Checkpoints and the Background Writer:

PostgreSQL 8.3 Improvements and Migration

If you have a PostgreSQL database that is written to heavily, it's not difficult to find yourself suffering from significant delays (many seconds long) during the periodic database checkpoints. A checkpoint is a point in time where all the "dirty" (modified or added) data in the system's memory is guaranteed to have been written to disk; see WAL Configuration for a longer introduction to this topic. As the amount of memory available to the database (via the shared_buffers tunable) increases, the potential for a long delay at checkpoint time goes up. Since memory capacities in computer systems keep increasing far faster than improvements in disk write speed, checkpoint slowdowns become more likely every year. PostgreSQL 8.0 introduced a way to write dirty buffers in advance of the checkpoint via a background writer process. It became obvious to many people during the development of the 8.3 release that this mechanism wasn't sufficient anymore to keep up with current generation hardware. To improve this situation, the way checkpoints are written to disk was completely rewritten for this new version. Here we will go over the issues with the 8.0-8.2 releases, introduce the new features in 8.3 that improve this situation, and talk about how to migrate existing systems to the new version.

A Quick Introduction to Checkpoint Timing

As you generate transactions, Postgres puts data into the write-ahead log (WAL). The WAL is organized into segments that are typically 16MB each. Periodically, after the system finishes a checkpoint, the WAL data up to a certain point is guaranteed to have been applied to the database. At that point the old WAL files aren't needed anymore and can be reused. Checkpoints are generally caused by one of two things happening:
  1. checkpoint_segments worth of WAL files have been written
  2. more than checkpoint_timeout seconds have passed since the last checkpoint
The system doesn't stop working while the checkpoint is happening; it just keeps creating new WAL files. As long as the checkpoint finishes in advance of what the next one is required things should be fine.

In the 8.2 model, processing the checkpoint occurs as fast as data can be written to disk. All of the dirty data is written out in one burst, then PostgreSQL asks the operating system to confirm the data has been written via the fsync call (see Tuning PostgreSQL WAL Synchronization for lots of details about what fsync does). 8.3 lets the checkpoint occur at a more leisurely pace.

Monitoring checkpoints

Before PostgreSQL 8.3, there were two things that you could change together in the postgresql.conf file to try and track down whether checkpoints were causing serious slowdowns to your system: With these two settings, you can look through your log files and see if slow to execute statements correlate strongly with the times when checkpoints are occuring. Useful tools to help you comb through the PostgreSQL log files looking for this sort of thing include pgFouine and PQA.

PostgreSQL 8.3 adds an additional parameter, log_checkpoints. Turning that on results in every checkpoint creating a verbose commentary about how much disk activity it generated and how long it took to execute, which is much easier to draw conclusions from than the simple warnings generated by the earlier versions. You will still want to keep the log_min_duration_statement set to a useful value (one that only puts entries in the logs occasionally rather than all the time) if you have reason to suspect checkpoint issues are involved in a slowdown.

For more quantitative analysis of checkpoint and the background writer, 8.3 also introduces the pg_stat_bgwriter view. This will show you exactly how much data has been written to disk by checkpoints and other components of PostgreSQL. Here's a sample:

pgbench=# select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc 
-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
                 1 |               3 |              24114 |         69461 |               41 |          115348 |        130816
The shows the balance of checkpoints that happened because the timeout was hit vs. ones where a checkpoint was requested, which normally only happens when you've filled checkpoint_segments worth of WAL. In this example, there are relatively few buffers being written by the LRU background writer compared to ones written out by the client backends themselves. Combine that with a fair number of maxwritten_lru points, this suggests bgwriter_lru_maxpages is set too small on this system. Figuring out things like whether the maxpages parameter was big enough was nearly impossible in earlier versions. In 8.3 some simple scripts that watch pg_stat_bgwriter can easily figure that out.

Aggressive Checkpoint Tuning in 8.0-8.2 Theory

There were four basic techniques available to try and cope with checkpoint issues in PostgreSQL versions before 8.3: These are all software-based approaches; it's worth nothing that some people have worked around this issue by altering their disk hardware, with adding a larger write cache being one possibly helpful technique.

Sample Aggressive Tuning in 8.0-8.2

To give an idea how all these ideas can work together, consider this configuration provided by Kevin Grittner as the result of a careful tuning exercise aimed at lowering his checkpoint spikes in 8.2.4. This is on a 4-processor server with 6GB of RAM, a 256MB disk controller cache, and a large RAID array:
shared_buffers = 160MB
effective_cache_size = 5GB
checkpoint_segments = 10
bgwriter_delay = 200
bgwriter_lru_percent = 20.0
bgwriter_lru_maxpages = 200
bgwriter_all_percent = 10.0
bgwriter_all_maxpages = 600
Here we see several of the ideas suggested above: What he's done here is make the shared_buffers smaller than even the write cache on his disk controller (256MB), so that anytime a checkpoint hits it should be able to cache writing even the entire buffer cache out if necessary. The OS handles caching everything else, which is memory not involved in the checkpoint process and therefore doesn't impact the problem situation as heavily. With checkpoint_segments at 10, the system can't do much activity (relative to how fast this powerful of a server can dirty pages) before a checkpoint happens, which also limits how much any one checkpoint has to clean up. And the background writer settings are aggressive enough that the entire pool is getting scanned constantly, which wastes CPU resources but there are plenty to spare on this 4-CPU system.

In addition, a key part of this solution involves that the background writer settings are set to write any dirty data that shows up almost immediately. For more information about these older background writer tunables, see the documentation. bgwriter_all_percent=10.0 means that 16MB of the shared buffer cache is being scanned each time the background writer executes, and at the default bgwriter_delay=200ms that means 80MB of buffer cache is being scanned per second--the whole cache could potentially be scanned every 2 seconds. If a substantial portion is dirty, that will be limited by the number of writes that can happen per scan. bgwriter_all_maxpages=600 means that the potential maximum I/O rate here for that portion of the scan is 600*8KB/(200 msec)=24MB/sec. That's actually quite hard to do, because the pages coming out of the background writer are not sequential writes--on a large database they will have a heavy seek component to them. These settings for the background writer, with scan percentages that creep into double-digits (10% is a very high setting for bgwriter_all_percent in particular) and maxpages that go into the several hundreds, are on the aggressive side. But when the shared_buffers is set relatively low as in this case, there aren't that many buffers to scan, so even a high percentage of them isn't all that many.

This is reasonable solution for working around checkpoint issues in 8.2.4, but this system sometimes has checkpoints spaced only 30 seconds apart which is a weak spot. PostgreSQL operating with a tiny shared_buffers cache is another, and it's expected that overall performance would improve if it were possible to increase that without getting killed by checkpoints. Since there are plenty of systems where making performance trade-offs like these just to get checkpoints under control were unacceptable, a whole new approach was added to PostgreSQL 8.3.

Setting checkpoint_completion_target in 8.3 to lower checkpoint impact

In 8.3, the checkpoint writes can be spread out over a much longer time period. A new parameter called checkpoint_completion_target suggests how far along the system should aim to have finished the current checkpoint relative to when the next one is expected.

If your checkpoints are being driven by the timeout, it's fairly easy to predict how this will work. Let's say you have checkpoint_timeout=300 seconds. With checkpoint_completion_target=0.5 (the default), checkpoints should be finished writing out all their data by 150 seconds after one starts. If you set it to 0.9 instead, the practical upper limit, the checkpoints will aim to be finished 270 seconds later, leaving only 30 seconds for the operating system to finish writing everything out before the next checkpoint should start.

The parameter works similarly if your system tends to hit the segments limit before the timeout. For example, let's say your current system has checkpoint_segments=10. Assume that you have checkpoint_timeout set to a large number such that the checkpoints are typically being driven by the number of segments being filled (so you get a checkpoint every 10 WAL segments, period). If checkpoint_completion_target was set to 0.5, the expectation is that the writes for the currently executing checkpoint would be finished about the time that 0.5*10=5 segments of new WAL data had been written. If you set it to 0.9 instead, you'd expect the checkpoint is finishing just about when the 9th WAL segment is being written out, which again is cutting things a bit tight.

You can get something similar to the 8.2 behavior by setting checkpoint_completion_target=0.0.

Changes to the background writer

The 8.3 background writer tunables have been slimmed down considerably from earlier versions. The major change is that the part of the writer that used to scan and write all of the buffer pool is now gone. That entire approach was fairly inefficient because it meant that popular pieces of data, like index blocks on heavily used tables, would get written over and over again in between checkpoints. On the type of systems that have checkpoint issues, where you're creating dirty data at a high rate, constantly writing out the most frequently updated database blocks wastes disk I/O. Physical I/O to the disk is usually the bottleneck on such systems and is therefore very precious, so wasting any of it isn't a great idea. Spreading out the writes only at checkpoint time accomplishes the same goals, reducing the size of the checkpoint I/O spike, without ever writing the popular data any more than it absolutely has to.

The other portion of the background writer, the LRU scan, only writes out blocks that haven't been used recently, in anticipation that they will be re-used for new data in the near future. There hasn't been much strong evidence that this is really a benefit on highly loaded systems either. Most operating systems cache writes in way such that waiting until the buffer is absolutely needed before writing isn't as big of a problem as one might thing. However, there may very well be cases where premptively doing such writes can be helpful, particularly because it gets writes heading out to the disk earlier than they otherwise might. Most default Linux configurations for example can wait as long as 30 seconds between when you write a disk block and when the OS actually starts putting that data onto disk.

But because there are some inefficiencies possible here as well, where you might write data out only to discover it was dirtied before the buffer could be resued, the LRU background writer was redesigned for 8.3 using a "just in time" model. The writer now looks at how many buffers your system typically allocates per unit of time, then only tries to keep just that many clean buffers in advance of actual usage. Accordingly, the idea of scanning a fixed percentage of the buffer pool every time was dropped. Instead, the only useful tunable for general aggresiveness of the background writer is bgwriter_lru_multiplier. The total for expected upcoming allocations is multiplied by this number to actually determine how many clean buffers the system tries to keep available. The default for the multiplier is 2.0, which via pre-beta testing was determined to give sufficient slack for the things that keep the estimate from being completely accurate: transitions from idle to full speed, buffers that get dirtied before they can be reused, etc. Pre-beta tests have suggested this model is fairly robust, but it's known to work less well as you decrease bgwriter_delay significantly. As you lower that interval tunable towards the lower practical limit of its range (10ms is the lowest useful setting on many platforms, and smaller values don't make a lot of sense anywhere) you may discover you need to increase the multiplier a bit, perhaps to 3.0 or 4.0.

Note that all this LRU writing is still capped with bgwriter_lru_maxpages being the most pages that can be written in one interval. This lets you put an upper limit on how much disk I/O the background writer generates. The default there used to be an emaciated 5 buffer pages per cycle. With improvements in the background writer internals and general forward progress in computing technology, the 8.3 default is now 100 pages. That makes the default maximum write rate 100*8KB/(200 msec)=4MB/sec. That's not particularly high relative to moderndisks, and the automatic tuning of the JIT mechanism means you'll only hit even that rate if you're doing a lot of activity to justify it.

Transitioning an aggressively tuned 8.2 system to 8.3

Many of the techniques adopted to reduce checkpoint spikes in earlier versions are counterproductive when using PostgreSQL 8.3. The fundamental ideas behind features like checkpoint_completion_target presumes you're not using any of these tricks, and instead are adopting the standard best practice for setting configuration parameters. The hope here is that by letting PostgreSQL cache more and avoiding writes of popular buffers except at checkpoint time, your total I/O will be significantly lower with 8.3 compared to how much an aggressive background writer will write in 8.2.

Once you've done all that, a useful test if you have a way to generate a test load representative of your production data would be executing that test with checkpoint_completion_target at 0.5 (the default), 0.7, and 0.9 to see how each of those works out for you (0.7 is the least useful of those if you only did two). Based on the results of those tests, it may be possible to further increase shared_buffers, and checkpoint_segments/checkpoint_timeout may need some adjustment one way or another.

One thing to be careful about when comparing the 8.2 and 8.3 behavior is that you may be getting many less checkpoints than before. It's kind of important to have an idea how many checkpoints you can expect during each test run to put together a fair comparison. As you increase checkpoint_segments, you need to adopt a mindset that is considering "how many sluggish transactions am I seeing per checkpoint?", not how many total per test run.

Also, note that if you're running your server on Windows using a larger value for shared_buffers is not as effective as on other platforms. You may get better results by setting that to a relatively small amount, and instead just pushing up effective_cache_size to properly model how much memory the operating system is using for you.

Appendix A: Looking at usage count data

The main difference between the LRU portion of the background writer, which now tries to tune itself under 8.3, and the all-scan that was removed from 8.3 is how it treats recently used buffers. Each buffer in the cache has a usage count on it. Every time that buffer is used for something, the count goes up, with an upper limit of 5. Every time the system passes over a buffer looking to allocate a new one the count goes down, and only ones with a count of 0 can be re-used for a new allocation.

The only way to really know how much the LRU writer might do for you is to actually take a look at the usage counts on your system. This might also give you some insight into what the 8.2 all-scan is writing out. There is a contrib module called pgbuffercache that lets you look into the buffer cache using SQL, but the version distributed with 8.2 doesn't show usage count data at all. If you're testing 8.3 using the same sort of things you used to do with 8.2, you can presume that there is a similar distribution in that version. It's also possible to patch the pgbuffercache to add this feature to 8.2. It's a small patch and doesn't impact the main server executable, so that may be an acceptable modification even on a live server. You can get that patch from the list archives. I've also assembled this and some other 8.3 features into a combined patch for 8.2, but that's a much more risky patch that you don't want to even think of running on a production system.

Once you've got a pg_buffercache with the 8.3 improvement, you can install it using something like this:

cd contrib/pg_buffercache/
make
make install
psql -d pgbench -f pg_buffercache.sql
Here's a typical query that shows what the usage counts are on the system, summarized usefully:
pgbench=# select usagecount,count(*),isdirty from pg_buffercache group by
  isdirty,usagecount order by isdirty,usagecount;
 usagecount | count | isdirty 
------------+-------+---------
          0 |  5633 | f
          1 |  7495 | f
          2 |  6214 | f
          3 |  3913 | f
          4 |  4557 | f
          5 |  2982 | f
          4 |     1 | t
          5 |     2 | t
            |  1971 | 
(9 rows)
This particular example shows there's very little dirty data at all, suggesting a recent checkpoint or aggressive background writer settings. There is a pretty even distribution of data at various usage counts, so I wouldn't expect the LRU writer (which again only writes things when their count drops to 0) to be particulary effective on this system.

Note that running this query produces a significant load on the server, and you don't want to run this too frequently lest the monitoring drag down actual performance.

Appendix B: pg_stat_bgwriter sample analysis

Here is an example from a more busy server than the earlier example, courtesy of pgsql-general, and what advice they were given based on these statistics:
db=# select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
              118 |             435 |            1925161 |        126291 |                7 |         1397373 |       2665693
You had 118 checkpoints that happened because of checkpoint_timeout passing. 435 of them happened before that, typically those are because checkpoint_segments was reached. This suggests you might improve your checkpoint situation by increasing checkpoint_segments, but that's not a bad ratio. Increasing that parameter and spacing checkpoints further apart helps give the checkpoint spreading logic of checkpoint_completion_target more room to work over, which reduces the average load from the checkpoint process.

During those checkpoints, 1,925,161 8K buffers were written out. That means on average, a typical checkpoint is writing 3481 buffers out, which works out to be 27.2MB each. Pretty low, but that's an average; there could have been some checkpoints that wrote a lot more while others wrote nothing, and you'd need to sample this data regularly to figure that out.

The background writer cleaned 126,291 buffers (cleaned=wrote out dirty ones) during that time. 7 times, it wrote the maximum number it was allowed to before meeting its other goals. That's pretty low; if it were higher, it would be obvious you could gain some improvement by increasing bgwriter_lru_maxpages.

Since last reset, 2,665,693 8K buffers were allocated to hold database pages. Out of those allocations, 1,397,373 times a database backend (probably the client itself) had to write a page in order to make space for the new allocation. That's not awful, but it's not great. You might try and get a higher percentage written by the background writer in advance of when the backend needs them by increasing bgwriter_lru_maxpages, bgwriter_lru_multiplier, and decreasing bgwriter_delay--making the changes in that order is the most effective strategy.

Copyright 2007 Gregory Smith. Last update 03/04/2008.