PostgreSQL 8.3 Improvements and Migration If you have a PostgreSQL database that is written to heavily, it's not difficult to find yourself suffering from significant delays (many seconds long) during the periodic database checkpoints. A checkpoint is a point in time where all the "dirty" (modified or added) data in the system's memory is guaranteed to have been written to disk; see WAL Configuration for a longer introduction to this topic. As the amount of memory available to the database (via the shared_buffers tunable) increases, the potential for a long delay at checkpoint time goes up. Since memory capacities in computer systems keep increasing far faster than improvements in disk write speed, checkpoint slowdowns become more likely every year. PostgreSQL 8.0 introduced a way to write dirty buffers in advance of the checkpoint via a background writer process. It became obvious to many people during the development of the 8.3 release that this mechanism wasn't sufficient anymore to keep up with current generation hardware. To improve this situation, the way checkpoints are written to disk was completely rewritten for this new version. Here we will go over the issues with the 8.0-8.2 releases, introduce the new features in 8.3 that improve this situation, and talk about how to migrate existing systems to the new version.
In the 8.2 model, processing the checkpoint occurs as fast as data can be written to disk. All of the dirty data is written out in one burst, then PostgreSQL asks the operating system to confirm the data has been written via the fsync call (see Tuning PostgreSQL WAL Synchronization for lots of details about what fsync does). 8.3 lets the checkpoint occur at a more leisurely pace.
PostgreSQL 8.3 adds an additional parameter, log_checkpoints. Turning that on results in every checkpoint creating a verbose commentary about how much disk activity it generated and how long it took to execute, which is much easier to draw conclusions from than the simple warnings generated by the earlier versions. You will still want to keep the log_min_duration_statement set to a useful value (one that only puts entries in the logs occasionally rather than all the time) if you have reason to suspect checkpoint issues are involved in a slowdown.
For more quantitative analysis of checkpoint and the background writer, 8.3 also introduces the pg_stat_bgwriter view. This will show you exactly how much data has been written to disk by checkpoints and other components of PostgreSQL. Here's a sample:
pgbench=# select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
1 | 3 | 24114 | 69461 | 41 | 115348 | 130816
The shows the balance of checkpoints that happened because the timeout was hit vs. ones where a checkpoint was requested, which normally only happens when you've filled checkpoint_segments worth of WAL. In this example, there are
relatively few buffers being written by the LRU background writer compared to ones written out by the client backends themselves. Combine that with a fair number of maxwritten_lru points, this suggests bgwriter_lru_maxpages is set too small on this system. Figuring out things like whether the maxpages parameter was big enough was nearly impossible in earlier versions. In 8.3 some simple scripts that watch pg_stat_bgwriter can easily figure that out.
shared_buffers = 160MB effective_cache_size = 5GB checkpoint_segments = 10 bgwriter_delay = 200 bgwriter_lru_percent = 20.0 bgwriter_lru_maxpages = 200 bgwriter_all_percent = 10.0 bgwriter_all_maxpages = 600Here we see several of the ideas suggested above:
In addition, a key part of this solution involves that the background writer settings are set to write any dirty data that shows up almost immediately. For more information about these older background writer tunables, see the documentation. bgwriter_all_percent=10.0 means that 16MB of the shared buffer cache is being scanned each time the background writer executes, and at the default bgwriter_delay=200ms that means 80MB of buffer cache is being scanned per second--the whole cache could potentially be scanned every 2 seconds. If a substantial portion is dirty, that will be limited by the number of writes that can happen per scan. bgwriter_all_maxpages=600 means that the potential maximum I/O rate here for that portion of the scan is 600*8KB/(200 msec)=24MB/sec. That's actually quite hard to do, because the pages coming out of the background writer are not sequential writes--on a large database they will have a heavy seek component to them. These settings for the background writer, with scan percentages that creep into double-digits (10% is a very high setting for bgwriter_all_percent in particular) and maxpages that go into the several hundreds, are on the aggressive side. But when the shared_buffers is set relatively low as in this case, there aren't that many buffers to scan, so even a high percentage of them isn't all that many.
This is reasonable solution for working around checkpoint issues in 8.2.4, but this system sometimes has checkpoints spaced only 30 seconds apart which is a weak spot. PostgreSQL operating with a tiny shared_buffers cache is another, and it's expected that overall performance would improve if it were possible to increase that without getting killed by checkpoints. Since there are plenty of systems where making performance trade-offs like these just to get checkpoints under control were unacceptable, a whole new approach was added to PostgreSQL 8.3.
If your checkpoints are being driven by the timeout, it's fairly easy to predict how this will work. Let's say you have checkpoint_timeout=300 seconds. With checkpoint_completion_target=0.5 (the default), checkpoints should be finished writing out all their data by 150 seconds after one starts. If you set it to 0.9 instead, the practical upper limit, the checkpoints will aim to be finished 270 seconds later, leaving only 30 seconds for the operating system to finish writing everything out before the next checkpoint should start.
The parameter works similarly if your system tends to hit the segments limit before the timeout. For example, let's say your current system has checkpoint_segments=10. Assume that you have checkpoint_timeout set to a large number such that the checkpoints are typically being driven by the number of segments being filled (so you get a checkpoint every 10 WAL segments, period). If checkpoint_completion_target was set to 0.5, the expectation is that the writes for the currently executing checkpoint would be finished about the time that 0.5*10=5 segments of new WAL data had been written. If you set it to 0.9 instead, you'd expect the checkpoint is finishing just about when the 9th WAL segment is being written out, which again is cutting things a bit tight.
You can get something similar to the 8.2 behavior by setting checkpoint_completion_target=0.0.
The other portion of the background writer, the LRU scan, only writes out blocks that haven't been used recently, in anticipation that they will be re-used for new data in the near future. There hasn't been much strong evidence that this is really a benefit on highly loaded systems either. Most operating systems cache writes in way such that waiting until the buffer is absolutely needed before writing isn't as big of a problem as one might thing. However, there may very well be cases where premptively doing such writes can be helpful, particularly because it gets writes heading out to the disk earlier than they otherwise might. Most default Linux configurations for example can wait as long as 30 seconds between when you write a disk block and when the OS actually starts putting that data onto disk.
But because there are some inefficiencies possible here as well, where you might write data out only to discover it was dirtied before the buffer could be resued, the LRU background writer was redesigned for 8.3 using a "just in time" model. The writer now looks at how many buffers your system typically allocates per unit of time, then only tries to keep just that many clean buffers in advance of actual usage. Accordingly, the idea of scanning a fixed percentage of the buffer pool every time was dropped. Instead, the only useful tunable for general aggresiveness of the background writer is bgwriter_lru_multiplier. The total for expected upcoming allocations is multiplied by this number to actually determine how many clean buffers the system tries to keep available. The default for the multiplier is 2.0, which via pre-beta testing was determined to give sufficient slack for the things that keep the estimate from being completely accurate: transitions from idle to full speed, buffers that get dirtied before they can be reused, etc. Pre-beta tests have suggested this model is fairly robust, but it's known to work less well as you decrease bgwriter_delay significantly. As you lower that interval tunable towards the lower practical limit of its range (10ms is the lowest useful setting on many platforms, and smaller values don't make a lot of sense anywhere) you may discover you need to increase the multiplier a bit, perhaps to 3.0 or 4.0.
Note that all this LRU writing is still capped with bgwriter_lru_maxpages being the most pages that can be written in one interval. This lets you put an upper limit on how much disk I/O the background writer generates. The default there used to be an emaciated 5 buffer pages per cycle. With improvements in the background writer internals and general forward progress in computing technology, the 8.3 default is now 100 pages. That makes the default maximum write rate 100*8KB/(200 msec)=4MB/sec. That's not particularly high relative to moderndisks, and the automatic tuning of the JIT mechanism means you'll only hit even that rate if you're doing a lot of activity to justify it.
Once you've done all that, a useful test if you have a way to generate a test load representative of your production data would be executing that test with checkpoint_completion_target at 0.5 (the default), 0.7, and 0.9 to see how each of those works out for you (0.7 is the least useful of those if you only did two). Based on the results of those tests, it may be possible to further increase shared_buffers, and checkpoint_segments/checkpoint_timeout may need some adjustment one way or another.
One thing to be careful about when comparing the 8.2 and 8.3 behavior is that you may be getting many less checkpoints than before. It's kind of important to have an idea how many checkpoints you can expect during each test run to put together a fair comparison. As you increase checkpoint_segments, you need to adopt a mindset that is considering "how many sluggish transactions am I seeing per checkpoint?", not how many total per test run.
Also, note that if you're running your server on Windows using a larger value for shared_buffers is not as effective as on other platforms. You may get better results by setting that to a relatively small amount, and instead just pushing up effective_cache_size to properly model how much memory the operating system is using for you.
The only way to really know how much the LRU writer might do for you is to actually take a look at the usage counts on your system. This might also give you some insight into what the 8.2 all-scan is writing out. There is a contrib module called pgbuffercache that lets you look into the buffer cache using SQL, but the version distributed with 8.2 doesn't show usage count data at all. If you're testing 8.3 using the same sort of things you used to do with 8.2, you can presume that there is a similar distribution in that version. It's also possible to patch the pgbuffercache to add this feature to 8.2. It's a small patch and doesn't impact the main server executable, so that may be an acceptable modification even on a live server. You can get that patch from the list archives. I've also assembled this and some other 8.3 features into a combined patch for 8.2, but that's a much more risky patch that you don't want to even think of running on a production system.
Once you've got a pg_buffercache with the 8.3 improvement, you can install it using something like this:
cd contrib/pg_buffercache/ make make install psql -d pgbench -f pg_buffercache.sqlHere's a typical query that shows what the usage counts are on the system, summarized usefully:
pgbench=# select usagecount,count(*),isdirty from pg_buffercache group by
isdirty,usagecount order by isdirty,usagecount;
usagecount | count | isdirty
------------+-------+---------
0 | 5633 | f
1 | 7495 | f
2 | 6214 | f
3 | 3913 | f
4 | 4557 | f
5 | 2982 | f
4 | 1 | t
5 | 2 | t
| 1971 |
(9 rows)
This particular example shows there's very little dirty data at all, suggesting a recent checkpoint or aggressive background writer settings. There is a pretty even distribution of data at various usage counts, so I wouldn't expect the LRU writer (which again only writes things when their count drops to 0) to be particulary effective on this system.
Note that running this query produces a significant load on the server, and you don't want to run this too frequently lest the monitoring drag down actual performance.
db=# select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
118 | 435 | 1925161 | 126291 | 7 | 1397373 | 2665693
You had 118 checkpoints that happened because of checkpoint_timeout
passing. 435 of them happened before that, typically those are because
checkpoint_segments was reached. This suggests you might improve your
checkpoint situation by increasing checkpoint_segments, but that's not a
bad ratio. Increasing that parameter and spacing checkpoints further
apart helps give the checkpoint spreading logic of
checkpoint_completion_target more room to work over, which reduces the
average load from the checkpoint process.
During those checkpoints, 1,925,161 8K buffers were written out. That means on average, a typical checkpoint is writing 3481 buffers out, which works out to be 27.2MB each. Pretty low, but that's an average; there could have been some checkpoints that wrote a lot more while others wrote nothing, and you'd need to sample this data regularly to figure that out.
The background writer cleaned 126,291 buffers (cleaned=wrote out dirty ones) during that time. 7 times, it wrote the maximum number it was allowed to before meeting its other goals. That's pretty low; if it were higher, it would be obvious you could gain some improvement by increasing bgwriter_lru_maxpages.
Since last reset, 2,665,693 8K buffers were allocated to hold database pages. Out of those allocations, 1,397,373 times a database backend (probably the client itself) had to write a page in order to make space for the new allocation. That's not awful, but it's not great. You might try and get a higher percentage written by the background writer in advance of when the backend needs them by increasing bgwriter_lru_maxpages, bgwriter_lru_multiplier, and decreasing bgwriter_delay--making the changes in that order is the most effective strategy.