Testing your disks for PostgreSQL

Like all database software, PostgreSQL performance is largely dependant on the configuration of the disk drive(s) on the database server. Before you can usefully adjust PostgreSQL to improve throughput, it is wise (some would say necessary) to do some basic testing on your disks to determine how fast they are. It's unfortunately common to find systems where disk performance is grossly below expectations for a variety of reasons. If you don't catch that early, that can lead to wasting a lot of time adjusting database parameters that won't have the slightest impact because operations are bottlenecked by the underlying physical I/O limitations. It doesn't matter how well respected your vendor is who has told you that your system is a great performer; if you haven't tested it yourself, don't believe it.

Warning: all of these tests assume you have your server basically to yourself in order to get accurate results. If there are other people using the server at the same time, it will badly impact the results here, and you likely slow them down considerably. Also, make sure you have enough disk space available in the area you run the tests at; filling up your disk with test data and making other programs crash because of it is a bad scene I don't recommend.

Basic read and write tests

The most basic useful test you can execute copies a large block of data to and from disk. In order to minimize disk caching, you need to copy an amount significantly larger than the amount of memory in your system. 2X the amount of RAM in your server is a useful amount. Here is how you can use the UNIX dd utility to execute that test, writing the data out in 8KB blocks similar to those PostgreSQL writes in: 1. Compute 2X the size of your RAM in 8KB blocks: blocks = 250,000 * gigabytes of RAM

2. Time how long it takes to write that many blocks and flush the data to disk like this:

# time sh -c "dd if=/dev/zero of=bigfile bs=8k count=_blocks_ && sync"
3. Time reading that data off disk again:
# time dd if=bigfile of=/dev/null bs=8k
For our test system with 1GB of RAM this is easy: 250,000 * 1GB = 250,000 blocks. Here are the results:
-bash-3.00$   time sh -c "dd if=/dev/zero of=bigfile bs=8k count=250000 && sync"
  time dd if=bigfile of=/dev/null bs=8k
250000+0 records in
250000+0 records out

real    0m47.961s
user    0m0.086s
sys     0m6.018s
That's 41.7MB/sec writing with 12.7% CPU utilization. Since there are 2 processors in this system, it's better to think of that as 25.4% of one processor. If we can write as fast the disks can keep up and are only using 1/4 of a processor to do it, that suggests the CPUs in this system should be able to keep up with an I/O bound load.
-bash-3.00$   time dd if=bigfile of=/dev/null bs=8k
250000+0 records in
250000+0 records out

real    0m35.450s
user    0m0.073s
sys     0m2.128s
That's 56.4MB/sec reading with 6.2% CPU utilization

These results are pretty good for a single 7200RPM disk; 56MB/s read and 42MB/s write is certainly close to the maximum I/O you can expect one cheap drive to accomplish. Your results should scale based on disk technology and number of disks. For example, were this a RAID-0 volume with 4 disks in it, I'd want to see something closer to 200MB/s as a read result here.

What's nice about dd test results is that it's very hard to argue with them. If you tell your vendor "I only get 10MB/s of writes when I run a simple dd test with 8k blocks", there is not a lot of room for them to weasel out of that by saying your test methodology is unsound.

Perform full disk tests with bonnie++

bonnie++ is as close to an industry standard as there is in the area of testing UNIX disk performance. It will make a reasonable guess for how much data to write for you, and it displays the results in various easy to read ways. It takes considerably longer to run than the simple dd test, and it's a bit easier for people to argue that the test might be unsound, but generally a bonnie++ result is all an experienced UNIX person needs to have a good idea what your disks are capable of. Here's the results of a bonnie++ run on our test server:
# ./bonnie++
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
thud            2G 33324  60 46554  13 24155   5 47561  81 55243   5 182.4   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
thud,2G,33324,60,46554,13,24155,5,47561,81,55243,5,182.4,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
Note that the block results here (46MB/s write with 13% CPU, 55MB/s with 5% CPU) are essentially the same as the dd results above. For database use, those are the main things that matter; it doesn't write on a character basis. We can create a fancy HTML version of the output by pasting it back into another bonnie++ utility:
# chmod +x ./bon_csv2html
# echo thud,2G,33324,60,46554,13,24155,5,47561,81,55243,5,182.4,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ | ./bon_csv2html > disk.htm
That produces the following chart:
Sequential Output Sequential Input Random
Seeks
Sequential Create Random Create
Size:Chunk SizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU
thud2G333246046554132415554756181552435182.4016++++++++++++++++++++++++++++++++++++++++++++++++

What does a poor result look like?

The above example is a good result from a disk controller well suited to high-performance PostgreSQL duty. It's instructive to look at a slower result to see how it differs. Here is a bonnie++ result for a system that's in the same general class as our test server: Pentium 4 3GHz, 1GB RAM, running RedHat Enterprise Linux 4.0. It has an LSI Logic MegaRAID 320-2 SCSI Controller, running with a RAID-1 pair of 36GB 10K RPM SCSI disks. The controller is configured with a WriteThru cache:
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
slowserver       2G 20583  49 20830   9 10314   3 34440  72 57350   7 447.0   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  3021  97 +++++ +++ +++++ +++  2938  93 +++++ +++  7080 100
slowserver,2G,20583,49,20830,9,10314,3,34440,72,57350,7,447.0,0,16,3021,97,+++++,+++,+++++,+++,2938,93,+++++,+++,7080,100
Sequential Output Sequential Input Random
Seeks
Sequential Create Random Create
Size:Chunk SizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU
slowserver2G20583492083091031433444072573507447.0016302197++++++++++++++++293893++++++++7080100
The block read score (57.3MB) suggests we're only getting one disk worth of transfers there; a more advanced setup might get closer to two disks out of this sort of RAID-1 configuration. The really dissapointing number is the block writes, which are only 20.8MB/sec. That's less than 1/2 of what a single disk should get. Because I tuned this server for paranoia rather than performance by turning off the controller's write cache, this is disappointing but not surprising; it does have to execute writes against both drives in this type of setup before it returns to the OS. This is a server where the data is worth a lot but it doesn't have to run very fast at all, so never relying on the write cache was the right choice here.

On the plus side, these disks do get a much better seek rating than my test server. This is a combination of them having a faster rotation speed (10K vs. 7200RPM), the fact that these SCSI disks are much smaller (seeks on a 36GB drive can execute a whole lot faster than they do on a 160GB one), and that seek reads may be getting split between the two drives in the RAID-1 volume.

You'll often find people recommending the LSI controllers for Linux SCSI RAID, and for good reason: they are very stable and reliable. They're just not fast for this application. This sort of thing is exactly why you need to run your own performance tests on your hardware. There are so many links in the disk performance chain, any one of which can completely destroy throughput, that it's the only way to make sure you're getting the end-to-end performance you expect. This isn't just limited to hardware. There are plenty of ways to screw up things like filesystem configuration, where everything from a bad journaling setup to poorly performing LVM software can just trash results from otherwise solid equipment. Test yourself, make sure you understand the results, and then when you run into a performance issue you'll be in a much better position to understand what level it's being introduced by.


Notes, references, and credits

Copyright 2007 Gregory Smith. Posted on 5/15/2007, last update 5/19/2009.