Tuning PostgreSQL WAL Synchronization

 

This is an introduction to the low-level implementation of the write logging used in PostgreSQL 8.2.  The database engine assures data integrity by a process that involves writing out data to a Write Ahead Log (WAL), then writing to the main database.  Because there are so many operating systems PostgreSQL runs on, the exact way the WAL is handled varies considerably from platform to platform.  Accurately tuning how writes to the WAL are done without impacting the data integrity requirements of the database requires understanding quite a bit of low-level implementation details; that’s what’s covered here in excruciating detail.

 

What’s a WAL?

 

Much of this won’t make sense unless you’re already familiar with the terminology of the Write-Ahead Log.  There are two sections of the PostgreSQL documentation that are prerequisites here.  I would recommend reading the sections from the latest documentation, currently located at http://www.postgresql.org/docs/8.2/static/   First read chapter 27, “Reliability and the Write-Ahead Log”, then skim section 17.5 “Write Ahead Log” to see what the parameters can be adjusted.

The section we’re focusing on here is 17.5.1, specifically wal_sync_method (yes, this entire document is about one parameter).  PostgreSQL lets you specify which methods it should use when writing to a WAL file via the configuration wal_sync_method.  From the documentation:

·         open_datasync (write WAL files with open() option O_DSYNC)

·         fdatasync (call fdatasync() at each commit)

·         fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

·         fsync (call fsync() at each commit)

·         open_sync (write WAL files with open() option O_SYNC)

Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.

This is certainly cryptic if don’t know something about UNIX file mechanics.  To get started, you can determine which fsync method is currently being used by your database (helpful when you’re letting the server pick automatically) inside psql with:

     show wal_sync_method;

Here’s a summary of which options are available on various popular platforms.  Look at the line for your platform and read from left to right; the first one that says Yes or Direct will be your default:

 

Platform

open_datasync

fdatasync

fsync_writethrough

fsync

open_sync

Linux

No

Yes

No

Yes

Direct

Mac OS X/Darwin

No

No

Yes

Yes

Yes (Direct?)

Solaris

Yes

Yes

No

Yes

Yes

Windows

Yes (Direct?)

No

Yes

Yes

Yes

BSD/OS 4.3

No

No

No

Yes

Yes (Direct?)

FreeBSD 4.9

No

No

No

Yes

Yes (Direct?)

 

 

 

 

 

 

 

There seems to be a problem with O_DIRECT not working right on Solaris. The above table will all make more sense after each method is explained.  But first we need a diversion into hard disk technology and UNIX file writing techniques.

 

Write Caches and Disk Arrays

 

To understand why this decision is complicated at all, you need to know something about how hard drives deal with disk writes.  Disk performance at all levels is greatly affected by disk caching.  At the lowest levels, hard drives have their own internal cache that’s used for reads and writes.  PostgreSQL expects that the disks connected to it can be trusted to correctly report when data as been written to them.  Inexpensive IDE and SATA drives are widely criticized for PostgreSQL use because they often “lie”, reporting that data has been written when in fact it’s just been put into the disk cache.  A good introduction to this subject is at http://www.jasonbrome.com/blog/archives/2004/04/03/writecache_enabled.html and comments on how this interacts with Postgres are at http://www.thescripts.com/forum/threadnav173635-2-10.html (it’s also a regular topic of discussion on the PostgreSQL performance mailing list at http://archives.postgresql.org/pgsql-performance/ ).

 

So the first rule here is that if your hardware isn’t actually writing when it says it is, your performance numbers will be inflated--it's extra performance at the expense of reliability, and eventually you’ll get database corruption.  SCSI disks are considered much more accurate in terms of actually having written data when they say they have.  This is one of the reasons the perceived performance difference between IDE and SCSI is perceived as narrowing recently; write caching is artificially inflating the results so that the better SCSI hardware doesn’t look as impressive as it is.  Unfortunately, the IDE drives are so dependant on this feature that turning it off gives unreasonably low results; see section 11.12.1.5 at http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/configtuning-disk.html for comments on how FreeBSD has been challenged by this issue.

 

If you’re using any disk array or intelligent controller card, again you’re back to needing to make sure that disk writes are being completed when they say they are.  Better controllers and arrays have their own cache, and in those cases you normally need to make sure it’s configured in write-through mode for proper Postgres integrity.

A reference good implementation here for Linux are the LSI MegaRAID SCSI cards, which have a write-caching policy that’s adjustable in the BIOS.

 

One approach regularly debated on the PostgreSQL performance list uses a card or array that has a battery-backed disk cache on it.  That allows you to get the instant fsync() response that allows extremely high database disk writes, while still ensuring that data that has been reported written will properly make it to the disks eventually even if there’s a system failure.  Critics of the battery backup approach suggest that if you run such a system under load, eventually you’ll have a failure in this relatively complicated cache method that will corrupt your database in a way that’s nearly impossible to detect or recover from.  Fans of a battery-backed cache suggest they can’t achieve their target workloads without it, and that a good backup strategy is the cure for an event that’s deemed as extremely rare as this possibility.

 

WAL writing

 

In order to completely tune how your system performs, it’s necessary to learn a bit more than you probably wanted to know about how UNIX does file I/O operations.

 

The PostgreSQL Write-Ahead Log (WAL) does only a few basic operations to the logs it works on:

 

1)      Open a log file

2)      Write a series of records to the log

3)      Close the log file

4)      Switch to or create a new log file

5)      Repeat

 

Normally, when you ask to write a disk block to a drive, your operating system caches that information and writes it when it gets a chance.  This write-caching greatly improves performance for normal applications.  In order for the WAL mechanism to work correctly, Postgres needs to know when the data written has actually been committed to the disk drive. 

 

Synchronous Writes

 

Situations where the application can’t continue until the data has been completely written out are referred to by UNIX as synchronous writes.  When you open a UNIX file using the POSIX standard calls (see http://www.opengroup.org/onlinepubs/007908799/xsh/open.html ) you can ask that all writes to that file be done in one of two synchronous modes:  O_DSYNC and O_SYNC.  The official spec doesn’t tell you a lot about the difference between these two.  A better discussion comes from IBM’s discussion of how these settings impact journaling filesystems at http://publib16.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/fileio.htm#wq222

From that document:

 

O_DSYNC:  When a file is opened using the O_DSYNC open mode, the write () system call will not return until the file data and all file system meta-data required to retrieve the file data are both written to their permanent storage locations.

 

O_SYNC: In addition to items specified by O_DSYNC, O_SYNC specifies that the write () system call will not return until all file attributes relative to the I/O are written to their permanent storage locations, even if the attributes are not required to retrieve the file data.

 

The main thing you need to walk away from this description with is that O_DSYNC can be a much smaller operation than O_SYNC is.   O_SYNC requires all file-system meta-data to be completed, while D_SYNC just requires the writing of your data be completed.  Since database systems like PostgreSQL have their own method for dealing with metadata within the WAL, it really doesn’t need to wait for all the file-system information to be done; it just needs the data written.

 

The other way to cope with a synchronized writing situation is to open the file normally, write your data, and then ask the operating system to flush that data.  There are two ways to do this under UNIX:

 

fdatasync: http://www.opengroup.org/onlinepubs/007908799/xsh/fdatasync.html

fsync: http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html

 

These two operate similarly to the above, with fdatasync being like a write done with O_DSYNC, while fsync is like a write with O_SYNC. 

 

For a description of what’s going on at the programmer level, recommended reading is in the second edition of “Advanced Programming in the UNIX Environment” by Stevens/Rago.  Chapter 3, “File I/O”, covers all these concepts, and section 3.14 (“fcntl Function”) even includes a simple benchmark showing difference in speed among these methods for Linux and Mac OS X.

Direct I/O

Since we can’t use the operating disk cache usefully, why not bypass it altogether?  As of PostgreSQL 8.1, writing with O_DSYNC/O_SYNC will also use O_DIRECT when available, which keeps WAL writes from using the operating system’s buffer cache.  Since the WAL is never read, this can be a big benefit, especially for write-heavy loads.  More information about O_DIRECT in a database context is at http://www.mgogala.com/directio.pdf and the actual details of O_DIRECT writes are at http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm

fsync_writethrough

We’ve now covered four of the methods available for synchronizing the WAL.  Just what is fsync_writethough then?

Any time there’s a caching write situation in the actual drive, you can’t necessarily assume fsync will actually make sure data makes it to the disk.  Some operating systems recognize this, and have a way to call fsync in a way that forces the drive’s cache to flush. 

Right now, fsync_writethrough works on two platforms:

 

 

On other platforms, fsync_writethrough returns an error (unless fsync is turned off altogether).

 

The default method for writing under Windows is open_datasync, which assumes write caching is not occurring on your drives; if the drives cache, you don’t have proper data integrity.  If you want to keep caching on, but have PostgreSQL work as designed, switching to fsync_writethrough is one option.  http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php has additional notes on this topic.

 

Back to the beginning

 

Repeating where we started from, PostgreSQL lets you specify which of these methods it should use when writing to a file via the configuration wal_sync_method.  Repeating the documentation again, which should make a lot more sense now:

·         open_datasync (write WAL files with open() option O_DSYNC)

·         fdatasync (call fdatasync() at each commit)

·         fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

·         fsync (call fsync() at each commit)

·         open_sync (write WAL files with open() option O_SYNC)

Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.

The developers have put the possible write methods in order by how efficient they usually are.  The database server prefers that it write with O_DSYNC, but it will use various fsync-type calls if it’s not available.  Should all other methods not be available, it falls back to writing with O_SYNC.

These defaults are good ones, and most setups will never need to touch this parameter (with the Windows case covered in the fsync_writethrough section being one glaring exception).  The problem comes when you need to optimize use of PostgreSQL on hardware that does some of this work for you, like when using a storage array that actually has reliable disk caching.  A basic introduction to fsync concepts from a performance perspective is at http://developer.osdl.org/dev/doubt/diskio/documents/2005-10-19-visualizing_diskio.pdf. 

test_fsync

If you want to learn exactly what the capabilities of your platform are, the first thing you should do is compile and test out the test_fsync tool.  It comes with the PostgreSQL source code, found under the src/tools/fsync directory.  You use it like this:

cd postgres/src/tools/fsync

make

./test_fsync -f /databasedirectory/fsync_test.out

 

If you get results that are all basically 0, that probably means that you don’t really have correct permissions to do all operations necessary to run this tool in that directory. 

 

Results from this tool are measured in milliseconds.  Here is a sample run:

 

Simple write timing:

        write                    0.009100

 

Compare fsync times on write() and non-write() descriptor:

(If the times are similar, fsync() can sync data written

 on a different descriptor.)

        write, fsync, close      4.090127

        write, close, fsync      4.092272

 

Compare one o_sync write to two:

        one 16k o_sync write     4.204583

        two 8k o_sync writes     8.308203

 

Compare file sync methods with one 8k write:

 

        (o_dsync unavailable)

        write, fdatasync         4.295135

        write, fsync,            4.325998

 

Compare file sync methods with 2 8k writes:

        (o_dsync unavailable)

        open o_sync, write       8.301195

        write, fdatasync         4.293409

        write, fsync,            4.280166

 

This from a Linux system, a platform that doesn’t have O_DSYNC.  You can compare these results against the theoretical maximum performance based on the RPM of this disk drive, which is a 15K rpm drive:

 

RPM

Rotations/sec

 

Rotation (ms)

5400

90.0

 

11.1

7200

120.0

 

8.3

10000

166.7

 

6.0

15000

250.0

 

4.0

 

At around 4.3ms to write and sync, we’re pretty close to the maximum speed the drive is capable of (4ms).  If the results were better than predicted by an analysis based on RPM, you would have reason to believe there is some write caching at work.  Write-caching that PostgreSQL doesn’t know about it deadly to its ability to ensure data integrity for your database.

 

So all you need to do here is pick the method with the fastest speed, and you’re done, right?  Not necessarily.  This is a test that writes a small amount of data.  In the real world, when writing an enormous amount of data, you may discover that the implementation quirks of your hardware or the filesystem your database is stored on strongly prefers one mode over the others, regardless of what the low-level test suggests.  The overhead of using the fsync and fdatasync calls gets larger as the amount of data you’re passing through a filesystem goes up, and it’s possible for writing in O_SYNC mode to outperform it under load—even though this simple test suggests O_SYNC is the slowest method to use.  This is particularly true for platforms that support Direct I/O, where fsync-type methods have a very negative impact on the operating system cache.

 

Filesystem Characteristics

 

In order to completely understand what you’re working with, you’ll also need to take a look at the writing characteristics of the filesystem your WAL is on.  If you’re running a journaling filesystem, and you’re also using the WAL, understand that in some ways you’re paying for data integrity twice.

 

Linux ext3

 

Current generation ext3 defaults to flushing updates to disk before committing the update to the main filesystem, what ext3 calls ordered mode.  This is overkill for the disk the WAL is on.  Assuming you’ve put the WAL on a separate disk (or at least partition), you can reduce Linux’s overhead by changing /etc/fstab to mount the filesystem in writeback mode. 

 

Similarly, by default ext3 filesystem will update the last access time attribute every time a file is written to.  This is also unneeded for the WAL.  That can be turned off as well.

 

Combining these two, an optimal fstab for the WAL might look like this:

 

/dev/hda2  /var ext3 defaults,writeback,noatime 1 2

 

While it’s obsolete at this point, http://developer.osdl.org/dev/doubt/diskio/result/NTT-SPLab-2004-09-29.html is an interesting table showing how changing the ext3 options can affect performance in the various sync modes discussed here.

 

Veritas VxFS Filesystems

 

Veritas’s VxFS product is one of the better journaling filesystems available.  In addition to good performance, it has enormous flexibility in terms of optimizing for performance.

 

The first setting that normally affects data integrity is the option for writing the VxFS intent log.  The best documentation on this subject is in the man page for mount_vxfs.  There are three options available:  log, delaylog, and templog.  Normally, using delaylog or templog can introduce the possibility of data corruption during a crash.  But this is not the case for data written synchronously.  From the man page:

 

In all cases, VxFS is fully POSIX compliant.  The persistence guarantees for data or metadata modified by write(2), writev(2), or pwrite(2) are not affected by the logging mount option.  The effects of these system calls are guaranteed to be persistent only if the O_SYNC, O_DSYNC, VX_DSYNC, or VX_DIRECT flag, as modified by the convosync= mount option, has been specified for the file descriptor.

 

This means that in the case of PostgreSQL, using either the open_datasync or open_sync WAL methods (which translate into O_DSYNC and O_SYNC writes) will push the burden of maintaining disk integrity to the VxFS software, which guarantees that data will survive. On platforms that don’t support O_DSYNC (like Linux and BSD), you will normally default to using fdatasync for synchronizing writes.  It may be the case that with VxFS, using open_sync instead will be faster, because you’re letting the VxFS driver handle data integrity at its level rather than requiring the more complicated operating system level fdatasync transaction.

 

In fact, it may be possible to get even better PostgreSQL performance out of VxFS by using the convosync=dsync option when mounting the filesystem.  Since Postgres only requires DSYNC level integrity for the WAL, using this conversion would serve to emulate the more efficient DSYNC behavior on operating systems that don’t necessarily support it.  Because this convert option makes VxFS break POSIX compliance, the main danger here is that other applications doing synchronous writes to that disk might misbehave.  Another possibility is convosync=direct, which will bypass the buffer cache (utterly appropriate as WAL data isn’t going to be read again later); further documentation from Veritas is at http://ftp.support.veritas.com/pub/support/products/FileSystem_UNIX/vxfssept_231815.pdf

Relying on VxFS to enforce data integrity is not entirely without risk.  For example, pages 5-6 of the HP-UX Performance Cookbook ( http://h21007.www2.hp.com/dspp/files/unprotected/devresource/Docs/TechPapers/UXPerfCookBook.pdf ) describes one failure mode where relying on the intent log can introduce application integrity issues.  They recommend mounting VxFS with options delaylog, nodatainlog in order to keep from happening.  Here is what the VxFS File System Administrator’s Guide has to say on this issue (see http://ftp.support.veritas.com/pub/support/products/FileSystem_UNIX/ for this documentation, with http://ftp.support.veritas.com/pub/support/products/FileSystem_UNIX/269437.pdf being the Linux version):

 

Use the nodatainlog mode on systems with disks that do not support bad block revectoring.  Usually, a VxFS file system uses the intent log for synchronous writes.  The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two.  When the synchronous write returns to the application, the file system has told the applications that the data is already written.  If a disk error causes the metadata update to fail, then the file must be marked bad and the entire file is lost.

 

If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed.  If the disk does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used.

 

A nodatainlog mode file system is approximately 50 percent slower than a standard mode VxFS file system for synchronous writes.  Other operations are not effected.

 

The “revectoring” of blocks here are alternately referred to as bad block remapping or reallocated sectors, depending on who you’re talking to; those all mean the same thing. Database use of VxFS should certainly consider whether the possible performance hit of mounting with nodatainlog is worth the small possibility of corruption from drive failure.  The revectoring case described here shouldn’t impact PostgreSQL, as it doesn’t care about the metadata when it needs to rebuild using the WAL.

Copyright 2007 Gregory Smith. Last update 5/15/2007.