Tuning PostgreSQL WAL Synchronization
This is an introduction to
the low-level implementation of the write logging used in PostgreSQL 8.2. The database engine assures data integrity
by a process that involves writing out data to a Write Ahead Log (WAL), then
writing to the main database. Because
there are so many operating systems PostgreSQL runs on, the exact way the WAL
is handled varies considerably from platform to platform. Accurately tuning how writes to the WAL are
done without impacting the data integrity requirements of the database requires
understanding quite a bit of low-level implementation details; that’s what’s
covered here in excruciating detail.
What’s a WAL?
Much of this won’t make sense
unless you’re already familiar with the terminology of the Write-Ahead
Log. There are two sections of the
PostgreSQL documentation that are prerequisites here. I would recommend reading the sections from the latest
documentation, currently located at http://www.postgresql.org/docs/8.2/static/ First read chapter 27, “Reliability and the
Write-Ahead Log”, then skim section 17.5 “Write Ahead Log” to see what the
parameters can be adjusted.
The section we’re focusing on here is 17.5.1,
specifically wal_sync_method (yes, this entire document is about one
parameter). PostgreSQL lets you specify
which methods it should use when writing to a WAL file via the configuration wal_sync_method. From the
documentation:
·
open_datasync (write WAL files with open() option O_DSYNC)
·
fdatasync (call fdatasync() at each commit)
·
fsync_writethrough (call fsync()
at each commit, forcing write-through of any disk write cache)
·
fsync (call fsync()
at each commit)
·
open_sync (write WAL files with open() option O_SYNC)
Not
all of these choices are available on all platforms. The default is the first
method in the above list that is supported by the platform.
This is certainly cryptic if don’t know
something about UNIX file mechanics. To
get started, you can determine which fsync method is currently being used by
your database (helpful when you’re letting the server pick automatically)
inside psql with:
show wal_sync_method;
Here’s a summary of which
options are available on various popular platforms. Look at the line for your platform and read from left to right;
the first one that says Yes or Direct will be your default:
|
Platform |
open_datasync |
fdatasync |
fsync_writethrough |
fsync |
open_sync |
|
Linux |
No |
Yes |
No |
Yes |
Direct |
|
Mac OS X/Darwin |
No |
No |
Yes |
Yes |
Yes (Direct?) |
|
Solaris |
Yes |
Yes |
No |
Yes |
Yes |
|
Windows |
Yes (Direct?) |
No |
Yes |
Yes |
Yes |
|
BSD/OS 4.3 |
No |
No |
No |
Yes |
Yes (Direct?) |
|
FreeBSD 4.9 |
No |
No |
No |
Yes |
Yes (Direct?) |
|
|
|
|
|
|
|
There seems to be a problem with O_DIRECT
not working right on Solaris. The above table
will all make more sense
after each method is explained. But
first we need a diversion into hard disk technology and UNIX file writing
techniques.
Write Caches and Disk Arrays
To understand why this
decision is complicated at all, you need to know something about how hard
drives deal with disk writes. Disk
performance at all levels is greatly affected by disk caching. At the lowest levels, hard drives have their
own internal cache that’s used for reads and writes. PostgreSQL expects that the disks connected to it can be trusted
to correctly report when data as been written to them. Inexpensive IDE and SATA drives are widely
criticized for PostgreSQL use because they often “lie”, reporting that data has
been written when in fact it’s just been put into the disk cache. A good introduction to this subject is at http://www.jasonbrome.com/blog/archives/2004/04/03/writecache_enabled.html
and comments on how this interacts with Postgres are at http://www.thescripts.com/forum/threadnav173635-2-10.html
(it’s also a regular topic of discussion on the PostgreSQL performance mailing
list at http://archives.postgresql.org/pgsql-performance/
).
So the first rule here is
that if your hardware isn’t actually writing when it says it is, your
performance numbers will be inflated--it's extra performance at the expense of
reliability, and eventually you’ll get database corruption. SCSI disks are considered much more accurate
in terms of actually having written data when they say they have. This is one of the reasons the perceived
performance difference between IDE and SCSI is perceived as narrowing recently;
write caching is artificially inflating the results so that the better SCSI
hardware doesn’t look as impressive as it is.
Unfortunately, the IDE drives are so dependant on this feature that
turning it off gives unreasonably low results; see section 11.12.1.5 at http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/configtuning-disk.html
for comments on how FreeBSD has been challenged by this issue.
If you’re using any disk
array or intelligent controller card, again you’re back to needing to make sure
that disk writes are being completed when they say they are. Better controllers and arrays have their own
cache, and in those cases you normally need to make sure it’s configured in
write-through mode for proper Postgres integrity.
A reference good
implementation here for Linux are the LSI MegaRAID SCSI cards, which have a
write-caching policy that’s adjustable in the BIOS.
One approach regularly
debated on the PostgreSQL performance list uses a card or array that has a
battery-backed disk cache on it. That
allows you to get the instant fsync() response that allows extremely high
database disk writes, while still ensuring that data that has been reported
written will properly make it to the disks eventually even if there’s a system
failure. Critics of the battery backup
approach suggest that if you run such a system under load, eventually you’ll
have a failure in this relatively complicated cache method that will corrupt
your database in a way that’s nearly impossible to detect or recover from. Fans of a battery-backed cache suggest they
can’t achieve their target workloads without it, and that a good backup
strategy is the cure for an event that’s deemed as extremely rare as this
possibility.
WAL writing
In order to completely tune
how your system performs, it’s necessary to learn a bit more than you probably
wanted to know about how UNIX does file I/O operations.
The PostgreSQL Write-Ahead
Log (WAL) does only a few basic operations to the logs it works on:
1) Open a log file
2) Write a series of records to the log
3) Close the log file
4) Switch to or create a new log file
5) Repeat
Normally, when you ask to
write a disk block to a drive, your operating system caches that information
and writes it when it gets a chance.
This write-caching greatly improves performance for normal
applications. In order for the WAL
mechanism to work correctly, Postgres needs to know when the data written has
actually been committed to the disk drive.
Synchronous Writes
Situations where the
application can’t continue until the data has been completely written out are
referred to by UNIX as synchronous writes.
When you open a UNIX file using the POSIX standard calls (see http://www.opengroup.org/onlinepubs/007908799/xsh/open.html
) you can ask that all writes to that file be done in one of two synchronous
modes: O_DSYNC and O_SYNC. The official spec doesn’t tell you a lot
about the difference between these two.
A better discussion comes from IBM’s discussion of how these settings
impact journaling filesystems at http://publib16.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixprggd/genprogc/fileio.htm#wq222
From that document:
O_DSYNC: When a file is opened using the O_DSYNC open
mode, the write () system call will not return until the file data and all file
system meta-data required to retrieve the file data are both written to their
permanent storage locations.
O_SYNC:
In addition to items specified by O_DSYNC, O_SYNC specifies that the write ()
system call will not return until all file attributes relative to the I/O are
written to their permanent storage locations, even if the attributes are not
required to retrieve the file data.
The main thing you need to
walk away from this description with is that O_DSYNC can be a much smaller
operation than O_SYNC is. O_SYNC
requires all file-system meta-data to be completed, while D_SYNC just requires
the writing of your data be completed.
Since database systems like PostgreSQL have their own method for dealing
with metadata within the WAL, it really doesn’t need to wait for all the
file-system information to be done; it just needs the data written.
The other way to cope with a
synchronized writing situation is to open the file normally, write your data,
and then ask the operating system to flush that data. There are two ways to do this under UNIX:
fdatasync: http://www.opengroup.org/onlinepubs/007908799/xsh/fdatasync.html
fsync: http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html
These two operate similarly
to the above, with fdatasync being like a write done with O_DSYNC, while fsync
is like a write with O_SYNC.
For a description of what’s
going on at the programmer level, recommended reading is in the second edition
of “Advanced Programming in the UNIX Environment” by Stevens/Rago. Chapter 3, “File I/O”, covers all these
concepts, and section 3.14 (“fcntl Function”) even includes a simple benchmark
showing difference in speed among these methods for Linux and Mac OS X.
Direct
I/O
Since we can’t use the operating disk cache
usefully, why not bypass it altogether?
As of PostgreSQL 8.1, writing with O_DSYNC/O_SYNC will also use O_DIRECT
when available, which keeps WAL writes from using the operating system’s buffer
cache. Since the WAL is never read,
this can be a big benefit, especially for write-heavy loads. More information about O_DIRECT in a
database context is at http://www.mgogala.com/directio.pdf
and the actual details of O_DIRECT writes are at http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm
fsync_writethrough
We’ve now covered four of the methods
available for synchronizing the WAL.
Just what is fsync_writethough then?
Any time there’s a caching write situation in
the actual drive, you can’t necessarily assume fsync will actually make sure
data makes it to the disk. Some
operating systems recognize this, and have a way to call fsync in a way that
forces the drive’s cache to flush.
Right now, fsync_writethrough
works on two platforms:
On other platforms, fsync_writethrough
returns an error (unless fsync is turned off altogether).
The default method for
writing under Windows is open_datasync, which assumes write caching is not
occurring on your drives; if the drives cache, you don’t have proper data
integrity. If you want to keep caching
on, but have PostgreSQL work as designed, switching to fsync_writethrough is
one option. http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php
has additional notes on this topic.
Back to the beginning
Repeating where we started
from, PostgreSQL lets you specify which of these methods it should use when
writing to a file via the configuration wal_sync_method. Repeating the documentation
again, which should make a lot more sense now:
·
open_datasync (write WAL files with open() option O_DSYNC)
·
fdatasync (call fdatasync() at each commit)
·
fsync_writethrough (call fsync()
at each commit, forcing write-through of any disk write cache)
·
fsync (call fsync()
at each commit)
·
open_sync (write WAL files with open() option O_SYNC)
Not
all of these choices are available on all platforms. The default is the first
method in the above list that is supported by the platform.
The developers have put the possible write
methods in order by how efficient they usually are. The database server prefers that it write with O_DSYNC, but it
will use various fsync-type calls if it’s not available. Should all other methods not be available,
it falls back to writing with O_SYNC.
These defaults are good ones,
and most setups will never need to touch this parameter (with the Windows case
covered in the fsync_writethrough section being one glaring exception). The problem comes when you need to optimize
use of PostgreSQL on hardware that does some of this work for you, like when
using a storage array that actually has reliable disk caching. A basic introduction to fsync concepts from
a performance perspective is at http://developer.osdl.org/dev/doubt/diskio/documents/2005-10-19-visualizing_diskio.pdf.
test_fsync
If you want to learn exactly what the
capabilities of your platform are, the first thing you should do is compile and
test out the test_fsync tool. It comes
with the PostgreSQL source code, found under the src/tools/fsync
directory. You use it like this:
cd postgres/src/tools/fsync
make
./test_fsync -f /databasedirectory/fsync_test.out
If you get results that are
all basically 0, that probably means that you don’t really have correct
permissions to do all operations necessary to run this tool in that
directory.
Results from this tool are
measured in milliseconds. Here is a
sample run:
Simple write timing:
write 0.009100
Compare fsync times on write() and non-write()
descriptor:
(If the times are similar, fsync() can sync data
written
on a
different descriptor.)
write, fsync, close 4.090127
write, close, fsync 4.092272
Compare one o_sync write to two:
one
16k o_sync write 4.204583
two
8k o_sync writes 8.308203
Compare file sync methods with one 8k write:
(o_dsync unavailable)
write, fdatasync 4.295135
write, fsync, 4.325998
Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open
o_sync, write 8.301195
write, fdatasync 4.293409
write, fsync, 4.280166
This from a Linux system, a
platform that doesn’t have O_DSYNC. You
can compare these results against the theoretical maximum performance based on
the RPM of this disk drive, which is a 15K rpm drive:
|
RPM |
Rotations/sec |
|
Rotation
(ms) |
|
5400 |
90.0 |
|
11.1 |
|
7200 |
120.0 |
|
8.3 |
|
10000 |
166.7 |
|
6.0 |
|
15000 |
250.0 |
|
4.0 |
At around 4.3ms to write and
sync, we’re pretty close to the maximum speed the drive is capable of
(4ms). If the results were better than
predicted by an analysis based on RPM, you would have reason to believe there
is some write caching at work.
Write-caching that PostgreSQL doesn’t know about it deadly to its
ability to ensure data integrity for your database.
So all you need to do here is
pick the method with the fastest speed, and you’re done, right? Not necessarily. This is a test that writes a small amount of data. In the real world, when writing an enormous
amount of data, you may discover that the implementation quirks of your
hardware or the filesystem your database is stored on strongly prefers one mode
over the others, regardless of what the low-level test suggests. The overhead of using the fsync and
fdatasync calls gets larger as the amount of data you’re passing through a
filesystem goes up, and it’s possible for writing in O_SYNC mode to outperform
it under load—even though this simple test suggests O_SYNC is the slowest
method to use. This is particularly
true for platforms that support Direct I/O, where fsync-type methods have a
very negative impact on the operating system cache.
Filesystem Characteristics
In order to completely
understand what you’re working with, you’ll also need to take a look at the
writing characteristics of the filesystem your WAL is on. If you’re running a journaling filesystem,
and you’re also using the WAL, understand that in some ways you’re paying for data
integrity twice.
Linux ext3
Current generation ext3
defaults to flushing updates to disk before committing the update to the main
filesystem, what ext3 calls ordered mode.
This is overkill for the disk the WAL is on. Assuming you’ve put the WAL on a separate disk (or at least
partition), you can reduce Linux’s overhead by changing /etc/fstab to mount the
filesystem in writeback mode.
Similarly, by default ext3
filesystem will update the last access time attribute every time a file is
written to. This is also unneeded for
the WAL. That can be turned off as
well.
Combining these two, an
optimal fstab for the WAL might look like this:
/dev/hda2 /var ext3 defaults,writeback,noatime 1 2
While it’s obsolete at this
point, http://developer.osdl.org/dev/doubt/diskio/result/NTT-SPLab-2004-09-29.html
is an interesting table showing how changing the ext3 options can affect
performance in the various sync modes discussed here.
Veritas VxFS Filesystems
Veritas’s VxFS product is one
of the better journaling filesystems available. In addition to good performance, it has enormous flexibility in
terms of optimizing for performance.
The first setting that
normally affects data integrity is the option for writing the VxFS intent
log. The best documentation on this
subject is in the man page for mount_vxfs.
There are three options available:
log, delaylog, and templog.
Normally, using delaylog or templog can introduce the possibility of
data corruption during a crash. But
this is not the case for data written synchronously. From the man page:
In all
cases, VxFS is fully POSIX compliant.
The persistence guarantees for data or metadata modified by write(2),
writev(2), or pwrite(2) are not affected by the logging mount option. The effects of these system calls are
guaranteed to be persistent only if the O_SYNC, O_DSYNC, VX_DSYNC, or VX_DIRECT
flag, as modified by the convosync= mount option, has been specified for the
file descriptor.
This means that in the case
of PostgreSQL, using either the
open_datasync or
open_sync WAL methods (which translate into O_DSYNC and O_SYNC writes) will
push the burden of maintaining disk integrity to the VxFS software, which
guarantees that data will survive. On
platforms that don’t support O_DSYNC (like Linux and BSD), you will normally
default to using fdatasync for synchronizing writes. It may be the case that with VxFS, using open_sync instead will
be faster, because you’re letting the VxFS driver handle data integrity at its
level rather than requiring the more complicated operating system level
fdatasync transaction.
In fact, it may be possible to get even better PostgreSQL
performance out of VxFS by using the convosync=dsync option when mounting the
filesystem. Since Postgres only
requires DSYNC level integrity for the WAL, using this conversion would serve
to emulate the more efficient DSYNC behavior on operating systems that don’t
necessarily support it. Because this
convert option makes VxFS break POSIX compliance, the main danger here is that
other applications doing synchronous writes to that disk might misbehave. Another possibility is convosync=direct,
which will bypass the buffer cache (utterly appropriate as WAL data isn’t going
to be read again later); further documentation from Veritas is at
http://ftp.support.veritas.com/pub/support/products/FileSystem_UNIX/vxfssept_231815.pdf
Relying on VxFS to enforce data integrity is not entirely
without risk. For example, pages 5-6 of
the HP-UX Performance Cookbook ( http://h21007.www2.hp.com/dspp/files/unprotected/devresource/Docs/TechPapers/UXPerfCookBook.pdf
) describes one failure mode where relying on the intent log can introduce
application integrity issues. They
recommend mounting VxFS with options delaylog, nodatainlog in order to keep
from happening. Here is what the VxFS File System Administrator’s
Guide has to say on this issue (see http://ftp.support.veritas.com/pub/support/products/FileSystem_UNIX/
for this documentation, with http://ftp.support.veritas.com/pub/support/products/FileSystem_UNIX/269437.pdf
being the Linux version):
Use the
nodatainlog mode on systems with disks that do not support bad block
revectoring. Usually, a VxFS file
system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction,
so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the
application, the file system has told the applications that the data is already
written. If a disk error causes the
metadata update to fail, then the file must be marked bad and the entire file
is lost.
If a
disk supports bad block revectoring, then a failure on the data update is
unlikely, so logging synchronous writes should be allowed. If the disk does not support bad block
revectoring, then a failure is more likely, so the nodatainlog mode should be
used.
A
nodatainlog mode file system is approximately 50 percent slower than a standard
mode VxFS file system for synchronous writes.
Other operations are not effected.
The “revectoring” of blocks
here are alternately referred to as bad block remapping or reallocated sectors,
depending on who you’re talking to; those all mean the same thing.
Database use of VxFS
should certainly consider whether the possible performance hit of mounting with
nodatainlog is worth the small possibility of corruption from drive
failure. The revectoring case described
here shouldn’t impact PostgreSQL, as it doesn’t care about the metadata when it
needs to rebuild using the WAL.