One downside to having a specialty
you’re know for is that the people with the hardest problems in that
area tend to find you. For me this is good in that it keeps me
working, but it can easily turn out bad if the difficult is not
distinguished from the impossible correctly. As someone who’s done a
lot of work on tuning PostgreSQL for heavy write volumes, the
workloads I see are skewed closer than I’d like them to be some days
to the impossible side. One of the areas that’s really snuck up on
me recently is just how hard it is to tune autovacuum for a database
that’s constantly written to.
The way PostgreSQL’s transaction
visibility system works, every database page you write will be
written at least one more time, to freeze its transactions ids. And
you may get yet another write on top of that, to reclaim free space
on deleted items. Hint bits
are another nuisance. And that’s just the average case; pages can be
overwritten many times if the deletion happens in chunks.
The big problem happens if you build a
system that can just barely keep up the incoming write volume at its
beginning. That system will then be crushed by its workload once the
overhead of this background maintenance kicks in. Here’s how it will
happen: early performance tests say the server meets specifications.
It survives initial rollout. But after a few months go by, the
archive of data gets bigger, and next thing you know autovacuum needs
to run 25 hours a day to keep up. (Hint: this doesn’t work. That
consulting advice, you get for free, as in beer. Which you’ll likely
end up drinking more of once this happens.)
What doesn’t help matters is how arcane
the parameters for tuning autovacuum are. Yesterday I submitted a
patch to add the first readout to help with this problem I’ve been able to
put together. Developed with my new co-worker Noah Misch, the patch
itself is pretty simple. It takes the information autovacuum makes
it decisions about and exposes it, in real-time, via the process’s
command line. Similarly to how you can track progress of things like
the archiver this way, sampling this data turns out to be very useful
for predicting how long things are going to take to run. There’s
also a report in the log file at the end. It looks like this:
LOG: automatic vacuum of table
"pgbench.public.pgbench_accounts": index scans: 1
pages: 0 removed, 819673 remain tuples: 19999999 removed, 30000022 remain buffer usage: 809537 hits, 749340 misses, 686660 dirtied system usage: CPU 5.70s/19.73u sec elapsed 2211.60 sec
The “buffer usage” line is the new
one here. This is units of buffer pages, which are the 8192 byte
chunks of memory. So seeing 686660 of them dirtied means this
autovacuum wrote 8192 * 686660 bytes=5364MB in 2212 seconds. That
makes for an average write rate of 2.43MB/s. If you’ve every tried
to tweak autovacuum before, you’ll know that guessing the rate at
which it’s going to write to disk is the hard part. This doesn’t
solve that problem completely, but it does let you trivially figure
out something that is enormously useful: how fast things are writing
given the current parameter set. And even that used to be really
painful to figure out.
The patch is really small and could
easily be applied with minimal risk to a production server. I expect
it to be the next thing I end up backporting heavily onto troublesome customer systems. Next time I’ll write a
bit more about how the information gathered by this patch has given
me new insight into vacuum tuning, some observations that can help
you even if you’re not running a server with it installed.