At Pixlee, we're big fans of proven, scalable technology. Serving millions of photos a day at millisecond speed requires an amalgamation of superb caching, scalable architecture, and an efficiently normalized data store. We depend on the JVM for our API, and on SQL RDBMS for primary data storage.
Data storage needs to be reliable, and for the last two years or so we've been well taken care of by Amazon's excellent RDS Postgres offering. We run an m3.2xlarge instance, backed by Provisioned IOPS for steady I/O with daily snapshotting.
Generally, this has run very smoothly despite constantly changing and rapidly growing traffic patterns from clients running large campaigns and including our widgets on more of their online assets.
However, a few weeks ago we ran into a strange behavior. We noticed that every 6-12 hours, cloudwatch graphs would indicate that the write throughput would spike up precipitously, causing a dramatic increase in queue depth. The graphs looked something like this:
Basically, all signs indicative of a massive surge in database read/write. There was only one problem: given what we knew about our code, there was no reason to believe that our I/O was not constant throughout the parts of the day in which the spikes occurred.
So what could be going on?
First, we needed to eliminate the possibility that it was our queries themselves that were causing the spike. We double checked the times that our more expensive scheduled database jobs ran to ensure that none overlapped the spike times - none did. We also checked for expensive, long running queries via pg_stat_statements:
(total_time / 1000 / 60) as total_minutes,
(total_time/calls) as average_time,
ORDER BY 2 DESC
and cross-checked them against where we knew they were being run - none explained the spike (more information on effectively using pg_stat_statements: http://www.craigkerstiens.com/2013/01/10/more-on-postgres-performance/).
We also used RDS's performance logging during spike times (setting log_min_duration_statement was helpful here) to determine whether any strange queries were being run. We found that the queries that seemed to cause the spikes were everyday insert and update queries - all on one table. The same queries that seemed to run normally for many hours, at the same frequency, would suddenly cause a spike.
So what next?
There were a few things we knew about that that can cause a buildup of writes within the underlying Postgres architecture. Poor checkpointing, transaction id wraparounds, and autovacuums.
Bad checkpointing can cause problems since at each checkpoint, all data files will get flushed to disk. If you let checkpoints build up, that disk flush can get very pricey.
Luckily, there's a simple way to test whether your checkpoint parameters are forcing too much disk flush: simply run the CHECKPOINT command, which forces a checkpoint immediately. We did so at about the time we expected a spike to happen, after a lull of around 5 hours. While the queue depth did spike to about 4, it wasn't nearly enough to explain the problem.
Transaction id wraparound is a phenomenon unique to Postgres. Basically, every transaction (generally INSERT and UPDATEs, with some occasionally hefty SELECTs) gets assigned a 4 byte integer. This ID determines whether concurrent users can see the changes caused by the transaction. Since they are a 4-byte integer, after about 2.1 billion transactions, they'll need to wrap around, or run out of space to record. Wrapping around in this case means stamping transactions "frozen", meaning that all users can see it. It also means a complete freeze of database write operations until a full vacuum finishes. That's bad. Very bad.
Luckily, regular vacuuming takes care of this scenario, since vacuuming will freeze appropriate transaction ids. In order to check that our freeze parameters were set properly, we followed Josh Berkus' guide here (http://www.databasesoup.com/2012/09/freezing-your-tuples-off-part-1.html) and confirmed that it wasn't a problem for us.
The last issue that can cause inadvertent heavy writes is autovacuuming. Autovacuum is effectively a daemon process that can't be turned off, and autovacuums all of your tables after it passes a certain threshold of write changes. If a big table hits an autovacuum at a peak usage time, you're in for a world of hurt. Luckily, while you can't shut down (and shouldn't shut down) autovacuum, you can tell it to ignore tables that you'd rather vacuum analyze yourself, in less peak time. We did that, but unfortunately, the spikes kept happening.
A lot of cursing and hair tearing ensued, but unfortunately, the spikes still kept coming. At this point, we had to attempt more drastic measures.
We knew that based on the pg_stat_activity during the spike times, it seemed that only one table was causing the spikes at any given time. Other tables could still be written to during that interval.
This led us to the conjecture of indexes. Perhaps an abundance of indexes (we had nearly 28GB worth over 18 indexes on that table alone) could be causing some sort of periodic index structure bloat under the hood that we weren't aware of.
Since we depended on most of the indexes to make our daily operations performant, we booted up a snapshot database to work with, and created a script that would simulate the regular amount of write traffic that would occur into this database. We turbocharged this script to write at a heavier rate, which we found made the spikes occur more frequently.
Then, it was simply a process of binary search and elimination of indexes. We killed half of them, ran the script, waited for a spike, killed another half, and so on.
Amazingly, after killing nearly all of the indexes, we found that the spikes disappeared!
We wanted to believe that it wasn't the sheer number of indexes that were causing the problem, since some of our other tables had similar setups but no spike. So we rebuilt a few indexes until we narrowed it down to - luckily - just one.
CREATE INDEX t1_title_fts_gin ON t1 USING gin (to_tsvector('simple'::regconfig, title))
This was an index that improved search performance on photo captions, which could be quite long per row (http://rachbelaid.com/postgres-full-text-search-is-good-enough/). As it turns out, our constant writing and updating to the table caused a considerable amount of disk write operations to this index. For a reason that we're still unsure of, the gin index was forced to write heavily to disk in periodic intervals. What we do know is that this doesn't affect all gin indexes, as we employ similar indexes on slightly smaller, non-spiking tables with text rows of fixed size.
EDIT: reddit user u/lluad pointed out the probable cause: gin fast updates http://www.postgresql.org/docs/9.4/static/gin-implementation.html
Gin indexes don't update very efficiently since the nature of inverted indexes cause them to potentially insert a lot of rows into the index for one change. Postgres gets around this by attempting to buffer this work up to a certain point, then bulk inserting them all at once. The bulk insertion doesn't really get logged anywhere, which was what made this such a difficult activity to trace.
Luckily, we run an Elasticsearch layer on top of our data (an ETL script actively listens to Postgres triggers on row updates and builds changes in ES, so we have a denormalized copy of the data. We may go in more depth in a later blog post). We simply swapped out the search logic from Postgres to Elasticsearch, deleted the index, and went merrily on our way.
Throughout this journey, we were able to pick up lots of knowledge about the Postgres architecture that we may not have discovered otherwise. We were able to opportunistically tune our database further with the lessons that we had picked up along the way. For more interesting problems like this, join our team!