Do you ever have one of those bugs that, when it is first discovered, appears simple to fix, but in the end forces you to accept that any proper fix will require Serious Engineering™ and in the process breaks your mind a bit? I had one of those this week.
At work, part of the platform is built using the Event Sourcing design pattern. We model changes as a sequence of events, and then have multiple subscribers consuming these events. The subscribers process the events to build up aggregated statistics and various denormalized representations of domain objects which are fast to query. We are using a framework that supports this design pattern, but it has the significant limitation that all event processing happens synchronously. This has tradeoffs in terms of performance and deployment. To mitigate these tradeoffs and allow us to quickly and easily add or remove subscribers, we extended the framework to support asynchronous subscribers.
The core of our asyncronous subscriber mechanism is pretty simple. There is an events table with a sequence primary key that is getting written to by a bunch of processes. There are some other processes that are reading events out of this table as it is being written to. Each reader process looks something like this:
while true sql = "SELECT * FROM events WHERE id > $1 ORDER BY ID" events = fetch_records(sql, last_seen) for event in events process(event) last_seen = event.id end end
If you spot the problem already, that’s great, but keep it to yourself and don’t spoil the surprise for the others.
This code was running in production for a few weeks when we started getting many errors related to event processing. At first it appeared as though one of the processors was making invalid assumptions about some event state, but it turned out the problem was worse than that. We were trying to apply an event to an invalid state because some previous events had not been processed. How could this be? Enter Multiversion Concurrency Control (MVCC1).
Based on our logs, the events we processed looked like this:
..., 185, 186, 189, ...
However when we later queried the events table, we saw that events 187 and 188 were present…
The missing events weren’t picked up by the processor and somehow dropped on the floor. There wasn’t an error in processing that silently failed and wasn’t retried. The missing events were temporarily invisible because the transaction that wrote them had not yet committed at the time that we fetched the events just following them in the sequence. When we queried afterwards, both transactions had committed and so the missing events were now visible. The event processor completely skipped the missing events because by the time they were visible, the code had already “seen” up to 189 and proceeded on past them.
This is how MVCC works. Even though the transaction that created events 187 and 188 started before the transaction that created 189, it committed afterwards. You can actually see evidence of this if you query
xmin2 along with the ids:
psql=> SELECT id, xmin FROM events WHERE id BETWEEN 185 AND 189; id | xmin -----+------- 185 | 10673 186 | 10676 187 | 10679 188 | 10679 189 | 10680
Okay, “no problem!”, we think. The code just needs to account for this out-of-order commit situation. We’ll just watch out for it by verifying that the event we are about to process is the next one in the sequence like so:
while true sql = "SELECT * FROM events WHERE id > $1 ORDER BY ID" events = fetch_records(sql, last_seen) for event in events if event.id != last_seen + 1 break end process(event) last_seen = event.id end end
Not so fast. The problem is that sometimes event
last_seen + 1 will never show up because there are gaps in the sequence that will never be filled.3 This is the case even though for this particular table, records are only ever created. There are never any updates or deletes. Why? Sequences in Postgres (and sequential primary key creation mechansims in other RDBMSs) only guarantee that the numbers provided are always increasing, not that the set of values they produce are free from gaps.
Why not implement some mechanism on top of event creation that generates a strictly sequential series of event ids? There are a few ways to do this, but all of the simple approaches use pessimistic locks in one form or another. Using pessimistic locks for this purpose destroys concurrency, drags down performance, and introduces the possibility of deadlocks.
There are optimistic locking strategies you can use as well.4 However, they are more involved and it helps when your implementation language has good concurrency primitives.
Despite the looming feeling we are grasping at straws, we vaguely recall past instances where esoteric Postgres features and nontrival compromises to our data abstraction layer have saved us.
After poring over countless Postgres mailing list posts5 we learn about the commit timestamp feature that was introduced in 9.5. This is a feature which has Postgres track the commit time of every transaction and make it available via the function
pg_xact_commit_timestamp(xmin). This is fantastic! All we have to do is:
pg_xact_commit_timestamp(xmin)and we are sorting on this value in a table with tens (and eventually hundreds) of millions of records
At this point we6 have found many simple solutions that don’t work, learned a lot about Postgres internals, and are still stuck with the same problem.
While I would like to end the post by revealing the simple solution that does work, the unfortunate reality is that we had to introduce significant complexity in order to address this problem.
On the bright side, the solution works really well and enabled us to learn about even more things that make Postgres great.
We’ll talk about it in our next post.
NOTE: crossposted at https://tech.precisionnutrition.com
Multiversion Concurrency Control is a mechanism that many relational database systems use to enhance performance while preventing clients from ever seeing data in an inconsistent state. This is accomplished by keeping multiple snapshots of data around. Whenever a client queries, they will see some copy of the data that provides a consistent view. This view will exclude pending transactions that have yet to be committed, unless you specify otherwise by locking for read. ↩
the system colum that stores the creation
in this particular table, about 0.6% of primary keys are missing. The number will be higher for tables where there is a lot of contention on record creation. ↩
and if we are being honest, StackOverflow ↩