6 March 2020

Fun with Multiversion Concurrency Control

Do you ever have one of those bugs that, when it is first discovered, appears simple to fix, but in the end forces you to accept that any proper fix will require Serious Engineering™ and in the process breaks your mind a bit? I had one of those this week.

The setup

At work, part of the platform is built using the Event Sourcing design pattern. We model changes as a sequence of events, and then have multiple subscribers consuming these events. The subscribers process the events to build up aggregated statistics and various denormalized representations of domain objects which are fast to query. We are using a framework that supports this design pattern, but it has the significant limitation that all event processing happens synchronously. This has tradeoffs in terms of performance and deployment. To mitigate these tradeoffs and allow us to quickly and easily add or remove subscribers, we extended the framework to support asynchronous subscribers.

The core of our asyncronous subscriber mechanism is pretty simple. There is an events table with a sequence primary key that is getting written to by a bunch of processes. There are some other processes that are reading events out of this table as it is being written to. Each reader process looks something like this:

while true
  sql = "SELECT * FROM events WHERE id > $1 ORDER BY ID"
  events = fetch_records(sql, last_seen)

  for event in events
    process(event)
    last_seen = event.id
  end
end

If you spot the problem already, that’s great, but keep it to yourself and don’t spoil the surprise for the others.

The bug

This code was running in production for a few weeks when we started getting many errors related to event processing. At first it appeared as though one of the processors was making invalid assumptions about some event state, but it turned out the problem was worse than that. We were trying to apply an event to an invalid state because some previous events had not been processed. How could this be? Enter Multiversion Concurrency Control (MVCC¹).

MVCC breaks our simple loop

Based on our logs, the events we processed looked like this:

..., 185, 186, 189, ...

However when we later queried the events table, we saw that events 187 and 188 were present…

The missing events weren’t picked up by the processor and somehow dropped on the floor. There wasn’t an error in processing that silently failed and wasn’t retried. The missing events were temporarily invisible because the transaction that wrote them had not yet committed at the time that we fetched the events just following them in the sequence. When we queried afterwards, both transactions had committed and so the missing events were now visible. The event processor completely skipped the missing events because by the time they were visible, the code had already “seen” up to 189 and proceeded on past them.

This is how MVCC works. Even though the transaction that created events 187 and 188 started before the transaction that created 189, it committed afterwards. You can actually see evidence of this if you query xmin² along with the ids:

psql=> SELECT id, xmin FROM events WHERE id BETWEEN 185 AND 189;
 id  | xmin
-----+-------
 185 | 10673
 186 | 10676
 187 | 10679
 188 | 10679
 189 | 10680

The simple fix that doesn’t work

Okay, “no problem!”, we think. The code just needs to account for this out-of-order commit situation. We’ll just watch out for it by verifying that the event we are about to process is the next one in the sequence like so:

while true
  sql = "SELECT * FROM events WHERE id > $1 ORDER BY ID"
  events = fetch_records(sql, last_seen)

  for event in events
    if event.id != last_seen + 1
      break
    end

    process(event)
    last_seen = event.id
  end
end

Not so fast. The problem is that sometimes event last_seen + 1 will never show up because there are gaps in the sequence that will never be filled.³ This is the case even though for this particular table, records are only ever created. There are never any updates or deletes. Why? Sequences in Postgres (and sequential primary key creation mechansims in other RDBMSs) only guarantee that the numbers provided are always increasing, not that the set of values they produce are free from gaps.

Maybe we can fix the sequence?

Why not implement some mechanism on top of event creation that generates a strictly sequential series of event ids? There are a few ways to do this, but all of the simple approaches use pessimistic locks in one form or another. Using pessimistic locks for this purpose destroys concurrency, drags down performance, and introduces the possibility of deadlocks.

There are optimistic locking strategies you can use as well.⁴ However, they are more involved and it helps when your implementation language has good concurrency primitives.

Maybe we can use Postgres magic?

Despite the looming feeling we are grasping at straws, we vaguely recall past instances where esoteric Postgres features and nontrival compromises to our data abstraction layer have saved us.

After poring over countless Postgres mailing list posts⁵ we learn about the commit timestamp feature that was introduced in 9.5. This is a feature which has Postgres track the commit time of every transaction and make it available via the function pg_xact_commit_timestamp(xmin). This is fantastic! All we have to do is:

enable this config parameter on every Postgres instance on production, staging, and everyone’s dev machine
restart every Postgres server in every environment
work around the fact that we don’t have the data available from before enabling the feature
figure out a way to avoid losing commit timestamps due to 32 bit xid wraparound
work around the fact that you can’t index the results of pg_xact_commit_timestamp(xmin) and we are sorting on this value in a table with tens (and eventually hundreds) of millions of records
deal with the fact that commit timestamps aren’t unique so we have to some bookkeeping to record both the last seen timestamps as well as event ids…which are no longer received in order…

Giving up on simplicity

At this point we⁶ have found many simple solutions that don’t work, learned a lot about Postgres internals, and are still stuck with the same problem.

While I would like to end the post by revealing the simple solution that does work, the unfortunate reality is that we had to introduce significant complexity in order to address this problem.

On the bright side, the solution works really well and enabled us to learn about even more things that make Postgres great.

We’ll talk about it in our next post.

NOTE: crossposted at https://tech.precisionnutrition.com

Multiversion Concurrency Control is a mechanism that many relational database systems use to enhance performance while preventing clients from ever seeing data in an inconsistent state. This is accomplished by keeping multiple snapshots of data around. Whenever a client queries, they will see some copy of the data that provides a consistent view. This view will exclude pending transactions that have yet to be committed, unless you specify otherwise by locking for read. ↩
the system colum that stores the creation xid ↩
in this particular table, about 0.6% of primary keys are missing. The number will be higher for tables where there is a lot of contention on record creation. ↩
an awesome one can be found here and here ↩
and if we are being honest, StackOverflow ↩
special thanks to drteeth for fighting this fight with me ↩

tags: postgres - databases