CRDT Dump | Look what I madeヽ(*¯︶¯*(・

Editor's note

This is a stream-of-conciousness style post. I had initially meant for it to be more of a deep-dive blog post, but in order for it to be all that I wanted such a post to be would have needed many hundreds of hours of research, and at the same time $JOB went from being a bit of a slow-moving affair to being fairly fast-paced and understaffed.

CRDTs are an extremely deep field of research, I find them fascinating. But I don't have the brain-bandwidth to do a high-effort @patio11-style deep dive on them anymore.

I've preserved the below for posterity, and so that I can publish other stuff and link to it for friends.

CRDT Info Dump

After hyperfocusing on CRDTs for a few months, I collected the involuntarily-received info dumps I sent to friends and then lightly edited them into a... somewhat coherent train of thought? 😅

Mom, Dad, look... I've been getting into CRDTs lately.

It started out with Operational Transform, just a little Google Docs with friends. But then I started hanging out with programmers and they were bringing DVCSs to parties and one thing led to another and... what? Mom, no, I'd never touch blockchains, I'm not that gullible!

Conflict-free Replicated Data Types have been the focus of pretty much every free minute I've had for the past few weeks. In those small hours of waking in the morning, before the 4 year-old has awoken, and when I'm just becoming sharp enough to think new thoughts, CRDTs have not been far from my mind. Like a song that I can't get out of my head, sometimes the only way to be free is to scratch the itch once and for all. So here's a big info dump of everything I've learned.

Disclaimer

I am not a CRDT researcher, nor am I familiar with much of the wide range of academic literature on the topic. For those reasons and others, this post is going to focus on Kleppmann et al.'s OpSet design, Roh et al.'s RGA, and Nicolaescu et al.'s YATA. I am aware of the existence of WOOT, Logoot, Astrong, LSEQ, and Treedoc, but have not looked into them closely due to my interest being in the narrative of the evolution of techniques powering libraries like Automerge, Y.js, Diamond Types, and, most recently, JSON-Joy.

As such what history I provide will be incomplete but, I hope, still informative and entertaining. I'll have many links to sources at the bottom if you're interested.

Motivations

Why do we care about CRDTs? What do they enable?

In short, all kinds of distributed, collaborative applications. CRDTs can be used to:

Enable collaborative editing in applications like Figma, Google Docs, or Notion (although *ahem* none of the listed applications actually use CRDTs, but they could is my point).
Synchronize state between distributed databases, like Redis does in geographically distrubuted clusters.
Represent the state of a shared game world
Pretty much any other scenario where distributed peers want to exchange some state efficiently

What are we trying to solve?

The tricky bit when it comes to distributed, collaborative applications is making sure that everyone receives all the edits and ends up in the same state without having to manually resolve conflicts if edits come in the wrong order or if simultaneous edits modify thesame state.

Let's say Alice and Bob both have the string "Hello world". Alice edits the string to read "Hello, Charlie!" and Bob edits his copy to read "Hello... Alice?". How do we communicate those edits, and what should the final state be?

|               [Fig 1]                 |
|       Alice       |       Bob         |
|---------------------------------------|
|    Hello world    |    Hello world    | Starting State
|---------------------------------------|
|  Hello, Charlie!  |  Hello... Alice?  | Parallel edits
|---------------------------------------|
|  Hello, Charlie!  |  Hello, Charlie!  | Possible endstate: Data loss
|------------------O R------------------|
|  Hello... Alice?  |  Hello, Charlie!  | Possible endstate: Data swap

Fig 1 shows what could happen if we take a naive approach where we simply send edit

oh I keep wanting to play so the CRDTs, what’s your main takeaway?

So if you want to play with CRDTs I suggest maybe not writing one yourself, like I'm doing haha. Play around with some of the popular libraries like Automerge or Y.js.

CRDTs have been largely ignored as an academic toy the last 15 years or so because even the academics thought that they were simply too slow.

Turns out Academics aren't always the best judge on optimizations that are available, and even the most popular libraries have some catching up to do in terms of algorithmic optimizations

At least when it came to client-side JS the major libraries had decide that the only way to get more of the requisite speed was to use Rust + WASM

But then this guy, the author of a massive utility library called json-joy, came in and beat everyone by orders of magnitude with pure typescript. He even compared his CRDT impl against popular rope libraries (a rope is a data structure for storing text in a way that's more efficient for, say, text editors to do lots of insertions and random deletions on) and native string handling and won

https://jsonjoy.com/blog/list-crdt-benchmarks

Some criticism has been levelled at the fact that the benchmarks focus only on speed of processing operations and not on, say, in-memory or on-disk size of the document, or whether the speed difference even makes sense since you're ultimately going to be limited by the network in terms of receiving operations Buuut I like that he posted them because it has a lot of libraries going back to the drawing board on how to squeeze out better perf

One of the hardest parts for CRDTs in terms of speed basically comes down to the fact that everything is stored in a big tree-shaped structure called an OpSet.

In the bad old days OpSets stored every character in its own node, which is always going to be slow because you spend most of your time navigating pointers, and branches get pretty deep and the tree doesn't stay terribly balanced because sequential operations from the same user basically degenerate into linked lists

Now most implementations store runs of operations from the same user in a single node, splitting nodes as necessary for concurrent edits buuut there's still another problem

Automerge uses an algorithm that has to search the tree for the insertion point every time, even when the insertion point is the same as the last insertion point + 1

Y.js instead dumps all the nodes into an array and just uses the ID of the nodes to form links, and gains over Automerge in that: a) all nodes are roughly in the same chunk of memory so better locality of reference and b) they cache the index of the last insertion so that they don't need to search for it again in the common case. But if the last insertion point isn't valid they have to degenerate to a linear scan of the nodes because maintaining an index of ID -> nodes would be more space-expensive than they're willing to go

JSON-joy wins out over the two of them by giving every node in the OpSet two sets of tree pointers: one spatial (i.e. an in-order traversal of the spatial tree gives you the layout of the document) and one temporal (an in-order traversal gives you the entire editing history of the document) and then he makes them Splay trees which are "self-optimizing" because they rotate recently inserted nodes to the root of the tree for easier access.

So now you have a) optimization for the common case of "inserting or deleting right at the last place I inserted or deleted", b) efficient traversal based on document position or by temporal ID and c) because he also maintains the a "length" on each node being that node's span + its children's lengths, navigating the spatial tree also has the effect of being equivalent to a rope, that text-editor-optimized string data structure we talked about

Of course, he gets this by making every node about 7x larger since most traditional OpSets are singly-linked "backwards" i.e. each node only notes it's parent and not its children although I think that's mainly a storage format now and most in-memory formats are doubly linked (except maybe Y.js?) Speaking of storage, it's got pros and cons

Main con is that traditionally you kind of have to keep all edits for all time, otherwise a disconnected client might come back with changes they can't merge because they depend on edits that have been purged.

In reality, this is a tunable parameter. You pick a point in time and say "I'm okay with folks who have been disconnected longer than X possibly needing to figure out their merge manually" and do your garbage collection

But even that's not so bad in terms of storage since it turns out that if you imagine the operations of the OpSet stored in a table format, the columns of that table tend to have low entropy and react very well to being stored in a columnar format where you can take advantage of compression techniques like run-length encoding, dictionary lookups, variable integer encoding, etc

My project has been translating JSON-Joy's implementation to Rust, mostly as a way to learn Rust

But my biggest stumbling block so far has been that the json-joy code is very dense, has zero comments, lots of single-character-naming, and often chooses to inline code in different ways to eke out better performance.

In other words the code is bad haha

Also it's not an OpSet in the strictest sense.

Really it's a projection of the operations but doesn't store enough data about those operations to preserve the full editing history. It has the full set of CRDT metadata so it can do arbitrary merges, but you couldn't, say, diff against different points in time because it doesn't keep deletion operations, only the tombstoned insertions they create

However that might not be a terrible thing to get around?

Nothing wrong with having most clients only being interested in that projection but storing the full history somewhere else

Ohh yeahh Brain blast

Two different clients: OpSet clients and projection clients.

OpSet clients have the full editing history and can be more focused on: efficient in-memory and on-disk representation, repeating/forwarding operations to other clients, but maybe suffer from slower edit operations

Whereas projection clients are full CRDTs but don't care about maintaining that full history, so they can get away with less space-efficient representations to get better edit performance

Of course you suffer because CRDTs aren't super easy to implement. OpSets have the advantage in that all the different kinds of data structures you want (lists, maps, registers, etc) are all really just different projections of the OpSet.

But for the projection clients it means having a number of different data structures that are CRDTs but have optimized structure for their purpose The testing alone man

Oh, also there's this project called Braid that wants to make CRDTs a core data type of the web and one of the first things they've tackled is GC.

Their method is to have all clients gossip out acknowledgments for received changes. Then when a set of changes has been acknowledged by all known peers they can be "blooped" together Then you can tune for how many acknowledgements are required, when do you decide to forget about a peer that has disconnected, etc etc Of course in this scenario it doesn't really make sense to have the separation between OpSet clients and projection clients since all history size tends towards zero

If you wanted to support some clients choosing to remember everything you'd have to make sure that operations referencing the newly "blooped" op are sensical to full-history peers

That or instead of having a distinction between client types some clients may just choose to keep snapshots of the OpSet before and after blooping and then add some metadata to link them in time the right way

(Ed: this stream-of-conciousness ends here. I remain interested in CRDTs but they are a very deep field of research, and I think in many ways they are not fully ripe. I look forward to seeing the improvements that are offered by the community over time though)