Making CRDTs 98% More Efficient

This was supposed to be a two-part blog post, but I got inspired to write a part three!

At the end of Building a Collaborative Pixel Art Editor with CRDTs Building a Collaborative Pixel Art Editor with CRDTs | jakelazaroff.com CRDTs sound cool, but how are they actually used? Let's learn by building a collaborative pixel art editor. jakelazaroff.com/words/building-a-collaborative-pixel-art-editor-with-crdts/ , we had just that: a fully functional collaborative pixel editor using state-based CRDTs. The drawback to state-based CRDTs is that you need to transmit the full state between peers to sync up. And our CRDT’s full state is pretty big! For a 100x100 image, it’s around 648kb. Yikes!

I won’t bury the lede: we’re going to reduce that by almost 98% to around 14kb. For context, that’s around half as big as the uncompressed image without any CRDT metadata. While we’ll never beat PNG, we can still do pretty well.

The benchmarks I’ll cite in this article are all based on this image I drew, which was encoded here using the same format we’ll create by the end of the post:¹

This playground uses the same pixel editor we built in the last post. You can change the color and brush size by clicking on the buttons below the canvas. The table to the right shows the size of the data in kilobytes. Right now, it just shows the size of the uncompressed data and the final result in our mystery format. As we compress our data further, we’ll add rows to the table showing how each format stacks up against the others, and their relative sizes.

Currently, our data looks something like this:

{
  "0,0": ["0442197c-8144-47f7-ae64-340a2df3d796", 3, [255, 255, 255]],
  "1,0": ["0442197c-8144-47f7-ae64-340a2df3d796", 4, [255, 255, 255]],
  "2,0": ["0442197c-8144-47f7-ae64-340a2df3d796", 2, null],
  "0,1": ["4ae8bd76-e84c-4652-bcd8-a5e339c574f3", 2, [108, 79, 255]],
  "1,1": ["4ae8bd76-e84c-4652-bcd8-a5e339c574f3", 3, [108, 79, 255]]
}

It’s just the JSON state of our LWW Map CRDT.² Each key is a string of the pixel’s (x,y) coordinates, and each value is a LWW Register state: a tuple of a peer ID, a timestamp and an RGB value. If a pixel has been erased, its RGB value is null.

Here’s a blank canvas so you can see how quickly the size increases as you draw. Pay attention to the different formats as we go through the article: some will actually start out larger than our uncompressed data, but quickly become smaller as more pixels are drawn!

Hex Codes

The easiest change to make is storing colors as hex codes rather than RGB tuples. As a reminder, here’s the type of each color:

type RGB = [red: number, green: number, blue: number];

Each channel has between one and three digits, so our color values look something like this:

[1, 234, 56]

That’s ten characters: six for the numbers, two for the commas and two for the brackets. If each number had three digits, it would take up a whole 13 characters.

Hex codes come in at eight characters, max. By using sixteen digits (A–F representing the numbers 10–15) instead of our usual ten, we can represent each channel with two characters instead of three. Plus, since we know that each channel is two characters long, we can omit the commas.

There’s one more trick: if both characters in each channel are the same, we can omit one, which takes us down to five characters (including quotes). Here’s what the data looks like now:

{
  "0,0": ["0442197c-8144-47f7-ae64-340a2df3d796", 3, "fff"],
  "1,0": ["0442197c-8144-47f7-ae64-340a2df3d796", 4, "fff"],
  "2,0": ["0442197c-8144-47f7-ae64-340a2df3d796", 2, null],
  "0,1": ["4ae8bd76-e84c-4652-bcd8-a5e339c574f3", 2, "6c4fff"],
  "1,1": ["4ae8bd76-e84c-4652-bcd8-a5e339c574f3", 4, "6c4fff"]
}

All in all, using hex codes rather than RGB tuples gets us down from 648kb to 606kb — a 6% improvement. It’s a start, but we’ll need much bigger improvements than that if we’re going to get all the way down to 14kb.

UUID Table

The next things to look at are the UUIDs. They’re a whopping 38 characters each (including quotes) and they show up in every single pixel. We could — and should! — remove the hyphens, but that will only shave off four characters per pixel. We need something more substantial.

Instead of keeping the UUIDs directly in each pixel, we’ll store them in a table. That way, they only appear once in the document. In their place, each pixel will store a table key that maps to the full UUID. When we need the full state back, we perform the same process in reverse, replacing each pixel’s table key with its corresponding UUID.

Here’s what the data looks like now. The 0th element of each array in data represents an index in the uuids array:

{
  "uuids": ["0442197c814447f7ae64340a2df3d796", "4ae8bd76e84c4652bcd8a5e339c574f3"],
  "data": {
    "0,0": [0, 3, "fff"],
    "1,0": [0, 4, "fff"],
    "2,0": [0, 2, null],
    "0,1": [1, 2, "6c4fff"],
    "1,1": [1, 3, "6c4fff"]
  }
}

Our UUID table uses an array rather than a map, which has implied keys: the index of the item in the array. Using numbers as keys also saves two characters every time we reference a UUID, since we don’t need quotation marks as we would with strings.

This turns out to be a substantial savings, taking us down to 236kb — a 63% improvement!

Palette Table

We can use the exact same technique for the colors. Rather than storing each color directly in each pixel, we build a table of all the colors used and replace the hex codes in each pixel with table keys.

The last element of each array in data now represents an index in the palette array:

{
  "uuids": ["0442197c814447f7ae64340a2df3d796", "4ae8bd76e84c4652bcd8a5e339c574f3"],
  "palette": ["fff", "6c4fff"],
  "data": {
    "0,0": [0, 3, 0],
    "1,0": [0, 4, 0],
    "2,0": [0, 2, null],
    "0,1": [1, 2, 1],
    "1,1": [1, 3, 1]
  }
}

Using a table like this seems strictly better, but there are some pathological scenarios in which it actually wastes space. Consider the case in which each color only appears once: we’d have the same number of colors as we have pixels, plus the overhead of mapping each pixel to a color in the table. Our current strategy works for pixel art, which tends to have a limited palette — but if we were dealing with photographs, we might want to choose another way to compress the color data.

In this case, the palette table slims us down to 187kb — 71% smaller than our uncompressed data.

Run-Length Encoding

The next big things are the pixel coordinates — the map keys of our data object. With x and y each between one and three digits, these can take up to ten characters each (including the quotes and colon).

Remember how, when we were drawing to the screen, we mapped indices in the buffer to coordinates? We went down the array four bytes at a time, drawing to each pixel in the first row and wrapping around when we reached the end.

Here we’ll do the same thing, but in reverse. We’ll make the data an array, setting the pixel at "0,0" as the first element, "1,0" as the second element, etc.

This raises a new issue: how do we represent blank pixels, where no user has drawn yet? Before, they would just be missing from the map. But now we have to represent them — otherwise, every pixel after a blank space would be in the wrong place!

We could just put some sort of “spacer” value like null in the array where blank pixels would be. That’d work, but it would also waste a lot of space.

Instead, we’ll use something called run-length encoding Run-length encoding - Wikipedia en.wikipedia.org/wiki/Run-length_encoding . It’s a very simple compression algorithm: when the same item appears consecutively in a sequence, instead of storing each item, store one of the item and the number of repeated instances.

Let’s take this weird sentence James while John had had had had had had had had had had had a better effect on the teacher - Wikipedia en.wikipedia.org/wiki/James_while_John_had_had_had_had_had_had_had_had_had_had_had_a_better_effect_on_the_teacher as an example:

James while John had had had had had had had had had had had a better effect on the teacher.

There’s a lot of repetition there. The sentence has 92 characters, and the word “had” appears eleven times!

With run-length encoding, the sentence might look like this:

1James 1while 1John 11had 1a 1better 1effect 1on 1the 1teacher.

The idea is that before each word, we add a number indicating how many consecutive times it shows up. That lets us skip every consecutive occurrence after the first. If we omit the spaces, this sentence has 54 characters — a 42% reduction from the original!

It’s not an all sunshine and lollipops, though. Consider this classic pangram The quick brown fox jumps over the lazy dog - Wikipedia en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog :

The quick brown fox jumps over the lazy dog.

That sentence is 44 characters and no words repeat consecutively. Here’s what it might look like with run-length encoding:

1The 1quick 1brown 1fox 1jumps 1over 1the 1lazy 1dog.

53 characters — a 20% increase! What gives?

The issue is that when you compress something, you need to add metadata so that you can figure out what it looked like originally. Compression algorithms make assumptions about what patterns will occur in the original data in order to optimize the metadata. If the assumptions are right, the compressed data (including the metadata) is much smaller than the original data. But if the assumptions are wrong, the “compressed” data can even end up bigger than it was before!

Our assumption here is that run-length encoding will not be a good fit for LWW Register states, since there are three separate pieces of data that can vary independently. All blank spaces are the same, though, which makes them an excellent candidate for run-length encoding!

Rather than somehow representing each individual blank space in the array, we’ll store a number indicating how many consecutive blank spaces we came across. That lets us know exactly which coordinates the next pixel will be at.

Now that we’re no longer storing the coordinates of each pixel, we also need to store the width of the image. That way, when we’re decompressing the data, we know when the pixels should wrap onto the next row.

Here’s what the data looks like now. Since we’re now using a lot of numbers without any labels, I added some comments so it’s clear what everything is referring to.

{
  "uuids": ["0442197c814447f7ae64340a2df3d796", "4ae8bd76e84c4652bcd8a5e339c574f3"],
  "palette": ["fff", "6c4fff"],
  "width": 100,
  "data": [
    [/* uuid index */ 0, /* timestamp */ 3, /** palette index */ 0],
    [/* uuid index */ 0, /* timestamp */ 4, /** palette index */ 0],
    [/* uuid index */ 0, /* timestamp */ 2, /** palette index */ null],
    97, // run of blank pixels
    [/* uuid index */ 1, /* timestamp */ 2, /** palette index */ 1],
    [/* uuid index */ 1, /* timestamp */ 3, /** palette index */ 1]
  ]
}

Notice how the data array is not homogenous: it contains either a tuple of peer ID, timestamp and color ID or a number indicating how many consecutive blank spaces there are.

Using run-length encoding brings us to 129kb — just over 80% smaller than our uncompressed data. We’re doing pretty well, but those last few percentage points are the trickiest ones to eke out. We’re gonna need to try something new.

Binary Encoding

So far, we’re at just more than an 80% decrease. That’s pretty good, and we could probably get it even smaller if we stopped using JSON. But we’re running into a fundamental limit: it’s inefficient to store things as text. To get much further, we’ll need to use binary encoding Binary file - Wikipedia en.wikipedia.org/wiki/Binary_file .

When we’re using a plain text format like JSON, we think in characters. When we’re using binary encoding, we think in bytes. A byte holds an integer from 0 to 255. If we consider groups of bytes as a single unit, it can hold much higher numbers — for example, two bytes can hold up to 65535.³

The numbers can represent anything we want them to. One common way to interpret them is as ASCII ASCII - Wikipedia en.wikipedia.org/wiki/ASCII , meaning each number corresponds to an alphanumeric character. Another way is just as the number they hold, meaning a byte that holds 127 would represent the number 127. We’ll interpret bytes in a few different ways as we create our format.

We’ll define a format similar to RIFF Resource Interchange File Format - Wikipedia en.wikipedia.org/wiki/Resource_Interchange_File_Format , which serves as the basis for file types like WAV and AVI. It’s made up of “chunks”, each of which has two sections: “data”, which is the actual information we need to store, and a “header” that identifies the chunk type and tells us how to actually parse the bytes in the data.

First, binary files usually have a magic number File format - Wikipedia en.m.wikipedia.org/wiki/File_format#Magic_number that identifies the format. Ours will represent the ASCII text CRDT, so the first four bytes of the file will be 43 52 44 54.⁴ That goes in the “info” chunk, along with the width of our artboard.

Here’s a diagram of the binary file. The first part contains a human-readable overview; all the data is labeled, and the values are in a format that makes sense to us (for example, you can see that the magic number reads “CRDT” and that the artboard is 100 pixels wide). The second column contains the raw bytes in the file; each byte shows its index in the file and the hex digits representing its number from 0 to 255 (or 00 to ff in hex). You can hover over data on the left and see the corresponding bytes highlighted, and vice versa.

After the info chunk, the file has three more chunks: one to hold our UUID table, one to hold our palette table and one to hold our pixel data.

We’ll start with the UUID chunk. The header consists of two parts:

Four bytes that read UUID in ASCII.
Two bytes that represent the length of the rest of the chunk in bytes.⁵

The UUID chunk’s data is just all the peer UUIDs, one after the other. These each took up 34 characters (including quotes) in our JSON, but here we can represent each of them in a mere 16 bytes. Just like before, the key will be the index of each UUID — so the first one in the chunk will be 0, the second will be 1, etc.

Here’s the full UUID chunk. Each white block represents one item in the chunk data — in this case, a UUID:

The next chunk is the palette chunk. The header looks similar to the UUID chunk header:

Four bytes that read PLTT in ASCII.⁶
Two bytes that represent the length of the rest of the chunk in bytes.

The palette chunk’s data is also similar to the UUID chunk’s data: all the colors, one after the other. When we represented these with hex codes, they took up either five or eight bytes, but here we can represent each one in a concise three bytes — one for each RGB color channel.⁷

This chunk has one extra quirk: when writing to the data, we skip the first item’s place (so the first color starts as the fourth byte of data, or tenth byte including the header). That might not make sense now, but we’ll learn why in a minute.

Here’s the full palette chunk:

The last chunk is the data chunk. You know the drill by now:

Four bytes that read DATA in ASCII.
Two bytes that represent the length of the rest of the chunk in bytes.

That’s the header. There are two types of items that can appear in the data chunk: pixels and blanks. Each item will start with a single byte “tag” indicating which type it is — 0 for blank and 1 for pixel.

Blank items are easy. After the tag, they have two single bytes indicating how many consecutive blank spaces they represent.

There’s a tradeoff here: by using two bytes to indicate the run length, a single blank item can represent up to 65,535 blank spaces. But then every blank item needs to use two bytes to represent the number of spaces — even if none of them need the second byte. If we used one byte, it would mean each blank item is more compact, but if there are more than 255 blank pixels in a row we’d need to use two or more consecutive blank items to represent them.

Pixel items are a bit more complicated. Remember, each pixel is really a LWW Register, which has three pieces of information: the peer ID that last wrote to the register, the timestamp and the value. After the tag, a pixel item has a section for each one:

A byte for the peer ID, indicating its index in the UUID chunk.
A byte for the timestamp.⁸
A byte for the color, indicating its index in the palette chunk.

We also need a way to store erased pixels. In the JSON format, we stored null as a way to indicate that the pixel had been added and then erased. In binary, there’s no null — all we have are bytes. That’s why we skipped the first three bytes when writing the palette chunk: we know that no color will use the 0 index. If any pixel has 0 as its index, that pixel was erased.

Here’s our full file. As you hover over pixel data that refers to items in the UUID and palette chunks, you’ll see arrows pointing back to the actual data in those chunks.

This gets us down to just 40kb. An RGB color takes up three bytes, and at this size we’re representing it with only a single byte of overhead! At an almost 94% reduction from the uncompressed data, we’ve already done a great job and could probably stop here if we wanted. But we can take it even further.

Run-Length Binary Encoding

We’re in the home stretch now. The info, UUID and palette chunks are about as compact as can be, but we can save a bunch of space in the data chunk.

If you recall a couple sections ago, the reason we run-length encoded only the blank spaces was that each pixel had three pieces of information that could vary independently: the timestamp, the peer ID of the last writer and the color. We’re going to split those up into individual chunks — one for each piece of information. Within each chunk, there will be only one single thing that can vary between items, which will let us use run-length encoding on each individual piece of data. So instead of one data chunk, we’ll have three: a writer chunk, a color chunk and a timestamp chunk.

The writer and color chunks are very similar to each other. Their headers have four bytes that read WRTR (for the writer chunk) or COLR (for the color chunk) in ASCII and two bytes for the length of the rest of the chunk in bytes. Each item in these chunks takes up three bytes: one for the index of the corresponding item in the UUID or palette chunk, and two for the length of the run.

You might notice that there are two items in the writer data, but three items in the color data. It means that — going from the top left pixel to the bottom right — the peer who wrote to each pixel changes twice, but the color changes three times. Because we’re now keeping track of each kind of data separately, we can compress them without worrying about the others.⁹

The final chunk is the timestamp chunk. The header has four bytes that read TIME in ASCII, and two bytes for the length of the data items in bytes. Each item in the timestamp chunk starts with a single byte representing the timestamp of that pixel. We can use a clever trick here to disambiguate blank pixels from drawn pixels. When we instantiate a LWW Register, the timestamp is set to 1. There will never be a register with a timestamp of 0, which means we can use that value to indicate a blank pixel.

We’ll use the same run-length encoding strategy from the previous binary format. If the timestamp is a positive number, the next byte will be the next timestamp. But if the timestamp is 0, we’ll interpret the next two bytes as the length of the run of blank pixels.¹⁰

That also lets us save a bit of space by ignoring runs of blank pixels in the writer and color chunks, by which I mean that a blank pixel doesn’t end the run of a UUID or color. Only when we see a different UUID or color do we start a new run. We consider the timestamp chunk to be the ultimate authority on whether a pixel is blank: if the timestamp is 0, then it doesn’t matter what we think the UUID or color is.

Here’s the full file:

This version of our format comes in at 14kb. That’s a 98% decrease from our initial 648kb, and it’s 50% smaller than the uncompressed RGB data — we’re using less than one and a half bytes per pixel!

If you’re curious how the formats stack up when the canvas isn’t blank, here’s the benchmark image with all compression formats compared:

Conclusion

As you can see, even though state-based CRDTs require their whole state to be synced between peers, it’s possible to for the actual data transferred to be significantly more efficient. We took a CRDT state weighing in at 648kb and reduced its size by almost 98% to only 14kb!

The optimization techniques here are inspired by Martin Kleppmann’s video CRDTs: The Hard Parts CRDTs: The Hard Parts A talk on the latest research on CRDTs, originally given at the Hydra distributed computing conference on 6 July 2020.References: https://martin.kleppmann.co... youtu.be/x7drE24geUw?si=9Z9Rl-wVwV2Hhj83&t=3587 . He and Ink and Switch Ink & Switch Industrial research lab working on digital tools for creativity and productivity www.inkandswitch.com managed to reduce the gzipped size of a 100kb plain text file with CRDT metadata to only 114kb — less than a 10% increase over the uncompressed plain text data (and compressing text is significantly harder than the pixel data example we just went through).

This all might seem like a lot of busywork to get back to a reasonable baseline. But in practice, there are CRDT libraries available that have already made these or similar optimizations for you, including but not limited to:

There’s also a great website called crdt.tech Code • Conflict-free Replicated Data Types Resources and community around CRDT technology — papers, blog posts, code and more. crdt.tech/implementations that has a list of CRDT libraries, as well as blog posts and academic papers.

CRDTs are awesome! Go forth and build something cool!

I’m imagining this picture hanging in the Louvre soon after this post is published. ↩
The JSON is formatted for easier reading, but when we’re talking about how much space it takes up, we’ll pretend it’s minified by omitting whitespace. ↩
If this is too abstract, here’s how it works in JavaScript: a binary file is just a Uint8Array Uint8Array - JavaScript | MDN The Uint8Array typed array represents an array of 8-bit unsigned integers. The contents are initialized to 0. Once established, you can reference elements in the array using the object's methods, or using standard array index syntax (that is, using bracket notation). developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array . Each element of the array is a single byte. Using multiple bytes to represent numbers greater than 255 gets a little tricky, but this is the basic idea. ↩
Note that this is in hexadecimal, so each of those numbers is base 16. In decimal, they come out to 67 82 68 84. ↩
Why do we need to know the length in advance? With binary, we don’t usually use delimiters like quotes or brackets to indicate when things begin or end. Instead, we expect to know the size in bytes of each item in advance. That way, we can just go through the items that many bytes at a time, counting up until we’ve gone through the total number of bytes. ↩
Yeah, we’re missing a few letters, but we need each chunk name to fit in the same number of bytes. ↩
Remember how each RGB color channel is an integer between 0 and 255? It’s not a coincidence that those are the exact numbers you can fit in a single byte! ↩
Using one byte for the timestamp makes each pixel smaller, but limits us to a maximum timestamp of 255. Lifting that limit would mean adding another byte to every pixel — even if every timestamp is below that amount. Since we know the maximum timestamp in advance, we could make an optimization here: use two bytes to store the timestamp if and only if one of the timestamps is above 255. Saving one byte per pixel sounds small, but that reduces the size of a 100x100 image by a full 10kb! ↩
One downside is that our data is a bit more confusing: items at the same index in each category might not refer to the same pixel. This is a tradeoff of compressing data: the more we squeeze it into as little space as possible, the less it resembles the original thing we’re compressing. ↩
Why not run-length encode all the timestamps? In my (very limited) testing, the timestamps varied enough that run-length encoding non-blank pixels wasted space. ↩