Rust Programming Language RustFest Global 2020, 23 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Zac Burns - Everything is serialization — RustFest Global 2020

Description

Let's peek under the hood of serialization formats and see how properties inherent to the data representations themselves will either help or hinder us for a given problem.

Learn what to consider when writing your own formats by looking inside some of the best.

More at https://rustfest.global/session/9-everything-is-serialization/

A

Zach burns wants to serialize some ideas that we all won't despise into. One talk to make us see what it will take to make code easier to realize.

B

A

B

By itself doesn't do anything of value. Here's.

C

A picture of the inside of a computer.

B

You can't tell from the picture what the computer is doing if it is doing anything at all for the computer to be useful. It must be a component connected to a larger system.

B

The system that the computer is a part of includes other components attached to the computer components like mice, keyboards speakers, screens and network cards, send data to and from the computer, through wires like in this picture.

B

Because of these wires and the physical separation of the system's components. The data which drives each component must be in well-specified agreed upon formats at this level of abstraction of the system. We usually think of the data in terms of serialization serialization at this level includes many well-known formats, mp3, json http among others.

B

If we break down this high-level problem further and look inside the computer, we see the same setup on the inside. That we saw on the outside.

B

Here is a picture of the inside of a computer subsystem which comprises several components. Each component, driven by data sent over the wires that connect the components the cpu, gpu ram and hard drive, are all data driven components and subsystems unto themselves.

B

We don't always think of the things happening at this level of abstraction in terms of serialization, but it is serialization just the same here too.

B

The physical separation of each component, aside from the wires connecting them, necessitates that the data which drives each component to be serialized in well-specified agreed upon formats file system formats like ntfs, are serialized by the cpu and sent to the hard drive for safekeeping data is serialized in vertex buffers to be sent to the gpu for drawing when data is fetched from ram that fetch instruction is bytes in a serialization format sent over wires between the cpu and ram instructions sent to the cpu are serialized machine code, which came from serialized x64 assembly, which came from serialized miri, which came from serialized rust source files, which came from serialized keyboard, presses and so on.

B

If you look into any of these subsystems, ram cpu gpu network card you'll find the exact same setup, data driven components connected by wires. We can take the cpu, for example, and see what it looks like on.

C

B

C

A picture of a cpu.

B

With components for instruction decoding branch, prediction, caches, schedulers and so on, each component in this system is data driven and is connected by wires, which transfer data to other components and.

C

B

Specified predefined, purpose-built serialization formats at this level we're thinking about smaller serialization formats, like the little endian format for integers instruction sets addresses floating point opcodes and many others at each level of abstraction of the computer system. You will find components driven by data sent over wires in serialization formats.

B

Unsurprisingly, the design of the serialization formats, which is the design of how the components interact, has a large effect on the system as a whole.

B

Maybe you feel this characterization of the computer as driven by serialization to be reductionist. Maybe you prefer to think in abstractions I'd like to point out that abstractions cannot be implemented without the use of serialization.

B

Perhaps the greatest abstraction of all time is the function, call what happens when you call a function.

B

The first thing that happens is the arguments to the function are serialized to the stack, the order and layout of the arguments or the file format. If you will is called the calling convention, maybe you don't like the things in terms of implementing abstractions?

B

Perhaps you think in high-level tasks like serving an http request, high-level tasks are all described in terms. First of parsing data, in this case a url which is a standard, serialization format having a host a path and other data followed by a transform and, lastly, a serialization, in this case, an http response.

B

There are two serialization steps: parse and serialize book, ending every transform step. It.

A

May seem that I'm overlooking the most significant.

B

Step the transform step, but if we were to look at what the transform step entails, we would see that it breaks down further into a series of parse transform and serialized steps. It's serialization all the way down your.

C

Database is just a.

B

Giant serialized file, after all, as are the requests for the data in the database. In fact, all of the state of your entire running program is stored in memory in a complex serialization format comprised of other nested, simpler serialization formats.

B

The first point that I'm trying to make here today is that at every level, serialization doesn't just underlie everything we do, but is in some sense both the means and the ends of all programs, as such serialization should be at the front of our minds when engineering and yet despite the centrality of serialization how to represent data as bytes is not a hot topic in programming circles, there is the occasional flame war about whether it's better to have a human, readable format like json or a high performance format, with a schema like protobuf, but the trade-off space goes much deeper than that.

B

My experience has been that the choice of representation of the data is an essential factor in determining the system's performance. The engineering effort required to produce the system and the system's capabilities as a whole. The reasons for this are easy to overlook, so I'll go over each.

B

The first statement was that the data.

A

Representation.

B

Is an essential factor in determining the system's performance? This comes down to the limits of each component in the system to produce and consume data and the limits of the wires connecting these components.

B

If the serialization format has low entropy, the throughput of data flowing through the system is limited by the wires connecting components put another way bloat in the representation of the data throttles. The throughput of information, also data dependencies in the serialization format pause the flow of data through the system, incurring the latency cost of the wires.

B

The system operates at peak efficiency only when the communication between components is saturated with efficiently represented data.

B

The second point is that data representation is an essential factor in determining the engineering effort required to produce the system.

B

Given input and output data representations in an algorithm sitting between a change to either representation necessitates a corresponding change to the algorithm note that the inverse is not always true, a change in the algorithm does not necessarily require a change to the data.

B

The algorithms available to us and their limitations, including the algorithm's minimum possible complexity, are determined by the characteristics and representations of the data. The third point was.

C

That data representation.

A

Is an essential.

B

Factor in determining the system's capabilities as a whole, given a representation and time limit the set of inputs and calculated outputs expressed within those bounds is finite. That finite set of inputs and outputs is the total of the capabilities of the system.

B

C

B

Size limit, even when that limit is bound primarily by throughput and time. I'd like to drive these points home with a series of case studies. We will look at some properties inherent to the data representations used by specific serialization formats and see how the formats themselves either help us solve a problem or get in the way. In each example, we will also get to see how rust gives you best in class tools for manipulating data across any representation.

B

The first example will be in parsing graphql, the second in writing, vertex buffers for the gpu and the third will be the use of compression in the tree buff format.

B

First graphql serialization.

C

B

Tend to reflect the architecture of the systems that use them, our computer systems are constructed of many components: nesting into other subsystems, comprised of more components, serialization formats nest in a way that reflects that, for example, inside a tcp packet, a serialization format. You may find part of an http request.

B

The http request comprises multiple serialization formats, the http headers, for example, which itself nests even further and other serialization formats being the payload in a format which is specified by the headers.

B

The payload may nest to unicode, which may nest to graphql, which itself nests to many different subformats, as defined by the spec, and if you find a string in the graphql, that string may nest further if it contains binary data, as perhaps a base64 encoded string, which may also nest.

B

The nesting reflects the system's architecture, because many layers exist to address concerns that manifest at that particular layer of abstraction of the computer, because nesting is the natural tendency of serialization.

B

We need formats that allow us to nest data efficiently. We also need tools for parsing and manipulating nested data, rust and many libraries written.

A

B

Give you these tools in abundance. Some of the baseline things you need are to be able to view slices of strings and byte arrays safely and without copying.

B

Interpreting bytes as another type like a string or an integer is also important, safe, mutable, appendable strings or binary types. Allow.

C

Us to progressively.

B

Push serialized data from each format into the same buffer rather than serializing into separate buffers independently and then copying each buffer into the nesting format above moving control over memory to the caller and safely passing mutable or immutable data is the name of the game. These capabilities are all necessary when parsing and writing nesting serialization formats.

B

A

I could say that.

B

These basic features were not rest, differentiators, but a surprising number of languages do not meet these minimum requirements. Rust is the only memory safe language that I'm aware of that does. But what rust gives you is much more here's a type from the graphql parser crate by paul colomies.

B

It is an enum named value containing all the different kinds of values in a graphql query, like numbers objects and so on.

B

What's great about this, is that value is generic over the kind of text to parse into one type that implements the text? Trait is string, so you can parse a graphql query into string as the text type and because value then will own its data. This allows you to manipulate the graphql and write it back out. That capability comes with a trade-off.

B

The performance will be about as bad as those other garbage collected languages because of all the extra allocating and copying that necessarily entails reference to stir also implements text.

B

So you could parse the graphql in a read only mode that references, the underlying text that the graphql was parsed from with that you get the best performance possible by avoiding allocations and copies. But you lose out on the ability to manipulate the data in.

A

C

B

That's okay: rust takes this up a notch, because there is a third type from the standard library that implements text. This type is a cow of string, a clone on right string. With this safe and convenient type enabled by our friend and ally, the borrow checker. We can parse the graphql in such a way that all of the text efficiently refers to the source, except just the parts that you manipulate and it's all specified at the call site. This is the kind of pleasantry that I've come to expect from rust dependencies.

B

If you want to change some of the graphql text, you can safely and efficiently do so using this type. Almost I say almost because there is a fundamental limitation of the graphql query format itself, that no amount of rest features or excellence in library apis could overcome.

B

Looking at the list of different values. Here we see that the entries for variable, enum and object are generic over text. Ironically, the string variant is not the string. Variant just contains the string type requiring allocating and copying what's going on here.

B

The issue is in the way that graphql nests its serialization formats, the graphql string value, is unicode, but the way that graphql embeds strings is by putting quotes around them. With this design choice any quotes in the string must be escaped, which inserts new data interspersed with the original data. This comes with consequences.

B

One when encoding a graphql string value the length of the value is not known up front. The length may increase from the re-encoding process. That means that you can't rely on resizing the buffer. You are encoding to upfront before copying, but instead must continually check this buffer size when encoding this value or over, allocate by twice as much when reading graphql, it is impossible to refer to data because it needs to go through a parse step to remove the escape character.

B

This problem compounds, if you want to nest a byte array containing another serialization format in graphql, there is no support for directly storing bytes in graphql, so bytes must be encoded into a string using base 16 or base64 or similar. That means three. Three encode steps are necessary to nest. Another format there is encoding. The data as bytes encoding that as a string and finally re-encoding the escaped string that may compound even further.

B

If you want to store graphql in another format, it is common to store graphql query and as a string embedded in json alongside the graphql variables.

B

Json strings are also quoted strings, meaning the same data goes through another allocation and decode step. It is common to log the json another layer, another encode step. So now, if we want to get that binary data from the logs, it's just allocating and decoding the same data over and over up through each layer for every field.

B

It doesn't have to be this way. One alternate method of storing a string and other variable length data is to prefix the data with its length. Not doing this is a familiar mistake that was made way back since null terminated strings in c.

B

The difference between the two can be the difference between having decoding be a major bottleneck or instant, no amount of engineering effort spent on optimizing. The pipeline that consumes the data can improve the situation because the cost is intrinsic to the representation of the data. You have to design the representation differently to overcome this.

B

I'm not saying avoid graphql. The concepts at play in graphql are great. I use graphql we're all on the same team here I mentioned graphql in this example, because using a standard format to illustrate this problem is easier for me than just inventing one for this cautionary tale.

B

When you go out and design your civilization formats consider designing, with efficient nesting in mind, let's look.

C

B

More example of how we can build capabilities into a serialization format and how rust works with us to take advantage of those capabilities for this case study we're going to be sending some data to the gpu.

B

A gpu is driven by data sent to it in a serialization format, we'll call vertex buffers vertex buffers contain data like the positions of the points that make up: polygons, colors material properties and other data needed for rendering this data comes in two parts.

B

The first part describes a struct's format and the second part is a contiguous region of memory containing those structs evenly spaced. In an array, this diagram depicts a vertex buffer.

B

The top portion is the description, including the names pause x, y and z, for a vertex position and a mesh and r b and g color channels.

B

The bottom part depicts the data with three f64 slots, one for each position coordinate three uh slots, one for each color channel and a blank slot for padding, which just makes everything line up nicely.

B

These slots repeat over and over again taking up the same amount of space each time, there's a good reason that the gpu receives data in fixed size, structs evenly spaced in contiguous arrays. The gpu is a massively parallel device.

B

The latest nvidia rtx cards have a staggering 10, 496 cuda cores and that's not even counting tensor cores and ray tracing course. Here's a picture of 10 496 boxes. It's a lot, I'm even in the way of some of these.

B

If you want to break up data into batches for parallelism, the most straightforward way you can do, that is to have fixed size, structs stored in contiguous arrays. With that choice of sterilization format, you can know where any arbitrary slice of data lives and therefore breaks the data up into batches of any desired size in constant time.

B

The serialization format reflects the architecture of the system. Contrast that to sending the data to the gpu in say, json with json, the interpretation of every single byte in the data depends on every preceding byte. The current element's length is unknown until you search for and find a token indicating. The end of that item often a comma or a closed bracket.

B

If we were to graph the data dependencies of a json document, it would form a continuous chain, starting with the first byte, the second byte, depending on the first, the third byte, depending on the previous two continuing until the very last byte of that document. Consider a string in json.

B

Is there a key or a value that depends at least on whether it is inside an object?

B

If I hid the values of any preceding bytes in the document, it would be impossible to tell the problem with that is that data dependencies limit parallelism? A json document must be processed sequentially, because that.

C

B

Property intrinsic to the format making json a non-starter for a gpu with over 10 000 cores the data dependencies limit, parallelism and add complexity to the engineering that goes into writing a parser.

B

Arguably, it's the data dependencies that make writing a correct, json parser, a challenging engineering problem in the first place returning to the vertex buffer format. If we were to graph its data dependencies, the interpretation of each byte in the data is only dependent on the first few bytes in the description of the buffer.

B

Aside from that, all bytes are independent by representing data in a contiguous array of fixed size elements, we can process data independently and therefore parallelize.

B

It's not all unicorns and rainbows, though there are downsides to arrays of fixed width elements. While we gain data.

C

B

We lose the ability to use compression techniques that would rely on variable length encoding. This means that you can use some kinds of lossy compression within a vertex buffer, but you cannot use lossless compression.

B

The trade-off is inherent to the representation json can utilize, both in json, for example, a smaller number will take fewer bytes to represent than the larger number integers between 0 and 9 take 1 byte, because they only need a single character numbers between 10 and 9, take 2, bytes, and so on. Here's a depiction of that.

B

It shows for a fixed amount of bytes, storing a few or many numbers using json.

B

I wouldn't ever call json a compression format, but in principle the building blocks of lossless compression are there in variable length encoding for integers. There are better ways to do this, though, which we'll return to later.

B

The building blocks for lossy compression are present in json 2 in the form of truncating floats here's a depiction of the lossy compression of pi. The first rendition of pi contains only 3.14, which is lossier than the second rendition, having pi to more than 10 digits.

B

To recap that the format used by vertex buffers has a different set of capabilities than json is not something that can be worked around with any amount of engineering effort when consuming the data. Those capabilities are inherent to the representations themselves, and if you want different capabilities, you need to change the representation.

B

Okay. Having established that writing the data is the problem we are trying to solve and the characteristics the serialization format must have because of the gpu's architecture. Let's write a program to serialize the data, we'll write this program in two languages, first in typescript and then in rust. I don't do this to disparage typescript.

B

Actually, parts of typescript are pretty neat, but rather show you the complexity that a memory managed program adds to the problem that wasn't there to start without seeing the difference, it's hard to appreciate the power that rust has over data.

B

The function we will write is a very stripped down version of what you might need to write a single vertex to a vertex buffer for a game. Our vertex will consist of only a position with three 32-bit float coordinates and a color having three u8 channels. There are likely significantly more fields. You would want to pack into a vertex in a real game, but this is good for illustration.

B

Let's start with the typescript code. Here is the typescript code, if you're thinking whoa, that is too much code to put on a slide.

B

That's the right reaction, and it's also the point I'm trying to make I'm going to describe the code, but don't worry about following too much there's not going to be a quiz, and this is not a talk about typescript.

B

Just listen enough to get a high level feel for the concerns the code addresses and don't worry about the details. The first section defines our interfaces. Vertex position and color are unsurprising. We have.

C

A

Other interface.

B

Buffer, which has a byte array and a count of how many items are written in the array. The next section is all about calculating offsets of where the data lives in the buffer.

B

You could hard code these, but the comment explaining what the magic numbers were would be just as long as the code anyway, so it might as well be code, since that makes it more likely to be correct and in sync, with the rest of the function.

B

Particularly cumbersome is the line that calculates the offset of the r color field, the value for which is a byte. But the offset is the offset of the previous field plus 1 times the size of an f32 in bytes that mixing of types accounts for a discontinuity. Our offsets describe, because.

C

B

We're going to use two different views over the same allocation, which is profoundly unsettling. We also have to calculate each element size both in units of bytes and floats for similar reasons.

B

The next thing we are going to do is to possibly resize the buffer. This part is not interesting, but the code has to be there or the program will crash when the buffer runs out of space. Next.

A

B

Set up the views and calculate the beginning position of the data we want to write within each view relative to the data size. In each view, these offsets are different, even though they point to the same place.

B

Lastly, we can finally copy the data from our vertex into the buffer, assuming all the previous code is correct. Phew now, let's take a look at the equivalent rust program. First, we.

A

Define our structs.

B

C

B

We did our interfaces in the typescript program. We leave out the interface for buffer holding the byte array in count. We aren't going to need that now. Let's look at the function to write the vertex buffer dot, push vertex, that's it rust, isn't hiding the fact that our data is represented as bytes, underneath the hood and has given us control of the representation we needed to annotate the structs on the previous slide, with wrapper c moving all error-prone work into the compiler between javascript and rust, which do you think would have better performance.

B

The difference is starker than you might think, not just because of all the extra boilerplate code or it being javascript or the cast from float to end for typed arrays, but mostly.

A

Because of again.

B

Data dependencies, this time in the form of pointer chases when accessing the properties of objects in typescript, for example, element.position.x it's slow, because the serialization format used by the javascript runtime to represent objects introduced data dependencies.

B

One thing we mean by zero cost abstractions are abstractions that don't introduce unnecessary serialization formats.

B

Remember that, because the choice of serialization format is a deciding factor and how you can approach the problem that the advantage rest gives us of being able to choose how data is represented, carries forward into every problem, not just writing. Vertex buffers for the final case study, I'd like to take some time to go into how a new experimental, serialization format called tree buff represents data in a way that is amenable to fast compression.

B

Before we talk about the format itself, we need to talk about the nature of data sets and we'll use the game of go as an example. We'll also hand roll a custom compression format for the game of go to use as a baseline to compare against tree buff.

B

This movie depicts a game of go. I haven't told you anything about how go works, but by watching.

C

B

You might pick up on some patterns in the data. The first pattern we might pick up on is that most of the moves are being made in certain areas of the board. Many are on the sides and corners there's very little going on in the center.

B

Another thing you might pick up on is that a lot of the time a move is adjacent to or very near, the previous move, the observation that local data are related and that not all of the state space of a type is likely to be used is not specific to the game of go.

B

If you have an image, adjacent pixels are likely to be of similar colors. Most of the colors of an image may not be far off from the image's global color palette, with large swaths of color space being unused.

B

We can extend these observations to a complex, 2d polygon, described by a series of points. Any given point in the polygon is not likely to be randomly selected from all possible points, with an even probability.

C

B

Each point is very likely to be near the previous, and there are vast vast regions of the possibility space that will not be selected at all, and so what we observe is that data sets containing arrays are often predictable.

B

All lossless compression works by making predictions first predict what.

A

The next data in the series.

B

Is going to be, then assign variable length, binary representations to each possible value so that if the prediction is accurate, very few bits can be used to represent the value, but if the prediction is wrong, you have to pay more bits.

B

The quality of the prediction is the main factor in determining how well the compression works taken to the extreme. If you could accurately predict the contents of every byte in a file, you could compress that file to zero bytes. No such prediction method exists.

B

We have a data set a game of go. What we want is an algorithm to predict the next move in the game to help us we're going to visualize the raw data from the data set. This scatter plot is a visual representation of the actual bytes of a go game. As you read from left to right, there is a dot for each byte in the file, with the dot's height, corresponding to the value of that byte.

B

If the game starts with a move at x, coordinate, 4 and y coordinate 3, there would be a dot with height 4, followed by a dot with height 3 and so on.

B

Our eyes can kind of pick up on some kind of clustering of the dots they don't appear random, that the data does not appear random is a good indication that some sort of compression is possible coming up with an algorithm to predict. The value of a dot may not be apparent by just looking at a scatter plot.

B

We can see that there's probably something there. We just don't know. Yet what it is it's worth taking a moment to consider how a general purpose algorithm like deflate, which is the algorithm used by gzip, would approach this gzip works by searching for redundancy in the data.

B

The basic prediction of this method is that if you have seen some sequence of bytes you're likely to see the same sequence repeated later so gzip scans back in the file to find previous occurrences of the data to reference.

B

Gzip's prediction works great for text where words are often repeated, at least in the english language. Words are constructed from syllables, so it's even possible to find repetition in a text. Even in the absence of repeated words, the problem is that in our go game the same coordinate on the board is seldom repeated.

B

You can't place a stone on top of a previously played stone, barring.

C

A few exceptions.

B

Then each two byte sequence in this file is unique. A redundancy-based solution, like gzip, will produce a compressed file that is far from optimal, because the underlying prediction that sequences of bytes would repeat, has not helped.

B

This observation generalizes to many other kinds of data as well recall that we stated that each move is likely to be near the previous move. We could try subtracting each byte from the last so that, instead of seeing moves in absolute coordinates, we'll see them in relative coordinates.

B

Here is a visual representation of that. This is garbage. There are points everywhere and there seems to be no visual pattern at all. It looks random indicating that the data is difficult to predict and therefore difficult to compress.

B

The problem with subtracting is that the x and y coordinates from the data are logically independent, but interleaved in the data. So when we subtracted adjacent bytes x, values were subtracted from y values and vice versa. Let's go back. Here's the same image as before.

B

We first need to separate the data so that, logically, related data are stored locally. Instead of writing an x, followed by a y like most serialization formats, would do let's write out all the x's first and then all the y's, here's a visual representation of that it looks maybe tighter than before.

B

This indicates that our data is less random. Now, let's try subtracting here's a visual representation of that now.

C

We're making progress.

B

What I want you to notice is three horizontal lines right near the center most.

C

B

Points about two-thirds lie on these lines. These lines correspond to the values, zero negative one and one. If we wanted to write an algorithm to predict what would come next in the sequence, the algorithm could be minimal. It's just the value is probably 0 negative, 1 or 1..

B

We can simplify this table further and say that the number is likely to be near zero. A small number, which sounds familiar from when we looked at the variable length encoding used in json.

B

That's going to be our prediction with our prediction, algorithm in hand. Next, we need to come up with a representation, we're going to write a variable length encoding in this graphic. We have three rows of boxes where we will describe the variable length encoding. Each box holds a single bit. There are three boxes on the top row. The first box contains zero. The next two boxes are blank.

B

The zero at the beginning is a tag bit. It will.

C

B

Whether we are in the four smallest values in most likely cases, 0, 1, negative 1 and 2, or the unlikely value case for all the other values. The first bit is taken for the tag bit, leaving two bits for storing those four values.

B

On the second row, we have the tag bit 1, followed by 4 bits allowing us to store the 16 least likely values.

B

The bottom row shows 8 bits for reference, which is how many bits are in a byte before we were writing each coordinate in a single byte. So with this encoding, all moves will always save some amount of space.

B

It didn't have to work out that way, but we can do this because a go board has only 19 points along each axis, which means that we're not using the full range of a byte. If we did use the full range, the encoding would have to have. Some values extend beyond 8 bits, but indeed most data sets do not use the full range of the underlying types used in the representation.

B

So this generalizes as well to other data sets the result. Is that our go game compresses to less than half of the size of writing the data out using one byte per coordinate, which was already pretty efficient. This result is decent.

B

The size is smaller than what would be produced by gzip, but the file can be written faster than it could be compressed by gzip. This is because the prediction is more accurate while being computationally easier to produce.

B

It requires less work to subtract the previous value in a sequence than to search for redundancy by scanning many values in the sequence. Note that this is not the best prediction algorithm possible. If you wanted to get serious about compression and squeeze the file down further, you could make an even better prediction algorithm.

B

You could write a deterministic go ai and have it sort moves from best to worst and predict that it is more likely for the player to make a good move than a bad one. This could give.

A

B

Twice as good a result as our delta compression algorithm on a professional go game, but the trade-off is that the ai would be computationally. Expensive, require a lot of engineering effort and once completed, would only be able to compress the game of go. Whereas.

C

B

Delta compression.

D

B

Sounds like it might be useful for more than just go. Let's review by comparing these methods in a matrix. This chart shows each of the three methods we considered written across the top gzip delta compression and ai compression written on the side. We have three categories: the compression ratio is how small the file is performance is how fast we can read and write the file and the difficulty is the engineering effort required to produce and maintain the code that implements the compression method.

B

A check mark goes to the best compressor in each category. An x to the worst and no mark for the compressor in between each of the methods is the best at something the delta compression method sits in a sweet spot. However, it's not the worst at any category.

B

So if we were to assign a score of plus one for being the best at something and minus one for being the worst delta compression would come out on top with a score of one gzip would come in. Second, with a score of zero and a, I would come last with a score of negative one.

B

The overall score. Hardly matters, though, because where gzip wins is in the difficulty category, it doesn't take a lot of engineering effort to grab an existing crate from crates.I o and run gzip on your data. You get a lot with minimal effort using something like gzip effort is important for working professionals under tight deadlines.

B

I'd go so far as to say that many of us even code in a culture that is hostile to high performance programming methods. This.

C

B

Especially true when those performance gains come with any engineering cost you're not likely to be criticized by your peers for using gzip, whereas the delta compression method required a fair bit of custom code. But what if we could move that check mark for the lowest difficulty in engineering effort from gzip to the delta compression method?

B

If we could do that, then the delta compression method would dominate gzip in every category, and that is the aspiration of tree buff.

B

If you followed so far in understanding how the delta compression method works, you're already almost there in understanding tree buff. If we forget about the details and look at the delta compression methods underlying principles, we find the essence of tree buff. Let's review the process.

B

The first thing that we did when applying our custom designed delta compression method, was to separate the x and y coordinates storage, treebuff generalizes, this separation to the entire schema.

B

If we were going to extend this from just the x and y coordinates of a single game of go to all the data for a whole go tournament, it might look like this here. We have all the data for a go tournament at the top. We have the root element, tournament, which is a type struct. The struct.

D

B

Properties champion a string on the left in the middle games of vec, and if you follow that through all of the moves of the games and their coordinates down to the bottom row, there's an x property and a y property which are buffers holding all of the x coordinates of all games in the tournament in one buffer and another buffer containing all of the y coordinates of all the games in the tournament.

B

This is a tree of buffers, hence the name tree buff. This structure brings locality to data that is semantically related and of the same type. This transformation is only possible if you know the schema of the data being written.

B

The next thing we did with the delta compression was that we applied a type aware compression method after having arranged the data to maximize the locality of related data, subtracting ins and writing. The deltas was only possible because we knew that the bytes were uhs and not say strings where subtracting adjacent characters would produce nonsense.

B

Tree buff, again generalizes this principle and uses different high performance type, aware compression methods for the different kinds of data in the tree. Since no compression method is one size fits all, it even spends some performance trying a few different compression techniques on a sample of the data from each buffer.

B

The result approximates a hand-rolled file format that genuinely understands your data. So what we have is fantastic performance and compression. What about ease of use and engineering effort? Can I claim that it's easier to use tree treebuff than gzip?

B

Yes, the trick here is that gzip is not by itself a serialization format using gzip assumes that you already have some method for writing. Structured data like protobuf or csv, or message pack or whatever using gzip, always entails introducing a second step. Writing a tree. Buff file is one step.

B

The rust implementation has an api very much like cerade, you just put in code or decode attributes on your structs call the encode method and you're done. There is no second pass over the data using another dependency in another format. It's.

C

Just the same amount.

B

Of work as it would take to use any other serialization format before compression so tree, buff is easy to use as easy as certain. How does it do on compression and performance? It's time to look at benchmarks?

B

This benchmark will use real-world production data served by the graph, a decentralized indexing and query protocol for blockchain data. For this, a graphql query was made to an indexer from the graph for 1000 recent wearable entity auctions in decentraland.

B

Each entity in the response looks something like this. There.

C

Are many properties.

B

Of different types, there are nested objects, arrays and thousands. Other entities like this.

C

B

But with a cardinality in the data that reflects a real-world distribution of values, what will be measured is relative, cpu time to round-trip the data through serialize and deserialize, and the relative file size. The format we'll be comparing two is message pack which, as described by messagepack.org, is like json but fast and small.

B

I've chosen this format because message pack is smaller and faster than json and like json is self-describing, which works well for graphql.

B

Treebuff is also self-describing, which means that you could open up and read any treebuff file without requiring a separate schema to interpret the data, also making it a good fit for graphql.

B

The feature sets are similar enough, that we can't attribute a difference in the results to difference in capabilities sidestepping. The typical argument that schemas are necessary for performance here are the results.

B

The big green, box's height is how long it takes the cpu to round trip the message pack file and its width is the size of the file and bytes. The message pack file is more than 17 times as large as the tree buff file, and it takes more than twice as long to serialize and deserialize.

B

The improvements are significant, considering that the first thing that treebuff has to do is to reorganize your data into a tree of buffers before starting to write and then reverse that transformation. When reading the data, it has no right to even match the speed of message pack much less significantly outperform it. If you wonder how this can be real, the answers have everything to do with data dependencies and choices made in representing the data as bytes everything we just covered.

B

Let's take a look at a different data set for this.

C

B

We will consider geojson for serializing a list of all the countries in the world. The data set includes things like the country's names, but the bulk of the data is in the polygons that describe their borders.

B

Geojson is a relatively compact format as far as json goes, because geojson doesn't describe each point with redundant tags like latitude and longitude repeated over and over as most json formats would, but instead opt to store that data in a giant nested arrays to minimize overhead here are the results. The green box is geojson.

B

The blue box, which is partly overshadowed by a red box that I'll get to in a second, is tree. Buff tree buff is more than 10 times as fast as geojson, and it produces.

C

B

That is less than one-third the size.

B

The red box is what we get if we opt into treebuff's lossy float compression, which allows us to specify how much precision is necessary to represent the data set accurately suppose we instruct treblef to encode the floats to the precision required to have better than one meter accuracy, which is pretty good on world scale, data like country's borders, in that case the resulting file compresses down to less than one tenth the size of geojson without sacrificing speed, we've seen how the choices in the representation of data can have a significant impact on the speed size, engineering effort and capabilities.

B

These impacts are not restricted to the cases we have studied but affect every software engineering problem. There.

C

B

Capabilities that you can design into representations that we did not explore today, if you consider serialization and representation as first class citizens next to algorithms and code structure, and if you use the proper tools to parse and manipulate data, you'll be surprised by the impact. Thank you.

D

Yes, thank you zach. um Actually, uh to be honest, um um your civilization is much much much deeper than I understand now and um yeah yeah. Thank you for the such deep presentation and uh we have.

D

We have two three minutes.

D

No questions by now that do you have anything to add your presentation or comment or additional message.

B

Yeah, well, I'm sorry that I didn't make it as accessible as I planned to be it was, uh I don't know it was. I did find it a bit of a struggle to kind of uh kind of get the ideas down into something like into a small package and really present them. It was a struggle, so I'm I'm sorry that it didn't. uh It wasn't as easy to follow, as I had hoped when I planned for the talk.

D

But yeah no, um no worries about uh it has a lot of um I say case studies, so it should be. um To be honest, I need some more time to digest your what you uh representation right now, but um well. I understand that how important actually yeah that should be our first season in the programming.

D

That is a quite important message. I believe.

B

Yeah, that really is the focus of the talk I mean, if, if you want uh to to bring your programming to the next level,.

D

B

Think that the best way to do that is to just go back to the basics of the problem. Every problem really is a problem about data and um transforming data and then in the end, serializing data, so just like keeping that in mind, instead of adding a lot of layers of complexity on top of that, and really focusing on that problem, I think can can help a lot. So if.

C

B

Are are interested in that kind of thing. There's some other interesting.

B

Presentations that you could watch I'd recommend, for example, watching data oriented programming by mike acton. Is he he talks about a lot of things in the same terms? So that's interesting. um Yeah definitely start there and then just follow the line with uh data oriented programming. It's a lot to learn in that field.

D

Yeah thanks thanks for suggestion, and uh we have one question that is: um is there anything tribal is bad for.

B

Sure um so tree buff is, is taking advantage of being able to like find predictability in data with arrays.

B

So if you are, if you want to do say, messages like server to server communication for things which do not contain a race, then maybe something like protobuf would be better for that it tries hard not to be bad and that kind of a case where there are no arrays. But there are some fundamental tradeoffs that wherever prebuff can optimize for the case with arrays, it will do so because the gains there can be significant where there isn't that.

B

Much to add to that case, where arrays are not being used, we're doing pretty well, in that case already with other serialization formats,.

D

Okay, thanks thanks for answering and we are learning all the time. Okay,.

B

um I'm gonna stick around in the chat too for a little bit. So if anyone has the questions there, I'll answer them.

D

Oh, it's thanks again zach for the great presentation.

B

B