Filecoin Filecoin Slingshot, 7 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Filecoin Master Class // Preparing Large Datasets for Filecoin Storage

Description

Protocol Lab's @mikeal, @chafey, and @rvagg will discuss tips and tricks and key underlying concepts for storing large data sets on Filecoin!

Keep up with events for the Filecoin community by heading over to the Filecoin project on GitHub:
https://github.com/filecoin-project

Check out the Filecoin community resources:
https://github.com/filecoin-project/community

And stay connected on Filecoin Slack:
https://app.slack.com/client/TEHTVS1L6

A

Yeah welcome everyone. uh We're just gonna, give a few more minutes here to allow some more time for some more people to come in, uh but we will get started right.

A

B

Let me know when I should take off.

A

Yeah we'll get started in a second here.

A

Okay, let's get started, uh welcome everybody. Thank you so much for joining us today. um uh We have a really exciting hour ahead of us uh with another filecoin master class, um and this happens to be one of the most requested ones. We've had so far, so it's going to be a really great hour with a lot of great information.

A

um So, let's hop right into it. I would like to start by introducing michael rod and chris from protocol labs and they're gonna, be the ones taking us through uh how to prepare large data sets for file coin storage. um So with that rod, would you like to get us started.

B

A

B

Okay, so I'm rod bag uh and um I'm with chris and michael's also here as well, we are uh on the ipld team. Ipld is the middle layer of the stack, it's concerned with the data, the data layer and how we connect them to connect pieces of data together in the distributed fashion.

B

My talk today is called content address data structures with ipld. This is a really basic primer on ipld. I hope that a lot of this material is already familiar to folks listening in. If not, then this is a great chance to get up to speed.

B

My one of my aims with this is to try and get folks out of a little bit out of the standard, ipf ipfs frame of thinking and into thinking more about how to um think about content, address data structures and how we link pieces together to form these complex graphs and not just jam everything into the the icfs way of building.

A

B

So when chris has his talk after mine he's going to talk about how we use these approaches to process extremely large data sets, um and do it in a very parallel way, so that they were suitable for storing in file coins in uh discover. So with that, I'm going to start and I'm going to rewind right back to the beginnings of of content, addressability and some of the basic terminology and things that you may or may not be familiar with just to set the theme.

B

So, let's start off and talk about merkle trees, I'm assuming that most people listening in have the very basics of content. Addressing uh that a content address is really uh in in our world. Anyway, it's a a way of addressing content. By looking at it's hash, it's has digest so content and hash have a one-to-one relationship and- and we can look up content by having the hash.

B

Now the term merkle trees comes from a patent that ralph merkel filed in 1979, and the basic idea of a merkle tree is that you could uh have a hash digest embedded within hash digest, and you can form this tree-like structure and what this means is that at any point in this tree, the address or the hash of that point authenticates uh all of the linked data underneath the tree.

B

So you can have a single hatch that references, many hashes underneath it now importantly, the the original pattern diagram has this classic style merkle tree, where you concatenate hashes, to form this very uh pretty binary graph. A merkle tree is not just an concatenation of hashes, although commonly in the real world like in bitcoin and other places, you'll see the term merkle tree used for a concatenation of hashes to form this very, very clean binary tree.

B

A merkle tree really just means a a tree of data where there are hashes inside nodes that authenticate other nodes and they bring them in through this linking so you can get these very sloppy structures and still call them merkle trees.

B

A dag is something you might have encountered in uh filecoin documentation, but we use this term a lot. It's directed acyclic graph. It comes from graph theory, not necessarily to do with content addressability, but the useful concept. So what it really means is uh you have a graph of nodes? It's directed so the there's directionality to all of the link. So in graphically you can have bi-directional graphs as well. A dag is directed so the links only go one way and it's a cyclic there's no possibility of cycles within the graph.

B

So you can't link from a later node back to an earlier node. It fits really nicely with merkle trees because of hash functions. The hash function gives us directionality they're a one-way operation. You can only go from in theory anyway, from the hash of the data to the data, the directionality um and there's no cycles. You can't link to data that doesn't already exist that you already don't already have a hash full. You can't leave a placeholder for a hash um you get baked in stone, so there's no cycle.

B

So we often end up with these with terms like merkel, dag or dag by itself or merkle tree. So these these terms get thrown around a bit, but this is what they mean now to bring it into something a little bit more concrete. This is a classic thing that people often want to do with content addressed data structures, build a file system, and this is very similar to the ipfs way of building a file system, and many other systems will do similar things. So in this example, we might have eight files.

B

Each file would be addressed by the hash of the data, so we just run a hash function over the data that gives us an address, so we have eight different things that we can address, and then we pull them up into two different directories and we include the hashes of the files that we want in those directories inside the directory metadata.

B

We might just have an array of hashes and that might be our directory. Maybe that's all we're hashing and we hash these directories as well, whatever form they take and then the hashes of those directories become addresses that we can then point to so we, our file system in this very simplistic example, has 10 independent chunks, each of which has its own address, and you see a directionality there and you see lack of cycles as well. So this is a dag.

B

This might be familiar if you know anything about git, so get this tree looks very similar a little bit more complicated, but in the the lower levels of the git format, we find exactly the same thing where you have what are called trees, uh and these are essentially directories and the trees point to blobs, which are essentially final and you can have trees pointing to other trees, but all the way at the bottom.

B

You have these blobs, the blobs get hashed and git uses sha one at the moment, and the the shallow ones get included in these blobs, which point to other blobs, sorry, which point two trees and uh blobs, and um you hash them, and you include the hash of these trees within the actual commit, and then you hash the commit to get the commit hash that we're used to seeing you get. The command itself contains not only a link to the the tree and its blob.

B

Therefore, it's obviously the merkle fashion, but it also includes other metadata, like the awesome community message, and it also includes a link to the previous uh commit. So in this way we build this tree that grows over time and can mutate over time. So in this example, here I've got uh on the on the left hand, side. I've got the newer commits pointing back to older communities, and you can see that I'm mutating my files over time so in the first commit on the right. I've got four files.

B

Second, commit I've added two more third commit I've made a subdirectory and put some in subdirectory and I've deleted a file and added some new one, but I'm still pointing back to some original blobs. I've got some new ones and I've got directory structure. So this is a very, very familiar pattern in content, stressing that you'll see repeated again and again, these a cyclic graph and uh and using localization to authenticate the the data all the way at the leaf by including it at the top.

B

So ipld comes in here with some primitives that we think help uh with building these structures and the first one is cid, which is our uh extension of a hash digest, and we, the cid, is content identifier and we use this to be a self-describing content identifier, because a hash is really just a hashtag. Just is really just a array of it could be this number of different lengths of standard um hash, digest and different hash functions as well so, and we- and we also don't know what they point to without the context of that.

B

So a cid gives us a lot more information in this array of fights, so we prefixed the cid with a version number first, so we conversions the ids and that potentially add new features in the future.

B

We add a multi codec that tells us the content, type um I'll talk about that in a minute, um and that tells us essentially what the hash is pointing to. What is what is the hash of when we get there? What will we find and if the cid also includes a multi-hash code? So this is the tells us what hash function was used and that multihash also includes the length of the hash. So we can see that right up front.

B

We know what the length should be and how much, how many bytes to expect um there's some links on the page there that meet that will will and they're in the zoom chat as well. That will show you take you to um some of the specifications for these things and also there's code repository as well to break that down a little bit further into something that might be more familiar. The ids can be represented as strings and we use this thing called multi-base and that's just different ways of representing um different different base strings.

B

So because we've got an array of bytes, we need to turn into something we can actually print. uh So we use multi-base to do that. The the idea of giving that you there at the top is um this is in base 32, and you often see that the asy prefix is very common prefix that you'll see because which will be clear in a minute, um but the very first character tells us something important. It tells us in this string. It tells us what base we we have.

B

So in this we've got the b and in the multi-base um uh table we can look up b and say that's base 32.. We can use that then to decode this string into the byte and give you the hexadecimal there in those bytes. We have the original hash. The original content address, and I put that in bold and that's on the right. This little prefix tells us a lot about what's going on.

B

So if I work through the prefix, I've got first of all the version cid version one it's the first byte there, which is just one. The second part is the codec which is uh hexadecimal 70, and that tells us that the data we're pointing to is dag pb format, which is the typical ipfs unix fs file format.

B

um So we know what this thing's point. What type of data listing point is so when we get there, we can decode it. We can pull up the dagp, codec and say decode this binary.

B

The next little bit is the multi-hash, so it tells us the the hash function, chart2256, that's text, decimal, 12 and then the number of bytes that it can choose, and that takes us 20, which is 32 in in decimal. So then we decode those 32 bytes. That gives us the hash that we can use. Then, when we load that binary, we could rerun the hash function and verify that it's the data we expected.

B

So this is what a cid is and they're very flexible, with, with all sorts of things with cid and you'll, see them a lot when you do it with powerpoint, sometimes in some novel ways too now cid has a version 0, which is, is you'll commonly see around the word. This has been deprecated and we- and we wouldn't invite you actually using these in anything new, but when you encounter them you'll see them. They start with q, the capital q.

B

That tells us that multibase is based 58 pcc um and we can use that to decode the rest, cid version. Zero assumes dag pb, it doesn't vary on the codec and it also only it forces you uh char 2 to 56. So when you loaded acid version, 1 one, you know it's jpb and it will be sha-2256.

B

This is the typical unix fs for ipfs format. um New everything newer is using the cid version one and you would be recommended to use that very strongly because it contains a lot more information.

B

The second thing that ipld brings is codecs. Now we use a table to map integer codes to codecs. A codec tells us how to decode and encode binary block data, and we have a lot of different codecs and some of them. You know a lot of them are not ours. We've got our own codecs, but codecs can be generic like json as a codec tells us how to decode strings into into objective data. C-Board is another one uh designed to fit specifically from binary and it's much more compact and also just raw bytes.

B

If we have like video or something where we don't actually want to decode it into a data form, we want to do something else with it. We use raw bytes, and this is used. A lot in uh fileclone discover that chris will be covering as well a lot of raw bytes in there.

B

Codecs can be other content address formats that include their own implicit, linking types like git and bitcoin block. We can review them through an ipld lens and interpret them with our own. We can put cids when we instantiate data out of them, so we have some codes that will interpret these things and give us the ids when we, when we read them, but these are not formats where we can write the ids into them, but they've got their own hashes in them, and it's just.

B

We know what those actually point to and then lastly, we have ipld native codecs, dag pb, dag seaboard and jack jason are the main iplt native codex dagpb is used for unix fs in ipfs. You encountered that a lot and most of the databases for filecoin, discovered, xtb and raw uh dag sebor is uh if you're building something new with um with that's not using just plain file data. This is a good recommended codec, it's compact and very flexible.

B

If you do all sorts of other shapes it'll, basically allow you to put almost any data shape into block form. Json is not something you'd be recommended to do with data. You have a lot of because it's not very efficient, but it's a interesting format to look at at least uh so. Moving on. Let's get back to our example of the uh the file system, we're building. I want to point out this concept of a graph root, because this becomes a really critical concept when you're talking about managing large amounts of data.

B

So in our example directory one would be our graph root, and this means that this is the single thing that we need to hold on to to address the whole knot like in the merkle tree in the beginning. We only need directory one in order to be able to address and authenticate all of the files in our whole tree. So we don't need to have references to all of these things in an index somewhere directory one serves as our index and we can traverse through the graph to get to everything we need.

B

We can also use directory 2 with our index. If we only care about that subgraph. Perhaps we only care about sort of these four files. um We could just grab directory 2 and use that as our root, for whatever purpose we needed. Our root for root is fairly arbitrary when it comes to this graph, because graphs can be very large, they can span across multiple graphs. We might only want subgraphs or we could make an entirely new directory and and reference only part of our other graph. That might still exist.

B

So maybe we make a new route, add two more files to it and point to a subgraph of our original one. So it's a root. Is this thing that we hold on to and it it's something that contains everything we care about for a particular purpose.

B

So in this example, we care about four files down the bottom and two new files, so we've got there's one one root of points in there uh roots are important if we uh care about mutability- and I know it's finally talking about mutability with content addressing but mutability is something we can build in uh into these uh essentially append only data structures by caring about the route.

B

So if we change the route, we can make things mutable. So uh in this example, I want to say edit file number one. um Now I can't edit the bytes and have it because the hash will change. So what I end up doing is making a new blog and hashing that and then I have to include that hash in its parent, which is directory one, and then that gets a new, a new hash as well. So essentially, I'm making two new nodes, but I'm still referencing eight different files.

B

One of them is different to the original, but I'm still referencing. Eight files and I've only created two new nodes. So I've got this mutable file system going on here um and a neat feature of this kind of behavior is that we can actually use it for snapshots. We could use it to roll back in time. um If we built in some notion of garbage collection, we could get rid of the old nodes we don't care about. We could change it over time.

B

um Ipfs does some of this stuff natively, but it's important to think about routes and how a route references, the tip of your graph and the most important thing you care about. uh Mutability, extends uh to everything you want to change with a graph. So in this graph it's some kind of arbitrary b tree. Maybe it contains file data. Maybe it contains something else.

B

Maybe it's nicely formed they've got rules about how it's formed. If I want to add delete, modify and do any operation. This thing I I just need to care about the root and how these modifications bubble up through the tree, to give me a new route. So if I I've done a bunch of edits in this tree and it's giving me a new route, I've done some deletions some additions.

B

I'm still referencing the bulk of the old tree, but I also have some new elements and again I've got this snapshotting. I can look back in time. I can. I can change which route I care about, for which purpose? Maybe I garbage collect old things, um but you can see what what I'm getting at here, which is this concept that um these merkle trees bubble up to the tip and it's the hashes that also bubble up as well, and the tip is the bit that you you need to care about.

B

I'm going to quickly walk through an example. This may or may not be helpful, but um it's. What I want to get at is thinking about how we build algorithms and data structures that can span very large sizes.

B

In this example, I'm going to it's fairly simplistic, but it does actually extend to a real data structure that we have not specified. So I want to build a super large array. I want to build an array that is that can live in content addressed land and it could be arbitrarily large, but we want to interact with my pieces. So perhaps this thing is so large. I I couldn't fit it in memory, perhaps it's so large.

B

I couldn't even fit it on my own disk, it's it has to live out there in file coin, perhaps or perhaps in some other context, dress space. um So this thing is fairly generic. I'm not caring about how it encodes that, as a separate concern, I'm building an algorithm here and the things I'm storing in this in this array are also generic. They could be links to other objects or they could be simple values. They could be something else.

B

um You know these these things they come into the picture when you've got a concrete use case, but um in this example I'm building an algorithm, that's generic!

B

So let's, let's start off with the naive case, which is they just put everything in one block um now that obviously falls apart when you get to large sizes, because your blocks become unreasonably big and uh in ipfs you're. If you're, storing the knife and your recommended maximum is about one meg, I think there's a a a actual maximum of two megs uh because of uh bit swap, but your your advice, maximum is about one meg. Now the size of your block will depend on your use case.

B

Perhaps you want smaller blocks for other reasons, because they're faster to load but there's various trade-offs with block size, but I can't just pack all of my elements into one block, so I need to have a way to extend beyond the block. So what I'm going to do is I'm going to say, there's a maximum width of my array in a single block. Any block in this array can only get up to a certain width, I'm going to fix that in my example to five.

B

So if I start with four and then add another one, then I've got my maximum sized array here. This is full and in both the first and the second, I've only got one one one block. So therefore it is my root. My root is my graph. um If I want to and when I want to mutate, it gives me a new root, which is a new version of it. So if I maintain within five minutes, just still got the one.

B

If I want to push beyond five, then I have to start building a tree. So in this algorithm I'm going to add a second sister block are also up to a maximum of five and then I'm going to address both of those blocks in a parrot block, and I'm going to call this a height in my graph, so I've got a height of 2 and in my height of 2 I can fit in five blocks of data each with five elements. So that gives me a maximum of 25..

B

So for any height I can work out and for any width I can work out the maximum I can fit in with the equation width for the pattern of height and in this example, with the power of height, that's 5, to the power of 2, which gives me 25.

B

So here is a graph at full capacity at height 2.. I've got five different blocks and I'm addressing them all in my root block, which has the addresses and I'm saying, they're hashes here, but there would actually be the ids in ipld uh each.

B

My elements in my in my height 2 are just links to my height one. My height one contains the actual elements I care about. If I overflow that capacity, I have to add a new height, so in this in this second part, I'm adding a new element. I've got 26 elements now I've overflowed now, I have to add a new height height 3, and my new capacity is 125, because now it's 5 to the power of 3., so you can see with this game.

B

I can keep on going out arbitrarily large. It just means I also have to get higher at the same time, and all it does is take me to a new single root, but one note node I can hang on to, and that gives me the address of everything in the entire array now that's well and good, except that I want to be able to generally get to the individual elements of these things without loading them all. So I have to have some way of getting from the root to the element I care about.

B

So this is where we build algorithms to do this so to traverse from a particular index. I need to jump from the root I need to say say: I want element number 19.

B

I have to know where to go to in my first jump and then where to go into in the second job, and I've got an algorithm for that here. The details are not that important. If you want to read up on this, I've got a link at the end to a specification for this particular algorithm.

B

Essentially, I take each node as a separate thing and when I perform a get operation on that node, I need to work out which bits of that index. I need to chop off that I care about and which bits to pass on to a child node, and so in this example.

B

I I've got to figure out first, which child node do I navigate to to get to the index you care about, and then, when I figure out which child node I'm pointing down into, I need to know which what the new index is to give to that child, though, because it's not going to know when I, when I get to the base level and I've got say the fifth element along and I'm saying I want element 19., it's going to say what I don't.

B

I don't even know where I am in this list, I'm just I'm just a free-floating block with five elements. I've got to be able to tell it. I know I want the fourth element along in your array, so at each level I've got to change my little index there. So there's an algorithm for doing that. That will get you um closer and closer as you get towards the end, and um it leaves you with these blocks that are all hours array.

B

um Hashing arrays should provide links um they're using the id for that and there's no other metadata included in any of these blocks, that just arrays, which is really a really nice format, because it gives us some nice properties. It also gives us some unfortunate properties, but these are trade-offs. We have to care about.

B

So let's look at some properties of this algorithm. Larger width means larger blocks, but also fewer levels. So so your traversals could be quicker, but your load times of blocks is larger, so you've got some trade-offs already in considering how wide your blocks need to be. There's also, a mutation cost there if you're, adding one element at a time.

B

You're also discarding blocks at the edge of your array, at a rate of you, know up to the width. So there's garbage collection costs there as well get and size operations are efficient, that they use that are roughly the algorithm I already showed, which literally to true they're, just one step each down to height.

B

One and for size we want to traverse all the way up to the edge to see what how big each at the f is, the the trailing edge they are appending data requires mutating a maximum of only one existing node up to one existing node per level. So when we append data at the end, we get this hash thing bubbling up to uh to parent level and that's a nice property. It means we we don't we're not modifying huge chunks of the data structure.

B

uh Iteration is really easy. It becomes a left to right tree traversal and you just put this in the shape of a tree, and you just do the left right with um really nice. Traversal slicing is only efficient if you perform it at the boundaries of the width. So, if I want to take uh let's say I want to take um the 10 elements from the middle of this l this array and make a new array that would be really easy if they, if they tell spell within the the width boundary.

B

So if I want to take elements element 6 to 15, then I just need to those two blocks and I need a new parent block to address them. That's all I need. If I need anything else, then I'm going to have to start shuffling things which is essentially rewriting the whole lot, so it gets business um pre-pending data is equally costly because you're not rewriting stuff. There are ways to do it efficiently. If you were working at the width boundary, but you can see it gets messy.

B

So this thing is really more optimal for adding things at the end and same with deleting deleting elements is only really efficient when you're doing it from the end, because it's the reverse of the insertion doing it anywhere else just means shuffling and lots of mess, and this this super large array also they can't handle the fasteners you'll just end up with a lot of weight, feature so see a lot of trade-offs, trade-offs built into this when we're designing algorithms for dealing with distributed data structures.

B

This is the kind of stuff that goes into thinking about ipld and how we build data structures. On top of ipld, and hopefully that kind of thinking is useful for your application. Further reading, I've got for this. If you want to know more about ipld and any of the things I've talked about, there was some links in the talk, but a lot of them. You can reach through ipld.io.

B

If you go there, that's our documentation site we're working on that to make it more informative. We have a specifications repo on github, that's in the ipld.org specs, and that includes a spec of this array, which we call vector, and it's also got a spec for a hash map which is a hampton algorithm which you'll see again and again in this in this world.

B

um The hampt is a hash array, mapped hash array, map tree um and it's a way of performing, really efficient, key value stores across very large data structures, and that's a really quite elegant algorithm. If you want to look into the details of that just to learn about how to think more about these things, um that's it from me. um I don't know if we want any questions or we want to hop over straight to uh chris.

C

Questions at the very end and just go to chris right now.

D

Can you hear me okay,.

D

Can you guys hear me: okay, yep, okay, just making sure I've had fun audio problems recently and you can see my screen. Okay,.

B

D

Great, thank you so hi everyone, I'm chris hafey and I'm working with the ipld team um and thanks rod for that uh great introduction. Ipld.

D

It just kind of hurts me, um I think, as many of you coming from ipfs and lotus world, uh you work with the apis that are given to you, but there's a lot going on there underneath the hood, which is what rod just went through and one of the things the ipld team is really focused on these underlying uh building blocks to basically build things like ipfs on those and also your own custom applications.

D

And today I'm going to be talking a bit about how the ipld team applied these intrinsics to uh prepare very large data sets for file from storage and the project that um we gave. The name is called dumbo drop and here's how my presentation's going to go basically give a quick overview of double drop. Talk about how we approach the problem. uh Talk about the architecture uh go through some lessons learned and tips and tricks in the end.

D

I think our hope here is that you'll take some tangible kind of tidbits away, um at least a better understanding of how you know how we approached a very large scale, data ingestion problem, and maybe some ideas on how you can do it yourself and solve your specific problems.

D

So, first of all an overview. What is dumbo drop. Our goal was to process a very large amount of open data and short amount of time for file. Point um mikhail rogers is on the call uh he's one actually, the brains behind this. I kind of just came in and cleaned up a little bit. So most of this design work is, is his handiwork, but we're pretty happy with the amount of data that we process over.

D

Three petabytes of data has gone through dumbo drop and it's amazing to see how it worked in terms of scalability. So if you are a performance person, a skill builder person, you like cloud stuff, you're, going to get kind of excited by some of the things that we've done here, but that was our goal is to process a large.

C

D

Of data in a short amount of time, uh so how do we approach this? Well? Most of our data was already in amazon, s3 buckets. There's public data sets like landsat and whatnot, and we want to convert those into filecoin. So one of the strategies we did is we decided to create a highly our own custom application, using the same underlying libraries that lotus and ipfs uses, and so, if you've ever looked at the loaded, stripe, dfs projects you'll see all sorts of package dependencies.

D

Hundreds of them made by protocol labs, some of them by the ipld team, some by different teams, um and so what we did is we said to really get the type of scalability and performance that we needed. We kind of had to go a lower level than those knight bfs, and so we went directly to libraries and kind of stitched together. This whole pipeline number two: is we um exploited aws lambda?

D

It's a lambda is a great uh capability of aws on serverless functions and allows us to get to a very high level of parallelization at a very affordable price. The other thing is in terms of language selection. We actually didn't use any go code, we wrote most of it in javascript and the main reason that is just for speed. We could really iterate quickly as we uh we didn't have a lot of time, but we wanted to move quickly and javascript allows us to move very very quickly. We.

A

D

Rough rust uh as well, because there's some proof generation code, that's written in rust, that's very high performance that we didn't like writing, rewriting in javascript, and so um we kind of leveraged that as well.

D

So what's the architecture look like well, you know. First thing is we kind of thought that we have a massive amount of data and we have to go to a whole pipeline. You know it begins in um it's raw state and we have to have a couple output, art factors. So basically, what you're, seeing here in a vertical is a vertical barge are the three stages: transform aggregate and com key generation cutting across horizontally are kind of like architectural layers, so we actually have storage layer where we store data processing where we do.

D

You know a lot of number crunching and then indexing content database, and so, if I actually put um kind of architectural blocks in here, you'll see uh there's three different, s3 buckets um and we actually had- I don't know- maybe 30 or 40 days sets in the three petabytes, and so this is kind of replicated over and over again, but for each source. Data sets in the bucket we generate output data uh in ipld format and blocks.

D

So when rob was talking about cids and blocks, we actually took source files necessary and convert them into ipld blocks and then also we generated car files.

D

We had three different processing uh functions for each of the different pipeline stages and down here we used dynamodb to store information about this whole processing logic as well as a com key.

D

So that's our our high level architecture, and I think one of the things I take away is think in terms of pipeline stages. When you're dealing with massive amounts of data, you don't want to basically do it all at one time. um You want to basically break into stages, and you also want to think about these as different architectural components.

D

So what happens um so? The data begins in the source. Data bucket first phase kicks off. What we do is we pull data out of here, so you kind of iterate over all the objects. We update the fact that we know these objects in the database for tracking purposes, and then we convert them into ipld blocks over here and, as rod said, there is a block size maximum, and so sometimes we have to break a single file up into multiple blocks um and we do that.

D

We keep track of all that in this uh source, dynamic database, and so uh this takes days to to run even in parallel um in s3 and I'll, explain uh why um it's kind of slow as we go on, but this is the first step. Second step is aggregating it up. So again we take all this data we just produced. We run it through uh another lambda function. We update the database and we store carpals. This is like concatenating, a bunch of uh ipld blocks and do a single big old file.

D

Car is kind of like guitar file. That's that's content, adjustable archive, as opposed to tape archive um same idea. It's a bunch of blocks back to back to back um and that's step, number two and then the third processing step is you take those car files run through this rust proofs uh code? This runs in the lambda. We generate um com key, which is part of the the piece proof uh that uh file point depends upon. I don't need to get the details of some of this stuff.

D

In fact, um this last thing I'm not going to talk about at all. We don't even encourage you to get down to that level of detail because you don't have to. We now have things like uh powergate, which allows you to basically put target ipfs as an output target and start in terms of your pipeline and use powergate to shift it directly into filecoin, so I won't get into more on car files and compete right now.

D

So a little bit of lessons learned, um that's basically what we did again: three petabytes, where the data ran through this system for a lar large number of data sets uh some things we learned you know one is: there are limits in uh aws and s3 uh performance based on prefixes, and so um you know the name form is directories, but amazon calls them prefixes and it'll actually throttle your.

D

um I guess the ability to read from s3, based on how data is partitioned into different directories or prefixes and, as I mentioned earlier on the longest phase that we had here was actually reading from the source data buckets, and this problem is one that we ran into is that these source data buckets weren't, really kind of organized or designed with this prefix limitation in mind, and so what you'd have is one top level directory called foo and then thousands and thousands of subdirectories and files underneath it well s3, will limit on that one food prefix, and so it was only so quickly.

D

We could actually go into that directory into that prefix and pull out files for processing, and that was a huge file that we ran into uh if you're like us, and you have basically an s3 bucket that you're reading from you can't really change it. There's not much you can do, but if you do have the opportunity, in terms of your pipeline upstream, to partition your data with this prefix thing in mind, you'll have far better scalability and things will move much quicker.

D

Another thing is when you're dealing with extremely large numbers of data petabytes of data like we're talking about here, we actually start running into some reliability and performance issues. In s3 um I mean this data is just so large, there's so much going on. It can be like a week weeks worth of processing, sometimes for this stuff, um amazon will cycle servers, it'll scale out your infrastructures, all sorts of crazy things, and you have to.

D

We actually ran into a lot of uh burps that we had to work around um where it would get slow or start returning errors that we wouldn't expect aren't documented.

D

um So one thing that we think of in retrospect here is that we should have had a prepare step before transform so again that first phase we're iterating through the bucket was slow and unreliable, but yet it was also coupled with the fact that we were taking a file breaking into blocks and storing it.

D

It'd been so much nicer if we could iterate through the whole bucket and then take that as input in one stage into another and then just directly generate that ld blocks as output, random failures do occur, uh make sure your pipeline is resumable and build and retry logic.

D

I have a couple of examples of some really nasty code that was in dumbo drop to make it work around these things again we're trying to do a lot of the in a short amount of time and um we kind of have the most beautiful code in some places, and this is one the area where it's like. What.

A

D

It's just we're getting errors, we don't know and understand. Aws custom lambdas are tricky um if you have to do a custom. Lambda like you, like you, use like a non-standard language like rust. um There is the good news. Is there's a way to do it.

D

uh We had our russ proofs code that we didn't want to report to javascript, so we had to figure out how to get rust into a custom lambda and it was really challenging because the base docker image that you filled into is based on cento 7.6, which was released back in 2011 and tool, change and whatnot change over time. It's hard to get current tech, sometimes to work with old tech like this, and so I spent quite a bit time making that work.

D

um The thing is: there's a hard upper limit on lambdas in terms of ram and disk, and so what we're doing with large data sets. We ran into both of these issues in terms of our ram consumption and on how much disk we're using and by the way that includes uh temp files uh temp directory.

D

uh So we actually had to modify the rust proofs code to be streaming rather than just writing the whole thing to disk and because we didn't have a disk to do it and we certainly didn't have enough ram to do it. So I had to be modified to be a complete streaming based implementation.

A

D

Lesson learned is compq, is way cheaper than storage, and so we're dealing with petabytes and petabytes of data. um The cost uh aws street costs goes up quite a bit we actually found in one month. We were doing a lot. A lot of processing still only like a fifth of our cost was compute. Four fifths of it was storage and in one case we did this massive conversion, um but we didn't quite need the data, for I don't know another couple of months and we did the math and we realized.

D

You know it's gonna be cheaper to do reprocess. This huge, like data set in the future, then keep it in s3 for the next couple of months. We believe it and reprocess it a couple months later, when we actually need it. So storage is expensive, and so you think about this stuff. uh Compute is cheap and plentiful storage is plentiful too, but it's just costly, so factor that into your design. Just like we did.

D

um The other thing too is uh you need to be able to tune your concurrency parameters. um Different data sets will run into different bottlenecks at different times, and so, if you kind of hard coded or think you know, if you get one data set to work, you've got one day. Is it working to get the next one? You may need to tune some parameters to get optimal performance so build that in it's just kind of good design.

D

But it's especially important when you're talking about a highly parallel highly concurrent system- um and you know not all the work we do at pro labs- is our most proud work, and this is an example like I talked about of us, trying to move quickly and make stuff work in a short amount of time um and dealing with some kind of tricky things that we ran into with aws.

A

D

So we had to build in retry logic, all over the place, with some kind of random timeout back off, just to make sure that when you have, you know a thousand parallel lambdas running out there, they didn't get into this kind of race. uh I guess I don't know what you call but they're all hitting and at the same time and overwhelming the system. So there is code like this and also there's another example where we just get random errors that aren't really documented anywhere.

D

That aws would give you uh as it's scaling out or or dealing with kind of this load, regenerating that and so just get ready to be prepared for not just solving the problem, but keeping your data pipeline running. um It's not an easy thing, tips and tricks. um You know. One thing like we did here is consider building your own pipeline from the same libraries. Ipfs lotus. Is you? Don't you don't need to have? You know, use the ipfs or those apis for everything um you can actually create python like we did.

D

We did poll in the libraries um that they used to do things um and you can actually get much better gains in some cases by doing that, so don't be afraid to do it, although it does require some programming um capabilities.

D

Number two exploit cloud services for skill building, cost savings, especially.

A

Lambda lend is a.

D

Great tool, if you haven't done servers before it's extremely powerful and powerful and affordable, um and it's really a way to get incredible scalability which, by the way, when we're talking about uh as rob's talking about, we have immutable content adjustable data.

D

Scalability is much easier than in. I guess mutable systems that we're used to, and so it really kind of elevates your game in terms of what you can do in terms of data processing by thinking about scalability parallelism.

D

um We did uh you know we talked about that prefix problem, um one of the ways we got around it in our block, store and in our car store um is we use the content identifiers themselves as the object. Prefix, the content, content identifiers are highly selectable and therefore they don't run into that. Concurrency uh blocking performance limitation, s3 has, and so we put those in front and then followed it with the that same c id as the object name, and we got around this this uh file neck in terms of performance and so phase.

D

Two, as you remember so phase one is to take all the data in the bucket and turn it into blocks. Phase two is take that same data and turn into car files. That second phase runs like 10 times, 20 times faster, because we don't have this, um uh even though it's doing basically the same kind of work, because we don't have this prefix problem.

D

um Fourth, uh transform your data into immutable ipld blocks as soon as possible in your data pipeline. It just solves tons of tons of problems for you. uh If you can actually have data. That is immutable. uh This is a common theme. Mutability is good and the quicker you get into mutable world. The fact the easier and faster your life will get.

D

uh The last thing in terms of tips and tricks um be aware that aws performance will change as it rebalances, while your pipeline runs, and so one thing we would run into is we'd start up a big processing job and it would go really fast, and then you know after a few hours it would get a little bit slower and after a few hours, just come to a crawl and we're watching the logs and we're like what's going on and then all of a sudden boom.

D

It goes into ludicrous mode and is scaling out crazy and what we're actually finding is that aws has some kind of heuristics.

D

When you know you when it when a given account um is doing a lot of by using a lot of resources, it'll try to start tuning it for your load and when it's adjusting the load or rebalancing it, it kind of you kind of get these burps uh in the speed and so don't be afraid that get comfortable with it, because when you start moving into high scalability massive graphs like this, uh you start pushing limits. What aws can do in ways that no one else has done.

D

Thank you. That's the end of my presentation.

C

All right, we'd love to hear some questions from anybody. If you've got them.

A

uh Yeah, um did you do any cost calculations comparing lambda to just using straight.

B

Ec2S on, like spock instances and stuff like that.

C

uh Yeah I did I did early on. um I mean, if you're really really diligent about it. You can probably squeeze some more performance out of ec2.

C

um The issue is that just spinning up a box that has a bunch of pores that you're then going to to balance into the moment that you stop using it you're just eating all of that, and so we were constantly tuning this and tweaking it and iterating on it as we were building it, and so it was just never going to be the case that we could always saturate all of our resources, and it was just like a massive amount of extra work to try to spin up some kind of clustering around all this, um and I mean honestly lambda- keeps getting cheaper and is like in general, just cheaper to do anything with than ec2.

C

I mean the whole model for lambda seems to be. If we give you basically free processing, you will generate data that you then have to store, and so like. The the model here is like amazon just wants to eat all of your money in recurring storage costs. um As an example, like we processed two petabytes of data and the storage bill for storing it, that month was four times the compute bill to process it, and the storage bill was going to require every month if we didn't get rid of it.

C

So um there's just like a huge delta here, like um yeah, so lambda being like so flexible um and so easy to use compared to ec2, was just kind of like an obvious win, and then we weren't really going to you, know split hairs on how much difference it was in cost to ec2, because the storage was always going to cost us way way more.

C

I mean the main thing. You have to make sure that you do those that, like any other services, you talk to um or any storage stuff that you talk to is in the same region, so you don't hit the transfer cost as well, but other than that, it's pretty good.

C

All right, thank you.

C

I thought you had a question you want to go.

C

I can't quite hear you for some reason. It doesn't say you're muted, but I can't hear you.

E

There we go okay, yeah yeah, sorry about that, um so I think you may have already covered it a bit, but I was just wondering if you could talk a little bit more about why you wreck um for dumbo drop. You transform the d. Why you're making the recommendation to transform the data into the ipld blocks like as early as possible in the process.

D

Yeah, that's a good question. I probably didn't clarify the value of that, but um once you move into a mutable world, you gain all sorts of properties. You don't have over the immutable world, so you have complete data validations. You know that bit for a bit. You can verify that the data you're working with hasn't changed and that is going to save. You can save you a lot of headaches when you deal with diagnosing aws failures and whatnot, um you know, did it fully process this file?

D

If so, how far do I need to pre-process the whole thing, and so by sooner you move into immutable world. You have know your problem set. You have a lot more knowns than you had before in a very variable dynamic environment like aws with large data. So that's probably one of the biggest reasons, but I can list many many more, but uh the other thing is just tooling right.

D

So all of the uh protocol labs libraries works with cids and blocks, and so uh as soon as you get in that you can start using all those tools, your benefit as opposed to hand rolling your own. Your own tools and whatnot.

C

Also chris mentioned this really early on, but all of this is in a pipeline and there's different stages to it. And so, when you find a bug in the pipeline, you realize you have to process a bunch of data that came like through that pipeline and anything that you did before the immutable states.

C

You basically just have to reprocess everything, because you have you have no guarantee of nothing that you can check to really figure out like oh what data matched this and what didn't, whereas, like all the data that is actually like, once you hit an immutable state, you then have new immutable states or some immutable reference for everything after that, and so it becomes really easy to like go and find all the data that you may have messed up, and you know you're going to have bugs. So you might as well plan for that early on.

E

I have another question too, um which is just like for the uh for, like slingshot or for people who are maybe just trying to like. Mostly you know, for this competition or something just say they want to like what would the recommendation, because I know I know you all were doing this for like an insane amount of data.

E

It's like you know, and I'm not sure the extent to which you wanted it to be like a super repeatable pipeline or whatever, but if someone's kind of like just like trying to do this as a one-off thing, what's like the easiest way um to do it now, especially given kind of like what chris has mentioned before, like powergate is maybe a thing. um I don't know how much that changes but yeah. What would be the recommendation now if you were just trying to do a one-off ingest.

C

Well, I mean it always depends on the amount of data and the shape of the data. um So if you need to do a lot of data- and you know that if you were just to shove it into ipfs on your machine- it wouldn't complete until after slingshot's over and then you have to figure out a way to parallelize it, um and then the the question is like. Okay, do you have a bunch of data that is really large files? Or do you have a lot of data?

C

That's a lot of really small files and that that really changes kind of the approach. um But one one big thing that I like chris mentioned. This- do not do what we did, um which was like actually create the car files and during compe and use the offline flow for the deals um that, like you're gonna, run into a lot of things that just don't work as well as you want them to. um But but also I mean it's a really sort of unnecessary step.

C

Unless you're shipping drives to people like we are um it's much easier to just like prepare the data in a format that ipfs can accept and then get up an ipfs node connected to that data that you processed and now you can use powerdate to do the deals right like this is way easier um and then yeah I mean in terms of how you decide to paralyze the processing, like you know, use lambda.

C

If you need that much compute I mean we had to have them, increase our concurrency limit from a thousand to ten thousand just so that we could get through all this data quickly. Enough turns out that when you go up to over 3000, though there are other infrastructure things that will keep you from actually going going above that, uh so you can't, you can't use it for our use case as much as you would like, um but.

A

C

um You know- maybe it's just using a bunch of cores on your machine and paralyzing. That way, and the thing to remember about these graphs right is that if you're doing operations serially, it's very slow, like it's actually like the one of the natures of these immutable data structures that they're they're they're really expensive to sort of uh create and mutate serially, um but they paralyze incredibly well.

C

So if you can figure out how to sort of isolate sub components of the graph, then you can parallelize all of those like as much as you want um and so for for us that really became like oh, we can process every individual file and then you know when we have a lot of small files.

C

We want we'll hand like you, know, 100 or a thousand files to one lambda and say go and do all of these right um or if it's a large file, uh where we know that we're actually going to go over the time for lambda, we'll cut that file up into parts and then give a part to each one right, and so it really changes. What you want to do, based on the kind of the shape of your data.

C

But the the main thing to keep in mind is just that you can. You can really parallelize the the graph creation um that works really well.

E

Nice thanks and then sorry last question for me, what are some of the um like? I guess the fields that you maintain in the in the database um that you use for like indexing all the information.

E

What like pieces of information are important to keep around. I guess.

C

uh Do you want to take that chris.

D

uh Yeah, so you know, I think we need to store um the paths to to uh like the source file and then the output blocks and the car files, so those paths need, I actually have it all documented. I need to refresh my memory um while pulling up. Why don't you answer, michael? If you remember the whole.

C

Yeah so so one thing to remember: that's a little bit unique about our data set um and the way that we have to do it onto these car files is that um each car file is like at the or sorry each car file is like about a gig, um because that's how much we can kind of process in parallel.

C

So we take a bunch of those and we do like one deal with them um and then we have information about that deal and we need to be able to figure out like okay, for this source data, for this one url to a file. um Where is that in the filecoin network right, and so one of the tables is just like all of those origin urls for, like, I think, literally billions of files- maybe I think yeah.

C

I think we're in the billions now um and uh like I pointed from that, to um what car file is it in and then and then we have um a database of all the car files and their comp generation, um and then we and then from there you can go okay. Well, that's what the car file is. What deals did we do for that car file and then we can find the deals and get the minor ids, and all of that, so we could reverse from there to to actually access any of the data.

C

um One of the things that that chris did that was really good. Was um we stopped using one table for all of the data and started doing a table per data set, so we have billions of rows, but they're now split across more and then we have uh one table for all of the the car files and compete, because there's there's less of those because we're compacting so many things together into one.

C

I mean, there's still millions of them, because it's like we did about three petabytes data. um But you know it's not it's not billions. um Also, just dynamo is terrible, um and so it's really bad and- um and we we have to like in each of these rows, we have to put a list of all of the um hash id's for all of the individual raw blocks and so that actually gets like a little bit. Bigger than uh dynamo tends to like having in a row.

C

So we have like you know we have the one table, that's doing that and then it was really important to have like a different table for everything else in copy, because we can actually keep that really small.

E

Awesome thanks and this um this schema link is really helpful too. Thank you, chris.

C

Yeah the code is very cleaned up now and documented and there's nice schemas, because chris took it over. I I wrote most of this early code in like a six week like, oh, my god, we have to process this data right away because it was literally like we didn't know it.

C

We had very early shift states in our minds at the time and- uh and it was like literally like we- we don't think that this will finish running unless I start running it now, and so I was writing it as fast as I could, and so it was. It was very impressive, like thing that it was doing, but it was like the worst actual code I may have ever written. So chris did a great job cleaning. All of that up.

C

Oh yeah, I think we're at the hour now.

C

All right, this is great. Thank you all for coming. Do you have any announcements or anything any final words for the slingshot folks, no you're done all right. Everybody have fun writing code.

C

E

E