IPFS Berlin Developers Meetings - July 2018, 9 Jul 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DEX - Smart File Importing - Juan Benet

Description

Originally recorded during the Berlin Developers Meetings from July 9-13, 2018.

A

All right who knows what this is about this deck stuff, one person to kind of kind of some kind of all right this. Let's talk about smart file importing so far, the project that's been named, X.

A

That kind of starts stands for data importing and exporting the data exporting. There's a whole bike shed thing of like talking about the name, since we chose that a lot number of people have brought up that hey, there's a lot of things called Dax, so maybe this shouldn't be callbacks. You can weigh in on that white shed. I, don't want to think about it.

A

There's a thing called multi Dax that I saw and that's pretty cool yeah.

A

The this thing has a you: can go and read this back here.

A

You search for it and it will be there and so there's a kind of rough speck here that talks about what layouts are and with splitters are we're gonna talk about that in a moment and then talks about the kind of architecture of what we're talking about and how to plug it into things? It's very, very basic- and this has been you know, part of the IP fence goal for a long time, and the benefits are truly massive for all you for a lot of use cases for any of us.

A

We just hadn't didn't have time to work on. This haven't had done working this for quite a while. There are other things to a priority, but it's really time to to revisit this. This is as I hope. You convinced me over the next.

A

You know 10 minutes that there's a lot of really strong promise to this and that if we prioritize this for the next, you know quarter two quarters or so a lot of things will get faster and better and a bunch of open use cases are gonna open up, so we'll kind of dive into that. But let's you know start more basic than that. So what you add this is a imagine adding some some stuff to ipfs. You get a bunch of hashes and you know if you're doing normal ipfs add this.

A

Is you know it's good, FS and you're, adding a bunch of so there's like the lower level, IP LD, hash, link graph and then on top of that, we're modeling a UNIX file UNIX like a familiar file system, we'll hope to actually get POSIX and planes someday the or maybe not fully possible complying, but at least getting permissions in right.

A

The point here is the file system stuff. If you haven't gotten a chance to dive into this kind of stuff, the IP LD work came out of really thinking about a lower level. Primitive than files turns out files.

A

Wasn't the right thing to base all this on and so all the appeal day work has been trying to take a project in flight and then like rebase it on top of a new, better heart, because really thinking about structured data at a lower layer layer makes a lot more sense and you can easily turn that into files and going the other way around is quite clunky. I think about API is in the world of you know. Web 2.0 world is so it's all structured data, not you're, not moving around a bunch of files.

A

I mean yes, you are, but like not your logic. So a lot of the next stuff is file specific, but it does have. It does present a lot of opportunities for other kinds of data, and so the goal is to maybe motivate it with files and maybe nail that and then think about other kinds of data structures. So in a normal UNIX file system, we have you know, directories and files, you know directories or diagnose files or doc nodes, and you know we take a big file and chunk it up into a bunch of pieces.

A

So you know this is a graph of this directory I think not yeah I think it is, and you can see here a bunch of things right like you, can see that directory a set object, and you know it's pointing to some some objects and right there. There's a file that happens to be happens to be big enough to point to four other sub logs right, so big file is getting chunked into smaller pieces. If you got a really big file, this would go on now. How do you do that?

A

This is a standard file systems technique. How you choose to chunk a file and lay it out, has vast implications for the performance of your system and vast implications for the use cases so well gets inside a moment when you start talking about layouts, but the structure of the graph is what we mean by a layout. So when we take a file and import it and the graph that we construct out of that will have some shape in that shape. We declare it. We call that the layout there are a bunch of different things.

A

You could have a balanced tree, but that might have a problem. What would what sounds like a problem if you have like a normal standard, balanced tree.

A

Can what would happen if you start trying to play a video, and you have you know like the root node in the CID, and you try to play a video? Well, what would a what would happen.

A

So it sounds like it's not gonna be fastened to fly.

A

Time to first byte inspected, because if you have a really nice balance tree you're gonna, you know go and try to seek to the first bite and you're gonna go down a bunch of levels. You find the first block and then start streaming that in so actually having a lopsided tree. Where you know the very first node immediately points to data that is way more efficient for normal files.

A

This has been around in emphasis not at defaults, but at say, if you pass an AB t trickle, it does this for you, but other kinds of files might want different layouts that you know nice trickle structure makes sense for things like video, but it might. There may be something else that you might wanna do. So that's that's what layouts are about and now the other stuff that goes into this is starting to think about potentially being able to deduplicate data much in a much smarter way.

A

So think, for a moment of you have two files, and you know you change one segment in in one of them, and maybe you insert a single character now in a normal like naive, sized chunk or thing, then the rest of the file is gonna. Be different. Right is gonna chunk to a different different set of hashes. I'll show that in a moment here, so you know, if you have so suppose that you know you chunk a file and then, like you, make a change and say like c3 or something you insert a single character.

A

Now, when you go back and reach UNK that on hash it now c3, c4, c5, 6 6 to 7 all are gonna be different, so that sounds kind of dumb turns out that you know we've known how to do better things than than that there are techniques for for figuring out, just very simple ways of getting deterministic chunking to happen based on the content itself and that those things so wrapping fingerprinting and you know either 32.

A

Cyclic hash functions like forgot, the the wording, but their specific kinds of hash hash hash functions that when you apply to a to a file, start yielding certain numbers, and you can come up with an algorithm to kind of choose chunks such that, if you, if you were to insert something to 2, C 3, maybe C 3 changes, maybe C 4 changes, but the rest would be put a large subset of the rest of the file would be chunked in exactly the same way. So you get the same kind of nice properties.

A

So this a lot of people have been doing this for a long time. This is also a night BFS. It's you know you can pass a special chunker rap and fingerprinting and it'll. Do this myth about ipfs robin fingerprinting doesn't help total myth and I'll show it. It does not help quite a bit. It just creates really small chunks which was slow for a while, but now with badgers it's fine.

A

You can also choose not to do any chunking at all and that's what get does now? What what would happen if you do no chunking at all.

A

Yes, what else would be true?

A

What happens? You have like a file that is a gigabit gigabyte large and you change one character.

A

Yeah, so don't do that right since 2005, you know I'll be fast because it doesn't one you should go back to the back of the past turns out that this is one of the reasons a lot of people have trouble with getting really large repositories is because of this like it doesn't, handle things like video and like large assets, really well at all, there's a possibility that if we get this importer stuff to be really really good, it could someday make it make it back into something like it there's conversations about that, so we do someone's driving stuff.

A

Now, that's kind of like a general purpose thing of saying, like just whatever data you throw at me, I'm gonna come up with some algorithm to try and chunk it intelligently. Rappin fingerprinting is one version of this. Rsync does something similar with adler, 32 I think someone does function they use.

A

There are other things like what if we knew something about the layout of the file, so, for example,.

B

A

Quite a bunch of render files I'm, gonna, compress them.

B

A

Yeah, so if I just add that you know that's not gonna be very nice, but I there's a special command in there, which I think is tar, I think that that would be I think soon. So what that did it's a very special case thing is that it imported the tar ball. Smartly, it looked at the headers of the tar ball and created specific objects for it, and I'll show you, in the difference between the two graphs, so be look at this one.

B

Okay, stop there. Oh.

A

Let's forgot to pass it: no, no that shouldn't work, I.

A

Guess, there's only one object, it's so small here. Let me create something much bigger.

A

A

So this this shows me the underlying it's a tool that I wrote: I'll demo it later. It shows me the underlying graph that ipfs generates, and so we can visualize that.

A

And so this is, this is like the the graph right, so these are like the underlying file blocks of the tarball. This is an intermediate node, so this is a balance tree right, so this is like the root, the root node and so on. So it's not perfectly the hana story, but it's built that way. If we were to add it with a different thing. Like add, this is the Terkel thing we should do the same.

A

To refresh this, then you get a different shape and so what's going on here. Well, that's a lot of data blocks in the first one really trying to optimize in the first first few blocks. So that's a bunch of objects in the first one and then the very bottom one it happens to be an indirect block. It might have more indirect blocks.

A

I think like xt4 has like four or something like there's, there's a whole bunch of research into how what what layout might you want optimized in certain kinds of file system use cases, but the the tar one- and so this is what's kind of cool about this. If we, if we now visualize a tarp tarp all version.

A

Oh whoops yeah, it's somewhere up here right.

A

Make sure you're following along okay, so what happened here? So this there's like a like a way of adding the stuff such that the tarball headers themselves are single, IP, all the objects, and so, if we were to like look at this thing,.

A

If we like, where did cat band.

A

Yeah there we go so what's in there, that's you know. Some data is in there. This is this is the tarball header. Stuff, surprisingly, large might be some some other stuff there, but that's just I feel the stuff. If I build, it was smarter, which you know the goal is to get. There are some of the some way you would be able to see transparently this. This data right there and you'll be able to see like the structure within it and that's that's the motivation for ability. It's like don't have to you know.

A

Ifif hasn't, should get smart enough to be able to understand those internal data structures, but that's that's where the tether is and the the actual data is isn't here right, so so the actual data link. If we were to follow that data, then that's the actual, like data of that file, and- and so why might this be useful? Does anyone have ideas? Why why importing a tarball this way might be useful.

A

Yep, so one thing you might be able to do is knowing that as a tarball, you could like seek into the tarball into a specific file and just pull that out right, and so when you were like, if we've want to see this specific thing, I think I think we can actually do this here. Let's, let's try it out now: I, don't think I've ever tried it, but let's do it.

A

So if we take this, oh this might not work, because it's I think it's expecting. We can probably do it with this. So so, if we get this out, then we like walk into this- is the that's because this is using the protobuf stuff nervously war stuff. If it was using the see more stuff, we could like traverses much nicer.

A

But the point is, you should be able to traverse into that graph and pull out a specific file. So that's one thing: the other thing is what, if you have ten tarballs, all of which we include the same file, if you, if you chunk it with a normal standard way and I profess, is not understanding. This tar ball, she's gonna chunk it it's on standard way and maybe it'll get lucky.

A

If you use driving, fingerprinting and align and maybe to duplicate stuff, but chances are it's not going to if you do it with smart, are importing then immediately it will duplicate the st. whenever there's a tarball that happens to include the same file it'll do to placate that exactly now. Where do you know, there's a lot of tar balls that have the exact same contents.

A

Releases releases of what software package managers right so think of an NPM or after something like that. How many packages include the exact same code, you like all of the versions of your package right like all of them right now, an NPM they're sitting like a whole different art wall and, and so like you know, think about decompressing, sorry, duplicate him. All of that. That's not just useful! In storage, it's useful in bandwidth!

A

If you were using ipfs to pull down the NPM stuff and you're pulling down package X, then you might not need to pull down all of the tarball everyone away. Npm is so slow! That's why one of the reasons! So that's very promising, but it needs like this smarter importer, stuff, there's a bunch of stuff to figure out there. What other kind of package manager includes a set of files that are all very similar?

A

What other kind of package package managers we have.

A

So I guess, like binaries, might have underlying like huge binary programs, some some package, others have binaries that are quite large, their megabytes in size. But when you look at the internal structure, they have a lot of duplicated stuff and you could duplicate that intelligently.

A

You could duplicate Lipsy out of most package managers like right away, maybe not because I think like the linker might change the addresses, but there's some weddings to be made there. The other thing that it could massively duplicate is things like Linux ISOs.

A

So imagine if you ran the Ubuntu or Debian, you know ISO installer system and you have like hundreds of ISOs, and all of these you know are targeting different architectures and different versions and whatnot, and you have to like keep all of these around and in reality that ISO, which is probably you know. Hundreds of megabytes includes a bunch of file that are the exactly the same version after version after version after version, and so if you import it an ISO smartly, you could look into its internal file system and they duplicate that intelligently.

A

So then, having another ISO would just be the changes. So that's the that's the the promise of this kind of stuff- and you know, there's a ton of things like this: a lot of files that, because of how UNIX and computing evolved, we ended up using just files and moving around files as that, as the native like Bob Dole of data and turns out it usually, is full of the same stuff. So let me give you an example: I make a lot of presentations that you reuse, the same images.

A

Images are large, so you know if I had to go to get talks.

A

So I have 8 gigs of keynote files, and this is just a bunch of ipfs talks and if I I think, let me see, if I have it still running reminisce.

A

So what I did here was I have a tool that allows me to test the differences between certain different repo types and the importer that we're using- and you know, let's check out the size of importing with a normal sized chunker and that's seven gigs. They duplicated some stuff, like that's pretty random, I'm kind of surprised. That must be like the exact same files or something otherwise. It would be really difficult to get that lucky. But then, let's look at rabbin.

A

Two gigs so I just imported a bunch, and then this is just like the night like the not fully naive sized stuff. This is like the general-purpose rabbin fingerprinting thing it just you know saved me four gigs of stuff of just without knowing anything about keno. Now imagine if it knew something about keynote and how it represents its own internal files. This could get chunked drastically better right. So this is.

A

This is like a promise, there's a whole bunch of different kinds of files that that we can learn how to how to chunk, and so the the goal of the project is to is to come up sort of like this. Next thing is to come up with, like a like a general way of attacking this problem, because there's a ton of different files out there- and you don't want to have to implement special junkers for everything.

A

But you want to structure a project such that when somebody needs to create us in one they can and then they can benchmark that against.

A

You know doing the naive thing or people can put research into some of these like rolling hash function and stuff and see if they can get that to work even better, and so the idea was to to come up with, like this structure, for saying, let's rip out all the file importing stuff out of the core implementations of IP us into a different set of modules and have a a set of programs and libraries that know how to import files necessarily they consume a file as an input and an output and I appeal de Grasse.

A

And so then you can think about what how to configure that. That is that chunker to tune how that works and then ipfs. Would you know import that as a library and and run it locally? Maybe some other implementations might actually call out to a different program or something. But that's, then we can make that pluggable so that people can like file for across against that and say: hey. You know, I just came up with a way to import keynote files, and here you go or hey came up with a way to do.

A

Isos here you go, and so that's that's how we can scale this that's how we can, and ideally we would do it in one language as opposed to having to do and watch different languages, but that gets into the whole question about like what assembly and like what's gonna, be the winner.

A

But you know for now like we could just follow the same structure to what we do with everything else, which is implemented in two languages and and see how it goes, and so we have the bate some of the basics. Already we have this like layout stuff. We have some of the chunker's. We have like the tarball thing, but we want to do a lot cooler, stuff, I, think the ISOs thing is very promising and I think getting it to work really well for package managers is also really promising.

A

So if we treat those as like the short term use cases and go after after them, then we can potentially speed up a ton of stuff earlier this morning, we're talking about potentially doing this for JavaScript CD ends. So if they give up like a CDN with a huge bundle of JavaScript in there like jQuery, like you know, every version or whatever it's just pretty big has the same stuff. You could intelligently chunk that, like rip out the modules that are the same or like a bundle like a web pack, bundle rip out.

A

The modules are the same and like import those that way, so you don't have to like bond all the same stuff all the time right, and so this is a there's. A bunch of interesting work here, I think the way in which we can reboot the project is by forming and working group, and so the way to think about that is first identify the use cases and the people that might be interested in that and the projects that might benefit from it and then see.

A

You know whether we have bandwidth to do that in the firm or we know we keep putting it off on and whatnot, but this is going to benefit. I, professor, both both implementations, it's gonna, benefit cluster because it's gonna be able to more intelligently understand the data that it's. That is pulling in there's a tool called just pack I've. Never you ever seen it was an experiment around data set distribution. That already has a way to like describe the importers that you're using so that we have a little description.

A

But so you can always take the same file, apply that description and output the same graph so that you know, as importers change over time they don't. You can always return the exact same in data package managers, ISOs containers, so it turns out that a docker container, it's just a bunch of tar balls smashed together right. A lot of the star walls often contain the same stuff, sometimes they're star balls inside tar balls.

A

So if you've ever played around with container images for the like there's just a bunch of tar balls inside and so if you were importing them smartly, then you could duplicate a ton of this stuff and then maybe you could shrink down like the whole like container.

A

What docker hub, I think hit us could probably shrink that drastically and then you wouldn't have to pay tons of. Think of, like all of the bandwidth that people are spending by just in sea I, especially like just downloading the same images over and over and over again like the same bytes or moving.

A

This is ridiculous, and- and you know you can you can then start and see how this could get even crazier like once you start doing this, you can think of you know entire, like OS or or like you running a VM on top of something like this or like being able to understand like the internal data structures of of you know, you have like a VM image, so not just like an ISO but like an image of a running computer and then import that intelligently and then like understand what this internal state of it might be and, like you know, you like the migration of Frosty's, but all of like the process- migrations like I.

A

Don't have you ever seen some of that work, but it's you know a live running system with a hypervisor and has a VM running and then something happens and you there's like an slightly older snapshot of that VM and they want to migrate that process somewhere else like this is used a lot in cloud being able to do that. You know this kind of stuff wouldn't even apply there and yeah. We talked about media and then you get into like a very interesting one, which is once you have this all this trunking importer stuff.

A

You could start putting functions in there, so imagine applying a filter to an image or something and the way you express the filter is not by taking the raster image, applying the filter and then rendering and but taking the original image and attaching a filter to it in code and then shipping back and the user runs up, and so it's a file.

A

It looks like a regular file, but if you can intelligently execute the code before it comes out of ipfs to the user, then it would feel like a file exactly the same way as everything else, because output the buy some time, but it's stored in a very different way. Underneath the hood. So you know, that's are suggesting a whole bunch of other things that we can start doing can do compression that's what you can do encryption and decryption.

A

This way, I had a little random hack before, where I have a compressed file and have an eye peeled the object that points to the compressed file and points to a key and then I mount that and a mount that and I see the file unencrypted and every time I write it. It writes encrypted into into IVFs. So that's that's where a lot of this can go, but it's there's some important roadblocks to get there, which is this this.

A

You know how do we lately all this stuff out and and so on? So yeah I think I'm gonna pause here, because I think we're already going into lunch and I'm holding you back from food and they don't want to do that. So I'm gonna take questions and then, if you're interested in in your looking more into this and might want to join our working group on this, let me know and then I'll see about whether or not we can form our working group for this next quarter or the quarter after great any any questions.

A

That's right so I do envision a set of sane defaults, but it is important that we always are able to go back to exactly the same file, and this is where the ffs back stuff got like. It was pretty important to come up with. I know where we was called I.

A

Think is a format string, I think so this defines so this would be part of the Dex project. It would be like a string, it's just a string that defines what what configuration options are involved so that you can take the exact same file and produce the exact same output every time.

A

And then you know if you want to upgrade from there, like that's literally totally sane, and so then this could be fed into so the activist add thing could be enhanced to include you know you could specify the string if you know what it is right, instead of it specifying these other options that are that are there, but but I think over time we could get to same defaults, but through benchmarking, a lot of that it should be. You know rigorous benchmarks that show hey when you do this videos get better.

A

Anybody else are there, questions.

B

A

So I think probably the biggest roadblock. There is more social than anything I think it's just demonstrating the stuff and then coming up with something that seems palatable to the gay community where they would get would have to learn how to deal with files that are chunk chunk internally, and so that would require thinking through through that the internals aren't. Aren't that complicated? It's pretty pretty good structure, and so this is pretty peaceable.

A

It's just a matter of thinking through the impact of a bunch of the tools that assume the file is always a single file and I would imagine that a lot of a lot of those tools would expect that you would unpack unpacked like the compress representation and to like this, like normal standard files, they were just kind of expect that to happen, but yeah I think I think this would.

A

This would be quite a bit a significant effort, but it would be I think it would definitely be worth it to cover a whole bunch of use cases where you suddenly. You should be able to use all of the tooling that works with git, versioning and commits, and so on, and apply that to versioning really large data right now, like you, it's not easy to version really large data.

A

It still sucks, like people have been working on this for many years, and it's still terrible, and so a lot of a tooling around git is really nice and a lot of the stuff that people have built in terms of github and and whole bunch of other tools. It's also really nice. It would be really great to be able to reuse all of that with versioning for large data, but that becomes a question of like how do we do it? What is the plural? What what is like?

A

The plural class really looked like to get that says: hey we're, gonna, add all the support and I think the biggest part is convincing that group that this is worthwhile, and that would just be with like really hard facts of thing. Look at all what these repos look at. What what get could now suddenly do?

A

B

B

A

Code is data right and data is code, so I think for those unfamiliar the IPL. The effort is, it is diving deep into computation, so it turns out that a lot of the wasting to into how to improve the distribution is actually an applications. On top of it is to really rethink how we mix computation and data, and so you can read, there's a whole bunch of notes here that talks through use cases that talks through potential things a bunch of different discussions, I think probably the most valuable thing that we've gathered together here.

A

Is this awesome bibliography that you know you can dive through there's a bunch of like really great paper papers here and a lot of this stuff is much more about computation than than data, and so I don't know that'll, you know all remain on there. I feel D that might yield the different like pop out a different project or something but uh but yes, I, think a lot of it does become really thinking through just how we represent files in the first place. Why are we storing just byte streams?

A

Why aren't we thinking about storing code? Instead and so on and in terms of languages, there's been some some work on this.

A

There's a set of issues. Actually it's funny because I was writing an email about us today to to somebody who is working on a thing called calm, Agora of blocks, I, think or call blocks, or something that it talks about about doing exactly this, and so you know one example would be so this stack stream thing is part of yep profess, explore a set of aspirations, and the idea was come up with a very simple language that would combine other objects and yield stuff. So so a normal file like two would concatenate its children right.

A

So he would link a file would like to link two sub files and they would concatenate those plus the its own internal data segment, and then you could like write a single function that represents that and then you could ship. So this were the IP. All these stuff happen where we didn't want to start writing that a special cases into ipfs we didn't want to have to be like. Oh this. Is this really cool idea?

A

Let's have this like code thing in a different language and included along with going BFS it'd, be really nice to have a VM here like a Wes, Emily VM and then point to the code itself as I peel the objects and then apply the function on top of another thing. So what this would look like a file would be a function invocation that you know it also points to the code itself. So it knows how how to interpret itself and then it would point to the data and then that's how you know you.

A

You would then stream about the file.

A

Exactly so all of that, so there's something amazingly elegant and powerful lurking there and we've been we've been exploring those branches of things and we found pretty good ways of representing these things and I think we're we're blocked is actually time. I. Think a number of us are just so spread thin that we can't do all this insulation all at once. I think Stephen, for example, is somebody who has already figured that so a lot of this out and just like, wants to build it, but then, like needs to maintain, go I profess, write so.

A

Well so I mean this is definitely an open call to to help out on the FLD team because and come to that right after lunch. We're gonna have discussion about that. If you want to hear what I feel eat, because there is a lot of really valuable things that could come out of this, and so the importer stuff is VIP. Od parameters are there enough to do the importer stuff without thinking very much about that and then late, but at the same time we can continue those explorations and then yield something much more interesting right.

A

So we kind of want to stage this where we want to provide the shorter-term benefit in terms of the very simple kind of chunking and splitting. We can already do and then later once we thinking about computable files, then that that opens up a whole bunch of other possibilities and so yeah I think there's a very interesting bright future ahead for computing and how we represent stuff it'll just well. We have to chart a path through all the possibilities and then like start doing some early as we get out but yeah.

A

Hopefully, you know we can start reducing the vast amounts of data out there, because in reality, a lot of it is just junk. That's duplicated, sorry duplicated other questions, or we can really release you for for food.

A

If there is people wanting to talk about this extensively, you can stay otherwise, we'll call it done and see you run. Thank you.

A

And yeah, if you're interested in the working group, either people here or if you're watching this video in the future talk to us.