Filecoin Filecoin Slingshot, 5 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How We Built This // Package Manager Registries on Filecoin

Description

In our next How We Built This, @brian-hoffman-ob1 will talk through how his team stored package manager registries on Filecoin, covering the nuances of storing versioned data (without automatic deduping) + small files on Filecoin and building a UI to explore this data meaningfully!

Keep up with events for the Filecoin community by heading over to the Filecoin project on GitHub:
https://github.com/filecoin-project

Check out the Filecoin community resources:
https://github.com/filecoin-project/community

And stay connected on Filecoin Slack:
https://app.slack.com/client/TEHTVS1L6

A

All right, um hi everybody, uh hopefully there's some people on the on the live stream and uh there's gonna be more to to view this after after I get it recorded, but uh uh my name is uh brian hoffman and I'm from ob1.

A

As you can see on this slide, uh and today I just kind of wanted to talk a little bit about a project that we've been working on for last few weeks called uh 5mb, and I know this is the how how it was built series so hopefully you'll get some of that and you'll find it interesting uh to learn about our journey of using filecoin and uh and developing on top of it. um So today uh it.

A

I don't expect this to be a super long uh talk, but uh hopefully it's still uh comprehensive and interesting, but um uh the agenda today we'll just talk a little bit quickly about who I am and what ob1 is. uh So you have some context. uh I'd like to talk a little bit about the problem. We were really trying to solve with 5mb um what it is uh and then you know, obviously, most importantly, how did we build it and then I just wanted to um you know I think, with any new and growing project.

A

You want to talk a little bit about the challenges that you had to either overcome or that you're still experiencing, so that we have an idea of where we want to go to from here uh and how things can get even better and uh and then and then just briefly kind of where we want to go to next or where it could go next, uh like I said, my name is brian hoffman, I'm currently the ceo of ob1, which is a venture-backed startup, we've been around since 2015..

A

My background is technology, so I'm a tech, ceo and I did prior to this uh work, I spent about a decade working in government and corporate consulting for cyber security stuff, which is what led into my interest in bitcoin and decentralized networks and crypto. um And so that's that's a little bit about me.

A

Obi wan, the company sprung out of uh a project called open bazaar, as you probably see above me, if you can see my video, I don't know if you guys can see my screen, but um it's an open source, decentralized marketplace and we built it on top of ipfs, so pretty intimately familiar with protocol labs and the technology that's been built there for the last few years.

A

These are just some of our products. Real quick that we've built open bazaar is our big one. We have a wallet called haven that has a marketplace built into it.

A

One of the cool things about these two and why it's worth mentioning here is that we've um we've integrated filecoin into the app um it it hasn't launched yet because we're going to be launching it with mainnet, so we're looking forward to that coming soon um and and then you'll be able to, you, know, send and uh receive file coin and and buy and sell things with it on our on our apps, which is really cool.

A

But what we're here to talk about today is a problem that uh came up when we were discussing how ob1 could help with some of the file coin. Things going on, and the key problem here is that we have a bunch of software package repositories out there. So you know if you want to download a package for arch linux or centos or one of these operating systems. You usually go to these mirrored ftp sites or these websites to download those packages.

A

You point them down from there or you're using something like npm to get software, and uh you know one of the issues that that the repositories have is that you know they have to ask all these people to mirror this content, um or you know, there's a lot of content that they put out there. So people can get it, but it's not really being used very often.

A

So it could be an obscure package that maybe only a few people need every once in a while, but yet that data has to stay online somewhere, um and uh you know that that presents an a challenge, and so we think that you know possibly a hybrid online data storage system, something that we could use uh filecoin for would be needed to ensure that that data is uh readily available when necessary, but at the same time the less used data gets archived and is there, but not necessarily always immediately accessible. It just has to be.

A

Retrievable has to be cost efficient, so that's kind of the general problem that we're looking at, and I know that a lot of the projects in this in the file coin space are solving similar things and tackling them in different ways, but this is kind of what we came up with um well, I did want to briefly to mention. You know why. Why do we even find filecoin interesting?

A

Like I mentioned earlier, we've been building on top of ipfs for for years now and uh open bazar is built on top of it and the entire network. The whole marketplace is, is um you know, accessed by peer-to-peer communication using ipfs and all the data is stored by those nodes in kind of an altruistic way, not necessarily fully altruistic, but um you know in a way that they all are contributing to the network and they're not paying for it. You're, not paying other people to store it.

A

So we feel, like filecoin, could be a way to move towards a more sustainable data ecosystem where you're not just asking people to store data. Sometimes people can't store data. um You know, if they're on our haven wallet it's a mobile device, they're, typically not running the app all the time you know in ios it would get backgrounded or something and get killed, and so there still needs to be more reliable and sustainable way to store that data and for other apps as well.

A

There's tools and applications that are still needed, even though falcon is, is making really good progress.

A

There still needs to be more and better tools for managing the network and using the network and uh and providing consumer facing applications that let them get to the data that they want, such as the source code repositories and the package repositories uh and, most importantly, in our in my opinion, uh you know the protocol, labs team and and the contributors involved with building things like ipfs and and lib p2p, and all the other underlying technologies that kind of have gone into filecoin.

A

It really has shown that it can work, it's strong and it's growing in the right direction um compared to a lot of the competition out there, and so you know, we considered a very trusted brand and um a group of developers.

A

So what is 5mb so uh you might not recognize what's in the photo. Basically, this is a five megabyte hard drive. This is from 1956.

A

it's one of the very first uh hard drives considered portable as like or moving it somewhat, but um so we just decided in honor of that, since we're talking about storage, to call the project, 5mb really didn't have much of a better reason to call it that. But it's an experimental project that we've built to make software package repositories available via the web and it uses ipfs and filecoin.

A

So these are the tech components that primarily go into uh building this. So we have a uh a golang based repository processor uh service that we've tentatively called amazon because it's involving packages and delivering packages. uh So I don't know that we might get sued for that. So that'll probably have to change at some point, but that's the working title. For the moment um we have a golang based http server that serves up the front end.

A

We use powergate, obviously in the middle of the diagram, that that helps us facilitate the communication between ipfs and filecoin. Dealing with that. um Behind that we have lotus, we also use mongodb to back as the back end data store for our application.

A

We use ipfs, obviously here, and we also have some rsync scripts and if you're not familiar with rsync, allows you to synchronize data sets across servers. So in this case from these icons, you have centos and arch linux and debian things like that. We're pulling these terabytes of data down from those repositories and putting them onto our server so that we can move them where appropriate within the ipfs and filecoin ecosystem.

A

uh So now I'm just going to kind of talk through the stages of like how we are using these technologies to accomplish the goal.

A

So in stage one we call it ingestion, it's pretty straightforward, essentially, ob1 works to use rsync and scripting to ingest the package repositories into an amazon ebs, so we're running it all and using containers and uh data storage on amazon at the moment, uh and that data is then partitioned into different logical data sets so that it can be processed further on because we have a bunch of different data sets of different sizes of different types of data.

A

And so that's that's part of the first step, and then uh you know going forward. You have to do periodic synchronization of these data sets, because um each one of these repositories uh updates their content at different time periods. Some of them are updating every hour somewhere every 24 hours, some less than that, um and so there are different rule, sets for different um of these package repositories.

A

But essentially it's about getting the data from the remote data set into a place where we can process it, which leads into stage two processing.

A

um So at this point we've got a golang processor, that is the amazon service that I told you about amz in um which works to stage the data into ipfs, it's kind of the first step. So we push all that data into ipfs and um and then what we are able to do is to look at the data structure based on that, and uh we have to do a couple things here, because the data sets are different and the way we handle their um their push into file coin depends on the structure of that data.

A

So if we were pushing uh you know, 200 terabytes, we treat that differently than if we're pushing you know one megabyte or two megabytes of data. It's a completely different kind of beast, so we have some logic there to do that, uh and then we basically break up these objects into uh buckets. What we call buckets- and uh I know this is kind of an overloaded term- I'm sure the textile guys are probably like.

A

We have buckets, but uh essentially they're they're like buckets, but we take the ipfs objects and we rearrange them into buckets and keep track of those things within our data store or mongodb, so that we know where each piece is and the reason we do. That is so that we can sort them into kind of comparatively sized buckets when we push them into file coin, because we can't push each file separately right.

A

We're talking about like one repository, for instance, um I think centos is like 600 000 files, individual files, for instance, of different varying sizes. So you can't really just push them all in separately and it doesn't really make sense to push them all as uh one big file, because you know if you're retrieving like a very small file or you want to retrieve a specific file, you don't necessarily want to pull down, say two terabytes from file coin. um You know when you only need a small set, so there's kind of some algorithm there.

A

That needs to be built right. It has to understand.

A

What's the best trade-off um that you can, you know compromise to make in order to make it not ridiculous, but at the same time still feasible um and so that's kind of the processing phase, and this is a bulk of kind of the work that we've spent uh the time that we've spent on building this is is really in this stage, because it is uh it's a very you know you want to do it properly, because you know it has to be functional, has to be usable.

A

um Then we move on to stage three, which is called archiving and at this point this is when the data buckets are now getting pushed into file coin, and we use powergate to do that. um This is a very. This is another very challenging area.

A

I think that a lot of people probably have these challenges, which is like so much can go right or wrong when you're doing these storage deals and and now we're talking about hundreds and hundreds of deals and of different sizes with different miners, it's it's definitely you'll come across a lot of interesting problems there to work through. It doesn't always work as planned, but this is the part where we're handling those kinds of things.

A

So we have uh code that monitors and handles the deal errors, and we track that and uh that information is also part of the mongodb. That we've built- um and I have here is you know one to do. We'd like to do is to be able to better expose that that information through the web. I know that other people work on similar tools and um we didn't really go down that that road too far, because we know that that stuff is being worked on and we want to make sure that we're not duplicating efforts here.

A

So you know in the future, we'd like to go back and revisit how we could make it easier to to understand what exactly is going on and what is the full status of these data sets as they make their way into file coin.

A

um Oh- and I did a horrible job of proofreading this, because this should be stage four retrieval. um Basically, what we've done is we've just modified.

A

The existing ipfs ui, which we thought would be the the most expeditious way of getting this out the door, so people could use it, and uh you can see here in the slide, is screenshot of that's the uh the centos repository that we've pushed into uh ipfs and filecoin and um essentially the way that it works is if the data is is hot and available via ipfs, which in you know many cases it is.

A

You can retrieve it very quickly through the web. You can traverse all the files, all the folders and find the file that you want and retrieve it immediately, and if it's not we've added in logic, to fall back to file coin and uh have powergate, you know retrieve that content from filecoin if necessary, and what will happen is that the ui serves up a notice that says hey this content's not not immediately available.

A

It will let you know when it comes back and uh you can revisit that location and and uh and then hopefully that that content comes through that. That challenge in itself uh is quite a user experience um uh problem. Potentially, if you can't get to access to the data. But the idea here behind this whole thing is that the content that is really needed to be accessible all the time will be there and be quickly retrieved. But you know in the case that some random or more obscure file is being requested.

A

Then this is when they would be having to wait, and in this case we've kind of abstracted away the payment piece of it as well. So we don't require people to pay the file coin to go, get that we handle that behind the scenes for now, um and so really it's just a time delay on their on their on their side.

A

So so that kind of some that kind of uh briefly explains. What's going on from from start to finish on this process, some of the challenges we've encountered- you know first thing obviously growing pains, you know, file coin is still not at mainnet yet, and it's moving very quickly.

A

Things are breaking and fixing and breaking and fixing and people are adding things and changing things, and that goes across the whole stack, because not only are we working with lotus but we're working with powergate- and I know the textile team is working really hard to keep up with that and and improve that and they've been really um really great to work with everybody's been really great. So that's that's, awesome and very helpful.

A

um Secondly, you know, as I've mentioned several times in the in the discussion here, user experience is super important. I mean we've always been building software as a company over the last five years, like focused on the user experience I mean in crypto and decentralization uh applications.

A

uh User experience is kind of sometimes the second second priority to just making it work or, like you know, other things, but we really wanted to find a way to make it so that it's it's usable and practical and not just an exercise uh just to say we did it and then. Finally, um this is such a steep learning curve. I think um for the technologies. I mean it's just such an immense amount of code and uh logic, and you know, uh verbiage and and different, like it's like a completely different vernacular.

A

I think, from a lot of what people work on you have to kind of get a hold of in order to really understand how to handle things when they come up and we're still kind of grappling with that. But um I think that you know that's one of the exciting things about it is is obviously learning and you're doing something that um no one else really is doing so that always comes with a steeper learning curve than normal um so where to where to next.

A

um So these are just a few of the ideas that I have, I think for at least this project, so we're just focusing on that which is you know. I think that there's a lot of room for growth on the ui side, to really kind of manage the entire uh pipeline from all the way, from ingestion to presentation to the user and a lot on the management side.

A

Of that- and you know, one of the biggest things about dealing with file coin is just you know, there's just an immense amount of different actors and processes going on at the same time and there's also the the the other x. You know the extra challenge of time, which is that you know in building like online applications. It's not a lot of like things that uh present like a very challenging time constraint. On top of that, it's that's pretty unique.

A

um We also are going to be and we're already in the progress of doing this, which is expanding. The data sets that are available online. So at the moment I think we're targeting about 11 terabytes of data and we're going to try and get that all up and and going.

A

I think that, as we expand we're going to start running into some scalability difficulties, and that will be interesting and we're going to try and figure out how to how to handle that, um and then you know. Finally, you know doing this kind of uh just the process of like pushing that much content into file coin uh for repositories that change pretty quickly.

A

So you know like if it's changing hourly and it's a two terabyte repository- you know like what is what's kind of the thinking. You know all right. We have to decide like. What's the thinking around whether or not it makes sense to even try to push that kind of data into file coin? Is that like really what we're trying to solve, or how do we handle that?

A

So that's another um kind of mental challenge that we have to get over as we go forward, and uh um so so that's what we're kind of thinking it through um that. That's basically my last slide, but I did want to just briefly talk about the fact that we're going to be. um I think the repositories for this code is still private right now, but we're going to be opening it up soon and putting it out there.

A

So people can take a look at it and and help out or expand it if they want or use it for whatever they want on their side, which is going to be useful and uh and um if you're trying to get in touch with us or you want to find out more about what we're working on. You can follow us on twitter at ob1 company and uh also we have our open bazaar handle.

A

uh Personally myself, I am brian c hoffman at twitter and that's kind of the best way to reach us and uh yeah just want to thank you guys for uh tuning in and listening and letting me share a little bit about what we're working on.

A

um I don't know if we're gonna do questions or something after this, I'm not quite.

A

A

A

Let's eat here should I stop the share.

A

Okay, great well, it sounds like we're. We're not gonna. Do a q a for this, but um yeah thanks for tuning in anyway. Thank you.

A