IPFS IPFS þing 2022, 10 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Filecoin -- an exabyte scale IPFS system - @jbenet - IPFS Implementations

Description

Filecoin -- an exabyte scale IPFS system - present3ed by @jbenet at IPFS bing 2022 - IPFS Implementations - https://2022.ipfs-thing.io

A

I'm going to talk about file coin, the largest type of deployment right after me, uh you'll hear about lotus one of the implementations of falcoin from ayush, I'm going to talk about three parts, I'm going to give it a quick intro, use cases and scale. I want to talk about the falcon architecture, the broader system architecture, so you get a sense of how all the different components operate and then I'm going to talk about some sets of problems around retrieval, interop, indexers and computation.

A

uh So I I've given a lot of talks about this, but you can, in a summary falcon is a cryptopowered source network, um it's blockchain, coordinated storage market, so think of using a blockchain to advertise storage providers and advertise retrieval providers, advertise clients and so on and be able to coordinate operation.

A

uh The network verifies storage using zero knowledge proofs. uh It's the largest snark system, I think to date. um Yeah. I don't think anything has to pass that. Yet it is a it uses, personal replication and proof of space time, some deep cryptographic primitives that are not primitives constructions based on other primitives and so on um that we ended up using and having to create and then and then use.

A

This is the stats of the network, so there's about 17 extra bytes of storage capacity, um there's about so there's about four thousand source providers. It's about 400 organizations and lots of projects, and so on, um yeah, so the to get a sense of the capacity most of the facilities are large scale um cloud style data centers.

A

uh So you can think of a lot of data centers in cities all around the world um with very large racks of capacity, and that capacity right now is actual arranged bits um that that are being proved at all times, so that we know in every 24 hours that all of that capacity and all the data that's stored. I'll show the data in a moment are approved every 24 hours. So we know that capacity is there by the way, the blockchain and all of that is in ipld. So it's using seabor in this.

A

U uh so it's a totally ipfs native blockchain system. uh This is the capacity. uh This is the capacity growth um and this is a data onboarding. So it's onboarding on the on the order of 0.5 to 1 petabyte a day, and so that's a lot of data. So um we went from kubo to lotus where kubo was like.

A

You know at the beginning, for dealing with megabytes and gigabytes and like ah making its way to terabytes and, like you finally made it and then lotus is like bam, petabytes exabytes.

A

um Now, of course, it's not petabytes or extra bytes in data distribution, that's in capacity and it ends up dealing with like all of the storage outside. So one extremely big difference between all the falcon implementations and um coupon other priority implementations is that the data size is so big that you have to do a lot of stuff outside of any kind of um dac manipulation or outside of, like one nice, dag store or something like that.

A

um The data's coming out to vote through a set of on-ramps and uh five oneplus, which is an incentive structure. um The web to use cases are things like data sets public.

A

Think of large-scale public archives, different kinds of computational data, and so on and more most recently, it's been starting to get used by large institutions that have massive scale data sets so think of tens to hundreds of petabytes to pop.

A

um By the way, this is like super fascinating data to onboard into a network and then enable with computation I'll talk about that in a moment.

A

There's a lot of native web3 use cases, so things like consumer storage, video and audio nfts and web through storage, and so on. All of those have um tens of thousands of users like developers and generating like millions of tens of millions of objects, and so on, um I think getting to on the order of 100 million. um But all of that stuff is tiny.

A

So all of the web3 objects is very, very small, so falcon has to immediately turn to like web to use cases because all of the web 3 data just doesn't just doesn't add up to to be very meaningful.

A

Now falcon still has to operate really well for all the webster use cases, because that's where a lot of the web 3 applications are so it has to do both like do really well for being the underlying data datastore for those webster use cases and start pulling large scale web 2 things where it's headed is to enable large-scale web-3 applications. So this is what needs to happen in order for web3 to cross the chasm we need to be able to deal with.

A

We need to be able to build things like all the applications on the left using um web through primitives, so things like ipfs things like blockchains and whatnot. It's a long road to get here. uh The big kind of like next big blocker is things like data pipelines and consensus, scalability and so on, um and there's a lot of kind of direction into um computation. That falcon is taking so things like pluck one out of the fvm, which is a wasm based um uh virtual machine. The the upgrade just went live earlier this week.

A

Sorry last week I think last week um it's blur, and that is going to enable a bunch of different runtimes to be added, on top, so think of being able to add the evm or algorith ses, or things like that um swings out and whatnot and a bunch of other other systems or target wasm directly through some some other kind of wasm native runtime from there.

A

By having smart contracts on that kind of computational layer, you then can do scheduling for large scale, computational networks, so think of lambda and ec2 and containers and vms and so on and, like all of that becomes possible once you can schedule, and so we think that falcons can be really useful for compute networks, so think of clockwise, a really good l1 for for a bunch of l2s that build around the fact that all the storage providers have a massive amount of hardware for both storage and computation.

A

So they have a bunch of disks and a bunch of gpus and they have a bunch of the data. So you want to move the functions over there, computer net and so on and by the way, content. Addressing and all the iphone stuff like makes this kind of thing possible and makes it shine.

A

One reason that there's going to be many computational networks is that cryptographic primitives are very different, and so they yield different, rational, different economic structures for getting verifiability and very different performance profiles. So it's unlikely that we'll see one single computational network, we'll see probably many um for a while uh it'll take a while for these to kind of synthesize, there's a bunch of next-gen scalability.

A

That needs to happen um uh there's. This is what the consensus lab group is working on um and kind of, like the the goal is to get to billions of transactions per second or trillions of transactions per second through things like um applying hierarchical consensus and and so on, but we really want to get to like blockchains that have very fast analytics things like millisecond finality, so within a data center, so that means you're never going. You have a blockchain that can operate within a region and then stamp up um great.

A

So let's talk about ip fast. So far, coin is an ipfs system. At the end of the day, the entire blockchain is ability. Data, all of the sectors are ipled data, the data within the sectors, our ipld data, all of this stuff is um stuff, and the transfer protocols are ipfs oriented protocols, um but lotus started in a different direction.

A

Lotus and go falcon, which turned into venus started in a different direction, not from google using a lot of the libraries from google, but then starting from scratch, and so that yielded a bunch of like interrupt gaps between the two that have just been the bane of a lot of our existence where, like kubo, does a certain set of things load as the destroyer, instead of things and just kind of like being able to bridge between the two has been non-trivial um but anyway.

A

This is the the rough architecture where you have source clients bringing in data through on-ramps into source providers.

A

Then data being indexed there and those indexes that data being advertised to these indices and then, when clients want to retrieve that data, they kind of either go directly to a retrieval provider or check the index for for some data. The index here is not just the dhc.

A

The dht is not going to scale for petabytes of hundreds of petabytes of stuff, so we really need um much more scalable content routers, that's what the whole session tomorrow is about is about very scalable content routing systems, and you basically need of one network accesses to be able to like do this with very large indices, and you need decentralization which is tricky, but basically you don't want to do something like adam layout, because then you end up like with a bunch of hops, and you end up like advertising petabytes of stuff into into the dc.

A

uh Falcon has a bunch of different consensus, nodes um and storage writer nodes, uh lotus, venus, forest and huhan. This is from blockchain client diversity requirements where you want separate independent code bases to make sure that, like, if you get into some problems with one other nodes and the network can potentially carry on, I found this table. I don't know if it's up to date, but this is a table that like implies where things are um and you'll hear about lotus in a moment.

A

uh There's other uh programs and other uh nodes, so think of boost is a layer that gets added to things like lotus and other nodes that can do sorts and retrieval for a source, writer and think of it, as kind of like an interrupt layer and a bunch of added tools and and systems to be able to mediate, the client and storage provider deal making of like. Oh, I want to retrieve this thing and I want to pay you or hey.

A

I want you to take this data and like store it once you start moving around large amounts of data like tens of gigabytes of terabytes, you have to deal with like all kinds of scheduling, problems and so on, and especially when you have some computation that you are going to run in the background like ceiling, and so that's where things like boost come in there's another implementation called s3 which we'll also hear about later today. I think and that's uh think of that as an ips implementation tuned for medium scale, data onboarding into falcon.

A

So this is like 10 gigabytes to a petabyte, uh so it can't handle like 10 petabytes plus. At that point, you want to move things outside of the wire you just don't want to use the internet um but telling gearbest petabyte. Is this sweet spot for forestry and think of s3 as like an intermediate node between clients and source providers?

A

Falcon knows are ipfs nodes, so think of this diagram, where, like the liquidity network is a very large network, um ibps is uh is on top of that, and falcon is a subset of that of the apple class network.

A

um It's ipod to be in unix to fast all the way down, um all the blocking data's, type, bld data and so on. Files today are imported like regular basics files that are imported as unix compass. Maybe they should be windows. um The transports today are bitsweb and graphsync, and the network is liquidity, http or offline.

A

So there's a lot of offline movement of data, because these scales- you just you, you can't use the internet. um So one important thing: uh there's a lot of people always talk about falcon and ipfs and interrupt, but that's a misnomer. It's like think of, like lotus and kubo, interop and part of what led to this was that people were calling lotus fat coin and people were calling uh go, ipfs, ipfs and so like.

A

We have like undo this conceptual damage, so I'm just gonna. Let that stay for a moment just burn it into your mind. um So that's a good one um and it falcon was sort of designed to have all these like different components that are meant to compose when you want to create a node so think of a consensus node as having a set of libraries, it has a olympic p node, a local repository, some local notion of time and a bunch of facilities for the blockchain.

A

Think of a storage mining node as having all of that, plus the ability to deal with files and the ability to do the storage mining components. Think of a storage provider, as like all of that, plus the ability to make deals and and participate in the storage market and think of a retrieval provider. Note is not having to maintain the blockchain state at all. It's just a it's just another node and a client and doesn't have to check in with consensus.

A

That's in theory. So that's the theory of the protocols and and the protocols allow that kind of thing. In practice the implementations are a lot murkier and the libraries are not um as easily decoupled and so on, because when you're making a thing, you, like you, encounter all kinds of different constraints that are different than the theoretical constraints. So um it's not as easy as like plug and play to making these things these end up being, like totally different code bases, um the so some quick notes on retrieval and drop indexers.

A

um Actually, let me check time.

A

I think this is a useless slot now so yeah so I'll couple comments and then I'll I'll hop off so unretrievable, um so the whole storage flow is working pretty well now. The retrieval flow is not working very well yet um we're right now. This is where why we made an indexer service that can index all of the stuff that the storage providers have can ingest that into like one place and then make that accessible.

A

So, like that's live now, that's getting connected to the ipvs gateway, uh you'll, probably hear about that tomorrow, um and then we want retrieval provider networks to be able to kind of use that indexing information and then pull data from storage providers. Those retail provider networks are being tested right now. Those are getting built at the moment. There's there's no like there are some test nets, live that work and are trying to raise the gateway and so on, um but they're, not kind of um uh good enough.

A

You have to like make that retrieval flow work. Well, so today, like you can easily like relatively easily write your data into into file coin, depending on you know how big it is um if it's big, it's gonna, be harder. If it's small, it's very easy um using many of the on-ramps but then retrieving it uh again depends if you use an on-ramp that has really nice caching for you and solves it for you great, like so things like web, through storage and seo storage and so on. All work really well.

A

If you manually give a ton of data to providers, then retrieval is harder and that's one of the areas where these retro networks are coming in to uh to help solve the problem.

A

um So that's what uh retro networks and retro markets are aiming for on the order of 100 000 to 10 million nodes and we're kind of thinking of like being able to deal with 10 to the 18 objects. That's kind of what that's web scale. Maybe 10 to the 15 is good enough, but 10 to the 18 is like safe territory, um and these will likely end up being tiered.

A

So there's a one of these is called saturn and it has a set of tiers where the virtual client may be running a service worker locally in the in the browser node that talks to an l1, which is primarily a bandwidth oriented, cache layer that I want talks on l2.

A

The l2 might have a bunch of content that was positioned there ahead of time, predicting that the data was going to be it's going to be requested in a region and then, if the all of those fail or the l1 might just go straight to l3, which is the the sp so think of like standard style system, style, caching, but but per region, and so think of these as being deployed um in a locale, so that you minimize the the latency you don't want to be crossing continents.

A

Here you want to be retrieving data from within, ideally closest data, centers and so on.

A

Scale data repository on where you can dump a bunch of data in and then run a bunch of data pipelines and computation uh to then build other kinds of applications. uh So we you know all this stuff can take like large archives of all the crypto trees from like all the applications, and just be this like um cold storage of all the stuff and then whatever you want to.

A

Pin you get close to the to the end users and you keep like the local hot caches for every everything um closer to wherever the person is cool. That's it.