Filecoin Filecoin Orbit, 23 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Indexing and Interoperability on Filecoin and IPFS

Description

Learn more about indexing and interoperability across Filecoin and IPFS in this session from research engineer Will Scott.

A

Hi, my name is: will scott and I'm a member of the data systems team at protocol labs? We work on the data layer of a bunch of the data transfer protocols and stacks in the in the interplanetary stack that vertical labs helps maintain, and so this talk is looking at content routing and how we find content um in both filecoin and in ipfs and and more broadly, um what that part of the system uh looks like today and and some of the work that we're doing to try and help scale that uh going forward.

A

And so this talk is looking at content routing and how we find content um in both filecoin and in ipfs and and more broadly, um what that part of the system uh looks like today and and some of the work that we're doing to to try and help scale that uh going forward.

A

So I'm going to start by talking about where content routing is today uh and then I'll move sort of up the stack through this talk so that by the end, I'm talking more about how uh the providers who have content um are providing that content out to uh the network and and and those protocols.

A

So today, from from something like ipfs uh content, routing happens through a dht, and what that means is each each piece of content which is identified by a cid. A content id which is the hash of that content is the the item that is both being requested and being published.

A

So people who have data that they want to make available to other ipfs users will publish that data into the dht by finding the around 20 appropriate uh nodes that make up the mesh and and putting a record into them, uh indicating that they have this content and so it it is a abstraction that looks a lot like a key value store where the key is this cid the hash and the value. Is that publisher that that location, part of the way that the dht works is that in order to prevent stale records?

A

So so some publishers saying that they have something and then going away, but you still getting that answer a long time later and not actually being able to connect to them. Is that records that get put into the dht expire after a day and so publishers who want to keep content available uh in in the current ipfs world republish every day to sort of indicate, they're still alive and still providing that content?

A

This has a couple scalability limitations right. The dht setup grows very well if the amount of content grows with the number of participants right. So so, if you have something that looks like a blog or you know, everyone bringing their own library of data, um and so as more users join and the dht grows larger. The data can also increase linearly at the same rate, but there's other sorts of quantities of data that are just too big for a dht in this in this sort of uh paradigm.

A

That they'll really overload all of the nodes that are participating in the dht. So if you imagine wanting to do something like you know a search index over a lot of data, so you know think think a big library of content, maybe web search, archives, other things like that.

A

It just means that there's so many cids that each individual participant in the network, unless you you, also grow the network, a huge amount at the same time will end up having more records than they really can meaningfully supply, and it means that the amount of bandwidth for the publisher who's trying to keep all that data available gets very large.

A

So um we start facing these larger scales of data with uh filecoin and and are looking at. You know what how what is the sort of next iteration of this protocol that we can use to to deal with this increased scale.

A

So the focus from from my perspective is the thing we want to get right. Is the protocols and we'll figure out the ability to iterate the specific mechanism um at a different rate if we've got good protocols on the two edges of this? So if we have a good protocol for publishing and a good protocol for requesting and finding data, then we can allow ourselves some additional flexibility and freedom to experiment with different uh specific implementations of content wrapping.

A

So what I'm going to talk through next is sort of the thoughts on where we go on those two. So the first is is content routing and- and so this is a user looking for a specific cid and we already have an interface in ipfs. That is roughly what we might want, which is when ipfs needs to find data. It makes a request.

A

That is the content, routing interface, and it is, I want to find providers for seid and I get back asynchronously, and this is the the golang terminology a channel that over time gives me the addresses of people who may have that data, and so what we'd like to do is take something that looks a lot like this and expose it over a network interface, so that I can have some other external system either either as some other program on the same computer as a library or as some remote service.

A

They can serve to implement this interface, and so I think, we'll start to see this um and and some abstraction that looks a lot like this coming to to a set of clients that are able to uh retrieve and and provide data from some some mesh of of ipfs and or filecoin um and and more broadly, for content addresses. So that's our thought of you know what is this simple client side of this interface?

A

Okay? So so, then, what do we do to actually implement? That is the next question, um and the work that my team is uh pursuing right now uh is is an indexing solution where we're trying to say well. Okay, can we just make something that happens to know the answer to all of these queries, uh and this is building on a bunch of great work? That's happened over the last couple of years. There was an initial implementation of a storage system.

A

That's optimized for these cid hash, based keys, called store, the hash that volkl made, and that has then been ported to golang and that we're now, using as the basis of a indexing network index system that we're calling store the index. So let me give you a brief um or or sense of what that's doing and how that's actually working, so it takes a cid and it uses some bytes of that cid itself, which it believes are generally random right, because it's it's a hash in in general, and so what it does is.

A

It first uses some bytes of that cid to decide which bucket to look at right.

A

So so it has a bunch of different buckets in memory and those buckets will point to index files on disk that grow, and so it will use a specific subset of bytes in the cid to determine first, a bucket that then points to an index and then we'll use another set of bytes in the cid again that it already has, due to time, determine an offset in that index file on disk that the record for that cid should end up, and so then it'll make a read of that index file at that offset.

A

So this is the first read on disk and that'll. Give it a record that has a relatively small, comparatively number of cids that match both that bucket and that index offset and what it will get in that record that it reads from disk is a pointer that hopefully includes that cid.

A

If it knows about it and the relevant metadata entries for that cid, and so then it will do reads to get those specific metadata entries and the metadata that we're imagining ends up being stored for a cid is the provider who has it um where they are the multi address. So this is this full address info that we talked about. Basically it's how to find them, and then also you know what protocol they're talking about and relevant metadata. That's protocol specific!

A

So how you actually go about retrieving the data from that provider and this gets. You know this is one additional level of complexity than what we have had previously, because now we're having multiple data transfer protocols, both bitswap and graphsync, as potential data transfer options, and we expect that there may end up being additional ones and that's part of the extensibility and flexibility here.

A

The nice thing about this setup um is that there's a fixed number of reads from disk that happen, regardless of how many cids end up in the database. You do one read in the index to get a record chunk and then you do a read to get the specific metadata entries that you want. The reason we do the extra lookup on metadata is because we expect that there is sort of for a given provider.

A

There's many cids right. So if you think about a file coin deal, that's you know 32 gigs, typically of of data, which has many many uh cids within it, and all of those the process that you should then use to retrieve is the same, and so we can de-duplicate and spend a lot less disk resources.

A

If we do that deduplication, but it means that there's only two to three reads from from disk for a given cid, regardless of how many cids are in the index, and so that means the only thing that we're really having to scale is disk usage, but we can, but, but we end up retaining a relatively low latency we're optimistic. This ends, up being you know, order of five or ten milliseconds. To answer a query. uh Sorry, microseconds, uh you know random access disk seeks, um even as the database of indexing gets very, very long.

A

Okay, so so you've got this abstraction of this library. That's able to do you know individual indexing uh requests, it can answer for a cid who has it relatively quickly and it should be able to scale, but that's part of the story. Right we've still got to answer. How does this indexing component actually learn about all of these cids?

A

And then I want to talk a little bit about how we see this sort of evolving in the future. Now that now that we've got interfaces right, you know this is one solution, but it's certainly not the bl end. All I'm sure that there's already a lot of you know uh questions in your head about okay. So if there's this centralized thing well, aren't we building a decentralized web, and- and yes of course, we are okay.

A

So the the other interface- and this is, I think, right now- the the focus on the thing that we want to get right and the protocol we want to get right is ingestion, which is how do we learn? What data various storage providers, um either filecoin or large ipfs nodes have and get that reflected in the network index.

A

So so the current thought is that what we're expecting from storage providers is that they will have a hash chain of uh basically advertisements of what content they have um and and they'll expose this list of advertisements and the cids in those advertisements um as a thing that network indexers can get from them. The reason that we're having this as a sort of hash chain of advertisements that that are signed by the storage provider is it is.

A

This is part of that extensibility, which is it's not that the index, the storage providers are pushing or or just sort of saying, I've got this data um which would not be you know, um immutable would allow them to provide potentially different views of their data to different indexers.

A

um We want to make sure that the the sort of expectation on search providers is they have one view of what data they have, and that is global, so that, as other network indexing solutions come in, they can also use this protocol to get the same view into what content is available in the network.

A

Okay. So what? What does it look like? As you know, uh either a new file coin deal happens uh for a provider or they, you know through some other means, get an additional set of cids.

A

Well, the first thing that they'll do is they'll they'll, take those cids they'll make the list of them, and then they'll make an advertisement that points to both that list of cids and the previous advertisement and that'll be their their new sort of head of of what content is available for this provider.

A

The provider then advertises this and it's the best effort gossip sub. Basically advertisement of I have new data available, and this is on a topic that network indexers can listen to so that they can get notifications for lower latency polls from storage providers who have new data.

A

uh In addition, we expect that the network indexers will sort of keep track of known search writers and we'll pull them.

A

um You know on some regular basis to see if they have new content, um so the gossip sub is, is um you know a way to to lower that latency in most cases, um but there there's a sort of a fallback to make sure you are keeping everything in sync um once a networker indexer sort of learns about this and adds it to its queue of providers that it needs to get updates from it will use graphsync to whole first that advertisement and then the list of cids um that it does not yet have um and and there's a couple reasons for doing it.

A

This way, the graph sync means we only get the portion of the dag that we do not already have. So it gives us this partial resumption of oh, I already have older advertisements, I'm not going to re-get them.

A

We also expect that there's a bunch of deduplication in the sids, which is um you know, a deal that is made to file point- is likely made to multiple providers um and so by by having this sort of structured through this ipld setup, um the indexer will only have to fetch that from one provider not from all of them uh and not duplicate that work, because it knows.

A

Oh, this is the same list of cids that I've seen in other places, so the indexer pulls that and and then adds to its index, that these cids are now provided by this, this provider. Okay. So so so the the advertisements um either introduce a a new set of cids uh or retract and say I am no longer providing some previous set of cids.

A

Those are those are the two basic deltas um and then, in that advertisement layer uh it says what metadata uh and protocol should be used to retrieve them right, because we expect that you know, for some large set of cids. You've got one uh sort of mechanism, so if, if you end up as a storage provider having you know different protocols or whatever, you would do that through multiple advertisements, each with the smaller set of appropriate cids that match that metadata um other metadata- that you might imagine uh is things like.

A

Is this available for free retrieval, or is it going to cost cost file point to to do the retrieval? Okay? So so an indexer has the sid maintains it. um But let's talk now about sort of what what that expectation of the the growth and sort of evolution of the indexes so we're doing one index right um there there's one thing which is like okay as the scales: what what's the what's the plan and there's two ways that it can scale?

A

You can either scale in terms of having a lot more queries or you can scale in terms of having a lot more data and- and hopefully we scale in both um as we get more queries um likely. What you want to do is have caches of recently queried cids that end up closer to clients right and that takes load off of the actual primary index, um because many of the there will be a small number of cids that are requested many times and many cids that are, you know, only never requested at all.

A

So you can pull that that frequently queried head into a set of caches that are closer to clients.

A

The the index itself is then limited primarily by that growth right, like as those disks get really big that gets expensive or it can't pull from the growing number of storage providers in a reasonable way. And so you can imagine sharding it where you end up with regional partial indexes. So there's some index that looks at storage providers that are in north america and a different one.

A

That looks at storage providers that are in europe, um and you can probably scale that uh to another order of magnitude, potentially just by starting across regions, and then the local caches will will send out sort of parallel requests back to these multiple different partial shards, uh maybe potentially preferring the local ones. Because they they want to connect uh clients uh with data that is closer to them uh by default. um But you know, if that fails, then they fall back to further away charts uh and, and that way you you've got.

A

You know a a set of you know fairly tractable uh extensions that likely allow you to keep scaling this for a while, okay, but that's all still within the scope of a single uh trust, administrative, regent, right. There there's some operator, that's operating this whole system, and- and this is why the focus initially is on the protocols of both content, retrieval on one side and storage provider ingestion on the other side, which is we don't want to be making the only network index, and we don't want to be running the only network.

A

So there's a few different sort of thoughts here, one is, the expectation is pretty quickly we'll end up with federated replicas, where other entities also are running network indexes, and that gives you a couple options. One is that clients can query from multiple providers indexes and can check them against each other to make sure that no one is lying and that and that these are all sort of operating in sync and and have similar views into uh providers on the network.

A

The other thing in the provider ingestion interface right is that you've got this now a tested chain. That's not mutable from each provider saying what data they have, and that means you can do a dispute process where, if the index responds to a query either saying the provider doesn't have it when the provider does, or vice versa, saying a provider has something that they don't.

A

You've got a process where you can show the a tested record from the provider and the response from the network index and say: look these don't correspond and you can figure out who is lying or who has provided incorrect data based on the the underlying provider records from the provider, and so that means you can now introduce a whole set of incentive games uh where you you make it um irrational for a network index or to not provide the the standard view of the network um so that moves us closer towards an untrusted or supporting uh potentially byzantine indexers, where the the indexer itself can be malicious can provide incorrect information because you've got a dispute resolution process for those who have grievances we're trying not to specify fully a protocol for moving towards untrusted uh either.

A

You know how how would you have sort of hierarchies where, where people who are offering network resources end up slotted into this, because that will likely depend and be interrelated with a larger set of incentives that need to happen with retrieval that are still sort of evolving right.

A

So you could imagine schemes, for instance, where, when I do a retrieval, there's some sort of finders fee where the indexing subsystem or the set of indexers that help me match make to a given provider, gets a percent or of the retrieval cost, or get some get some fee for doing that. Finding and maybe there's you know indexes that are fast, but only have some of the content and they charge less of a fee and then there's other more expensive to run indexes that have all of the data but charge a higher fee.

A

So there's there's potentially some sort of market here and before we've fully understand the set of constraints and incentives that we would that are useful in uh structuring that market. We don't want to sort of overspecify it so we're expecting that evolves and that the right thing in this iteration to get right is those two interfaces so that we can then iterate and change. What component actually exists as that networking interface without having to keep updating the protocol that either the providers or the clients are speaking so to finish off um we're working on indexing.

A

You can see the code, the the network indexer, that store the index, there's a reference provider at index reference provider and this work. um You know I'm not doing this work, but the the whole data systems team that you see in these pictures below uh is really the ones to credit for this. So with that, uh thank you. I'm happy to take questions on slack uh or or uh through any of these. Various protocol, labs uh messaging channels- um or you can reach me at will scott. Thank you.

A