Filecoin Retrieval Market Builders Mini-Summit, April 2021, 6 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RETMKT Builders - Retrieval Market Indexing

Description

Will talks about content indexers at Retrieval Market Builders Mini-Summit in April 2021.

A

Rowell just um showed us a lot about what we're.

A

Seeing happening to lotus as a process and a lot of what motivates that is that there are new capabilities that we want to add um around making the value of a file coin. Miner um better in terms of you know, enabling this retrieval market um and and one of those is indexing, and so what we're meaning when we say indexing, is that it's sort of important for a minor to know what content it has and to be able to get that content back out.

A

So right now, content is stored in falcon in pieces uh and deals, and so these are, you know, an eight gig or a 32 gig car.

A

It's a bunch of individual sid data items um and we refer to it based on that hash of that overall thing, and so what is hard to know is which individual sids, which pieces of content are inside of those things um and to map those individual pieces of content that someone may want later back to which minor has it and which piece which archive is that individual piece of item in or that individual piece of content?

A

So um this this problem of like content, routing routing from a sid into the the entity that has that is sort of what ipfs is already solving right um and and is the content writing problem. But one of the things that we think is sort of the useful next step to enabling this in in the context of of the file coin network is we want to have some interaction between an index or some external node?

A

That is going to help us with content, routing and the miners, the the nodes that have content, um and we think that this um extension, where, where we, we have this interaction and interface, that we define um for a node with content that sort of accumulates new content and is going to notify that and then some external party that can learn what that index and what the available content is, is one that both uh isn't and doesn't need to be um specific to filecoin. So we can.

A

We can make this interface more general, so that um existing services that have a lot of content address data, maybe also fit the same model of they. You know in some sort of checkpointed way will get more content that they want to make available in a content, address, network setting um and and that interface probably looks pretty similar to what um a miner's lifecycle looks like of content sort of coming and going in batches.

A

So we've got two sides of this sort of interface, interaction that we need to think about when you think about what that means that we're going to ask on the minor side and then what what is this sort of index process that then flows up to clients and the client that we're thinking about here specifically are like how do we make this content um available to ipfs clients and and serve available more generally in a useful, low latency manner, so so enabling the cdn style use case?

A

um So this is starting to dive deeper into what what the indexing process. um Okay, I'm gonna make this smaller for you guys, probably.

A

Oh there's a little bit smaller um really doesn't want to make that sidebar smaller, but uh so we're imagining that there is some sort of indexing service, that's going to live on a minor um and that, as the market completes a deal, that probably is a trigger that triggers the indexing service to make that new content available.

A

um What does the indexing service need to do? Well, it needs to generate an index one of the, and so what is an index? Well at some level? It's just like the list of all of the sids that are in that deal, and so we have tools that can go over a car archive as um a deal is currently stored. um We could imagine with a different um proof of replication that potentially that storage is different, but it is still you know, a bunch of content, addressable pieces that are referenced by sits.

A

One of the things that I think we're still trying to fully understand is whether that complete index of all of the sids that are contained in a piece is the thing that we want to have as the interface.

A

um uh So so, if an indexer or if the rest of the network wants to say hey what content do you have? Is the answer really that list of sids, and so the the motivating example here is that a lot of data that we have is in uh unixfs format and unixfs format has two things that are sort of interesting.

A

One is that if you've got a large file, that file is spread across multiple items right, so individual blocks of data are limited in size, and so, when you have a large uh contiguous sort of file, that's more than a few megs.

A

What you end up with is individual blocks and then a metadata file that sort of points to all of those blocks, and so the question is: is it useful for the miner to just give its undifferentiated list of sids, or would it be more useful to only have the sids of the files themselves rather than their individual pieces, or is there something else that we need to give in terms of this index like the dagged structure of it?

A

So this sid links to these other sids or is there something else beyond that that we need, even so that an indexer can actually do a useful job of understanding which things it needs to prioritize?

A

um And so right now, what we're imagining is: there's, probably some baseline of just the undifferentiated list of sids that may turn out to be too big, and so we may ask the minor, who knows this structure or can have some heuristics to also be able to provide alternative lists or sub-lists like a semantic index of you know the list of files in case it's unix, fs or, if there's other um data formats, that we find to be commonly uh used that are valuable to do that, where you do need to know some semantics about um the actual data in order to do an efficient pruning.

A

Perhaps that needs to live on the minor, um because it's unclear, if either the sid or the dag of sids, that there's a good balance that can be provided at this interface that lets an indexer node. Do useful sub selection uh just on that list that it may need to be done. Someplace, that has the data, which would be the miner um okay, so so you've got some service that now has this list of data. How do you get that data to an indexing node?

A

That's then going to do the the work of generating an overall index and being able to say oh this sid, that's on this minor and probably what that looks. Like is two pieces. There's a there's, a send of advisory of new data available. So whenever um a new deal finishes or or there's a new, you know chunk of data items that have become available on the miner it'll publish, probably on, and this can work just sort of as a gossip sub topic that we agree upon um that.

A

There's a message with a new route, a new amount, a new list of synths that are become available uh and that will trigger indexers, who are watching that miner to then go and and pull um that set of sits from them, uh and we can imagine that poll happening over.

A

um You know an ipfs like thing where, where these are all ipld data, and so there's a new sid that gets put out and the indexers fetch that sid with a connection to the miner and get back um or or a connection to this index process that maybe exports itself over ipfs and gets back, then the list of of new sids that need to be indexed um and- and so uh we started writing up that. That interface a little bit in terms of how we think about these. So what?

A

What is the advertisement that happens over um pub sub? um And so you have your your uh a sid that points to sort of a manifest of this new list of data. That's that exists, and, along with the previous one uh and along with sort of how to connect to you and then you sign it, and this means that that miners sort of get one global view of what content they have are making available for indexing um and that helps them from that.

A

That helps us to sort of have this consistent view that we are able to enforce and make sure that um miners. If they advertise data as available, they have to advertise it as available to everyone, and so that that's sort of one step um where once we have those semantics, we can make sure that they're sort of the availability of a open marketplace that you that you don't have, at least through this.

A

This process, miners that that claim to um you know they'll only make their data available to the their partnering cdn um uh that rather any indexer node um would need to would be able to find miners um and and see and validate the the same index, basically, um okay, so so this this is.

A

This, hopefully starts to sound like a plausible interface of something that we might ask um about as a way to now have these lists of what content is available, um and we could imagine aggregating that um and I've sort of been purposefully, vague about what an index or node actually is doing uh and and whether an index or node is, um is even um like a single logical entity or whether this is you know, multiple decentralized ones or sharded um likely.

A

You know the the first version is we we make something that is logically centralized and we wait for that to fail over in terms of its ability to scale, and then we figure out how to shard it um as a way to get to an mvp um and- and you could also imagine that there's multiple of these- that different people run as as a secondary thing.

A

um Okay, but but then there's there's also the the subsequent question of the second site of this interface, which is you've, got these indexer nodes. Now that have these big indexes of a lot of sids that are available within file coin um and and which miners and which pieces they uh reside in, and we still now have the problem of okay. You've got a bunch of clients that want to do that.

A

uh Do that lookup of I've got a sid and I would like to know who has- and this is a problem that ipfs already solves um via a api called content routing, and what that actually looks like right now is: uh is the method find provider where you give a sid and you get back a peers address info? So you learn which peer has that content, and this is the interface that currently is exported by um the dht and by other content, routing modules within ipfs.

A

So the the relevant thing here is that what we actually probably want to return is not just a peer address info, but something that's a little bit more extensible that we can do things like provide a record that says um well, this minor has this data, and also is you know, there's some other metadata that we start to really care about.

A

Here like this is stored in filecoin, and so you know, there's probably some other protocol, like data transfer that you might need to use to get to it rather than just starting bitswap and trying to get.

A

That said, and also maybe you know it's in this piece- there's some other metadata- that's maybe relevant, so that the miner can efficiently get it for you, and these start to look a lot like the smart records that petar described to us yesterday, which is uh what we want to be able to have in terms of the shape that we get back, is not simply a peer, but a record.

A

That is this sort of um extensible um view of what we know about the sit right, because the record that I get back is actually likely to have multiple minors that are multiple options of minors that all have that piece and so you've got multiple minor piece records and then you've got some extensibility of the protocol um where you're able to say- and you should get to it via this thing-.

A

There's there's a bunch of interop um sort of path of least resistance. That's still getting figured out in terms of how do we enable that actual subsequent fetch of okay? So an ipfs node gets this record um from an index or node potentially, and then how does it know um to actually get the data out of a minor and and that um there there's sort of a few different ways and and we'll see uh how hard uh they they end up being relative to each other?

A

um You, with with the current um graph sync, go data transfer um process that exists in the market, uh part of filecoin miners. The process is uh happens in a few steps where first, you on chain do a retrieval deal and ask for that data and and then that results in a voucher- and you use that voucher to then initiate the go data transfer session over which you transfer the data for some subset of the data that's stored um it.

A

It may be process possible to incentivize miners to allow for free retrieval fields, and so one thing you could imagine trying to say here is that in this extensible protocol thing you can say: hey the protocol you should use. Is data transfer and it's okay.

A

You can pass in to serve an empty voucher here um and just go straight ahead to data transfer and see if it works and for miners that allow free deals that may be sufficient, and so you may be able to do this without popping back up um once it is a paid deal or.

A

Then then, you need to you know: do this file coin specific process next of actually paying for it um or delegating to someone else like your your content provider, the the app the dap or something that that's going to be willing to pay for it, and that starts to get somewhat specific.

A

And so we need to think about how we allow that to get bubbled up in an ipfs client so that it can delegate out to you, know a lotus uh pro plug-in that can make the deal and and provide a way to allow for extensible protocol handlers uh at the ipf tesla.

A

um But but if we have some point of extensibility um both in what records look like and then also on protocols where, where, if you don't have it built into ipfs, um there's a place to plug in on the client, we think we can get from record and from this different interface of finding providers to getting the data and making it available on gateways on, go ipfs nodes on embedded nodes and a bunch of people who want sits.

A

The other thing that we're going to have to figure out is how does the ipfs node know that it is getting or or ask an index or node? So if we, if we start with a limited number of these indexer nodes, initially, what what does this initial arrow look like, where we find the right, indexer nodes and make use of them, and that also likely goes through a few iterations? We could imagine that you can do sort of an initial delegated content routing.

A

This is something that javascript ipfs nodes already do, where they delegate their find provider request over to a companion, go node that they know, but you could imagine writing that same implementation of content routing in go ipfs so that, for instance, if there is a known uh indexer, the gateway could just delegate to an indexer that gets stuck very close to it, uh so that it's low latency, um so that that's that's one option and in an initial version you could also imagine that the thing that happens afterwards as you as you begin to decentralize this is you have an advertisement of indexers, so indexing services.

A

Can you know announce or or provide themselves as a service provider of indexing, and then nodes keep track of who they've seen as potential indexers and to keep track of how fast and how reliable these indexers are and use that as a way of prioritizing who they ask for indexing queries so that there there is, I think, naturally, going to be um a market for this indexing, which is you want an indexer who is close to you and you want um and there's a trade-off within the indexing of.

A

Are you going to prune down to uh a smaller index of head queries of popular sits that are getting asked for a lot um where you can respond to 80 of the queries in relatively small number of milliseconds? Or are you going to keep uh the much larger list of all possible sits so that you can answer the longer tail of sort of rarer uh queries which will be a more expensive uh lookup and higher latency?

A

So there's a couple places uh of potential differentiation there in indexing. um I think we we also still need to figure out. What uh you know is: is the story going to be that these indexing nodes um uh get get renumerated uh directly in this protocol. um There there's two places there that you could imagine that uh you know retrievable mining indexing um playing into the overall economic game.

A

One is um that a client could, potentially you know, pay somehow or or just based on usage of number of queries um that could go into the network, somehow um allocating resources to the indexing uh nodes. The other one is that for deals that are made off of an index poll to the indexer.

A

The indexer could get some small commission on those so that it goes within the existing sort of payment channel thing, but basically, there's like a finder speed type thing where the indexer node that did help make that deal happen um gets somehow uh tags in in the record that it returns so that when the miner uh does its payment channel thing, uh the main the miner um sort of tips, the the indexer for fun, bringing it in the traffic um to will quicktime check. um What can you wrap it in one minute yeah?

A

I think I'm basically uh uh done talking about these two uh particles. This is you know.

A

Sort of a a bit more future, looking we're just, I think you know it gives you a sense of what the hopefully sort of first concrete steps look like in terms of getting to something that works and then a bunch of the things we're imagining, um but we would like uh feedback. Certainly in terms of other constraints, we need to be thinking about and other ways that you can you all in particular, are imagining plugging in or would have to interact with this or things that sound unrealistic.

A

So there is a breakout session um in about. I don't know an hour, something like that um where I'm happy to talk with you all more about details of what actually will happen here and with that I will turn over to hannah.