IPFS IPFS þing 2022 - Content Routing 1: Performance, 9 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How to be an index provider - @masih - Content Routing 1: Performance

Description

How to be an index provider - presentecd by @masih at IPFS bing 2022 - Content Routing 1: Performance - https://2022.ipfs-thing.io

A

Right, hello, everyone, I'm, Massey I'm from uh the Bedrock team, I'm from the Bedrock team uh responsible for building and expanding a network, indexer I'm here to talk to you about how to become a network, an index provider. Sorry about the type on this slide. What is an index provider you know is this concept has been mentioned a few times. So what what is it really so index provider is nothing but a content provider that does two things.

A

So one is um the content provider that um okay, so a content provider that has a index of all the multi hashes that it contains and then shares that list of multi-ages to the network. So it tells everybody what multitude has and the second thing that it does is it tells an expert how to retrieve it right. So this is the key difference between a content provider and an index provider an index an index provider is a content provider that tells everybody about this multi-ashes and teaches them how to retrieve them.

A

So what is the general overall process of providing indices right so at first we have some content. That content goes into a process of uh generating advertisements which I'll go to in a minute. uh The advertisements that are generated are stored uh in a local file system. Advertising themselves are nothing but content, they're all addressable, and then that content provider would make an announcement to the network. Saying hey I've got this stuff right so from now on I'm going to dive deeper into each of these components and talk about how they work.

A

What they're made of the first thing I'm going to talk about is what is an advertisement. Advertisement is basically a piece of information that contains a link to the previous advertisement. I'll come to back, come back to that in a minute, the idea of the provider in terms of peer ID. It contains addresses I.E, where you can contact this provider right there. You can see how you can construct something like Azure info from provide addresses. It has a signature, so that verifies that it's actually the provider that provided the record.

A

It has some links to the entries and a context: ID metadata and a Boolean to say whether this advertisement is about removing content or adding contents every so before I go any deeper. I want to just point out that every data structure that I'll talk about here this is all ipld.

A

So we try to use ipld wherever we can and everything we talk about, has ipld schemas you find Links at the bottom of the screen that points you to the ipld schema, so you can have a look at those and you know understand how they work. So how does these advertisements uh kind of connect to each other? So here we have a picture of an advertisement chain which I think Andrew had earlier in his slides. As you can see, you have an advertisement which has a list of fields that I mentioned earlier.

A

It has a link to the previous advertisement that link may or may not be present. The absence of that. That link means that we have reached the end of that tree of chain of advertisements, and it also has links to entries. So entries is the actual thing that contains the list of multi-hashes we're trying to advertise as a content provider, uh so entries themselves can contain a chaining I'll, go to deeper into entries themselves in a minute and explain the type of different type of entries.

A

So when it comes to entries, we have two kinds of entries. One is an entry chunk. uh We call the entry chunk, which is basically an array of bytes and remember. Multi-Ashes, are nothing but bytes, it's just a multi-coded code and a digest, and it has a link to the next right.

A

So in the entry chunk type, you can see that how that multi-actions could be contained in a single message right and the way the point of having the next link is that at some point you will hit the limits of the message size, for example. So providing a next link is a way for us to chunk these. So it's a way for us to basically influence some pagination mechanism on top of ipld data. The other type of entries that we could have is hamped and hamt is a Advanced Data layer in ipld.

A

It's actually a prefixed stream map is a way for you to define a map. So this is a very recent addition. You can find the specification of Hampton the link below the point of using hand so um now I'm going to talk about what is the difference? What why why would you choose one entry chunk entry chunk versus hunt right?

A

So when we started building index provider we started with entry chunk it's nice and simple. You have just a list of multi hashers, this whole chained together great right. This is really easy to understand, but uh it comes with a problem. So, for example, when you want to divide up multi-ashes across multiple shards or multiple nodes, this connects to the stuff that will mention earlier in terms of next steps in terms of decentralizing network indexer and having a distributed Network indexer, you really can't use entry chunks.

A

You really can separate multi-hashes by something like prefix or anything until you have Traverse the entire depth of multi-ashes. Now you can imagine if every node in the network wants to do that. This is a very expensive operation.

A

So we want to provide a simpler way for us to slice and dice multi-ashes and a way for us to quickly find out what are the multi-ashes in this specific link that I'm as an indexer responsible for that's where hamt comes in, so uh the the way that hamt is using index provider is actually used as a set. Hamt itself is a map, so we use it as a set where the keys, the multi-ashes and the value is always set to something simple like true and the the entire multi-axis are basically sorted by prefixed.

A

You have a prefix tree. You have a really nice and efficient way of finding out which multi-ashes exist in a in a link, but it does have these disadvantages. It's a bit more complicated to work with it's all built on very, very new and fresh implementations of ipld hands, which are still under development, but it has a huge potential and opens up a huge array of design decisions for us to make in the future. If we have a humped based entry structure,.

A

I touch some metadata, so what is metadata in an advertisement? If you remember I, started by saying a index provider is a content provider that tells everybody about the multi-ashes that it stores and also tells them how to um how to how the content could be retrieved metadata is the thing that captures how the content could be retrieved. The metadata itself is again designed to be extremely extensible.

A

The only structure that it has is it starts with a protocol ID, and it has an optional byte after particle ID, which defines you know whatever protocol you would like to Define. So today you could invent your own special way of fetching information and have your own metadata, and you know: there's there's nothing stopping you from that. You can. You can Define it. There are two specific Methodist types that are defined today. One is transport over a bit swap so you can see links to the multicodec CSV table there.

A

You can see these codes inside the CSV table and then the second one is graph sync for filecoin V1 data, the first, the first type of protocol ID bitsop, really doesn't have any bytes at the end of it, because, as long as as soon as you know, an endpoint support, speed swap the rest is simple. You just ask for sits and get you get blocks back is extremely simple.

A

There's nothing else to do there, but when it comes to something like graph sync over filecon, you have more information that you require in order to instantiate a retrieval of data. So what is that information?

A

So here I've I'm showing you the ipld schema of what is the structure of bytes inside the metadata for um advertisements that support uh graph sync filecoin V1 retrieval, as you can see, it has things like pcid, whether it's a verified deal or not, but there is fast retrieval or not, and this metadata then makes sense to something like Lotus and boost, and then that would enable you to retrieve information.

A

There are Links at the bottom that point to ipld schema and the go documentation package. You can have a look at that, so we talked about advertising information, advertising content. So how do we tell people that hey I no longer have this or what would happen if my address changes? You know because I need I've, told people that hey I have this I have this multi-ash and you can get it from this address.

A

What should I do if my address changes, so the structure of advertisement supports two specific fields that allow you to modify advertise contents. One is context ID, the other one is a Boolean field called is removed. So context, ID is a is a basically a unique identifier that identifies a provider and the metadata you can think of it as a way of grouping multi-ashes together. So I think Andrew earlier touched on this.

A

So imagine if you want to remove a whole bunch of multi-actions that you've advertised you really don't want to advertise multitaskers again just to say this is the removal you know remove these multi-ashes. Instead, you can tag them if you like, with a context ID and you just say: hey, remove context, idx and all the multi-ashes associated with that context. Id are going to be removed, is removed. Like I mentioned it's a field inside advertisement and whether it's set to true or not defines whether the content is being added or is being removed.

A

In a case where you would like to change the metadata Associated address and metadata data, Associated content without removing it again, we don't want to advertise all the multi-ashes again just to do that. So there is a specific, no content sit that you could use as a link to entries and then simply include the context ID with the new address and new metadata and as soon as the advertisement published, then the indexer node would change the information Associated to that context, ID.

A

uh So so far we talked about generating advertisements right. What is that? What is the actual fundamental data structure that we need to produce as an index provider? And now we're going to talk about? How do you tell the network about it so that an index or node can then come and ingest the information and make it available to the world and make it actually findable? So the the points where we tell the network, we call it announcement. An announcement uh basically includes two things. One is what is the head of advertisement?

A

If you, if you remember I, showed you advertisement chain that is interconnected, so the head of advertisement. It means that the latest set of advertisement, entry that exists in the chain and the second one is the publisher address, as in where can I retrieve this multi-hashes from right, so that the announcement themselves could be made in two different channels. So you can ISO either use gossip sub to publish this information, and you can also use explicit, put requests to the um index provider via announce URL.

A

You know an examples of providers that use different uh Avenues here are, for example, Lotus and boost are using the gossip sub Network to disseminate information about these announcements, whereas something like NFC storage is making explicit puts to the current indexer to notify that hey I've got a new advertisement, come get it right.

A

uh There's one subtle Point here, um I wanted to point out so there's this concept of publisher that exists and publisher could be different from the con from the content provider or from the index provider if you like, um but um that basically means you can delegate publication of advertisements to a separate uh process that could be a powerful, yet potentially complicated concept that allows you to basically scale your system.

A

If the thing that is providing content doesn't have the capacity or wants to isolate, to separate the task of publishing advertisement, so you could do that, but it it comes with a specific Clause that hopefully we will make um you know less important later on, but for now the IDS need to be explicitly whitelisted.

A

If the publisher and index provider use different identities, because advertisements have signature, then it means totally new sets. So you you're creating a different chain of advertisements like which is something to consider. So you know in a typical case, you can imagine, identities could be shared and that way the chain of advertisements produced by the publisher and by the actual provider would be identical.

A

uh So what are the interfaces for an index provider I.E? How does a network indexer connect to the index provider to the ingest information? So there's actually two different interfaces that could exist. One is HTTP, it is extremely simple. The other one is graph sync on the HTTP endpoint. You simply have a endpoint that exposes the head advertisement and you also have an endpoint that, given a sid Returns, the blocks return Returns the block Associated to that set in form of Json.

A

On the graph sync side you have, you can have combination of Gossip, sub or HTTP for publication of announcements that I talked about and a good old-fashioned graph, sync and uh server that allows you to just simply fetch the sits, fetch blocks Associated. So the links that you see on the right hand, side. These are the libraries that Implement their graph sync with uh basically graphsing interface, that you can use again. Andrew kindly pointed on on these in his presentation earlier.

A

So on the implementations. There is one currently one implementation of index provider it's written in go. It is written to be two things at once. There are historical reasons for that: one is being Standalone index provider. You can use it to basically expose an endpoint and give it car files, which would then publish the network and say hey I. Have the content in this car file uh initially built that to test the network index there, but now it's available for anybody would like to use it.

A

The other side is sdks that allows you to embed a network. Sorry index providers inside your go application and and basically build your own thing. This library is used by filecon in boost and Lotus is used by S3, there's a whole bunch of clients that are using this. The URL at the bottom is where you can find the information I wanted to quickly mention a rust implementation of HTTP into interface, which is written by Marco sitting over there, which is excellent, so excellent.

A

To see- and you know, one thing- I love to see is just multiple implementation, implementations of index provider in different languages, because you know it's the best way to kind of figure out wrinkles in the protocol and make sure it's actually making sense for everybody in terms of tooling the index provider Library also provides a CLI tool that allows you to basically interact with an index provider, regardless of what it is written in.

A

As long as it follows a network index or protocol, so the things you can do with the provider CLI is you can list advertisements from a provider given its multi-adder, you can verify ingestion, you can use it to run an index provider in a standalone way called Daemon. That then takes car files and basically publishes into the network.

A

uh I wanted to dive a little bit deeper into this tool and talk about having written an index provider, no matter in what language, whether with index provider, library or not. How would you verify that it actually worked, so you can use the index provider CLI to verify that it is working so with index provided CLI, you can check you can list the advertisements that exist in the existing a provider.

A

So here you see an example of a commands, LS list advertisements from a provider multi-adder and what you see is an output that shows the seat of the advertisement, the set of the previous advertisement. The idea of the provider addresses that are included in the advertisement and whether isn't removed or not I need the background, a bit of a process that is gone and actually fetched all the entries in advertisement.

A

It says it's made of five chunks, so chunks means that this is actually an entry chunk chain and in total it has 72 000 ish multi-ashes.

A

So how would you verify that a published advertisement is actually ingested by an indexer again index provider? Cli provides you with a tool that allows you to verify ingestion, so the command you see here is very verifying that um advertise the multi-hashes from a provider so I'll get into that in a minute. But multi-ash is from a provider to sit that contacts which is so happens to be the endpoint of a network. Indexer today um only recurs once so.

A

You can recursively walk the chain of advertisements to find out which multitashes exist, the minus a speed that you see there is randomly sampling, 10 percent of the multi-hashes. You might not want to verify the ingestion of every single multi-hash there, so it allows you to support random sampling and the PID at the end tells the ncli. What should be the expected client peer ID for any index records that I would find in the index and then, as you can see, the output shows you. You know how many multi-ashes are verified.

A

uh You know how many were on indexed. uh The output is actually quite long. It gives you things like uh numerical distribution over number of multi-ashes and things like that. If you do recursively, but I haven't added it here, please go and have a play with it in terms of minus minus FB, you can actually get multi-assets from different sources. So, if you think about it, the advertisement chain is just a source of a list of multi-ashes right, and at least some multi-agers could come from.

A

A car file could come from just a detached car index. So all three of those are supported by the provider CLI. You can just point it at the sort of source of montages. It goes and extracts the multi-ashes and verifies ingest uh some a little bit of stats on sit that contact, because this is a question that came up earlier when Andrew was giving a talk. So right now we have 172 providers on sit.contact about 26 of them are file coin providers and in the last seven days we have ingested about 5 billion multi hashes.

A

You can see the list of all the providers that exist in the see that contact using that URL and what you would see is things like city of the latest advertisement that is processed when was it processed? What is the peer, ID and so on? And uh you know it just shows you the entire list of all the providers that exist today.

A

So what are the next steps for index providers so, like I mentioned, the hammed work is just as at this infancy. So it's it is implemented both on the store the index side. So it's still the index understand links that point to hand on the index provider side. It is also implemented in that you can produce advertisements with kind harmed as entry.

A

What what we do want to do in the next step is basically dog fooding, so we want to basically write our own software that publishes a lot of hands and make sure that it's actually working.

A

uh This also connects into the work stream in terms of decentralizing the network indexer as part of verifying the hand we're developing a indexer mirror, if you like, so you can imagine because the list, the chain of advertising strong provider is completely open and public, so I can technically download that chain, reshape it and change it and then republish it myself or identically republish it a bit like a CDN. So that's we're actually working on a tool on index provider that allows you to mirror providers and potentially remap the advertisement chain into hand.

A

For example, there's a whole bunch of open questions, I've trimmed them down, um will touched on a few Andrew touched on a few, so there's a whole Rich uh wealth of open and difficult questions in index provider on the in index provider. Specifically, though, on the network indexer on the index provider. Specifically, you have things like um how long should a pop uh a published advertisement remain or be servable by the index provider, and how long should it be discoverable having been ingested by an index?

A

So you can imagine a world where you know there is like a decaying function or something that requires your publication of advertisements and that could solve a lot of problems in terms of garbage collection and things like that. Both garbage collection and distribution of workload.

A

The other stuff that we want to talk about is um what should be the limits on advertisements published by an index index provider. Where should we say that hey, you are big enough to be two index providers? Please, because you know it would be easier to fit you in a distributed chartered Network, so you could have different ways of doing that say by the depth of advertisement chain, or you know the amount of multi-ashes that are published. These are all techniques that we're thinking about all relates to the uh more extreme about.

A

How can we make a planetary scale, Network index around index providers, a quick set of links for you to try? There's an example for the index provider. Sdk go go SDK that you can try like we'll mentioned, there's also a five coin: slack channel that you can post any questions with.

A

Last but not least, uh all you see here is a work of a team, more specifically will and Andrew Bedrock team that is willing to integrate all those into our application, as well as the rest of the pl Network. So thank you all I'll have any questions. That's fine, I'll, take any questions that you might have foreign.

B

A

So the first question I would ask is what language is the provider is being written in and if it's go, I would immediately would point you to the example the list you can see the link you can see on the on the page and that example is basically stands up a index provider engine and then publishes a simple advertisement right, and it also shows you how you can hook up your own way of listening multi-actions, because really the the thing that we want to make make is extensible when it comes to providing libraries for index providers.

A

Is everybody stores their multi-ashes and lists of files differently? So how can we build it so that it is agnostic of say, lotus or boost it is agnostic of ipfs itself is just works. You know so I would recommend. Looking at that example, if it's not a go client I'll point you to documentations on a soda index which talks to talks about what are the protocols in terms of providing indices. So there there's two one is: what is the ipld schema? You can find the link in my slides and the other.

A

One is what is the protocol for HTTP there's a more detailed document on what is the HTTP protocol on store the index under dock.

B

Marco alrighty providers, uh publishing camps, yeah.

A

Not yet no not that we know of yeah, but I need to look at the logs, but the options are there in the index provider. So you can configure it to publish hands. We want it to be the first people to publish it just to make sure it's working all good but yeah, not yet.

A

Why all right? You.

B

Maybe give them have a few um uh pending service. Folks like here, it may be useful for this type of Workshop, where you kind of like you go through adding fitting services to the.

A

Next yeah sounds good, sounds good I'll I'll see if I can put the content together. Yeah.

C

I think the real question is they're sure two options: there's like native integration of this sort of protocol or waiting for IPs reframing and then we'll have likely a sidecar type thing where I could invest in public to something that goes along with this through that reframe portal. And then it ends up.

B

It might be really useful to just get get into the habit of getting other people to write other tools that ingest other content and publish it through, because otherwise people will learn that they can do that and they'll think that everything is really cool and so um like, even probably even after returning lessons. I work so for groups running large scale systems separated.

C

Food and, if there's already a database somewhere else, those decisions will be more difficult.

A

So no more questions. Thank you.