IPFS IPFS þing 2022 - Content Routing 1: Performance, 9 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Network Indexer: storetheindex - @gammazero - Content Routing 1: Performance

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, I'm andrew gillis, more commonly known as gamma zero in the uh various uh forums and github, and today I'm going to be talking about the network indexer, which is an important component of a content routing system. So let me share my screen and we'll get started so the network indexer. What is a network indexer? Well, a network indexer is a node that stores mappings of cids to provider data records.

A

uh This is um what allows us to find find where content can actually be retrieved from so and think of an indexer like a very specialized key value store. It has two primary groups of users: that's the storage providers and the retrieval clients.

A

Storage providers want to advertise their content by storing data in the indexer, the indexer that handles us with ingest logic. uh Retrieval clients want to query the indexer to find which storage providers have content and how to retrieve that content from the storage providers. So that's part of the indexer's find logic.

A

So how do these different user groups interact with an indexer?

A

So, let's just start with the basics. uh A storage deal is um is created by a storage client, so data is stored on a storage provider.

A

When a storage provider has that data, it's going to announce that it has new content, uh it does that by publishing um and the cid of a special record called an advertisement and it lets the indexers know that it has this this new content to be uh to be indexed.

A

This is normally published over a gossip pub sub.

A

um Usually through that mainnet nodes, I can also be published to indexers directly via http, so as the storage writer announces it that gets to the indexers and then the indexers want to sync that new content, so the sync portion of the of ingest means that we're going to go ahead and read the all of these.

A

The latest advertisement records from a storage provider and we're going to get the information that we want to index. That includes a context id metadata and all the multi hashes which map to that data. So we'll talk a bit more about what the ingest process involves in a bit, so once the indexer nodes, uh ingest the data and now they've in that they've actually indexed all of this content for the storage provider.

A

A client can then uh query the indexer to find where that content is and how to get it. So a client uh is going to issue a query for a cid or a multi-hash, and the indexer is going to look up the provider information for that for that cid and it's going to respond with one or more provider records if it has any to the client and say here are all the providers that provide this. This content and information about how to go retrieve that content from each of those.

A

Providers then the client can go to the storage provider and retrieve the content.

A

um It's going to part of the information in the in the record that I received from the indexer was information about what protocol to use so graph sync bit swap and then the client's going to send that provider record that it got from the indexer back to the storage provider, which allows the storage provider to look into whatever content is in that record, to use it to find the data that's being requested. That's maybe a like a deal id or maybe some internal record keys or whatever, whatever it may be.

A

So here it is all together um in one one picture uh just to see all the different uh interactions that uh that are happening with index or notes all right. So let's talk a little bit more about the ingest.

A

Just to give you an idea more about how indexing works.

A

So ingest really consists of two parts: uh the publish of uh the announcing of the availability of more content to index and then the sync which is actually where the indexer is pulling in that content and creating the the index uh for that content.

A

It's called sync because we're basically synchronizing up to the latest state in a chain of of advertisements which are the records of of content that are available.

A

So the first part is the publish a little more detail on that. The announce message is what gets uh broadcast out from the publisher to the indexer.

A

It's usually sent over gossip pub sub, but we can also send it via http. This is already built into lotus clients and it will be able to uh over gossip pub sub, send this to the mainnet nodes, which then relay uh that that publication to the indexers.

A

The indexers then get this announcement message which contains the cid of the advertisement that's being announced and along with the publisher's address, which is where to retrieve the advertisement record from and that allows them to go. Get that information. Indexers can also ignore uh publications. If they already happen to have the advertisement, they may have synced it from a previous from another publication or from a direct announcement or or from any number any number of different ways. They may have been notified, um so additional announcements don't cause additional work.

A

So note about the what the publisher is. We say: provider and publisher, but when talking about indexing, it's important to realize that the publisher can be different from the storage provider. So, specifically, the publisher is the entity that publishes the advertisement records, in other words the content that is being indexed generally, it's the same as the storage provider, but it does not have to be.

A

Other entities can publish on behalf of one or more storage providers and there's policies to control which publishers are allowed and publishers can sign advertisements uh if they're, creating they may create those advertisements on behalf of the storage provider and a publisher can sign those advertisements.

A

We'll talk about advertisement signing a bit more anyway. The sync process works like this. We have a chain of of advertisements that also have entries associated with them. This forms an ipld graph, so the ingestion reads: the chains uh from the latest. Most recent that's been announced all the way back to uh the either the end of the chain, or at least until the uh whatever advertisement the indexer has already ingested.

A

um So it reads the whatever part of the chain it doesn't have yet and then processes it uh in order, so from oldest to newest, so that we can apply the updates uh properly and come out with the correct state.

A

So advertisements are signed, including the links. uh What this does is create a blockchain like structure um and the uh and so the all of the advertisements and all of their their associated content. Hashes all become immutable and um and they're all signed as well. So we can verify. We have a a chain that we can verify um every uh every portion of and uh and and we have immutable data, then we can make sure we uh we can see that the proper signatures are applied.

A

um So I wanted to talk a little bit about context, id and metadata, which were mentioned earlier. So a context. Id is what uniquely identifies metadata in the provider record, so the metadata is the portion that says how to get the the content and um and then some opaque data, which is uh sent back to the storage provider. That tells a storage provider where to look it up like a deal id um internal record, key, etc.

A

So, by using a context id on the provider id, we have a globally unique id that in that tells us it gives us a piece of metadata for for the provider.

A

So the context id is what a provider uses to be able to update any of its metadata or delete that or delete its records. So the context id once once an advert has been published with the context id and we can publish a subsequent advertisement context id can be used to add multi hashes. It can be used to um to update the metadata and it can be used to remove all all of the data associated with the context id.

A

So when we update the metadata that means anytime, we query any of the multi hashes of pointing to that the provider record. The metadata that comes back it'll always have the updated metadata.

A

So let's talk a little more about how indexer stores data, so indexing content means well. What is it means to create an index, and that means taking the input of a provider record and it's associated all the associated multi hashes that are uh part of that data and inverting this to be able to have multi hashes, which then map back to that provider records. In other words, we can look up the provider record uh by using that multi-hash.

A

When I say multi-hash, it's synonymous with cid and multi-hash is just a cid without the codec portion, so we can look up the the provider record and it and a multi-cache can also refer to multiple provider records, provide a record.

A

Different providers can provide the same data um or maybe a provider has the same data in different deals, etc. So maybe available from from multiple places. When when we need do, we we do need to update the metadata. So how does new mixer refer to that? We don't want to have to do an update for every single multihash. That would be millions of multihashes potentially.

A

So this is where the context id comes in the context. Id is then used to uh look up the um or delete that provider record. So how do we? So? What's that actual mapping look like if we, if we need to refer to provider records both by metadata and provider id well, we get basically an indexer stores, a two-level mapping. So we have any multi-hash that maps two each multihatch maps to a list of provider keys and then each provider key maps to the individual provider record.

A

And by doing this we can now be able to index provider records by multi-hash or by the context id and, let's talk about, say, index or data sharing, so indexers are able to share data with each other. They can share the discovered providers and publishers or they can discover providers and publishers from other indexers.

A

So if I look at uh a provider's, I'm sorry an indexer's set of providers, I can configure any other indexer to go and retrieve the providers from it. So it can learn about them and then be able to exchange indexing information with those providers or those and those publishers.

A

Indexers can also republish http amount, http announcements to each other. So if you send it an announcement directly to a single index or via http, that indexer can then be configured to republish over gossip pub sub to any other indexers, and they may want to receive those announcements.

A

um And then that's x, indexers, like the main net nodes, can also act as gossip pub sub relays.

A

So what are the the next steps in indexing so talk about how it works? What are what? What are we going to be providing soon uh we're going to be doing uh advertisement chain snapshots, which is basically a mechanism of compressing advertisement chains?

A

uh This means that we that uh publishers can have a much can replace their chains with a very compressed form and eventually truncate those chains so that they don't have to keep the data all the way. Back to the beginning of their very existence.

A

We can archive chains using car files, and this will allow us to bootstrap new indexers that are coming online, so we want. We want to start up a new indexer.

A

Instead of having to read from each provider directly and ingest an entire chain of advertisements, we can read from an archive which may be available either from the provider or from some type of cdn, which we can then bootstrap our indexers we're also working on pro on sharding by provider, so that we can spread ingestion across multiple nodes as well as sharding, by multihash for storage and lookup across multiple nodes.

A

So uh when I say multiple nodes generally, this is going to be within a single deployment, because we'll want to have more indexer deployments for more local multiplies content routing. But we still want to be able to scale each of those deployments and we want to scale them by spreading a work across multiple nodes. And this is how, as a these, are two different ways. We divide up the work by uh ingestion and by uh storage.

A

So, just to give you some reference, we have our current indexer implementation is store. The index. The code is a publicly available on our github repo, the value store implementation.

A

That's the portion that actually stores the index content is implemented by go store, the hash which is being used because it's a highly efficient storage, and it gives us a lot of advantages to storing uh huge amounts of data very compactly, although we can replace the value store by any other, uh any other store that implements the the relatively simple uh index or core interface, so we've actually used pagreb. I think we've actually tested with some experiments with postgres, but we can hook it up to anything.

A

If you any any other storage that you want, um just you have to implement the index the interface, and um so we have also our another repo. That's worth mentioning is called go legs, it's uh it's. What provides the synchronization logic to keep indexers in sync with publishers and that's it for the? uh What does an indexer do? How does it work at a somewhat of a summary level and I'd like to uh ask if there's any questions or things that we want to discuss further at this point,.

B

One question: um can you talk a little bit about the scale that the extruder is dealing with right now, like how many records what fraction of the.

A

Network does that represent?

A

Yes, let me let me go ahead and.

A

So we have um about 26 of the uh the providers that we're currently indexing. We actually have stats that we're keeping on our dashboard here, and so we can uh point to some of these so about 26 of the of the providers and it's about 23 of the deals there's other. I think that's going to be discussed in other presentations in more detail.

A

So, yes, we have a number we'd like to keep adding a lot more providers and we'll be growing the number of providers and also the amount of data that each provider will be indexing. Of course, so we are still ramping up and you can actually look at the the size of the the index. This is um this. Is the current size we're about uh three and a half terabytes? uh Let me expand the time on this a little bit, so you can see a bit more linear graph.

A

Let's go so the this uh sawtooth is just the uh garbage collection, but you can see how it's been growing. Over the week, we've gone from around uh 3.2 to around 3.5, terabytes and so uh third to half of a terabyte per week is a pretty substantial growth rate. But we're actually still running this on a single node and um not having any problems with either in with either the storage or the um speed of the responses.

A

Anyway, not sure if that that probably answered or tried to explain a lot more in your question than me.

B

How can um can ipfs um large pinning services, and so on, connect to the elixir.

A

B

A

That the way that's done is so there's a what's called a reframe client, and so a reframe client is part of the content routing system and that's going to be what's responsible for working side by side with the dht and it's going to ask indexer nodes where, where cids are available from so that they can also stream the answers coming from indexers as part of the the content finding solution.

A

All right thanks, I have a question: do you see like.

C

Will uh you always run the indexers, or do you foresee, maybe application developers who run the ebook servers for their own like application, for example, I I'm doing something big, some like telegram or some service, and I might run my only mixer, and so there would be a hierarchy of connections or something do you do like plans or something like that.

A

So in there for the general indexing, it's probably going to be a few different providers of indexing services that are essentially going to index uh everything.

A

But yes, there, there are business models where a particular entity will want to do their own indexing, so it maybe for internal consumption, or maybe they have a specialized client to look up data uh in within their services and so they'll they'll index, all of their their content, um or maybe maybe they'll index the content of certain providers that they're partnering with um so if, for example, a a service that provides some sort of uh data storage and retrieval, maybe they partner with certain uh filecoin storage providers, and so they index the content that they uh they're, that they're serving out and they that may be something that they keep internally um but there as far as a hierarchy of indexers um that actually gets into a lot more difficult models like uh how do you you know you trust them?

A

How do you describe how you just determine, which one has uh which portion of the the key space so uh really there's at this point, we're looking at having general indexers and then specific indexers that are for specific purposes and the clients that use those specific indexers will either be explicitly configured to use them by? Whoever is whoever's services is providing those indexers.

A

All right, thank you.

C

How do you store efficiently hashes? They are basically around one day.

A

ah Yes, so the hashes are stored efficiently uh by storing only the the portion of the hash that has the um uh the the uh the longest matching prefix of any other hash. So, if you think of all of this hash data, we don't actually store any of the the hash data other than the part other than the minimum necessary to differentiate it from any other hash.

A

um So that's that's part of that's the main part of the solution um and then how we we index, those records on on disks on disks and things like that allows for a fairly efficient, find and look up of those. That's. The basic idea is not actually storing the whole hash.