IPFS IPFS Camp 2022 - Content Routing, 2 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scaling content routing - Andrew Gillis

Description

How many CIDs are there, and what are the open scaling questions in content routing?

A

Hi, my name is Andrew Gillis and I'm, a software engineer currently working with the Bedrock team on the indexer project and I'd like to take this time. To present to you our plans for taking the indexer to the next level of scalability.

A

So here's a rough idea of the kind of scalability we're talking about our goal being 10 to the 15 indexes. So to give you an idea about what that means. If you look at this small red square in the corner, that's uh represents 1 billion indexes, which is about what we are currently getting in 12 to 24 hours.

A

So a thousand times that 10 to the 12th would probably given the rate of acceleration of indexing data coming in would probably be something that we'd start seeing in uh maybe in less than a year and then looking at the large gray block our eventual goal, something that it's a little bit harder to estimate the time on and but I'm I would really rough guess.

A

Maybe five years, here's where we are now we are at less than uh less than 10 to the 12th, which is basically the the whole Blue Block or one small bit in that big gray block.

A

So we want to be able to eventually get to be able to handle all of the data represented by the that lower block. But it's going to take some time. It's not going to get down all at once.

A

It's going to take a number of steps to get there and we're going to have to proceed in a way that allows us to uh course correct, as we go forward, accommodate new features and understand how the growth velocity changes when new patterns of usage, Technologies uh things like that are experienced by the indexer, um but we're we're taking steps to eventually to get there and the next steps we're going to take should hopefully get us a couple orders of magnitude closer to our end goal.

A

So what needs to be done first well to figure out what we need to do? First, we have to look at what the indexer struggles to do the most. Currently, the vast majority of the indexer's work comes from ingesting all the index data published it's a huge amount of data, that's being published every day and that's increasing. We see sustained rates of twenty thousand to sixty thousand new indexes per second and that's normal over 7 billion new indexes per week, which is increasing every week.

A

We already have over 200 storage providers, indexing content and we're adding more all the time ipf is now ipfs is now indexing through reframes. So we are increasing the velocity of our uh of the data we're ingesting.

A

So let's understand the ways that that overloads the index here a couple ways: one is the shear volume of data, uh the indexer just can't keep up with being able to to process it, um be able to pull it all over the network and be able to deal with uh uh processing and storing it.

A

It's uh the the absolute uh line of data is too much for it and when the indexer gets backed up and it starts getting behind and it's backlog of work that it has to process, and that means we need a scaling strategy so that we can divide all of those incoming data among multiple indexers, so that no one indexer is forced to handle all that data and become overloaded.

A

Another way that indexers can become overloaded is due to exceeding storage limits, so every node at some point is going to have a a limit to how much data it's going to store that can be imposed by Hardware infrastructure or simply because, as its data store grows so huge, it becomes too slow to update an access.

A

So we have to have a strategy that accounts for a way of handling. When indexers reach uh reach a maximum capacity, we have to be able to do something to hand off responsibility or redistribute data or something so that the the storage load is not too much for any one indexer to handle.

A

So this uh gives us some ideas for for basic requirements. uh We need to be able to handle, and we know we need to be able to handle adding indexers anytime, as the capacity of that as we need to increase capacity uh in our indexing deployment.

A

uh We'll need to be able to add indexers uh any uh index or node uh must be able to gracefully Handover uh work when it's full when it reaches capacity.

A

um Even if it's from a single data source, uh we need to be able to have some way of handling what an indexer can't absorb any more data. We also need redundancy, because if we're indexing huge amounts of data, it can be very expensive to lose that data if an indexer goes down, so we do want to have the ability to have redundancy within our deployment and um since the we don't know exactly which indexer is going to ever have a particular piece of of data.

A

We'll need a way of sending queries to all indexers so that we can merge the results, um because the result that we want may come from any of them or all of them.

A

So the decision we've made to split up the the work of ingestion is to divide the Publishers over the indexers in a pool of indexers. uh The publisher is the entity that publishes indexing data indexers it it may or may not be the same as the actual content provider.

A

um So this strategy is very simple and it allows us to assign indexers a publisher and then, after their assigned a publisher, the indexer can interact independently and directly with the publisher and it doesn't require any use of any sort of Upstream service to uh to parse and redirect data to indexers indexer can just interact directly with with the publisher um and uh in order to decide how we assign Publishers to indexers we're going to use a coordination service.

A

A coordination service is a simple service that is going to listen for, announced messages over uh Pub sub or coming directly over http. The announced messages are just when a a publisher says it has new uh data to index, and it's going to look at the publisher and then announced message and see if it's already been assigned. If it has not been assigned to an indexer, the the coordination service will decide on an indexer to assign it to.

A

After assignment the index will be told to sync to that publisher and from then on out. The indexer will act independently with that publisher and the uh the coordination service doesn't need to do anything more for that publisher.

A

um The coordination service was chosen because it's a very simple way of making the decision for the indexers and it doesn't require any sort of consensus mechanism to be shared amongst the indexers to find some way to agree on who gets which Publishers and it's also really not a single point of failure, certainly not in the short term, because the uh the coordination service can go offline and the indexers will all continue to operate just fine without it.

A

And then, when the coordination service comes back online, it can resume assigning uh Publishers to indexers, and everything will will catch up as they go and index the content. Nothing will be missed.

A

So what's an index or pool consist of on index or pool?

A

It's also very simple: it's just a group of indexers that a configuration, sorry, a coordination service is configured to control and all that means is the indexes are on the same network and their administrative interface is available to the configuration service and uh the indexers can be put into the pool by updating the configuration, the coordination Services configuration and can be removed from the pool the same way uh when an indexer is added to the pool uh any um any Publishers that that are in need of a replication that, in other words, they don't have enough indexers already to replicate as per the configure uh replication settings, uh that indexer will be assigned those Publishers right away.

A

um Otherwise it can just sit there and wait for new publishers and when an indexer is removed from the pool, the coordination service will reassign that indexer's, the the one being indexer being removed, Publishers reassign those to other indexers that are available to accept that assignment.

A

The coordination service can do some other useful things. One of those is, it will be able to pull the indexers and when the indexers are at 20 percent, it can uh sorry when they're at 80 capacity. In other words, they only have 20 percent of their storage remaining. It will be able to inspect the other indexers to make sure that there are other indexers available to take on that.

A

um The assignment of the Publishers when, when the almost full indexer finally uh gets full, and if there aren't, then it can alert administrators until something can be done about it. So it's it gives enough time to send out alerts and have a person be able to react to the need for more resources in the pool. If there's not, if it predicts that there's not going to be enough indexers to take on the additional load at some point.

A

um So what happens when an indexer does get full? uh This is where we have what's called a freeze mode or frozen mode and uh publish your hand off. So when an indexer reaches 90 capacity, uh it'll automatically go into a frozen mode, it's a special mode where the indexer stops ingesting any new data, and it only responds to removal ads and metadata updates from its assigned publishers.

A

The indexer continues to respond to any queries for index data. While it's frozen, it just can't accept any more new data. So it's not going to grow in size anymore. After it's frozen when the Cs. When the coordination service sees that the indexer has become Frozen, it will then assign that indexer's Publishers to other non-frozen indexers in the pool and the new indexers will sync with the last CID that was handled by the previously was handled previously by the frozen indexers.

A

So it'll pick up where the the new indexers get assigned, the Publishers will pick up where the Frozen indexer left off foreign.

A

So the uh one of the results of this is that the indexing data gets spread across um and multiple indexers, even if it's from the same publisher.

A

So, for example, if we had a an ad chain on a publisher that from A to Z and so first indexer gets a b c d, then freezes because it's full uh next one gets EFG, then gets frozen. The next one gets hij and so on. uh You can see how that would be spread all over the indexer pool um and that's fine, which allows us to just keep adding indexers as more and more data pours. In so um the as deletions occur.

A

uh There may be sufficient space to consolidate some of the Frozen indexers or possibly re-initialize them into the pool if their data is no longer useful. So there are things that can be done with them, um but the point is that the uh that freezing and provides a way of of handing off the data, basically so basically uh freezing, is, is a graceful way of letting the data that's coming into the indexers Overflow um from the full ones onto ones that aren't full and it doesn't involve moving any data around.

A

So if you think of other strategies that adding indexers or if you're, adding a node to a network, would cause a redistribution of of data, um even a minor one, that could be very expensive in mixing situation just because of the amount of data that could be expensive and very time consuming so having this freezing strategy uh allows us to handle full indexers without moving any data around.

A

So uh what uh what this requires um is a scatter gather queries because any indexer or all the indexers could have the data that's being looked for, so a client doesn't know which indexer to ask- and there is no uh there's, no idea of, of which indexer might have an answer for any key. The multiple ones might differ. You know multiple providers could provide the data for a key, um so we have to have an ability to do. Scatter gather queries.

A

We already have a service that does that it's called index star and since we have a homogeneous network of indexers right now, we already have to do that. So this service is already available and and working well and it. uh It sends a client query to all the indexers and collects their responses and merges them into a single response back to the client.

A

So um it also does things like uh is able to have circuit breakers circuit, breaker logic that if an indexer is not becomes not responsive, it's not going to wait for the uh response from that and until it becomes responsive again and uh the index star um uh is currently like I said, is currently working and and we're able to you know it's a necessary part of this network.

A

um So scatter gather queries. Yes, they do multiply all of the incoming queries out to the uh you know by the number of indexers in the pool. So this may be some cause for concern, because it's uh if the query load is sufficiently heavy but right now it's not.

A

um We cache uh sorry. The cash uh takes care of about 95 of the query load and what does get through is actually fairly small. Queries themselves are small and so are the responses. So it's not taking a huge amount of of uh processing power, Network bandwidth or anything to deal with the query load at this point. That may change at some point in the future, um but currently uh we're not really. We don't feel that we need to address that in this step, um but as that does increase, what are the next steps?

A

One of the eventual strategies that that we'll get to is- um and this depends on uh the number of factors which we don't know yet uh this is this- is why it would be considered Next Step uh is we would want to distribute the query load by sharding uh the over the key space.

A

In other words, different indexers would be responsible for handling a different portion of the key space, and we do something like that with a consistent hash, um we'd split, the uh the ingestion layer and the query layer into two separate uh two separate layers so that the ingestion could be handled separately uh as there was. uh The note handling ingestion would then write the results back to the query layer that uh that's key sharded and queries would then be directed based on their key to specific uh nodes.

A

In the the query, layer um that may prove to be uh critical in the future uh could also be problematic just because, if there's any need to redistribute data, even though consistent hash minimizes it, the sure amount of data could be a problem.

A

um So it's a matter of weighing a cost versus benefit, and we don't really have all of those factors yet at the scale that we'd start being concerned about it. So that concludes my presentation. Thank you. Very much uh here are some links where you can find more information about the indexers or get in contact with us feel free to reach out. Thank you very much.