Graph Protocol Core Devs, 8 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Graph's Core Devs Meeting #5

Description

The Graph’s Core Devs Meeting #5

This video was recorded: Tuesday, July 6 @ 8am PST, 2021.

A

Alrighty welcome everyone to the fifth uh protocol, town hall, now being converted to the core devs call, uh so thank you for joining us just wanted to say a few words about this transition. So we noticed over the last few months we see a lot of engagement from indexers developers that are working. You know on graph node or you know very closely to the protocol and we're ready to create a more focused group around.

A

You know core dev processes, doing a lot more ideation and building in public and meanwhile transitioning to a bit more of a community call cadence, so delegators curators sub graph developers can have their own forum for their own discussions and challenges. I would also like to welcome streaming fast to the core, devs call and the graph community. This will be the first call they've joined, I'm so excited to see what they come up with in their research and uh working with you know, edunote and other teams.

A

I also would like to welcome oliver zerhuzen, who has joined the foundation to work on governance, so you'll be seeing him around and a lot more frequently in the core, dev and community calls and, lastly, wanted to welcome semiotics they'll be presenting a little later, but excited to have him lean and more with our community as well with that arielle. Do you want to take it away.

B

Yeah, uh hello, nice to have you here for another cordov's meeting. uh I will share my.

B

Screen so um I'm going to talk about protocol upgrades first, um there are some um changes in the seeking contract and dispute manager that we talked in the in the previous quarter meeting uh those are related to having different, slashing percentages on the dispute manager.

B

uh We already talked about the reasons for having this gip, uh but what I want to like update is that these are already approved in the in the ggps uh three and four by the council, and uh these are undergoing um um like a boat in the uh an unchained boat, so the upgrade happens. Okay, so uh we already have a four out of six votes, and uh it's it's probably will happen today tomorrow or or the next day.

B

um So as a summary uh dispute management upgrade with the different servicing percentages, uh some staking upgrades with uh fixes minor fixes about how uh reward snapshots are calculated and delegation parameters, initialization, plus some um reduced gas usage on the contracts by by cashing addresses.

B

The there's some delegation working group that is being organized by oliver he's collecting some feedback about the community. uh From from all the all the delegation uh feedback, uh so uh core devs can be connected and we work on the on on. We know the right focus to to give to the work, but he's going to talk more about that. So oliver, if you want to take from here.

C

Yes, thank you ariel. Let me pull up my presentation there. You go.

C

Can you all see the slideshow.

C

Yeah, perfect, okay, um let me just tee up kind of like the history around what we are going to do already back in january, brandon actually has uh consolidated some feedback that he's received to that uh point in time in six distinct, sort of broader subjects that you can frame up that fall under the subject of delegated experience enhancements.

C

So what we mean by that is parameters, it's processes all around delegations that really have come up as sort of either ideas or maybe current pain points that the community was raising as as a sort of a forum item.

C

Now, since january, we have in fact seen more uh more items come through and, as a matter of fact, it's quite a bit. If you actually screen through the forum posts, you find uh probably about a couple of dozen different form threads that touch on a number of different things, really in addition to what brandon had collected.

C

So you know giving credit to jim cousins here, who, a few weeks back, has erased that as an issue that we have so many things out there and that we might have a need, for you know better consolidation of all the items that have since been raised so that we can actually put them all together and and review and prioritize them and really come up with something that, as arya 7 provides some guidance to the community as to how we want to prioritize it.

C

So these are the items here that that you're, seeing and we've really formed a cross network group where we have aerials in there chris russell's in there from the indexer community. We also have chris ewing in there, from from the delegator community, providing us sort of guidance and feedback. You know from that end and we are really calibrating on what are the most important issues to the delegators.

C

And how should we prioritize them- and we just had just coming out of a meeting last last hour, had a very interesting discussion around sort of the dynamics, some of the constraints that we have within the network. So we're really having deep discussions so that we come up with you know, solutions that are both meeting what the delegators want, but that also fit well within our protocol.

C

The goal really is to present back to the community when we're done with our discussions, a sort of roadmap around delegator, you know process and parameter improvements, and that is something that you know we're going to have posted in the forum and that we're looking to get feedback on in essence, what we want to come up with is you know, guidance for the devops team that they know out of the couple dozen forum threats that are out there. You know what should they be focusing on next, as they do protocol enhancements.

C

We also want to provide um transparency to the community. um You know there has been a lot of engagement around forward post that dedicators have raised, and all of that we are going to capture in here, and we also want to continue to get feedback. You know from from delegators.

C

So um when you have seen the you know, the post being um on the roadmap being posted in a couple of weeks, make sure you give us your feedback, how you think about it, whether you feel that we got it right or whether we're going to fine-tune some of that now. Arielle has already worked on a gip draft. That already is essentially addressing some of the items that you've seen in the couple of slides ago, and this is something I really encourage everyone to go into the forum post.

C

It's actually labeled under proposal to change, indexer, cut mechanism. It's a very interesting proposal that addresses a few pain points that includes the transparency around understanding, how indexers set their reward cut percentages, and it also makes it actually easier for indexers to maintain their cut percentages without having the need to readjust every time. Delegation amount changes and it also actually addresses you know one specific instance that we've seen coming up, which we call sandwich attack.

C

Where allocations can be, you know changed I mean the reward. Cuts can be changed. You know right before and after allocations um that have, uh you know, presented themselves as a pain point, and that is also a component that is being addressed so make sure you get into the forum provide your feedback there. um We are currently fine-tuning that arialis and uh you know, are looking to get a poll posted here fairly soon, and that was my.

B

Update on that, thank you. I can take it from there um so, based on what oliver was was uh saying about the proposal. uh This was this was an an original proposal from gary in the forum, and I I took it and and work on a candidate implementation.

B

I I encourage any like indexer delegators to participate and add feedback and and see like uh everything, makes sense and and it's aligned with the with a solution for for this. um This is the current rewards of distribution. When, when um you have indexers uh is taking 30 percent of the stake and delegates are 70 percent, let's say we have 11 rewards and the index forecast is ten percent uh indexes will get, it will get the the ten percent of the whole rewards and the relationships the the rest.

B

So this is this is the way it works now. um uh One one potential issue with this is that even uh if you have a delegator with a small amount of delegation, um the the this this distribution is not taking into account the stake um to let the delegations take to total stake.

B

So um it's not really like it's not that the rewards are accruing for about. For the amount of delegation it's being like, distributed from the absolute amount of rewards um and the new calculation, this is the new calculation and changing the the slide um for the same mistake. um What it's going to do first, is to um assign rewards based on the on how much the indexes or the legislators are staking.

B

So um the distribution of the hundred rewards will be dumped first on the on this ratio, and then the indexer will take 10 of the delegation parts of words. Okay, so you will have a different, um a different share of rewards based on this new formula.

B

It doesn't mean, like the indexer needs, to keep it to a ten percent like it will need to adjust this this ten percent based on the yield the they want to provide to the alligators um in the forums.

B

um I posted like um a gip about this and there's some as a link to spreadsheet, where you can see like how the delegator yield change and the indexer yield change, based on the on different index, indexer to the legit uh ratios and- and you can see basically this this formula in in in movement um based on unchanged. So I encourage that you take a look at the spreadsheet and provide some feedback.

B

The other thing is um the the current formula was working is working by um using the indexer rewards cut when when the allocation is closed, so um it's it's not taking into account the the indexer rewards cut when the allocation was created, so that that means that the indexer has the opportunity to change the the cut before the allocation is closed, like you can can change it like multiple times, based on on on that, and it's sort of, I would say, changing condition for delegators and it's a bit of a speculation based on on when you are closing the allocation.

B

um This candidate implementation had a change where um we take a snapshot of the indexercats when the allocation is created and that's its use. So multiple different allocations can have a different extra cut based on when it was created. Okay and the reason for that is to avoid last minute condition change.

B

So so that's the other, I would say main thing. um So that's that's! That's all about the this um upcoming candidate gip. um I encourage again, like you, provide some feedback and I'm going to talk now about like disputes um for arbitration.

B

uh One thing that happened during the last two weeks was um there were open disputes for p2p one one of the indexers that the core team analyzed and found out that word, like determinism issues on the presented pois, so um the these examination issues were related to uh diversions on a particular block number um by doing an if called that return, something different.

B

So the the the resolution by the alliteration uh following the the habitation charter was to draw the disputes, so the funds were returned to the the the deposit were returned to the fishermen and- and it was not slashed okay, so that that's the main thing uh the as a next step, the the team is going to uh is working on reproducing uh the the wrong. If call now, the investigation is to find out. This was a particular condition on um one of the of the theme notes or um or something else.

B

So the the the current um line of work is um setting up um a weight. So we can reproduce the the determining mission, okay and we will post um in the forum uh more details about that. um One more thing to mention about disputes. Is um the team uh published this uh cli that you can use and install easily on your on your terminal?

B

um It's um it's a cli that lets. You manipulate all the manage all your disputes. uh You can like uh list open disputes. You can uh create index in this field. You can create query disputes by presenting at the station. um So this is like a first version. I think, will help a lot and will be keep on improving and open for feedback.

B

uh That's most of it.

B

A

Awesome, uh I believe adam is next.

D

Great: hey everyone, I'm adam, I'm product manager, working at edge and nodes um on the graph node. um Let me find yeah, someone talked about two things, um two things today, you see my screen yeah, um so, firstly, it's about um core processes. Let me actually just narrow, so I can see anything um so obviously, this is like this is a very exciting time um for the protocol.

D

In terms of like um the core development um of the protocols, we we've got more participants so like, as the uh network um continues to roll out, we've got um more indexes and um and and so we've increasingly an idea to that.

D

But then we've also obviously got more core contributors so um like that includes streaming fast, but we've also got, for example, line chain working on um on a piece of work to um support unit, testing um of uh essentially mappings, and so we've got more people participating, and so it's increasingly important to standardize and organize how we're sort of developing and releasing software, and obviously it's like a coordination challenge um so that everyone involved knows what to expect.

D

um Specifically here, like I guess like these are, my areas of focus would be around um sort of graph node graph cli, which obviously has indexes cli as part of it graph ts, which is obviously supports graph, cli um and sort of writing mappings and then on the indexer side and there's the index of service and indexer agent, um and uh obviously those components will have different versions, and uh sometimes they need to move in lockstep other times they wouldn't, um but we wanna like essentially try and standardize the way in which we're um uh releasing at least in particular, when we're coordinating across um core development teams, um and so these are the core components they're also the other.

D

These are the two sort of important components of versioning, so the spec version, which is the version of the subgraph, manifest what that looks like then the api version, which is the version of the mappings api so um and, and that's going to be particularly relevant. I guess with an upcoming release, which I'll talk about in a second um and so essentially, I've just got a couple of updates and requests um for people, um so the first like. Firstly, we really want to move to more frequent requests for graph node.

D

We know there was a like quite a long delay between uh 0.22 and 0.23, um and we want to try and um move to a slightly more regular cadence, we're not going to commit to like a day um a day in a week or of the month, but we want to move, I guess to every two to four weeks um to try and essentially build that muscle and and also get um get get people across the ecosystem, getting used to sort of upgrading on a regular basis, um and I guess, like as part of this, um I've got a request for help.

D

So, firstly, um I need to find essentially more people to help beta tests so indexes for beta testing. So we've got a couple of people who we've um we've got a couple of people who we have been working with for the last release. We want to essentially move to doing that with more people in particular. If you've got software, you write which uh in which is used by others. So we know that with the 0.23 they're, quite a lot of migrations, quite a lot of changes.

D

So if you maintain grafana dashboards, that used by other people, um then we want to, we want to sort of know that we don't need to be taken by surprise. So please um please get in touch either in the forum on discord, um then that would be great and we also want to have more contributions to graph cli from sort of called from outside the core development teams. And um so this comes just comes back to the line chain.

D

um Assembly script work uh unit, testing works so they're essentially um test um creating a framework which you want to like, ideally build in, um and so uh we wanna like. We want this to be uh an ongoing thing where it happens more and more often, um and so uh we want to essentially um have a bit of a working group. So yeah I'd, like I think petco is here so he's um um it'd be great.

D

If you could be part of this, but if you're interested in being part of this and working how we could build more things into the cli, then that'd be really um great and please get in touch. And then the last piece is, I guess, a call to um uh subgroup developers. So our grading assembly script- the latest version- is something we're currently working on um or credit to.

D

Otavio he's been working uh working on this um and this introduces new functions, the sort of new latest version across graph node graph, cli and graph um ts, um and so we'll. Obviously, when we ask we're doing sort of thera testing on this, but um we obviously want to make sure that we understand all the changes, because there are some breaking changes um in the in the um essentially latest version assembly script.

D

So if you're interested in testing the latest version with us, then then again, please get in touch that'd, be great, um so I'll pause, I'll pause there. That's obviously only sort of a light touch thing on releases, just in case there are any questions about any of that or um or anything in general, about like how how we're managing things at the moment.

D

I can't see: is there anything in the chat.

E

F

I'll put a I'll put a question out there just for discussion for these uh more frequent releases uh on the two to four week. Cadence.

F

um Do you have any thoughts on like whether whether those would be like minor versions, patches, major versions- um and you know how are you thinking about that relative to the frequency of maybe upgrading the version at the protocol level?.

D

Official good good question, so I think yeah, I guess a an overarching thing is that um graph note itself is hasn't sort of had its 1.0 version um yet and and and we're working our way towards uh towards that. As I guess, there's a major, a major version bump and I'll, be, I guess, a bit of a milestone, because when there's uh yeah so I'm talking about um what we want to fit into that. But I think in general it will probably be sort of minor versions.

D

So um I guess I like it like every couple weeks and I think that um wouldn't necessarily like, I think we still need to work out how that fits into the protocol, because I think um there's there are sometimes some reasons and like like like at the moment. um There's a push, I guess to move to the latest version. So two three one, um essentially because it allows us to upgrade the ipfs um uh yeah ivfs node that everyone's using because currently the 0.22 they support sort of up to 0.4.

D

So there's like a holistic benefit for the network to like, like if everyone upgrades, then actually we can like bump that and that then improves sort of appearing and like the extent to which that that node can fetch data and people would actually start using their own ipfs nodes.

D

So there'll be some instances where we'll, I guess, encourage like like folks to upgrade, but I think in general um it probably won't be every every two to four weeks but again like there might be some instances where we've introduced new things and again that comes down to the sort of the versioning strategy that that we, I think, we're still kind of working out full disclosure. But I think that yeah, hopefully by releasing frequently we can, we can again, as I say, build that muscle.

E

On the question side and there's somebody asking about if there's breaking changes and it comes up with assembly, script, update.

D

Where is that? um Yes, yes, so there'll be um breaking changes and essentially that it'll be tied to the api version, um which I mentioned so um yeah. This, like.

E

If I tried to reel.

D

Them off um off now, I'd miss some I'm. Essentially it's going to be pretty well well documented and because it's tied to that api version, you'll be able to essentially use the an older api like the current api version and and that will also work. Fine and graph node will essentially support both in parallel for a while, um and so, as someone says that there's a detailed upgrade guide to help you get through that.

E

Another question is there are any plans to add more filters to block handlers.

D

Yes, yes, so that's on yeah! That request certainly um come through, like I wouldn't say that, there's um a specific time frame but like I think, adding some more um control in terms of block handlers and and other filters has certainly come up quite a lot and um hopefully we'll be able to get some of those um kind of improvements.

D

um But yeah like I think I think I think one thing slightly off topic, but um we want to, in general in, like, in general, start to be um a bit more open about what we're sort of working on some of the like, like ongoing projects and also things that we're sort of discovering um and so yeah. I guess, watch this space on on on that front, particularly as we start to collaborate more with streaming fast and work out what we're all respectfully working on.

D

Cool, um so then, I guess just just quickly. The other thing is this is more sort of building on um sort of feature management. So this is one of those things that's leading into the graph node 1.0 release, um essentially, there's a gop and radical which um brandon have been working on.

D

I think I'll be iterating on a bit, but essentially this is providing a sort of formal pathway for introducing new features to the decentralized network, um essentially from experimental through to full support, and there are a couple of ways in which we can do that um essentially comes down to whether um whether a given subgraph that uses a feature um essentially we'll have query arbitration or indexing, rewards and indexing arbitration, um and there are some like some challenges to this.

D

We don't necessarily track all the features that we'd want to um uh essentially migrate in this way. So that's sort of up. First, um these levers aren't all tracked on the network.

D

I think there is an indexing reward um uh sort of flag which is set by the um sub graph oracle, um but then there's also like, I think we need to establish like where the looks control is um for, like where an index would say um that they do want to index these kind of subgraphs or make those kind of decisions so very interested to talk to indexers about how to best surface thing. This config configuration on the network so um yeah I'll be reaching out.

D

Obviously, but if you, if you'd, be interested to talk to me about that, then then that would be great.

A

Awesome, if there's no more questions for adam, then we can pass it off to semiotics.

G

Okay, hi everyone. My name is ahmed oscar, I'm the co-founder and ceo of semiotic ai semiotic ai is a bay area startup, and we are experts in ai and cryptography and we are very happy to be part of the graf community and grateful to the foundation for supporting our work.

G

Our focus in the past several months has been query negotiations in the protocol and the long term. Success of the decentralized query marketplace depends on the economic efficiency of the market.

G

For example, indexers need to optimize pricing and consumers need to bid efficiently to get the best service and we've been investigating automating query negotiations with deep reinforcement learning and towards that long-term goal. uh Today, we would like to share some initial steps, which have probably some more short-term impact to the protocol, and this is about again indexers query price, optimization, because this is an essential task.

G

Some queries cost a lot more in terms of infrastructure costs and latency can be used as a metric to measure this complexity. However, predicting latency given a graphql query is a very hard problem and we started tackling this issue using ai and I will let matt to describe what we've been doing so far.

H

All right, hey guys, my name is matt dibol. I am a developer at semiotic ai.

H

So, as I met was saying, we are currently trying to develop a method to predict the cost of running a query without actually running it, uh which would be very useful in the everyday scenario you see here where a dap developer will send a query to the gateway along with a bid for how much he is willing to pay for it.

H

The indexer is then passed with pricing that query and figuring out how expensive it will be to run it. So currently, this is done with agora, which lets you uh lends you a lot of flexibility to specify how exactly you want to price a query, um but it turns out that uh actually figuring out how to write simple rules to price queries uh does not quite uh catch all of these very expensive queries, which is a very hard problem, as was explored in this ibm cloud research paper that you see at the bottom here.

H

So what we are doing is developing a deep learning model to predict the latency from a raw graphql query, so this is actually done in a few steps. First, we take the raw graphql query string and we actually need to turn it into a more digestible form to the neural network. So we actually take each individual word and piece of syntax in the query and we turn that into a categorical token.

H

This gives it a better form that can then be fed into a deep learning model.

H

Here we actually use a transformer model which you may have recognized as being used with great success in things like gpd3 and bert.

H

Here we just use the encoder portion of the transformer model to take in that tokenized query and output, a final latency prediction, so we have applied this pipeline to a large amount of hosted data from the graph, and it has performed quite well within that data set, but there's a big limitation there, where that data set actually contains a lot of repeated queries or very similar queries.

H

So, even though the model is performing quite well within that data set, we don't expect it to perform well on completely new queries that it would see in deployment.

H

So to solve that we actually moved on to solving query generation. uh Being able to generate queries comes with many advantages number one. We have unlimited data, which means that we can train the model uh until it is able to actually learn the syntactical qualities of graphql and actually predict the latency of completely new queries.

H

There's also the added benefit of not needing needing any historical occurring data at all. So one thing we could do is we could pre-train a model before a sub-graph is even deployed so before any queries are even indexed for a sub-graph. We could actually already have a way to predict how much those queries would cost to run.

H

We could also have indexers that could train models completely independently without any outside data set from the graphs hosted service or even other indexers.

H

So this is all great, but it comes with one very large challenge and that's that gendering generating queries is very hard. So the subset of random strings that actually form a valid query is extremely small. So what we have done is we have built a tool that can shrink this subset of this set of random strings down to the set of random queries.

H

So what it does is it takes a partially finished query and it outputs all of the possible next tokens that could that are legally defined by the graphql syntax, and so we actually have a bit of a demo here for you.

H

So at the top you can see a query being randomly generated uh and at the bottom you can see the valid next tokens being generated by our tool and right now the uh generator is just randomly picking from that list, adding it to the query and then the tool is looking at that partially finished query again and outputting all of the possible next tokens.

H

So in the future we can actually have a neural network picking from this list, instead of randomly choosing from this list, and then we can do some other things which I will get to on the next slide.

H

So now that we have a query generator, we can actually plug it into our latency prediction, training pipeline, which will look something like this. So now we will have a query generator generating a query. uh Our latency prediction model will predict the model, the latency. For that query, we will also feed that query through our local indexer to get the true latency. We can then uh compute the error between the latency prediction and the true latency and then trained our aim model as before, using us supervised learning.

H

So the big thing that this solves over our previous training pipeline is that we now have unlimited data, so we have solved our data quantity issue, but we've actually introduced a new issue in that we now have a data quality issue.

H

These random queries are by definition, random, and they are unlikely to adequately represent the complexity that can happen in deployment.

H

So our next steps are to actually introduce an adversarial component to this query generation pipeline. So this is the same exact uh graph. You just saw with the added added component of a reward arrow going from that supervised learning error to the query generator, so the query generator will be trained via reinforcement, learning for making the cost prediction model uh mess up. So now it will be incentivized to come up with queries that are interesting.

H

Queries that the cost prediction model is not very good at predicting, so that now the cost prediction model can see those in training instead of in deployment where it would be very bad if it messed up on them.

H

So that is where we are headed and we're very excited that about the quality of model that this will be able to.

A

Provide awesome, matt. Thank you. So much for sharing that was really insightful.

A

uh Are there any other questions? While we have the semiotic team.

F

Here I can share some comments just on why I think this is super cool. um You know, so a common challenge that you see in a lot of you know. Blockchain protocols is figuring out how to cost. You know different operations, usually compute and storage, and you know so. The approach that you've seen taken almost every protocol to date has been to kind of set this at the protocol level right.

F

So you have like ethereum like gas costs, you know for storage, for uh you know different types of uh you know uh compute, like op codes for for compute, um and you know we were designing the graph we we kind of came up with a similar.

F

um You know came to a similar question which is like: how do we price or how do we cost queries, and one of the challenges here is that um configurations are so much more heterogeneous than you would see with like node operators running like a blockchain node, for example, you know the the cost of executing a query. Is you know very contingent on? You know the caching that you've set up both in graph node? uh Caching, that happens in postgres the indexes that you've created. You know how horizontally scaled.

F

Your setup is, so there's really no way that we ever could have done like you know the equivalent of ethereum gas costs. You know as a protocol wide set of set of costs, um and so, instead we ended up with you know, agora, which you know zach's presented in past, calls which gives indexers a way to flexibly, communicate their own costs.

F

You know to consumers in more of a peer-to-peer fashion, um but historically it's been difficult for indexers to figure out how to set those agora cluster models right, like it's actually quite a bit of work to figure out. You know what are what are my own costs?

F

You know that are idiosyncratic to my specific configuration, and actually, how might I change that configuration in response to what those costs are or in response to you know the market opportunity of the types of queries that are coming in you know, and so this is kind of the first step in really making that a first class experience. You know so the tools that that semiotics and the models that somatics are learning is really making this kind of peer-to-peer um pricing price discovery and, like uh cost communication, uh really efficient and so yeah.

F

It's just just the I think, tip of the icebreak here, but a lot of exciting work to be done.

E

You do have a couple of questions for matt.

C

You know first one being a couple of people wondering why didn't you use real data from the hosted service.

E

H

Yeah absolutely so uh yeah. Perhaps I glossed over that a little too quickly, but we actually did first train on data from the hosted service and the model performed quite well within that data set.

H

But actually, when we look at the data, we find that a very large majority of the queries are actually extremely similar to one another, because they're generated by scripts that are just sort of filling in variables or doing something similar. So even though the model is able to perform well within this small data set, it's not expected to actually perform well on completely new queries, because the set of unique queries that the hosted service is getting is just a bit too small.

C

Next, one is what type of supervised learning are you using.

H

uh Yeah, so we are performing uh regression on the uh predicted output uh to the um actual true latency. um I guess specifically we're doing a mean squared error between the log of those two numbers.

C

Okay, so the next one is a longer one. What are the bottlenecks on the time it takes to compute a prediction, and what do we think a viable lower bound might be on that time?.

H

Absolutely yeah, so we've done, I think about a week ago. We did a sort of a study into that because that's obviously a very important thing.

H

We need to be able to predict the query in in less time than it takes to run the query, uh and so there's actually some great tools that exist for doing efficient, neural network inference and with those tools we were able to get uh some of our models down to 200 to 500 microseconds, which is sufficiently low to be very useful for predicting these queries, since most of them take at least a millisecond or.

C

More thanks, we got another one coming in. How long did it take to train the models in the most recent pipeline used for testing.

H

Yeah so originally uh we were training on the full hosted data set that we got and that took a very long time. We actually didn't even get through the whole data set. But, as I said, we were actually training on lots of duplicates. So when we actually processed the data set further and deduplicated it, we got a data set of unique queries and we were able to train on that in a few hours. I think- uh um and that's that's including a few times through actually so yeah.

H

It's it's a very small set of uh unique queries.

E

Thanks, I think that's all the questions we've had. You know great job man, great presentation, awesome thanks.

A

Well, we have a few minutes left about 15, which is abnormal for us, uh but if there's any questions around uh gip's upcoming protocol changes, maybe the arbitration charter uh now would be a good time.

A

Any comments or discussion items, brandon unmuted.

F

Yeah, I mean just- I guess, generally I'd love to figure out a way to make these more interactive. You know I love, I love the presentation format for the uh you know for the start of it, but now that we do have multiple core dev teams, it would be great to discuss. You know, I know, there's lots of other meetings happening. You know where, where some of these ideas are being discussed but it'd be great to you know even discuss some partly formed ideas on these calls.

F

You know to give people kind of a a taste of work in progress. um You know. One thing I think that came up in the index for office hours with streaming fast last week was discussions around like changes to the you know, information model of subgraphs. You know so the idea of actually breaking subgraph mappings into like multiple stages that um that can be run concurrently.

F

You know for for kind of optimizing performance, that's just an example, but um yeah it'd be great to give people like a view of work in progress and even if something's not not fully polished yet.

A

Alex did you want to say a few words about that.

I

So I think brandon you're talking about the parallelism right. You want me to go through a little bit of that. The story, the pancake swap story- is that the right place to do that know some of the elements that we we thought of and uh I'm we're writing an rfc on. That's what uh we're doing do you hear me? Well yeah yeah, yeah yeah. I think that sounds great, so I did it. I did a demo to chris from graphos.

I

We had a lot of fun, um so I've shown him the few lines we added six lines to stage to to transform the pancake swaps, which is a clone of unit, swap basically to stage it in different steps and to run all these steps in parallel.

I

So that's one of the way we were able to chunk like processing a thousand blocks or ten thousand blocks and put that on a thousand computers and then have all of that do run step one for the full history in five minutes and then run step two in eight minutes, because it does just a little bit more work and and and then at the end you know we would have on disk files that have all of the values that you would eventually find in postgres.

I

If you had processed the full history linearly and the thing that was taking two months right linearly through mappings and querying yeah. So that's sort of a blend of the streaming fast technology, because it's based on files, opening files, processing data linearly in chunks. But the fact that you can parallelize that, and so that's what some of the thinking we want to bring there uh or at least a proposition. We're are explaining our way of how we, how we thought of it, how we implemented it and see if that resonates.

I

If there's better ways, maybe to do these things. So uh is that useful to to all you guys, I don't know.

F

Yeah, that's super helpful. It looks like we're already getting some some questions. One question I actually had during the last one was just: what kind of files are you using? Are you using like parquet files uh and like kind of column, store layout whatever.

I

So I don't know if that's right, okay, so look the streaming fast. Tech started from the principle that we want to take a blockchain and extract all the data. There is all of the data there is so that we theoretically would be able to rebuild a node of the concerned blockchain, because we would have all the data all the data right. So all of the accounts not change all the state changes all of the whatever you find in the storage of a blockchain.

I

We want to extract and put into files so that we can process them and and files have great advantages uh as opposed to processes that are, for example, an ethereum node, an archive node right, there's, cpu memory, overhead, there's disk access, and you have large systems, I'm thinking of an archive node here large. You know this needs to be managed. You want a copy of that to scale it out.

I

You need to copy the big disk run, a lot of cpu and memory just to go access a little bit of the data, whereas if you have things in files, files are the sort of the the juicy things in data science, because they're eminently paralyzable extremely scalable like things like s3 or google storage, is never a concern and also cost wise.

I

If you have things in files like in gcp, we're using that there's not even a cost for downloading the file, there's not an egress cost, everything is internal, so it's extremely cheap, and so, when we're practicing, I'm telling you processing the the full history of that section. I'll tell you we didn't. We processed the whole chain, all transactions unfiltered right, and we did that four times and it still took like for that part, maybe 30 minutes, because that's negligible right, we're dealing with files, it unpacks very fast. So so that's our core principle.

I

We take the chains extract the data put them. We wanted to have the most complete description of all the protocols we handled, so we defined protobufs. That's what we're using. We wanted that to be easily consumed by a lot of different languages. So that's the first delineation of I would say a contract between you know: data consumers and data producers. Any chain would produce protobufs in chunks of 100 blocks.

I

These are stored in the files and then we have what we call the fire hose, which is being integrated by the team right now that would bridge files and real-time streams. So the nodes that produce those those streaming blocks and the fire hose knows to go to files knows to to now switch to the live stream, ensures contiguity navigates force for you, that's sort of a you know that the fusing of those things- and this is chain agnostic, and so where am I with the stage with the uh that? Does that answer your question?

I

I could go on. You know, but.

F

Yeah one follow-up to that. So it sounds like from the update at the beginning that you're kind of thinking about files all the way down, so the files not only as the firehose for the blockchain nodes themselves, but also these intermediate computations uh within graph node.

I

So so what happens is that that is for specifically the parallelized version right. Of course we have we for the pancake swap story. We built the linear version that just goes linearly. This one already went faster because it was using files and the linear storage, not relying on hitting the node right, because there's a little bit more data in there we tried to leverage it more and and with and avoid hitting the node for information.

I

So that's one thing, but so the parallel version yes, runs the first stage and then drops the entities to disk and then the second stage starts and picks back up the entities that would exist if it had run linearly, and that's why you can imagine. You can't run a unit swap without first knowing the pairs so stage. One goes and extracts the pairs for the full history drops out on disk.

I

The next stage starts with you know, knowing the pairs, so it loads the pairs, and now it can do a little bit more work computing, whatever price average, and then you have another value that relies on the price average. So you flush the prices. First ones and then you restart process, and you continue from the pairs plus the average on which you can build the next stage.

I

Then you write a little bit more and all these things produce very quickly and thinking, like maybe half an hour, so you have json files that are extremely easy to introspect and you have full history so for iterating, that's also one of the core principles. If you want data agility, the processes need to be extremely fast. That's why we always strive for parallelization, because if you can't do that in in blockchain world you're, just like ah you're gonna take more and more time ever you can't you can't.

I

We can't do that right, we'll have more computers, we can spin up more computers, but we we won't be able to extend the space time to go more and more linearly. So that's why, in our opinion, parallelism is important, but it also gives you a lot of agility, so you can inspect the latest bits of the data before they're hitting postgres and that's the second stage in our operation.

I

We took those files and then generated huge csv imports and were using sort for the parallel injection, the fastest method there is to injecting postgres loading, csvs, removing indexes and then adding indexes after really to have an optimal injection that takes how long three four hours with the rebuild of the index. So that's how we managed to you know, shrink the full history in six hours. Is that interesting? Is that fun?

I

So I, if hopefully, that answers your question.

F

Yeah, I think I think one thing that's really exciting to me about this kind of model that you guys are proposing is that it really maps a lot more to. I think what a subgraph is doing. You know this kind of the pipeline. Nature of you know the data. uh One of the features that uh you know I know adam's been thinking about. A lot is like what does it look like to have more, you know, analytics style functionality as a first class.

F

You know uh you know concept in subgraphs and we were seeing this with. I believe the uniswap subgraph, which pancake swap initially you know forked, which was that a lot of the a lot of the overhead was actually within the mappings them kind of you know, imperatively, doing these sort of like aggregations, and you know the types of things that really could be declared declaratively and be maybe staged as like a downstream part of the pipeline, but because we didn't give you that facility, it was like okay.

F

How do we put everything that we'll ever need inside this? One mapping function that you know runs: uh you know sort of sequentially, um but you know once we start breaking things out within the subgraph and of these pipelines. You know it really starts opening up. You know possibilities for you know for extending subgraph functionality, but also you know, on the the performance side as well.

D

Yeah- and I think I think one of the interesting things that interesting challenges there is that, like I think at the moment, writing subgraphs is um like conceptually. It just runs from top like top to bottom and so like, like it's a relatively simple mental model, whereas obviously this is making much more powerful much like much faster like it's definitely the way we want to go. But it's like thinking about how we surface that to subwoofer developers in a way that, like is easy to build the powerful subgraphs that they want to build.

D

I think that's going to be like really interesting, like user experience challenge.

F

One one question I have is just sort of an open question, as this work is developing. It's just what are the impacts on determinism from the file approach and any impacts on like uh being able to bring along the the proofs? You know the the sort of provenance of the data along with you know the data that's getting added to the files. So not just you know not just the the raw data itself, but hey the the proof that this data came from. You know, xyz block.

I

I guess there's there's two aspects to that. You know there's! uh Okay, so you know that's in the work, the notion perhaps or the proposition we're gonna write an rxt about that to have sort of a network sub graph that fire hose. I'm talking about is raw data unmishmashed. You know, there's no mappings there. The blockchain produces a lot and there's data in there. That should be deterministic and one way and we've seen with the work in aragon, they're they're trying to have another. You know sort of parity traces and they figure out.

I

There's small inconsistencies between the implementation, so we did a lot of work between different clients, output, the same thing, so protobuf is the same and the content there could eventually be sort of hashed cross verified in a way, and that could become that's one of our thinking here. What we're thinking we're going to write rfc about that have yet another service on the graph? Perhaps right that would be that.

I

Would not you don't query it you don't you know you shape it up to just to have you know the information you want, but it's not you know. It's raw raw unmapped and then that could be used if it's composed by the the other mappers. So you'd have yes provenance. Maybe that really it touches on a composability of a grass and it starts making sense when you have thing that is more raw, anything that is less raw because there's refinement.

I

Eventually, I don't know if the composition between subgraphs and sub could could could could build into that.

I

Brandon was talking about these things uh yeah so that aspect verifiability that could be a service and then yes, a sub graph building on such a fire hose could try to you know, and that thing is also very deterministic once the data is on disk and people are verified, there's just a lot less questions of the node had an issue I hit a node was in the wrong place or someone crippled the disk there, and I can't hash the ssd disk of the archive to know if the answer is going to be good.

I

Yet whatever, like some things like that right, hitting network errors where you have that a lot hitting network. Oh, I got a wrong error from the get token.

I

Whatever token name so I removed it, and that was we find like difficult to determine as an issue because you're relying on network systems, if it's just about the data and the data, is complete.

I

There's much less risk of determinism errors.

F

Right, it's a lot easier to compare files than it is to compare the sort of like runtime behavior of these, these different nodes right right, uh yeah, the the network, subgraph idea- is really fascinating. You know there were some projects a few years ago that were kind of um you know. I think after the graph came out trying to apply the you know the graphql concept to blockchain.

F

I think ethql was one where they were kind of like thin wrappers around like json or pc, but they didn't fundamentally touch the underlying.

F

uh You know semantics of querying that blockchain data or you know the way that that it was extracted from the chain, and so you know, I'm excited not only about the performance gains of exposing us through the graph, but actually providing. Potentially, you know a maybe a more definitive or useful set of semantics on top of these networks that you know might even get more usage than the sort of native uh you know like json rpc style. You know apis.

I

I think there's tremendous value in that labor flat layer because it empowers the rest, but also even for those communities like if such a radical approach is brought to their chain. This brings a lot of value to their data right. It's always been about valorising or putting value to the data, making it valuable and useful. So.

F

Cool great discussion, thanks for it thanks for indulging me on that you, but I know right time, I'll kick it back to you.

A

Yeah, thank you so much that was really insightful and I I'm excited to have more of these kind of open forum discussions as we focus on core dev processes uh and and more just core dev uh items. um A few questions came through uh throughout the call on curation delegation.

A

uh We will be having community calls that oliver and a few others are setting up, so that will be the best forum if you're looking to learn more about curation and subgraph development too. Thank you. Everyone for joining us for the fifth uh core dev call see you next month.

F

Thanks everyone.