Filecoin Compute Over Data Working Group, 17 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Compute Over Data Working Group 4th Session

Description

Joel from the Ceramic team shares a walkthrough of the purpose and structure behind data streams and their upcoming ComposeDB project. Luke from the Bacalhau team takes us through the Bacalhau architecture, roadmap, and a live demo.

Ceramic: https://ceramic.network/
Bacalhau: https://bacalhau.org/

A

All right, all right, hello, everyone welcome. This is our fourth meeting of the compute over data working group very glad to have everyone who is joining in live um and we're very fortunate today to have the ceramic team. Who's gonna, give us a rundown of their tech stack and then also the bakayow project after them, and so joel will, let you lead it off first and thank you for joining and we're looking forward to whatever content you'd like to share.

B

Thank you, uh hi everyone, I'm joel uh technical, co-founder of ceramic with me today. I have spencer and maybe sergey if he joins later, so uh we have a lot of people who can ask questions today if we get to them um yeah. So I kind of wanted to jump right in uh I'm going to share like a kind of intro high level overview to ceramic, then dive into some of the things that I think might be a little bit more relevant to uh to this group.

B

Right, so do you see my slideshow right now.

A

B

Perfect, okay, so enter the ceramic. uh We see ceramic as kind of like an open graph for web3 data. uh You can kind of think of it as like. This open knowledge graph of the internet, it's alive, meaning that anyone can update it and kind of like add things to it, and it's like relational.

B

You have relationships between different things and from this you can kind of get this emergent web of trust, um and we can start to like make sense of of the world, because now the internet is not only something that's presented to us by web pages, but we can actually inspect the underlying contributions. How people interacted.

B

And so some of the kind of thinking that came into like being important when building ceramic is that we wanted to have the ability for applications to share data across uh across themselves. So like big cross applications and across organizations, uh we want people to be able to like optimize their workflows for for their applications.

B

We want data in the network to be composable, meaning that uh if there are two different applications out there, and I see some useful data in both of them, I want to be able to take those put them into my application and just run with it kind of use that and when I make updates in my application, those should propagate to the other application. So the data kind of is composable and that's like interoperable and kind of there, and it's not locked into like any particular app, and so with this.

B

We can, of course, like increase, like innovation in the web, 3 space more similar to like how we did with like the blockchain for like financial sort of things. We want to be able to like do that with data, and to do this we need things to have authenticity.

B

We want the network generally to be censorship. Persistent, we want every action in the network to be authenticated.

B

We want audit trails that we can audit after the fact, and if we have this, we can actually let authors of different sorts of the different parts of this kind of decentralized knowledge graph to to build reputation, um and so why don't? We actually just use the blockchain uh to do this, because it has all these properties right.

B

Well, there's some problems with blockchains, they've and, and it boils down pretty much to blockchain's favorite, strong consistency. This is great because it prevents double spends, uh which means we can manage funds.

B

It means that we can have public goods funding, we can have nfts, but the kind of main limitation is that blockchains are limited in throughput by the the block producer, basically the node that is producing the current block and the next block and so on, uh and that can only be so big and you can't really like build a twitter or facebook on like a strong consistency model where everything goes through like one node.

B

uh So different people have approached different or have different approaches to scaling blockchains for data. Some do like the bigger block approach, so, like solana, celestia are weave. They have different ways of making sure like convincing themselves like hey this big block producer is fine, because we have this kind of validation methods for it um and so, but you're still limited by the throughput of the block reducer.

B

Then you have like the sort of proof of storage approach which falcon uses with cia and storjay also uses, and here you can have big blobs of data you publish and then um you have some kind of agreement between two different nodes or a set of nodes that hey this data will be stored and be available, but you're still kind of limited by the throughput of the block, because you can, if you have like 100 million users, they can't all make one transaction each to the to like hey here's.

B

My new data and generally, like a user, only has like a small piece of data that they're updating, not a huge blob.

B

So if you want to build a system where you have like a lot of participants, you either have to have like some centralized aggregator or you kind of have the limitation of the block here, and so can we do something better? Can we have something?

B

That's decentralized and eventually consistent, so we want to have parallel data execution um or data production rather, and we think we can achieve this through, like having audit trails, there are kind of independently verifiable um and we still want some properties like data composibility, that data shouldn't be like uh kind of logged to the different places where it's at, and this would generally, of course, only work for non-financial data because for financial data you want strong consistency, and so imagine, if you have like a set of event, streams that are verifiable, you could have something that looks like this.

B

You have. The different colors here represent like different subsets of the network, and so your database might care about, like some particular subset here or basically like the network in this way, so you have like the different subsets of shards, uh but someone building a different application might want to like have a different topology of like how they build their indexes, so same data, but like different configuration, um and you can actually achieve this if you have uh independently verifiable event streams.

B

But if you have something like hierarchy, critical systems, you still kind of like produce the falcon users, you still produce like blocks and they, like the data, is kind of in a particular shard, and then you would need to move it. But here you can actually produce the data independently and then you can build kind of like specific uh consensus over different places. So that's why in ceramic we have separated two layers essentially or three layers. If you think about the kind of user experience so the base layer, we have an event streaming protocol.

B

On top of that, we have database protocol, and on top of that, we kind of have this nice graphql apis, and this is the kind of system we're building for this is called compost db, so I'll kind of quickly go into composite b. uh I'm not gonna. I'm kind of gonna rush through this because I think most relevant, for this call is the event streaming layer.

B

So compostability allows nodes to create indexes for the data that they care about. Data models is basically a way to specify a subset of data that conforms to some semantically meaningful, schema and yeah node choose which models the index and you can create across kind of like linked models.

B

So a data model just describes how data looks like, and any user in the network can write data to this model, they're creating use using the graphql schema definition, language, of course, use graphql to query over them, and these data models are discoverable and composable. So I can take two existing data models, pull them into my application and build something new and roughly this is kind of what it would look like to define a model. uh This is kind of a little bit outdated, so don't use this in in in your code.

B

But here you have a model for a proposal to imagine like a dog proposal. It has an author, it has text, and so this author here is like given proven crypto graphically by this document, account kind of tag, and then the text is some string that the user puts. Then you have a comment and same thing here.

B

There's an author there's text, but we also have a proposal id which references kind of like the id of a particular instance of a proposal, um and then you can kind of see that we have a comments relation here in the proposal model, so you're going to have this kind of relationship that allows you to query hey pro. This proposal give me all of the models, and so you can kind of imagine how this generalizes to a graph of of data models anyway. So that's the database system we're building on top of ceramic.

B

um If you're curious about that, please reach out, but that's not what this is about so event streaming here. I want to like slow down a little bit and allow us to dig a little bit deeper. um If you at any point, have any questions, please feel free to interrupt me and we'll go through that.

B

So, uh first of all event, streams in ceramic are completely independently verifiable, meaning that I can synchronize one event stream in the network, verify it and be sure of its integrity without having to verify any of the other data in the network.

B

So this is kind of fundamentally different from a blockchain where I need to like synchronize all the state of all smart contracts in the network. Here you synchronize the state of like one event stream. You can verify that.

B

Excuse me, um then: we use uh decentralized identifiers, the ids for authentication and we've built a system that allows us to use essentially like any blockchain wallet in principle, to to write into ceramic and I'll go into a little bit more. What that looks like uh we have a peer-to-peer network for synchronizing events or synchronizing the event stream in general, and we sort of like have a prototype for like how this works right now in the future. We want to build something: that's like a little bit more scalable.

B

So that looks like some sort of like um custom epp protocol and I'm happy to like dive into that. If anyone has questions about that, um and then all data is sold bound, meaning that data is authored by a particular account and there's no way to like transfer the data like transfer ownership, essentially um so like the data is, is with you and you can't like trade, the data in the sense of like an nft. If you want to trade, something use the blockchain and use nfts for particular things or tokens.

B

So an event stream looks roughly like this. This kind of, like the high level view and, of course like if you dig into particulars here, there's more nuance, but essentially like this, so you have a uh different events or the event stream is made of events.

B

uh And first you have a genesis event that essentially creates uh this stream and we have something called stream id. The stream ids is like an fire for the stream and that's based on the cid of the genesis event.

B

uh Then there are two types of events uh in. In addition to the genesis event, one is an anchor event and one is a signed event. So an anchor event is essentially a timestamp that utilizes a blockchain in this case ethereum, to provide a cryptographically like verifiable, trustless timestamp that hey this previous event was published, did exist, at least at this point in time, um and so we have a system for like how we reference that and how we batch that into merkle tree.

B

So we can actually like anchor not only one stream with one transaction, which would be like very unscalable, but actually make one on-chain transaction and anchor a large amount of events. At the same time, then, we have signed events and the signed events includes a signature, basically proving that hey, I'm, making an update or I'm I'm adding an event to the stream, and I'm actually like authorized to do so. So the genesis event specifies the account the id that's allowed to make updates to this event.

B

Stream and the signed event, basically like says: hey here's a new event to this event stream and here's a proof that I'm allowed to make an update to this event stream. So we're using the the ipld tag jose to store the the signature and the signature is done by is authored by a particular did.

B

So just this whole event, log fits very neatly into the ipld data model um yeah, and then you can just like add more signed events. You can add a new kind of timestamp banker, events and so on. You can keep like growing this event, log and uh it's all like hassling, together from genesis to like the latest, uh the latest event, and so right now in ceramic.

B

We support kind of like only one canonical branch of history in the future, where we want to like, allow us to diverge and converge on the log, because you might have like notes that are not completely in sync and we don't want to lose data yeah. So that's kind of high level, the data structure and the diving in into like the how the authorization authentication works.

B

Is that generally, if you think about um distribution of public keys in the wild right now, you have essentially like ethereum, wallets and solana wallets that have like really big penetration in the market of like just providing public key infrastructure, and we want to leverage that right. We don't want to build, have to build a new wallet for all different kinds of sort of things.

B

So in this case you have a wallet that holds a public private key pair, say an ethereum address from an ethereum mattress. We can create a d80 pth, I'm using ethereum here as just an example. uh This would work for like any any sort of blockchain wallets that has like a public private keeper.

B

Then pkh stands for public key hash. So if you know an ethereum address, it's basically a hash of a public key and the pkh. Basically, access ethereum address, plus which network is on, and now it's a d80 and then there's a resource in ceramic. In the case of ceramic, it's an inventory, and so in the genesis event it specified that hey this particular did. Pkh controls this resource.

B

Okay, great. So now, if I want to update this resource, I would, in theory, just need to produce a signature by this wallet.

B

But if you think about, like the user experience, you don't generally want to have to sign the message in meta mask every time: you're making an update, especially if it's like a social media app. You just want to make a comment or like as a dow proposal like you, you don't want to like always have to pop up a meta mask pop-up and so to mitigate that we're using something called object, capabilities.

B

So we basically leverage sign in with ethereum for ethereum, uh there's a similar standard being built for solana and can be extended to like different blockchains as well.

B

Basically, you generate the session key, which is the id key in the browser. Then you have a message or you generate a message that includes the public key of the session key and includes the shadow identifier of the resource, a stream id for ceramic, and then it signs over that. um So the wall is basically signs giving the session key access to write on behalf of the did pkh to this resource.

B

Now the the application has the session key and it kind of packages. The signature and the message that was signed from the wallet into a an iple format called cacao and every time it now, the application wants to update this resource. It basically makes jws signatures store that as jose and includes a reference to the basically includes the cid as part of like the signature payload uh of the cacao as an invocation.

B

So now, like the assassin key, invokes the capability got from the pkh and is able to update the resource on behalf of of the wallet essentially.

B

So this allows us to like make delegations from like existing wall as an existing public key infrastructure.

B

You can, of course, also you know, uh create an event streaming in ceramic that only relies on the dad key and the resource is actually controlled by directly at the id key. It kind of depends on like what sort of application you want to deploy so like.

B

If you have a server side application, you don't really need like a wallet and all that infrastructure for users, so you can just like use a public card key pair uh and just use the adk method, so there it depends a little bit on how you would approach things, but for users this is really neat um cool, yeah and there's also work happening to kind of align, the cacao format and the cacao support for like signing with ethereum and potentially like other on-chain capabilities, with the work done by ucan, so kind of like harmonizing those standards, because they kind of achieve somewhat different goals, but they they have or they have similar goals, but they they do different things right now, or they have different capabilities.

B

uh No pun intended right, now, um cool and then so. What are kind of the the use cases for event streams, so one that I already talked about because we are building a database on top of this is, of course you can use event, streams for databases and what's neat about this- is that you don't need to build like a mapping like one to one like one so event, type of event stream to a database. You can actually have event streams and then build different sorts of databases on top of it.

B

So you can specialize per use case. Someone might need want to build like a large scale like application database. Someone might want to build like a privacy, preserving database that encrypts all the data.

B

Someone wants to maybe like build like a local first database that doesn't really like care about the global state but uh kind of or you want something that works in both cases like both local and like a global database. That's like optimized for both and the event streams are kind of agnostic as to like what sort of indexes and what sort of databases you build on top and since we leveraged event streams, we can kind of like have different consistency.

B

Models as well like I might want to have like I plug into some event streams and build my own database locally uh on my machine and other people do the same, or I might want to build like a database that actually leverage a blockchain to like ingest events from event streams and build consensus of like what we observed in these event streams and build an index. And so we have like consensus over over that index.

B

That's more restless and you can kind of like mix and match these approaches, which is what's really cool when using event streams for databases.

B

I think what is uh probably more interesting for this group is like using event streams for compute, so we can think of like an event. You must like sort of like a formal glue between different compute stages in a data pipeline, and so imagine that you have like someone produces some data uh and puts outputs that into an event stream and signs it and now, okay. I trust this author of the data I'm going to actually plug into that run. Some computation on top of that or the set of event streams.

B

I'll put my my result of my computation uh into a new event stream and then maybe like do an extra step after that and do the same sort of thing um or um you might actually want to have these sort of open-ended pipelines where I build a pipeline based on like some events and then some computation, some events and I achieve a particular goal.

B

But then someone else plugs into like halfway through my pipeline and does something differently, and so I think that this this possibility of having this kind of open and composable data pipelines are really interesting and, of course, like this ends us up in, like some sort of like semi trusted inputs and outputs where, since we have this event stream, we know that, like okay, whoever produced this computation, signed the output, and if you trust that actor, maybe you can just consume that or if you're like suspicious.

B

Maybe like you want to have a system where you have like five different nodes or five different organizations that produce they run the same computation and produce the output. And if it's the same, you can kind of trust it or you actually have a consistent system around it. But you can kind of like mix and match, which is is kind of cool.

B

And so, if you were to build this sort of like system on top of ceramic event, streams uh ceramic wouldn't really care about like how you define compute jobs and how they're executed.

B

I know there's like because the only thing ceramic would care about is: is you produce an event stream that the signatures are correctly uh produced and it's up to you to enter how you interpret those events um and yeah? I know that there's like a bunch of different interesting projects in like the ipfs file coin space, where I think there's like this project starting to build the ipli.

B

I think it was called like the linked invocation. um I don't know if there's anything public about that, but that was talked about during the ipfs thing.

B

I know block science is building this cat thing, which is like content, addressable transformers and there's probably like a bunch of different, much more different examples that I'm not aware of, and it would be cool if we can kind of plug in plug this thing into these event, streams and kind of like sort of mix and match these sort of different approaches to computation and yeah. So my question to this group is: is there like an appetite to standardize uh around some of these things?

B

Like do we want to standardize how we do an event, log data structure and how we verify signatures, how we do the id authentication authorization and how we do like timestamps through anchors, I think, maybe like we want to standardize around how we synchronize these event streams um in a peer-to-peer manner, but I think, like only standardizing around like you know, a data structure would be really interesting, because then we can at least have a standardized way of validating things across uh across the ecosystem.

B

uh So yeah, that's that's the question I want to pose to to this community uh and- and you know, if there's interest uh we'd be happy to like- maybe make a first proposal of like what this sort of thing could look like. uh If this is not interesting, then you know uh yeah, that's essentially it from from my end uh happy to talk about this or answer any questions.

A

Thank you so much joel. I I've got a couple questions, but I'll just open up the rest of the the audience in case. Anybody else has questions they want to raise. First.

A

Well, joel, for what it's worth, I know the fission folks are not here today, but they're very actively pushing the did standard forward as well. So I think other folks in the community will also have strong opinions. I'll take a couple notes on the the standards component that you raised, because I think that's a great idea. I love the idea of you. Guys are building it out. You guys can be the first and, if other folks want to, you, know collaborate on that all the better.

A

If not at least we have a first implementation, and I thought it was really interesting how you were sharing um uh to go back to the notes here, um the the different uh permutations of event, streams, and do you have any indication about like the data sizes that would be ideal for event streams? Are they limited to one node processing the data streams, or is it even large? Big data sets as well.

B

Yeah, so so uh currently, how ceramic is designed is that ceramic network doesn't really like care about how the event streams are replicated. It's kind of like up to your node to decide hey.

B

I want to replicate this particular event stream, and so, if you want to build like a really large event stream, that has huge chunks of data, that's fine or you maybe just want to build an event stream that has references to cids that you could retrieve of falcoin, and you have the data integrity stored in ceramic or like the kind of uh not the data integrity, that's by the city, but like the the integrity of like the process of like how uh different pieces of data was processed and and attested to over time,.

A

No, I appreciate that's a that's a good context, because metadata around data stored in file coin is and ipfs has become a hot issue as well. So I mean this lens some structure there.

C

Yeah I'll um I I'd like to uh plus one on the uh creating the the uh general standards and and uh yeah. We don't have everyone here, but the extent to which uh you can start getting those documents you know onto our pages um uh for people to discuss, I think, would be amazing and um wes is exactly right. First person to propose something. This is a lowercase s standard.

C

I want to make clear like just up to us to figure out how to you know, produce it and, and anyone who wants to use it can and anyone who doesn't doesn't have to, but uh the extent to which you know we can have things go across uh platforms I think, is incredibly powerful and uh certainly something that that uh you know. I think a lot of people would be excited about.

A

Very good, all right, joel! Thank you so much for the overview appreciate that that was. That was tremendous. So um moving on to the second half of the session david, I can hand it over to you if you'd like to give us a brief intro of backley out.

C

uh I'm actually just gonna hand it to luke.

D

Okay, great here we go cool so uh hi folks, I will share my screen. The first challenge is sharing the correct screen, because sometimes I share the wrong screen and then I do the whole presentation and someone's just looking at my inbox. um So hopefully you can see um uh some slides. Can everyone see some slides.

A

Yes, what's good.

D

Okay, great um cool, so hi everyone, I'm luke, I'm here with kai um uh who we are the tech co-leads on the on the backyard project, um and we also work with uh david and wes um on this project too, um just kind of by way of of setting some context. um We are working on a compute over data project, um but we also have this kind of dual mandate, so um we have two jobs.

D

Basically, we have the job of building a working distributed, compute network that will be valuable for users, a bit like ipfs for compute, and we also have the job of making this working group successful and making helping the people in this working group.

D

The the teams in this working group to be more successful by building building blocks, for you uh finding ways to collaborate um and uh build common libraries or common standards um and by splitting out working code from whatever we're building uh that can be shared in the community, and I wanted to share a little bit of sort of history here. So dave- and I have some history together.

D

um Dave uh was the product manager for kubernetes at google in the early kubernetes days and that's where we met, um because we were both working on trying to make kubernetes easier to install and the situation then was that there were many different projects to try and make it easy to install kubernetes.

D

And there was a great deal of fragmentation amongst those projects and a lot of overlap between those projects. A lot of them were doing a lot of the same things and it kind of sucked for everyone, because it was more complex than it needed to be, and so what we did was we started.

D

This working group called sig cluster lifecycle and we built um a reference implementation, which was cube adm for installing kubernetes, but we also made sure that the cube adm could be vendored into other projects and was uh providing building blocks for other projects, and we had great success with this, because, within a few weeks and months of cube adm going out there, we had coob spray cops, mini, cube uh kind.

D

Lots of the other kubernetes installer projects started adopting the building blocks that um that we put out there and therefore they were able to spend more time on their differentiated features and um became more successful as as installer projects. So that's that's basically we're kind of we're here to do that again in the in the cod um in the computer over data web 3 space, so that's kind of what we're hoping to achieve.

D

So with that. I want to talk a little bit about what we've got now um and as you look at this, if, if you're working on a project in this space, if you're here in the room now or if you're watching the recording later have a think about whether there's anything in this that's useful to you and if there is then please ask us to split it out and we'll split it out into a reusable component that you can that you can use so that you have to maintain less code.

D

So we have this peer-to-peer compute network that interacts primarily with ipfs supports users showing up using a cli to submit jobs that can either be docker jobs, so they can bring their own docker image or they can just bring a python script and we'll run python inside wasm for them. The reason for that is that we are going down a path of determinism which is easier to achieve in wasm and I'll talk about that more in in a minute.

D

um The determinism is in order to enable verifiability um which enables us to start to have some confidence in the results we support concurrency. So you can say how many times you want to run your job, um and so you can say, concurrency equals three and then three nodes in the network. We'll pick it up. uh You can also just recently we added support for sharding, so you can say um I've got a job that has ten thousand files in it. I need to resize ten thousand images.

D

um I want to do that in a batch size of a hundred, um and you just set um you just mentioned a glob pattern. Dot jpeg batch size equals 100 and your command, and then it will get run across 10 nodes in parallel and the results get reassembled automatically um we support reading from ipfs.

D

Obviously we also support reading from http, um because there's an awful lot of data sets available over http on the internet, so this is can also act as a bridge as a way for people to get data into ipfs by reading it from http processing it and writing the results to ipfs and right now.

D

We also write to ipfs and we've got this thing: scaling up to 750 nodes uh with lots of performance work on going to make things fast at 750, 1k 10k we're working on on scaling it up to tens of thousands of nodes operating on petabytes of data, so that's kind of where we are today.

D

um I can show you a quick demo uh if you please pray to the demo gods. We just got this working a few hours ago. um So what we have here is I'm going to do a little bit of work locally. So imagine I am working on processing video data and I've got some video files on my local machine and I've also got a great deal more video data on ipfs.

D

That is more data than I can fit on my local machine, and so um here we have the local data, and um here we have kind of you have to use your imagination a little bit, but imagine that there was tons and tons and tons of data in in this ipfsc id. That was more than I could than I could fit on on my physical machine.

D

So the first thing we're going to do well, I'll, actually show you what the video is and then show you what the workload is.

D

um So I'm going to look at the video, it's just uh some nice footage of looking at a gothic building and then there's some other footage that we're gonna that we're gonna use and what we wanna do is we want to apply uh funky, um matrix style uh overlay on the top of this video, um so I'm gonna do that locally by running a docker command, and this is gonna use ffmpeg inside a docker container, and this is just operating on this literally this local file that I have on my computer and it's it's processing it and outputting the processed video data in in my outputs directory that I have here, I can show you exactly what that job was.

D

If I scroll up enough, so we did a docker run, um we mounted the inputs directory and the outputs directory, and we ran this video resize example container image and we just ran a shell script inside it and pointed it to the inputs directory in the outputs directory. So that's how you run things in docker, normally, um and so it's finished, and so we can see um what this looks like and uh there we have it sort of funky, um matrix style um uh text, that's being overlaid over the image over the video.

D

Now we're probably not gonna win any awards for uh for our our video um work, but uh this will kind of serve to to demonstrate the point. I hope um so I'm now going to export this cid now remember this cid points to this data which, for the sake of argument, imagine this is more data than you could process on your local machine. So I'm going to I'm going to show you two things.

D

The first thing is I'm going to show you just running that same container image on the back of your network um and I'm going to time it. So you can see how long it takes so uh we're doing instead of docker run we're doing backup docker run so we've made the docker run command in battle, yeah very similar to the one that you get from docker, um so that people are familiar with it and um oh yeah just quickly.

D

I want to show you what's actually happening on the nodes, um so this is three nodes on our production network.

D

One of those nodes has picked up that job and is running it um and it's finished so in 31 seconds uh it has processed those those three files um so yeah just to show you the command in a bit more detail.

D

um We're re we're mounting the cid as an input, volume and back liao will handle reading the data in from ipfs, and then we can specify you get c two cpus for gigabyte memory, and then we can say please wait until the job finishes, which is why um it actually waited uh this command blocked until the command finished. And then you can run uh bakulyao get on the job id and we will see that um we can see the results of that.

D

um Excuse me, so we can see now um that we've got uh processed video footage from the baccala network. Okay, so that's pretty cool and you can also see if there was any errors or any messages on standard out. Those also get written out to these files here.

D

So I'm going to clean up from that run and then I'm also going to show you what you can do when you use sharding and so sharding is about parallelizing the work across multiple machines on the network, and so what we're doing here- and actually I need to show you this quite quickly, because it's so fast- is that the work got spread out across multiple of the nodes on the network. You can see this middle node picked up, one of the jobs and the third node here picked up two of the jobs.

D

um It's a bit random at the moment which nodes pick up, which jobs but um and you can see we did the work 10 seconds faster than we did um when we just ran it uh one after the other. So that's pretty cool. We can um slash the run time of our work by parallelizing it across the network and now, if we do baklyow get on on that job id, then you can see this time we had three shards.

D

We've got shard zero and shard two and shard one showing up, and so you get separate exit codes and standard out and standard error from all of them. But then inside volumes you get the same data reassembled um in uh in the output volume directly, so yeah, um that's the demo, that's kind of what we have now um um yeah I'll. uh If you've got questions, please hold on to them.

D

um I'll, take take questions at the end, so I'll, just kind of blast through um our road map um and talk about what we're gonna do next and and like I said so as you're looking through the road map. If there are things here that are interesting, that you'd like us to split out into separate projects, then please come and tell us um and um because we we want to do that to make the whole community successful.

D

So um the current roadmap looks like this. uh I've got a few slides here for each uh year half so I've got till the end of this year, um first half of next year, first uh second half of next year and then the first half of 2024 um and notice that in our roadmap we have these explicit goals.

D

Like I said um for making the compute over data working group successful, so we want to find partners who can use our code or integrate with us in a meaningful way, and we don't know how much engineering effort each of those integration efforts will take. But we are resourced to do that.

D

We are also working on making users successful um so you'll see that throughout the roadmap is that we have targets on that, and then you can also see the kind of feature work and that we're adding so there's a big focus at the moment on performance, getting performance down to seconds um of job execution in thousand node networks. We're not far from that now.

D

um But there there's there's almost always more performance work to do uh we're also working on a file coin publisher, so that you can write the outputs of your jobs to file coin as well as ipfs.

D

um Then we're going to work on byzantine fault, tolerance, so making it so that even if 10 of the nodes in the network are lying, then um you'll still be able to get useful work done um and one of the first ways that we're going to do that is to start verifying the outputs.

D

So if you like, I mentioned earlier um with the deterministic execution mode in wasm, um you will be able to start assuming that if multiple nodes uh came to the same uh result, um then well either they all did the work correctly or they're, all lying, um so um yeah being able to hash the outputs of deterministic jobs is kind of step, one in um our approach to addressing uh byzantine fault, tolerance, we're also looking to make users successful, then, after that we um are adding support for dag execution.

D

So right now we have single jobs and we have sharded jobs and those are kind of two of the building blocks that you need to build a more generic, dag execution engine. So we're working on that we're looking at whether to integrate with existing dag execution engines versus building our own.

D

Ideally, we don't want to reinvent the wheel. So if we can integrate with something like um airflow for example, then we might do that, but that's all tbd. uh We haven't designed that yet and then, of course, as we go through this, we're we're interested in finding ways to factor out what we've done and help and use it to help other participants in the working group. um Then we are upping our game on byzantine fault, tolerance towards the end of the year, we're looking towards making more users successful as well.

D

Of course, in january next year we might tackle non-determinism, uh which is how do you verify so we, you can currently run non-deterministic jobs on the network, um but obviously you can't it's hard, it's harder to verify non-deterministic jobs than it is deterministic jobs. So we um we'll we built a prototype earlier on in the project of how you could verify non-deterministic jobs. I won't dive too much into that now.

D

But if you have questions, please ask um then uh more focus on work on working groups, successes and user successes and then we're going to probably around march next year, pick up prototyping and verification protocol uh around battle yell.

D

So the um let me just pull up some notes here. um The uh the verification protocol um will help us to um uh move towards um being able to um build systems like this that eventually interact with incentives um and so the verification protocol. For example, um we have a draft of right now that we, for example, can have nodes.

D

Check the like randomly check the work that nodes do and then, if the nodes are lying, are found to be lying then start to, for example,.

D

Punish slash those nodes in some way and then that prototyping of that protocol, which we have a draft of in the wiki, um will allow us to start kind of testing those ideas and that will lead towards being able to connect the network to a smart contract.

D

So we have this large piece of work in april through june next year, around integrating with a smart contract, and one of the ideas is that our entire transport and scheduler layer, um which currently is just a custom implementation uh on top of gossip sub on live p2p, would become replaced with, for example, an fvm contract um and um so that fvm contract would then be responsible for uh implementing the uh verification protocol and then around that time, we're looking at so just after. We start work on the smart contract.

D

We're also going to spin up an effort around formal verification. So let me just pull up.

D

uh Some notes on this as well. Sorry, I should have had these already, um but basically.

D

Let me see sorry um the yeah here it is so um so the formal verification um we're looking at tools like glow, daphne and y3 as ways to encode the behavior of the verification protocol in a um in a formal system that you we that we can mathematically, prove things about, and the goal of that effort is to make it um it's basically to eliminate bugs in the protocol like earlier than finding them because we get hacked, um and so um that will be quite a big effort, but we believe it will be worthwhile and we think it's um it's very interesting.

D

I think there's. So. I've worked previously on formal verification of um some network algorithms, where we actually found novel attacks against established protocols by by using formal verification. So I think that's an area that is super interesting in the space honestly um so excited to uh to potentially spin up an effort around that, um and then I see um basically the the big challenge that I see with moving to a smart contract. Implementation of the um of the transport and the scheduler layer in the system is efficiency.

D

So once we have um all of the kind of scheduling and metadata around this job execution happening on chain, then we expect that will initially dramatically reduce the performance of the system, and so we've got like a whole half a year dedicated to getting that back up to scratch. um In parallel with formal verification, all the while making users successful and helping split out work that we're doing and then come october. We plan to build a plug-in system.

D

The plug-in system relates to the non-deterministic execution that I mentioned earlier.

D

We're also, then, going to look at a reputation system which is kind of using the outputs from the smart contract and the verification protocol to make a basically a published public dashboard of which compute nodes are um are reputable and which ones are not based on which ones have been deemed to to be publishing good results and which ones are not, and then for the following: half of 2024, looking at incentive models, developer experience and then we've also heard a lot of demand for secrecy.

D

So apparently, the network is fully public and the com, so both the data and the compute has to be fully public now. This is obviously problematic if you have private data or you do not want to make your code open source effectively by publishing it on the network.

D

So we're really interested in talking to other working group participants who are already addressing secrecy and potentially partnering with them, because we see that as a very difficult problem to solve. Frankly, um then, more user success scaling up the number of users scaling up the also the number of working group partners who we have building on the foundations that we're laying, hopefully and then.

D

Finally, the very last thing on the roadmap in the foreseeable future is extending the compute network to also work with long-running services like microservices or api servers, where they are interacting responding to network requests, rather than just doing batch jobs over data and that's fundamentally harder, because how do you verify the result of someone of a compute node?

D

That's dealing with an arbitrary api request, because the profile of that execution is going to be different depending on what the api, what api requests come in, so there's a whole interesting area of work there, um but I'll stop to make sure that we have time for q a and discussion here. um I'd love to hear from folks on the call or in the recording afterwards.

D

Are there any parts of what we have already that you think might be candidate independent building blocks as in what should we split up? First?

D

Is there anything on the road map that people would like to collaborate on uh we'd love to work with other teams, um and is there anything you think that we should work on that? We don't have on our roadmap today, because we're here we're here to serve the working group, um so yeah.

E

D

You um I'd love to hear answers to those questions or any other questions that people have.

A

Well I'll give everybody else a minute to chime in if they have questions, I will also add both luke for these these questions and also the ones that joel raised. We are going to be launching a discourse server soon for the community, so the very least even for folks that can't join, live, we'll make mark these episodes of our first issues for discussion purposes and try to funnel folks there as well, but um I'll just pause for anybody else. Have any questions.

A

Okay, all right well, we'll have the content up uh on uh on youtube here shortly, so other folks can um can view it. A lot of folks weren't able to join on the call today live, but um luke, joel I'll, just pause. If you guys have anything else,.

D

Yeah, if joel is still on the call, um I wanted to say, I really enjoyed your your presentation. Thank you, um as you can see, from kind of what we're working on um the the identity standard stuff that you mentioned um is not uh directly relevant to to our goals, um but it sounds like a very useful thing to have um and um yeah I don't know. Maybe you have some thoughts on uh on whether there's intersections between our work. That could be helpful.

B

Yeah I mean, and the stuff you presented on your roadmap is like very wide and a lot of like open problems, so uh I can't like really speak to like exactly how it will fit in, but like um we basically use the did standard.

B

Basically because, like it's like a neat format to represent different sort of accounts, and so you can have this- the public pirate keeper that signs the output of the data or you can have something more sophisticated with this blockchain wallet and object capability thing and all of it kind of like fits in this- the id package.

B

um So I think like if you ever have like a thing that needs more than just like. Okay, I have a computer job that I run and there's a signature over it and I put it on the blockchain. uh Then that might be interesting um or yeah. I guess like so I guess the question back to you is like, as you imagine, back layout right now. Is it like a fully kind of just like incentivized compute network, or do you imagine like, as people build different like compute pipelines?

B

On top of it, you actually have some people that run like a centralized services puts inputs output uh in there and then maybe there's part of the pipeline. That's like more distributed and more verified.

B

uh We're like how do yeah will there be like that sort of like dynamic thing in there or do you think it's like one sort of system.

D

I think yeah, I think we don't really know yet. um I think it could go. It could go either way.

D

I think um um if we develop the building blocks, if we're able to help develop the building blocks for like 10 different projects to be successful, then I think that will be deemed a success um and, at the same time, um we're also going to be guided by what our users are asking for and and what the storage providers who um we want to integrate with are asking for, and things like that, um so yeah, uh not a direct answer, I'm afraid, but.

B

No, that makes sense and, uh like I'm interested to see, like you know, from from a very, very abstract perspective like how can we, because different people are building different story like you're building one computer project? There are other people building like different sorts of compute projects. uh I think the the thing that block science is doing with like having just uh what they call cat a content adjustable transformers.

B

It's also interesting this, like do sort of the same thing, input, cid output and run.

A

Some computation.

B

And like the ability to take these different sorts of compute systems and like plug them together, because they might be good at different things, yeah and that's why I think it's like, like wondering: can we standardize around some way where we like? This is how we shuffle data between the systems.

D

That would be very interesting. I think that's a thread that we should pull on and work on together um and also on the did thing. I think it might be interesting to see how whether identity for both users, submitting the jobs and also for the compute providers, um is useful in that in that world. Sorry go ahead, kai.

E

Well, I was just going to say um when we get to the dag um story that that's an opportunity for baccala to be one stage of a dag. So when we start to think what is a dag like bringing various systems together, because if, if I say, consume an input from ipfs and write the output to ipfs, and then I tell another system- hey your input is here: I just wrote it.

E

Then we start to have like is in the the the pipeline between these various systems can be the storage of the output and then reading from that input and there's various ways that we can integrate like that um and yeah. I think it's a it's a compelling picture to say: there's lots of different compute projects out there. Each would be good at a various stage of the pipeline is a very strong statement. I like it.

B

Yeah and you just a quick note on the kind of the id thing like, I don't see the ids and the abstract form as identity. I see them as like a cryptographic identifier, and so it's like a mapping from like a random letter string to do some sort of public keys.

B

um So I I don't view it as like we're building this as like an identity piece in the event log. You can use it for identity, but it's not kind of what it is at this core.

D

C

Okay, um I know we are out of time and before everyone departs um we uh I just wanted to say uh we will be circulating. Please do mark down on your calendar. I'm sorry.

E

C

It off the conversation, that's just right out of time, and I do want to cover this one item um november, 2nd and 3rd in lisbon um is going to be the competitor for data summit. I will be sending around a uh we're trying to build the schedule to right now uh it is supposed to be collaborative. um I would love to have lots of people talk. um uh You know across the over the course of the day.

C

Talk about your projects, talk about things that are interesting to you um and then um work on some uh tracks, hallway tracks for or not hallway, uh but uh unconference or other tracks, where we can collaborate on particular items uh like specs, and things like that.

C

um So uh please uh keep keep your eye out for that um and um do jump in the uh compute over data working group slack uh to continue the conversations I just don't want any. I mean we can continue to talk now to to be clear, but um I just didn't want to uh miss out on the opportunity for everyone to collaborate.

D

A

Very good all right we'll go ahead and wrap for today. Thank you so much joel and luke excellent content today, guys.

D

Yeah nice to meet you thanks. Everyone.

B

Thank you, bye-bye bye, guys.