Fluence Labs Conference talks, 2 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Compute over data Networks Discovery and Composability

Description

Push-based approach with Aqua
Dmitry Kurinsky, Fluence Labs CTO
Compute Over Data Summit, Nov 2, 2022

Original recording https://youtu.be/WqquUQDgHj0

A

My name is Dmitry I'm from fluence Labs, I'm, CTO and co-founder of Flint slabs and I want just to share some raw thoughts about computer data, computer data networks, Discovery and compatibility and how fluence can help. What's our approach on this topic, so I decided to go straight for from the fundamentals. I will speak about data compute computer data Network. What does mean to have a computer data Network and why do we need meaning uh what does mean discoverability compatibility?

A

uh What does mean computer or data networks, Discovery and compatibility? And finally, when we have some understanding of the problem, uh I will tell what we at friends do to help with this problem.

A

So, first of all data and compute data is things known or assumed as facts making the basis of reasoning or calculation and to compute. From my opinion, it means to shuffle data in a certain way for a purpose, so data is usually needed in a different form that it appears it appears in some raw form, but isn't in another. And what does this mean another form? It means aggregated, retrieved, analyzed, filtered augmented indexes searched and so on.

A

So we need data in many many steps and forms and compute actually accompanies data all the way and the same calls for data. So if we want to capture the fact, uh it actually means that we can yield this fact uh without providing something before it. It means that we have some way represented as an arrow here to get some data as letter e, if you transform data, it means that we provide data. We have Arrow some compute to have another data.

A

If you have an intermediary state that is held between different invocations of compute, then we have this kind of error, like State, monitor where we have some internal State and input and providing you kind of the state and output and this state change it may help with augmenting data. It may mean storing data and providing nothing as the result and so on uh and finally, to finalize the life cycle of the data.

A

uh We need to consume it, for example, to show it and we have some kind of things for the data, and it means that you provide the data as an argument and you have nothing uh inside the system uh out of it, but you have some effects, for example, something shown on on the display, so there's no way to compute without data compute without data makes no sense. Absolutely no sense. You can cannot see anything if compute without data makes sense. It means that there is data involved.

A

Actually it was consumed or exported, or something happened, and data without compute is the Lost fact it happened, but it even doesn't isn't stored. So it's just just lost, so some more errors and letters uh we have some predefined computes and providing the data and getting CID is also a compute job and it's not always trivial to compute CID, and it's not always easy to understand that this file represents dcid and not something else, and sometimes we upload this compute.

A

Even this compute from the client to the server, for example, if we stream data or if the data is too big and we just upload it and then we get back CID, we have uh this arrow inversed and that's not quite compute.

A

If we get from CAD to data, it means much more than compute, because we need to Discover it and a lot more things involved here and we can think about the user defined compute, which is an arrow without data yet but ready to get it, and uh luckily we can represent it as data as well like webassembly, binary or a Docker or like any code.

A

Finally, is is data, so compute over data by definition, from my point of view, implies that you can get an arrow C and place it where the data is so that, finally, you compute all the data and place it where the data is means either to move compute closer to the data or to get data closer to compute, and both cases make sense and I'm not sure that we can limit all the use cases just to one like page processing or to another like runtime, real-time processing.

A

Both cases make sense, and fundamentally it makes sense, because it's just more efficient the closer the computer data, probably the more efficient the computation is, and it brings more privacy, more security and overall, is it's fine, then a few words about the network. It's a concept which is orthogonal to compute and data and I wish. We could just skip it.

A

If you could not think about the network, things would be much much easier, but in the real world physical world there is a such big latency and bandwidth distinction of physical resources that we are forced to treat these resources as somewhat independent peers.

A

So um we just have to use physics, but if these resources have something important in common, if they share some attribute, then we can organize a network out of them.

A

So this network is identified by this attribute-based claim, for example, if we know that some peers are storing a data set, or we want to put this role on these peers. Like say these peers, please do store it.

A

We want to identify it as a kind of a network so that they can help each other, or maybe they can help us to work with this data, where, if we have some special peers with a specific compute capabilities, dpu excess beer is very different to laptop, for example, and maybe we want to organize them in networks as well or particular things served for a particular user.

A

For example, if I made some kind of a deal with some service provider, then these peers, despite the fact that there are better peers for like physical resources, but these peers are ready to serve me because I have Financial relationships with them, for example.

A

So all of this matters and means that we have a lot a lot of networks. We have place to have a lot of networks which can kind of could be very big as protocols, but they also could be very small, for example, uh all the providers of a particular file for a given CID on the ipfest network effectively. They also form a kind of a network by this attribute and basically Network means Assurance connectivity uh and or agreements over data within a set of beers.

A

So either you can find the route over the network, and that means that it is a network, for example, if you can get a peer by prid by temla, it's mainly about shared connectivity, or we have some agreement over data like consensus, all the different kinds of consensuses, including eventual weak consensus like strong consensus, any type it's about agreement over data that all the peers will probably share or be capable to yield some particular data that, like conforms to some invariants.

A

uh Now, let's speak about computer data networks uh also with arrows and letters. If you think about ipfs as a network as a whole, then every ipfs deployment forms constitutes computer or Data Network and compute here at least is getting CID or getting from CID to to files, and uh the notation could be something like we can yield a function from CID to maybe maybe data if it exists.

A

If we can find it, maybe data we can think about every ipfs node as a single peer, cache, Network or any any subset of this nodes as a cache Network, so that we provide a peer ID and have this function. That works with a very different like latency, because we look up locally and use ipfs as a hot hash.

A

Instead of using the network for Content Discovery, it's very different for compute, we should have many compute over data Networks uh for different geography, for example, if you wanted them locally or for types of computes or for different ways to compose this compute. But fundamentally it means that somehow we can have an access to the peers that can take our compute in some form of abstract Arrow like webassembly, for example, or Docker or whatever.

A

And after that, we will have an ability to execute these computations on data, so the notation becomes longer and longer, and uh also we can think that a single device, even one device or like a small set of user devices uh that belongs to one user, for example. They also constitute a single device, computer or Data Network. That, for example, provides the way to capture effects from I, don't know measurements or click streams, or something like that.

A

So we can think about all these kinds of devices and workloads as a different kinds of computer data. Networks.

A

Then comes Discovery, so Discovery means that we need to use some capabilities out of thin air. So previously we had this from nothing. We get some capability and that's discovery, uh and we have a lot of different domains that we won't discover for and in, for example, for different types of data sources. We want to discover data sources, be it the devices access to the chain.

A

Access to sensors, to external events, to Randomness is different data sources, but we need to find a way to get this data or for a type of data store, called storage like filecoin, hot storage blockchain as a storage whatever uh or a particular data set. We need to find this particular data set or a kind of compute capacity. We need GPU, we need whatever we need the ZK kind of compute so that we need certain proofs and does this compute uh fit for a certain data that we want to provide?

A

Is it possible to connect data source to compute capacity? Do we need to prepare the data before we do compute uh and for efficiency of everything above uh we need to employ a very different Discovery algorithms. So there must be some Solution on top of that, to think about all those problems uh in some layer that just simplifies them, or at least lets us uh speak about it like to have the common common language to speak about all these problems while they seem different and in web 2. We have solution.

A

So uh when we just do web 2, we have a lot of easy ways. The most easy ways is to point on the local capabilities like localhost file system, local CPU, or something that you have in the web. Browser just call function in JavaScript or write variable, that's the access to data and to compute very simple or for the remote capabilities.

A

Usually they just have a resident point. Something XYZ like some hosts, infuro I, don't know that gets to DNS for Discovery, and we have this discovery layer hidden in a fundamentally centralized way. But if we get to the end point that fulfills our needs, we're just happy so absolutely good. Web 2 is very optimized for that and we may discover computer related networks the same way with the gateways and many people are happy doing just in future or something like that.

A

So, let's switch to compatibility what to compose and why. uh What we want is to utilize the same network uh or a sub Network. It could be a small Network or the same compute capabilities or the same data uh wherever they are as building blocks for something more complex to deliver the value to the end user.

A

One example uh is a data processing pipelines, so you have some device, something that produces the facts that we want to capture and analyze. It could be really device like the phone, but it could be the gene the access to chain. It could be the source of time of Randomness other triggers.

A

uh Then we have the edge uh like uh reeling node in uh lip P2P, for example, or uh the closest cache uh or uh we could like do some filtering pre-processing, and it means that we want to do it very close to the device just for optimization uh because we might get an Insight very fast or we can filter out unnecessary data. Very fast, or we can do a small Aggregate and prevent the further processing until we collected some data, so it's pigeon after that we could have some real-time processing close to the edge.

A

Probably we want to have the same region. We want to have good latency to get some real time in size or to store some data in the read models to have some databases for like application Level and finally, we store it into probably filecoin to the cloud. We accumulate a lot of data we batch process it because we have a lot of data available and this is a very different different domain.

A

So the question is: if you want this, how to express this pipeline? What is the way to to speak this pipeline to approach this pipeline? How to how to code it, especially given that we have a lot of Discovery processes here and different protocols, because probably Edge nodes is a sub Network that is capable for doing one thing, while cloud is a very different set of peers and different Hardware different capabilities, different connectivity, different latency.

A

So, um and where should this data pipeline be expressed again? That, for example, we have just these four levels who owns this pipeline from one level to another, uh who writes this code, who deploys it and where to deploy it? Is there any place to deploy it? Who and yeah on the way we have the user data? Probably we get some insights, we make some decisions.

A

We should transfer some permissions, so a lot of questions about security that were like ignored or exploited uh in web the world, but we want to do better so for computer or data networks, Discovery and compatibility if we take all the words together. uh So what we want is to utilize many networks with different kinds of resources and attributes. We want to point to these resources and how the the single flow of processing we want to take Resources by geography latency by different ways to establish security.

A

Some something easy, some consensus, some Trust a different kind of access to data. It could be real time hold chain, whatever we have different compute capabilities and want to point on them, and we have a very limited amount of peers actually connected to the end user to the device and that's the question: is it connected or not?

A

Can we on the pool data, or can we also push notifications and so on and we want to have a single pipeline and in this case we will have this computer data as a big product composed of different protocols.

A

So um coordinator to organize to orchestrate this workflow seems very inefficient, because if we think about just these four layers, there could be much more. There could be less. But if you think about just four layers, then there is no good dedicated place or location and even geographical location or protocol wise location. What what network should we should it be like located in, or should it be a web to like endpoint, just no very efficient place in web 2?

A

It's not the case, because you can have everything in one place and you can see you can have one endpoint that actually leads to the workflow being executed inside like the the cloud, but in web3, probably not, and uh also, if you have a coordinator, it's going to be too powerful in terms of ability to censor the facts or to change them or to move them to some other storage. That user haven't authorized to work with this data and so on.

A

uh All of this could be managed inside a single networks. So if we say that compute over data is a actually about creating just one network that that is a One-Stop shop, that does everything- uh probably it will be enough to have some protocol inside this network to discover the right place to have the call and so on, but in the real life even for web 2. One network is not enough, and even in web 2, we have different uh solutions for the edge and different regions and different stores, and so on.

A

So probably what we want to have is uh finally a flow of service composition, that is current net or free permissionless like ready to be composed and integrated with different protocols, uh and we want to have this workflow to to look somewhat like this.

A

So we have a certain steps and we can describe the workflow, be it some kind of controller for a user interface or really data processing, or anything like that. We can abstract it out to something that consists of very simple steps. For example, I want to get data, then I want to discover the edge peers which are ready to serve me like. Ideally, they should be prepared.

A

Then I run compute on on the edge I wait for the proper results, uh given the consensus or security capabilities of this Edge beers and proofs after that, I discover some kind of fog, some intermediary computations for the real time, I move the control flow here, run compute on fog, wait for proper results and proof, and after that I discover storage uh like. Why should I discover because it could depend on the data that I got from from the fog compute.

A

For example, I could determine the the right Shard uh which points on the sub Network on small Network.

A

Only during this competition I cannot uh probably I cannot have everything in place uh from the very beginning, but I can discover every next step and after uh this like uh discover, run, wait, discover, Runway, discover and read and uh finally done, uh but if these peers are not prepared to serve me, if they're not warmed up, then I want to have the pipeline like this I try to discover Edge if I cannot I'm trying to do latency based Discovery to find the closest peers deposit the code on them so that to prepare the The Edge nodes.

A

The same for the Fogo I want to find the cable peers if they are not ready, if they are not in cash, deploy the content on them so I on every step. uh I do some computations I. Have the decision about the next step and uh I push my requirements to have the next step prepared for me.

A

uh So until until everything is done and when it's done, it doesn't mean that it's just stored I might want to have a the proofs that everything was done, that everything was okay or maybe I want to distribute some incentives or to have some acknowledges.

A

So, ideally, the pipeline should look like this and it should not depend on any any particular Network. So that's about the Flint's approach. uh Given all of this in mind, uh we created a set of tools that makes it possible like today. So first of all, uh we have an aqua. Aqua is the way to express these workflows, which are like decomposed into this discovery.

A

Compute Discovery compute steps with maybe some provisioning like deploy, something or not or failover or not, and so on, and you can compose distributed workflows uh which are suitable for open networks from these basic steps is push based. So next peers are resolved on the Fly and you can take the actual real-time connectivity into account so that you can optimize uh the way you you don't say want.

A

You say I'm going here, because it's meaningful for me uh and a push-based approach means that uh still having an open network, you involve the minimal amount of peers possible and the minimal amount of beers means that it depends heavily on the Discovery mechanics. If you have log n, then you have a log n appears. If you have several uh logarithmic Discovery process like in Academia, then you have K log n or something like that.

A

But if you can reduce the scope of uh lookup, so, for example, using not the whole cadem layer or the whole network, pretending that all the beers are the same, but instead you, if you can using some attribute, uh select a subset of the big Network and organize some kind of cadaver. On top of this, then it's much much more performant and if you have a hot cache, it's even better.

A

So um but fundamentally it depends on how exactly you write the code and you have full control. uh It works on top of lipitupy, but it doesn't require you to recompile, rebuild redeploy the network or the networks. So if some protocol is uh ready to serve aqua, then it's just ready and you can use it for composition without any additional. Like communication with the team. I, don't know protocol, forking and so on, and you can have different workflows.

A

uh So it's open for integration with all the other protocols and uh as the result of execution as the side effect. It provides an audit log uh that only what you expected to happen have happened and you can use this log and we are going to use this log for uh the incentives layer uh with aqua.

A

You can uh Express these smaller networks on top of a compute over Data Network protocols, so sub network uh is Network, formed around a certain resource and or attribute, for example, data for a user or peers of protocol and close to region.

A

G, it's an attribute or a set of attributes or peers capable to access some data like blockchain B, which run fluids and, at the same time which run I, don't know uh filecoin, for example, so that you can access the new blocks and uh with aqua you can do SUB Network formation, uh using aqua, just libraries to interconnect sync bring code on peers, deploy provision and so on.

A

Because all this network concepts of hearing connectivity or hand agreement over data, they could be also decomposed to a single workflows and if you can have a set of workflows to run on the network without uh the end user. Interaction with this workflows, but instead with the sub Network, then effectively, you have desktop Network and you you can have sub Network discoveries in our collaborate as well, and the composition of sub networks, which means pushing data and control flow to the next and next next sub networks, meaning peers of these sub networks, because nothing magical.

A

It's just a new line character. You get from one line of the code to another and it could mean moving compute and data to the next peer uh and finally, Aqua is part of uh affluence, and you can think about the fluence at the workflows, tech for a compute over data network discovery and compatibility, yeah, Cod and DNC, so fluence accompanies protocols to help them organize push-based and Purpose Driven workflows.

A

So we are open for different protocols as a sidecar or as lipid2p protocol with aqua the language you can describe the workflow or straight provision and so on, uh and we have Maureen Marine is a web assembly runtime to run compute or uh probably highly likely to lift another API or protocol running on the same peer to Aqua.

A

So, just to enable the composition, you can write a very small web assembly to use the capabilities that you already developed in another another protocol or, if you don't want to do it, if you do want to do all fluence with all the portability and so on, you can do pure wasn't influence as network is a lipid to be based Network for aquent marine to enable a discover and composition of all the parts.

A

There is a lot a huge amount of meeting details in my in my talk, so I was just sharing our thoughts and approach and if you want to learn more, take a look on influence.dev or let's chat after after this and after the break or during the break, I see that everything is prepared.

A

So thank you for your attention.