Filecoin Compute Over Data Working Group, 16 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: CODwg Live Jam Session 03

Description

Join @DeveloperAlly & @silentspring30 to jam with Simon Worthington (@51M0NW) from Bacalhau / Expanso / Register Dynamics and Luke Marsden from Lilypad on federated learning, compute use cases and engineering challenges

A

Hello, everyone.

B

A

Welcome back I'm, Ali I joined here with cardika my amazing co-host as well we're back for another computer over data working group, Live Jam our third session now, and we're lucky enough to have the amazing Simon Worthington here with us: hey Simon, how you doing.

B

C

Good thanks goodbye, okay,.

A

About yourself, I know: we've worked together before Simon was actually instrumental in building out uh lilypad version zero, which, for those of you who might not know, uh was a bridge between the filecoin virtual machine and the backyard public network. So it allowed you to kind of call back our compute jobs from your frame, smart contracts, and you can still do that. uh So he was instrumental in helping me build that bridge and I think learn a lot about ethereum at the time. So, but what have you been up to? Otherwise? So.

C

B

Well, sorry, what I've been up to otherwise.

A

um What else have you been up.

C

To besides that, I was laughing so hard on your joke. um I've been working on um uh since then, for example, so we've been um taking back the hour um from the one point, not release that we did in May and it's starting to add lots of new features that lots of people have been asking for um which I could talk about a bit more if you like, um but some of them are things like uh custom executors. So you don't have to just build a Docker image.

C

You can actually start from a higher place like python or JavaScript or whatever, and run that on battle Yeah directly.

A

Javascript that never had JavaScript capability.

B

C

It didn't, but it's coming. It's.

A

Coming soon soon now I'm interested.

C

um Yes, that's coming also things like long-running jobs, so it's not just kind of run a computer job and uh and.

A

C

It to do but actually kind of run a job, that's continuous and streaming and Library very long, um so yeah working on lots of those features bringing back out to more people all that good stuff. Basically,.

A

Awesome um and I think uh one of the reasons that we asked you to join us here today as well. Awards, like I, know you're leading kind of that engineering effort on back. We are now uh one of the things that was super interesting uh to me and takatica as well. uh Was your talk from the computer over data Summit Where You released backpr version one, uh and you were talking all about Federated learning there and I think that was something that you really enjoyed as well. Cardika. Could you like?

A

uh Do you have any questions you wanted to ask about that? uh Talk there and I'll uh I'll, find it and put it up on the screen for everyone as well.

D

Sure yeah like it was super interesting like um so you know we will go like Simon would talk through a couple of um you know like the latest developments, but I think later on. We'll talk a bit of a case, study and I. Think what I love about this, these Federated Learning Blocks, would be for medical data and how we can really you know, be able to answer some of the questions and how can I?

D

um You know like basically ensure that my data is um sort of safe and not being shared around um in terms of being able to share data with people. You might not trust that much um but still give access to it. You know, like that's one area like um I. Quite thought was really interesting um and then considering that, like I, think 68 of Enterprise data is not being like. It's not really being able used or like not being able to give access to that's one of the that's.

D

What one of the things I quite like to explore a bit more and and then um in terms of governance data as well like so the government's most 100 of data is um govern on the gdpr. All of these other ones. You mentioned in your talk and David too um I think yeah super interesting to learn a little bit more about that, and we chatted a bit about the medical case study in terms of um like breast cancer or like medical data and how we could sort of like set this up.

D

So yeah super interesting to hear a bit more but I'll. Let you talk a little bit more and give you an overview about bakayo first, and then we can sort of Deep dive into these other questions.

C

Yeah sure yeah happy to talk a little bit about bacliao. um uh So back of the hour is a computer, a data platform. It's open source, um it's on GitHub uh github.com project um and what it aims to do is be a kind of commercial off-the-shelf, like commodity platform, for building compute over data uh networks of all sorts of different types.

C

So we have like a kind of a global public network which you can join and anyone can submit um jobs to that Network and have them run against, like volunteer um volunteer kind of compute in ads out there in the world.

C

That's around the network, but maybe, more importantly, what you, what you can do is download the software run your own computer network with your own Hardware um and your own environment, and what that allows you to do is take a kind of General background architecture and apply it in lots of different interesting ways and Computing. Over Federated data is exactly one of those um exactly one of those ways so yeah.

C

Since the 1.0 release, we've been working on a couple of things, I talked about the the executors um uh I'll I'll say a bit more about them. When we come to the actual computer, Federated databit, um but we've also been running like working on long-running jobs, so this is um yeah allowing a job to exist for a long time. You know weeks months and continually receive new input, because lots of lots of processes are not kind of. You know, run it once and then forget about it.

C

But they're kind of more like it's going to receive new input over time uh and you need to you need to kind of react to that new input with some sort of operation I mean as an example.

C

A key one might be like user, presses a button and says something into a microphone, and then you need to then like do text-to-speech and then apply it to a large language model. But you need to do that every single time. The user presses that microphone. So it's less about kind of a one-off job, it's more about like a job that will get run multiple times under the same context and often in quite constrained environments.

C

So like one of the one of the places that we think a lot of this um stuff is going to get applied, is actually on home devices so like on your smart fridge or on your smart speaker, um and it's also about kind of bringing um computer over data jobs to those places as well like they're, reasonably constrained.

C

I have also talked about running jobs on satellites in the past, um which actually I think is something that is happening. um I think we have I can't remember I, think we have a partner who's going to run some backlier nodes on their satellites. um I can't remember what I shouldn't say anymore,.

A

So uh yeah I could definitely see back yeah being but I mean even run on iot devices, as well as another cool, really cool feature about about it. um Yeah yeah.

B

um There we go yeah.

C

Yeah that'd be great, wouldn't it and beyond that we've just been working on things to make the system more reliable, so we've got a partner who loves to submit like 100 000 jobs at once.

C

um That's been quite uh okay, so you're gonna have this big spike and then nothing. How does the system handle that so we've been putting in kind of um persistent storage and and load Spike handling Tech, to make sure that if you submit that 100 000 jobs, you get 100 000 results, they don't just someone that just disappear and go into the bin, or anything like that.

C

um So yeah, increasing the kind of quality of life, I suppose has been, has been a lot of what we're working on um and also experimenting with new ways to um to to submit jobs at the moment. The way that back of the hour is very much designed around the public network. So it was like here's a job, who's going to run it and you might give it to one person to run and they'll come back or maybe you need three people to run.

C

It then come back, but also what we see in, especially in these Edge scenarios, is I've got a whole Fleet of you know, satellites smart fridges, whatever I want to run it on all of them. uh Actually, one of the things we've been doing is maybe it's possible to Target an entire network or Target an entire subset of a network. It's not just running the job, but actually running it everywhere that it needs to run um kind of bringing a different dimension to what's possible as well.

A

That's a really nice thing, a networking challenge for like almost Laura but everyone, you say people think about AI. Don't they.

C

A

B

D

Yeah yeah yeah.

C

I know: that's that's exactly the sort of thing yeah. um Maybe you'd have some Laura one census connected to it, or maybe they would be computer, but it's absolutely designed to work over those sorts of like very low bandwidths. Like interesting connections.

A

Yeah very cool.

B

A

I love happening there. I've got a free, QA.

B

Tester for scalability, it sounds like that.

D

B

Is really helpful.

B

Gotta love your uh QA testers.

A

Thanks for helping.

B

With the QA.

C

Yeah, would it be nice if we talked in advance but.

A

So how robust is uh the public network now by the way, uh I I? You know, I've been a question on the lily pad, obviously, for a while now I I transitioned from bacl and backup and backyard became an open source project and.

B

I transitioned.

A

To working more on lilypad, so I haven't had as much to do with uh back allows I used to so this is like really great to learn all the new things happening for me as well, um and is the public network still running with my question and and how robust is it these days? Yeah.

C

Yes, it's definitely still running. um We've still got a bunch of like compute nodes that are run by the Buckle. Our team and we've also got a bunch of still got a bunch of volunteer nodes that are being contributed by um people, some of whom we don't know who they are, but that's kind of the nature of having a network like this.

C

um We when, when people come along but I mean the the public network is deliberately limited um because we don't know because we don't know who's going to be submitting jobs to it as well. It's not just about like public computer, it's also public users, it's deliberately limited. So there are lots of things you can't do in the public network that you can do in a private setting like jobs can't access the internet. You can't have a job that lasts longer than about an hour.

C

um You can't use too many resources at once. You can't access like local file systems or anything like that, so it's really intended as a way for people to check it out. I suppose yeah yeah, like as a demo, ultimately um see how it works. You know happy to also support kind of people running jobs for public good on there like absolutely no problem, but it's intentionally limited for security reasons and to protect the people who are providing their compute, including.

B

A

Fair enough.

C

So yeah, where people have been coming up against those limits, we've been saying to them. Oh well. Also, it's really easy to run a private Network. Here's how you do it um I'm, trying to make it as easy as possible to set up one of those things yeah.

A

That's awesome, um so it's probably pretty good how-to guide of how to set up your own private, Network, now I. Imagine in the docks, yeah.

C

This should be a good one, I mean dots can always be improved, so please do try it out and if it doesn't work come into slack and tell us that it doesn't work.

D

C

uh Yeah definitely Keen for people to to do that and for us to make it even easier. I mean we'd really like to get to a place where it's kind of one click setup or something. You know, maybe not one click, but you know one one: click per per compute node, so it's just you just run it on the machines and they just all manually connect together. That's kind of the end goal for how easy it should be. I. Think, okay,.

D

So so I'm gonna jump into the case study a little bit oh yeah, um just to bring like the the technology a little bit more to life and also for the non-technical folks. I thought we go through the scenarios. So let's say um we want I, want a private, Network and add all my medical data as I said earlier.

D

My dad is an oncologist, and wouldn't it be amazing that you know all the mammograms and like all this, other data would be um he could actually share like first within you know, universities, but then, and then you know, people can actually contribute to that as well.

D

So that would be one question so just how to set up the private Network yeah.

C

D

And then the second step would be like so that people just know on the center scope, and we would then have like these Federated Learning Blocks.

D

You know people could add to those um and then um eventually there could be a Federated machine learning and you know like some jobs can be run and and then the nice thing would be, uh people could actually get royalties at the end, um for you know contributing the data, and one thing we would like to ensure is that nobody can sort of tamper with that data and sort of de-anonymize. It I think you know with robate, um you know being around now.

D

People are much more aware and have to be quite careful in terms of you know how data is being shared, um so yeah, so that would be sort of like the the overall scenario. Yeah.

C

Yeah yeah, okay, yeah, so.

D

In terms of um the you know, setting up like a private Network, how would you know how would I go about that.

C

So it should be super simple um and and Ally link to the the docs on the on the comments.

C

um But broadly, all you need to do is download the back of your software um and it has a you, have a command in that software um when you do that, it will spin up a compute node and what you ideally want is to make sure those compute nodes are next to the databases that you want people to be able to access or have access to that data. So there's lots of.

C

As you imagine, there's lots of different like configuration options, but you can, for example, load it with credentials to access an S3 bucket or load it with credential. That's like a postgres database or I'll share or like a private, ipfs Network any.

B

C

Things um and yes take advantage of the of the performance. You want to do it as close to where the data lives as possible. So you would, you would stand up what's called a requester node and the request to node is the thing that angles from the users to actually run jobs.

C

um It is like a trusted agent of the user, but it's also a trusted agent of the network, so the requester node, as well as responding to user requests, is kind of responsible for moderating them as well. So.

C

You sounded that request, node configure it appropriately, and then you connect into it all the different compute nodes that you want attached to all the different databases that you want to make available and then that's It. Ultimately like you, only need to do those things and then you've got a fully functional back of your network, which can which can answer these sorts of requests.

C

It's not like there are various like flags for controlling the privacy of stuff, but ultimately all you need to do is specify that the things that you're operating on a private so you're not connecting to any kind of other background, Network you're not connecting to any other ipfs network, nothing that you're doing is gonna, leave the confines of what what you've configured um and then you can configure the like moderation options.

C

So by default the request will send, um or by default the request will just allow all jobs to to be used so a way in which we would control. That is by controlling the users. Who can access that requester node like via, um like AWS credentials or gcloud credentials,.

D

C

Or running in your data center, like your kind of VPN credentials, um that's quite a broad brush thing, so it's obviously like you know you let one person in and they can access everything the next person can, but actually sometimes what you want is, um but different people have different levels of access or you want to be able to know all you want to know for audit purposes like who who's doing. What um so Bachelor has the concept of a client identifier.

C

So every user has a unique client identifier associated with their machine, um and you can collect those up and you know, assign them to specific people and then say to the requester. Node actually only accept jobs from these client identifiers or only accept jobs from these real world identities, um or you can go a bit even deeper than that and say. Actually, this person should only be able to access this. This person shouldn't only be able to access that blah blah blah and you can get as specific as you want ultimately like.

C

There is a lot of control about how that request. Note will accept jobs. um There's also, you can get kind of into this Middle Ground, um so you can say you know these users shouldn't have access. These users should have Global access, but more realistically, most users um can uh can access all the things that we think are not particularly controversial. So if the user wants to access this data set well, it's mainly public anyway, so they can just access it. But if users want to access this data set it's sensitive.

C

You know maybe it's pseudo-anonymized, but it's still got quite high value data in there, um so their jobs will need to be moderated by a human to check that what they're doing is um is kind of valuable, so I've. So one of the slides that I presented at the um in during the Cod Summit was on Instagram. Let me see if I can find it um I can't not that one this one. So the the various ways in which people talk about like what is what is acceptable um and I guess the key element.

C

The key elements that that mean you need to have moderation are like what people are going to do with the data is as important as who and as important as what the data is. So you know you're, obviously happy to share data to do with like mammograms for the purposes of curing cancer. But if someone came along and said actually, I only use this data to like increase the profits of my pharmaceutical company. Maybe you would say actually no I don't support that goal like that's not in line with.

C

Why I'm sharing this data in the first place, so I'm not happy with that. So understanding the context around what the job is going to do is kind of part of that moderation, job um so yeah you can configure, ultimately a quite a high level of um with a high level of control. What jobs should be run and also pass it to a human moderator to say you need to check this to see whether this is going to do something inappropriate or not. So the thing that's interesting about that is at the moment.

C

Backlier runs like Docker containers or it runs webassembly.

C

um Moderating those is quite hard like if someone says I'm going to run a random Docker Edge over your sensitive data, I would be like well, what's it going to do like how do I know? What's in the docker image and sure you can, you can inspect a Docker image and you can check it out, but like actually that's very difficult to do and it's ready to any it's a technical job. Like you know, you need to be a technology expert to know how to do that and to really assess it.

C

So I talked earlier about plugable execution, so bringing higher level languages like Python and JavaScript directly into backlyout, and that's.

C

This is one of the reasons why we want to do that um so that you could just submit a python script to run against the data like you know, that can be significantly less complicated, significantly less complex than a Docker image, which is a lot easier to moderate or, for example, you can go even higher and just submit, like an SQL query, to run against a database quite easy to moderate, compared to like a blob of webassembly, which is very difficult, um so yeah.

C

One of the reasons that we're doing that work is to kind of make it easier for people to moderate jobs in these private environments. I suppose.

D

And this sounds like a great challenge for like a like a hackathon or something you.

B

D

I was looking, I was looking into that and I had exactly the same question so I'm like how can I know what you know like how can I quickly decide and the person is pressing the button like yes or no. It's like.

C

No pressure yeah.

D

How can you really um evaluate you know what's going to happen and what they're actually running, but yeah perfect like this sounds amazing that this is coming up and um yeah I would love to like learn a little bit more um and help people can actually use. You know, use that and and make it accessible and then have like a great interface. You know around that um yeah.

C

D

C

Definitely I mean yeah that'd, be a great hackathon topic, like we've kind of always imagined that um there would be automated tools that can help with that moderation. You know like. Can you apply some AI to a piece of code or to a whole job and get it to generate like some level of risk or to kind of try and describe what it's doing in a way that is helpful for a human moderator um and we've never really tried it. So that would be great to see people doing something like that.

D

Awesome yeah, maybe something we should talk about, but ali um yeah.

B

D

It together definitely.

B

A Big, Field I think uh going on there, but.

A

Yeah is hard, isn't it because I mean it's always going to come down to? It depends who your community is, and what you're trying to moderate, like some things are obvious and some things just really aren't so.

C

B

A

Don't know if anyone's heard of um Blue Sky the social media app, um which is trying to kind of create uh this version of Twitter X or whatever it is now.

A

I just can't say x with a straight face. It's.

B

Like close your app, why did you do that? That's close! You realize.

A

The top right press, the X.

B

uh I digress, I digress anyway,.

A

So blue sky is kind of creating this version, where you'll be able to bring it's kind of like Mastodon as well, where you'll be able to bring your own servers and then you as a community kind of decide what sort of moderation rules you want around that, um and you know this is only parallel likely uh in the same place here, but you know I think content moderation is like one of the challenges that has just been a challenge for many years, and it's still like really unsolved.

A

uh How do we do it well uh and yeah I think you know there's lots of different ways to do it. I definitely think AI is going to be able to help, though live answer. Yeah.

C

Yeah, no, it's definitely really hard and I think you're, absolutely right that the key is to understand like in your community or I. Guess in this context, like your topic and what the sort of risk factors are, because right at the bottom of that list is safe outputs, which is exactly the same problem as content. Moderation right, like you've, got some data coming out.

C

You need to moderate it as automatically as possible and it's a lot easier if you know that, like oh we're, you know working over patient data and the big risk is people's details being leaked yeah, and actually it's a lot easier to solve that problem than it is to solve, like a general problem of like is this image offensive and it's like? Well, that's great yeah.

A

Yeah, exactly exactly right uh and I think there's kind of a parallel here, or at least a a um random segue that I'm going to take um into uh this. This is also the case with how we're building out you know kind of this um incentivization layer, which is Lily Pad, um which you know is basically taking the vapia public network and turning that into an incentivized version of that or a like three version of that. Like so much like kind of file coin is to ipfs.

A

You could say little iPad is to to backly out um a lot of extra code going into this, but the basis of it was you know the Buffy okay code.

B

A

A paper called modicum um but yeah in much the same way. How do we know that nodes? How do we trust nodes to do the right jobs like that? Is your safe outputs and at the moment, the only way that we are able to test, that is to make sure that our jobs are deterministic, but.

C

A

Of computers are deterministic, so at some point uh we are going to have to figure out how to run non-deterministic to a certain degree, uh jobs on the network and still have the ability for Game Theory to catch when those are the wrong thing. uh So really interesting. Research problems anyway,.

C

Yeah, that's really interesting. Yeah I mean I, think I. There's. There is like interesting pedigree from um people like koi who, like sort of taken the problem and punted it, but in a quite interesting way, which was to say actually you're the one running the job. Well, you're, the one running the community. It's up to you to decide how best to like decide whether or not the job's been done correctly, because if you're screen scraping something like well, it kind of it can change right between results. So it's kind of statistically is it?

C

Is it better? So yeah just depends on the workload which is quite hard to do generally, um but lots of engineering research for sure.

A

Yeah definitely and I. You know I think uh Levy ribelov who's our um researcher uh on this and knows way more than me, I'm, just like kind of regurgitating some of the things I've heard from smarter people than me. Basically is um you know it is. Actually it knows knows: choir has worked with them.

C

Quite yeah, you.

A

Definitely see see the parallel there as well.

C

If anyone can do it, Levy can so yeah.

B

Exactly exactly uh really afraid there.

D

um Another question was um I think you um we were looking at that in terms of you know, saving costs and sort of egress costs in terms of bakayo, um and you know you had like an interesting calculation. I was just wanted to highlight that um a little bit again.

C

Yeah yeah so I mean this is um this is a a slide? um That's um come out for Lori, which is really like, demonstrating the sort of um the reason. Part of the reason why uh we like doing computer data in place is valuable um because um I mean, for example, this is the egress cost.

C

There's a similar, Ingress cost, which actually I think is more than this as well, um but ultimately just moving the act of moving data around um it costs you money like to get data in and out of these big clouds like AWS and Azure, and Google cloud is costing people like thousands of dollars.

C

um 50 terabytes, like I, don't know whether people think that sounds like a lot anymore or not like when, when I had my first PC, that was like unfathomably large amount of data, but I think you don't have to go um to very large scale anymore before that becomes like a commonplace amount of data. I mean, like the amount of data that Netflix has the amount of data that Amazon has is going to like outstrip that by orders of magnitude right like several in fact, or does a magnitude um so moving so yeah.

C

If you imagine the numbers in the kind of uh next to AWS, like 4300 and just times it by like 100 or a thousand and in fact, I think the costs do kind of they're, not linear, either I. Think for these clouds.

C

um That's when you're starting to just see something that is just prohibitively expensive and definitely the reckon from I think most computer data platforms, but certainly back here in general, is that um this is not affordable. It's all sustainable, ultimately like another. One of my favorite stats is data growth. The growth of data volume is outstripping the growth of network bandwidth by like a factor of like 43, so that just means that it's not going to be very long before we just physically cannot move all the data that we have around into a central place.

C

C

um When you look at these cost numbers as well like if your data is um getting to be this scale, just kind of moving operational data in order to query it is, is just costing your company like thousands of dollars, hundreds of thousands of dollars annually.

C

It's just a cost that, if you, if you want to be competitive, it's a cost that you won't have ultimately, because the only reason that you have it is because you're a little bit lazy and haven't thought about how to avoid it. And this and computer data is the answer to how you avoid that.

D

And also the time right required, um I think one super interesting area would be to analyze. You know the sort of carbon footprint um you know the reduction you would actually um save because of that um so yeah I would be. Would that would be maybe like another another hackathon or something you.

C

Know like like.

D

Just a research study um in terms of you know, like Net Zero, um there's a lot of criticism around that as well.

B

D

um You know this would actually solve some of this.

C

Yeah yeah I mean doing less ultimately is, is the answer and moving moving data is one of those easy things to do less of. If you have ability to query it and I think you're absolutely right that a lot of people don't appreciate how slow data transfer still is um I was I, was reading I can't remember where it was, but a news article, the other day where they raced a like a 50 terabyte data transfer between like two European cities.

C

um One was over the internet and the other one was on a pigeon and the pigeon got to the other data center, um like about 20 times faster than the internet transfer took so I think that it's something like the the internet transfer was only at four percent when the pigeon arrived and it's just like yeah, this thing's really slow, like a pigeon, can do faster for this. For this sort of yeah, this is it. This is exactly it. um This is um like for those sorts of volumes.

C

Data transfer is still really slow, so you can't you just can't do it. You will not keep Pace with the volume that is being created.

C

D

C

Cod working group.

D

A

A four gigabyte data file had downloaded. It's amazing.

C

Yeah, four gigabytes: not even that big like come on what.

A

A

Okay right we're stuck on hardware, and we don't realize that you know it's.

B

A

Running with cables under the ground and don't try doing this in Australia, either no yeah.

B

A

But uh yeah, that's great um I. Think one of the reasons uh that Falcon was born as well. If I'm um allowed to do since breaking here since we are on the Falcon channel, uh but you know the point is to decentralize all this are so you can do it all in one place or also uh an interesting uh thing to think about when we're talking about Hardware as well is even the big providers are mostly concentrated in North, America uh or Europe a couple of places in the world.

A

Really it's not a very distributed um geography of data centers even owned by these big uh massive Cloud providers. um So you know you're also kind of disadvantaging people in general um in a more Global World and that that does that does lead to issues. We've seen that the internet tried to you know, did a lot to change that as well, like people can get an education with an internet, but even even now, like that's not everywhere so uh yeah I, don't know what I'm on about, but.

B

Definitely a point in there somewhere.

D

um Just to sort of um you know get back to the case study so the last part. Let's let's say: okay, you know I set up my private Network um and then I can do my run mine. You know machine Federate, machine learning um on there as well, and you know get all these amazing results, but um so in terms of Lily, Pat and being able to then get royalties, can you sort of talk a little bit about around that? How would work for the for end users.

B

A

So that's a really interesting point, uh I and I think you know it's one that AIS are bringing up a lot these days. uh Well, one of one of the things that you know is kind of coming to the surface because of AI, like you know, how do we attest to you know, won the originality of potentially any creative Endeavor, these days um blogs codes, whatever it is, was it actually written by a human? Was it actually done by a human?

A

Is it actually, uh your prime minister saying that who knows like these deep fakes um are going to be literally everywhere, so so that's one problem and the other problem is uh We've. We've trained up all these data sets as well, and this is what you're mentioning as well here. Cardika we've trained off all these data sets and we don't really know where those data sets came from, um because a lot of this is a black box. A closed black box of information.

A

These days uh I mean we can have a guess where these data sets came from, but but uh anyway, uh you know, but the original artists or the original creators aren't getting any attributions here. So how do we manage to do that and look um I? You know, I I do work in the web, 3 space and I think uh you know this is exactly what blockchains are for. These are the kind of things that blockchain can help with.

A

One of them is verification and attestation, one of the another ones, a Providence uh record and the other one is our payments uh layer and we've kind of done something similar to that uh with a project called waterloo.ai which um trains a an artist, a transa model on an artist's work, so fine tunes really to an artist's um work, so this artist would upload, say 50 images or so, uh and then we run this machine learning uh fine-tuning model on it to fine-tune to that particular artist's style.

A

So it works best for artists that have a unique style uh if they have stuff everywhere, it's a bit harder to train on clearly um and then a user can come along and uh pay a certain amount to have an image created out of a prompt they put in. Like you know, my favorite one is a rainbow unicorn in space. This is the one I use all the time, which is why you'll see unicorns everywhere, um but uh and then in the background, there's smart contracts here that automatically pay out to that artist.

A

uh So the idea is that that artist deserves some attribution for the original. You know data that they provided to this AI model. uh So that's one way: uh blockchain could help with that and that's just a proof of concept project like I, really hope, like people extend and expand on on these kind of ideas, for where blockchain can kind of help with um offset some of the maybe more problematic areas of AI uh and I. I. Definitely think verification or provenance and attribution are some of those so and yeah under the hood.

A

Lily Pad runs this and bacliao initially as well, but we're moving it over to Lilith amazing.

D

um So Theory, then you know you could have like a group of let's say, and in this instance it's just it's this one artist who gets the the royalties back. But let's say they're, like hundreds of people um who you know part of this, maybe a Data, Trust or something else, and then um by giving.

D

You know, like all this data, um rather than you know, like currently people sort of give data for free, like even not knowing right, um but it would be much more conscious, like a conscious way of contributing data to um like a sort of controlled or a good cause or but then you get like your little royalties back yeah.

A

Yeah 100 I think uh you know that is uh kind of something that's happening in the traditional Tech space people are going. Okay, uh we want to have these public good data sets that people can train AI models on clearly AI um is going to make a massive difference in the world. There's problems it's going to solve that we wouldn't be able to solve on our own like complex problems. uh It's going to speed up the time to solve some of these problems, and we were talking medically before uh that's.

A

Probably one of the spaces uh ended science as well. So if we have these kind of open data sets that anyone can use- and, for example, I know, there's a lot of climate data on filecoin.

A

Imagine if you had next to that, then something that could process that data uh like back yeah like lily pad something that can access and use that data in that same data house, and you can run that as an open share open uh source as part of your open source model, or you can run uh you know these Python scripts or whatever it is. You know uh your own scripts.

A

Clearly um that are better than than mine, um although you've got JavaScript now, so uh maybe um you know to to try out some new things on top of this data and if you've got the whole- and this is what the whole open uh Source movement was about. Wasn't it like this is speeding up the process to come up with um Solutions and a really interesting article actually that I read around this.

A

Was this like leaked Google Document about how Google has no moat and neither does open Ai, and it was talking all about how open source is basically eating their lunch. They can do it quicker. They can do it better and they can do it more targeted uh these open source developers with with the access to data and gpus. So let's go give it to them. uh Let's see what see what you know we get out of it. So yeah, that's my little Spiel right. There.

C

Yeah you've done your Xbox advocacy for the day.

B

But but I really believe it. Oh yeah get me started I'm like yes. This is how it should be. Everyone I.

C

Think so, I think the interesting I guess. The way in which it comes back into the Federated stuff is having like it's it's one thing to have an open data set um that is, like you know, built in an open source way or built by a community in an open way um them in the medical case.

C

Obviously, like it's more challenging to do that, and maybe you could have some medical data sets that were anonymized, um but it's really hard to know whether or not that will be good enough before you release stuff into the open, because even if you have anonymized data, if you have you know, it only takes like three data sets which contain the same individuals um with some very specific properties before you can take.

C

Three data sets that are anonymized and actually uniquely identify an individual in those data sets so like, as if you're a data custodian for some of the sensitive data. It's quite the bar for being able to release it publicly is very high. I mean um I, don't know whether you've seen the news in the UK about a massive data breach.

C

That's happened with all the the police um data; basically, they accidentally released all of the data of all currently serving police officers in the United Kingdom, including Northern Ireland, where police officers are like routinely in danger as part of their Duty, um and it's that sort of story that really um puts people who are in a difficult position of being responsible for sensitive data makes them. You know, pushes them back towards being very data fearing which is like the term that people use for yeah. This is um oh, no, this is oh.

C

Maybe this is the one or uh oh. No, this is a. This is an even novel. This is another one brilliant another huge average um yeah like it was. It was details of all police and I think the way it happened by the way it was um I.

C

Think people had all the data in Excel and didn't realize that if you delete the tab, but you still have a pivot table that references, the data and the data is still in the file, so they release the Excel with the pivot table, and then they didn't realize that all the data was still there. That's how it happens um so.

B

Yeah, it's yeah.

C

C

um I think what's interesting, I think the like an interesting angle is how do you do that sort of open work across data that is closed um and I've got one of the slides there from um from the computer of a data Summit which.

B

C

So the term that people seem to be using to talk about this sort of stuff is compute islands and the idea, basically, is it's topic specific compute or um gating access. So you in in the use case that you talked about um like your uh your father, who's like an epidemiologist, has some data has done some data collection.

C

That was an expensive process and would like to make that data more available for use right, like ultimately you're furthering science, um so making that data as available as possible is a positive thing, um but also doesn't want to be involved, particularly with um like moderating, requests or like spinning up Bachelor nodes or like doing any of the tech, because that's like that's, not the scientists or your house right like they should do what they do best.

C

um And so the idea here is that people will contribute their data sets or people contribute their um their compute hours, but not in a way that actually removes the data from their control. So it's still within their domain or within that kind of Enterprise. You know boundary, but then they put their trust in an external moderator who is kind of well chosen um to allow or deny access based on their policy.

C

So, like you know, your father might say, I only want this data to be used for the Medical Science I, don't want it to be used to develop. Like you know, private drugs or I don't want to be used to kind of train AI models. That's my ethical stance and so it'd be up to the moderator to kind of apply that policy and because you've got that whole end-to-end thing there.

C

If you introduce that um remuneration element where the user on the right is paying to access that data, then that can correctly flow back into the covers of the university to fund more.

A

Science, more data collection.

C

Right so this is yeah, you use the word, Data Trust, um absolutely a big thing. At the moment, data trusts are very a very much a kind of social or legal construct right so, like then, they're not on a very, very strong technical footing, and mostly they do involve basically a bunch of companies or a bunch of organizations that don't have a huge desire to share data with each other other than under specific circumstances, ultimately giving their data up and some of the compute over data technology is able to say.

C

In fact, if we go back a slide, perhaps yeah it's able to say you don't have to give your data up anymore, so the risk is much smaller, but you will make it available on a computer data Network for querying and the person who's going to be moderating or doing.

C

The querying is someone that you trust so someone from your industry body or like maybe if it's a government organization, you know if you're a State, Health Organization, it's your federal government body, someone that you ultimately do have a level of trust in to use that data appropriately and then obviously, subsequently what you can do, because this is computer data, you can say: oh actually, what did they do with my data?

C

What queries have they run and you can look through it and say: oh actually, that wasn't really what I had in mind I'm going to end this relationship, whereas if you hand over your data, obviously you no longer have visibility of what people are doing with it. So yeah I definitely think there is a way to bring more of that private data in a safe way to good use in like training models that have public benefit or, like other doing other Sciences, probably benefit right.

A

Yeah, this is awesome.

D

Absolutely like um I think um in terms of you know, like you know, if you you know when you think about you, know where the data and how AI models have been trained so far and the buyers of it and the you know the it's.

C

D

Crazy right, um so, if we could, you know, enhance and have like this. This, these, you know really um reviewed um data, sets um that have like you know like the buys Checker or whatever. So you know what you're like actually looking for and- and you know, I completely understand, like you know, people just um if you want to train your model, you just need to get started. So you know you go the easiest route, but at least you know what you know, what the what the holes are.

D

You know like where you need to sort of like and add to it um and, and there will be different Pockets uh of you know, that's how Envision it like different pockets of people or different. You know groups of people who can then sort of like um supplement it. You know like support each other within that, but still get like a um you know. A monetary reward at the end, like you know, I think that would be like sort of an.

C

D

World um yeah yeah.

C

Definitely and I I do think that something really powerful in people being able to be part of it without having to do it all themselves, like you know, in this case the scientists or um the state clinicians who are able to contribute stuff and see some of that reward without having to run the whole system, build the technology from scratch and I.

C

Think that's what computer data in general um almost has a responsibility to do or, like the part of the vision of the whole movement in general, is bringing more of this Tech, which is not new, let's be clear, like they were doing this in the 60s right, but bringing more of this Tech to more people through all the stuff that we've learned like over the past decades right and like exactly like you say, building it in a way that allows people to um to take part, no matter where they are and and be compensated for that.

D

Amazing amazing.

B

A

I mean you know, I'm off to contribute to science. To be honest, so I've got things.

B

I we're talking to yourself a place to leave off uh here unless you have any further comments, cardika or.

D

Yeah well, I've got a thousand more questions, but I save it for another time.

B

And thank you very much Simon for joining us today. I really appreciate you being here.

A

And it taught me a lot: I am off to go. You know, help make the world a better place after that little Spiel. So thank you.

C

Thank you thank.

B

You everyone you're, welcome, bye, bye, foreign.