IPFS IPFS Camp 2022 - Decentralized Science (DeSci), 3 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Project Bacalhau: Overview and DeSci Use Cases - Wes Floyd

Description

Project Bacalhau is focused on helping projects increase trust via open data processing pipelines. This talk describes how Project Bacalhau is working to improve the state of Decentralized Science via public data processing use cases.

A

So uh I'd like to start the talk with a little bit of a reflection, kind of talking about Nicholas's history of the scientific process, specifically the challenges that we want to help with open infrastructure.

A

So, let's start with an example of how it started this is this is a visualization of a 1768 painting uh of an uh of an inventor who created this new form of science, known as the air pump, which we refer today as the vacuum, and you can see the audience is surrounding with happiness as excitement, floods, the room to observe the fate of the the bird in the glass vial. So this is. uh This is a fun example, because it's open science, it's happening in real time. It's verifiable! It's trustless!

A

You can see that this is actually playing out in real time, and maybe this is sort of some some versions of what we're trying to get back to uh in in the new scientific movement. So a little bit of a reflection um I follow a number of folks on what we can call DSi, Twitter uh particular Jocelyn Pearl, and she had some interesting uh science memes.

A

That I thought would be fun to share, um and so this is a bit of a reflection about the challenge that academics are in today and to give you a little bit of background, my background is more so in infrastructure in Cloud infrastructure, so I've been learning more about academics and it's been really fascinating to see the just the systemic challenges that academics have to go through, because these are people that really want to make the world a better place and for no fault of their own or for no fault to their institutions.

A

The structures have sort of solidified in a way that makes it very painful for them do the work they do, and in fact this was one um one academic person who is studying, I want to say, um um Advanced physics, computational uh work, she's, just sort of at the end, like she's, trying she's with the grant process, she's trying to get published, she's just prohibited from doing the work that she really cares about, and so what we're seeing is a movement of brilliant scientists uh who are willing to take a different uh different direction and I think labdao, beta Dao molecule doubt represent the the best aspirations of people that are going to start new Industries in this space, and so in particular, we want to empower these folks with the best tools so that, if you're a researcher you've been in an institution, you know you have your python.

A

Your um your notebook you've been you've been doing your work in that space. We want to give you the as good or better Tools in this in this space. So there's a couple things that we need to fix.

A

First off the public Cloud platforms are very robust, but they're, often more so oriented towards closed systems, and then, even when you get into the economics of web 3 systems and the things that Nicholas was describing, tokenization and those sorts of payments become an issue um web free projects that are innovating in the decentralized science space Also deserve equally first class decentralized infrastructure that is on par with what they're trying to do, and now there have been open storage platforms like ipfs and filecoin for some time, but we're just now seeing the burgeoning of compute platforms and back of Yahoo is just one of many projects that are trying to bring that compute capability to the ecosystem, and so, if you were to visualize this, it's particularly for folks that may not have a technical background.

A

We'll have your storage infrastructure here on the bottom. You can store whatever you want as much super low cost or free. You will have some compute infrastructure which may be back or another another project in our space and then you'll have apps like I. Think you know, lab down represents a good example of this.

A

You have IP nft protections for scientists that can fund their research and they can give their financial incentives and ultimately, hopefully, over time we can rebuild an industry so that scientists can be self-supporting and can do some great work on their own, all right so to dive a little bit deeper into the technical piece. I want to give you guys an example of a problem that came up through the Max Planck Institute, and this was a project that was launched about two years ago.

A

It's named Eureka and it stands for finding the challenges in measuring Earth temperature. It turns out that climate scientists struggle with this issue of accurately identifying the temperature of the ocean when clouds appear, because clouds can make it difficult to accurately measure the um the humidity and temperature of the ocean in certain places. So they basically launched this large-scale survey.

A

It was 20 different instrumentations, it was ships, it was planes, it was radar, lots and lots of data that went out, and actually it was a combination of a bunch of other universities too, not just max playing.

A

So, at the end of this, you have these terabytes and petabytes of data across different universities, different gdpr jurisdictions, different academic ownership of the data and all trying to solve this problem, which is truly good for Humanity. The problem gets magnified significantly, and so this is an example. If you go to back of yahoo.org case studies where the team, through um through the Eureka project, has actually posted those data sets on ipfs and you can get access to most of that raw data.

A

Today, it's been cataloged that lives in different places and so they're starting to stitch together this data, rather than having it be siled in an individual University or an individual private repository it's now publicly available. But some of the challenges we want to solve are that when you get to very large data, sets like that terabytes and petabytes, it can be difficult to move that data quickly across long distances. It's so large. It just takes time.

A

The network pipes have have limitations, and so one of the things that we're very interested is sending the compute to where the data lives. So if you want to host a large portion of the scientific data, many petabytes of information, we want those researchers who are going to do pre-processing of the cloud images Cloud masking to send that compute to where the data lives it's much more efficient and you get again and the Best in Class experience. You would get from a public Cloud, but through these web3 Technologies.

A

And so, let's kind of bring this back a little bit to the impact that we would like to have on the way the researchers work today. So, let's imagine this researcher says: I just did some pre-processing of this Eureka data set and through my sophisticated machine learning, I was able to refine the images of clouds, and now we can more accurately measure the temperature of the of the Earth's ocean, surface temperature researcher. 2 says great: can you send me the files I'd love to uh to reproduce your work?

A

I fork my code in GitHub from other people, all the time and I build off their work. I would love to do the same for you. This is what I expect. This is how how technology works today and all these other researchers say the same thing as well. So now you've got an audience in a community in a bit of a scale concern.

A

So this traditional approach of having the data live in an FTP server in an academic institution is going to hit up against scale limitations very quickly, and so our solution is salted. Cod fish, the the name of the project, is a bit of a play on the Portuguese term for COD, which is compute over data, um and um that's how we got our name um and so the goal- and this is this- is uh stealing from a uh from a famous builder in the DSi space. So quote about the back of y'all project.

A

Is that now, when the data and the processing are completely in public, the resources are automatically shared automatically. It's it's default to open, which I think a lot of you are hearing in the space. Not only is it natural for the web 3 Community, but it's really natural for what academics want, even if they are limited in some way by their institutions, and so now you have an annotated graph. You share it, you build on it and in many ways the scientific Community moves faster. So going back to the technical schematic.

A

Now we have all this data here that lives in ipfs. We send a copy of that information and it lives to these different, pinning Services there's a similar architecture for filecoin which we'll we'll get to in a bit. But now the data lives in all these places, everyone's contributing to storing a copy of that data, and so, as a researcher I bring my data I bring my code. Maybe it lives in GitHub, it's a Docker container. We send it to the back of Yale cluster. It gets processed now. The result. Data also lives in ipfs.

A

The data pipeline was public, so I can trust. What was being done. I can see, I can verify. I can reproduce the work and I also don't have to reprocess the data. Again. Everyone benefits from this new process. Data set in the scientific process moves along much faster.

A

So let me give you a brief technical, a little bit more of a technical, deep dive. If those of you who are more Hands-On with ipfs and Docker technology, the back of y'all platform is meant to treat Docker containers and was and binaries as first-class citizens, so to translate that into a little bit less technical language. If you have any work that you've done as a scientist, you've written python libraries you've written something in Julia. All that can be wrapped into a container when you submit it to the bacoya network.

A

One of the compute nodes will bid on your job, maybe because it has data locally or because it has availability to process your job that all happens transparently to the user. The data is moved between ipfs and filecoin transparently, and then the user gets their results back, just as if they were running locally and, in fact recreating that local development experience is a big Focus for us uh with the architecture, and so this is an example of what it looks like here.

A

Obviously, simple CLI is our first goal: there, eventually we're going to be trying to build out some more user user web user experience, capabilities for the platform, and so just briefly, I'd love to show you a video of what this looks like in action here. On the left hand, side you've got a command line, you're going to submit some back layout jobs on the right. You've got a bunch of files that live in ipfs. These are landsat images; they have clouds, they need to be processed, we're going to run a job here.

A

In fact, you can see this is the backlog Docker run command, we're going to do a simple image, resizing job against that and when we submit that it's going to go off to the internet to a back of Yao cluster that lives somewhere around the world. When we say back of your list, we can see that that job is actively running, then it gets completed and we have a nice ipfs CID, which everyone in the world can access afterwards now I go say, go get my job. It brings me back all the results.

A

Some standard output, all the if there was any errors, I get that information and very quickly I get my new file. That's been automatically resized, so that's available now on the internet for everybody else to view, and it's all entirely transparent. If you have interest or if you have a um an opportunity to make use of public compute data processing, please reach out to us. You've got my contact information here. We also have a slack channel on the filecoin slack Channel jump in ask questions, give us feedback.

A

We appreciate all that and if you go to our website, you can pull down demos. We've got some fun latent diffusion things, some Advanced like machine learning stuff, it's really fun and also some some biology simulation stuff as well.

A

So please do check that out and my very last Shameless blog is it as Nicholas alluded at the end of the day today, we're going to have a workshop and we're going to have some time where we can all get Hands-On on ipfs get Hands-On and back with Yao group up everybody into interest areas or experience and then take a little bit of time to really just try it out.

A

If you have a data set and you're a researcher we'd love to see if we can try to onboard some of your data into the ecosystem and get your feedback along the way.

A

So that is all that I have. Thank you guys so much for coming.