Filecoin Bacalhau - Compute over Data, 24 May 2022

Previous Meeting

⏯

youtube image

►

From YouTube: Overview of Project Bacalhau - David Aronchick

Description

Over three days in April 2022, we brought together 50+ people from across the ecosystem (Starfleet, OuterCore and PLN) to discuss opportunities and architecture of Compute over Data.

- Core problems to be solved across analytics platforms
- How to meet data engineers/scientists with the tools they use today
- Vision (familiar, simplified, collaborative) and high level roadmap for Bacalhau

Learn more: https://www.protocol.ai
Subscribe to Protocol Labs: https://youtube.com/@protocollabs?sub_confirmation=1
Follow Protocol Labs on Twitter: https://www.twitter.com/protocollabs

Computer over Data Summit returns in 2023!
Dates: May 9-10
Location: Boston, MA
Registration: https://www.codsummit.io

A

Thank you so much uh this is back. I know this is the temporary uh logo. I I love it too, but like yeah, it looks like a dead fish. It's not it's not good. um uh You can't have a. uh What do you call it a new company without having some like incredibly pithy, like data transform, so I'm I'm. Where did you use that as a placeholder? I hope I actually also like this official way to process data. I want to have a caveat for this session.

A

uh This my session is gonna, be an extremely high level. Walkthrough, it's gonna, be something from you know. My experience. um A lot of you don't know me from adam. I I previously was the uh I led kubernetes for several years. I was the first non-examining vm for kubernetes started the kubeflow project. I've been in ml and data science for a long time. This is very much like pains that I have seen in the world. uh However, this is going to be super high level.

A

It will not answer many or potentially any technical questions you want. Those you're gonna have to stick around um uh a short way to describe this. Is you know, pm hand waving for me and real stuff happening momentarily? Okay, so just stay with you.

A

So the first thing I want to do is really set this stage for the world that we're going into it's an incredibly powerful world. It is got so much attention and we in the web 3 space. You know, obviously we know a lot about this, but it's really a general thing that people know about. uh As juan mentioned, we're, adding a petabyte of storage deals a day. This is new stuff coming on the network.

A

This is like unheard of out there in the world uh for a platform to be to be adding this much data constantly, um and you know when you look at the analysts and whatnot, they talk about how big data and data usage will affect every industry, and I know a lot of this stuff is washed out again.

A

You can go watch it all online, but um down the left-hand side is basically every domain you could possibly think about, and and within the next few years um they will all be affected by the accumulation and use of big data uh to measure this. A lot of times, okay, well, are you using it? Does it actually matter? This is spend per year and um just a reminder like juan put up something where it's like total cloud spend is about. You know, 369 billion is what he proposed.

A

You can say, like whatever fermi estimation, that at between 300 and 400 billion roughly in five years, one third of that ish will be on data alone, like that's an enormous number and an enormous piece of this pie, and obviously we are participating in almost none of that right. uh So there's an opportunity there, um so there you go um so and to show you that it is really early innings.

A

You know you kind of follow where the big data investments are happening, and here you have like about 67 billion dollars in 2020, moving into investing in big data platforms right. This is all private, or I mean excuse me. This is all like public markets or private markets, and things like that. So there's a lot of stuff this isn't spend. This is betting on companies to solve this problem.

A

So again, a ton of attention playing in here and it's not our job to supplant those in many ways it is our job to complement these, to enable them to go off and build great big businesses.

A

uh I promise this is the end of the uh you know, kind of setting the stage here, but like it's kind of impossibly big, to talk about some of these numbers, uh people estimate that about three trillion dollars is wasted on bad data, bad data processing and things like that um yearly users generate that was this is actually um I promise it wasn't coordinated, but it's actually the number that juan had circled in that pie, chart down in the lower left-hand side about 2.5 trillion or excuse me exabytes of data generated by users every day.

A

That was the circle that it adds up to um so on and so forth, and and just to give you kind of inspiration that we're potentially saving the world here, google is literally using big data to help solve fusion right, so not not a terrible thing to be spending our time on, um but this is right. So one talked about debug ability, monitoring so on and so forth. uh This is the level of success for organizations right and you know again. The numbers are small, but almost any of these vary but they're almost all entirely.

A

You know, 70 of projects are successful or less meaning, like you can basically flip a coin, and your project is going to be fail right like it's that bad, so there's lots of stuff. We can do to improve this okay, so that's kind of the market. I hope I've inspired you to um you know suggest that this is big and we can go after it and make a big difference.

A

So one thing I want to set the stages um uh you know. First off there are many many super smart data developers in this room right now um you should go talk to them. They work in academia.

A

They work at previous big companies, and things like that and they'll have a really good sense for those that are not big data developers or have not done this previously, I'm just going to walk through kind of the pain of what they experience today and I hope to impart to you like our target market here I at least at the start, so um just to let you uh give you a kind of sense of this.

A

um Our target is, you know about four million data developers today who are using big data in some way, they're developing big data pipelines, uh transforming large data sets, or things like that and they're growing extremely quickly, according to whatever various measures about 10x up from 2016 um to where they are today so like extremely quickly, we can plug into them and for better or worse again, referencing juan's talk.

A

They are almost entirely ignored by a lot of developer, tooling, today, right so, for example, the standard for developer tooling today, for example, doing breakpoints and things like that is gdd right. There is no ddb for a distributed system for data processing for uh any sort of pipelines, and that's a nightmare right like how do you set a breakpoint in your data pipeline to know whether or not you are transforming the thing wrong? It's really hard today really hard.

A

uh That is like a goal among many others like if we did nothing else, to take all the standard development tools that people have today on their local machines and enable them to do work in a distributed way. We have already won we've already done so much better than whatever 67 billion dollars of investment have done uh so you know that's very modest, and we can go much much further than that. uh uh Oh sorry, it's a little bit washed out, but but um where do they spend their time uh on the left-hand side?

A

Here you have like this. Is the standard flow for a data developer today it's uh data loading data cleaning, I'm just reading them out for existence, washed out a data, visualization model selection, model training and scoring and deployment models. This is broadly from the ml space, but the concept of developing a model based on data is not specific to ml. That's really standard for what people do. Basically, output a set of things, create an artifact and use that artifact in your code or whatever it might be.

A

uh Roughly speaking about 70 of it is just getting your together right before you even begin to build artifacts again, I'm not saying that we can't go further than that, but if we just solve for 70, I promise you're gonna make a ton of people happy.

A

um So for those that don't know what a typical data pipeline looks like uh you start with ingestion and processing you move to engineering and splitting it into some form of uh this is the live stuff. I want to train on or analyze. Then you, you have a holdback set that you never touch your training or other code, but you use to test what you train and you never. You always want to keep those separate, because if you allow your test to bleed into training, then you can overfit and have other issues like that.

A

Then you finally train or create your artifact based on the result, uh and then you serve it in your overall artifact and then, ideally, you loop back the results to the original. So if you show this to just about any developer today, they're like, of course, this is exactly what I do right um and it doesn't matter, it doesn't need to be on a distributed platform. This is also what they do locally. This is what they do all over the world.

A

uh This is our focus for now again, it's not to say we don't want to do federated learning or you know, distributed training or checkpointing or any crazy things like that. But again, if we just solve this, we will make such a huge difference to the world. What would they like so to lay it out for you? I I've summarized it in one of three uh categories here, I'm just going to a cheyenne plug-in plug-in. I guess um so. It really comes down to one of three things uh familiar.

A

They want to understand it already, ideally, second simplified, even from where they are today and third collaborative and I'll get into what each of these are in a second, so, first a little bit about data pipelines. I mentioned that pipeline system today, and people often like very narrowly focus on just building a model or an artifact for your end up your end solution, but in truth it looks like this right. It's many many components: wired together, ingestion transformation, engineering, uh validation, uh training, then doing all those steps again at scale.

A

uh You know rolling it out and then ultimately, monitoring and observing it in in production. um This is what really moving things to production is and again this is not new. This is kind of like what software development looks like today. um You know each of these steps are independent and each of these steps are independently composable.

A

Now the challenge here is that every person doing these developments will experience it in a slightly different way. This is a classic microsoft office thing where people are like. Well, why do microsoft office has so many features in it? um You know I don't use 95 of it turns out. Everyone uses a different five percent, uh which is super annoying if you're a product developer, but it is the reality of the world. What does this look like inside a big organization? Well, here you go. Microsoft actually published this paper uh about three years ago.

A

They did their own survey internally, one of the most sophisticated micro. uh You know um machine learning, data development organizations in the world. They have 159 different tools. 159. Can you imagine if you're an sre, it might be not saying like? Oh, I have to support whatever 10 years old cntk like what the hell, but what 11 people 11 people need it. So what are you gonna do tell them to go f themselves now.

A

um So you know this is again super standard, but it it really highlights a need for us to understand that people are going to use the tools they're going to use and we need to encapsulate those tools, so they can keep using it in the way they're familiar with, but still give them an opportunity to participate in this very public data platform make sense.

A

Okay. So that's what familiar is uh now to just up the level of difficulty. We haven't even got we. It was just at the tools we haven't even gotten to platforms right so by platforms. You have compute providers right and you got a lot of those uh and then you have data platforms and you got a lot of those and those are useful as well, except I think there are too many choices right. We should get rid of all of them. uh This is what users actually want.

A

So here's said this is wikipedia from said uh said was invented in 1974., so that's pretty good, 48 years old. I think we should bring data science back to the 70s is what I think right like. We should make it as easy to use this 48 year old technology on your brand new technology, as you do today, and I cannot tell you how many people you said it is so commonly used out there just to process a csv file or something like that.

A

It is a wonderful tool like let's not reinvent, that, let's not try and throw that out in fed. Instead, let's try and meet folks where they are okay, so that's for me simplified uh now. Let me walk you through what the data scientists uh workflow actually looks like here. You have a very very standard. This is like the tutorial in machine learning and data science. This is how you create the housing price data frame right. So this is something that you go to any tutorial you're going to see.

A

One of these things predict my house price for me uh in jupiter notebook there you can see it's about. I know three lines of code. Half of that is literally like loading. The thing in and you're done, pretty simple and you can get going to do that exact same thing over there in hadoop a also whatever 15 year old technology.

A

It looks like this, and- and this is still missing, like half of it right- that's how bad it is, and so you're asking a data scientist who had something working pretty well locally to now, translate all that mess into this for the exact same functionality like not so good, um because it's just super painful now to be clear, like it's, not just even dave developers that are facing this pain. Sre's faces thing too.

A

um So here you go, I'm gonna take you through a play in one act, um so the data scientist now has her local machine and it's running perfectly and she's found her data set and her model works locally and it converges presto, I'm ready to go so. The first thing she does is go to our itops person to provision an entire cluster again. This is something that most folks actually face.

A

You can't just get unlimited compute, you don't have protocol labs credit card, uh so you need to like go to your central I.t staff and get it. The first thing she has to do is like provision it and that by itself takes forever right, so the id ops person just goes around says: okay, I'm having 100 things too. Maybe you filed a ticket on it I'll get to this afternoon later this week, whatever. Finally, it's provision, um and now the it officers says: okay, well great, I'm glad you provisioned it now.

A

Can you do this right, like here's half a dozen things or more, that she has to do just to take that code? That runs locally perfectly well to production, and many of these things are because she works in an itops organization that requires you, know acls and various things like that. um You have to rewrite it into java. That's a super common request which no data scientists want to do. I promise you use out-of-date libraries that have passed.

A

You know global security requirements because we're not going to allow anything to deploy that touches production data without this various things like this, it's just a lot of stuff that they asked and she's just like. Well, I just want to run that simple job. Why can't? I just do that, and the reason is is because that's the requirements, um so she does that that sucked and she says fine, great off you go and she provisions it. It runs and success.

A

uh It actually did except they forgot to turn it off, which happens all the time as well and presto now, you've just blown through your entire monthly budget, because you forgot that these were gpu machines that cost whatever two thousand dollars an hour. um Super super common situation. You see this all the time and you see like super pernicious behaviors around this, where it's like. Oh I'm, gonna secretly like spin up and use someone else's cluster, I'm gonna plant on gpus, because we have a limited number of gpus. So I'm not gonna.

A

Let other people have them. I'm gonna, you know maintain idle processes to make it look like they're working like it's really up, um but it is what it it's. What data scientists face today, I'm telling you man. So one thing they can agree on right, mapreduce sucks!

A

We can do better, um and finally, I want to inspire you around collaborative so literally. The reason I joined my protocol lab six months ago was like I want to try and avoid um uh you know our children and children's children uh living in a barren hellscape- uh and this is it right like it's really positive.

A

um uh The problem is, is collaboration around science today is really hard and data is even harder right. So today you have these open data, sets all over the place, literally petabytes of data, very valuable data out there in the world that are awesome right here. You can see the cancer genome atlas. This is just hosted on amazon. Actually, technically, it's not host on amazon.

A

So it's on, like an ftp site, you can click a button that provisions an s3 bucket and copies it there, which means you now start paying amazon for it, which is messed up as well, um but suffice to say, like there's at least a catalog out there of these things right so far so good, so this is landsat. um Alex is going to talk a little bit about landsat. Later uh landsat is already hosted on ipfs, which is awesome, um but today let's say you have three scientists that come together and they're like well.

A

I would like to use landsat um it's a super popular satellite um thing. uh Satellite data set that is donated by governments over the world, so number one. The data scientist says I want to create a tiled version, so I'm going to take a subset of the original version that is tiled uh so that it's focusing on different areas that are interesting to me in this case she's a volcanoes volcan volcanologist, all right.

A

Whatever uh anyhow, she wants to study volcanoes, so she's, like I'm, just gonna, grab a picture of that volcano uh second person wants to do scale so sam same thing before just reduced pixel density. This is a super common requirement because of you know not needing that kind of fidelity and images being very, very large, and then the third data scientist says.

A

Oh, I want to do the same thing, but I want to actually grayscale it again very common when you're building your artifacts is to use lower resolution versions, because you don't need that higher resolution. You can achieve the same thing at you know, one tenth or one hundredth of the cost by working on these smaller sets so far so good. So each data scientist has gone off and done her own thing and we have a fourth data scientists come from says. Oh, I actually want all three of those right.

A

I want it scaled to uh interesting elements. I want it um uh tiled, because I don't need all that. You know various land and water uh and I want it grayscale, uh but she can't she can't touch any of those. Those all went off into private research. uh You know they didn't republish their methodology for doing these particular things and again, oftentimes papers will describe that. But it's like we were talking about last night.

A

I was talking about alfonzo he's, like I hate reading papers, because the first thing I want to do is at least try and attempt to figure out how the hell they like did this thing and it's hard because oftentimes they don't publish it correctly. Their code doesn't work on except anywhere on their machine and so and so forth. So not so good that, however, with bakayow, we can change this right same exact, stereo situation, except in every case there they republish the cid and now it's out there, and I can see what happened.

A

I can see lineage, I can see from the original data set it came down and what they did and now the fourth person comes along and says: oh great, I'm just going to grab all those- and you know use that as my c data set and there's a variety of ways, we can go about it achieve that, but not just that, then that becomes collaborative right and they can get leverage on that. And so now the next person comes along says. Oh, I just want to know what they do right.

A

I'm just going to save time. Make sense so unprecedented collaboration because of this you know the way that we're operating here. So that's the scope. I hope I'm being like getting to the inspiration. You're excited um familiar, simplified collaborative again, these are just my words. I'd love to as a community come together and figure out what our core tenets are and and move forward from there. So can we improve big data with small changes for data developers and that's what computer over data and final coin is uh or the backhaul yao project uh vision?

A

So this is my words again take it for what you will. I think there's lots of crafting here. That's too buzzwordy, but you get the idea. I think we can transform big data by giving developers simple first-class distributed tools and unlocking a collaborative ecosystem. This is, I think, our mission again lots of honing. I think we should probably have an unconference just talking about how we talk about this thing, uh but setting that aside, this is what I would like to do. It looks like you know. All the things I mentioned are already.

A

We simplify it, give meeting people where they are using these tools that they already know and love. We deliver performance improvements because we can and we'll I'll talk about that in great depth in a moment uh and then folks later will and launch this new collaborative uh science community.

A

What does this look? Like? You, take a 10 gigabyte file. Csv, you upload it to ips. From that you get a cid. You then execute using the command line. We have a uh downloadable. You can go to backalia.org right now and install the stupid binary yourself. uh You submit your job in your cid, you name the cid and then in the command.

A

You name the command, and so this one right here is- I use said, as I mentioned earlier, to process the large csv and filter it down to just the things within whatever 50 kilometers of portugal, pretty simple stuff stuff that data scientists do every single day and then I fetch the results presto. I have added a new tool, but most of this is totally understandable to a data scientist, no matter where they are, I didn't have to use hadoop or hdfs. I didn't have to figure out rewrite this in java.

A

I didn't have to like do any kind of like figure out concurrency or resolute job resolution. Orchestration presto just works. um You know, in addition to that, no temporary storage using mostly idle compute, the results were automatically added back to the chain again. Privacy and things we're gonna have to tackle right now. We're just focused on public data and performance. As one mentioned, uh familiar commands automatically resolves failures, there's retries, there's, concurrency and ideally quite cheap, and I haven't even gotten to the biggest thing, which is no egress.

A

You didn't have to move this 10 gigabyte file, that's it. It was already there. It was like running locally and you know it obviously gets much much worse. As the data size gets. Bigger uh egress, I think, is amazon's most profitable thing. Let's try to I'm not going to say bad things about you. Stop there you go um so we go back to our play in one act.

A

uh Data scientist comes along says: here's a data set, that's perfect! How do I engineer it presto? She now submits. She adds her cid in there. She writes her said command. It ran great locally. She knows it's going to run great, we're already checking bash syntax, which is convenient, because I cannot tell you the number of times that I personally have made a synthetic mistakes.

A

uh She runs off, it goes and uh after every time it's all done, and now our ideology knows how many uh cat videos are up to to youtube every second, so she has stuff to do too very important. um So you might say wait a second. What about these?

A

These are all good things: homomorphic encryption, selected, execution, gpus enclaves so on and so forth. You know the vision: is there nope? Not not yet? Okay, like we're gonna get there. We have the vision we want to achieve all these things and enable great domain specific things exactly like juan was saying. We need to enable businesses, organizations projects to go and do great things on our platform, but not yet uh our goal is exactly like juan said. You know, let's achieve performance. First, let's make sure jobs run.

A

Let's show them make sure they run well efficiently. They resolve correctly, uh they recover from errors. You know all these things that are kind of the block and tackle for even a system being valuable. um Our roadmap is again. You know um uh that that uh tilde is very doing a lot of work uh approximately in may. We would like to launch to public consumption, no incentives, uh uh 100 nodes data, smaller than 32 gigabytes, fitting into a single sector, ideally on a single machine.

A

um One cid, only public data only deterministic only cpu, only no incentive structure, no verification of results. So again, this is not for general use, but ideally anyone in the world will be able to consume it, use it and engage by october again until they do a lot of work here.

A

um Approximately thousand plus nodes we're not gonna, stop like we'd love to get ten thousand hundred thousand a million so on and so forth, running ten thousand jobs, uh one petabyte of processing across many files. Ninety nine percent job success rate, 90 or 49 malicious nodes, supported, uh dag execution so distributed acyclic graph.

A

Allow multiple steps to connect together um a primitive reputation system likely only at a reporting level, not um injecting like choice on whether or not I want to deploy to a uh reputable um uh node or a provider and swappable systems uh swappable their verifications of execution. So I'll reputation make sense, uh so you might ask like incentives. Why would I choose to run this uh seriously? Not yet I promise like we're going to get there and by incentives I mean tokens verification.

A

All these things that are required stake in particular, is dependent on this. um How we get there tbd but like unless we have a well-functioning system, there's no point in going forward to figuring out other things out really. So, let's get to a high functioning system first, it is not that we are ignoring this. It's just a little bit later promise um we! I cannot stress this enough.

A

We expect there will be many structures, most of which will not be done by this project right that I cannot stress that enough, either right uh trusted environment, execution, gpus, a super fast resolution, time for subnets and scheduling and all those kind of things wonderful. We will support all of those, ideally we'll support them via interfaces and and uh loosely coupled systems, and by that I mean this right, extensibility, so uh uh luke and kai momentarily will talk about the overall architecture interfaces and plugability. uh They will go into this diagram as well.

A

uh You can see here. This is basically straight. You know almost one to one from those line items that juan had earlier.

A

These are core elements that we expect to have many implementations, most of which will not be written by us. It is our job to build clean interfaces and explain to people how to extend into systems that they can build for their own incentive structures and other things like that, um and we provide core primitives that you know work out of the box, but ideally you can swap out um some so critical from day zero.

A

Our system must run on these interfaces like there's, no cheap and cheerful way of not having interfaces at the start um uh even at launch. We expect to have those those uh uh various uh optionality and, like I said, domain specific customization over time so sounds great when well, I already told you right nothing new here um again, tilde approximate, very approximate software engineering so like what's the rule of thumb, double the time in minus two weeks, something like a add two weeks, uh but it's not about the data or the idea.

A

Now the number one way to identify that someone is an absolute blowhard. Is they put a slider on that has like a quote from steve jobs right? So it's not global. She does it's me right. I said it, um but it's. This is actually critically important and it just leverages exactly what juan said right like the disease is thinking the idea matters. I cannot stress enough. The idea does not matter at all. This is about execution.

A

Ux is the killer feature ux at every phase. Is the killer feature for the data developer for the sre for the storage provider for the eventual compute provider, the browser everything ux is the killer feature. We cannot move forward unless this works liquid smooth, uh but you say I want it now how we move faster.

A

It's all of you. uh We have some key skills and hires that we are missing uh right now. It's like three of us doing the coding, uh so that's so good uh we're hiring very fast. Obviously uh we would also love many. We have lots of partners in the room right now or on the stream. We would love to understand where you would like to go and see what we can do in our core project to support you.

A

So what interfaces core primitives and things can we take off your plate and work collaboratively on and then figure out where to go from there right, so that is stuff like storage or plugging in schedulers, or things like that, the time to do to suggest things to us is now and even if it's just coming by and looking at the already published interfaces documentation and so on. That's enough, um but if you can, you know collab collaborate with us.

A

I know that um we talked to folks about like how we execute with wasm how we do this, how we do that we would love to talk more uh and with that. That is my overview. um We're on time, which is very I'm very pleased about.

A