IPFS IPFS Camp 2022 - Compute Over Data, 30 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Keynote & Overview Survey - David Aronchik, Wes Floyd

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

So I'm going to start with a broad overview and then you're going to be followed up by a whole bunch of people who know a lot more about this stuff than I do um to start. You know I we're going to talk about revolutionizing, the Big Data age, with compute over data.

A

So what does that mean? Well, first I'd like to do a quick survey on exactly what we're talking about when we talk about Big Data, the amount of data on the web is obviously just shooting up like mad. You can see here. The number of things that happen every minute on the web is enormous. This is something that Juan alluded to earlier. You know at in web 3 and distributed science. Excuse me distributed compute we're going to have to deal with these levels of numbers.

A

These are Web Two web one numbers and web 3 really has to scale to that and the truth is we're just getting started right when you think about it. We are living in the Big Data age um by 2025. There's a speculation projection here that we're going to reach 175 zettabytes.

A

um That's such an enormous number. It's like almost impossible to believe it's literally billions of times larger than than the hard drives that we have today, and you might think to yourself that you don't have a big data problem um and that may be true today, um but it's actually remarkably easy to get to Big Data. So, let's put a marker on the board and call it a hundred terabytes. How would you build a hundred terabyte of data today right?

A

Hundreds of terabytes is much much larger than most of our hard drives and computers, and things like that there it becomes hard enough for people to manage. Well, you could start with just having a thousand nodes across your overall deployment producing a gigabytes of log a day. Maybe it's a hundred VMS Each of which have 10 services on them. Each of those produce 100 megabytes a day, uh maybe you're doing video collection.

A

Video streaming, you have uploads 20, 000 users, five videos, five minutes, long that'll get you to uh a hundred terabytes fleets of vehicles or Edge devices thousand Vehicles. The average vehicle today has 70 sensors. That's probably going to go up in time and each of those produce 150 megabytes a day, that's 100, terabytes or maybe you're doing something really broad you're collecting from millions of iot devices all across the world, each of those only producing one megabyte a day and you're going to hit it.

A

These are these numbers are very realistic today, and each of them will get you to 100 terabytes and- and you might wonder why I'm talking about 100 terabytes, um you know even at extremely fast bandwidth, with no interruptions moving 100 terabytes from one place to another at 10 gigabits takes an entire day, meaning if you're generating that on a daily basis and you're moving it from place to place. If you have any Interruption whatsoever, any compute necessary you're going to lose you're going to fall behind immediately.

A

So the idea that we can deal with these centralized organizations that manage this is just you know, it's a it's something from the old days and we haven't even gotten to the cost right.

A

um We put together like a quick summary here on on what it costs to storage yourself. If you're going to do it you're going to build your own on-prem data center, what hard drives are cheap bad. Could it be? You know the cost of doing this over a five-year period uh to store one petabyte. So that's just 10 100 terabytes is 1.3 million dollars and you say well, hard drives are cheap, maybe they're coming down system. Hardware only makes a small percentage of this at five hundred thousand dollars.

A

The rest of it is maintenance, ongoing facilities, all that kind of stuff. So it's really going to be quite costly to store it yourself. So I would argue that compute over data isn't just recommended. You know it's the law today. You have a bunch of options for doing compute over data for actually pushing your compute to where it is um and they're pretty good, to say the least right. These are huge platforms. Hundreds of millions billions of dollars invested um billions of dollars uh a market cap here.

A

But the truth is that they're really like built for a very particular set of needs- they're not really built for our new decentralized world or by the way I would argue for data scientists. So if you look at the you know most canonical example of doing data science data analysis today, it will look like something like this. This is the you know, Panda example for building. um uh You know just doing an analysis across a small set of houses in a particular region to do the exact same thing in Hadoop.

A

It looks like this and and that's oop and that's not even all of it.

A

The problem is that you have to take what people know and love today, python Jupiter so on and so forth, and convert it into the platform for that to run in this distributed way, and this isn't even the whole thing. We haven't even gotten to scheduling orchestration other things like that, let alone maintenance of your overall cluster. Now again, I don't want to dismiss those platforms, they're built for a very specific reason, but is that the need for everything- let's say I, just want to filter my data at the edge?

A

Let's say I want to do some trivial transforms or something more profound, uh like you know, build reproducible events that might not be appropriate for a centralized centrally author authorized and maintained cluster.

A

So now, you're, saying: okay, well, you're, making a case that maybe everything shouldn't be centralized in a single place, but why does it have to be truly decentralized operating in a trustless environment? The funny part is I was trying to like come up with a good answer for this I actually just went to the dictionary right. The definition of decentralization is organizations whose activities are not performed in one central place but happen in many different places.

A

There you go, I didn't even have to say it right, we'll talk about going back to those things, those hundred terabytes. You have many machines, many devices, many users, all of them, spread all over the world. They're not sitting in your data center right, they're, all over the world, they're already decentralized. Your data is already decentralized, whether you want it to be or not, and the problem here is what you see. um What you see in front of you so I'm going to give you a walk you through kind of a trivial example.

A

You have your data, you have your data scientist in this centralized example with her centralized compute and then we're going to make it easy here and just have kind of three data centers. She says I'd, like one data processing job the moment. She makes that request. She should go out to every everesters and move the data that was sitting there at those edges into the central machine and God help you. If you have bandwidth throttling at that Center place where maybe it's only 10 gigabyte.

A

Now you have to do it serially, instead of doing it in parallel. So then it gets even worse. Now again everyone kind of faces. This takes a long time. You know, maybe it takes an entire day if you're you're dealing with 100 terabytes, but at the end of it it runs the compute job and it does hand her back and she says: okay, great ready to start.

A

She runs it. It gives her her results and she says Oh no I got one thing wrong or I have to rerun it or any of a number of different questions if it's even a remotely long time an hour a day, but since she ran her job and finally did her analysis, it's highly likely that very expensive data cash would have been removed and you would have to start all over again. So not so good, so you might say or anything's better on. You know, hyperscale Cloud well sort of um make no mistake.

A

They are very, very scalable. They have some degree of effective decentralization in that they have many data centers all over the world. But the truth of the matter is your users aren't sitting in those data. Centers they're still going to be decentralized, whether you want them to be or not, and we really haven't again even gotten to the cost.

A

If you talk about storing on a cloud vendor, but a lot of people think it's cheaper to store, you know on S3 or or you know your your giant blob in the cloud, but even then it's really not and the worst part is, as you start, to pay for Ingress egress.

A

So in the process of storing every time you make a mistake or every time you have to re-upload your data you're, paying an enormous amount of Ingress egress, and we haven't even gotten to the processing side right if you took the same thing instead of storing on a blob and you decided to move it to a more efficient data processing platform, you're, probably going to quadruple your overall cost, because storing on those platforms is much much more expensive.

A

That's one of the reasons by the way why, after you upload your data to these clusters, they tear them down and delete the data because it's so expensive to just maintain them. So I would like to propose that we need a system that maps to how we collect and store data today, and what does that look like well for us? It's the center of the Venn diagram.

A

How do you create locally local to data reproducible environments, new incentive models that will allow people to work together without just paying for a giant S3 bucket that you know everyone chips in for and most of all, I think it'd be provable, pure and verifiable. One of the things.

A

That's so decentralized, trustless environments, you're going to have to figure out a way to run that compute, but then also verify it at the edge because you don't like getting garbage in without any verification is just as bad as if you had no data, arguably even worse, that's what we're working for when we talk about compute over data, so what we are building?

A

What you're going to see examples of today, your data scientist needs to declare her job and the pipeline necessary to run it so we're gonna have to go and build a reproducible-ish job and pipeline environment. True reproducibility is very hard you're going to see some great platforms today that get close, but we're certainly not done. Second, you need to start those jobs. Now that you have it, you have to start it and start it at a distance where the data is being executed, which means we need decentralized and orchestrated execution.

A

I need to be able to spread out automatically to a number of machines and execute those jobs in each place and, finally, I need to ensure my job finishes and I can trust the results, so we're going to need to build an incentivized, uh consistent and verifiable Network, and it can't require rewriting everything. We can't go back to Hadoop, let's not make the same mistakes that were made in the past. Let's meet folks where they are so in your new world. It looks something like this: you have your data scientist again.

A

She says I'd like to do one compute processing job. She hands it to the network. The network says all right, like I have a CID, it has three chunks, who's got it. The first note says: I have a chunk I'll. Take that the second node says the same thing: I'm good ready to go. Then we have a problem right.

A

The third node says: oh well, I have the chunk, but I don't have any CPU space I'm already doing something else, and then another node says: oh you know, I have uh I have CPU space, but I don't have a chunk. So let's get them to work together right. Let's have the network automatically move it from point A to point B.

A

Now, obviously I mentioned before that you, you want to go out, train a void, moving data where possible, but that may not be the case and it's up to you as the job definer to spec. Oh I, never want to move this job. That's all fine! Okay, maybe we'll wait until the first node is done or let's say you know what I do want it in a hurry: I'm ready to pay that you know overall cost or time or whatever it might be, so they run it.

A

The node gets the job and then they're able to run it again. We can't get processing for free, but you are executing in parallel by default and then they say: okay, we're done, but we're not. We are not done. The the computer overdated network isn't done. We need to verify it so then another node that volunteers itself and says I want to verify that those jobs were completed. Correct correctly, it automatically goes through that process takes a few minutes and then it says the job is verified, download at your leisure, and she can do so.

A

She has a CID that works and she can pull those cids from each of those things which will be in parallel.

A

So what we're saying here the case that I hope I'm making in the landscape for the overall compute over data environment is what you see here. We think we were proposing that it's easier to manage by providing you a self-organizing network. This is something where all the nodes understand. Each other know how to process know how to like, maintain restart, move things around they're compatible with existing tools. You can take your Docker container.

A

You can take your awesome, you can take whatever it might be and move it to where the data is, and it's got built-in verification. So, within that you can use a number of different verification techniques. You can provide arbitrary Lambda functions you can provide, you know we hope soon, snarks and other things like that to provide execution and verification at the edge it's more cost.

A

Effective I already mentioned it's effectively, serverless like those nodes, the data and so on already take advantage of powerful networks like ipfs to maintain itself as things come up as things go down, reproducibility automatically kicks in and when nodes need to move things around, you can get efficient bin packing. This thing has CPU over here.

A

This thing has storage over here, let's figure out how to bring those together and, of course, it's impossible to overlook um greatly reduced Ingress egress you're able to take advantage of the fact that you're executing right next to the data you have a file handle instead of a network port to go and get your data from. Finally, it's reproducible and collaborative, obviously everything you see here- is going to be content.

A

Addressable content, addressable, Bob, hash content addressable by a Merkle tree you're, going to be able to do everything you need and know that you're not reproducing things that are already out there in the world. You also will get metadata and lineage for free. These networks will be able to provide things on the chain to that have proof of exactly what happened when, which is something that none of these platforms to provide today natively, they all have to go out somewhere else and wire in that metadata and lineage themselves.

A

Without that you run into real problems. As you begin to get down your pipeline. As you begin to get derivative data set after derivative data set, you need to walk all the way back. Where did I collect this? What did it come from so on and so forth? That's on us to build that into the platform by default, and, finally, you know I think you can really get truly Innovative models today. You know, maybe you have an S3 bucket, or you know, by the generosity of a hyperscale cloud.

A

They'll support an open data set and that's nice that's greater than, and we don't obviously want to turn our nose up, but truly, if you're doing this properly, you'll have ways for communities to join together and each contribute. They may contribute money. They may contribute their own compute, they may contribute storage. They may contribute working time on these things and provide overall models and and new ways of processing the data in a decentralized world that becomes much more possible without sharing a single.

A

You know, API credential, now we're working on this right now we have a very passionate group of people. This is the computer over data working group. Already 15 members representing you, know, 75 people um uh who we meet every other week and we're already beginning to work on many of these things, whether or not their standards or um uh you know collaborations, between these various organizations.

A

We all care about this right now and you might ask okay, you just showed me 15., which one would you like me to pick and the truth is there isn't one right you might have seen this earlier um Juan presented this and we're we're very inspired by this, because this is the true reality right. You have a three-tier um system right now, where you're. Basically it's up to you to decide what you want.

A

You're gonna have privacy on one axis: you're gonna have verifiability on another axis and you have a performance on a third access and it's up to you to decide which of these are the fit for your uh situation and in truth, this is again something that really differs from the excuse me from the systems that you have today today.

A

It's really a one size fit all if you went out and spun up a spark cluster or Hadoop cluster or EMR or take your pick, it's a great platform, but it's making a bunch of decisions for you and it's asking you to say like okay, here's where you are you're done. What we think will happen is that people will pick and choose even within a single organization and say: oh, this is HIPAA data I need it to be fhe, or this is actually just log data I'm totally.

A

Okay, I want this to be very performant but I. Don't you know I, don't care that much about verifiability things like that. We want people to be able to pick and choose these as l2s on top of it on top of a common storage and network solution, and you might say that is a lot to think about, and it is, would you like to learn a bunch? More I have great news for you. This is what the compute over data track looks like over the course of the day.

A

uh You just heard me talk a little bit about the overview and, what's going to happen, um uh coming up right after this you're going to see hashes going hashgraph. This is a really cool platform.

A

um uh Talking about warped work from Eric I'll be back to talk about one of the potential platforms that are out there. A platform called Baka yeah. You might have heard about it that operates natively on ipfs and can be the platform for many other Platforms in the future.

A

um We'll have Zach. Excuse me, Matt from the fvm team, come and talk about how this integrates overall with chain consensus. uh Then we'll come back to zaps and showing you how to build um these kind of apps that take advantage of these underlying platforms and we'll talk at the end. We'll talk about fill mine and um helping to build an infrastructure Network that layers over the top of this.