Filecoin Compute Over Data Working Group, 13 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Compute Over Data Working Group 6th Session (AM with Spice.ai)

Description

Luke Kim from Spice.ai takes us through Spice.ai's approach to helping organizations build the next generation of building Data, AI, and Web3 native apps.

Spice.ai: https://spice.ai/

A

B

A

Everyone listening remotely- uh this is our fifth uh community uh um computer data working group community call and we're very fortunate to have luke from the spice ai team, as well as derek, and today we're gonna learn a lot more about their platform.

A

In particular, uh it's gonna be a bit of an unusual session because we had a different team that wanted to join but they're based in asia, pacific, so they're gonna have a second call later today, we'll probably end up slicing these together, uh but nevertheless I'm looking forward to hearing everything about spikes, ai and so luke I'll hand it over to you and I'll. Let you take it from there.

B

Cool thanks, wes thanks so much for having us um I'm going to share my screen here, set this up a second.

B

You guys see that okay looks good cool, so um I was just gonna uh just demo, a bunch of our stuff, um and then I looked back at the previous presentations and uh no one was autoforming um uh and there was a bunch of really awesome presentations, high quality ones. I was like, I guess I'll have to do a dirk, so here we go we'll go through it pretty fast, though, and hopefully spend a bunch of time in the product. So so who are we um loot, cam, founder and ceo of spice ai?

B

um And before this I worked in azure built a team. There called azure incubations built a bunch of projects. um You might have heard of one called data which is a distributed runtime for building applications and then also worked on a bunch of developer tools and infrastructure for github azure and microsoft and spy ceo we're an early stage startup and we're actually a protocol labs portfolio company as well, and our mission here is really to make building data and ai driven applications. Easy.

B

We launched a preview of our data platform in april, called spy.xyz and I'll be taking you through some of that today.

B

So if you think a little bit about the history of ad platforms, we've kind of go gone through a couple of these stages.

B

um You might remember on-prem a little while ago, now or actually you know, a lot of stuff is actually still on-prem, um but a lot of the the applications built in that time were obviously these client server applications.

B

um Then we went to public cloud and our cloud applications and we really start to evolve into much more sophisticated, distributed, applications, highly reliable, scalable and so forth, and then, as like an increment on that decide to build these containerized applications and even more sophisticated applications. More reliable uh and this term kind of came around called cloud native and a lot of these applications were built on platforms like kubernetes, and our thesis at spice.

B

Ai is that the next kind of platform that comes out is really going to be targeting what we call data in ai native applications and these applications, just like the cloud native applications, were really native to how the cloud works uh in terms of this distributed nature, the scalable nature.

B

We believe that these data, ai native applications, are really going to be uh native to uh building these intelligent applications.

B

That span any of these platforms, and I think that the platforms that enable these applications are still being built and still loading one of the technologies, I think, is going to be web3 that helps build these.

B

So if you think about what is a data or native data in ai native application, what we see it as is apps that act, so they take actions but they're based on intelligent decisions from real-time data.

B

Now, if you think about web 3 applications where three applications are already data native applications, so if you take some of that definition, there are applications that act based on real-time data, and that happens to be blockchain data and smart contract data.

B

You could argue that a lot of them are intelligent, uh but maybe not in the definition of intelligent that we use, which is um by uh really leveraging. uh uh You know: ai techniques like deep learning and machine learning and so forth.

B

And so, if you think about how you might build web free versions of intelligent applications, think about things like quickbooks or xero accounting packages? Well, how would you build that in the web free world uh with things like intelligent transaction categorization, for example, or some of the the biggest businesses today in web 2 youtube netflix amazon? um How do you build like ai driven recommendations so say if openc wants to do uh nft recommendation, when you go there, how?

B

How would they build that, uh or even just things that we've had for years and years like gmail with spam, filtering and fraud detection? How would you build that on say, wallet, 12 messaging for protocols like that xmtp.

B

The other thing when you get into web free, blockchain data uh that I'm sure a lot of you already know is um web3 and blockchain data is painful right. So you have to build and operate these massive blockchain nodes that you know tens of terabytes going to petabytes with you know, chains like solana, and you have to build all and operate this massive big data infrastructure and ai infrastructure and ml structure, and you have to understand.

B

Smart contracts calls logs and events, um and that's just like to get the basics to start making data in ai driven applications.

B

So a lot of work right so um and if you look out there, there's really no good solutions that help you with this. So there's really great analytics solutions, but these are focused on analytics so doing the graph and so forth um and really they're not designed for uh massive scale. Training. uh Big historical data sets bulk data, apis ml friendly, and so what you find is most companies try to build themselves but often spend a lot of their engineering time.

B

We've heard from many companies building products on the infrastructure, instead of actually the value that their products actually bring into market.

B

So what we're building spice aei for is something specific, not necessarily for specific analytics or dashboarding or uh dapps, but specific for specifically for bulk data applications and machine learning. And so, if you're thinking about an application, production application really needs to be very high performance.

B

And if you think about machine learning, you need to be able to query and fetch millions of records. Tens of millions of records at once to do training to do aggregations and so forth, and our specific audience here is really developers and data science scientists, people who um uh uh who are building applications uh and want it to be easy right, so three lines of code to get your date web three data into numpy or pandas.

B

And so, if you, if you think about the the grid that we had here before spice ai's focus, is really on applications. Machine learning- um and it does query sql query, but it also really supports these massive historical block data apis and we're building it to be as ml friendly as we can, and so here's kind of the general platform we support ethereum bitcoin and we run our own nodes.

B

We index all that data um and we have more chains coming soon, polygon, solar and so forth, uh and the idea is that you can build your applications on top of the platform and access the platform over these really high performance, low, latency, apache arrow apis, which is this high performance, commonly uh data format in memory format, and it's used by projects like spark and pandas, and so you can actually use these libraries as well directly on the platform and use all of these tools that you would already used today in the in the ecosystem, and so with that we'll get to the actual demo part and we'll take you through a little bit of the platform.

B

Now it is early access. We started building this in january, so pretty early, um and so um we can choose some of it, but we're still building quite a lot of it. So so let me just bring up this really quickly.

B

This is our home, page, spice.xyz um and so we'll just log in and uh because we're uh targeting developers and data scientists you log in with github, and that's just kind of the canonical way that people in this space share, we'll probably add you know other developer friendly logins later on when we're out of preview. So we are still in preview here.

B

If you look here, it's a very simple api, but it's an application. First, api, uh sorry interface, um and so we're just gonna go into uh this application here um and we. This is really just to help. You explore the api, the actual product itself, a sort of explore explore the data.

B

The actual product itself is the api, um and so like other products in the space, you can just do really quick, uh sql, query and you'll see it's like super fast and everything we do is uh aimed at being high performance, because, if you're building a production data driven application, you need it to be high performance, and so um essentially you can. You can explore the the data sets that we have. We have a little bit of a data set reference here and you can go through.

B

We have tokens nfts, d5 data sets and if you look here in our doc in our docs, we actually have a bunch of example, queries that can go through, and so a couple of those will actually um take. So here is this one: just getting the latest block number and I put that one in there, um and so you see the blaze block number just pulled up eight seven eight and come over to either scan here um and refresh that you'd say eight seven eight.

B

So um you can see real time data that we have in here. um If we continue on uh over to this panda over this python tab. What I mentioned before is the goal is to make this as easy for developers as possible. So we have our sdks and one of our sdks.

B

Is this python sdk and you can see here it's three lines of code, including your api key, including your query, and then you have your data in pandas and what we've done is abstracted away: the use of apache arrow, which is high performance, grpc connection um to the service, which means you can get all this data down super fast uh and uh just using a couple of lines of code and you've got into pandas or numpy or any of these other libraries matt plotley without doing a lot of extra work, and you still get all these benefits of this very high performance interface and we also have a node sdk 2, we're going to be building other ones like go and rust and so forth.

B

The other thing that that makes it really cool and easy to do is use other platforms as well. So you don't just have to use our platform for dealing with the data, and so you could pull it into your ml pipeline, something like kubeflow or even a kegel notebook, and so we have a couple of examples.

B

Keggle notebooks up here I'll pull this down because I can't see- and so if I look at say one of these dex liquidity props over in kaggle, I can easily do the same thing so here we're just pulling down the sdk and again just a couple of lines of code. This time the query is a little bit longer, um but essentially it's just a query, and now I can use things like matplotlib um with my data um and you know continue to have that ongoing and refreshed and super fast.

B

So a couple other examples of queries. We can do um things like you know: average transaction fees, uh gas fees, kind of all, the things that you would expect to be able to do from a data platform in a space and they should all re uh running pretty fast right. So here's some some transaction fees, um nft api, nfts uh and tokens. One thing that we're pretty proud of is: we have a very good detection for tokens, and so um here we have uh esc-1155 and it's all automated.

B

um So if a new token comes up on the chain, we'll be able to detect it, and uh we kind of give you the token standard as well. uh In our token apis, um but again, this is really just the start. um We're just getting started. We're gonna be working together with the community to build a whole bunch of um data sets. Here.

B

um We also support just like a lot of the other data providers, direct access to our nodes. So if you want to go and use our json rpc nodes, we you can do that too, and we have other kind of value-added apis like prices and gas fees and and so forth.

B

um So that's the the basic uh platform and our goal is really first to provide the data in a high performance uh bulk data uh way. So, for example, with this api, you can go fetch. You know 10 million rows at once, uh and because it's coming down this high performance, apache arrow api. It's kind of this long-lived connection, that's very efficient. You can even stream it down, uh and so the next step after that is what we'll talk about in within the context of this group with us.

B

Then how do you actually apply compute over that data uh and start processing and doing some more interesting things with it? So let me come back to the presentation here come through here.

B

So if you think about uh the things that we can do here, um we've we see it in kind of two categories: um one. uh We just love the idea about doing compute over data um and especially working with biocoin and ipfs and protocol labs projects, and so we expect to be able to contribute to the working group, but also projects like which I can never pronounce, um and you know we're going to go over to the data summit in lisbon and give a talk there.

B

uh Philippine uh my co-founders, let me go over there, um but also uh we actually want to really work with the group to uh integrate um some of the projects into the platform. So you can imagine that we would host our nodes into spice and then enable you to do computer ipf's data there, but combine that with that queried, web3 data. So imagine that you have a job, you're querying data and saying fetching all of the nfts for the last last year.

B

Then you have the nft all that list of nfts in your actual job. You can then take that query. The ipfs data and get the actual nft do some processing on the actual image and then send the result back to the chain or somewhere else or even generate data sets for other spice users, and so we think that's a really great application of the the project.

B

And then you don't have to take all of this data out of this system, move it all around, especially when you're dealing with millions of rows, which is the idea for really data-driven applications and and when you want to do like large-scale machine learning.

B

So uh we have uh the the platform right now. As mentioned it's in preview, we have a bunch of customers on it all doing some really cool stuff.

B

Everything from nft marketing uh analytics to um you know obviously, trading and financial applications, but uh nft authenticity, services, wallet messaging, and the other thing I wanted to mention is we're also developing this open source project to help build uh ai driven applications um easier for for developers and essentially being able to use a simple api to access um some training and inferencing, and we will wire up that data ongoing data um for you uh from the platform. So you can say here is a query.

B

I want to work on this data, um bring it into this runtime now. Let me train and influence on that, um and the longer term goal for us is to build this into the platform as well, so that we just make the entire experience of building a data and ai driven application. Just really easy. The data's there, the runtimes the frameworks there and all of the ecosystem projects are there as well. So a really fast overview of the platform. You can check it out at spy.xyz.

B

There is a wait list, but uh if you pm me I'll, let you in and um yeah thank you so much for the time today.

A

Thank you luke. I can't say enough how how interesting it is that that spice ai can be an enabler for other compute over data working group projects. So I love the the positioning. You guys.

B

A

And, and could you say when you think about the types of ai applications that will be built um and interacting with things like spice ai?

A

Do you think of it as the mode of operation as sort of a more batch style request, or does it get more scenarios where the applications will need frequent requests to spice ai or maybe not real time, but where's? Where did on the spectrum? Do you think that a lot of the the demand will come from yeah.

B

I mean that's a it's a great question. Hopefully I think my headphones aren't going to get better. So hopefully you can hear me: okay, um yeah. I I think it'll eventually be. Both.

B

um I've worked in um big data platforms for for a long time and there's always this kind of notion of like the batch pipelines, obviously, and then real-time pipelines, um if you think about say uh if you want to work with large scale like like hundreds of millions of nfts or if you want to do like a big training job, it has to be bashed to some degree right.

B

I have to get all of this data in history and learn from it, um but then, if you think about or how to actually inference on that model, it needs to be real time because, as new data comes in, I want to use that for inferencing on that model.

B

So you really need to combine both of those models if you're going to be doing some type of real-time intelligent actions in the world, um but it doesn't necessarily have to be like perfectly real time and real time means different things to different people. So when we first set out, we were working with a bunch of financial applications and we said like do you need real time? They said yeah we do and we thought that meant like hs hft like sub second real time right, then we went back and actually asked me.

B

What do I actually mean? I said actually like if you get us data within a day like we're, not even that sophisticated like our trading strategies and like a day is real time for us. So real time means different things to different people, uh but I think you'll eventually need to combine both techniques to to really build out these ongoing, continuous uh data-driven applications.

B

um And again that can be like the examples I gave before it could be. I'm gonna take a whole bunch of data trainer recommendation model on how to like, like what nft is. Is a cool nft to look at or buy um or it could be. You know similar like spam. Section we've had for years. I have to you know, look over a whole bunch of content figure out what spam and, what's not, then I'm going to need to have ongoing data to keep it updated.

B

I'm going to need ongoing data for inferencing and so forth.

A

That makes sense. Thank you all right. Well, I'm gonna. I'm gonna give a a chance. If there's anybody else in the audience. I know we've got a bunch of other folks joining live.

A

All right well.

B

I'll say one thing so in terms of that um use case.

B

So why do you guys do such hard things um uh like? We already have customers who have looked at the project and are like this is super awesome, but it's too hard for us to use or not necessarily too hard but like. We want to focus on our business logic right and we don't necessarily have time to set up all this infrastructure, and so, if we have a way to gain access to the benefits of that project, be able to do compute over some of this ipfs data. That's awesome um and uh please go.

B

You know, help help do that for us um and and they're already. uh We have a couple of uh customers who already like really struggling with like massive egress amounts of data, uh and- and this would save it would make this much more efficient right if we can bring that compute to where the data closer to where the data lives close to, where those results are from the from the queries and so forth.

A

You know it's a really interesting point because, just like in you know the web 2 world, there should be stacks, there should be back-end services for back-end needs and there should be. uh There should be tableau of web3 data, which you know in some ways you guys are going to be able to help build that need. I I totally agree. So that's a good, that's a good point and it needs to be built that way, especially for less technical business, folks to get value from it exactly yeah.

B

Well thanks! So thanks so much for having having us.

A

Yes, yes, we appreciate it. Thank you so much um and we'll get this on youtube here shortly, one uh one last, I guess advertisement for the rest of the group um lisbon summit uh november, 2nd through 3rd luke's, going to be there, so we're going to have lots of folks. Please uh please let us know if you're able to join and we'll have a little bit more content, we'll add on here for this afternoon, but luke. Thank you so much for joining. That was tremendous.

A

Thank you. All right take care, bye,.