Filecoin Compute Over Data Working Group, 18 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Compute Over Data Working Group Session #11 (Aligned.co and Petals.ml)

Description

Svetly Metodiev and Aligned.co - custom FPGA hardware for high performance compute workloads: Filecoin sealing, zk proofs, and more.

Alexander Borzunov and Petals.ml - Run 100B+ large language models (LLM) like BLOOM at home, BitTorrent‑style

A

Okay- and we are good all right- everyone, hello, welcome everyone. That's listening remotely to the compute over data working group. We are fortunate to be joined uh from with svetly from aligned, uh which is going to tell us about some really interesting work that they do building custom hardware for various compute networks, we'd like to not only cover decentralized compute networks themselves, but also the tool manufacturers that go into those the technology, Builders and so uh aligned. We are first physical, Hardware manufacturer if anyone's building decentralized networks.

A

um These are the guys to get in touch with to make sure you think about optimization and then secondly, we're gonna hand it over to Alexander who's, going to tell us more about the work they're doing with pedals ml, which is a decentralized ml training network uh and we're going to find out all about all the things that are going on there and whether or not open AI is actually open or not just kidding, okay, um so anyways svetly.

A

Thank you so much uh for for getting the content ready for us I'll hand over to you. If you're ready.

B

Thanks Wes appreciate that so um yeah I'm coming from aligned what we do is uh ceiling as a service and we're here to talk about. You know in general, decentralized, compute and so ceiling being a specific uh use case that really isn't on the back end in the trenches of PowerPoint. So it's a great thing to to discuss here and let you guys know what we're up to and then um you know what we can do as far as compute as a service in general.

B

um So just taking it off we'll talk about. You know about a line so for um so.

C

B

Is a high performance, compute company um offering Solutions as a service, so what we've done is uh our team comes from a background of video compression and rendering where they specialize in building their own hardware for uh big data processing jobs. Now we've taken that skill set and applied it to decentralized networks to blockchains, where you have big, um you know, compute needs such as ZK proving and then of course, ballpoint ceiling. So, as I mentioned, uh you know we specialize in building our Hardware, we utilize fpgas and so that's a field, programmable gate arrays.

B

So essentially what we do is we take silicon from Intel and AMD. We.

C

B

Our own boards, we um uh with super fast connectors on those boards, we'll write our own software on top of those boards to take advantage of uh uh and optimize for certain compute jobs. So um what we've done is basically built uh one of these uh models for uh fall, Point ceiling. uh We did this back in in May of last year as kind of a proof of work, proof of concept, and we presented it at Phil Austin and um what we found was.

B

It was a lot cheaper, our system, an order of magnitude cheaper than what we're seeing out in the market. So we knew we were up to something good there. So we started developing um a remote ceiling solution as a service, and uh you know we commercialized it back in November and now we're commercially sealing four PowerPoint storage providers and the way our system works um on the remote ceiling uh system works. Essentially, is uh we built a remote Bridge?

B

We had to build a custom bridge that can separate out uh the ceiling pipeline from the actual storage provider, Miner setup and it's super easy deployment. Essentially, it acts just like a local worker on the storage provider side, and so you install our Bridge. You set some some custom environment variables, whitelist and IP, and then you can start sending jobs uh to our compute pipeline.

B

And so you know for very big needs. You know we have our uh our data center in Ohio, Canton Ohio power is great over there. um uh You know super fast internet connection, but there is bandwidth constraints on super big needs.

B

So what we can do is essentially, we can co-locate our ceiling clusters in your data center and then we can basically have a direct line connect um that opens the door for a lot of you know shared Revenue opportunities plus you know it obviously solves the bandwidth constraints and then just kind of in general high level um for file coin, the platform Network. So what we're doing is we're trying to accelerate the fileco network growth and so by decoupling, sealing compute from Storage.

B

What we're doing is really uh helping lower the costs by allowing different storage providers or different service providers to specialize in what they're good at so, for example, if you're a storage provider, don't worry about building out the ceiling pipeline, stuff just focus on storage, but you know looking at high level uh about 2.5 billion dollars of network Hardware has been invested so far with a large portion of that going towards ceiling.

B

Compute- and you know, the ballpoint network still has a lot of growth left, and so a lot of additional investment will have to be made in um in ceiling compute, and so this is where separating it out will really help out with this next leg of growth and specifically for any service provider or user of a compute. In this case, storage providers who are using um our ceilings of service the benefit to them, is moving the business model away from um and invest.

B

You know upfront investment in capex to a pay-as-you-go operating expense, and so you only pay for what you need and it really eliminates the uh the problems we're seeing out there with uh storage providers who have invested in over invested in ceiling compute. Now it's sitting idle, so essentially just pay as you go and pay for what you need.

B

um Moving that so moving that capex to Opex and and basically optimizing that will allow storage providers to focus on storage and invest more in storage and and help in that side of the business, and then also you know it's complicated to build ceiling and compute systems, and so, if you are a storage provider or in general, like you, should focus on what you're good at so again with the specialization concerns, storage providers can now focus on doing storage and building out.

B

You know uh apis for the client-facing stuff, whereas they don't need to focus on building a ceiling, infrastructure and um and again that really helps out the benefit to the full Network and everybody who's in it.

B

um Now, where are we today? A line is currently running a large Fleet of fpgas in Kenton Ohio. We have a data center there we have a 200 gigabyte um and we have a lot of chips that are sitting on the Shelf ready to be deployed. So we can build out different types of compute pipelines for different use cases, and so we are actively looking and building different types of um uh uh compute that, based on what we see out there so yeah it's.

B

You know, I wanted to talk to you guys to open myself up to communicate as you're building out your different compute use cases on the front end or wherever, however, you're building it out. You should talk to us because we could um we should work together. You know, if there's a big opportunity there, you can use our infrastructure to start to build out your compute pipelines so uh feel free to reach out to me.

B

You can always get in touch with me at my email or I'm on popcorn slack info's right there, so um yeah I'd love to love to talk and uh open it up to any questions.

A

Thank you certainly super helpful and I just want to kind of launch one question with you there.

A

um If someone wants to engage with a line, is there any typical timeline process in terms of uh let's say we identify some compute that could benefit from Custom Hardware. If you could maybe give us a sense of what the ballparks would look like, how do you do it? How do we do a proof of concept? How does that eventually get into larger rollout any sort of timelines around that would be super interesting, yeah.

B

I mean we can we can build out a a proof of Concepts fairly quickly.

B

The only thing is um you know, as you know, we're all resource constraint is a big challenge for startups, um and so right now we have specific bonuses and if, if it's a big opportunity, we would love to hear about it and um and work with you now the timeline it could be weeks to you know like we could actually spin up something in a week or two to some thing that takes a bit longer depending on, depending on the job, so I.

C

Really can't say.

B

um It really depends on each individual job, but I encourage you to reach out to me with these use cases. If you have something so we can map out, you know, what's the opportunity here and does it make sense to build something out for it.

A

Very good and Z Sean feel free to jump in. It looks like you might, have a question: just go ahead and jump in whenever you're.

D

Ready yeah, uh just a quick question: I've done some fpga work in the past. Do you guys allow any access for uh working with higher level languages on top of their log or vdhl like heart, camel or something like Clash or claspber Haskell like? Is there ways to kind of um kind of operate more with the with the software and kind of libraries and stuff? On top for for programmers who want to operate workflows in your environment,.

B

I mean that would be a great question to ask my Tech Team, um I I, think we we base. We could open up an API right, so it doesn't doesn't matter what language you have on your end. You can. We can always have an API to talk to our system.

B

um If that's what you mean or are you talking about like actually integrating your software within deeper within our Hardware.

D

Okay, cool. Thank you.

B

A

B

A

Is this is super helpful Scott, like you know, one of the other things that I see coming along the the path? A lot is um you mentioned: uh ZK, algorithm, zero knowledge, proof and zero knowledge, proving that era holds a lot of promise for folks to fix a lot of issues around private data sets uh machine learning uh people talk about using ZK for distributed learning, and things like that.

A

So I'm really glad to hear what you guys are working on in that space, because um anything we can do to overcome these Hardware limitations and bring more of these use. Cases to Market I think just helps all of us in the decentralized compute space. So.

B

Thank you absolutely, and you know just that we really do have a big focus on the ZK proof side. So if you do have ZK perfuse cases that you want to bring to us, we would love to talk more about those um yeah. You know just let's keep the communication going and and we're all. Let's.

A

B

How this all involves.

A

Very good all right, thank you so much um well. We can I think hand it over unless anybody has questions to um to our next guest Alexander forzanov from petals ml. If you can hear us Alexander yeah.

E

Yeah sure uh hi everyone thanks Wes for inviting me to give the talk today. uh So let me let me share the slides.

E

uh Do you see them.

A

E

Looks good yes, yeah great great, so, okay, let me start um so. My name is Alex and today I'll um present you, our new system called petals. This is basically a decentralized platform for running ledge length. Large language models, uh large uh I mean the same size of gpt3, maybe uh and by the way like. If you have any questions uh during my talk, you can just like drop them in the chat and like uh I'll, try and try to answer them either now or uh after the talk.

E

So, uh okay, let's begin so just to give you a quick background. I think uh most of you have already tried large language models, maybe uh as GPT 3 or like charger, PT or another like chatbots and so on, uh but uh just like uh to give you a brief background, like uh the main feature of large language models, is that they can solve many language processing tasks, not only like chat with other people for entertainment, but so many practical uh language processing tasks out of the box or with really minimal fine-tuning.

E

So like as an example um uh here on the left picture we have like uh so imagine, I don't know. Maybe someone creates a chatbot that accepts some order for delivering food and I, don't know a few years ago. You would need some NLP and engineer to parse this sentence in English in human language, for you into some, like Json format, that you can later process in your like database in your delivery system. But now you can actually give like a few examples here.

E

It's enough to give just one example of what you want to do with the input text, basically to convert it into this kind of Json, and you get you make the model to convert any later text for you. So you provide one example of response, and it does it later automatically and often you don't even need one example.

E

You can just describe the task and it understands what it should do and just generates, for example, code, a line of code or maybe like a longer text or code and and importantly, uh the language models can be used, not only for generated texts, but you can use their internal representations, so you can basically cut off this head that chooses the next character and use these representations.

E

For basically many other things such as maybe building a recommendation system, classification and so on so basically very useful for anything related to Mutual, language, processing and code processing, and, as probably many of you know, modern LMS are really huge and they keep growing like many of these few short and zero shot properties Arise at 100 billion plus parameters.

E

So gpt3 is 175 billion parameters and they keep growing so like. There are now like models closed with two trillion parameters and so on and of course, like At. First, all these models were just proprietary, so companies like open AI, maybe deepmind have them, but there was no much access to them from like people not working in these companies. Maybe maybe we had like some apis uh limited apis or uh some.

E

uh We could try some limited features, but we didn't have full access for research for like implementing different stuff, different methods, with these models and and like I think last year the situation has changed, because multiple large models were released, most notably Bloom by the big science initiative and meta AI by meta or x, Facebook and several other models, but uh turns out. It doesn't really change situation much because.

E

Very little number of people were able to actually run these models, and this is because they're really huge. So basically if the model is of size of Jupiter 3, so have something like 170 billion parameters. You need at least this number of gigabytes of a GPU memory to run the model efficiently, and this is even if you use like some very smart compression techniques. So if you, if you just store each parameter as a floating Point number, you will get like twice as that or maybe four times as that.

E

uh So anyway, with even with the state-of-the-art compression methods, uh you can only uh run it if you have like really lots of high-end Hardware such as maybe a three a100 gpus, with lots of GPU memory, or maybe eight of more consumer grade gpus 3090. So uh it is actually. This Hardware is very expensive to buy quite expensive to rent and it is still difficult for, like independent researchers or maybe small labs, small companies, Universal apps, to use all of this um so actually letter.

E

Let's take a look of what options did we have uh before petals? uh So uh actually, okay, like you, can you can download this model, assume that you have maybe only like one consumer grade GPU or maybe one high-end GPU, but not like a gbo cluster. So one method you could use if you you could like turn, take one machine with a GPU uh then like when you run the model, basically copy model blocks from your disk or Ram to the GPU on demand.

E

But this way it turns out to be like extremely slow because to generate even one token, even one part of the world. You need to go through all 70 Transformer blocks of this like pledge language models, so you basically uh loaded part by part and to generate even one word. You need to transfer uh like almost 200 gigabytes to the GPU memory and you you need to transfer it for like for each word, so this turns out to be very slow if you just even if you just copy this from Ram.

E

If so in this case, it consumes like something like 20 seconds per token. So definitely this is not uh close for anything. You could use to like make some interactive inference. Make chat. Bots uh for any like practical applications- that's not quick enough! Unfortunately, and it gets even slower if you don't have 200 gigabytes of RAM and you only use a floating from s his team.

E

So okay and another obvious option is that you could, uh maybe you don't even have a GPU, but you could use some hosted apis from open, AI, maybe other companies, and they are very convenient, of course, but they may be expensive.

E

Unfortunately, and also you cannot always use custom, fine tuning and generation methods needed for your specific task, and this is very important for machine learning because, as you know, this is a very hot topic and new methods arise like many times a week and you often want to apply some late paper, but you don't have access to, for example, code and hug and face API or open AI API to implement some new features and there's no way to look under the hood, that's relevant for researchers.

E

You, you only get the output text, but you can't analyze what happens in the model under the hood. So that's that's where basically petals comes in, um so we suggest new options here. A new option for any large language models, if you don't have a GPU cluster itself- and this is basically by collaborating with others- collaborating with others of the internet, as some like of my colleagues say in the style of my BitTorrent or some as some other decentralized projects.

E

So the core idea is that you load a small part of the model, then team up with people serving the other parts to run inference or maybe adapt model for your own task tasks and just to give some terms here, and so we have some participants called servers that load the model blocks, for example, Bloom blocks or maybe any other large LM. They load them to their gpus and they kind of set up a service for others so that they can do like forward and backward passes basically computations through these blocks.

E

They have already loaded to their gpus and in of course, we have some clients who perform like forward backward passes through the whole model by successfully sending requests to the servers.

E

So, basically, maybe maybe we want to send requests to the servers holding the third dot of the model, then to the servers holding the second third and so on, and um it may come actually as a surprise, because I've already told you that, like Ryan and model locally, if you don't have enough, gpus is like very slow, and here we uh we now don't uh we now not only like run something locally, but now we also communicate by the internet, and the internet is comparatively relatively slow Network compared to some like local networks in high performance clusters.

E

So you may Wonder like why. Why is it faster because, like internet is faulty, it's possibly slow, uh so it it came for me, it came as a surprise that this way actually generation is at least like 10 times faster than with offloading than with this method, I told you for writing the model locally, and why is it faster turns out?

E

This is just because, okay, even if we use like the internet, a very goal and possibly fall to network, we do not send much data this way, and this is because, in the offloading case, we needed to like constantly send huge model blocks. But in our approach, every peer can just load a certain part of the model like uh to their gpus and then just like, accept small requests with a small internal representations of the model to like do their part of the job.

E

So, basically, to sum up, we use a slow network, but we sent thousands time less data, so this actually works faster, even though some pairs may leave or even though there can be faults related to the unstable internet, for example, and also an important feature is that you can actually have lots of control over over the model that you didn't have like in the hosted API case, because you uh call part of the model like yourself from your client that runs locally on your PC and um you can, for example, if you want to take a look, what still the model had after a certain block or maybe insert some new features, insert adapters, maybe change the order of layers.

E

Somehow you can do this with the as you do this like in the usual in the usual machine learning Frameworks, and this is something that is not possible, usually in the apis. uh Okay, um let me let me take a look at the question. Is the history available for transformation models throughout if its life cycle, um not sure I, understand the question correctly? History is not stored by uh almost all.

E

History is not stored by default, but to generate longer sequences you need to store some of the latest hidden states to be able to so that the Transformer is able to take a look at them, for example, if it needs to like generate newer texts that corresponds to the previous texts and this one are stored while the inference session, while the generation session is open um so yeah.

E

Let me uh let me go further and feel free to ask more questions like if you're, if I didn't uh respond very clearly so okay, uh so the generation is fast. But uh here is another problem. uh Actually it's not clear. So, basically, all the peers have a common shared model and what if they want to adapt the model for their own tasks, so maybe teach a new language for the model or maybe teach a new tasks. So basically do anything that implies changing the weights of the model.

E

Like is it possible and it turns out that, yes, basically, even if we assume that all these shared blocks, hostage and servers are constant and modern, NLP suggests that you can use some parameter. Efficient adapters or trainable prompts basically a very small additions to the model compared to the like this size of the pre-trained model to adapt the model for like most real world tasks. So basically, this deeper drained model, it kind of stores all the knowledge about the world.

E

It has learned from the internet and code and so on, and you can add, just a small, very small adapter, to adapt it to the task you need and this adapter it doesn't need much memory or compute, so it can be even stored locally, even on CPU. At some cases, when your client doesn't have a gpus, so clients can store adapters in prompt locally and just train them as usual neural network, using like usual Frameworks like pytorch you're used to and so on.

E

So this allows to clients to like adapt the model for their own tasks, then use the common infrastructure, the common, like petal, swarm to be able to run their version of the model, their adapted version of the model and, moreover, clients can share the trained adapters among each other, maybe like publish and hug and face Hub or GitHub uh the weights.

E

These small additions they have trained and import them from each other and, of course, inspect some intermediate States for research, which is important for ML scientists, um okay and uh yeah before I show you some examples, some like technical stuff. uh So the crucial thing is that we use leap P2P for Alden twerking, so this allows us to have protocol agnostic into working code. So maybe it's like some peers want to communicate or participate. Maybe the others want to communicate over quick over UDP, because this is like more convenient for UDP hole punching.

E

Basically, the only way we can communicate with UDP all punch links to use like quick, not TCP, and we can actually like this allows us this allows. So if someone, for example, some user decides to join petals basically provide the computing power of their GPU, maybe their GPU at home or at their company.

E

Usually they have like nut or a firewall and P2P allows us to use like lots of natural techniques, including and also importantly circuit relays, to like Traverse that if a user didn't manage to set up the port forwarding correctly and just like a small technical detail, that may be an interested interesting to people from protocol Labs that actually like battle zers written in Python, but we still use like the go client for libid to be, and for this we kind of revived goal to beat Liberty Diamond.

E

That was unmaintained for a few months or maybe even years. So this allows us to use all these features.

E

um Yeah and importantly, the whole system is decentralized. So, like we don't have any like critical point of failure, you can you can even like a couple of services. I'll show you later you can just uh clone the repository and uh set up a replica of them yourself and like there is no uh nothing uh in the system that cannot be replicated by other people, except maybe the bootstrap beers. That way like this bootstrap addresses that we have like in ipfs, so uh yeah and importantly, all steps are fall.

E

Tolerant so of course, like people may live at any time, maybe they want to use their GPU for something else, and the clients will always be able to find another server holding the same blocks and a couple of words of how to make this efficient. uh Basically, we need to compress anything so that, like we can use as as few servers as possible as few communication as possible. So we compress model ways with different latest techniques: don't actually store them in floating Point numbers, but rather in some quantized state.

E

That involves like just an index in the like some Bean, where the weights Falls also the same for communication and so on.

E

um Well, okay, and uh of course, there is some like plot balancing so basically, first of all, uh if some servers leave like imagine in this picture, like uh turns out like uh all these servers live and we have a gap in the network. The other servers are able to change the blocks they hold to close this Gap, or maybe a bottleneck and the same for clients.

E

So clients may choose servers to optimize latency for Generation to optimize throughput for training with larger batches, because throughput is much more important than latency in this case and, of course like there is some randomization so that we don't always fall into the same path and we're kind of trying to distribute a lot over the network and uh now um uh I'll just show you a couple of examples. So right now our public swarm hosts two models. So this is a Bloom and Bloom Z.

E

These are basically a kind of uh publicly released models that are similar to gpt3. Maybe they are of the same size as jupiter3 and Bloom is the rec is a regular language model and Bloom Z is its version um fine-tuned to follow human instructions? So basically, that's maybe something closer to chat GPT, because the standard, mod language model May refuse to follow your instruction, but this does its best. So yeah we have two of such models.

E

Many servers contributed by different users of and organizations of different capacity that allows us to like form the full chain. So basically, both of these models have 70 blocks and the client when it wants to generate something or train an adaptation of the model. They choose a chain through all these, through all the available servers given the conditions they want to optimize and then run the model with maybe like a decent speed. So our speed is something like one token per second.

E

This is not blazing fast, but this allows you to have some interactive applications like chatbots or maybe like show a beautiful progress bar to a user that like spins for five seconds, and then they have like their response and so on, and as an example we have like this chart.

E

uh It is, of course not not as strong as charge PT, because we uh didn't spend that much time on the modern side, but still it runs to run this Bloom Z Model the model fine tune to follow the human instructions and see its outputs in real time as soon as it's it is generated by the Swarm. So it should work with one or two seconds per token. Sometimes it's slower because sometimes we have lots of clients and of course we need more servers joining to handle increased Lots properly.

E

But anyway this works and without battles like you couldn't you couldn't actually try balloon Z, because it's not available yet in most inference apis so and like inference with offloading is like 20 seconds per token. So this allows many people to to try this model now and if you are an ml engineer, a machine learning engineer, it is important that we made battles interfaces uh loop so that we made battleless interfaces uh so that petals is no more difficult to use than running on a small model on your PC.

E

So, basically, all these interfaces are very similar to a popular hug and face Transformers library and you're kind of just create another models. And if you're, just an ml engineer, you don't need to know that, like under the hood, there are like huge amount of algorithms, all the lipidopy stack and so on, like lots of going on, and for you it's just like the usual model, you run uh so basically, that's mostly it some points about the future work. We may do so.

E

We consider introducing rewards for hosting servers because, of course there may be some imbalance, for example, if it turns out that many people made some client applications for battles, but not many people make decide to like join as servers so uh yeah we wanna to introduce some rewards that will serve people hosting servers will then be able to use them on high priority inference, and maybe some extra features like increase Tech flanks. uh We do not consider like any crypto at the moment.

E

It will be just like a simple reward system because, like our system get all this ml focused uh so but anyway, I hope this will motivate some people to contribute. Also, we wanted to make like a leaderboard of who contributed the most, so people can kind of advertise their companies- maybe, for example, you own a GPU hosting, and you can like, provide a couple of GPU and advertise its name in our leaderboard.

E

Another important point is like security, so basically we want to solve it by validators, pretending to be clients and periodically checking if servers are turned correct results and, of course, if they kind of cheat, for example, they may want to cheat to earn points without actually contributing compute.

E

If they cheat, we can like ban them and maybe and make some uh subtract some amount of their points, and so on and uh I I would say that the most important downside of our system compared to other approaches, is privacy, because peers in the Public's form May recover parts of your data. That's, unfortunately how it works right now, of course, there are like a lot of stuff about, like some, uh you know: multi-party computations, like maybe ziki ZK proofing, and that's more for security and so on.

E

But unfortunately this does not work for machine learning for large-scale machine learning well yet because it usually involves like 10x or 100x slow down due to the fact that these methods usually work with some, like you know, integer discrete mods and like machine learning, is a completely different field. You calculate everything in floating Point numbers and like these algorithms, these NPC algorithms they're not translated well to these floating numbers.

E

So you have lots of Converse versions so anyway, like we, we think that, unfortunately, it's not practical to apply any known, MC methods here uh yet, but maybe some methods will appear in the future and for now we we just just set up if you like, if you still wanted to use pedals. For example, you are in some Universal app that wants to process some private data. You can set up a private swarm between organizations.

E

You trust, maybe I, do not join with another small company, another Universal app and you can easily set up a private swarm, basically a private Network, where all the data will be processed among the viewers. You trust, so it won't be. A security and privacy won't be issues for you. uh So basically that's it. That's it. You can try follow check out our website. We have like GitHub docs tutorials everything there also paper. If you want to dig in into technical details and feel free to ask any questions now,.

A

Alexander, this is amazing, huge, huge fan of the work that you guys have built. I mean in in so many ways. I'll I'll start with one question and I have a lot more, but I'll give some other folks an opportunity to weigh in. Could you share a little bit about when you were starting the network of uh of compute providers available in your network, uh people that are actually running the pedal software?

A

um Could you talk a little bit about how you grew that network? Was it simple word of mouth or do you have any more sort of intentional way of growing that and maybe what you would like for it to become in the future.

E

A

E

Sure so, uh basically it grows at some rate for now. uh So basically we got some initial visibility because, like Bloom is a very popular model and like indeed many people like indeed running such large models were an issue for many people, so we got some visibility on the live ml subreddits, maybe ml Twitter, Hacker News, so, like some people just came to try it out uh some.

E

It's not difficult for some people like to host a couple of servers because they may have a couple of spare gpus but of course like we do not grow grow with a very quick rate yet and that's why we want to work on like introducing incentives as fast as possible, so that, like people, are more motivated to do that. However, like I think still now, the growth rate is not 0, I. Suppose it's like something similar to BitTorrent.

E

Maybe that does not provide any explicit incentives, but the people are still like sharing files because kind of I don't know. Maybe they want to give back to the community or something.

E

Yeah did answer your questions. That's perfect. Thank.

A

You yeah great yeah. It's a testament to the uh to the interest right now in the uh in the language models that you guys are serving.

E

Yeah, if you have any spare gpus, you.

C

E

Can also connect.

B

What is the kind of like? What's the current traction that you're seeing and when you look at it, I, don't know you're two years out like what stopped you? How do you quantify it? For uh you know a team, that's looking to do something: that's more commercial, more scaled and like less on the science, fair project side.

E

um Yeah, so we're actually a research team saw like petals is not a startup. uh We we're just like researchers at a lab, and so we decided to make this project with, like other researchers, uh from the big science collaboration, basically, collaboration that unites all the researchers from lots of different universities and companies, and um so we actually don't want to monetize it don't plan to monetize it to get any profit ourselves for now.

E

So our I think our primary focus for now is to just make some kind of self-sustainable system, like maybe BitTorrent or Bitcoin. So maybe maybe you will move on to any of the other project or company, but the system will live. uh Maybe someone else will be able to like contribute, and so on so yeah. We don't have like explicit plans for monitor, Asian and only want to add incentives so that like to balance the demand for the servers and Supply, but these incentives they are not like planned. Like you know, they are not.

E

They won't be designed explicitly as money like crypto tokens. So this is like something you can get something you can spend for useful features, but not really any kind of a token or something.

B

C

B

Guess what I'm trying to get to is like? What's that thing, that would cause an inflection point and what could it turn into? As with my team as well, we're not looking to mine crypto tokens which offering compute as a service, but you.

E

C

B

Do you get a team like our team to develop a pipeline for this to offerings you know to for the reward like? Can you define like okay in a year or two years? We think it could be this big because of this inflection point and I think someone to ask what universities you're working with maybe that's kind of the path. Is there a partnership or something that you need? That would cause that growth.

E

Yeah yeah so right now yeah we we're thinking about different like ways you know to spend to award points so uh yeah thanks. Thanks for the suggestion, um so yeah I guess we we need to like take a look at different approaches, because there are like lots of similar uh networks awarding someone for computations out there so so yeah, um uh but for now I can't like say any any specific plans, it's very cool! Thank you.

C

E

So a question from uh Irina: yes, so uh yeah, where I'm based I'm, currently in Armenia, uh so yeah, um my University so yeah. We are actually like a research lab in Yandex in the Yandex company and but we're kind of very independent, because we don't do anything uh for business.

E

We're mostly focused on like something and contributing papers, and we did this project with people from many other companies like on the first slide, uh like people from also hug and face that's a well-known company in ml in NLP field, also from people from with people from University of Washington and so on. Yeah yeah, thanks for the link yeah with people from big science. Basically.

C

C

And the last question: what's.

B

Your top priorities, enrollment on.

C

In your own yeah.

E

Sure Yeah, so basically um I think our top priorities now is to um so. First of all, we want to finish this incentive system because uh it is uh important to like match the demand to to handle the growth more naturally, also we wanted to do like some technical optimizations I'll move to this slide. So basically, there are a huge hormone in Improvement, the road and algorithms, so basically choosing the past among servers, so that works as fast as possible, and also we can consider like implementing some stuff like tensor parallelism.

E

So basically, if you have one machine with a few gpus right now, you need to like run and battle servers, but you will be able to run one tensor parallel, tensor pedal server that will be like n times faster. So basically, this also will decrease the inference time, so so yeah I think that's, basically it so incentives and some work on making the network more fast and more stable.

C

C

uh Are there um uh any? Is there any understanding of uh uh the like asymptotic limits of how well uh this could perform relative to uh uh like models being run all in one location,.

E

uh Yeah, so there is something you can cannot surprise if that, like. Basically, you can optimize the computations that are done like inside one inside one server, but you cannot optimize much the latency between servers so basically, for example, right now for Bloom Z, you will need three Hops uh at best.

E

So, basically, if you choose the long server from like 0 to 16, uh then this server and This Server um and you can optimize something that happens inside a server but the latency between them stays and I think this can also be optimized to some point, for example, by choosing servers geographically closest to you. So, for example, I imagine that if the network grows will have servers in America in Europe, maybe in other continents as well and the clients will be able to choose, allows the servers to like minimize all this latency.

E

But still there are some I think there is some lower bound that we want able to surpass, and, of course, this will never be as fast as inference on like a GPU cluster, but we can become close to it more close, more and more close.

C

uh Also um are there uh substance, are there like what are the main differences between uh training and inference in these models, uh and also you did go over trading somewhat, but uh it would you. It was also those models weren't being trained from scratch. So is it uh also so what's the story with training models from scratch.

E

Yeah so uh actually, like our research group, started uh from like approaching how to train very large models from scratch, uh because uh at the I think they started in like 2020 and I joined soon as well, and at that point there was like GPT 3, but there were no publicly released models of this size. So we thought that maybe maybe we need to make a system for train them for from scratch to to get them.

E

But um actually, like turned out that this system, uh like we made a library of this, it's called hive, mind and actually battles uses it a lot under the hood. So like we built on our previous work but uh have mine in my opinion, and it turned out not as popular because uh to train large models from scratch. You need like lots of ml expertise like much more than like. You need to just like fine tune it for something or run a chart and so on, and it turns out.

E

There is not too many people out there who don't work for a sub company yet and who wants to like create these large models, and it was much easier for them to find some funding find some like donations, basically, then, to set up this complicated, distributed training system uh that like uses sleep, b2p and many other things. So basically, I have mentioned that to be much more complicated, though you can still use it to to train like models from scratch, maybe not 100 billion plus, but I.

E

Think like 5 or 10 billion, uh definitely works um yeah and so yeah yeah a lot harder for open sources. uh Alexandra writes in the chat and yeah and as for the difference from uh between training and between inference and fine tuning, so like right now. uh Basically, an important difference is that you usually do inference like for one sequence or for a few sequences in parallel, and you usually do training for for large, very large batches of examples.

E

So you need different client routing algorithms on client to perform them efficiently, so like inference, should optimize latency and in turn.

E

Latency is not important for training, because throughput is much more important because, like your transfer, lots of data then process a lot, a huge batch, then transfer them and so on and and yeah and for instance, another important consideration is that it consumes memory because, like people for example, if you're talking to like our chat, our chatbot, you need like while you're talking all the servers in your inference chain should uh store all the previous like States for the previous tokens, so that you, the Transformer, can attend to it.

E

So it can look back and understand what you were talking about and so on. So inference consumes some memory, that's what called cache in this table. So basically it consumes some memory and uh you need you need to allocate some memory like for long periods of time to like allow to do inference.

C

Got it? Thank you.

A

Alexander one last question here: um I know we're getting a little closely at a time, but I just want I'm super interested, the um the checking the the validation that your team is doing for uh potential cheaters in the system. Could you just give a high level perspective when you're thinking about designs for looking for malicious, behavior or cheating Behavior? Do you do you try to approach it in a programmatic way, or is it somehow more of a manual task? That'd be very helpful, because a lot of other projects are concerned with similar problems. Yeah.

E

Sure so uh our security system that we want to start with this is definitely not like a silver bullet.

E

It won't be perfect, but I think it will be able to like catch uh like some, maybe at least Hardware failures or people who didn't spend like multiple weeks to dig in in all our stack and so on, um and what we want to do is basically to just uh make this validators, who pretend to be clients and send different kinds of requests to the servers, and then they like know the correct answer in advance because, like we may pre-calculate it or maybe we can calculate it using like a subset of trusted servers because, like there are, there are a couple servers hosted by us in this scheme so um uh and yeah.

E

We we can compare that and, for example, if you assume that there is present uh as some malicious server that uh maybe not all the time, but with with a certain probability response with incorrect data, it will be catched sooner or later, uh like. Of course it will be able to harm to a couple of people, but it will be. It will be quite personal later.

E

So that's I think good enough to start and like for increased security guarantees, the clients will be able, so one thing you can do is just to run, for example, inference through multiple chains uh through like two disjoints sets of servers, so that you can like double check your result and also, uh of course, like for 100 guarantees. You you need to use a private swarm in this scheme. Unfortunately yeah. Unfortunately, there is no like good proof of work for neural networks at the moment, so.

A

Super helpful all right, I'll give just another second in case anybody else has one last question:.

A

And if not, we will thank you so much for joining both you and Sally. This was tremendous content, much appreciated um I'm gonna post, this recording to the slack Channel soon under the computer data working group. But if people want to go to cod.cloud, they can get a link to the slack Channel there and uh obviously spently Alexander we'd love to have you guys join us there for any any future conversations, but uh just to wrap up.

A

Thank you so much for taking time, uh love, love what you guys are building, and uh we appreciate you for joining.

E

Yeah. Thank you. Thank you for inviting me like protocol Labs. We use a lot of your stuff and I. Think like it's, um we're very happy that you invited us.

B

E

Thanks so much time.

B

Really appreciate it, thank you.

C