Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2020 - Virtual, 4 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Is There a Place For Distributed Storage For AI/ML on Kubernetes? - Diane Feddema & Kyle Bader

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Is There a Place For Distributed Storage For AI/ML on Kubernetes? - Diane Feddema & Kyle Bader, Red Hat

Containerized machine learning workloads running on Kubernetes receive benefits such as portability, declarative configuration, less administrative toil, all with marginal performance impact. The best published results for performance sensitive machine learning workloads, e.g. MLPerf v0.6, were obtained by reading the datasets from local SSDs. While the MLPerf datasets fit comfortably on a single SSD, it’s a luxury not afforded to folks training models against petabyte scale datasets. We’ll share our experience running MLPerf training jobs in Kubernetes, against datasets stored by Kubernetes stateful storage services orchestrated by Rook. Highlights include the performance and scalability tradeoffs associated with local and open source distributed storage, and how machine learning formats like RecordIO and TFRecord provide performance utility and model validation flexibility.

https://sched.co/ZerS

A

Hello, everyone welcome to virtual amsterdam and is there a place for distributed storage in aiml on kubernetes, I'm diane fedema, I'm a principal software engineer at red hat and today, I'm speaking with kyle bader, who is a principle software architect for cloud storage and data services at red hat.

A

So next slide. Please.

A

We want to know: is there a place for distributed storage for aiml workloads on kubernetes?

A

What do you think.

B

I would say yes absolutely um and why why why were we interested in in doing this? um Well, there was, you know, kind of kind of stemmed from a number of conversations that that we had internally, as we were, you know, thinking about uh distributed, storage and uh the implications of machine learning and um kind of boils down to um you know many of you probably have an environment.

B

uh Well, if you're lucky have an environment, it looks maybe something kind of like this right and you know kubernetes is awesome, so you've chosen to to use kubernetes to to manage this environment, and um uh you know most the times there will be.

B

You know some some number of general general compute general purpose, compute nodes that are, you, know, running your applications in your storage and so on and so forth, and then the the lucky part is, if you are, are able to have some special purpose compute with uh additional gpus in order to kind of accelerate your machine learning workloads, so that you can, um you know finish, you know both training jobs and inference jobs, a little bit more expeditiously and collectively.

B

um You know this this kubernetes cluster that we're looking at here uh represents a fairly significant capital investment. So um I want to make sure that that that it's used uh as as efficiently as possible right. We want to maximize our utility from this from this investment so kubernetes to the rescue right. um You know at a very simple level right kubernetes is a scheduler.

B

It does a lot of other things in addition to scheduling, but um one of the things that it's it's here to help you with is is packing bins, um you can think of of nodes as these bins and then, instead of having you know, dimensions like length and height and depth.

B

Those nodes have dimensions like amount of cpu amount of memory. How much storage space do they have on their local storage devices? How much I o can those storage devices handle and we want to? We want to effectively fill fill these bins with pause and we want to fill them as full as possible without them kind of impacting each other.

B

In order to to you know maximize the utility of that cluster, so we schedule a pod and that pod is going to consume some amount of cpus some memory. uh You know it will generate some amount of storage. I o and consume some store, some amount of storage space, and you know if you, if you, if you draw this up on onto this kind of you, know multi-dimensional object here, you get a shape, and so each pod kind of has its own shape, and um you know with all these different shapes of pods.

B

You want to try to you, know, fill up fill up the uh each of these dimensions as full as possible.

B

So, as you add another pod right represented here by this kind of dotted purple line, um we want to. um We see that it's it's taking up. You know some an additional amount of cpu uh uh additional amount of memory. You know it's going to consume some more storage. I o and take up some more storage space.

B

Now, some of you might have noticed a problem here um and, and I'll kind of point it out um is that you know we we've kind of exhausted the amount of cpu resources here, and what does that mean? Why is that? Why is that bad right? We want to maximize usage of it, and, and so we want to use the ball to cpu, don't we well? Yes, we do, but we don't want to trap. We don't want to trap the the the other resources.

B

um Basically, we don't have it want to have this like stranded capacity in terms of memory and unused storage, space and unused storage io. We want to use those. We want to use those efficiently as possible too. So one of the problems that uh you know you want to solve with the scheduler is to avoid trapped capacity.

B

One way you can help avoid trap capacity is by having fewer dimensions that are constraints, um and so this is this is where you know, distributed. Storage gets kind of interesting because you can kind of abstract away, storage and, and then there are uh because it's being remotely accessed um the the the pods have uh ability to be scheduled in more places and there's less constraints on scheduling them um such that um it's easier to to to fill your bins maximally and uh in turn maximize the utility of that that significant capital investment.

B

Another you know time when distributed storage comes into the picture is when you need more storage space uh for a particular application. Then then, then you have on any given node in your cluster right.

B

So if you, if you, if you need to uh have- or if you have data, that's larger than a node, then then you're going to have to spread that data over multiple nodes and uh guess what distributed storage helps helps you do that right so by kind of virtualizing the storage um uh kind of collecting a bunch of disks together and then using various. You know, software-defined storage uh methodologies.

B

um You know it can it can kind of create this uh either virtualized block or file, system, storage or object, storage, and um then you can have you know larger data sets that pause interacting with um than than what you could get from any individual node.

B

uh The other problem that might present itself is is okay, you, you have available compute, right and, and you uh want to want to use it up, but the data already exists on another pod right, because you're using local storage somewhere and your data is over here. So now, you're kind of you know sure you could have something. That's going to. You know make that that data over over on the the right host be remotely accessible right.

B

Maybe ice goes to your nfs or something to the to the node or on the left, but you know that running that that nfs or is gateway is going to consume some amount right, um and so you can't kind of ad hoc provision it um and it would really just be better if, if we just kind of had a distributed storage system that was in place and and and gave you more flexibility in term or gave kubernetes more flexibility in terms of scheduling, um so this is- uh and this is not you know- none of this is unique to kubernetes or uh machine learning workloads, but but it does apply uh equally to machine work, learning workloads, um and so it was kind of in in this thinking this.

B

This context that we we started to ask the questions about distributed, storage and machine learning, so very simplistic, right, uh kind of truncated version of the uh machine learning life cycle right now, of course, there's gonna be something on the left. That's that's. You know the applications um or sensors or whatever that's that's, generating the data and then there's there's more on the right. Like is you know, training is not the end, all be all right.

B

You're gonna have to take take that model, and you know wire it into an intelligent application and do something useful with it. uh You know, shopping cart, uh suggested products or you know, detecting fraud or so on and so forth. But for the purposes of of this discussion here, we've kind of simplified it to okay. So you know, there's there's to the the time to value is, is something that we want to shrink and um some of the the things that need to be performed um before we can realize value uh from from that data.

B

Is you know we're probably gonna have to prepare it wrangle it together uh in some sort of way and then without distributed storage, we're gonna have to move the data. So if we, you know, if it's going to take a really long time to run it on the general purpose, compute and it was processed on the general purpose.

B

Compute then we're gonna have to move it to the special purpose compute in order to you know finish it in any sort of reasonable time frame and then, finally, once the data is there, then we can actually start training um uh the model against that data and then, once it's once it's completed or we're kind of on to the next next phases of of the life cycle, so um with distributed storage, can we can we compress this? Can we can we compress and have uh get to that value faster?

B

Can we reduce the time that it takes to get to that value?

B

And if you start, if you implement kind of some sort of distributed storage, uh you could potentially kind of compress the the data moving and model training into uh into the same kind of into the same space and and and and perhaps perhaps that does mean that it that it inflates the time that the actual model training takes. But you you know you, you potentially save yourself that data moving st uh stage now, uh of course, an astute reader might go. Oh okay!

B

um Well, if the data is not remote and the gpu node is accessing over to the network, isn't it kind of like the data is moving? And the answer is yes right. So it's going to be kind of over the course of the model training it's kind of incrementally or either uh either incrementally over the course of the whole training.

B

It's grabbing some of the data it needs to keep the keep the gpus busy or it kind of bulk loads it in the beginning and then um and then and then processes it from there right so um so yeah this is this is this is what we want right. We want to. We want to see if we can compress that time to value by kind of collapsing, these two stages by using remote storage and accessing the data in situ right in some sort of distributed file system- or you know, object, store or something like that.

B

So the other uh interesting thing that that comes about with uh distributed storage is it helps a lot with uh with with parallelism right and you know, just kind of simple simple. You know explanation of parallel parallelism much. Many of you already know all this, but you know if I'm digging a ditch and I'm alone, and I break it up into you, know six work units. um You know it's gonna. Take me longer right.

B

um If I want to dig that ditch faster, I can recruit two of my best friends that are that are equally hard workers as me, and we'll each compute com compute complete uh two units of work and that ditch will be dug faster right and that's great and the reduction in time that it takes to dig that ditch is speed up right and that's the it's kind of the computer sciencey. uh What they call that is is speed up.

B

So can we additionally instead in addition to kind of collapsing those two, the compute stage and uh or the the training stage and the data moving stage, but we also affect speed up to uh further reduce the the training time potentially compensating for or uh going to to further than we would be able to otherwise.

B

So if we have this, you know model, training and and collapse model training and data moving step. Can we can we parallelize? Can we add additional gpu hosts? um Of course, this would be after we've. We've stuffed as many gpus, as is nvidia, has engine engineered to be able to fit into a single system, but you know once you once you've exhausted the ability to scale up. You want to need you're only left with the ability to scale out.

B

So, if we scale out, can we can we further reduce the the amount of time it takes to do this kind of collapsed data, moving model training step?

B

um So it's in the in the in like with this in mind like this. This is the overall kind of uh like theory, um theory level. Thinking um behind you know. The question about you know is: is there a place for distributed storage um for ai ml in in kubernetes right?

B

So next we needed to we kind of needed to test this hypothesis um and so without further ado I'll. Let diane talk a little bit about a benchmark um that that that we use for some testing called mlperf.

A

Kyle next slide, please.

A

Okay, so, as kyle said, we're going to run an experiment and we need models to train for this experiment and we're going to train them with data, that's sitting on distributed storage and then compare that to data that's sitting on local storage and see how each of them do, and so we decided to choose the ml per bench. Training benchmarks uh to run and ml perth is like an emerging industry standard for um a standard benchmark that is designed to be used as a level playing field for comparing machine learning infrastructure.

A

There are all these companies that are shown here that are contributors to mlperf and university researchers as well.

A

So we thought this would be a great choice of a real world, uh benchmark that the entire test, the entire system, um to run in our our distributed storage experiment next slide, please, the mlperf benchmarks uh solve real world real world problems, as I said, such as uh computer vision, problems like identifying images, identifying objects and images and translating one language to another.

A

So we are going to run the ssd uh benchmark, which consists of a model. This ssd resonate with our ssd with resnet34 and uses the cocodataset, which is in a publicly available dataset. So then we also are going to run the transformer model and that uses the wmt public data set.

A

So I'll talk a little bit a little bit about each data set and all the code that we used here is open source. You can go to mlperf.org and see the code that we ran uh and also we're running lpr version 0.6 training in our experiments next slide. Please.

A

So this is an example of object, detection and segmentation, and you can see that the objects have been detected and they've been labeled, and then the segmentation has happened also on uh the guy in the bicycle.

A

So uh these sorts of applications uh use things like the coco dataset which we're using and uh each year there are competitions uh to improve the state of the art in this object, detection and segmentation.

A

So we just ran one of these benchmarks next slide. Please.

A

A little bit about the coco data set, which that we used it's the common object in context, data set uh and it has over 300 000 images.

A

It uh allows researchers and people are doing benchmarks like we are to take real-world problems of finding these objects in these scenes, and you know outlining them and labeling them and seeing how your model does at that accuracy. Accuracy level that you achieve so the data set was created by cornell in a joint project with microsoft originally, and if you'd like to check out the data set online next slide, please you can download it, and you can also peruse around the data set and search for things like in this case.

A

You can submit queries at on the cocoa explorer where you want to see all the scenes with horses and bicycles and cars, all the images or whatever combination of the objects it recognizes.

A

You can choose them here and just check the data set or if it's got the sort of objects that you are looking to um to recognize in your images. Next slide. Please.

A

So kyle talked a bit about why we want gpus. We want to get that speed up. That powell was talking about, and in these benchmarks we use neural networks that rely heavily on linear algebra. Things like matrix multiply, dot product things like that and gpus are extremely efficient at doing that matrix math.

A

They are great number crunchers for matrix math, and so you could get you can get thousands of times. Speed up. um It's not unusual to say, get a 3 000 times speed up if you run like a matrix, large matrix multiply uh on a gpu instead of a cpu.

A

So uh what we're trying to do here is use these gpus to reduce the training time make those matrix matrix operations go faster and uh we also have to feed those gpus with data and that's what we're testing you know.

A

Are we getting the data from this distributed file system there fast enough to keep the gps busy, so uh training times can take days or weeks, and if you made a slight error in your code and then you figure that out a week later, then your turnaround time the work time for the data scientists or the researcher is not reasonable. So we want to shrink down that time to train so that you can iterate more make more changes, improve your model quicker next slide. Please.

A

In mlperf uh we will be using python. There are many languages that you can use lists our java. You could choose other languages to write a neural network application, but the benchmarks that we are going to be running, use python and there's a good reason why many data scientists prefer python and it's because you have access to libraries and I'm just listing four here that are very useful when writing neural networks, so uh there's theano, tensorflow, keras and pytorch and today, in our example, we'll be using pi torch.

A

Another really nice thing about using python is that in your python code you not only have access in our case to the pi torch library, but you also have access to numpy and you can move these multi-dimensional arrays back and forth between um between numpy and pi torch very easily. The api is very easy to work with, and I think that's why researchers and programmers like it so much next slide. Please.

A

So, just a little bit more about pytorch, it's open source. uh It was created by facebook. It was initially intended as a replacement for numpy uh that used, gpus, numpy doesn't vampire runs on cpus um high torch enzyme gpus, with just minor modifications to your code.

A

You can make your multi-dimensional uh arrays in numpy run on gpus when you're using itorch, so um my church was created to so that it was easier to uh build neural networks. It uses something called tensors which are really just multi-dimensional, arrays imperative, multi-dimensional arrays that run on gpus and by imperative. I mean that when you execute, as you go down through your code and you execute a line that uses a tensor bit, the computation is performed immediately when you hit that line of code- and um not all implementations are like that.

A

The first release of tensorflow was not imperative. They have now switched to the imperative model, because it's so popular and in tensorflow to 2.0. They also uh use the imperative computation approach.

A

So uh in pytorch you can run on cpus rgqs. You can switch back and forth easily with some minor changes in your code and you get the benefit of flight changes to your code. Allow you to get huge speedups by using gpus next slide. Please.

A

So when we were preparing our benchmarks to run uh on the cluster that we created uh the kubernetes cluster that we created in aws, we first containerized uh the ml per benchmark and we created a docker file for it and in our docker file we used nvidia's version of pi torch. We built it from source, then we added the scripts that read the mlport benchmark and we created that container image pushed it to quayio.

A

Just has a repository online repository, and then we pulled it um into aws when we actually ran this job on our cluster and I'll. Show you the the yaml next that we use to uh pull that image next slide please.

A

So this is actually a different image that we're pulling. But not in this case we were pulling a cuda vector, add. But in our case we pull the image for the ml perk benchmark.

A

But what I'm showing here is that you have some yaml that kubernetes interprets and schedules you onto the correct worker node, based on the resources you're asking for so here. I'm asking for four gpus, which is what we had in our cluster in aws and kubernetes, will and openshift will schedule that pod onto a nose that has the gp.

B

So um what do we see? What happened when, when the rubber hit the pavement? How did this turn out? So um as a as a before before we up before? I ever um talk about, you know what we see we like to give some just splash up some details about the environment, so that if people want to recreate this sort of work, they can they can do so. um We have kind of the the laundry list of versioning on on the left.

B

There we used uh openshift container platform uh 4.5, which you know, maps to kube uh 1.18, and then we used openshift container storage 4.4, which gave us you know: ceph, nautilus and and rook and csi all kind of neatly packaged together and then the the cluster itself was.

B

uh You know just a few few modest work uh masternodes, and then we had some uh three pretty modestly sized uh storage, uh dedicated storage workers uh that we, you know used taints and tolerations, to ensure that only the uh the the storage uh storage workloads were running on them.

B

um They had a couple ssds each and then the workers are are relatively, uh not not particularly relevant here, because all of the workload that we were running or the one workload that we were running was uh basically the the different pi torch uh ml perf tests and those had those uh gpu constraints, um and so that made sure that they were scheduled to uh the p3 8xl that we had provisioned, which has uh four of four of the nvidia v100s.

B

They each have 16 gigs of memory each um we use ec2 uh for to run our our cluster uh us west ii region, and then we spread uh spread both the compute and the the masters storage and the worker nodes across uh different availability zones for fault, tolerance. And but of course, because we only had one gpu worker. It was. It wasn't a single availability zone.

A

So uh the first benchmark that we ran is an object, detection benchmark called ssd, and you can see here that uh you can see the software stack that we ran. We, you know just like uh we just described a minute ago um and so we're using python and pytorch and cuda, and on top of that, we're running, openshift and um so and of course, red hat core os is the operating system that we're running, and you can see that uh we have a training time where the data was sitting on local ssd.

A

So when we say local ssd, we mean on the same worker node where the p100s p100 gpus are, and um so we had a time of 45 minutes 92 seconds. uh Well, it's 0.92 uh minutes actually, and um then we had uh training time using ceph of 45.78 minutes. So you can see, there's essentially no difference here that the difference isn't definitely noise. At this point uh we could have run it again and it could have been faster with stuff.

A

So that was an excellent outcome and it shows that yes, it's reasonable to use distributed storage for these training benchmarks next slide. Please.

B

That's good news right: we're not we're not completely crazy.

A

B

What was the other one that we ran? It was.

A

So we ran um the translation benchmark, which translates from uh german to english and english to german, and so this is a natural language processing benchmark and it was only about a four percent difference here in our timings slightly over 4.01.

A

I think um so again, we you know, put the training data on the local node and tested it and uh got a timing of 62.91 minutes and then to compare side by side. We put the trading data in seth and got a timing of 65.43 minutes.

A

So again this was a great experiment and we got a good result.

A

So we're happy about that.

B

Yeah, it's interesting. I mean they're they're they're, so close right when you're well, you said four percent variance or something it's like. uh If we ran you know, if we ran each of these, you know 20 times. Would it just be like in the error bars right it wouldn't even it's effectively, you know indifferent, and so we kind of go okay well, this is this was interesting. Why?

B

Why why? Why did it? Why did it come out this way, so we wanted to dig in a little bit um to uh prometheus right. uh Look at some of the telemetry from the test to to really try to understand um what what's going on here. Why do we? Why do we? Why don't we see what we saw.

A

And so uh you know it's very handy and uh useful to use prometheus and grafana together in kubernetes and openshift and right here we're looking at our transformer run, and you can see that along the top there, the top row graph is gpu utilization and it goes from zero to nearly 100 percent.

A

uh When the job starts at on 9 55, you can see the time along the x-axis there, and so we are just keeping these gpus very busy um for the duration of that run, and you can see that in seth there are, you know, and also actually next row down is the graph for the memory utilization and we're at about 15 gigabytes of gpu memory used for the duration of the job so and then the the set I o ops are kind of distributed equally across the entire run and are are pretty low a lot of not really getting the storage that hard so uh kyle.

A

Would you like to say any more about this.

B

Yeah it was, it was kind of like uh you know we were, I would love, we would launch these, but we launched this. You know this particular uh workload and then- and I was sitting there looking at the storage classroom, but it was like. Is it working?

B

I was like you know. Seth is bored right like we're not doing anything, um and- and I think uh you know the the I o that we're mapping here is- is the the the data operations right, um so there might be, there might be some additional like uh like file system uh like you know, stats and listings and stuff that are going on here, they're that are increasing the I o uh a little bit, but for the most part these are.

B

These are relatively you know it's reading reading like a sentence right where like well, those those io sizes was really small like if you divide the number of ios by the amount of data that's coming out or the the particular periods so um and then we're like well, you know it's probably just reading like a sentence and then it's converting the sentence and then you know doing a comparison type thing. So it's it's a relatively you know, at least for this workload. It's a not a particularly storage, intensive uh um machine learning, workflow.

A

Right and it it trains on sentence pairs. So um those like the english sentence and the german sentence are sent together and then you train that's how the model is trying to translate from one to the other and so yeah. It's not particularly io intensive and the I o is uh human language sentence pairs together.

A

So look at the.

B

I was more interesting. I thought, because it was, it was totally different right than um than what we saw with the uh uh with the transformer right this. So this was the uh the single. What is it single single.

A

Shield detector, detector.

B

So uh my favorite, because it's the ssd right and being a storage guy, we love, we love things that we love ssds right, so the ssd workload.

B

Overloaded terminology, but it was, it was interesting from a storage perspective because it it basically just bulk loaded everything at the very beginning, and then you know once once it was once it was over to the gpus.

B

Then it uh you know, staff was just kind of relaxing and while the uh while, we you know, burned up a bunch of uh energy crunching and doing this, uh this image detection, so that was kind of interesting. We saw a lot more io, but it was concentrated in. Like a brief. You know almost two two minute window.