Red Hat OpenShift Commons Briefings 2018 | OpenShift Commons, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons ML Briefing: Distributed Training Performance on K8S with Lachlan Evenson (MSFT)

Description

Lachlan Evenson (Microsoft) discusses the results of his team's research on Distributed Training Performance on Kubernetes with the Machine Learning on OpenShift SIG of OpenShift Commons.

Learn more at https://github.com/joyq-github/TensorFlowonK8s

Join OpenShift Commons https://commons.openshift.org#join and join the conversation

A

Apart from Australian.

B

A

Can you can you give me okay, okay, fantastic, so if you have any questions, feel free to feel free to interrupt, because I can't see everybody's faces well, I'm, giving the presentation, so don't don't be shy. What I'm gonna share is just some performance testing that we did with distributed training on kubernetes with tensorflow, and this is a piece of a presentation that I gave at cube con, but I think you'll find it very interesting.

A

So, as David mentioned, a lot of folks are looking to run ml workloads on kubernetes, and you know the sig is really about that doing it on open shift. Do we get on hooba netting? So one of the things that this work came out of was several months ago about eight months ago the Microsoft Research team came to the container team over at Microsoft Azure and said. Is there any way you can give us some of your expertise in running kerbin at ease and help us understand?

A

Why we're having performance issues when we try and scale these distributed training these distributed models on kubernetes, so I went over and start with the MSR team. For several weeks and we went through all the scenarios- and this is just the output of that work- so hopefully you'll find some of it interesting and at the end, there's some future work that where we're also doing it space, but before we get into it most people say well. What were you testing on?

A

So you can taste this and test it on any cloud and any any bare metal setup that you have. What we do here is we've just documented the exact VMS specs that we have for the workers and for the parameter servers the type of GPUs. We have the type of the diversion of kubernetes, all the parameters that we have for the actual test environment and we're just using I'm, not sure, if everybody's aware we're actually just using the upstream tensorflow benchmark scripts to actually run these test frameworks against different models. That's all there for you.

A

If you want to go and replicate this and test it yourself, the data set that we're using is imagenet, which is a real data, set, not synthetic, so there's the parameters of the setup environment. So what we did first was we actually didn't do any distribution. We just wanted to do a Ingle node with multi-gpu and just have a look at a baseline of what how a model could be trained and how fast we could actually process images per second out of that image.

A

Net data set, so you can see here the 1 2 & 4. We pretty much get linear scalability and we made sure that all the GPUs are indeed saturated when we were doing those tests, so that gave us a fairly good baseline of what we could do with resin at 50 VAT, so 64 running images per second there. So that was a standard we get linear scaling, because what we wanted to see is in a distributed fashion. Did we still get linear scaling or whether did that break down?

A

So when we go over to distributed, we started with one parameter server and two workers. We had a sync variable update, so these are all parameters that you can tune inside tensorflow when you set things up using the CPU as a local parameter device. We had each worker pod had its own dedicated host, so we didn't have any contention between parameter servers and worker nodes, so it was pretty much three DMS 1 parameter server and two workers each with the pod for their respective services that were hosting.

A

We had the variable update mode set, the parameter server and we were using G RPC as the network protocol. Now, what you see here is a single pod training with 4 GPUs and distributed training. So the blue is the numbers that we actually saw with a single pod on a single node essentially and then, when we scaled it. So these are across the different models: Google meta, Concepcion, v3, ResNet, 50, ResNet, 152 and bgg 16, but you can see google that actually gets fairly good start bailing. As you scale out you get fairly.

A

Linear scaling is lossy before GPUs, here's 8 onto machines. You can see that the factor gets a little bit less depending on the model and for vgg 16 in particular. It actually gets worse. Using distributed. Training with vgg model is actually worse than using a single machine. So this was kind of interesting to us just to kind of get the data and take a look at how we can dig in and what was actually causing those bottlenecks and understand.

A

So what we can came to find was we took a look at Network. We took a look at disk. We took a look at different network configurations that we could do, but really it came down to the compute versus been with ratio and really when it goes to setting up a parameter server. The bottleneck of the whole system is how fast that parameter, server can process the gradient changes and redistribute them to the nodes, and that is actually basically on the network. Although different models have different networks, so computation cost readily alex net.

A

How much computation costs we have for these different models and across training speed up of two nodes versus a single node, so Google getnet down in the bottom right. Here you get one point, eight six, obviously as a public cloud provider or anybody who's concerned with costs. If people are paying for two nodes and they're using distributed tensorflow, we want them to get. You know pretty much linear scaling as they pay for more compute and with Google net.

A

You can kind of give the closest, but really the average here is about one point six at best and at worst 0.63.

A

So this is just some things to keep in mind as you look at distributed training, because what was a surprise to me was as an operator of kubernetes for many years, just throw more hardware at the problem, add more nodes and surely your results will be better, and that is true to a certain extent, but not the extent that you would think now. Just some observations that we jotted down was it. It really depends on the model and network bandwidth in particular, so we had non specialized network hardware and specialized network hardware.

A

So really what that boiled down to was 10 gig Nick's on the VMS or InfiniBand on the VMS and how much bandwidth we can consume. Gpus are not fully saturated on the work worker nodes. So, even when we scaled it out, we couldn't actually reach full CPR GPU saturation and it's likely due to the gradients, the batches and then being rich redistributed, that the GPUs could never get full saturation. We could do it with those models on single single node systems, suboptimal performance, GPU, starved most of the time there wasn't anything to them.

A

To do we, we actually tried doing a number of different PS server configurations so running on the same note as the pods actually seemed to create worse performance and that we tried to do multiple PS servers and understand whether, if we had multiple parameter servers, whether we would actually get better distribution to those workloads. We also tried synchronous and asynchronous variable updates, but what we actually published in those previous slides was best case that we actually found so really they've got us asking what we can do better.

A

So we went out and we re running a lot of these tests today results yet to be published. But before we went in to cube con about a week before goober released hor Avadh, which is an open source, distributed learning framework for tensorflow. So if you've already got your models, written intensive flow, there's no extra work that you had to have to do: hor, Avadh, simply shims. In now here's this is taken from the horrible dock specifically, but really what you can do is enabling different batch sizes here really get linear scaling distributed.

A

Tensorflow gets this horrified will get that with TCP and obviously with our DMA. You can get a little bit more and they also publish the ideal linear scalability here. So really, what you can see is with large about sizes in the distributed training here on kubernetes, you can actually get closer to linear scaling with horev on now. It's a standard load Python package, if you're not familiar with it and it's seamlessly installed. Now it uses nickel and nickel is intra. Node GPU.

A

Communication and basically really what you're doing here is distributing the parameter server to the GPUs, so you basically have a completely distributed parameter server rather than a single pod or a single entity responsible for processes and their gradients and distributed them to all the nodes.

A

There's nickel v 2 as well and nickel v 2 is in turn the the number of nodes you can have all the GPUs communicating those parameters and gradients with one another without having to go through a single head in a parameter: server MPI for worker discovery, so there's a reduction in coordination there and tensor fusion.

A

So this is just a way that we can start to look at using tensor flow and there are other tools out there that we might be able to achieve linear scaling, because when we go to these distributed models on distributed compute and in the cloud we really want to get as close to linear scaling as possible. That's what we're going for here! So there's a another paper out there, that's also of interest. So it's deep grading compression, so there's been analysis on those gradients.

A

So that's as the models being changed, the Delta in the parameters that get distributed back up to the parameter server they're actually have to be fairly redundant. 99.9 percent of the gradient is strange, is actually redundant, but the purpose of this gradient progression is really to get your parameters, your gradients, actually down to reasonable sizes before shooting them down the wire right.

A

So you're, going from a hundred, Meg or 500, meg and or less then I'll make for most of this so another way that we can actually not take care of scaling and as we go obviously we only had two nodes, but people made one hundreds or thousands of worker nodes, but this really allows you to scale in our testing on one gig Ethernet VM. So you don't need specialist hardware and you don't need large NICs in your VM. So another interesting fact there now just just some other other tools that we actually have out there.

A

So there is a deep learning open source, deep learning workspace that we publish now. This is akin to a very point-and-click experience and where we're also working with with cue blow as David mentioned, but the deal workspace predates Qubo a little bit. So if you're interested you can shoot over here, it's open source.

A

You can take a look, but that's very UI, driven supposed to give data scientist access to a bunch of different frameworks, tensorflow PI, torch, spark and lightweight way to get entry, there's a couple of other things that were working on so free flow, C&I plugin. So this may be of interest in this forum, but you're really leveraging shared memory and RDMA to improve network performance.

A

So we've published a CNI plugin called free flow that allows you to get a little bit more out of the networking stack by using tensor flow, so that's deployed as a daemon set and there's a custom, CRI and scheduler that we've also released I.

A

Let me just pop over so these are both open source tool kits that you can go and take a look. So if you want to run spark in this deal, workspace, there's a document on how to do that and set this up, and this will run on any kubernetes cluster anywhere on any platform I'm just using kubernetes abstractions.

A

Finally, just this week, we repackaged and open sourced a custom CRI to give us more granularity in GPU constraints and a custom scheduler for kubernetes to be able to bubble these GPU constraints, not only how many GPUs you have, but different factors inside those GPUs so feel free to go and pick that up, take a look at it and and contribute at this point. I will open it up to any questions.

B

A

B

Listening to this and wondering well some of the stuff you just present or reminds me a little bit of how you know the old MapReduce formalism got its traction where it's gonna be. You know opinionated. It's like we're, gonna focus purely on stuff that actually scales well and so I'm kind of I guess like in terms of Kubb flow direction and other stuff. Like that, I wonder. If we should this indicates, we should be fairly opinionated as to what kinds of implementations we support. Sorry I also have a little flew here, but.

B

You know we're not gonna like we're, not gonna, provide implementations that don't come close to linear scalability, just because we don't want people to be a surprise. Yeah.

C

You know I think, that's a really excellent point: Eric I I won't like I'm, not so hardcore in impe nation. This is my opinion, so not as a coupon person, just as an ml person I'm, not so so hardcore in in implementation. That I will literally block people from running stuff. That is, is having a is gonna have a bad time, but I think I think it behooves us to really double and triple down and make sure that people are aware that you know to the extent possible.

C

Hey do this or you may not scale linearly right, and it's it's exactly the kind of stuff that that you, that research that that Lachlan has provided here and things like that. That will help people be aware of it. But second, you know it will help us. You know indicate like hey. If you want to get linear performance here, you really need to be, including you know, a higher level MPI framework or something along those lines.

C

So a you know, I'm all for leveraging this kind of research and and either putting it into Docs or or making it the defaults. You.

A

C

A

Yeah, absolutely and I think you know even upstream intensive low specific now that these performance benchmarks are there and people are starting to process the numbers. The tensorflow community as a whole is doing a lot. As you mentioned, it's very akin to MapReduce once you start to see the numbers there.

A

People are actually taking action to to get this closer to linear scaling, because it's a surprise to most people that it is not currently so we're continuing that research and expect to see a little bit more and and I can add the slides to the talk and, if you're interested in in chatting with either myself or the Microsoft research team. That I worked on this with happy happy to discuss.

B

Thank this research is pretty bloody, awesome and really helpful, and thank you for sharing it today.