Cloud Native Computing Foundation Kubernetes Day at OpenStack 2017 (Boston), 9 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Leveraging Kubernetes for Machine Learning

Description

This talk will demonstrate how a developer can build machine learning pipelines on top of Kubernetes. This presentation will include a deep dive on how Kubernetes is being enhanced to add GPUs as a resource. The presentation will also demonstration how an operator can spin up a Kubernetes cluster on Openstack with terraform.

A

Hello, thank you very much for coming. In my talk, this is machine learning with kubernetes I'm very excited to be the first producer here at kubernetes day. It's great to be here in Boston, so Who am I. My name is Christopher I'm Luciano I'm part of the I'm, a part of the open source technology team at IBM, digital business group, I'm, blessed with being able to work on kubernetes full time, I mostly concentrate on sig, node and sig network.

A

My github handle is on there in my twitter ID feel free to tweet during the keynote, but please make sure to add the underscore cm Luciano on Twitter founded some junket service he's a lot more successful than I am and he doesn't need the PR, but I do so make sure to keep the underscore in there. I have a very unsuccessful blog for some reason, I'm hoping to fix that pretty soon so check back later on, I'll be posting some more stuff on there.

A

So a lot of talks talk about some of the more technical things: how to set up things: how to twist some of the knobs how to tune things, but not a lot of talks. Talk about the WHO and the. Why so I've coined a series of talks by myself was from calling the what where and why series so why do I want to you? Is these types of technologies before actually using them?

A

When we talk about machine learning, it's very important to note what we're trying to accomplish here with machine learning, it's very important to get the most accurate results possible. So if we have a traditional bell curve, that's garbage! We don't want anything like that, so we're going to continually train our system to get to eke out the best possible accuracy that we can so the goal we're starting with some sort of a base knowledge. We have points of analysis, a corpus of unstructured data. Then we feed that into our system. We notice the airs.

A

We correct, we rinse and repeat it's a cycle.

A

So let's take a very simple example. This is my cat sprinkles. She has very distinct features. If you know her ear is very pointy. If you notice her feet, you know they come down, they connect. The shape of them is very similar to that of her paws. You could kind of see her tail not really make it out, but we're highlighting here are some of the features that say that this is a cat. You know, we've figured a few different data points and this is some of our base. Knowledge we're going to be feeding in.

A

So we move on again here's another picture of sprinkles little darker, but we could still make out pointy ears. We have a circular face, we can see her feet. Maybe we start to notice some more patterns about the system we get to here. We notice myself my fiancee and a penguin. We ask the system. Is this a cat? Well, I, don't see any of the features that I noticed before I, don't see the circular face, I, don't see the tail I, don't see the fur I, don't see the small nose.

A

I, don't see the pointy ears, this isn't a cat. Is this a cat I have a short snout I have a circular face the eyes are closer together. I, don't see a tail, though there's no fur, not a cat.

A

Here's a dog, let's assess the features of this. Do I see pointy ears. Not really the face doesn't seem right. The fur doesn't seem right. The legs do come together, but the Paul just seems a little too wide, not a cat.

A

What is this so here we have a sloth. Does their system know what that is? Maybe not what it does know is. It has a smaller face. It's circular in nature. It has a smaller nose, a distinct mouth. There's fur the feet seem to come together, everything's kind of coming together. It kind of looks like a cat. So in this instance our our system might become confused, and this is where we would see on our rate that maybe we'll have a jump and we'll have some like false or also negative or false, positive sorry.

A

This is where we would have to come in in kind of re trainer system to distinguish. Maybe some more advanced features. Other cats, because I'll in the end I just want to go online and find more cats for sprinkles to friends.

A

So I'm going to here, oh, we found another cat potentially, so we noticed the pointy ears. You notice a circular face it's a little harder to see, but this is sprinkles again. We notice the long tail. Is this a cat? If we have many cats, but did we tell our system that we might be expecting many cats I, don't know once again another opportunity to come back around and train the system, so you might be thinking like what what why would I care about this? This is a very useful example.

A

I already have a ton of cats. I don't need anymore I want you give me something. I can work with here. So IBM first started it's kind of for a and see you some of the more public AI knowledge with Watson prior to IBM digital business group I worked on Watson for two years. This is why I'm here today, so what to start it out on our ever famous power machines here, if you watched it, I encourage you to do so.

A

If you have not, just edit beat Ken Jennings at his own game essentially, but this also isn't very useful for the common person. Unless you're really trying to impress your friends every day, invite them over jeopardy only to dominate them with Watson. This isn't a very effective method for you to to utilize, so we can see that Watson started off as that research project. There's a demonstration at jeopardy, then some of the more advanced features of Watson start to be separated out into smaller services.

A

So, first off we started in healthcare mood on financial services and then it kind of ballooned to such that we had a ton of services. So the lots of developer cloud is something goes through the IBM cloud today these are services you can hook into you and are using some of these key points of the older Watson application that are actually useful to you. What are some uses that you can think of for these smaller examples?

A

Well, if you look at text-to-speech tone, analyzer speech-to-text, some of these things to be used for use cases that are very simple. If you have a podcast and you want to have captions at the end or if you want to print out all of that in an easy fashion, you can have that information.

A

At the end, you can provide that to your podcast consumers, some of the more interesting examples that have used for tone, analyzer, emotion, analysis, there's been cases that people were experimenting with for determining if someone was lying based on this hone of their voice or based on some of the the known like lying indicators,.

A

Another interesting example that I found it being used in I've been today is for security. You know, intrusion detection systems also have a corpus of knowledge. They in doats cases that they know about. As far as does this look like a security breach or not they'll classify it and then maybe, if you have a choosing prevention system, they'll try to actively block that, but attacks are getting so advanced today. That is necessary to potentially incorporate artificial intelligence in order to detect and note to newer types of attacks.

A

Zero-Day has come out every single day and it's almost impossible to block every single one of them.

A

So now we'll get into like how can I do this myself? These links I'll provide the slides. Obviously you can't click on these now, but we're going to start with GPUs, so you're going to need a machine that exposes GPUs GPUs are being leveraged because of the amount of cores these training jobs take a long amount of time. So you can have the case where you think you're going to have a short iterative solution. Spin up some GPU virtual machines or bare metals, do your training and then tear them down, but often case.

A

If you, you know, note a lot of the examples we had with cats with dogs takes a lot of time to train your system and to note these things and to air correct. So it's not uncommon for these jobs to take weeks months, even in kubernetes is going to help you to cut out some of the corner cases you have to deal with if you're deploying this on the bare metal directly. Tensorflow is also an interesting project that has come out of Google. That allows you to leverage some of these more advanced api.

A

You will need to build it yourself. There is also examples of a deploying tensorflow atop kubernetes now I know you're thinking, there's lot going on here. I've got stacks on stacks on stacks. If I'm going all out, I have to start off with the bare metal, then I put some OpenStack on it. Now a virtual machine I deploy a containerized, runtime docker rocket, then I put kubernetes on it.

A

Then I put tensorflow it's a lot going on here and I can understand cutting a lot of these out myself, but you want to cut out the OpenStack cut out the OpenStack. You want to cut out the tensorflow. You want to do this yourself, go for it. What I want you to think of an Irish Breakfast? You know we're not beating my Irish Breakfast I'm not going to eat the blood sausage without the pork sausage. The toast perfectly complements the eggs. I wouldn't want it any other way.

A

So when you're thinking about these systems, think about how kubernetes can help you to better deliver some of these machine learning training systems, so this visa, the information that I present the next following slides, hot off the press. Some of these proposals were just discussed last week and sig node, but some of the characteristics of GPUs. It is important to note that distinguish them from other types of resources. We'll go to each of these right. Now, multiple video cards, so one one node one blade- could have a ton of different video cards.

A

You could even potentially have different models of video cards in there. They have some faster one. You have some slower ones. You have some purpose-built ones for our certain topology, so Cooper nutty is going to help you out with this. By allowing you to use. Look that's in the next slide. A node, selector and I'll discuss that in a couple more slides. What allow you to specifically target the exact GPU that you want?

A

There's a lot of discussion going on in the community about exposing topology up through kubernetes, so that you can target the exact topologies. You want, but then the next few slides I'll show you ways that you can get around that to do the right thing. The first time driver installations. If you're using a video card, you need to install the drivers, these are proprietary drivers and you're going to get these from Nvidia's website.

A

An important thing to note, though, is the driver version on the host most often needs to match, whatever you're deploying in your container, whatever workload, if you don't, they clash, you get a weird error message: nothing works! So when you're matching driver versions, you want to be sure that version 2 of the driver matches up with version. 2 of your container. This match is bad. You could deploy these things for the Cooper Nettie's daemon set, and the Cooper teddies daemon set will essentially deploy a target across all of your nodes. Do whatever work necessary.

A

So when you spit up new nodes, it's automatically going to stall these things for you. Now there is some confusion. Arounds do I need to reboot the machine. The official Docs say that in the video that you should reboot to grab all of the latest kernel modules prior to utilize, the machine I've seen mix and match. Sometimes you have to reboot. Sometimes you don't. Certainly in my laptop it says, I had to reboot when I install the Nvidia drivers, so you might find discrepancies.

A

So this is before you have multiple video cards. Kubernetes allows you to have a node selector in order to target the specific video cards you want. You have some Tesla 100 P 107. There go ahead and target that right here and Liat h, pointer down here, you're saying how many DP use that you want- and this is a normal pot spec- that you're going to be using to deploy out whatever a system like tensorflow or your application specific solution.

A

The link down here will show you this example and show you a little bit of base knowledge about that resource. Fragmentation is also big thing. So, as I said, these jobs take a long time to run. So you want these always to run on time as fast as possible, with the most correct results. If you are consistently working around slower systems in your data center, maybe you have some older nodes mixed with the newer nodes, you're always going to want to target the newer nodes that this is an important job, because these take weeks months.

A

So here's some experiments that our research team has gone through as far as testing the intel quickpath interconnect versus some of the more advanced hardware features like env link and the important thing to notice here is the huge bump and what you're able to achieve performance wise when you get down to using something like four GPUs with an NB link enabled- and this is as I mentioned- is a specific topology very specific to GPUs.

A

Kubernetes is trying to figure out how to use topology in a way that you don't have to necessarily expose that to the end user, because it increases complexity and people that just want to get a job done will have to figure out how to use that it could get a little messy GPUs fail in different ways, and it's not always a hard failure. Sometimes it just gets too hot in there. If you're running a training, job you're just grinding away at it, the fan may start to overheat.

A

You're stuck in you're, going to start to get inconsistent builds insufficient power problems. What have you I have an example of an error messages you might see when it happens here in a normal, node setup. If you have one card in your blade that fails, what are you going to do we're going to go in there and hot swap it out or you're just going to see if it's just fine in the end, what kubernetes is going to do is get the proactively mark this as unavailable, and it's going to only target the working GPU.

A

So at your leisure, you could come back fix it again. This type of performance and problem analysis is active in the node problem detector and we're starting to add some for requests to Cooper Nettie's to actively do that blocking. However, if a node is just completely dead, your job will get migrated over somewhere else, so in kubernetes 1.6 just released a few weeks ago.

A

A lot of the work went into trying to make the GPU experience a little better, so officially reach the alpha stage, so you can have multiple pods on your nodes, a pod and kubernetes. Just your unit of work really video card discovery. It's know a little better and now it's using some fancy regex to figure out any active video card so that it can expose the basic failure.

A

Recovery, as I mentioned, is in there the only problem at the moment it only works with docker, and it only does this because of some very interesting handoff between the Kubla trying to figure out what containers are actively using things. This is something that's going to come and feature person. So where are we going with GPU in kubernetes we're going to start with a device recovery? It is very important to be able to segment off one card and allow that job to continue somewhere else.

A

That's where the health checking features come in topology, as I mentioned before you can just full stop allow the user to configure these things themselves. However, if you schedule it right, the first time a kubernetes has a degree of quality of service features, so a best effort is I. Just want this to run. If it fails a few times, if I can't look at my resources right now, it's fine. First of all, this is starting out with these resources. This amount of CPU this amount of RAM.

A

However, it might creep up to this limit, in which case I, want you to cut it off or guaranteed. That's where you know how this application works and kubernetes is going to do its best, almost always guarantee those resources for you. Now, if you bubble up topology to the guaranteed level- and you say I want this thing to be guaranteed to have these things. Building a topology to that is a lot more consumable for for a user. Then they don't even need to know about what the best apology for GPUs are they're.

A

Just going to be assured that, because of the base colleges, you know about GPUs that they're going to get that every time. That's another thing we're going to work on metrics. As always a good thing. Kubernetes utilizes see advisor for every container that that's spinning up. So it's going to give you metrics per container that you launch in there. It's also going to give you metrics node by node, some clean of features, and we want to do. We want to make it work with things that are not docker.

A

There was a significant effort in one kubernetes, one, two, five to attempt to abstract out a lot of these things, and so they contain a resource interface so that you can pass the same amount of base information to kubernetes and deploy a VM to play rock the container to play docker. Whatever you want.

A

Last week, we met with a lot of people from Nvidia and videos having their conference this week on the other side of the country, but they're going to help us out with some of the libraries where they try to constrain some of their best practices by env ml. In their newest addition, Lib NVIDIA container, so this will be something that the couplet, which is the worker node of kubernetes, will call out to to gain this functionality.

A

So that's basically end of my talk. I apologize I did put at the end that I was going to have a demo for deploying kubernetes on top of OpenStack I notice. There's like five to ten talks that are only targeting that I didn't want to steal the material, so I will deploy well I'll, send up a blog post on the IBM code, page with an example. If you really want to see my specific example later on, but come talk to me about these things, I'm very interested.

A

If anyone has high performance computing use cases, especially using GPUs I, also work in the networking space on kubernetes tell me about any features that you'd like to see in there I'm, especially interested in multiple interfaces within at a container and also multiple networks that you want to try also any other cloud native computing foundation projects that you want to talk about great.

A

If you just want to talk about anything you're. Just looking for a friendly face, a couple of topics- you I know a bit about cars, coffee, cooking, fishing, world culture. Come up! Ask me about that. If you want to ask about a non-technical topic, questions.

B

Good yeah, thanks for the presentation, I have a question in my area. Gpu is less interesting, but the networking part is way more interesting because I'm, basically working with the service providers and they are looking into the networking performance- is there any plan to do something similar for in the kubernetes environment, to support something similar to GPU, but for the networking cards? Yes,.

A

Within IBM directly we're working on that use case where it's node to node, so a lot of the performance is based on more of the network. Some of the more advanced networking features that you see in combination with GPUs is infinity band menzel.

A

This relates to some of the topology work, we're trying to get it right at the node level. First, before we move on to inter node communication scheduling on the network takes place in kind of an upstream, potentially soon-to-be CN CF project. A lot of those networking things happen in the container network interface, which is a separate project where a lot of these plugins come together in order to achieve some of those advanced things.

A

So a lot of the hope is to contain place a lot of that knowledge in those plugins and allow you to chain those together to get what you are trying to achieve. This both supports a more clean code base for kubernetes, but still will potentially allow you to achieve those same results all right. Thank you for other questions, nope. Well, thank you very much.

A

I hope you have a great day just my information again up on the board cm Luciano underscore unless you want the junket guy, github accom cm Luciano, see what I'm working on I'll be at the CN CF booth and also floating around the IBM booth. I. Look like this. If you're trying to find me, thank you.