Kubernetes Machine Learning Working Group, 29 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Machine Learning WG 20180329

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so it doesn't look like anybody put anything in the agenda for today. So far matter of fact give me a second.

A

That's the agenda doc, just posted it in the chat so.

A

What does anyone have any topics that we'd like to discuss for today? I.

B

Had a question, so the idea was, for kind of initially to start off for different groups to kind of show or demonstrate with their current machine learning. You know pain points are for running on kubernetes and is that still of interest to people, or it seems like nobody's willing to step up at this point.

C

And I've been still collecting information from some of our users. I work for the University of Michigan and a lot of our researchers have not been timely and responding to me. The students have been pretty responsive, but uh not so much the researches themselves, not shocked.

C

Neither am I.

C

B

C

I can comment a little bit on like what the students have said and some of like our experiences, is like sort of the research support staff. Otherwise, I can hold off till I, get a little bit more information. I mean.

A

I feel like the input from the students, should be valuable, maybe not as valuable as a faculty, but you know education and machine learning is a huge use case for it right, like I, mean machine learning, practitioners, researchers, data, scientists, don't grow on trees, they generally go to school somewhere, so I mean if we want kubernetes kind of to be adopted in the community, not just for production use, but for educational use.

A

The friction of students as they try to mult, go through their workflow in order to get through class order to perform research in their own. You know that's valuable when.

B

The students are the ones doing most the work, often anyway, so ya,.

C

B

The pain points yeah I mean that was grad school. For me at.

C

Least, but yeah we're actually finding that uh it tends to be the students that are bringing this stuff to the research themselves, kubernetes containerization and things of that nature. Yeah.

A

I mean your adviser is probably not gonna care. What tool you use as long as the paper comes out. The other end, yeah.

C

Like it's like the quick thing, a lot of them tend to be like very overwhelmed with all the information out. There there's also like a lot of tutorials that just sort of like cover the the very few basics, but nothing beyond sort of just getting started.

C

They've had you know enough to just get something up and going and sort of like a hello world type thing, but actually trying to take that beyond. That has been a bit of a pain point.

C

It also holds true for a lot of the like machine learning like libraries, I think it's not just you don't really getting up and going on kubernetes there's a lot of like you know, if they're researching out, like blog posts or other things like that, it's all tends to be like a lot of just intro getting started stuff.

C

In some cases, a lot of the documentation will actually like go against other documentation that they found. So just finding like useful resources was something that was pretty universally commented upon.

C

No big thing, like the majority students were also like very eager to like dive in to learn this stuff, but that doesn't surprise me too much. A smaller portion of them were just one to sort of like continue going with. You know what they know already.

C

The other thing like for us Kerber days is still fairly small, mostly lots of experience with our students in like Jupiter hub and things of that nature and getting up and going on there. But our shop is so like a very classic big HBC group, so some of the container experiences they're getting like using singularity and all that to run that through slurm on our HPC.

C

But that that, like again of the students and things like that, they would much prefer to you know, use Carre's, use, containers you just Jupiter hub and get going on there rather than try and pass something through a badge processing system, especially when it comes to like parameter, optimization and all the other fun things. So.

B

They, what hardware are they running these communities clusters on.

C

B

Cloud based or alright, it.

C

Well, we have an like on Prem virtualization thing. We run it on there. Some do use like gke and we are expanding Cobra days out to bare metal.

B

C

From us sort of as like research support staff, we, like we honestly want to sort of have kübra days, take over a lot of this space.

C

Just because, like the massive growth of things like jupiter hub and and more things that are not your classic batch scheduler, so like us as research support staff, if like we can sort of, you know, have crazes like the unified platform that things sit on and then, if we want to spin up slurm or something like that, that can just be in like a namespace or something like that in there. But it allows us to make better. You know, usage of our resources as things sort of shift in these different directions.

B

If you looked at running slurm type systems on granese, our experience is it's typically trying to fit kind of round peg in square holes. Yes,.

A

Your set is kind.

B

A

Trying to do slurm, and we also looked at trying to do mpi type frameworks inside of kubernetes in needs yeah. It's.

C

Definitely a pain but like we are making progress with it.

C

We have like basically a proof-of-concept thing that will like user ssh is in and it spawns a container as their UID GID mounts their home directory and does all that it is a little bit of challenge to like you know as a campus, we have all these central resources with. You know big shared storage, and you know all the other fun things a big scratch disk.

C

So getting like all that stuff like map bin, has been a little bit of a challenge, but we're working through it.

A

Okay, so mission I were doing some discussion or had some discussion about potentially trying to improve, not for everything, but some of the documentation are surrounding the Google ml products and some of the examples there by products I don't mean our actual products. I mean like the open source stuff that we do like tensorflow, for instance, Kuby flow and just trying to come up with some.

A

You know onboarding and best practices that generally pertain to running machine learning, workloads on top of kubernetes and maybe trying to aggregate some of the general limitations that were discussed last time. Like you know, we, depending on what you actually do, and what your storage and networking solution looks like. We really don't have a lot of good answers around high-performance networking and storage solutions directly integrated with kubernetes, but at least we could capture the state of the art in terms of what you could do.

A

What is available like if you did want to run some type of batch machine learning workload across HDFS partition that is feasible. It's just probably not such a great idea to try to run your HDFS cluster inside of kubernetes, but you could do things like co-locating, the kubernetes cluster, with the HDFS cluster in the same physicals, but have them logically partitioned.

A

So things of that nature.

D

So within hey, it's Jeremy from coop, so so I think within coop. What we're trying to solve a lot of those problems, I mean I. Think a lot of the issues you just mentioned are coming up, and so we have sort of these open issues to sort of provide recommendations about how you deploy kubernetes in various settings. Various cloud providers, various communities, distributions to sort of get the things you mean and some of the issues that come up so far.

D

Our secure ingress we've also got an issue about storage, and you know how you new things get come in like object, stores, shared files, POSIX compliant file systems. We haven't specifically looked at, you know, other. You know, high-performance networking, yet I think the net one of the big ones for us. Next is things like monitoring, and you know, especially with integration with like sto, and also monitoring for things like RPC, metrics and latency, with respect to your model servers, and so any guidance or help with that would be amazing, yeah. So you're talking about.

A

Metrics yes serving correct.

D

Serving is a big one, but we have other model servers that aren't TF serving. So we use Selden for non tensorflow models like XG boost and and.

D

Sk learn models, and so like one of them, okay,.

A

Interesting I think.

C

He dropped yeah.

A

I think we lost him.

A

So, okay give them a moment to come back, but does anyone else want to bring any topics up in the interim.

B

I think one of the ARS from last time was to propose a new repo and afraid of his cube, incubating or and then have vish approve that, for you know just starting to collect these, you know kind of getting started documents or just a kind of initial nucleation site for exact.

A

You don't want to do that.

A

We if we want to I, talked to the bishop about this. If we're gonna do that, we have so inside, obviously they're the sig I'm, sorry inside the working group and community. We have a machine learning like repo basically, but they so anything that we want to collect in terms of documentation. That would probably be the best place to put it just because of the kind of rules around. What's a sig and what's a working group. So if we want to start owning code, we gotta like form a sink.

A

Basically, his working groups aren't supposed to encode or we get another sig to own the code for us. Okay,.

B

Yeah I: don't think it really matters this much where it is it's just getting somewhere and then having that and we could start banging on it.

D

B

Yeah documentation.

A

Give me one second I'll, just pull up the location cuz. You can do pools against this now, like immediately.

B

Because I think that's the lowest hanging fruit in terms of your just getting started and addressing a lot of the points that you know Bob pointed out just now and and things we've seen ourselves and then all of the harder problems that we definitely want to get her to as well like like Jeremy was talking about, then you know those can happen in in kind of this points kind of to those right. This is the kind of the landing zone or entry point to those.

A

So I just linked in the location of for anybody's unaware about the ml working group, so we can put whatever documents you want in this subdirectory in the community to capture all of us and according to the steering committee, that's the appropriate place. So we can do that or I mean if people prefer to use Google Docs, that's also another option and we could just link out to the information from there. It's really up to you know what people's preference is with respect to. You know whether they prefer to produce stuff in markdown or editing.

A

But um did somebody want tickets, an action item to do something a little bit more specific in terms of aggregating, some of the pain points and.

A

Adding the documentation or did wish tape that a our last time on herself I.

C

Think he took up I'm, not certain mmm okay,.

A

So if that's on vish, then I don't see any pools against this, so I'll follow up with him. I.

C

Pick on me being wrong them.

C

B

B

Jeremie another, your back. Do you want to finish what you were saying? Yeah, sorry.

D

Sorry I think my my zoom crash I think what you know so I, so we want to do it not just for TF serving. We had these other model servers and we also want to do it for our batch jobs like tender flow we'd like to have you know, basic metrics, like CPU memory and I. Think one of the questions I have that's not clear to me.

D

Is you know, is it better to sort of like the RPC you should we try to get them directly from like the G, RPC server or tenza flow serving, or should we try to come up with like a more general solution that might rely on like an n boy proxy, but you probably want anyway for like things like sto to do it and so understanding like what the best practices there are and then, similarly it's not it's not clear to me what the best dashboarding and way to surface these metrics is yet whether we should be using like an in cluster solution sort of relying on clusters.

D

But you know back cloud and and and distribution, specific backends um and uh I. Don't I don't know what, like the best practices are, and so the more that we can have like the community kubernetes tell us like what the best practices are and then we can just kind of follow them. That'd be great, in my opinion, so I mean I think it depends on what you were.

A

Serving in particular, it depends on exactly what you got up there Tia serving its kind of opaque. If you're serving off a CPU, you can get CPU metrics and fury scaling directly off of like CPU metrics. If you want to do auto scaling should work if you're doing GPU, we still don't.

D

Have a lot there right, I thought this added like utilization and duty cycle and some of the others for GPU yeah, but.

A

Is that sufficient to actually trigger the scaling? I, don't know if that's actually, by the way either so.

D

At this point, we're not even we're not even looking.

A

At getting the hardware monitoring off yeah yeah, that's okay! So if you're, just looking at getting telemetry from the hardware itself, yeah.

D

Getting some indication like: are you running effectively? Are you not running efficiently and that not even auto scaling it? Okay,.

A

So then I was thinking of the proxy in terms of. Are you looking at a proxy in terms of doing stuff like mixing or for.

D

The proxy we were looking at things like RPC metrics, so RPC counts rates and errors. Errors in my 10 say those are the ones we want and so ideally would be. It would be great if GRP see sort of exported those or tensorflow serving exported those automatically, but I. Think me- and you talked about this in the past and from what I recall of our conversation. It doesn't happen that way today and once.

A

We happen that way today and I've seen no new teams in order to actually make it happen. Yeah.

D

And from what I can tell what you're also pulling was, it seems like it's not trip non-trivial to do it either so yeah.

A

D

The best way to do it would be.

A

To get it from the G RPC layer for talking.

D

A

Specifically, I mean like it's, not it's not like brain surgery to get it out, but not like something that mmm I'm gonna have the time to go. Do a PR! You get myself what! So, if we're, even if you're looking at doing auto scaling or if you're looking at telemetry another way to do it would be to just use if you put a proxy in front of it, extract all of it, with the exception of the actual hardware, telemetry directly I think auto scaling off of that to you. If you wanted to I, think.

D

I'm leaning towards the proxy solution, mainly because it seems like we get the most bang for the buck and I think integrating with SDO to do centralized policy management is is something that we want to do anyway. So I, don't I, don't know the details, I'm, not it's not an area of my expertise, but my expectation is that once you're using SDO, you also need a nun boy proxy, and so it just seems like that. All fits together nicely, but the details elude me so.

A

I, don't know like I could take a look more enjoy how sto was help with this use case. I'll boy, I, think out of the way does provide you with some of the primitives that you could use to achieve the type of proxy and we're talking about I. Don't know if.

D

It's no sorry so I, don't we're not specifically looking at its do for the metrics, but we're going to look at sto for centralized policy management and security so well, security and the like mutual zero-zero. What's the term 0h.

A

Lost em off yeah.

D

But somewhat my main point was that, if you're going to be doing, if you plan on moving in the direction of Sto, you're gonna have an invoice see in front of everything anyway. So it's not like we're introducing an end boy proxy, just for metrics or therefore incurring wait-and-see, another I guess. Other issue is just for the metrics I mean.

A

Honestly, most data centers, the one hot, the single network- hop you pay for. The proxy is probably not bottleneck. If you're talking about prediction serving ya.

B

A

It's probably not so much to pay if you're odd way to do good. It's pre, cheap! It's.

C

A pretty cheap hop anyway,.

A

Yes, I was thinking more in terms of doing this at the application lever layer just having an application that sat in front.

B

Of it, if it was.

A

Something people are interested in I. Think the hard part to get my head around is like a no TF serving model server to some extent, XG boost I played with a little bit, but there just seems like there's such a large collection of potential frameworks that would have to be supported in order to have a unified solution that works for everybody. And then each of those frameworks takes it different.

A

Then, even it's really just a different type of input from the application layer or the transport layer, most transport legend saying, but the extra boost is structured. For instance, right I mean like it's, not the same: I got a I, don't know what a unified proxy that works for both of those looks like well.

D

So most of the most of this most of the server is they they have this. So if we use we use Seldon, that's what our current solution is for.

D

Well, it works for a tenderfoot serving and tons of kimonos, but it's we are primarily looking at it for non tensorflow models and basically they all use the same api, regardless of what what model your framework you're using and you basically just write a little hook that converts like the general request, which is usually like a list of arrays or something some other sort of generic payload to your particular.

D

Format for your model or and whatnot, so it's there certainly is a layer there that you can sort of at the application there, where you can sort of interject logic that would apply across a large large framework of models right, so there's really sort of only there's like tensorflow serving and then you have a model server for you know Python models, but I guess there are some other frameworks for like onyx models and some of the non tensorflow non Python frameworks.

B

D

D

So anyways that's very helpful. Thank you.

B

So is there interest in doing essentially request batching for better throughput under certain latency budgets at the proxy level? I know this is kind of a direction that the rise lab at Berkeley with the clipper project, but it seems like that naturally kind of fits into with it unvoiced. You know service, mesh layer.

B

Sorry, my audio cut out: can you repeat that right so has anyone heard of intrested essentially another feature at the service mesh layer for machine learning? Inference would be kind of in request, batching to increase throughput, while maintaining a certain latency SLO, and you know, there's there's kind of academic projects like rise, labs clipper which are designed to do this, but it seems to fit pretty naturally in an existing service, much like sto or envoy, rather than having a separate project. To do this.

B

You know bespoke and I was wondering if you know Jeremy have you heard of interest at this and kind of your discussions. I don't know.

D

B

D

So when we hear a lot about, batching is actually with inference for GPUs and sort of serving setting, because with GPUs there you, you really need a batch in order to batch multiple requests to get. You know high throughput and sort of manage latency. You can't process them sequentially. So that's the use case I've heard of most where this has come up, and you know one thing that's been mentioned is I. Guess NVIDIA has I forget what what it's called it's. It's like.

D

It's like it's like RPC, maybe it's GPC for, but anyways they've got some library, that's supposed to be able to effectively batch requests and stream things to the GPU in order to keep them well-fed.

D

Although I don't think it's, it's not I, don't think it has a blocker there's some technical details there that I I'm not familiar with right now, but that's that's what I've heard of so.

A

Is the optimization that you're looking for batching the memory that you're moving across from ram into GDD are cross-- PCIe 16 merrily.

D

I think the problem is, if you, if you, if you I, think the problem is that you need to process multiple requests in efficient together to get the efficiencies of the GPU. So if you just process the request as soon as it came in your while, your lack request is processing. It's blocking all the other requests like. So it's not like multi-threaded. So you, if you have like requests coming in every millisecond right, it's more! You get better performance. If you actually wait, 10 milliseconds aggregate all those requests, then process them all at once.

D

A

It's basically it's utilization of the silicon like before you spin, a warp up. You want to have as much as you you want to already have moved as much memory as you can on a G DDR before you spend the were pup in order to utilize the entire chip.

D

Yeah I think that's accurate, I'm, not sure about the exact you know under the hood, but you know essentially the whole point of the GPU is that it's high throughput because you can process multiple in parallel simultaneously, so you need to have multiple requests. Otherwise, you're not gonna, get high throughput. So for st.

A

Or Envoy I think that would be very challenging for the handle of that level because they basically have to intercept every request going through the entire cluster once you've enabled the the service mention Network, and because of that, they try to be super lightweight in terms of touching their traffic. I, don't know if they'd be I could talk to him and see if they'd be willing to add a feature to do something like buffer a series of requests and then pass them back through.

A

The other thing is I'm not entirely sure I mean their HTTP aware for like layer, 7 I'm, not sure how aware they are in terms of application level. Traffic like I, mean I know they do that they do G RPC HTTP and they all do certain things in the transport layer. I think UDP has some nations support, but not not too much.

D

Honestly, it's my opinion, maybe Jeep GPU serving is I, think that's still very much an unknown. You know whether that's actually the right thing to do or whether you're better off trying to serve efficiently from CPU. Maybe you know comp might comprise compressing your model or doing some other approximate information. So I don't know if I get this time, that I would necessarily invest in trying to convince right.

B

Yeah that was kind of my question is this: how high of a fruit is this, and it sounds like it's pretty high, so I and.

A

I'm, not sure that's, why say, maybe it doesn't hurt mask so I mean I would think that they. This would be something that might be challenging for them, but we could always just reach out to them and say: hey. Have you thought about request batchman, because there are definitely features they do in terms of like draining connections, circuit, breaking and so forth, and so on. So there is definitely intelligence built into the proxy just a question of how much are they willing to do? I.

B

Think another angle on this, too, is if this is kind of the best practice for recommending that you know this. This is how we recommend you surf models you're using ISTE, oh and boy, then that's just more moving parts. That's kind of the you know smaller shops have to set up so I think. That's just underlines the the kind of amount of the lift that is required for for people to get up and running with these systems. So I think that's something to keep in mind as well.

A

The other thing would be if you tried to batch it your network, if ultimately you're trying to match the serve the silicon you'd have to make your network at that point aware of this silicon that he's trying to serve to you right. That doesn't seem like it's super awesome. Separation of concerns normally.

B

The batch size is a pretty easily settable parameter of a model. Yeah I.

A

Guess I mean, but you basically be sitting in a network policy well sort of a network policy up in order to make sure that requests are batch appropriately. Yeah I, don't.

B

Know it's something to.

A

Think about because you can do the same thing at the application level. Whatever is receiving. The request could just batch up a certain amount before it passed it on to whatever framework is actually interfacing with the GPU.

B

You do get into scenarios where you have pipelines of models, so you do sequential or parallel sets of inference for even requests, and so then you need more of a global awareness of what the total latency is for given batch sizes at each stage of the request.

B

A

B

Esoteric so but.

A

You could so I mean if you're getting the network latency out. It's more a question of where do you actually insert the batching logic right like in terms of getting the network, intelligence, sto, envoy or a proxy? Another type of proxy is a good place to get that, but for the batch and implementation I don't know if you necessarily need to put that into the proxy as well, but see.

B

Germán, you talked about the storage as well. Well, what were you referring to there? Is this the kind of communities, volume, controller stuff that we've been working with you, or is this something else.

D

Have you been working on the KBC? Okay? Yes,.

B

I work at the LA Z and.

D

So I think yeah that that's that's part of it. I think you know, there's a there's a question of like um you know. If we, if we recommend object store, you know how do we get object, stores that are? How do we come up with a good reference limitation for object, stores that can run anywhere and and so that you're not just running on the clouds that have them, but you can also do it on pram and then I think there are use cases for shared POSIX because you've got application.

D

Applications like hdf5 that require it and so I guess shared POSIX is one of them, but I guess you know. Another potential solution is to use something like ABC and automatically sync data back and forth behind the scenes, but yeah I think I. Think for us we're just kind of trying to figure out at this point. You know: we've got some issues about like what are water. You know the requirements and best practices that we can sort of recommend to people. So we can say like okay for examples and whatnot.

D

We make use of a shared post, Docs combined file system, so on these types of clusters, arcades distributions go and provision NFS and here's how you can do that or on you know this cloud provider. There is a you know, cloud file or solution already, so you can use that basically, that sort of thing so that you know in our solutions and examples we don't sort of have to reinvent the wheel every time so.

B

It sounds like that's mostly kind of documentation and guidance in terms of pointing to existing object stores, and here is kind of a little bit of hand-holding in terms of setting these up. If you haven't done so already, yeah.

D

So I think you know. As an example, you know, there's there's some discussion about like min IO versus rook right and you know like which one of those gives you more bang for the buck like in terms of you getting objects, tour and Sheridan FS, and is it performant, and so what cases can use for it and so I just coming up with a good recommendation so that you know people we're not just telling people like okay, here's a list of options go and figure it out yourself, mm-hmm and.

A

D

A

Is super challenging for a lot of people unless you already have something set up on from but or for Amazon? They have. You know, elastic file system isn't.

A

But getting something set up inside of a kubernetes cluster, we could come up with some basic stuff to likely get, maybe a helmet art or something that would just install a couple of pods running some type of filer.

A

That would get you up and running for, like demo purposes coming up with something that's robust and scalable would be challenging, though, for object, storage, I, think the biggest problem I've seen because most people I see want to use the s3 interface for their object. Storage, like GCS, for instance, has an s3 compatible interface, Azure blob store.

A

Does there are a lot of EMC appliances and you can get off the shelf that have it the biggest incompatible portion I've seen for that is the authentication credentials differ from provider to provider and they might not even use the same type of authentication.

A

So what works in one place isn't necessarily portable. Even if you've used this s3 compliant API.

A

Other than coming up with a friction log about like for this provider, this is how you use it this provider. This is how you use it for a rook. If you're turning upset clusters, that's a I mean that's, that's someone s3 compatible API. They have their own semantic differences there, but we could come up with something like if you're using this one, here's what you should be aware of. I guess.

D

Yeah I think um you know we're iterating and all that ourselves and I think you know what the way I think we want this. We I don't think we sort of understand what, while the the requirements yet like you know how much, how well I'm tourists and is enough NFS needed and whatnot.

D

It's proved convenient for a couple things and we do have some use cases, but so I think that's where we're going to just learn over it smart more time as we have sort of these examples running in cluster and we get more feedback from the customers and we see what solutions you know work and don't work on what the problems are. I think.

B

One one thing is simplify. The problem is that, in my experience, the the storage solution you need on your cluster often doesn't have to have the resiliency or robustness guarantees that you were used to with storage providers, because these kind of clusters are tend to be more of the kind of computation domain. I mean this doesn't hold in all cases, but so so what I mean is you'll? You know some grad student lab will have their NFS, you know box, but then they have a separate cluster.

B

That's the compute cluster, and so you just want to get the data in a format. That's closer and more scalable, but you don't necessarily need there were buses, because it's a second copy, the data that doesn't always hold, but that it does vastly simplify the problem in terms of making it not have to be. You know highly available and full tall. Aren't all these kind of things.

A

It's interesting: um okay. Does anyone have any other topics I'd like to discuss.

A

Okay, um I, don't really have anything else either. um Do we wanna end early and take back 15 minutes then.

A

Mine as well, okay, um but before we go I, want to discuss it. Does anybody have anything specific that they want to add to the working group repository between now and the next two weeks? I know Mission definitely wanted to capture some things. Jeremy. Did you want to capture some of the feedback that you just gave in terms of potentially at least outlining what we might do for storage and proxying multiple.

D

Models yeah we have most of that is covered in github issues within the coop flow. Yes, I was just assuming that information there source of truth.

A

Exists all right, I want to propose something else generally at least insta gaps, but we usually do before the meeting is come up kind of with the topics that we want to discuss prior to the meeting starting.

A

So if anybody for the next, which is what April.

A

Yeah, so if anyone for the next meeting wants to propose a topic that they'd like to discuss in particular, feel free to access the dock and just throw it on there and we can go from there unless people are amenable to that type of structure.

B

Sounds good to me: okay,.

A

All right cool um all right, guys, we'll see you in two weeks then Thanks, okay,.

B