Kubernetes Machine Learning Working Group, 24 May 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Machine Learning WG 20180524

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Everyone welcome to yet another episode of resource I mean learning working group meeting and today is the 24th of May. We have a few items in agenda today.

A

Let's see the first one is FF dear the agenda says an image is supposed to talk about it. Are you on the call and.

A

So FF DL seems to stand for fabric for deep learning. That's.

A

You got the second.

B

Yep I'm just moving because there's some noise in the other room keep you one minute, then I'll be ready to %.

B

Yeah I should be ready now. I can share my screen, whatever you've already.

A

Just to make sure Nick you were the one who added the machine learning, oh right, because it's a similar I, don't.

C

Think that's the same Nick, it's the Nick from University of Michigan or something who wanted to present his own user pain list.

A

C

But I don't think he's also on the call.

B

Whatever works, wish, you start with the Charter presentation if I p.m. is not on the call.

A

D

Howdy, my name's Josh burkas, in addition to being personally interested in machine learning, as their main reason why dialed in today, I'm also on sink inhibited experience if you weren't familiar sink contributing experiences mission, is to make it as easy and pleasant as possible to contribute to kubernetes, as well as taking care of quite a number of bookkeeping things that we do for the project.

D

The as such we've been trying to attend all of the sig and working group meetings this season in order to just let people know where we are and what we can do for you. Let me go through some of the dock here. One thing is: we are kind of continuously trying to tweak how we handle repo and workflow management for the repos in the kubernetes namespace and including changes to labeling changes to prow bot commands to do issued triage guidelines and many other things.

D

So in that respect, I actually have a.

D

Statement and a question, the statement is that a lot of those things there's going to continue to be changes over the next three or four months, you'll hear from us and test in four on that we post that to contribute. Sig leads and kubernetes dev mailing lists ahead of time, so that people can voice objections before we actually push changes.

D

But it does mean that, if you're concerned with those sorts of workflow changes that you need to be following, at least one of those mailing lists, the I and the question is: do people in cig in the machine learning working group have particular concerns or feedback about things like the pran issue and Docs workflow the way that it stands within kubernetes. There are things that we should hear about and take into account.

B

That's good, okay,.

D

I think there's a know: if something tastes later we've got, you know, contribute to experience. Mailing list contribute experience. Slack account.

D

You know, suggestions are open at anytime.

D

D

Does the machine learning working group have any kind of mentoring program? Will you participate in the in the mentoring activity.

A

A

D

A

We're still like trying to figure out what is the exact Charter like how we're gonna, actually I feel, like maybe we'll, have better answers. 2.

A

Months from now, ok.

D

I would say it's never too late to start and mentoring, particularly when you're new and everybody's new.

E

Ok, like see gaps, we can mentor people to how to contribute to the workloads. Api Otakon should be to helm right yeah. What does mentoring? Look like for a working group says it doesn't own code.

D

The stupid question does working, does: does the the ml working group isn't do they know we're really not on any code? No working group.

E

Owns any code by definition, yeah.

D

By definition, however, the material reality is sometimes different. Okay,.

E

That's true I guess I mean like they're like there is that there definitely many projects that we have here yeah and there are things that we're interested in collaborating on as a community, but they don't live inside of kubernetes core and they don't live inside of a kubernetes, SIG's repo. So yes well, but it's unclear that that's necessarily the direction that anyone's even going anymore yeah and it's.

D

Not external repo and having stuff as an external repo is not a limitation. As far as we're concerned, I mean at this early stage. What I would think would be appropriate is showing up for, for example, a meet our contributors.

E

D

Or, depending on people's available free time, we do get funding for programs like outreach, e and google Summer of Code and other internship tech programs. So if you have and I'll tell you from an applicant standpoint, if we say hey, we have funding for this machine learning related internship position. We will get a lot of applications.

D

So if you have anybody who has the time to do that kind of mentoring and that's a lot more time, you're talking about minimum of five hours a week, probably more, then you know, as that rolls around, for whatever the next season are for those various programs. We'd certainly be interested in having the mo or who participate in those, because it is a very attractive thing, for you know, students and unconventional background folks and other folks interest in internships to participate. It.

A

We can extend it beyond the cold contributors to actually start contributing, feel like right now. It's things are a little.

A

Great points and if you can like post these solutions and the meeting.

D

And then the last thing is actually pretty small, which is.

D

It would be good to have a sort of statement of purpose and link to your working group, doc on your site.

D

Okay, I will link to more information in the notes. Mostly this was to say hi. You were here. This is what we do for the project. If you have problems in these areas or questions or things, we want to do think somebody in contribute experience.

A

D

And those questions can also go to us. Okay, I mean sometimes depending on what the question is. We have to escalate it to the steering committee, but you could you should start with us.

A

To the next topic, just checking again is an image on the call now or is Nick on the call. Okay, so is it someone say something I.

B

Don't think so, the other on the call.

A

Got Nick can no take over okay.

B

Let me know when you see a beautiful white slide. Yes,.

A

I can see it. Okay,.

B

So, let's see this is kind of inspired source by a conversation between like rich Kenneth, Escott and I, and the D email thread we started last week around what kind of concrete things would a mission with that? Would a a working group like this actually go and target if the resource management working group would be any kind of inspiration.

B

It was a bunch of people with common interests where work means it needed to get done across different six, so that was kind of inspiring. What are the first working groups and we thought that the machine-gunning working group would be somewhat the same, so just to kind of snag out of vicious email.

B

This is a pretty good summary I. Think, like our goal, is to identify gaps and extend kubernetes appropriate for all working on in workloads. Fine. So if there's any confusion between, for example, what different I asses mentioning just queue flow as being one of them, and what we would take on here is a lot of AI.

B

Asses needs to provide a end-to-end experience where the machine learning working group would, on a per case basis, look at opportunities for doing some of the heavy lifting, at least as far as I've seen it from the outside kubernetes has been really good to target the kind of 80 20 percent like use cases where you know, as long as a youth case is our pattern, is common, then, would be a good way of doing it with with kubernetes itself.

B

So we're a lot of AI asses comes from top down to make sure that everything is great and kind of to some extent doing I won't say, stop get stopgap solutions, but would need to figure out a way to make it a good experience. The machine learning working group would be a bottom up to figure out. How can we enhance existing abstractions? Is there a space for new ones? How do we lash yourselves to ongoing efforts that can help ease the machine, learning use cases and those pain points does that is?

B

Is that controversial in any in any way?

B

If not I'm, just gonna continue yeah I like it.

B

So, at least from from to start breath, this was informed by just like our thoughts on you know the pain points we've seen and the first-class solution that we'd like to see and the the list that that which he commented on the email thread in the previous meetings meeting ourselves going over a different solution for simply boilerplate templating and like from source to deployment within scheduling.

B

That's a bunch of things you'd like to see we worked on our algo team and or our our teams are submitting thousands of jobs and our a comp of background with HPC schedulers and I used predictable job scheduling, very large-scale job scheduling and the best answer they have in kubernetes is the job abstraction and there are many ways for other things to be desired within that one, then again, when there's a limited amount of accelerators that tends to be used for training, fair sharing tends to be an issue. We sorry about that game.

B

Scheduling, as a team has already been talked about in six scheduling. A lot of the patterns that we see in the job operators for tensorflow hi torch cafe, could potentially be helped from from extensions within commodities um also related to runtime, and these are incomplete list by the way I think already. Now we have ideas for two or three more runtime related topics, but yeah.

B

You know loading in hundreds to terabyte size data sets easily having solid support for accelerators, having ways where jobs could buy checkpoints without having to be opinionated about which capoeira is coming up with it. So there's a lot of different topics that are not tied to any one kind of I ass on top, which is where people have to come up with solutions and they're coming up with different ones, so it basically creates like siloed environments on top and that that is not in the spirit of kubernetes.

B

So what we can do now for each one of these we have a slide, but I don't know. If you have enough time to do it. We have about 9 min minutes, I'm, either I skip to the very end. You know the slide tag is formed, so it goes through each one.

B

At the very end, we could talk about priorities and dependencies between them, because I think we need to do that kind of work. In order to establish a you know, suggested orderings before we can suggest different roadmaps, so British Scott. Whoever is on the call. What do you mean.

B

Just in terms of what we do now, we can go through each one, but I don't know if 9 minutes is enough to get through all of it so to be as efficient as we can.

B

People have a rough idea of what these mean. Then we can talk about how people, how much people care about each one. So we use that to feed into like roadmap, or we just got us for as long as as far as we make it, and then we can continue next week.

A

Maybe just like rephrase what is saying the ideas to figure out what what problems can become part of the community layer, either like as a core component or as part of a sort of an official blast extension.

A

Some of these aspects will will help even HPC workloads. So in that sense, like you want to do minimal work as much as possible by the same time like enable cumulus to become really friendly for so I think I'm, pretty confident, there's probably more in the core layers, but this seems to be a pretty good.

F

Yeah Nick: can you just that quick overview of some of them and see how far we can go yeah.

B

We could definitely do that. So, if you were on the call around the source to deployment, we spent a full hour talking about this. This is the first half of it where some of the things we saw was people are copying a lot of the a lot of pod templates and job templates, some people in some solving problems for themselves that they don't it out like up streaming. So we've seen more than one solution to certain problems. Some data Sciences go go on just still having problems that other people have solved.

B

So what I'm listed here is basically it's hard to keep things in sync. If people need to start a new project which one of these they data sciences templates, did you copy?

B

So it's kind of a a bit of a low practical problem, but I think it is a fairly common one. This is where we discuss at least like five or six different ways of doing it, so that is just to mention that part of it. That's, maybe it's a whether that should be in kubernetes or not. That's, that's something we can begin discuss.

G

May I ask hi: this is Dave Ron check? Have you have you looked at cube flow there? There are other folks in Intel working on that. I was just curious. If you had a compare contrast. Okay, that's us.

B

Okay, so yeah I talked about templating here because wish unlisted listed it. So I was just to mention it and yes, we have our project within this space called MLT. We know what a scaffold could be used. You know that draft could be used. There is two others is forged: SH and Red Hat's source 2 container project as well. So we we have. We have a like two weeks ago, I think two weeks ago or or four weeks ago, we we had like a pretty elaborate talk about this.

B

This problem, in particular um yeah same thing, I'm, really just pulling in things that they'll be presented back then, where the what what datacenters tend to care about is make a change to my Python file.

B

I need to see it I'm almost in real time in the container, so that has been somewhat some of the struggles that we've had here in a variety was based on a user interview, three or four different ways that he or she would be doing code editing, either from home, pretty straight to github, pull it down the developer laptop, be attached for the Samba share and doing editing as a job would volume on that in so like a pretty fragile set up, but you go to a long length in order to get the kind of very interactive real-time developer experience so for bin packing.

B

That's another issue that we've seen internally, where, if people are for different machine learning workloads, there is a very different requirement for the amount of acceleration you need behind it. Some workloads can deal fine with one. These are just to illustrate some part, some parts for the quantity, one order or two accelerates, but if some boyish models need, for example, for they can end up waiting for a long long time before we see those jobs being scheduled. So the the only alternative right now is for that needs to.

A

B

Why, yes, exactly the accelerator in interconnectivity as well right.

A

To schedule against apology, yeah.

B

And few management at least what we see- people use the Java abstraction just like they would have done with slow-mo app or compare it to to those tools, there's very little a few mechanisms for doing introspection and for doing accounting on those. But people start something they have, and they don't have a good idea of. When it's going to start, you don't have a good idea of who is using what they don't know where poor things are running so yeah like at least internally.

B

We have you have like thousands of dollars like lined up, and there are some pretty bare-bones things that people are asking for. So those those are some opportunities for going in and help enhance job controller or to go and kind of explore.

B

You know yeah be warned, I get a job version 2 or something like that. But it's just food to explain like the problem is really there's fine people schedule many many jobs or people try to use priority, but it's only at admission. So you know in all that there's a lot of a lot of things in terms of the experience. Think that could be enhanced.

B

When it comes to to sharing, we have a quota mechanism, but it's not nearly as sophisticated as the ones that you would see in traditional HPC scheduler or you can have like a cascading quotas that can apply all the time that fits into preemption.

B

That is based on some kind of budgetary measure of who is funding the cluster. Some of those those concerns we've been looking for answers as well I'm trying to buy controllers on top of quota, but in general kind of pushing some of these down, so they become almost like a detector standard would be amazing.

B

For some of the game scheduling there are already, as I said before, some initiatives in six scheduling.

B

Scott can probably comment on this or what a lot of the things we see today, it stated parallelism, but with upcoming specialized hardware, might be able to see a lot of model parallelism as well. So this this issue is not it's only going to be exacerbated by it.

B

We've seen a lot of kind of specialized job operators come up. There's one for ten to flow, then they've been fork, for you know early versions of MX net. We forked it for PI torch that was contributed to to cube flow, but a lot of them follow the same pattern. So there might be an opportunity to find a way to enhance either the Java abstraction or some abstraction within kubernetes that doesn't love the heavy lifting.

B

So we make sure that whatever we come up with is scalar and robust and maybe more importantly, has the same experience across different frameworks. So people don't have to learn a new C or D to do something special for tens of lovers this by torch. If there's no good reason for it, does that make any sense, please P speed up if there is anything where I'm making like gross assumptions, are going Ophelia's pretty pretty quickly.

H

What observation is that you know you brought up these alternate schedulers, slurm and Moab from HPC, but there are. There are others specifically in data analytics like spark and storm that I think are commonly used in, maybe not so much the training aspect of machine learning but executing it later. Those are the ones I've encountered more frequently at conferences involving data analytics with machine learning and AI.

E

So case, storm is really about stream processing. More than anything else, if it's a true stream processing, add in micro batching after the fact, but it's really meant for processing streams of data and kind of giving you realize views for like art and now live are good for certain types of machine learning, but for learning and training it's probably not the way you want to go.

H

Right, yeah they're more like applying it after you train your neural net and getting it to run and give you results, and there are lots of people interested in running those workloads on kubernetes I. Don't know if you consider it out of scope of this working group or not, but there's interested there's interest in the topic. We.

E

Already have a lot of work. That's gone into running spark on kubernetes, so we like the working the workloads team at Google Masek. The data did a lot of work to make spark kubernetes a first-class scheduler in the same way that yarn me so send spark native are inside of spark, and we also have the SPARC operator to kind of give spark jobs. First-Class support, the committee's API. So there's a lot of work there already, but that's again, still not the right way for I'm.

H

Just asking if you are declaring that that's out of scope for this working group, that it lives somewhere else or are you continuing, there's no need for it, I'm just wondering because I'm interested in it, but maybe I, came to the wrong place. That's why.

B

I think you know we are heavily biased by deep neural networks, but this is an ml. You know forum, so I think it has a lot of value and I, don't think they're gonna be mutually exclusive, so no your input would be I'm. Sorry.

E

C

E

And that's where most of the SPARC contributors who overlap with kubernetes contributors for those discussions like I'm, not saying we shouldn't talk about it, but ultimately, if we wanted to look at spark and ml live as part of what we want to tackle, then we should probably involve them. Okay,.

H

A

And, like spark could, in theory, be a really good solution for that face, so in that sense, like data pre-processing is probably the the more time consuming task than just training. So in that sense we would love your feedback here.

F

E

Things upstream in this part is extraordinarily difficult, right, major changes, so it took over a year to get the work that was done enough to integrate kubernetes, maybe year and a half back, and so we can in terms of like understanding use cases and so forth. We could do that, but if we really want to change the way spark works or like try to modify spark to better suit ml or stream processing, I, don't know if that's something we want to bite off.

E

If it is I, don't think 68 is a better place to bite that off in here. That's.

A

I think what I'm saying is there is not extending spark but more like high, and it might just be like.

H

Okay, well I think he answered my question thanks quite.

B

So this is a this one area where we already did one open source project like the kubernetes volume controller, but in this is a general problem and even with the careers volume controller, that there are some things that we'd like to aim to enhance or to improve in terms of the experience, but, for example, for the unit demo that we have for our tool. The brain scans data set is about a terabyte, so it is a non-starter to start copying things around.

B

So how do you make that a better experience, because it's gonna be something that even for you know, it's like a straight out. Kubernetes job is something that you'd like to to have wired ovens in some way, so yeah I, don't think I have more for this one under Scott. You have anything else. You wanna mention.

I

No I think that's pretty good I guess. The one thing to sort of keep in mind is the demands on having data, local and closeby right so because you're visiting the same data over and over as a models training the closer you can get it to the device and keep it there. The better.

F

B

Yeah, so um you know I think from a long history of working for kind of pluggable device support. We bought ourselves a lot when it comes to turning these things on in kubernetes, so we do have the device plug-in API, but this is more like a call out to its continued investment and to its you know, hardening it getting things like pluggable accelerator metrics in so it's not so sinkhole accelerator, specific.

B

Anything else to add here.

A

I feel like we can add values like through a feedback from a user standpoint as to what matters and what doesn't matter. Because oftentimes.

A

Develops a new hardware and expose all its functionality, while when we approach it from a user standpoint, lot of that doesn't really make sense and it wasn't like shell together and the overall user experience is in what we were originally thinking of so I'm hoping we can bring that user voice and bring in that ml expertise and then say whether this makes sense or not, at least for Emerson point right.

H

Now, let me ask you this: when, when you want this support, is this going to be typically a scenario when you're standing up your own kubernetes on your own on prem hardware, because if, if, if I'm hosted in GCE, for example- and they have the tensorflow processing units, aren't those consumed more as a service so that you don't have to have kubernetes manage those things yourself?.

B

So wish I guess you can talk for for the Google News case.

A

Okay, yeah, that's that's correct for, like Google, has chosen to explore, expose tensor processing units as points but Google also exposes GPUs, and they are not exposed as in points they expose as devices and- and you are correct in that, like if you're spinning up your own cumulus cloud. So you would have to be with these things. It's just that, like you can't like no take Kate or and like whatever you get as part of the code release and say now all these work. We have to install a few extensions.

H

B

Yes, that's that and then you know pretty like practical things like if, if accelerators break it could be hard to tell, but like Richards resource management, working group is probably the right place to do this, but we can bring in if you know things we are seeing in amongst our users.

B

So this thing I mentioned before and I've heard about this concept of persistent jobs for the first time from from Ken, oh yeah, like almost all jobs, will output checkpoints that you would wanna keep around for as long as you can, because you used you use them for experiment comparison, which is almost a most important way of using that output, because, as you know, standing alone is really hard to understand those metrics, but it's they tend to be relative here on the right is another output from our unit model, where we would be outputting to Google storage, then we will point a local sends a board instance to it.

B

However, it would be nice if that would be don't be another level of indirection where, from the jobs perspective, you don't have to worry about which backing store. That is actually being used can do we have anything else to add to the persistent job.

E

Not really I mean basically for distributed training and something like all the jobs worth could be flowed. Training, TF, training, I, torch.

E

We they all use a similar pattern, we've said before, and then there's also stream processing, which kind of sometimes do so. It's something that people continue asked for, but that we decided we're, probably not going to do in core. So something we'd like to do as an extension and there's a common expansion. They could be leveraged across the board. Yeah.

B

So be super useful and.

B

Talk to this one.

A

Eventually, once mmm starts taking over, then there will be more and more utilization demands and also like- and this is pretty difficult to achieve. Given the current state, where, like VMs, are sort of the boundary for most of the so I guess, one of the themes would be. How do we work with existing multi-currency efforts and like what sort of fluids do we have to enable at the node or at the cluster level, in order to make like the cumulus machine learning clusters multi-tenant ready, while at the same time not sacrificing performance right.

B

And the like a myriad of different ways of wiring things together, because it's layering right or you know we have the cat a container initiative with you know: high speed, one container and VM wiring I think there's like three or four other ways of using running VMs in kubernetes either you know just for hosting legacy applications or if I, actually, you know, wrap the individual workloads in in a VM.

A

I think that's why I would love feedback from this community, which is my assumption, is deep?

A

Learning really really cares about performance, so even like five percent performance, jobs are actually problematic, and so, whenever we add a VM layer in there is going to be non-trivial amount of performance head and so I haven't, like I, don't know, what's the right models for this space, I think for this, the pure services space, the performance that is most likely okay for many apps, but the added security is much more important, but I'm not sure how much that matters here in this space. So.

B

Does all the things speaking pinko, maybe sieve out from part of the e because of use of pain, points this gathering process as well or maybe like at the end, but it comes to a broke back. We can see which things are ongoing efforts and what requires more research. You know.

H

All I can suggest I don't want to turn this into a promo, but but just full disclosure I do work for a vendor VMware, but hypothetically if there was I'm not announcing any products but hypothetically, if there was to be virtualized, GPUs or accelerators I assume. That might be something that you'd be interested in.

B

Is it like multi-tenant abuse, maybe but yeah Alex, I guess the most common thing we've seen has been bare metal on Prem virtualized cloud setups, but it would be interesting to either do studies of what people do or what people would be like would like to do. If someone is trying to string together a you know, multi-tenant appliance how people go about doing this, so maybe it's a bit more of a hypothetical but I'm sure that wishes related to to your work with gke, okay, I'm.

F

Sorry I think I think they should can was given a reason why distributed training you know be supported in kubernetes core fighting, get that because I think even be sure can was.

I

F

An explanation why distributed training was a request that an extension to coordinators can be added, but they were not gonna offer native support in core.

B

Yeah, like that, music requests also comes from us, where you don't want to have a different experience based on different frameworks. That basically follows the same pattern when it comes to scheduling things and keeping things up or sharing certain pieces of information across these these jobs, so they can get in touch with one another. So.

E

For job it's already be, one is primarily the answer and we I mean we actually want to move something fast without right now, in order to do any API change, you have to get agreement from the architecture committee and they're being really really conservative with what they're letting into core.

E

If you want to do it as an extension, that's fine, I, don't think you're gonna get concurrence from the architecture committee to add a job as a core resource unless you first do it as an extension and the extension become so prevalent, and so common in use that adopting it as a core resource is, you know basically just the best thing to do.

A

Yeah I mean to add to what kind of saying the general guidance is not that the community has to stop working on these kind of things. Well, it's just that not have the community think about everything else core in the sense that we now have the the tools and the extension points that we need to prove new api's and make sure that it's solid and robust enough to be supported for many years to come, and so, from that point of view, I, don't think ken was saying no.

A

We should not do it, but rather what you're saying is like? Let's not go change the core constructs or add core a piece, but rather let's build it as an extension and then make sure that it works really well before we bring up that conversation right I feel like this secondary conversation of core is probably not that important, because we all want cumulus to succeed, 4ml uses and if they happen to, if we happen to ask them to install one add-on or one or two add-ons, that's probably not the end.

J

Of the world, the general.

E

Thing is this: if you're going to do something as a core resource now- and there are some things there are- it's not saying- they're never going to take any new core resources, but you really have to bring a very strong argument that we have to do it as a core resource and here's. Why and four staple a job. It's kind of already been proven that Bachelor clothes can be done as an extension. So we don't really have a strong argument to do it in core first and evolve. It be about that.

E

If we do it as an extension and again, it becomes so prevalent that you know everyone's already using it anyway, and we see it tons and tons of different uses and it's very mature there's nothing to stop it from evolving from a custom resource definition into a core resource. Icy'.

F

B

Right so I think it's um we can find out the details of this, but it's maybe just a call, a shot up from from us being I think we saw this pattern with mesos, where a new framework is easy to make, but making a good and stay stable and scalable. One is really really tough. So if there's gonna be a lot of operators that are of varying quality, that users are better off having a a common abstraction, and that's that again is a is a bad and a hypothesis, but maybe worthwhile was doing.

B

Does that make sense like we only have two more than za we're way over? Yes,.

F

It makes sense we can't go over the rest already.

B

So, which do you want to talk to this in terms of how this would be help from the from bottom up?.

A

Not necessarily a separate API like if you just keep a history of jobs, that by itself is like an indicator of so big right now. Jobs has a whole bunch of issues by the skin said it's it's v1, so we can't really change it. So the only alternative we got is like. We have to write extensions to work around the limitations of jobs and so I.

A

Just I was just calling out that this is a common problem of a like people want to track their experiments and whether it's for ml or for HPC, or for like just your batch like it's, it's a very common use case and whether we do it as part of the state, full job abstraction or whether we do it as a separate extension, so that it works for other controllers, like that's up to us, but I think giving people a way to do this and and giving people a path to think start.

A

Thinking in terms of like more and more curious, narrative, abstractions is probably a powerful and important thing we have to do. Fine I found that once you delete your once, you pod the eyes. The job doesn't know what happened previously or odd gets deleted. Job goes berserk, like that's, that's not going to work at scale. Yeah.

E

But the one thing of when I hear is that if you take spark, for instance, they provide their own history server right so at an application level and they even think of tense report, in a certain extent, as a way of storing the results or history of jobs that have been running internally right so I, don't know what is the core thing.

J

Look like it's the.

A

The like I think Nick cut a slide about that previously right, which is like job output tracking, like you had some, so you you want to order what includes UK and and have some unique identification for those set of input parameters yeah and then you have set of outputs and being able to go from inputs to outputs through an IB. That's something! That's not easily possible today, with jobs and I would expect that to be something important, even for other custom controllers. That sends like what.

J

Reason, the inputs, outputs in the way.

E

There is your handle or something that's kind of application. Specific they don't know. My only thing about this is I. Don't see what we need to do in core to do this like there, you write it to a database right looking, yes,.

A

Like I said I'm not saying you should we should change the core I'm, just saying that it's a common pattern yeah, but it's worth abstracting right and I'm, not saying it's it's! What abstracting in core I was saying that it's a common enough problem that.

B

Can I think we saw something similar like in in the in the cube flow community, where it was talking about history servers as well, and it just seemed to be a wider issue right if you want to have a way to keep track of objects in their history forever. If there is a way to plot that, and maybe is something it's probably something you have on the side that does backups of it and then stores it in some that.

G

This is David again, that's my hope as well that we aren't able to but I I don't want to stop people from experimenting or thinking like outside the box, for example- and this may be an absolutely crazy idea, so you should dismiss it easily immediately but like what, if we overloaded using labels or something like that to help with experiment tracking like I, think there might be some some things where we could use kubernetes core constructs to make experiment tracking at a higher level easier and and I'm all for that I.

G

Just don't I, don't want to like stop stop of creating potential solutions. Just because you know we have captive already running on cue flow.

E

Here's here my primary concerns with this: it's not clear to me that experiment, the results of an experiment are closely tied to the life cycle of a kubernetes cluster and backed by imagine that you might turn clusters up and turn them back down as experiment over the lifetime of multiple experiments. But you'd want to be able to store the results of an experiment and it's outside of the scope of the kubernetes cluster yeah.

G

Sorry you're exactly right: Kent I didn't mean to suggest that we wouldn't need something like that. That's.

E

My only concern really and then the other thing being emitted. You may have multiple clusters turned up simultaneously with experiments that are being run and they may want to be able to share the results and the collection of experimental data as a global resource. So I don't think it's as simple as saying we can build an abstraction inside of a kubernetes cluster bit. I, don't know, I think it's a harder problem than just saying. We can just build some abstraction that handles that particular well.

B

I think we're not looking for well.

B

We have some idea of some solutions for this, but it's mostly being aware of there being a problem and I think it's a good indication that if it's a tough problem, then you know they tend to walk around them for for a bit, so yeah I think it's just like the lodging that um whoever pulls in is today will have to deal with some of this, and it's yet one of these kind of easy to get into a silo if there isn't like a good way of meeting as some kind of common as shared format.

B

Somehow this is something we can. We can leave until we we have an idea about how to how to go. Do it.

A

Example, which dumping his historical application, like historical states of objects, you can think of like grease, simple solutions and not over complicated to start with I think at the end of the day, people just want to run something, and if the cluster goes berserk or like jobs, EPA goes because I still some way to go. Look at what happened previously.

E

I'll be more like a generic who's to be serviced and up and running, and that you don't really care about whether the underlying implementation is based on kubernetes run as a service, and you know something that's backed by multiple types of durable storage. So you can put it on GCS as three agent.

C

E

Friend, that type of a solution is more I, don't know, I mean kübra days could be part of the answer there, but I don't really know how much of the answer it is. Yeah.

A

That's yeah, but anyways Nick. You just got three minutes all.

B

Right, yeah, mom's learning, there's also yours, so you get to talk to it.

E

Yeah so I mean there's some challenges, particularly with penser flow serving, um particularly if you want to try to get it to work on GPUs and how we deal with it as an abstraction also with running experiments. It's not necessarily the same use case that you'd have four typical deployments like stage canary type deployments that you do with serving workloads on kubernetes.

E

Auto scaling becomes a bit of a problem in particular of GPU, like it's not clear, so, with a CPU bound workload and with TF serving as a CPU bound workload, you can pretty much auto scale off of CPU utilization as a good trigger. Gpu duty cycle is probably not a good trigger request. Latency might be a better trigger or total QPS, which is probably tied directly with request, latency and batching.

E

There's just a lot of things that we could probably do better and potentially in a more general way and then ultimately, a question becomes to TF serving, is you know it works for tensorflow out of the door right, but there's nothing that stops you from providing other plug-ins for TF serving to serve other different types of models like I. Think there's a there's, an XT boosts in the community ready, there's a Kathy, 2-1 yeah, so I mean looking.

E

Is that the right extension point is that the right way to serve models on kubernetes in general should is the right thing to do to invest EF serving to support other types of models, or is it to provide another level abstraction above TF serving be safer questions.

B

Sounds good, so I think the last one here is we don't have time for this, but we try to start a you know a list here which is like for each one of the slides. We can go and vote and comment and do whatever and figure out which SiC is relevant for it, and if there is ongoing proposals we could write that too. So, maybe in two weeks time we'll be in a better position to figure out what is less ambiguous and what code do now versus what requires a lot of effort.

B

But it's worthwhile for us to look at.

B

Does that make any sense.

A

Posting your slide back and also this list of priorities sponsoring six just so that we make sure that we have.

A

Any potential concerns.

B

Right and also adding to it right. So if there were several programs requests ideas about how to enable profiling or whatever we have missed here, we can go ad and we can figure out who's interested in what because it will only become real if people start working towards those goals. Fine.

F

Yeah thanks maker, can you make a copy of slides, are recording available to us yeah.

B

So they are a link to them right now. The Charter, which any one group tried right here and let's call it the priorities and then oh now, there's a spreadsheet to that. So is that good enough here?

B

What do you wanted to move? Having moved up further.

F

It looks good what the recordings.

B

Sounds good I, don't have anything else, hope to see people kind of chime in with uh you know what they care about and then we could. We could figure out like which one which, which topic to attack first, you can get a bit more focused. You know today we were all over the place, but I was on purpose so, like gonna, start funneling and get focused.

A

B

Thanks a lot guys.