Add a meeting Rate this page

A

Everyone able to see the presentation- yes.

B

Okay,.

A

Okay, I can start so this is a presentation about a machine learning user pain points from some of the interactions and interviews that could have done within the internal data scientists at Intel. So that's why it's an Intel specific tool title in there and me corner Nicholas, Jason and Carlos are all from a day AI products group within Intel. It's it's called a PG.

A

So the overview of the talk is simply we interviewed and interacted with a lot of internal data, scientists, machine learning, practitioners and sometimes some interns in within within Intel, and this is like catalogue of their pain points when they were interacting with scheduling substrate like kubernetes, and what we wanted to do with this presentation is about categorizing, the user pain points into different areas, and we give some suggestions on which areas this in my working group might have to move like should focus on in order to reduce the barrier of entry for people to use communities.

A

So the use of endpoints is categorized into three in this presentation. One is about user experience. Second owners about resource management and third, one is about data management.

A

So before we go into the details of those categories, the users that we interviewed interacted with mainly fell into two categories. One of them is researchers or data scientists. They were involved in basic ml research, including changing the ways on using models itself, what models to use and how to change the modeling techniques in order to provide better approaches for model training, the other other type of users were fell into the category of ml practitioners. So they had some type of some type of business of scales in mind.

A

They they take existing models and apply it to some data and come up with come up with a service or some kind of providing some kind of a service to users. Sorry to customers.

A

So this is a this is a representation of the time to solution for a typical user. In this case, it could be either data centers or ml practitioner. So so some of the work is labor intensive, some of them compute intensive- and this is this- gives you a simple as simplified overview of the stages for these kind of users. So for a start they would look at the data and they would want to label the data and then and then argument the data this medicine. This augmentation of data might also include some pre-processing steps.

A

So these three steps we think are labor intensive and and then they go into the actual compute intensive or the training phases where they actually look at what? What? What are the models that they want to train the the if they want to change any details within they're put in the networking layers?

A

They would change that in this experiment with topology stages and then finally, they would. They would look at running these models and training trini using either high priority tuning or some kind of iterative cleaning phases and then finally, they look at the results and figure out whether it's useful. If it's not useful, they will try to iterate on these models or phi r, or they would just be done with the these phases and then they'll talk about the results.

A

This is a very simplistic view of I trade of training, so that is modern input in this case, that is moduli sree hyperparameters or the parameters input to the models, or also called as hyper parameters. In this case there are two inj: they have some values. U N and V 1 and in between the user, can also do some code changes to the actual modeling code itself, so they run it through model training and then they look at the model output.

A

The model output usually consists of something like an accuracy or an error rate and then a loss function. So if the, if the data scientist finds it finds this optimal for their use case, then they move on. However, if it's not we'll come back to the first phase and then try to make some code changes or two or change some other other parameters and then again go through the same kind of workflow till they find something that's useful for them.

A

So this is just an example like they'll come back and do second values like in this case: it's? U 2 and V 2 and then they'll go output, W, 2 and X 2. The the other common workflow is, is called HyperMatter, piper parameter, optimization where it is assumed. There are two parameters that there are inputs for the model I and J, and the user wants to look at the model, accuracy, loss function and the usefulness of the model for these different kind of parameter inputs.

A

So what they will give us is a basically a matrix of input values and they would spawn of jobs that run each of these models with different sets of parameters, at least that's what is represented here. So in this case, it's running one set of parameters in cycle. In second case, it's running another set of parameters with different values.

A

They each give some output, which will be in terms of accuracy and losses and depending upon the data scientists use case, they will figure out whether this is this parameter value and the model that they're used for training is useful or not, and in some cases where some of the parameter combinations might not be so useful, they would want some introspection capabilities when the job is running. In order for us to look at how the model training is progressing, they find it not so useful.

A

They also want the ability to stop it earlier as an early stop capabilities so that they can, they can actually focus on the model parameter combinations that are more useful to them to is not to improve productivity, for example.

A

So you first get set of pages of pain. Points fell into the user experience category here. A typical user will be a solution, data analysis or a biologic grad student.

A

They know about, for example, sky could learn how to use tensor flow. They now go to Python and then maybe they use some great asset management tool like pandas for data management.

A

This this, this slide is about barrier to entry, so we were also talking about concerns on concerns of concerns which drain productivity, so in this case, going from sky could learn or coding in Python to running it in kubernetes is one of the main issues like understanding concepts related to parts containers.

A

What are bills, what are tags, the differences between command and entry point, the CLI and namespaces, and that is also this concern of putting together a lot of information that is provided by the scheduling substrate like job status, the pod names, the logs from the container tying it back to the job that ran the data. That is a securities with these jobs, etcetera, etcetera.

A

So some if something goes wrong or if the user wants to debug certain certain certain runs all these come all these things come into play a.

A

More concrete example of user experience is experiment, trapping in which, which is which is which I alluded to in the last slide. It's about it's about when I talked about I trait of modeling. When then, the data see just do some code, changes want to go through this workflow process when they training the model, look at the output and figure out if it's useful and I didn't go back to the same doing the same, so it is difficult for them or cumbersome for them to decipher code changes. Data related with the code changes using dockerfile.

A

They feel like it's labor-intensive and again. This comes back again about tracking logs and output in terms of understanding. What's the status of the job, they have to look at logs, tie it back to data and the the task ID itself, and this kind of this kind of iterative process actually requires putting together a lot of things and- and this is applicable to both hi Deborah metal sweeps and I had a little modeling and I as I mentioned earlier. They might also want to run certain stop certain jobs that will respond in the cluster.

A

During the hyperparameters fields, because they realized that the trial was useless.

A

Finally, they are some. Some data scientist also use visualization tools, direct answer board in order to compare and contrast some of the runs that are being done when they're, using either parametric Unni.

A

So, in case of resource management, so this goes back to the hyper parametric tuning process where there is a set of metrics of values for hyper parameters and the user wants to spawn a bunch of jobs, usually in hundreds or maybe even thousands, in order to figure out which model is useful for them. So a single user can hog all the resources in aerosol Wisconsin Custer and the requirements for like this kind of a fair distribution of resources is that these these users actually come from sometimes from HPC style background.

A

So they want to sum it to the cube and forget they expect the scheduler to act fair without their involvement. They want to know like what is the lead time or the? What is the wait time or expected maintain something like an ETA for the job to run, and then they want to ability to prioritize one task or the other other when required.

A

So, in terms of data management, some of the users, some of the users want to want to as I think some of the users, if they, if they're, if their data is stored in a stored in a data source where which is not natively supported in kubernetes, for example, they want to catch that locally and then use that data for a either a distributed, training purposes or simple, simple single node model training along with that, not only with not only about the river application.

A

They also want to keep track of the data that was used for each of their runs so that they can and both figure out which data and which parameters were used for certain run, either to discard it or to use it, because it was useful for them that comes out of the experiment. Tracking.

A

The other problem we have run into is around running out of local storage itself, so they want a seamless way to figure out or the schedule to take care of landing landing when they're, trying to when they're trying to pull data from or locally cached a data from a different source into their node, they want to land on nowhere where there is storage space.

A

There are also concerns about the data clobbering and labels itself. This course those who goes to role based access control or are back where they try to label data and its concern. The there is a concern that some other user within the namespace can also collaborator.

A

And finally, tying all of these things together, we wanted to make some suggestions of areas that Yammer working group can focus on and we have a list of abstraction. Sorry, we have a list of items here, it's mostly around a finger obstructions for researches data, scientist and ml practitioners. Each of the pain points that we have listed directly correlates to one of the items in this list. It's regarding either templating workflows, its women, tracking resource scheduling, the resource monitoring and so on.

A

These are also focused on reducing the barrier to entry for ml researchers and data scientists and practitioners for using kubernetes.

A

The final note about the approaches we want to also meet where the users are. Some users may have already workflow some don't understand, or some are not familiar with the concepts and use cases or the or the ways to use vanities.

A

It might be better for us to meet meters where they are rather than like.

A

Rather than like imposing some some sort of restrictions on them- and maybe we should also try to add values incrementally and addressing these specific, this vein and I mean addressing these pain points by building from some small components and then building on top of them.

A

So that's the presentation.

A

All right, thank you, so I also want to mention that there is some ongoing work going on in each of these. The items that we have talked about here you know all these list items I mean if time permits in later later meetings. We can also present on some of the work that we are doing, but that might be at a later stage, I think now we are focusing on figuring out what are the user pain points within the cluster?

A

Sorry for clustered usage per ml practitioners, so, okay, so Connor and Nicholas? Do you have anything to add.

C

At the moment, yep I guess everything we had went into the slides, so yeah yeah.

D

More.

C

People can can see the recording. It looks like we have a bit low attendance this week, but I'm curious to see how this matches up with other other people's lists and they get to present.

B

Should we spend some.

A

I mean I'm on.

B

Chad.

B

Yeah.

A

So I cannot share the slide deck but I'll copy this somewhere else and share it with the group yeah yeah.

A

So are there any other questions before we go on to what we just stood I.

D

Guess what I'm coming to make is you know each one of those items on the the bolas you showed there's there's a whole kind of cottage industry of tools that exists kind of around those, and so a lot of this is taking stock of what exists out there and finding ways to either combine them or make them appropriate. You know easy on ramps or what-have-you, so it's definitely not happening within a vacuum. Given the amount of things happen in the ecosystem, yeah.

A

The other thing I want to mention here is: we want to identify the gaps and if there are already existing tools, we don't want to reinvent those. You just want to make sure our tools are filling the gaps wherever.

B

I can go choose.

B

How can I get hold of frameworks that are optimized for Intel hardware,.

A

So I can speak to that. So there are, there are some. There are some wheels that are already available. I, don't think they are containerized yet, but there are availability of reels for tensorflow and some other frameworks as well. I, don't exactly recall a fit of merde. If anybody else recalls it, they can help me out. I can send you links to them.

A

They are already optimized for MK land takes advantage of all the intrinsics if they are available, such as a B, X and F MA, and all those special instructions that are available with the underlying hardware. Okay,.

B

Actually, a side question that this does anyone feel the need for curating, basically a list of tools and links that one should probably be knowing, if they're trying to do because right now, when I try to navigate this area, this, it's it's spread out everywhere. There are so many different possibilities and solutions, and for a new user, it's a little bit daunting like not even taking into account the fact that sitting up kunis itself is an Herculean task just by itself.

B

So this this anyone else see value in like trying to cure it, maybe just a dog or a github readme, or something of that sort, that sort of links to all the tools that exist out there and one that most interest me is it's like having optimized frameworks, because that's like everybody should be using that, but finding it has been hard for me.

D

Yeah I agree 100% on a list like that and the optimized frameworks actually I just talked to a team yesterday, whose packaging these up and I offered to give them a big kiss because I just it's been such a simple low-hanging fruit that we, you know it's just it's really been frustrating it. You know it's kind of fallen between the gaps of several teams and so we're addressing that. But this would be kind of a perfect place to put that.

B

I.

E

Completely agree they're, essentially just like an awesome list. Yeah.

F

So um base to some extent, I, don't know what the scope of queue flow is, but it seems that down in terms of the automation of all these tools are pulling them out. There will be a place for it or how do you see your connective relationship between the two.

B

Solutions.

B

I think, like the list just keeps growing the ones that I can list this- probably a zillion of them out there. So how do we stay under payin, aided in terms of the application stack that people want to use, but still like give them access to all these different tools?.

F

You probably end up with them with a map like a ecosystem map, almost like CN CF right or you have like four different ways of doing their poking. You have three different schedulers and so on. Yeah.

B

I I agree at some point. The list will sort of start losing value because people don't know what to choose, but even even at the infrastructure level like okay leave the part on identifying the right set of applications to use right, that's a hard problem with each one having its own subjective advantages and disadvantages, but just knowing how to how to set up critters and like what sort of plugins do you have to use and and like? How do we recommend?

B

How do we recommend users consume namespaces and Cothran right recommend them partitioning their capacity like all. That is a lot of information. We have it spread across one to two different features, but there's no easy user story around that right.

C

For.

B

Things like having optimized images could.

C

Could we put a dock like that on the website, or is it you application specific.

B

That's a good question: maybe we can bring this up in sick dogs and like ask them if the website would be a good location for this, if not like, just having just having a simple repository to start with, where we can all like call some patches and and then just add, the things that we find are going to be useful, that that might be a good start just by itself. But I agree like for discovery having having it somehow linked into the docks.

B

Yeah.

F

I guess it's also been an internal problem for us, for we've been trying to understand how all the tools work together and how to understand them. And then you have that big pile of kind of cognitive load versus the actual use case, which is people don't want to know why and yes, it needs to be like a breach of that gap. Somehow why and you know I don't know I, don't remember if balaji got to it, but the expectation is really that someone falls in and don't care that company's is underneath.

F

I know it's kind of controversial statement, but you know it's it's about the code. It's about the results in the end, yeah.

B

That makes sense, actually the file is sort of like hide itself in the.

B

So that'll be a win if we actually end up getting to that point.

B

So to start with I, don't really know, what's the right repository, because this is one of a curative effort and so I don't think it can be an incubator or whatever. Maybe given the new changes around SIG's and and how repositories are being organized, maybe we can like. Ideally, this should be part of I I, don't know if it's six. This should actually be part of.

B

Maybe.

F

If we, if you think about team, it's been by project oriented by so far, but if you think about activity right. So if you say you know labeling right and saying we have templates for how to do labeling, we have templates for how to do training and so on. So it become say: I did the best known method of doing X like best known method of doing center flow. With with this problem, it might be very I, get a set of different thinking, whether you're sensing for for right.

F

So if you want to use like the estimator API is you want to use like Lola leave the APS and so on?.

C

So.

F

Close what.

C

About the activity so you're saying you could have like.

G

A.

C

Number of different recipes for doing a thing like if you want to do hyper, parameter tuning, here's three options and how to set it up.

C

It looked like the trend was kind of like away from the incubator, repo and towards you know. Each cig having a separate github work. Is that right, there's like kubernetes sig node? Can we get one made for the worker? Then we could, just you know, make our own repo. Okay.

C

Is that an option.

B

Because.

B

Another thing that might be helpful here is like listing a bunch of use cases or like features that we want over the system and, like today's presentation, is a good start there, so that they like or wants to add feet just activities. They don't like short to very specific, six and and then say: hey I want to change this API here in this specific, but rather they they they justify what they're trying to do at a high level and then specific.

A

And also from a infrastructure standpoint, you might also need to add some best practices guides on how to set up the cluster for I. Don't know running my workloads, I mean it'll, be very useful.

A

At least it then our group, it's kind of a talk, but we have learned a lot of things I as a community if we come together and put up a guide for setting up the cluster itself, including our bag, how to use namespaces how to use storages, TV and PVCs and all the Provisionals there's a lot of things going on to bake together a cluster, so that might be very useful.

B

Like an excellent idea, bhaji just like just capturing the existing tribal knowledge and making it easily accessed that sounds perfect, okay, so Connor or someone else do any of you want to make a proposal for opening this repository with the specific goal or should I take.

C

It'd be easier for you to approve it if we suggest it.

C

Yeah we can. We can do that. Okay,.

B

I, don't expect a lot of resistance for that? Okay, sorry I didn't mean to take up so much time on that, but I should be good. Go back to your presentation and then specific.

A

Sure, Conor, Nikolas and Jason Wilson is also in the cold. Soon they all can chime in.

A

So, with respect to this resource management- okay, I'm here so I'll- start here, so the respective resource management. This idea of fire distribution of resources comes because there are existing features that we can use like quota and priority within communities. But the uses that we have are are expecting something similar to HP style and which is HPC style environments where they want. To sum it to the queue and forget in quota is actually may develop using admission controller, so it just rejects high-level objects of the pods at admission time and priority has its own drawbacks.

A

So some of the users don't even specific priority. They either specify the same priority, for example, either high or low. Then it becomes useless. So those are the problems that we that we see.

B

Is not directly but rather have controllers.

B

So, in that sense, controllers are expected to to work around Kota Kota problems and maybe like resubmit new pods and to handle the.

A

Right so the other problem here is in terms of using the the job API itself in communities, so some of them don't want it to restart, and there are details about how do you? How do you submit a job that doesn't restart on failure? So there are some issues around there as well. So these are some things that some of them are like. Just writing a best practices guide saying. That is how you use it and some of them we have to go around and figure out like some existing.

A

Something we've heard: maybe somebody else can chime in here and that's one of the reasons they use. Pods.

D

So I guess one one comment I would make about. The previous statement was when, if, if it's controllers, job to handle quotas and resubmission, then just kind of moves, the problem from one area to another- and you know I- keep flow and stuff for coming online, but until those are in a state where they can handle, you know basically anyone's use case. Then there's still then link to submit individual jobs. So I feel like that. Doesn't necessarily.

E

Solve the problems moves it around. Well, it moves it to a better place right because some jobs contain you know more than one job and I have a couple different containers involved in it and if you're trying to figure out a priority of who gets what, when you need to have the whole scope of okay, how many CPU consumptions will I spawn when the job starts? That's why the controller hasn't been thing that monitored.

E

That is in track of a party in queues alone, because you have to have a holistic view of every part of the job before you launch it, so you can allocate, as you can actually say. Okay I can submit this job now well during the job is not the cranny's job, rather a set of containers that interact in various ways.

G

Yeah and just to give a bit of context for that for the pod stuff income flow. It's because, basically, that there are two types of failures right. There are failures that are tribal. Let's say you put crashes for some reason leading to resources. You might want to jump through to restart and we tried the the tensorflow job right. But, for example, if let's say there is a Miss configuration file is missing, we don't want the pots to be retried 20 times and an August GPU for that right.

G

So that's why we move from using the drug controller to a custom controller. So, basically we are checking in the exit code and if the exit code is within a certain range, we assume that ministry tribal. Otherwise we just stop the job there.

G

I guess that's mostly, because GPU are so expensive that we try to optimize the timing.

B

Of the system.

B

I, don't.

G

Think we did I'll try that's after a call: okay, I.

A

Mean it is mostly around debugging and diagnosing what went wrong so before the before the next restart, at least. Maybe a control over that okay.

F

Yeah, that was another point right. You know the amount of sources that you potentially have to dig through and if the submission can went wrong for some reason, then you have that this sequence of steps needs to go through like by inspecting the job inspecting the pod figuring out. Well, it actually did get to start before you even grab the locks.

F

Those.

F

Kind of companies, tool ability, I, don't know if it's kind of ml specific, it's been more just like general use experience with debugging jobs, because you need to go from the job to whatever parts that ended up fulfilling that job by.

B

Yeah I suppose you know existing logging system. That's not annotated labels with specific lines, so it can get a little tricky to to go from a bunch of labels on a job to a list of log lines on some back-end logging system. I. Think this week this specific feature request was raised from the queue flow community I saw they were at a.

B

It tends flow logs with specific port labels, so they can actually track it throughout today, but we should definitely file a feature request for this, because I think this is something that cube realists should be doing they're. An existing logging system should should be providing the necessary hints to to aggregate container and pod logs into higher-level entities like jobs.

B

To file that issue.

G

Yeah I can do it. Sir I just asked Jeremy because you might have already been doing that so I asked him. But if you didn't, we feel that file an issue.

A

So other one is about generically monitoring resources relation in terms of axle raters. It's it's not about like how many the quantity itself they they want to introspect the memory acquisition, for example when when, when a model doesn't fit in GPU memory or, for example, even CPU memory, if it doesn't fit in there, they want to know how far they can push the system or what how much memory is required in order for them to run those models. So maybe monitoring for getting more details about the these job runs.

A

They have done so far that, maybe that's that's what that's? That's the that's. The reason why the acts of retest the socialization is there in terms of storage. We talked about it like they want. They don't want the the disk pay to run out. Local is fish run out in each node.

B

Or GPU metrics right, it's just that, it's not part of, say, cube. Cereal yeah.

B

I think.

F

It's like with logs it's a way to type back to the job and be historic right, so you can.

B

Okay, can you not do that today with Prometheus, for example, you.

F

Probably can right, but it's probably just best best, like there's no methods for how to tack everything correctly. So when you type it once then, okay.

B

Yeah.

F

So probably just usability I, don't think it's impossible to do. Okay,.

B

I've.

D

Tried to use this do that exact thing using the new GPU metrics from choruses Prometheus controller that are exposed, but it's it's I was beyond my abilities, so I appreciate some pecans there. Okay.

B

Yeah I mean this is what we could do. If we have a repository then we can. You can actually have an issue just for this, because this is more about just documenting how to consume an existing feature rather.

B

This should be like trivially possible. If it's not possible, then we sort of failed and trying to add this feature to Karina's.

F

That's some of the internal processes we've been doing here where you know, there's been three steps to solving a problem like first off and there's also been inspiring this. This tag has been to collect the issue from the user themselves.

F

Add enough details on this, but the first thing we would do is to figure out if you can solve with what we already have by undocumented second, if it is to extend the thing that exists and then only thirdly, that we would start like looking into new projects, but I, wonder if there's something that we can mirror open-source. There would be something similar like it could be queue, flow or you're, talking in all these kind of best personal practices, but I'm, not quite sure, like I, said before what relationship is between those?

F

If queue flow is a an opinionated way of doing things saying this is going to be the push-button easy thing, and what's being done here in in this workgroup, is to identify their sacred landscape at large.

B

Yeah I mean there might be multiple ways to solve the same problem and cupola might be choosing the most optimal way to solve the problem, but that still doesn't preclude other users and other use cases. If you will I mean I, don't have a problem starting anywhere wherever we all think this is the right place to document this, and let's go with that.

B

So nickels are you suggesting that we actually have a repository in cube flow for this.

F

um As long as there's one place for it, but as you can see here, some of this might be really simple my mind not really warrants like big projects or anything like that. But just like you know, recipes like we said before. Yeah.

B

So.

C

Vision in your mind, would you know the repo that we talked about submitting a proposal for if people file issues there with these be the kinds of issues that you'd expect to see? Yes,.

B

It could be that.

B

If not, then we can start exploring how to extend our system to meet that specific use case, because right now, if someone finds the issue against given SQL is saying I want to do food, it will get lost because.

B

If it's focused on a specific work load, then hopefully there are people who.

B

I.

C

Think that makes sense. You know to the effect that it would benefit people that may not be using a particular workflow solution. So know there are a lot of people doing ml and kubernetes that aren't using coop flow. And so, if we put everything in the cube flow org- and it might only be visible to a small segment of the users.

G

Yeah I agree. That's so.

B

Okay, awesome so I guess you're deciding to do it in the cumulus ecosystem, the core community. We still haven't figured out exactly it's gonna be but I guess we will know it through the proposal process, so bars you're, going back to the other monitoring problem. You you mentioned so GPU memory should already be available. So I guess there is a a usability issue. There.

B

Iron storage metrics already available as well.

A

Yeah storage metrics are already available, but this is about just just a possibility for the user and again this would be again very simply like coming up with some kind of a best practices guide to make sure that everybody understands how to expose these metrics to users, marina yeah,.

B

Okay, so this falls again and in the documentation category yep.

B

You also mentioned templating.

A

So I see so in this in this kind of iterative training process. They also do some kind of code changes where they want to just I mean. Ideally, they want to just be within the IDE they are using, and something else should understand that there is some sort of a code change and make sure the dogger bills in place and then deploy it and kubernetes cluster so that they can look at whether the job runs or not.

A

You know or if, if they know that Java successful, they want it directly deployed in the cluster, so I'm, Conor and Nicholas can talk about it. They've been they've, been working on a they've been working on related area. Okay,.

B

So is it is the use case as an ml practitioner I want to I want to be able to try out my model on my on my workstation, make sure it works and push it to a cluster, or is it one where I actually push model to a cluster, and then then I need to go and iteratively change the code that I have previously submitted and then we trigger another training, is.

B

It's.

A

Both I think so so in terms of so ultimately, so after some code changes, they want to do this kind of a sweep. I mean if anybody I mean someone else can correct me if I'm wrong, but but ultimately, after doing all these code changes, they would want to do a sweep in order to figure out what parameters are useful for their training their model. So then I'll, then, at the end, deploy it in the cluster like like in this model.

A

So.

C

That the main idea behind the the templating, the templating issue, there was to kind of get from source code to deployed container okay, because a lot of the tooling that we see in the ecosystem crops up kind of takes over at the point where it assumes you have a container built with your code in it already, and that's that's not something that you know some data scientists who haven't worked with scheduling environments.

C

You know they may not be even familiar with the tool chain to build the container, and then you know once you get kind of like this. You know best practice that that tends to build up within a within a research group. How do you share those in a in a in a good way? You know what we see is a lot of people will they may just like clone and continually modify prior projects?

C

But you know, improvements for, for one project may not get pushed back into the you know the one that's used as a template, and so it's just a way to kind of like aggregate and and codify best practices, usually focused on, like a specific combination of kind of like a deployment model and a framework.

C

So you may have like a known good way to deploy multi-node tensorflow, for example, or you know just just something that you know works within your cluster set up so that you can ease the time from like zero to getting something running.

B

Potential users to continue rising their app is like a huge blocker for adoption, like they love the fact that there's so many possibilities, but the fact that they have to move from a VM and a Jupiter based workflow into a container based like auditable flow that that banner has been has been a huge problem for them. So this is an area that I've been meaning to invest some time which is, can we make it easy for users to move their source code? This this source code is written typically against.

B

Like well-known, then known, libraries like it's, we can literally have a single file describe the dependencies, and so can we move. Can we like provide tooling that would take that source code and deploy it against cumulus like Brandon's metal particle, is one way to approach that I think like Kelsey the go based em over here back, which also self deploys.

B

We can do something even really simple, where, if you can specify the specific framework that you want to use or even even want to specify, like you, have a metal container that has all frameworks you, just literally like run a command that that wraps your directory. And it knows an entry point and then just runs that it's something really simple, that that that makes it possible for people to try out curators and.

A

Okay, I'm.

G

Sorry yeah just just just quickly and if you guys heard about Floyd uber, but what did you like? Uber is, but it's working with containers and what they do is exactly what you just said. Basically you say I know from another command but flow, add file but run the framework, equal, tensor, flu or whatever else, and then they put suckered into an image that already has a lot of dependencies. The issue is that the images is really easy. I, don't remember, but it's it's really very heavy, but it's really easy to use. That's pretty cool.

B

Just bite them to begin with. That's that's.

B

Statically introspect what libraries are being used and then even infer what base image to use for that.

B

Is this this? This is a concrete project, I.

E

Think it's a MVP. We might want to meet users where they are. If they're already using Jupiter I, think Jupiter has an STD back-end. We can maybe add a button to Jupiter that just like you know either it creates a pot or something like that, because if they're already inside Jupiter then by default, they have other dependencies in one place. So we can use the same container put the code in there and, like the boy had a th job or some other abstraction.

B

But I feel that Jupiter is this one possible ID and Jupiter. Id is not really that great, maybe Bray lab is better, but it's severely limiting. So for folks who are traditionally used to like regular text editors or like IDs, then maybe that might not help them or pull it the other way. If we can, if we can design such a tool to be modular so that it can be consumed both from within Jupiter's context, as well as like say, run as a standalone I think we could probably try it.

A

In the other, templating usefulness is, comes comes from this hyper biometric tuning where you want to just like you have already a pod template. I mean this can be already done, but he's made out with just documented how to how to change these parameters, something like a bash script or something for kubernetes, maybe from helm or via custom resource, which does that for you running.

A

For.

B

So maybe you should revisit the rest of the stuff that we discussed or stuff that we presented next week or actually two weeks from now. Meanwhile, like as an action item before the next meaning. Let's see if we can get the repository and basic documentation going.

A

Sounds.

G

Good thanks. Thank you. Thank.

A

You.
youtube image
From YouTube: Kubernetes Machine Learning WG 20180315

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).