Kubernetes WG Resource Management, 13 Dec 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20171213

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

All right, so this is the December 13th meeting of the resource manager, working group and a number of items on the agenda. I believe this. She put a number of these there if I'm not mistaken. So in the interest of time, let's go through each of these items, one by one. So do you want to go through the V ones and work plan topic fish, yeah.

B

Sure I just want to make sure if someone is taking notes.

C

B

You Balaji awesome, so I mean I had a few thoughts in my mind but like before I go on about them. I want to understand what is it that, as a group, we're considering achieving in the one pattern time frame? I just want us to be cognizant of the fact that, even though we have a lot more people in the community now focusing on this area, the review bandwidth is like pretty limited on.

B

The churn rate also needs to be limited, so we need to like figure out some form of balance between graduating existing features and stabilizing them versus adding new things, and so, let's explain like the next ten or fifteen minutes together, identifying priorities for all of us and then seeing how we can how we can sort of like place them in the 1:10 time frame. The.

A

5S team maybe kick off the discussion and fish if I said that, like from my perspective, I think at least representing RedHat in the community here that we would like to see device plugins be able to move to beta and 110 the basic CPU pinning support that right hand Intel's been collaborating on would like to see this into one beta and 110, and then the huge pages work that at least we found valid with a lot of exhaustive qei threat outside I.

A

See no reason why that cam, Butantan from Red Hat itself, I'm not interested in driving a ton of new features in the one tenth time versus finishing up what we've started so I'm, not sure how others in the community feel. But that's kind of where we're coming from. At this point. Okay,.

B

I plus one most of that from our main priority now is to is to graduate device plugins to beta I'm. Also, they continue focusing on graduating all the other existing experimental features, because we have had them for a while and it's I think it gets through the stack and that, like we, have a few experiment, controls and so on. So like trying spend energy on graduating them would be the most priority and, secondly, I would like to start collecting use cases for better resource.

B

Api is not necessarily like finalize the design, but just at least have a sketch of like one of the different scenarios that we are trying to cover and maybe have like a few high-level ideas that are circulated in the community and that's about it. I guess so about finalizing or it's not about like deciding the execution plan and so on.

B

Those are the main things from our site. At least I.

D

E

D

Want to set it as a psychological sketch code that we want to have a operate on wow. These are on resource class API and hopefully we can come up with some execution plan. I mean Isis catch cough. It's just a few like I said yeah, it's a go. That will help people to prioritize the work I'm.

A

Just on the resource class front, like I'm curious, if I was to gauge where the community of folks here are like or from my perspective, I still work to do to make it easy to operationalize like a homogeneous set of clusters or notes that some of the advantages around resource classes and getting heterogeneous support are at least more forward. Looking for me, so, like is something that's a near term. Production concern for folks that, having it be anything beyond a stretch, goal would be like a major detriment.

B

We want to make progress on that at the same time like that's, not the top most felt. If you had like actually prioritize it out, probably say it's like a peak to.

B

D

B

That align with your thoughts, I.

D

Think, like a whether we put here the pupae up, you too, let's try to make some progress and I mean I feel nobody will really come to us and try to put I. Don't think that will be the situation but I think I agree with Derek that we may want to come. Put some time put more time on this ethnic graduating. The existing features to beta I think it's always her party, but like. We may also want to plan to put some time on this effort as well at least arm I, said I.

D

Think I may try to spend some time on this.

B

Most of the viewers, who would definitely wanna be stakeholders on this project might not have cycles to come to some consensus and in the 110 time frame. I think that's what we are trying to commit by saying existed as long as like that's.

B

D

That's definitely especially like I think this feature that I suppose no set on the schedule. Instead.

D

Getting out to prove it in here.

B

So yeah, so there's one more thing that I forgot to mention I would also like to see in a one time time time anything consumption of NPD, a cheap used in a really easy manner, should make you, unlike other local cluster scenarios, without useless to go through a lot of me. I, don't know that's a common interest for people, but like at least with the cube flow project that we announced recently I mean this has come as a very strong signal and feedback from people trying it out.

B

So I like to I'd like to like make some progress in that area, but it should leave me out of three so.

A

Sorry or your process.

B

A

You say: mini cube.

B

What I meant to say, with mini cube, was that the most common cubed solutions or deployment solutions, this support in media without too much affection.

B

F

Think it should be, could be, they mean well.

B

Yeah see it's a little bit complicated and I, like you, Batman as I, expected to just be the cool column which deploys like the core and then like the new flake in Brian's language, the nucleus and then like just one layer above it, which is absolutely necessary for any sort.

G

B

Criminal deployment function. It is not meant to be a full-fledged, fully user-friendly crystal power. So then, so, let's not even get to like what is the right abstraction for expressing that. Maybe this is not the right form for that, but yeah we definitely. We have to identify what's the least common denominator for focusing on energy.

E

From from Enzo we'd like to help with a design for fir Numa, just something minimum, they just making the CP manager and the device plugins agree on numero Finity.

E

So in addition to that, all of the graduations that Derek mentioned that's.

A

Something you just want to write designed for, or you expecting like an alpha code, proof point cuz, I guess the one thing I have is like. If we're trying to graduate what we have now I, don't want to do like a ton.

E

Of free factoring right, just just designed for 110, okay, I'm uh I romantically get done with everything I actually.

B

Have a point I mean it may not be.

B

Anything but I still want to like discuss this, which we already have aa taneema and so I want to understand what level of performance gains we will actually get by trying to trying to pin to a specific human owed and like say if you didn't, have static, CP, opening and Numa and and if you still did device allocation I mean the weights allocation by itself will have its own its own, like hardware, locality issues, but but I want to understand the performance impact of like all of these different features and their combination, so I'm I know my gut feeling is like.

B

Maybe we first invest time on building some some good real-world performance benchmarks, because it's synthetic once we can like set them up such that we will see performance differences, but does it actually matter to use this and and that maybe we can give them specific guidance or white papers on how to cuz. You notice, right so before we even like in this more energy in like identifying more of these features and setting them up in specific but specific policies. I would like to see some benchmarks available in cuteness.

B

Maybe we start with just a few workloads and then we can open it up to the community where they can contribute much more I see this discussion happening in in the six scale in strict testing areas, but I try to forward a sign talk for that.

B

Google has invested in something called college: kids, benchmarking, which is also open source. So it's like basically a lot of things that we can leverage today and it's just about like sitting at the right infrastructure for us to get get that data, probably across different environments. So.

D

So maybe we should make backup some standard. A performance benchmark is a critical now to find the new mom, because I also I also think they say something. Maybe we would like to say and I hope the benchmark is not just limited to new MA, although maybe Numa will be the first feature that we may use to evaluate.

G

Why do we have to have this thing? Get it I mean it's a well understood thing, there's piles of benchmarks so.

B

My question is like what enabled on Yuma and then just take. You know, mi.

G

Doesn't doesn't account for PCI locality? That is a fatal flaw.

B

Yeah I was just going to say that, like what, if you just do device scheduling, we don't do without taking care of human and just lot on. You might deal with that like what would be coming.

G

It won't schedule the threads on the local node I. Don't know my handles page page faulting, so it handles memory and tasks, not the agai locality. Sure.

B

I'm saying that if C, if you had just CPU pinning and if you chose the right devices based on the CPU.

B

For example, then, if we had enable like what level of a latency I'll be looking at this I guess,.

A

That's what I was trying to differentiate so Connor said one thing which was he wanted to get locality decisions made relative to the pinning CPU choice. I can't tell vish if you're saying you don't want to do that and just depend on Auto numeral, oh, but I feel like we do need to get that aligned yeah.

B

I'm my thought process is typically like more CPU cores than the external PC. A hardware devices like wait. I mean choose the rest based on that, just because of like how machines are typically set up. But again, this is getting into the details where, like you, don't have any proper data to verify where the performance level is today and what the day latency is like. I think anything. The folks did some work around it's one or something which tries to benchmark this thing. I would like to see that become part of the equivalents community.

B

So we have. We have some data on top of which we add these features, because we've mentioned this quite a few times where we never made progress in that area. A.

D

Few managers answer panting has coming up with some performance back, but not to show the benefit of the static asleep you mention so I think that will be a very interesting work if we can tie some progress on that and.

H

How some resort quality it is a moment. I had already didn't you, we could maybe.

I

C

But we can do sex again, like it.

B

C

Somerset, along with the coordinates.

B

Scaling and suggesting are trying to something similar so just see if we can have have like single abstraction layer for running these kind of benchmarks, because otherwise we'd have to go build all of it, which is probably not that interesting for us. So maybe we need to like engage with those things and find out what is the working group that is working on just real those benchmarking, and maybe we have a few tests that are specific to our features.

B

I

Is especially when it comes to colocation right, because I guess that this would pretty much be like from end to end or how fast or how much versus you know what happens when more than one thing is running at a time, I.

B

Agree I mean all know that we have to implement a whole bunch of features like based on experience, but it's just that it's sort of hard to say what sort of applications we're targeting and like what is actual end user.

B

A

What you're gonna feel like it seems like for the initial set of items we talked about like if you wanna scoped today's discussion.

B

A

Discussion, like I, didn't hear any rampant disagreement of let's work towards graduating. What we've had thus far at best we're kind of now like prejudging any any design thoughts that Connor might put out right, which is a bit premature, I guess so, like as a community like I'm, not saying we should commit to doing anything with Numa got a code standpoint, 110 I, believe you're, saying the same. I think it's it's fair for Connor to go.

A

Do the benchmarking or work with Bala to talk about the motivation for the next step of the design, but like at this point, I feel like we're, arguing things that are more three months out. Instead of you know, January yeah.

B

E

Think it makes a lot of sense, especially as we try to optimizing multiple dimensions. You know identifying regressions is gonna, get less and less trivial, so we add more things under management, so yeah point taken.

I

We did try to kick this off like Jeremy and Christopher and I. At some point um we could look at what we could scribble down back then and Mikey would hausky was also a part of that you pump. We could have the gun and see where lifted.

B

C

I have this thing: I cannot sign it with this I'm working on some synthetic benchmarks, but if you can update that you can reduce the.

G

Real the world world benchmarking is significant effort. If we block on the inclusion of that we're going to be waiting for a long time, potential suggestion is, is her onyx or or perf kit benchmark? Does a lot probably be a lot easier to integrate and a lot less debate about what gets run a lot simpler, I think our team, even.

B

G

Some experience with both of those but I just don't know if we can commit integrating.

I

On 110, but there's just a collection of benchmarks that exists like like for Cassandra, like the Y of Y, is a white CSB and CPUs bag and whatnot well, I hope grown parallel right, because some things won't work on. Unless we have no awareness, like the tooth pages, isn't that right? Some things are just getting on me. If you get to like many socket systems, will won't be a great experience on.

B

The Google side, like I, some of the teams that are maintaining from a benchmark are actually add benchmarks with tensorflow, but we like do random device allocations and try to measure standard deviation, so I mean all that all those benchmarks, I would expect them to be upstream. So I I would consider it to be less technical work to get that running in a pod, and then it's just running a lot of them and like storing the data and Prometheus, or something like that.

B

I'm saying that since there is an effort already happening for for having a framework for running real-world batch, maybe we can have that group of people focus on that and we can just bring in some some specific workloads that we think are good indicators of performance.

B

Jamie, that's like to reduce your concern of like this might just take forever. I'm saying that maybe we can we can work towards a simple MVP, which is not what we would ideally like, but at least it's still giving us some data so.

G

The pircuit bench marker was brought up in that document. I'm, not I'm, not against that at all. It's just the the fullness of that document was well, it was ambitious, I would say necessary. Quite honestly, we pitched a giant proposal to sing scale like a year and a half ago. That was a real. What we called it sink and it just didn't. We never snapped it. So I was glad to see someone else in the community bring this up again if they can actually staff it yeah.

B

I agree, I definitely needs like one or two people who would work for a month or two at least get something going, because once we have what I've noticed is like once we have these frameworks, the people usually like end up adding more and more benchmarks and one more test. It's that initial inertia of like building that framework. That is what is typically that's the thing that typically takes more time. Oh.

E

I know we can ask around which probably can't commit anything in this meeting.

B

Okay, nice to have I'm, probably.

C

Just that design proposal being out for the pneuma did I hear that whenever discussing about execution anymore.

A

Firstly, I have no qualms that people want to write proposals and an argue in the content of a proposal why their feature is valuable right like at that I'm, not sending any expectation that, like you, know, 1/10 timeframe, any approver upset proposal will have time to review it and keep up with it right.

A

So, as long as the author is aware of the bandwidth demands on the reader and I think we're kind of communicating that pretty clearly here now that, like people, if they have time and energy, do want to invest and research like that, let's get that done.

A

What I'm most concerned right now is like if we look to what we have to like allocate people sign up to in one like, if we can just make sure that, for the feature sets that we described now that you want to graduate that we have a clear owner, reviewer pairings, that's probably the best thing we can do and we like we can handle that offline.

A

If you want because we're so there topics on the agenda, but if Intel wants to go and write proposals, I mean we I've written thousands of proposals that never merged, but each one I learned something right. So it's okay.

E

So, where the, where the roadmaps being stored, is it in the same spreadsheet from West? You.

B

Know I think I feel like from one thing: I should probably start moving that to a community file like some Sunbeam e-file. Maybe we just create a file within the working group directly in the cumulus community repository yeah.

A

I have actually do this and I've been really delinquent, but I will do that, but for the new year.

A

So we can then handle that via port quest afterwards, I think it seems like we have general consensus in the near term for 110 areas that we want to close out with some stretch, goals of maybe iterating more in a resource class design and maybe throwing more thoughts out to the community around a a Colo County design with CPUs and devices.

A

She moves on to the next side of the agenda. Do we have? There was a next item on here around discussing cluster scope, resources or actually I? Think Jeremy. You had an item on here to give a rundown of the cucum meeting. Well I. Had it.

G

So if everyone on the phone was there than or or attending remotely that I don't need to go into it again, I would say that there was. There was good a lot of good scheduling. Talk there I mean the there's like Cuba. It was aq arbitrator, I learned about that, but anyway, I can't see who's on the call. So I guess I'll just assume that folks have a good sense of what's discussed there.

J

Jeremy, can you go over if there were any requirements from the community that were brought up in relation to the roadmap or anything and the changes that people were looking for? I tell.

G

You what the one thing was, one thing, I kind of heard loud and clear- was capacity management capacity planning. That was the one thing and then the other one was like, although it was probably heavily biased or the gentleman from Intel was they're talking about FPGAs we keep hearing about that, especially since Amazon is selling them. So that was the other thing that came up requirements wise and I. Quite honestly, I think everybody is waiting for us to get the device plug in and user story around GP is nailed down.

G

Let's get buttoned up nail down, you know, I mean have a useful user story across distro, preferably across platform, preferably anyway story around those.

B

There was one more thing which is like resource sharing across namespaces, especially like, if, if you have a large set of resources and if you have different people like holding or and like holy aim through utilization, I, think there was a project called cube arbitrator but like trying to make it more upstream and like trying to make it more relevant. Like those other discussions going on.

D

For capacity management, it relate to resource code. haha Oh it's something different. The.

A

Arbitrator works on top of that there in early discussions on it, wanted to do it through quota, but we talked through that challenge.

B

G

Here's the thing.

B

G

B

Generally go ahead, say: I was.

G

Gonna library, a little bit more around capacity planning, so our team has been digging pretty deeply into the Prometheus metrics that are currently available.

G

There's some there's some giant gaps as to being able to make sense of what's running in a cluster and how busy your cluster is so I think over the next couple of well Sprint's on right outside we're gonna we're gonna poke there quite a bit actually to help and see who how to build useful, Prometheus queries that tell you you know: are they intentionally to ultimately like inform cluster auto-scaling, but really in the short term, just to figure out, you know, cost efficiency, potential changes for cost efficiency, etcetera.

G

You know where this started guys was the where the heck do we set Cube reserved and what do we set system reserve to and we started looking into it's really difficult to answer this at cluster scale right now: we've given the existing metrics so on our side, well, I'm, not exactly sure where the changes will go, but that's how I was attempting to like address this. It's a big concern internally as well. So.

D

Does that mean we may want to either more monitoring signals metrics throughout? Does the communities are like you think? Maybe there are some other ways to address this problem.

G

Yes, I mean I, think we'll prototype with a plug-in, neither a notice for her plug-in or a Prometheus additional previous scraping endpoint and then we'll see if it's worth upstreaming. My my opinion is this is pretty generic stuff that anyone who's building capacity planning systems would want.

A

So personally, I think this discussion should probably, if we're talking about like reservations on a note and stuff, that's probably something we should drive through signal, more proper and, as you know, Jeremy. Our our observations are that that couples a lot of things like we've, we've had large numbers services, destroy our CPUs as well. So were there any other topics we wanted to come out of from the Keep Calm meetings, or do we want to go and is someone here that want to talk through the cluster scope, resources.

B

Were not able to see the standing so there's so much of expectations and.

A

I saw Jeremy's picture and I can attest here, trying to listen remotely. It was hard because it seemed like it was a big room that was for people, so that's great. So do we want to go through the next topic on cluster scope, resources.

B

A

Do not hear an author, so I would say: let's move on to the next topic here, which was what an interest in talking through Moriches. Can we so at the end of 1:9, we've had the very good discussion around like graduating plugins, my vision. If you want to give an update now on 110 and how we can get over the hump that we fell in.

B

B

I just feel like this is Jeremy at keep on doing that, like you need a few more implementations, so maybe like trying to have a prototype of FPGA integration working on and maybe, if you can get because the computers Solar Flare acceleration like those are those are things that will really be beneficial when we are having this college.

B

A

For Solar Flare, we did you try to pursue that right and then we've basically kind of halted that right so absent like revisiting that original discussion on what we pass in through the device plug-in I kind of think that at least sent from my perspective, I'm not really motivated in continuing to push that.

B

A

Clearly there was pushback on trying to use this as a way to handle anything around networking, correct.

B

D

I think on the networking side, I hope they can use the CLI, because that is the protocol used for network is tension and I just feel like sending the whole I.

B

A

I, don't want to waste our own engineering time on vices that might be outside of the comfort area of this group versus devices that we think would help us demonstrate device plugins can satisfy more than one device type because, like personally.

B

A

Now, device plugins is very clearly a hardware, accelerator plug-in model and even then not yet satisfying the monitoring side generically.

A

So we have to choose what the other ones going to be right now, if we're getting a lot of interest from FPGAs like I'm, not sure that that we can staff that this upcoming quarter, but like I'd like to know like what is the other device type that we think would be reasonably covered by the existing surface area so that we can graduate.

B

The only thing is like.

B

Maybe, as part of beta, we should.

B

Like that's probably less interesting for me and that's like if the plugin is not complete, that's okay, but I just want to understand those high-level details or like what is the scope and like how do all these plugins interact so.

A

Jeremy is pushing some other work around multi network support via other angles. I, don't know Jeremy. What your appetite here is with respect to yeah.

G

So the that's! How I spent my week last week for the most part talking with different people and if you're subscribed to sig network you'd, have seen me, propose a new working group to sort this out amongst the vendors and when, in those meetings, I made sure they understood that there's an awesome event, rural intersects between the device plug-in and whatever they come up with initially using CR DS.

G

So we we actually committed some engineering resources inside Red Hat to to work and build prototypes in that area, along with several of the other vendors, including Intel and that'll, be if you track sig network you'll see it's coming up there and one thing we can do when that kicks off on December 21st is bring up. The concerns mentioned now around what to do in 110, I, just don't know if they're going to have solid direction in the timeframe where we're looking to potentially graduate things. So, but it's worth mentioning to that.

G

To that verb, I.

D

G

D

Clarify the the what it cause we want to get from the I mean I know it's definitely helped us evaluate the API design with different types of device or device tracking implementations, but I also want to don't want to depend on the like, because it since we don't really have all resources. Human resources allocated. I know the device plug implementation. Right now seems like no one has confirmed they will actually in 110 timeframe, they will be able to report like make available a best tracking presentation.

D

I just want to set clear goal like what we want to get from the different team. Experimentations I think if we just want to hear a feedback and I think they already have some faith bags from early developers on different device practice, and maybe we can just summarize those feedbacks and then maybe we can just like ask her on the folks to say like why that they think the feature is ready to graduate to beta. Instead of really like, depending on the real implementation you use the in production or whatever I.

J

Know they want to get to a point of stability, but like is there a huge hurry to graduate device plugins to beta that would preclude waiting on some of the networking features to get sold so.

A

I mean, from my perspective it right have. We will have a very low ability to direct our customers to use a feature. That's alpha, like you, get very little meaningful feedback. I can't come I can't comment for other vendors, but right now alpha is a big inhibitor to meaningful feedback. Yeah.

D

I see it I also feel the same way like you from a just a whole to feature at the alpha stage. It'll be me know we may never get enough feedback and I know it's a little commitment that we to support Romania to support the different versions API, but I just want to say like it. So from other people who have implemented if ur enter, who have tried to implement different packets if they say some some feature, I'm working, it had to be found for beta I would like to say the particular feature.

D

Requests are like any particular issue. Instead of like a trend to rely on, say, we have to have some other device for the implementation to graduate the future to.

A

Me the question is, like other other devices, that users in production or vendors in this space are looking to support in any timeframe outside of GPUs, and if that's not the case, is it realistic to expect that we can meet the goal of demonstrating that device? Plugins could support more than one device.

A

You know at Red Hat we tried to push forward the solar flare information, because that would be something that would be compelling to our user community, but obviously that had met resistance and right now you know correct me if I'm wrong, but I'm not aware of any strong right now pushing us to get like FPGA support in the next three months right. So there is a major user need for accelerated workload. Types though right, so the best I can think.

A

Is we either after, like we could rebrand device plugins or just be honest with ourselves that we're not having people demonstrate other devices in the right time frame, or maybe we revisit the push back. We had on the solar flare work, but only.

A

Three months, if we have the same device, plugins support more than one device hyper has been demonstrated to.

E

Such is also on the wine yeah.

K

So we apply into to work on with I'm, not sure how fast we will be able to open source our pod I, plug into to be able to demonstrate it, but yeah it's in our plants.

B

Functionality and I'm like highlight the any changes that is necessary, but that should be good I guess yes,.

K

Yes, with what I think you're in one time, one time we will be able to do. Okay,.

B

At least one person or a group of people that are going to spend some energy in this area, I.

A

Guess how firm is the commitment is that, like I mean there's a lot of ways we can handle this right? But it's like it's if we find ourselves at the end of 1-10 without a clear alternative device plug into like to be I'm, not sure if we continue to hold right, because at some point like where there's actual demand, the need like that seems to be what the community is directed or realize. Then right, I.

G

Would agree with you on that from my side, I mean I've done to at least go there several times to Frank I mean since there's been more cycles on this even has ten sessions. Earlier this week, I wouldn't.

H

Say they're not.

A

H

Have another act yet.

G

In the 110 timeframe, the problem with our initial priv cases PR was that it was. It touched. The pot object with a navigation. It was or.

B

No, it's not with a navigation.

G

It touch the pot object which, which kind of got a full-stop from from this group I'm deferring your.

A

Ask for the pod object to be passed to the plugin.

A

H

We annotate it with an address. It.

J

A

Touch the pot object. It just asked for more data to be made available. Yeah.

B

The problem is not passing the data. The problem is like clearly clearly documenting the scoop or device plugins, which is what is like a little bit murky at this point. If.

A

Device plugins to FPGAs and hardware accelerators is that sufficient scope, I mean and I guess what it that that's, where I sense, the scope or the direction of the community scope is heading towards, but.

K

Even talking about like FPGA scope, we still need bunch of additional parameters. What needs to be positive, iceplug in to properly operate way, device.

K

D

We can discuss them in more details like and what's the right way to send those parameters between occupied and the device. Rocky I just feel like a sending container idea. Our idea, our path cycle may may open the tower too wide, but maybe we can't discuss the particular use cases and say by the we can't come up with some solution.

A

That was the last topic on today's agenda. The only other topic I want to have is I. Do not think we should hold any more meetings for the remainder of the year. So just a heads up on that and then another topic, I wanna discuss was have people think through a little bit is right. Now we have a weekly cadence for this meeting, I think given what we've discussed right now for what the community interest is around 110 and where our focus was would lie like if we were to move to a biweekly meeting.

A

Would there be objections to that or you know, given a new year, we have some time to think. If you want to do, anything's I mean anything differently. So if people have suggestions on how we could run the group better or more efficiently or anything in that regard, please just reach out to both me and fish, but let's use the new year to do things better. If possible, yeah.

B

I'd like to can that, like maybe you just move to a biweekly kalyan to begin with, unless someone objects and we can always have like more focus meetings, if a.

G

B

People are collaborating on like specific specific items in an agenda then they can always have their own meetings. The only request from Brian has been that, like those meetings should also be documented, so it's just if you're having those dialogues meetings is just like documented in our community page.

A

Okay, so with that, everyone have a great power day and I will see you on these meetings next year and I will update our invites to go to a biweekly cadence and try that out to begin with, 2018 I still have a party to discuss.

F

Today did I miss something on the agenda.

F

Balaji pundits, very quick and I actually talked to this corner about this before yeah about the topology of achieves you use them, you are using, so this is about. Actually we are planning to move or the Microsoft research, machine, learning or cloud to kubernetes. It's very big plans, but every machine learning training job actually, depending on this kind of cheap.

F

You topology, for example, the in being their link, for example, and we have paper and benchmark to prove that this kind of acceleration can can improve the training performance to very high level, but so the blog here we see her. Actually in or alignment we cannot be device playing at all because plot device, the device called me only can say: I want to teach. You is, for example, but what we want it like. We want to use two GPUs, which is which are connected by a media link, so its cuddly is kind of different requirement.

F

So, although we are using, Chrome is one canary, but we cannot you device plug in what we are doing today, that we have a CR ashame, it's very ugly because we have to move. We have a copy most of the logic in debugging and writes the meeting Stalinists. They are ashamed to do that. So what I want to discuss today is actually the topology stuff can be handled. Most of the case. Indicators can be hand on handle being the resource in the resource class proposal.

F

Right, for example, you can say that I want to I want to cheap, seriously she's being together. It's kind of like a resource class, but pro stuff I.

B

Think so you should be able to go find.

B

Coverage on exactly this topic, like help you come up to speed on like, but what the community consensus was, and, lastly,.

G

B

You can have this conversation again after New, Year's or email, something because it goes like TL DR. Is that, like we decided not to deal with topology at the cluster level and do it at the node level like have cubelets sort of handle a graph and do some scheduling but like if you read through that loads, you sort of understand the rationale behind that.

B

So, if you have any, if you had this some some specific things, then you would probably like highlight just those and the conversation to be fair. The resource.

A

Class proposal did come out of that meeting right and it was prototyped. So if you, if you're saying initially the ideas is germinating, something that you would get value out of I guess that's good feedback. But if you haven't had a chance to read through those notes, Harry I'd happy to point you to them. Just I would ask that you. Actually you should yeah it's amazing how many requests I get to be able to get access to those, because people don't want to join criminate staff, but that's not an issue. Yes,.

D

And I also want to understand a highway if you have some like, because I think all right now, I want to understand. Like a are you interested in doing this yeah prime or you are trying to use something cloud pride environment because I think there are certain ways to work around the problem, even I. Right now like say you can create some cracker with the particular our diagram. A group node.

F

F

F

B

B

Subjective question of, like which one is better, with some facts and less about like we need to do this.

B

Because alternative is like we have, we spent like an hour just talking about this.

B

It might be more effective times.

B

F

That makes sense yeah.

D

I think I want to learn like the particular pin PI. So the current way of doing this I can say like a possibility. It's my issue and the MA automation required in future. It could also be value, but I would also would like to hear y'all just as a given that you have a good experience on this and yeah just ask dick the PIMCO. You hype, mm-hmm.

F

That makes sense.

E

Speaking of following up um anyway, keep con a few people mentioned next iteration of an in-person meeting. I was wondering if it if people were interested in that next year and.

E

Timeline, maybe March ish, February or I.

A

Would be interested I can't know on a timeline right now. What is cute con Europe.

D

Is it possible that they meet up someplace closer, no.

A

No I'm just trying to think what the conference states were. I thought I knew it started with an M and I wasn't sure if it was earlier and then I I know, there's a open source. Leadership Conference in March, but I would be interested in having a get-together again. I thought that was really constructive is one.

B

Make sure that if we do it that we have.

A

Enough of a breadth of topics, vision, I, myself, I think he and I can vouch that we had to do like multiple weeks of preparatory work for that face-to-face that was rather exhausting. So it takes time to coordinate those to be as productive again so I'm in favor of them, but we just need to make sure that they they have enough lead time and topic space.

B

B

Like trying to work towards so on so maybe like doesn't be the best period because we can actually plan.

A

That sounds good to me. Let's get around each other for next. You didn't know there with that. Sorry Harry for missing a topic, but I know you can read the links. Listen.

F

Hey yeah, yeah I noticed that you have updated disease. They proposal yeah and.

A

With that, everyone have a great holiday and I will see you all next year.

B