Kubernetes SIG Scalability, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-06-10 Kubernetes SIG Scalability Meeting

Description

Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b

A

Okay, I think we're good to go so welcome everyone to our bi-weekly six community public meeting uh today is 10 july. I think I already have everyone in that, and this is let me actually present the meeting.

A

A

All right, can you see my screen on the meeting notes.

A

Yeah matt, okay, perfect! It's always confusing when I do that one zoom uh all right arnold, so I see you already added something to the agenda, so go ahead.

B

um Matt uh we wanted to discuss one of the issue we have raised uh related to couple of test cases. We added uh you know a couple of months back related to network performance, uh metrics measurement, so.

C

B

uh That time, I think you were on leaf oz was, you know, handling those uh basically that issue in the sense we uploaded the test cases and the code he reviewed and all that code is committed, but what we see is okay. Let me paste that issue here so that you know we can have a discussion.

B

So we have used a couple of tools, uh like you know, iperf to uh to measure the uh udp latency and other packet per um all that measurements. So what do we see is for the latency? We see negative values being reported um in the run. Actually, so when we searched in the internet, we see that you know uh time. Synchronization could be the problem.

B

A

B

um Guys are saying, uh but ozzy was saying that you know all the worker nodes are. You know perfectly time synced in the in the ci setup right where these test cases run so so there has been no progress on this issue because we we asked him whether we can consider those values as negative and so that you know we can continue with ah still going, and you know, reporting this metric measurement because say out of many pod pairs.

B

Maybe at least we have seen 50 percent of them report negative values, so at least other positive values. We can consider for the calculation and go ahead. So that was our question uh in that issue, but I think oz seems to be maybe busy. I think right, so he has hasn't got time to look at this. So that's why we joined this meeting today. So if, if he could, you know, uh take a look.

A

Could you share the issue because, like I, I I'm vaguely aware of the things you're talking about like this network performance but, as you said like I don't think I was. uh I was the one reviewing.

B

A

I think someone else was doing that, so, if you can like drop it here or perfect,.

B

A

B

Pasted it in the chat.

A

Yeah uh is this the one? No that's something else.

B

A

No, the number is one hit.

B

One eight: uh I have pasted it actually in the chat window.

A

In the which transform sorry, the zoom chat, room or.

B

Yeah zoom yeah.

A

Sorry, like this zone is always open.

A

A

A

Okay, I think like uh like thank you for bringing easter attention. I will make sure someone uh takes a look at this and and help you unlock that uh yeah. I don't think we will be able to debug it now, but definitely.

D

Just to add a few more details to it like um so we had at least one issue, so the sourceforge net that link that gives the issue that we had raised in the tool itself. So what they were saying is like, I think, by default, gc uses uh ndp, but I'm not very sure the place where we are running. Like uh the sixth scalability I mean or the cic pipeline, where it runs. He is really using npp or some other uh clock algorithm. So I'm not really sure, but I think the default is empty.

D

So what they were saying is like the ntp. Even the best case can have millisecond latencies and what the test, what we see are so it shows around microseconds hundreds of microseconds of latency, so.

A

D

A

D

A

Write this down, that's a good point. Actually, that might be a time. Precision issue.

A

Okay, but like that is like someone from six favorites, you will take a look and- and we will try to help you here- yeah so like going forward like uh I assume this is blocking you uh or like, maybe putting in the other relic. So what's the long-term plan here, basically, are we running this test continuously currently uh using our infrastructure right? That's what we say.

D

Right now it is running, but it's not getting displayed in perth. I think maybe this negative value is the hindrance.

A

Okay, I see like yeah yeah, so.

B

There was a code written to display these. There was a code written to display these metrics in the puff dash. Even that one is not yet merged by. I think oozy I mean. Is it merged with? No? It is.

D

Merged, uh I think, like the puff dash has to be upgraded and that has to be running in the test. That's fine.

A

Okay, so we did have like two weeks ago or maybe four weeks ago, this purple issue was documented here, so we needed actually to roll back to the previous version uh because it started consuming and too much memory. I'm not sure whether this is result or not. I can check on that.

A

Magic is not here today, but he was he was working, but this is closed and the question is: did we agree or we upgraded so uh just double check that your changes, uh whether they there are or they're, not there, uh it's possible like it's already, if not just, let me know on slack or uh or like so the.

D

Only thing is because of negative values. I don't know how the the dashboard would be. Looking like, it will be weird to see that.

A

Definitely we should have from these the names here right of the measurements you're using do you know what is the name of the test uh yeah? It's.

D

uh Network measurements, it should be in the thousand nodes one: okay,.

A

I mean I haven't seen it.

D

Running so I don't know if.

B

But we wake uh that version of puff dash is not there right. So I.

A

Think what's important is it was updated so yeah we deployed like the new version like two weeks ago or something like that. So, if like the change was committed.

A

It should be here yeah. We are basically building from from the master so like anyway,.

E

A

uh I will make sure that that that will help you here.

B

A

All right, let me just write it down too.

A

Okay, if you can find the pr where you make the furthest changes, that might be helpful, that you can basically easily check whether the the version that we have currently uh contains it or not.

B

But but it was merged uh long back right with quite a.

A

uh We can check that online so, like basically just let me know on slack, you can send me the pr number, but if you say it was merged long time ago, then it should be there. Then the question is like whether uh the pr did everything that is needed to actually have the jobs displayed, because, like perfect, is not a great example of a good codes like the config is uh like hard coded, and it's actually there's like many places. You need to change if you need to add like new tests, so it's very error-prone.

A

It's likely that we miss one place or two but yeah uh like anyway. I will get back to you on slack and and we can figure it out uh sure.

A

So, thanks again for bringing this to our attention, uh our notes, like do you want to uh shed more light on the migration of scalability jobs to the camera.

F

Okay, so hi everyone, I'm the my name, is anna mccam and you can call me arnold and I'm the co-chair of the kitchen for working group. So the main mission of this group is to migrate any resource running on internal google and front to the new community ever.

F

We identify some pro job using special project in gcp and we would like to basically migrate those job new community to the new fraud and we'll realize we are. We need to raise some versus quota on gcp before to do that.

F

So, as I saw some comment about, there are some basically lists of quota except cpu that need to be raised. So I would like to know if, uh basically, if you have a specific list of course, ways or just we need to discover over time so of.

A

Yeah, like uh I don't think we have anywhere documented, uh like the exact list of which quarters should be raised off top of my head. There will be like storage quota, so basically like the issue, we run this super large test in scale five thousand nodes uh so to spin up 5000 vms. You need a lot of cpu quota, that's obvious, but also, I think, there's a quota of number of vms in network, or, if I remember correctly, there are quotas for disk. uh That's definitely like the standard quota won't be enough to speed up.

A

Okay, probably there is something more. I see that you, you are working with yeah secure, so that's actually a great guy to work with. uh So if you can create a dog and share with us, then definitely we can like check the projects. We have and basically feel the the the quotas there and then you can proceed from there.

C

So, actually, a few days ago, I I noticed this, this bug and I shared with the team. So I think that from six capability almost everyone is aware of this issue. So I guess, if you need help with you, know understanding what kind of quotas do we do you need in new project? Then then I think we can discuss it on slack and help you, but but as as matt said, basically there is. There is no like specific list. We will just need to figure it out.

C

F

Think, basically, like I say in the in the pro request, I will create a new document and share with the medics links the mailing list of this group. So we can just.

A

Like you can share, with with marshall, also as well like, we will take care of it.

G

F

Ahead x accept coda: do we have something else like a specific bracket or specific permission we need to? We need to deal with or.

C

F

Yeah about a bucket that needs to be migrated, but I thought that do we need something else.

C

Yeah there is there's one more bucket, I think so when we were checking through for the issue there. There was like list of buckets that we also need to migrate, and I know that one of our buckets was was not included on this list. So I think there is bucket called six capability logs and.

E

C

Is also a bucket that that needs to be migrated. Okay,.

C

I think I think we posted in one of the issues with buckets uh the name of this bucket.

F

Okay, are we? Are we looking for that later.

F

uh I think that's all for me. Thank you for the app.

A

Yeah like thank you for like doing that and running this effort to migrate the jobs, uh smart, social, just like start a dock and and we will make sure it's uh filled with the requirements for migrating. The the test to the new project.

C

Yeah, and if you have any questions you can you can write to me and either I would try to answer or maybe ask someone else to help so.

A

I mean that would be perfect all right, cool thanks, okay abu. uh I see that you already had a comment. I wanted to ask about this. uh What I see okay, so you are going to spend some time in 123 right so right now you are not working with that! That's completely fine! I just wanted to ask uh more or less about this about the timeline.

H

Yeah definitely actually I'm right now, working on the support for a list uh yeah. I definitely want plan to spend some time. Okay,.

A

Is it the clip that voitec opened for the server okay.

H

A

Okay, that's cool uh all right and last but not least, marshall.

C

Yeah, so actually I wanted to talk about the bug scrubbing process, so I think that currently we we don't have any and because of that we might be missing a lot of issues so actually.

A

The example is the the issue that uh the first thing we discussed today right yeah.

C

Yeah, exactly and and yeah I think like two days ago, I also found issue that was like two weeks ago, which seems to me pretty important. It was kind of like memory leak with watches, so I believe it was also important at least to look at look at it and maybe redirect to someone else, but um basically right now, we don't have good visibility um like what what kind of issues are assigned to six scalability.

C

uh There are a few reasons in my opinion, why? Why is that? So?

C

First of all, we have at least, I think three different repositories right like we have test infra, perv, test and and also kubernetes, where there are multiple issues that that are later assigned to six scalability, and so I did a little bit of research and I found out that the other six like there is one sig in particular sig clee, who is um using um tool called um triage party, which is open source tool and basically, what it allows you to do is like you have basically like tabs and you go for them and you you can see all issues that are not triaged.

C

So I think this would be great help to us, but also that's not probably enough. So you know tool is the tool, but we also probably would need to spend some time and um what I would like to start is basically, I will start document and I want to also discuss it. What do you guys think about it? uh What was your opinion? How would you like to see it uh if, for example, some part, maybe of of this meeting, could go towards bug scrubbing process like what's your opinion.

A

uh So I'm very separate of this, as I said like, we have a great example event today, and you mentioned some other, and I also remember that it happens a lot that we have some back open and sometimes we just miss it. That's that is something we definitely should address so like starting a process for that, and that's definitely a great idea uh when it comes to using this meeting.

A

For this I would say why not like it's actually a good, uh maybe like a good standing item on the agenda to to do something and just make sure all the uh issues are trius and there is an owner for them. So, uh yes, like sounds like a good idea like starting a dog, proposing something, and then we can like discuss. Maybe next meeting and yeah.

C

Yeah yeah, probably I would say that you know at the beginning. We won't be able to try out all those bugs that are kind of like our backlog, but moving forward new issues like starting from from yeah. So and then you know, if we have some spare time, then we could go back and and look at those like backlog issues that are kind of like the newest, but not the one.

A

Yeah, but this basically sounds like a great start like, let's start with uh something simple that we know we can handle and uh once it works, then we can figure out like what to do with the with the backlog right.

A

Xi'an, I know you're here somewhere. What do you think.

I

Hey yeah, I think the backlog idea is a is a good one. um I because I I think this is a problem for us as well, because I keep getting pink from time to time.

E

I

uh Folks wanting to contribute stuff for 6k level t and I always have to start searching myself.

A

Yeah exactly right so like having a process for that and making sure we have some like properly tagged issues, for example, for things like this for the goods with first issues.

I

Yeah, I I started doing this recently for some of the tickets. I just got the ticket this week and it seems people are pretty keen actually to work on first issues like within minutes of cutting an issue so.

E

I

I cut particularly was around duplicate metrics we had for watch counts. We have two metrics um two different things, both of which can count, watches um and yeah. So very.

A

Sticky, so if you are doing it, it's already, that's, I think another reason to to have it written somewhere and make sure uh it's like coordinated right like we are not duplicating the word yeah. Oh yeah, even more reason for doing that,.

C

Yeah also also, you know, there's multiple issues, and I think that you know um there are. There is limited number of people in sixth capability that can, actually, you know, do those tasks and, and basically what I'm saying is that um it will also help us prioritize uh issues. Yep.

A

Yeah cool definitely great idea, so thank you so much marcel for for, uh like attacking that.

F

um I'm the one branding twitch party from sigrid. So if you did just just pick me, uh I will walk you through the process to deploy through each party on the community infrastructure.

C

Oh nice, okay, great.

F

Thanks we've been working on this for like a year now we have. We had some issue over time, because uh basically, three party have some glitch, but I can basically give you the tips when you're on.

E

Instagram so.

F

Being me, maybe next week or in two weeks, so then I can walk you through the process. Oh great, thank you.

A

Thank you. That's uh very kind of you all right cool. Do we have anything else to discuss today.

I

E

Hi, I cannot ask the question.

A

E

Think yeah, I'm new to this big group, uh so I'm just wondering does ko pop basic group, for example, if, like I have a proposal for um like for us to increase scalability by adding like running multiple scheduler instances, because you know from my you know um from my experience, bottleneck is in the scheduler. So is that for does that fall into this uh c groups? uh Scope, like you, know, running multiple scheduler instances in parallel, so that you know can support larger. You know number of nodes.

A

uh Like it's definitely a good starting point so like the nature of six capability, is that we are a huge.

E

A

Like we work with other sikhs or we do things in the area of some other city, I think it's even written somewhere here that we very often do something that is falling into a charter of other individual sikhs. uh It might uh be that uh eventually to go to some other thing to also discuss it in this case, like six scheduler is uh uh like definitely a good place, maybe not a good place to start, I don't know, but maybe they they experimented with like running multiple uh schedulers uh but yeah.

A

uh So, basically like the the bottlenecks we are talking about is basically the pot throughput right, like your.

A

Yeah, what is the throughput? That's that uh you are like trying and when you're running.

E

Yeah so, like you know like basically it's like you know the um the latency you know, if, like you know, one we have one scheduler right. Latency is very long. It's all uh in the concept right, it's all kind of um blocked there and q there. So I think if we can run multiple schedulers, then that will help you know solve the problem in parallel.

E

um So that's one thing. Another thing I think you know what we can do, I'm not sure whether that's fought into this um scope. uh It's like vertical, auto scaler. We have like horizontal hpa and the vpa right so like for epa. As far as I know, currently it needs a kind of a reboot right. If you want to increase, you know the boundary of the pulse, that's my understanding that it needs a reboot. So if we can't have mechanism, you know that doesn't need the reboot.

E

um I think that would be great, but is that does that fall into this group? Or does it fall.

A

The other group there's six out, seek out scaling for for issues like this so hp. That's that's! uh That's there. So, oh.

E

Which group of us is called autos? Oh there's a sequel to scaling.

E

Another general question is so: is this like china? Is it isis? Is it open to everyone I mean for every sikh group? Is it open to everyone? Oh I. I need to um get uh access approval, something like that.

A

Like sorry for what exactly? For uh just, if you could specify sorry.

A

If you could, could you repeat the question? Oh.

E

Okay, so like, if I need access to the slack channel, oh.

A

There's no, it should be open to everyone actually.

C

Yeah the six card it has almost like 2 000 people, so.

E

Oh okay, so you already like your folks, discuss the problem on the slack channel and also the zoom meeting. Is that so or you have other vehicles to discuss something or what is the most commonly used? Wait.

A

These two are like most commonly used like if there is like something like. Obviously, we also discuss things on github right either in like pr's or you can open a cap, so kubernetes enhancement, proposal, yeah and then like. Basically, we can review their discussion. So that's, but these.

H

Three are like, I think,.

A

The only ones probably sometimes like we write docs and discuss their, but it's it's pretty rare yeah. So.

E

Cat pr and then, but for like before that, if like I would like to bring some ideas for discussion which should I just join this team this this meeting or should I go to the slack channel? What's the common way, your folks are using yeah.

A

I think like starting with maybe slack channel writing down uh what you want to do or like what are the problems you are facing. That's probably a good starting point. If, like it requires more discussion or like detailed discussion, then uh basically you can add to the agenda of this document and we'll discuss during the next meeting. uh The thing is like our tl.

E

A

Today by tech, but usually he is here so uh definitely like you'll- be a good person to have during this discussion. uh Like is basically incubated from the beginning or almost beginning, and he knows.

E

A lot of stuff so.

A

Like ideas about running multiple schedules, I I surely will ring a bell in his head and he will know whether someone already tried this or what are the potential issues there or you will definitely know things about it. So, oh okay well like. If there was slack, then we can basically make sure that uh takes a token.

A

If there is like it requires like more discussion than we can uh discuss in in two weeks,.

E

Okay, so uh so so, if I'd like to discuss, discuss something, I can just add to this uh meeting notes right, yep totally, so it's open to everyone who can. um I do not have access to like to modify you.

A

Should have the comment access, I believe right. uh Oh.

E

Just the comment section, okay, I see.

A

Like you can propose changes also like like people, are doing that during the meeting. I think it should be. I will double check on that like. If not, I will make sure it's it's open to everyone to propose changes. Okay,.

E

I see got it oh select, channel too right, you, I mean that's another way to discuss the boss ideas. Okay,.

A

Yep that's another way exactly: okay, good.

E

Yeah, thank you.

A

You're welcome uh all right.

F

Just just quickly, I added in the zoom chat the link from six scheduling and see auto scaling. So if kd wants basically information about those two groups, they basically they have communication information in the link I put so what's happening is like you just you join the mailing list, so any of those things and you will get an invitation in your calendar for the next meetings and in every readme.

F

For more of anything, we have basically the link to the slack channel for you to drink.

E

Okay: okay, great! Thank you so much.

A

Okay, all right, thank you, so we are out of time. So thank you, everyone for joining us today and I hope to see you in two weeks. Thank you. Thank.

C