Kubernetes SIG K8s Infra, 9 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: k8s-infra-team's Biweekly meeting 20200109

Description

wg-k8s-infra recurring biweekly meeting

A

So hi everyone, I'm Bart's, mclubbe I, will be our host for today's meeting, and this meeting is being recorded at the beginning. I want to remind everyone about our code of conduct, which we can summary as be excellent to each other and.

A

Is there anyone who's new here and maybe I am.

B

Washed I'm a little bit new ish. Can you say a few words about yourself how many active in the cyclist lifecycle ku, particularly on the image folder, which was kinda? What I'm here today to discuss, because this would be the VM image problem of the container in which folder yeah.

A

If you see you here is there anybody else with new I want to introduce himself or herself.

A

C

Yeah, my name is Amal I'm working as a software engineer in Ola this liquid uber is doing me. We are providing tax use, a service and I have around three year of experience and I'm. No two gate open.

A

It's good to see you here, yeah, okay, so let's move to our recurring topics. Are we planning to do some billing review? The theme I think you'd be the only one I I wasn't.

D

Planning on it, we did it the last one: okay and I figured today would be sparsely attended. So hey we save it for the next one, yeah.

A

Definitely sounds good to me. Okay, so, let's move to the action items is maybe line news with you there. Indeed he is hello. Can you say something about huge promoter? Yes,.

E

So the main piece that needs to be done is launching or I should say, creating III tests for the auditing mechanism and then afterwards launching it on on prod and when we do launch it on prod. We need to make sure that if it fails or if it errors out that permissions are set up on the project such that people or a group is notified about those failures, not just me, so that part can happen at some point in the future.

E

But for now the exact order of steps that needs to be done is first writing the tests around the auditor, because I've only tested it with my own personal GP project. So it should work, but if there are any changes to the code etc, we need to make sure that it still works.

E

So that's what's remaining yeah.

D

Sounds close, I can smell it. Yeah.

A

I mean and is there something which we can help you with currently.

E

It would probably be around the permissions stuff I'm not really familiar exactly about how that stuff is set up, I mean other than the scripts that we have in Kate's io / in for our / GCP. Is that the only.

D

E

D

E

D

Those are the scripts, although there's a PR open to start to terraform that stuff. Take that or orthogonal e is. Is there a group, a role group already that's set up for that you think maps to this like we have. We have storage admins. Is that? Do you think that's the right role group? We.

E

Probably need like a page or group or people we haven't. We still.

D

Don't have a standardized answer around alerting, wait.

E

I thought I thought there was a learning group email already. Don't already have one, because.

F

I remember using it.

E

For or maybe I don't I didn't think so, but maybe because I see I added myself to that group in a PR. Oh.

A

I think that it's a good point to the action item to check to confirm it's working. No, but is that stuff down like the terraform stuff, is Christophe doing that or no.

D

Somebody else sent me a PR or the terraform stuff for now, because I'm sure that that will take some iterations. While we convert the scripts over into that the most data, so I would say, go looking group see a mole find all of the role. What wouldn't call them role, but they basically are role groups see if you think, one of the maps to the this group, it's sort of not the alerting group. Now they who's going to administer the auditor right right.

D

Okay, I, don't know if we need a separate one for container registry or if that should just fall under storage, admins I, pretty sure we have a story. I know we have a storied admins group. If that's sufficient, then, and that's the group that will want to add permissions to be able to run the auditor.

D

Okay and I.

E

Think you're already in the storage.

D

Admins group, okay, I, think I can check I mean I. Could.

C

Check the word on video.

D

So we can look at that afterwards, but basically it's find a group that you think you want to run it as in the long term,.

E

So I'll talk my head, I, can't think of tasks that could be done at the same time in parallel, but maybe I mean it's already merged in, but people could look at how to auditor works on the way to work this morning, I just thought of one additional feature which was a requirement that I don't know if I've implemented, which is if an image changes in GCR and it's a what I call a child image of a fat manifest. So you know about that manifest, and then you have like child images.

E

It could be like 10, there's 10 different architectures, so if one of those child images gets pushed out first, which makes sense that's how manifests that Memphis are created. That means that the Pops of messages coming from GCR will say. You know this child image. It's not gonna, say child image, because it's not that smart, it will say you know this jar was added to the registry. Now I need to check because it's not going to show up in our manifest promoter manifests like text.

C

E

People don't add the child images there, but I do already read the entire GCR states anyway, and in that, like big data structure, there should be the child images. There I think it's a good point as soon as we turn. This live you'll start getting alerts for everything, because almost.

D

Everything yeah yes, so.

E

That's one thing: I need to just double-check, but people can look at the code ahead of me. If they want to please fing me: I will point them in the right direction. Okay,.

A

Sounds good so about the new cluster yay I think that I'm, currently working on checking how things are going with monitoring. We have have some consensus at cetera because ie for the last, like a few weeks, I missed many things and but about the turning down. All this you have shown. Other thing is going him.

D

The old GCSE work is down, there is no moral GCSE web. The new cluster is live for that yeah.

C

D

Their next steps were to do I. Think somebody sent me a PR to start converting the kids to the IO stuff. We created the namespaces that one actually should be relatively easy. There's not a lot of state there, but I'll have to manually copy the old secret, so we can make a clean transition and there is a list of the other ones. We filed a separate issue for each of them before the holiday was up and they all seemed relatively approachable. At this point there are back stuff, we know works.

D

We still don't have monitoring an alerting but I. Guess it's not really fair to raise the bar beyond where it already is so I guess: I'm, not standing in the way of that progress anymore. Somebody else was also working on the promoter method or the publisher, which is the only thing left running in that other dev to cluster and I'd love to turn.

D

That off I saw this yesterday that there's a bunch of PRS that were opened in the last two weeks which some of which I got through email, some of which I didn't see my tweets about github. Changing the way their emails work, but I will take a look at those PRS. I probably will not have a chance to until next week. Obviously,.

A

That sounds dude I think I feel like. There is a lot of progress, so that's very good to hear and one of the last things is they both cost reform, how the things looking because I don't see teams on the call, but is there anyone who can tell us something about it.

A

No, so let's move it for the next time, I will ask teams and I will try to check how things are going there, and there is something about the streamlining streamlining accounts and that would be a good thing to tell me what just think about, because I don't get exactly and I think about it. I.

D

Don't know what that means either I'm, maybe I'll, use that as a jumping off point, though, I ran the auditor script just before the holidays and two things came. A few things came to mind. One. This script was logging a whole lot of stuff that was really useful for auditing purposes, so I sent some PRS and and tidied that up a little bit. It's certainly not done second, because we didn't run it on a regular basis. It had a ton of changes that were very difficult to audit because they were so large them.

D

So I did a pass. I hope that I came up with you know. I didn't find anything that was glaringly obvious if people want to contribute this is it looks like it's relatively sparse day today, but if, if there's something that people want to jump in and help on the auditing script producing like running through each of the api's and auditing, the things that are important to audit for that API producing usable diffa below would be an awesome place for people to contribute I.

D

Just in an hour of looking at it identified, you know six more hours of work to do at least so, if anybody's interested in that, we can add them to the auditors group, and you can actually start running that audit script and it will dump all the IM rules and everything across all of our projects, and we can get that into a state where the audit results are actually something.

D

We could look at in this group every two weeks and say: okay, we've created a new project X and we've added group food to roll bar, etc. Mm-Hmm I.

A

Think that, after this call, I will spend some time tomorrow and to create some email to our group with the thing what is happening right now, etc. So I will mention that, stop it there to someone somebody who can help. um Okay, there is no topic. There are like open discussion topics here, and there is some topic from moosh loop about new accounts for image builder projects. Can you tell us more.

B

Yeah I'm worship, so as part of the image folder project, they counted two distinct requirements. So the one is the ability to add entering tests for building images, some of which requires nested virtualization, some one which requires different accounts and we can probably get around some of those and then the the second requirement is to be able to publish those images. So initially this would be I think Timothy sent Claire.

B

It said he wanted capi had an image, bold ahead to kind of run on kubernetes head or a combinational buzz, so that you, you testing the tips of each of those against each other and then eventually getting to a stage where we can publish images to the community and take some of that workload off of existing sub projects. You are kind of duplicating the effort, so.

B

So using an existing GCP, so we want named accounts for publishing. So if we're using AWS or TCP accounts, currently we would get a pool from Bosco's. If that's my, if I'm understanding it correctly, it would be a little bit harder to get access to a direct account and to limit others using prowl. You gain access to those accounts, so that's kind of why we can't use the existing accounts and why we would use it to have new accounts. Can.

D

I, ask you a couple of questions because I don't feel like I have all the contacts, so can you go back sorry? Can you go back to the beginning? What exactly are we testing.

B

So these are as part of the sick trust alive cycle, there's an image builder project. So at the moment those images get both out-of-band and published from people's desktops and works there. Yes, so so we want to add entering testing of those images. So if I have so add, pull request, testing or nightly testing so that the images produced by image builder actually run a kubernetes version.

D

Right so this is a two-level, so the word image is such a vague term. This.

B

D

B

Okay, yes, not content, I think we must actually rename it to something else. Person is very confusing, but yes, these are a VM machine. Okay,.

D

So this the thing builds VM images and you want to run some tests against those VM images which you can obviously only do by spinning up the ends. A well.

B

And we can only build them by spitting up games as well, why? Why is that the case? So so, if you wanted to so, for example, you bully images using nested virtualization. You can do that on GCP, but you actually need to create a custom image template which, if you're being spun into a different GCP account and every both is going to be quite slow, so do want to create one image: template to host the nest and virtualization host to be able to build images, that's on the GCP side and then on the aw aside.

B

So for just building. We we just need an a or any AWS account that that we can use so.

D

The image build process is spin up a vm configure it take a snapshot.

B

So it is there's two corner ways of doing it is that way which is the Packer model and then there's another another model which uses humor and basically downloads a cloud image spins that up locally using nested, virtualization configures it and then creates an image out of that so without requiring the cloud resources, but it does require nested virtualization, so there's two kind of different competing models at the moment. Okay,.

D

Cuz, as far as I know, GCP doesn't support nested virtualization. So the only way that would work if you wanted to run the entire bill would be to spin up a VM and take a snapshot so.

B

It does you just need to create a custom image with a custom, nested virtualization new license and use that to spin up a VM, so it is actually a nifty feature that is supported. One thing is a1 is supported, but it it does like I had this running on Google Cloud, bold and it.

D

Works. Okay! Well, that's something! I didn't know how about that! Okay, and so then thank you for that. Bootstrap now go back to the bit about needing different accounts. Well, just.

B

I think an named account, so one account provider, don't know it handles the adjure ones, but we can probably talk to them directly, but just a named AWS account in the name: DCP account that we can potentially maybe only give access to you to these specific jobs. So maybe you're on some trusted Pro cluster or something like that and.

D

So the jobs kick off from brow.

B

At the moment there aren't any jobs, but we can kick them off from brow, or we can also use something like google cloud from this. Probably too much of work.

D

Was so the the pattern that we're, following from the other staging projects with respect to container images? Not VM images is that we have a project for GCP calls them projects Amazon calls Macallan's. We have a separate GCP project for each staging staging effort, the staging effort, the staging excuse me words. The staging project comes with a GC, our registry and a GCS storage bucket and could come with other things if we wanted to which could include the ability to run VMs.

D

But what happens today is we then link it via prowl, so prowl watches your source, repo changes to your source to repo trigger prowl, which does some logging stuff and then triggers cloud build in your staging project. Your staging project does whatever your cloud build needs to do and then publishes there. The images to that staging project so I, don't know how much of that Maps exactly to VM images.

D

But if you can describe the build step as a cloud build script, then you're already 3/4 of the way there I think we'll need to figure out where we serve those images from or if those are servable images or if they're, just disposable, test results.

B

So it it would just be, we can put it in the GCS bucket, but.

D

Are they things or are they just ephemeral like we ran the test and here's the proof.

B

So for pull requests that would be. We ran the test and here's the the proof, but based on tags, we were to start publishing some of those images at some point in focus.

D

So we'd probably want to do something very akin to the promoter process that we're working on for images Justin, who I don't believe is here, wants to do the same promoter process for arbitrary staging bucket artifacts. So maybe that covers this, the idea being again that we have one true prod or some number of true prod buckets and only a bot touches those and as a human, you file a yeah mol, pull request that says.

D

Take this thing off when my staging repo and promote it to the production repo in the bot would do that, for you. Does that sound? Like am I getting close yeah.

B

That's exactly what we'd want in the later stages, I think for now we can start with a Google Cloud, bold and ephemeral images or ephemeral objects.

B

So it would just be a Google cloud boat project that then spins up vm's, okay,.

D

So yeah, if you can, I think step one would be. If you can describe this in terms of a google cloud build step, then we can link that build step to a repo and pull requests or commits to that repository, cool and.

B

D

If you all the cluster api efforts already have staging projects, so that's where I would say we should start doing the work. You probably will need some extra amount of permissions in order to make sure that this works I don't know what those permissions would be. um Where are you doing it today?

D

So there's a github repo.

B

For this, and that's about as much inference, we have at the moment.

D

Okay, we so we have two options: one we can, you can follow both branches simultaneously I would say, goes start with the staging Reba O's that are already existing for the cluster API, there's, probably a half dozen of them for the different providers and see if there's enough permissions already, my guess is: there's probably not and you're gonna run into a brick wall. That says you know you need access to the compute API or something like that, and then we can talk about how we govern that the other side of it would be.

D

We could throw up a new temporary project, let you go prototype within that project and then backtrack from that to what the minimum permissions are. In my experience, it's harder to figure out the minimum set than it is to run into the brick walls over and over again and drill through them.

D

So, okay, then, given my druthers I would say start with the cluster API projects. If you don't have access to those already, you can add yourself to the appropriate groups in groups. Diamo, okay, you know where that you know where that is sorry, I'm being very vague, like you're new to the group groups, that yamo is in the subdirectory called groups under the Kate's that IO repo, okay, I'll.

B

Find it then, and.

A

Feel free to ask on the slack channel, if you wouldn't you, couldn't fight it or something yeah.

D

Absolutely we're around this week, for me, is a little chaotic first week of the year, but as the first week dies out, we'll have more time to help figure. This out, yeah.

A

And feel free to write to me even privately I can help. Thank you very much. No worries, okay. So this is the last question which is written in the document is any other topic which we want to mention today.

A

And asking for the first time, yes.

F

So one thing that still is something that is like a preventative thing is the retention rates. So we have a retention of 60 days on staging storage and staging GCP surge, which was sort of arbitrarily done and there's two things which might be nice to like dead focus. First before we go production on the image pomona one is the retention for staging images. Currently we don't have anything, so we don't clean them up, so just basically retention forever.

F

So we might not want to do that because it might lead to people using staging images instead of the production images and as the second thing is, the test cluster or the tests basically might get pushed into a specific search bucket, which might need to be around longer. Basically, all the Lots from the tests where the retention of sixty days was sort of asked to be lifted, or at least not be added. That makes sense.

D

So, let's be clear, there's a difference between. Unfortunately, the words are very confusing: there's differencing retention, which means the minimum amount of time. You must keep it and life cycle which lets you configure the maximum amount of time before I automatically delete it for production. We set the retention to ten years, so anything that gets uploaded to production will stick around for ten years, put a little asterisk on that, because I'll come right back to it and we set no life cycle.

D

We don't auto, delete anything from production for staging I agree we sort of want to delete them after a certain amount of time. Here's going back to that asterisk GCR doesn't support life cycle for images. Yet now GCR happens to be built on GCSE and GCSE does support lifecycle, so I can go ahead and put that in place and in fact it kind of works. It deletes the image blob, but it doesn't delete the image header. The image header is store and separately.

D

So if you list the images, it'll still be there, but if you try to pull it, it won't work which is kind of not helpful.

D

So I actually reached out to the GCR team and I asked them like hey like. Are you willing to support this like? Can you promise me that this will work into the future and they said no, so they they reserve the right to change the implementation around GCR.

D

So we don't currently have a knob to control lifecycle of images. Yeah.

F

Sd yeah so on their on their staging image. The first thing, probably keeping it away from the implementation of what we can do, is at least document what might be done in the future like set a specific time frame so that the expectations are set and then see what we can do does represent.

D

F

So there should open issue, I looked it in there in the docs. It also mentions that you can do that on there on the Google side, basically of the Google cloud side, but not on the GC outside that make sense, as mentioned this, and the other one I'm still looking up is on their test locks. If you want to keep the locks longer than six days, something was mentioned like half a year or is it?

F

Is that something that we want to keep even longer, especially because we might want to compare testing outputs between major versions or something yeah which.

D

Test logs are these past great ones and where are the? Where are we storing those? Do we have a separate bucket for those somewhere? The plan is to have a separate bucket. I, think sure we I would consider those to be prod logs and set their attention to ten years. There's no reason to delete them that I can think of yeah.

F

So I think the word came up was they might just use the normal staging buckets which do have a retention rate. So if they're in the normal GCS buckets through our normal process, they would run into the 60-day retention problem.

D

Okay, if these are concern, if these are real tests like like staging, is supposed to be ephemeral, so I wouldn't use staging for this okay, we can come up with a better way of doing it.

D

If these are things we want to keep for real, it's more prod like than staging like, although that distinction is getting blurrier as time goes on, so we need to refactor some of that, but in the terms that we have today, I'd say this is more prod like and I would be happy to create a separate, prod or separate project, or something that has a more customized policy for this.

F

Okay, I'll document that, but do we have an agreement on retention or life cycle in general, like I, would say retention, because that's the maximum amount of time someone can expect it life cycles, something else. Yeah.

D

Which isn't that for the end-user, so I'm happy to change or set retention for staging stuff? If we think it's useful I'm happy to fiddle with numbers and move it from 30 days to 6 months, if we need to I just in the end, I want as much automation around cleanup as possible so that we don't end up with a bill, we're looking at going well, I, don't know what do we need? These are hmm okay,.

F

So general agreement is, do have retention on staging GCR images and 60 days might be a good start. Even if it's arbitrary.

E

Sure yeah I just want to add I, remember in a meeting with the GC, our team and I was concerned about storage costs and they were saying that storage costs are mostly negative bullets. All about like network traffic.

D

Yeah, it's it's true and honestly, keeping things forever in storage is not a huge deal other than it gets confusing as to whether people can rely on them, be there forever I feel like somewhere between zero and one year, there's a threshold which effectively becomes permanent right. So if we have a policy that, like six months, these things go away, then people will stop relying on them, but if we keep them around for a year, we might as well keep them forever. Yeah.

F

I agree: the question is: is six months already too long for people to maybe rely on them, especially because thinking images might be like the more updated ones? If.

D

We I mean two stages: six months is two releases on the current cadence I'm I'm. Okay with that I, don't think that's egregious. It feels a little on the long side, but if we think that dipping between two major releases is useful and that's fine honestly, if you haven't promoted something out of a staging repo within six months, you probably don't need it, but.

F

What like my Before, we jump to production I would probably advocate for setting the retention as lowest as possible, because increasing is always better than breaking someone that is relying on the said retention. Sure so I would go with 30 or 60 days, because that feels like in a normal death cycle. Why would you have something that is not push to production through the production bucket yet or to the production VTR.

D

Yet well, and just let me be devil's advocate, what are we protecting against in putting a life's putting a retention on staging, mostly staging is operated on by humans? So are we afraid that humans are gonna come through and do something dirty and then try to clean up their own mess? Or so so we had one.

F

One issue with docker haps still having images which weren't updated for more than two years, including the pause image which was used in I, think VMware, infrastructure, testing infrastructure and we completely broke them just by removing a pause image and because they relied on them being there. If we had a retention rate, at least of the expectation of 30 days, that might have not happened.

F

D

Don't see how that those two things linked like just to draw the analogy the docker hub is: it was a prod like repo that somebody was assuming was still alive and it really wasn't, but we don't have a good way to advertise that. Nor do we have any sort of retention policy on it. That's not a staging recall. It was never a staging staging reposing, explicitly say staging or staging, or something like that. So, if you, if like, if you're going to production and your tests, a cait staging cluster api you're wrong in the head.

F

True, but it's like being wrong in the head and making sure that expectations are set to different things like I, remembered that that's some Etsy, do you like service internally, was to hold tolerance that everybody was using it, so there might have been introduced arrows to like remove that huddle or to expectation that it's not the reliable service. Overall. Sorry.

D

I just realized I'm Ari fifteen minutes late for my next meeting, um I'm happy to discuss I'm, not as convinced as I might be, but it's not also a huge deal with the caveat that I can't actually do anything about it for container images. So the example that you used is kind of moot anyway, because retention doesn't actually work. So, let's take it to slack or something yeah.

F

I'll open the issues or link it in the docs awesome. Thank you. Sorry, folks, I have to drop off Norris.

A

Thanks Tim see you later, okay, so is there any other topic we want to discuss up today's meeting. We have like a 15 minutes left.

A

Okay, for the first time, the second and the FIR okay. So let's finish our today's call, thank you all for being healed here James. Thank you very much for taking notes and see you in two weeks. Thanks very much.

C

A