Kubernetes SIG K8s Infra, 15 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: wg-k8s-infra biweekly meeting 20200415

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Right now, hi everyone, my name is parts mikvah and I will be host for our today's community cold.

A

I want to remind everybody about our code of conduct, which we can summarize us be excellent to each other as I. Don't see anyone new or sure there will be a time to small introduction. So if you are new and want to introduce yourself like a few minutes, there would be a time to do that to that um I think we can start with the billing review because we didn't do it without add your name to our agenda. If you are in this call, please and then, let's start with the billing review, I.

A

A

So in last week's.

B

C

There's the media or there's our attended flipping over I'm, assuming.

D

Yeah, that's after April 1st right that curve yeah.

C

D

Yeah, that's that would be the flip, nice and husband, so we spent.

E

Hi, everyone sorry I'm a few minutes late. It's a! Let me pull up my equivalent to that and compare.

E

Just loading very slow, no where's, my laptop! Oh, it's just right! If so many tabs open.

A

Ling space please try to play.

A

E

What days are we looking at here? I can't quite see it first.

C

First, there's three to fourteen, although you said we didn't bring you billing last four weeks: oh yeah.

E

All right, sir, what are the dates now 18 March.

E

E

Cool my graph looks very much similar to yours. My total is.

E

Easy enough to be off by the.

E

Okay, so I think I saw I was fine with $70 in logs yikes. What this is where it gets interesting now we can start looking at. Are we actually spending things in a reasonable way?

E

Mm-Hmm I think it is interesting that the Downloads is the obviously the lion's share of of where we're going, and that was only for those days that we flipped the GCR.

E

Cool all right, well, I mean it's everything. We expected it to be right.

E

Link to that billing report.

C

The meeting notes I.

B

C

B

C

It in community and you're not talking about.

C

Thank you. Sorry. Could you say that again, Bart I'd be done this to do. I didn't understand what you were saying. The the billing report that you were showing. Can you paste the link to that in chat, because I don't have that in this laptops history and it made me realize I, don't think it's actually linked anywhere I.

C

A

A

Let's switch to another: is there anybody who would like to introduce himself or herself.

F

This is my first time in the working room, but you might have seen me around the community now I currently am a senior site. Reliability engineer at Red, Hat and I would like to see how.

B

It's great to see you here.

A

A

Okay, so let's jump into our action, I can review. Let's start with the container each promoter update from this.

D

High yeah, so I mean there hasn't been a whole lot of updates, because you know we're after the flip happenin and we rolled it back, we've been trying to unravel any other lingering dependencies that we were unaware of which caused the rollback in the first place other than that. That's an ongoing thing right now, other than that I could say the backup jobs have been restarted, they've been running, fine and I've, been clearing out the old time-stamped, backups I think that's about it.

D

We expect the next flip to happen soon, but I can't really give any guaranteed dates right now, we're hoping to do it on a Monday, so it'll either have to be next Monday at the earliest or the or the Monday. After that, the reason for that is because the rollout or the you know the flip itself takes a number of days. Typically, it should take four days and we want to catch any any errors during the work week.

D

Last time we did this, it was in April first, which was Wednesday, so we were actually very lucky in catching errors on the Friday afternoon, Pacific time when that happened. So that's.

E

D

Yeah, but we will be posting updates to any scheduling changes on the I guess. I haven't been posting on slack, but I haven't posted on like kubernetes dev sake, release mailing lists, so you should see updates there break.

A

Thank You brave um okay. So my first item is at the billing / namespace. There is not much progress because I face I'm, you sure I don't want to work on the live document, so I try to clone it and play with this, but I have some error. Very, not informative. Data set configuration error without an information, so I assume it's related with some kind of permissions may be team. Do you know if I need to have some more permissions to clone the billing report and play with it or I.

E

Have no idea I justin, set that all up. I have not touched the billing report. Where's, the auditor accounting group is supposed to be have enough permissions to do all of this accounting. So if there's a permission that you're missing, we can figure it out and add it to the accounting group, but I don't know off the top of my head. What it would be so I would.

A

Have to communicate with Justin and ask Shirley, and we can try to solve this. Okay.

A

So another update from me about automating DNS reconciliation is, as we discussed lifting yesterday, I tried. The idea was to use bash script, which would be run in the container, with the cytokine sidecar container of heatsink, and whenever the directory of DNS changes, the bash script would call the octal DNS to reconcile the DNS records.

A

The I problem is I feel like this would be a little bit rocky too much and I spent some time and trying to understand the pro jobs and how we could achieve this with products, and it looks like it can. It should be easy easier, but the problem I problem which I faced is I, don't understand how we can connect the octo DNS, which needs the service, account credentials, we've the pro secret. So this is the finish if any buddy here so.

E

So we can use this workload, identity feature which basically says the kubernetes native service account can act as a Google service account without having tokens for it. Basically, we will use the YEC from the cluster as the authentication, so we can do that. We've done that for a few other things like prowl, where we've enabled other clusters even outside of our own territory, to be able to do this yeah and that's not.

A

All so I found the documentation about it, but my problem is I see in the aqua dns configuration the reset point where I need to put the credential JSON file and I. Don't know how to you know like force the aqua dns to just use the service account.

E

It shouldn't need the the JSON file, because I can run it as myself, without without a without a JSON file. Right I can run it as my own. You know, g-cloud off login, so there-there must be an equivalent. Ok.

A

Great so then it should be easy. I will try to just create the post on me job and like follow the documentation for workload, identity, and you find you have some problems. Then I write.

E

Them all down awesome. It may be that octo Dinah's needs a change. It's a good point. I've, never thought about that.

A

I will play with that. Another item is the artifact server progress I. My actual item wants to do some research about what we need and what we should do. I didn't start doing it focus on different items, so no progress so far, and the next topic is the moving term to the triple-a cluster. It's currently running with the subdomain / canary dot calculator.

A

All it looks like the data are consistent with the original / suitcase I/o I confirmed, confirmed that within mathematics trick from six co-led, the only problem we have right now and the team is debugging. This currently, is that the people from the Google Group it looks like they can access the namespace in triple a cluster. So if they can access that, they will not have any option to update the project, the /, so when we will figure it out, it looks like we can switch in DNS to the triple-a cluster cluster. So.

E

I got an email from Matt, I guess he's back in the office your today, my yesterday, hey or my last night and I will try to work with him to figure out what the heck is going on right.

A

Also the publishing, but it looks like we have most of the things to just deploy them. There are two things problematic and one is. We currently are deploying the provision with infrastructure with terraform and if as if we want to add you know some resources, like storage class, given any storage class, why are the terraform as it was suggested?

A

Then? If you add the puree this provisional, we can't run the terraform locally on our machines, because the access to the kubernetes is restricted only to Google console.

A

So the decision right now is, if we are okay with that, and if not, we need to find a solution and.

A

E

Mean the only so sorry the only solutions I see would be to make the cluster endpoint available or to not use terraform from both of which seem distasteful.

C

Is this something we could try raising upstream with terraform I.

E

Don't think it's a terraform problem I think it's merely a we're trying to talk to our kubernetes api server and we have chosen not to put our community's api server on the internet, so the choices are put it on the Internet or talk to it from the secure perimeter.

A

Actually I'm kind of okay with using only Google console. You know it also won't be very common to use provision the whole infrastructure using format. This is, and we.

E

Require cloud shell for administering the clusters. Anyway, right you can't run cute cuddle from your desktop, so I don't feel like it's too big a deal, but you know this is a chance for anybody else to say: stop forcing your crap on me. Yep.

A

So I I would like to ask if we can get some consensus about my voices, it's okay to just run the terraform inside the Google Cloud console and the objections.

C

You're saying the consensus is Taryn care form within cloud console or yeah. We would.

A

E

We didn't cloud shell just to be specific. It's not a it's, not a GUI, it's still a command line right, but it's in that cloud. So.

C

Hey I guess: that's okay, but B I'm, still not 100% sold on using care form to manage kubernetes resources of which storage classes. One.

E

So this is a great topic.

E

This is one of those resources that is halfway between infrastructure and and applications right, like I'm, not sure that I'm fond of using terraform to manage deployments and services, because those things you know are intended to be living resources, storage classes really kind of not it's really kind of part of the infrastructure, and it is really sort of part of the cluster definition. The fact that we have node pools that terraform can manage, but storage classes that it can't is an artifact of the implementation.

E

But to me they feel, like the same category of things.

E

C

I agree: okay, I mean I'm sure we can always revisit this if and when we ever get to the point where we decide what to use besides terraform yeah.

E

I mean it exactly: if there turns out to be a better something-something down the road, then we can consider that then. So my suggestion.

A

Is I created a pull request which I will update one change and we will see how it work.

C

Okay, so just be clear, consents and says yes, terraforming cloud, shell, the objections.

A

Okay, it's a live data, that's any recurring like action item and let's go to the open discussions and my first item there is I would like to suggest to add our like a meeting agenda in play like additional field for action, item manager- and you know as every week every be weekly call with the note taker.

A

We could ask some person to be responsible just for action items I feel like it can be very helpful because and also to create action items and then to put them inside the yeah, because we are I feel like we can lose and we are losing so much items during the road.

C

I'm not opposed to it I'm also happy to try and take this on while I'm going no taking duties. It's kind.

A

Of difficult for me to take.

C

Notes because I want to walk and talk at the same time, so cherry picking out actions can be difficult. What I'm trying to do right now is just highlight things and assign them to people, and then that way after the meeting I know to go back and take everything that was assigned to people and turn it into a github issue and assign it to that person. So it would be. This role would be helpful if we kind of consistently highlighted actually items like that.

A

And the other opinions.

A

Are the objections.

A

Okay, you can, we can try and from the next meeting I will actually I will have late day up this invite and we will try to do it. I know it will be hard because there is not many people who like to do notes. So probably they won't hear a lot of people who like to do action items, but we will try and I think it can be helpful. I think.

E

The biggest problem is the distinction between notes and good notes. Like Aaron takes great notes, I noticed that I'm being assigned a is as we're talking, which is take some take some dedication and, like you said, walking and talking it's hard there yeah.

A

Another topic is during the talking about the community infrastructure for kind Ben who rolled that he needs some GC GC s buckets with more retention than 60 days to put some artifacts and I feel like it.

A

We definitely don't have any solutions to promote. We don't have any automated way of managing the pockets for that kind of purpose and I, don't know if we can quickly have some solution for him.

E

So I didn't read his thread, but what does he want.

A

So he needs some pockets where the GCB can put some artifacts with more than 60 days of retention are.

E

They are they permanent artifacts for serving or in a staging artifact. So.

A

That looks like they are not like persistent ones, but also not only not only for 60 days, I, think I, asked or and even didn't I wanted to ask exactly the same thing and the issue my.

C

Understand it be a place to host releases of kind, and he also it pushes like the latest build of kind to is.

E

There is there a thread or an issue in that I'm missing I can't find it yeah.

A

G

At least it's not something that we have to support.

C

So the reasoning for this is that the see there are CI jobs that use this. These CI jobs are considered release blocking if they're important enough to be released blocking. They should probably be important enough not to use somebody's personal GCS bucket an account. So Steven Augustus asked that we use community infrastructure to host these like latest builds of kind as well as releases of time.

G

Wouldn't that burned us giving storage space under the production namespace, which doesn't have the 60-day retention limit.

E

So I feel like I would say that that seems like my preference I'm, just finding the link now, if it's not I mean we can make staging, you know we can change their attention on staging the. The point was to make it small enough that people don't treat it like permanent. If you know if we needed it to be 90 days or 120 days, I wouldn't object to hard.

E

But it sounds to me like this is actually permanent and we should go the other direction and say like it's just prod and for the sake of history. We keep it forever and we only have the two right. We have the very ephemeral 60 days or the ten year retention and I.

A

If I understand exactly what he means because it looks like he doesn't want to, because if he wants to use the only images or if he wants to use some binary artifacts, because if he wants to use the images, I would also I, don't see any problem to use just the production bucket canary artifacts. Exactly.

C

So that's we has two columns right there at the images like kinder, node and stuff like that, and then there's the kind of binary. Then so Dindo see your hand raised, but what I'm trying to get at is like. We seem to have this well-established, staging and then promotion process for images. We lack the equivalent for binary heart of America unclear to me whether we should take patient agency and say for people who have binary artifacts just push straight to prod or whether we should be coming up with some equivalent solution. There's.

E

So we we want, we have an open issues somewhere in the backlog, for an equivalent promoter for storage for for GCS.

E

For the time being, we do not have that, and so there are a handful of things that have been enabled to be able to push to broad like the Relan stuff, Stevens Stevens work, the CNI project actually has a different, yes bucket, that we created just for them. I think that one was open coated I, wasn't part of the review on that one, but I saw the result of it. We could do the same thing for kind here. Just make a new bucket, create a service account.

E

They can push into that bucket and give that service account to Bend for the time being and say when we have the promoter in place, we'd like to take away that access and use the promoter instead, yeah.

G

That's exactly what I was gonna say Tim, so I would support that. So.

A

This should be easy and quickly doable. Is there anyone who would like to own this and I'd.

E

Like to look at the script just because the C&I one was just sort of thrown in a little bit haphazardly cuz, it was a one-off, but now that it's not a one-off anymore, I'd like to make sure that we're doing the right thing for.

E

Maybe anyone enough I found the issue I'll respond to Ben's issue and ask him if he thinks that's cool.

A

Anybody wants to work on this new script. I can mentor and help, and then we can ask him also.

E

Great point just to be clear: I'm, not sure, there's a need for a new script. I just didn't look at this original CNI special case when it was pushed in so I came across it as I was doing something else and I threw it at the back of my brain to come back and look at it and then I forgot about it. Instead, until just now so I'm more than happy to have somebody else, you know take a look at it and do the actual PRS.

E

This should be easy unless we need major refactoring, which we probably don't.

A

Okay, so, let's add the action item for me, I will try to find anyone who would like to help, if not I will do it but feel free. Anyone who is here and or anyone with quickly listen our call. The topic is open and I can mentor and help with this.

A

Ok, so very wrong, I'm kind of curious about what is the status and what is the willingness of like improve or to fix the issues which we found earlier.

A

Okay and how much work they would put needs to be fixed right. So I have to talk on this one.

C

Belgium runs very old versions of pure fauna in in fox DB, both of which have many many many many vulnerabilities upgrading, either of them seems to be a non-trivial task, and so it's also unclear whether the way that velodrome is architected and wired together to have these services talk to each other is in any way saying because it appears as though exposing them to the public internet.

C

Not a good idea, it'd be nice to find some way to restrict things a lot more, so I I don't have a hard estimate on how much work it is all I can say is like the team of people that I work with have other more pressing things to work on and we're also using as a data point the fact that we we took velodrome down and 0 people complained about it in any way, shape or form.

C

I personally am really sad that it's down, because it's the only thing I know of that it has like friendly, visible views of like data that goes back longer than 90 days about our tests and her jobs and stuff, but like the velodrome, is actually this complicated thing that runs a bunch of code to also like scrape github and dump stuff into a cloud sequel store. Did anybody realize this it took a week? We, we kind of discovered this.

C

We've only been using it as like a cur fauna and influx thing, so it might be cool if a community member who knew her fauna, an influx kind of like stood up a more modernized version of that stack and we could see if we could use it. But again, given the relative lack of like noise about it, we have deemed it a lower priority.

E

Does is that sorry does that lower priority mean we can turn it off and, like we'd turn off the old cluster, and if we decide we want it back, somebody will go through the hurdle jumping of putting it in the new cluster.

E

C

About I get back to you on that I I think we would feel that way.

C

There may be a question of like do. We want to save off the data for you to leave the cost.

E

So I can leave the data around I would just like to decommission the old cluster. There's only two things left in it, one of them being perfect, which I will try to resolve today, the other being velodrome. If velodrome is deactivated, then there's literally nothing running in that old, cluster and I can like move it down to zero nodes or something or get rid of it entirely, and just save the volumes that velodrome jobs are using.

A

Yeah exactly because of that reason, I started a discussion. Okay,.

C

I will take that as an AI and get back to you later today. Awesome.

E

Thanks, okay, so run run for it. It's.

A

Okay, so the next topic is I would like to build some consensus about moving the slack, in fact, our new cluster I'd, like some research about it, it looks like it should be an easy thing to do. The only only thing which we lack right now are the secrets. So we start a discussion yesterday and I also to discuss the topic with Catarina today, because the cluster admins will have access to the slack secrets and if it's okay for us, of course, this is like only seven people or so and I'm.

A

Okay with that, but I'm. Also the member of this group, so I don't force anything.

A

Sorry, what was the question? The question is: if we are okay to put this in triple a cluster that cluster admins members of the cluster admins group would have access to the slack secrets.

A

They already have access to SSL, Certificates and etc, but.

C

A

C

Us access to these secrets, I'm just curious I, don't see my thumb, I.

E

Mean we could use it like, we could go off and do a different key management system and give different groups access to it and not cluster admins, but I. Think the point of cluster admins. Was we it's a small group of people that we trust to manage the cluster and everything in it like, like you said you already have access to the SSL secrets right.

E

We didn't go far enough to try to protect that from from humans, yet because it's a group of about five people so I feel like that's not worse than what we're already doing, and if we get over time, we can get better hosting.

A

Okay and the objections, anyone.

A

Okay, so I feel like we have a consensus that action items which we need actually is. It would be good to have the secrets and one config map into the repository. It's not present there, so I, don't know who have access to this cluster at who could add these to the repository encrypted by a deep crypt. Oh.

C

I should probably take that I.

A

Don't want to but I work, so I I think I checked all of the documents. I put the links where these secrets are being used in which manifest. So there is, as you can see, like six secrets and one config map which is not present. There are we talking about one for the publishing back? No, this is just for the slack. This is the okay.

A

It's like event, lock a slack moderators like welcomer and slack in oh okay,.

D

Yeah then it's a turn not me and.

A

Also I have a question because at least as far as I can see now they are all in separate, namespaces and I. Don't think it's, it would be necessary to put those in separate main spaces in our people. A cluster so I could I would suggest we could use that slack tools or select namespace and put it there.

C

Okay, I mean why? Wouldn't we just keep it, as is.

A

In different namespaces you mean yeah, then, as a convention right now, we would have to create four different. The directories into the repository and like right, documentation about how to deploy separately all of them and if we would put those in one and I'm space, we could just you know, write one readme file with deployment, instructions and I, don't see if there's like any and you need to have separate things basis, but my game room.

A

C

I'll I'll ask about the security implications of that.

A

And only what is like a manager certificate certificates there and we would switch to use the third manager I asked James if it's possible. He said it is so it should be easily done.

A

And there is like a next topic is I start digging a little bit about two projects which I see in the issue with the with the inventory of the clusters and projects right now and I. Don't have a lot of knowledge about those, and these are cattle and greenhouse, and these two are the last two, not counting the pro and you and the pro Bosco's, the first, which is the cattle it looks like it's. It should be easy to move this.

C

Okay, I can talk about Bethenny's cattle would maybe be like the last thing I tried living it is. It is very brittle and very creaky. Nobody has touched the codebase in years at this point and it has like I think it's up to a 500 gig volume for reasons that nobody has taken the time to invest a factor. Okay, can you say two sentences about what it is and where it runs.

C

So what kettle is kettle is called the kubernetes extract, transform it takes test results from GCS, it scrapes those down, and it also subscribes to pub/sub updates about them and then stuffs them into a bigquery data set, which is then that bigquery data set is what is used to drive the triage dashboard and all of the metrics and stuff that velodrome was displaying cool.

D

C

It run, it runs I believe in did I not have an inmate issue about cattle. Let's see is running, indicates goober night / / runs.

A

Like Kate skipper, Nagar cluster, that's.

E

Awesome so I have never I have zero context on it. It sounds like it complicated and an interesting system. His eye.

A

Into the any repository and I looked into the resources which it needs, and it doesn't look that hard. The only problem. What I see which I see is that bigquery data set, which probably would need to be moved to be managed by our.

C

Relation the complicated part of it is because it's so creaky there's no like quick and easy way to know whether or not it works a complete restart of kettled to get it up and running again takes about 10 to 11 hours, I'm sure somebody who knew what they were doing capable of hacking. The codebase could maybe bring that down. But again it's kind of been a lower priority thing for us.

C

That's the only reason I recommend taking a look at that one. Last I.

E

Do you mean last last, like after prowl and CI and scalability or last like last, in the first wave before prowl and CI and skeletal after crowd would be my advice? You can get away with it. I.

A

Will be a little bit more into the code and I will try to understand a little better because at least as far as I look at the code doesn't didn't work, that's hard to understand at least Venus famous last words, my friend yeah I, know I know because of that I'm not suggesting to doing it right now, but I wanted to understand it. Okay,.

C

So that's cattle greenhouse is a homegrown distributed cache for basil. So it's something that we would deploy when we have a build cluster. So it's not something you would deploy in the triple-a cluster. It's something! That's slowly there to service crowd jobs running in a a prowl, build cluster yeah.

A

That's like a reading these, so it also probably needs to run to be deployed with the problem itself.

C

Not necessarily so like I was trying to describe last week. My suggestion would be that we we first focus on creating just a build cluster basic bare-bones, build cluster, that's capable of running like let's say some of the image builder jobs or some of the container image promoter, jobs, things that don't have to spin up ETV clusters and things that don't use basil.

C

As I say it actually, I think the image builder jobs do use visible and, and then next we could like stand up greenhouse and try running jobs that use basil and would benefit from having a basil cache. And next we can stand up Bosco's and then provision a bunch of projects for Bosco's to manage so that we can start running ETV jobs. That's they end up clusters in those projects.

C

And I believe the last time I talked about this Tim would really like us to just use the triple-a cluster as the build cluster I still feel, like that's, not a great idea. I.

E

Wouldn't say my thought is, would really like I would really like to understand why not okay.

C

C

We could we could try using it as like a trusted built cluster. The idea is, we don't want to end up running like PR jobs with untrusted code and the cluster that has potentially super sensitive secrets. That's sort of the I sighs I get.

E

That that that seems like a completely reasonable, isolate.

C

So we can, you can try reusing a triple-a cluster as something for like trusted jobs. If you want to give that a shot, I still, wouldn't necessarily recommend it. I guess: I can't give a give a solid reason. Why other than paranoia don't.

G

You try that so one reason might be so we ran into that with the backup job where we ran into quotas that resolves it in others, things failing and that in my mind, could be happening. Let's say the trusted thing goes out of hand, built wise and then I, don't know no research and we can't get another one up because of quotas and right make our actual core infrastructure, not work. That feels like a case that might actually happening. Yeah.

C

I mean that's, that's why we regret me the same cluster we used to run prow Services is where we run some of these trusted jobs that we want to have like restricted access to secrets, and we now feel like that was a bad idea, because when the container image Brodeur hit GCR quota, it effectively prevented the bat node from ever pulling down any more images which then killed off prowl on that node until gradually reached a point where we were unable to redeploy prowl until we created fresh nodes and added them to the cluster.

A

My question parish right now is a because when we will move the publishing boat and slack infrastructure tools as well old Rome is for now not a priority. It looks like we should start discussing the prowl. As you said next.

A

Because at least are looking under the issue in the inventory, it doesn't look like. We have had any other projects right now to move.

C

Sure I I feel, like I, will focus on the ability to run crowd jobs first, I.

A

Don't understand I, don't feel like I understand, really well. Pro I got prop 4 or 6 hours today, trying to read the documentation and understand how it works. So you can suggest anything so.

E

I have spent a little bit of time, trying to understand the prow architecture and basically, there's a split between the controller and the workers, and they can be in different clusters and in fact, is that's a big part of the design.

E

Erin tell me if I get any of this wrong, and so the this simple first step that people who are familiar with prowl are suggesting is, let's give it a place to run the workers first, and then we can talk about moving the masters, the controller's into some other place and those don't have to be the same clusters.

A

Yes, correct I think there's a lot of action items for now, but the next thing I think we should spend more time about discussing the some ideas about the timeline may be.

G

Architecture right, so that means that if we decide on splitting up the builds cluster from the triple-n cluster, that means proud. The management part could still run in the triple-n crosser, but the bills and the build jobs would be in trouble ve or something right. That's.

C

Yeah I am fine with that approach.

E

And you know if, as we move forward, we should consider whether, if we think the the builds are really untrusted code like do, we need to run them in a different runtime like we've got the chief Azure sandbox stuff that you know, is it useful for us or is it not useful, not interesting for this, but we don't have to consider that at the same time, like let's change, one thing at a time.

C

C

Okay, I will take an AI on trying to have up plan together for next meeting. This is kind of what I had Hawk said last meeting and then some things happened which have taken up my time, but will hopefully give me a lot more insight into you exactly what prowl needs out of a Google Cloud project and so I'll try to have something prepared for next time.

A

I'm Greg and the last topic is the monitoring for delay. We've Scott we've started discussing a little bit with Scott I, suggested Scott to take initiative and suggest some things, and he did a mentoring a little bit. Him and I have permissions to do some stuff. So we want to have some proof of concept, very so on Scott and update.

H

No I got you the configurations you can apply, so we can try that soon and then I posted in the slacker about any kind of stack driver alerts that we want to get going. I, don't think I got any replies last time, I checked, but um yeah are there specific stock driver alerts that we can get started either applying are either defining and terraform or just yeah I mean. Are there? Is it just like standard like CPU and memory usage alerts or other specific kinds of alerts that the group would like to see?

H

My suggestions to.

A

Just start with those, and then we will iterate over cool.

G

To be fair, basic.

G

Like individuals is just a simple app time check so having that would be like my first priority, I think.

C

Yeah I, I kind of agree with Michael I feel like starting with resources. First, without understanding what a normal operational workload looks like could lead to us having a lot of noise. I come from the school of thought that says. If you're monitoring an application, you should be monitoring. What does that application expect it to be doing like what is?

C

Was the application successfully doing the work it's supposed to be doing look like, and how can you monitor that it is successfully doing that so or the publishing by is it running, is definitely a good hairband step, but like is it successfully publishing, repos and stuff?

C

This might be more of a philosophical debate, but that's just the way I'd approach it I.

A

Agree, I just want to have something to eat. Are it over this so because we I feel like we are moving a nice topic further away. We don't have anything and.

E

Generically things like resources are the sort of thing that people have no idea how to actually decide on, like I know, for a fact that we're wasting resources in this cluster I.

A

E

Had the time to go through and figure out, what is the right CPU allocation for the nginx instances that are running Kate said I, oh I know it doesn't need as much as I've, given it so having graphs that show what the resource usage was over. An extended period of time means we can go back through and say well on, average is less than a half a course. So let's give it a half sure.

C

I mean like creating something that is kind of equivalent to our billing report, which shows us usage of our resources, but more up to date and with a little more fidelity than our billing report. Graphs are great I'm opposed to like creating alerts right now. That's all okay,.

A

So, let's I think that the hand who can start with graphs and let's move discussion about alerts further during there- oh cool.

A

So that was the last topic from our agenda and is there any other topic you would like to discuss today.

A

For the first time, the second and the third and I would suggest to give us back of the format's.

A

Excellent okay.

C

A

You everyone for being here and see you later in two weeks.

G