Google Agones, 22 May 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Agones Community Meeting - April 2023

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

That actually I'm pretty good with you like for some weird reason, I think also just because melatonin tablets work really well. For me,.

B

So it's not been too bad thanks.

C

You started recording, let's, let's go ahead and dive into the agenda, we can chit chat.

B

At the end, excellent, wonderful.

A

Welcome to another episode of the uh Agonist Community um meeting stuff, oh God I'm not doing well. Let me take things away.

C

So we've skipped last month because of GDC, uh so we haven't met in a couple of months. We've got a few things on the agenda today.

C

The first one I stuck on here was it's been a while, since we've talked about feature promotions, I left a link in the meeting notes to our future stages, documentation which lists all of the current features that are in development and what stage and development they're in, and if you look at the the final column, which is when they met when they got into that stage, a lot of them have been there for a while.

C

So I just did a quick audit and went through and filed tickets for all the ones it seemed like they made sense to promote um with the idea of sort of mentioning it today and sort of getting some consensus, um or if this is a feature, that's something that you've been.

C

You know interested in or been testing out, especially if it's an alpha, um you know, I think the the main things we're looking for to go from to data are that the the person who asked for uh the feature has actually used it and find that it actually solves the problem.

C

um Once features are in beta they're on for everybody, but they can still be turned off, and so what we're looking for to go from beta to stable is that it hasn't. uh You know, broken anybody's deployments right so uh once a feature goes from alpha to Beta and it's on. uh If it breaks you in some way, you can flip the future flag and turn it off to make things continue, working as they were before, once we go from beta to GA, that's no longer possible.

C

So what we're looking for to go from beta to stable? Is that they're not breaking anybody? um So there are a number of things here. I think I sorted them in the meeting notes based on the first one, the first set of ones or ones going from beta to stable, and there are one two three four of those uh which is graceful termination, sck, graceful termination, Custom, Fleet, auto scalar, sync interval the state allocation filter and save to evict um those have all been beta for a couple of releases.

C

Now uh it seems like uh from my gandering at the issues that they're all ready to move up to stable and the three that are currently in Alpha that it would be nice to turn on by default, so that everybody's is using them are the split controller, extensions, pod host name and resetting metrics on delete, I think out of those three the first one is probably the one that might impact people, uh because when we split control and extensions, it's actually changing our deployments, I'm going to be running um an extra pod or two in your cluster.

C

um This actually has a number of benefits, though, and I think we're going to end up there anyway. So I don't see any reason to wait. um This gives us much higher reliability on The Agonist control plane and does allow us to scale out the governance controller from one to many. If we want to from one to more than one, which is great in terms of things like upgrades and resiliency and node failure, you know time to recovery and so forth.

C

So uh with that sort of diatribe over um I filed issues for all of these, if people want to follow up with any specific one, please go chime in on the issue: um we'll use lazy consensus for moving forward and the ones that we're particularly excited about pushing um from you know you put a note that one's already moved to Beta, so I missed one uh apparently since I.

A

Wrote this one yeah it wasn't, it wasn't on the: uh where are we it? Wasn't? It wasn't attached to the the issue, the pr so I just I just closed that and actually let me.

B

C

The documentation is also out of date, then, because if you look up the the feature, docs link that's linked from the top there. It says it's still an alpha.

A

But no, it should be. It should be because the one that is uh here is still in Alpha. But if we look in the 131 or one whatever we're on, if I pop over into the development docs, which.

B

C

Then it says beta.

B

Should say whatever it is somewhere.

D

Interesting, it is it's.

A

Sort of yeah there it is the development yeah, because because we hide it because future Gates.

C

Yes, awesome all right well, thank you for catching that one we'll cross that one off the list.

D

I think the only thing that gives you pause about any of this is um the it does seem like a lot of users tend to stay on one version and then jump much later uh and so I don't know how to properly get signal off of the future Gates.

D

So, like you know, I think SDK graceful termination as an example is um uh probably fine, because 118 is like a year and a half old or something but like the ones that have been since 1, 30 I, don't know what the right answer is, uh and similarly, like I, don't know if we can figure out a right answer without you know, I think that's the reason that laser consensus has to be the way we go and kind of telegraphing. In these meetings like by the way we're doing this yeah I.

C

Have yeah I think for the.

D

C

Of these people that are are jumping releases like even like all the ones that are going up to stable, except for safe to evicted, been there for quite a while. So even people are jumping releases, hopefully have jumped into a release where this is already on by default, yeah um and if not, then they might have to run an old release. While we figure out what what bug got introduced, which has definitely happened before so.

D

I note that it is again like seven googlers and Stephen so.

A

Steven has a lot of power here, I'm just saying.

E

Well, we're on 130 now so we'll see. Oh.

A

That's good yeah, oh there you go, that's pretty good uh What, Was, I, gonna, say I. Think, like my my might think, I think these are all fine, especially because that's been beta for so long.

A

um This is basically an end to end tested. It isn't really actually uh backwards like it's, not a breaking change in terms of APR or anything. It's just providing more more stuff. Just I think a lot of people are already using and this one since there's an option to always turn it off. I think that's also. Okay, so that would be all right. I'll write notes on the tickets, but that would be.

B

My my thing oh and we lost Robbie.

D

I think we could move on to Max's. That sounds like we uh dispatched that topic.

D

Max you want to take it away.

F

Yeah sure uh so the thing I want to talk about is uh like design proposal to basically expose some um a status for the Philips to indicate together.

F

uh You know it's like it's in a remote, complete state or not.

F

um The background is that, right now uh we don't have like such a indicator in the fleet to to to tell user or some third party um uh CI orchestrator advisor of lead role is complete or not especially the case when you use a governance-based cloud deploy when you trigger a route through Cloud deploy, you will say, slow, UI or CLI. You will say that the cloud deployed world is immediately finished after you um submit command.

F

um You know either no matter the fleet road is actually finished or not. So um the general design idea is pretty straightforward, which we will uh I propose to integrate with another open source Library, which is called quest status, um which provide some standard uh status which can be understand by Cloud deploy.

F

um So basically, what we will do is in our freight status by including a reconciling condition uh to to tell whether the Fate is still in reconciling a state and case status will take this input and uh compute and I'll put some um standard status which can be understand by uh cloudpoint.

F

um The open question I have is that how we Define the um completion of a fleet, rollout I, think by definition uh when certifying the completion as there there is no like old version of game server running English blades, which is similar as the definition of deployment throughout completion. But the issue is that for Fleet for game server we have the allocated status allocated. State uh will not be terminated or scaled down by during a layout process, so this will make the reward process on.

F

You know out of control of our of the government system itself.

F

If customers do not terminate their allocated games over in the old game, server set like for a long period Then if we Define the road process of it, uh completion is like there's no old word engagement servers then, basically their route will, you know, never finish, or only finish after a very long time after customer manually terminates or somehow terminates their old Gaming server, so um I'm not quite sure like how we should Define this completion or what's the um use case or customers use journey like how they will use this, how how what's the real state or real you know, indicator customers care about um yeah.

F

So that's the major open question I have.

A

Thoughts I have thoughts but you're the uh you're, the the I don't say token end user here for your ID.

A

If you, if you wanted to be like hey, go, go, do a thing when done is enough? What would that look like or when? When would that be.

E

Yeah I guess uh usually immediately I'm interested when I change the fleet. Just for the uh the ready servers once they've obviously been updated, I'm less concerned about the allocated ones.

E

I might want to do some I, don't know if I could do a thing, I'd be once uh the ready ones have been updated and then maybe later on allocated but I think yeah. We have that exact use case where, like after updating a fleet, it might not. You know the allocator service might take a day to recycle.

E

um Yeah, it would be nice to actually override that for some of the fleets as well to kill the allocated servers during that rollout. But that's probably a different problem.

A

My next thing we'll talk about that.

D

The uh the thing that gives me pause about just the ready servers and allowing the allocated to continue is that, at the very least like and I, obviously come from a more traditional like web services thing, but like wanting to know that the binary deployment is like fully rolled out is sometimes necessary for things like dependency management. So if you have like binary a that uses, some back-end service, like version one that uses a back-end service and version two doesn't uh and you want to know- hey I can actually drain that back-end service.

D

Now um that's the case where you really want the allocated servers to to be relayed as well.

A

Well, actually, that's a really interesting point deck should the should the litmus test almost be? Has a new game server gone from ready to allocated and had someone play on it and therefore we know it's? Okay, oh because you're, that's literally your point right, like an international web service manner, you would roll out part of it and see if it's still working, but the only way to test the game. Server is actually have some play on it.

D

I was more talking about the like. If I were trying to drain a back-end service like eliminate it completely yeah.

E

D

Thing that I want to know is that the binary that uses it uh is no longer there but you're. What you're talking about is actually interesting in a different way, which is yeah: okay, yeah, which is Canary, which I honestly is like a more interesting feature, or isn't that not more interesting? But it's a different feature right well,.

A

Not it's not necessarily Canary um I'm, just thinking about it when like, if Okay, um when is when is a fleet rollout done enough right like enough right like because I think that's the word, because in theory you could have a fully allocated Fleet um if you had a completely full cluster, like probably has not like you, may have some buffer, which would be ideal right, so we may assume some buffer um you could you could argue that if you had one ready game server, that's enough of the new type, Maybe um I think you could argue if all of them were replaced from everything like the old game.

A

Server sets gone and like everything between is gone. I think you could also argue yeah, actually no I'm going to rescind it I think I think my idea of like if um what's the word I was going to say if a if it's played on it's a valid rollout, because it worked because I think you're right, I think that's Canary, and if you want to do that, that has to be a different approach. Yeah! No, okay, no yeah.

D

So max this is like a single condition. Getting relayed right, there's, basically like a rollout done condition. Is that right.

F

Yes, so basically it has. The reconciling condition will give its value like force or true communicate advisory. So, like still in progress or not.

D

So one option and I think Mark Hardy said something like this: in the ticket or in the dock um somewhere might be just to provide configuration as to which which thing you want to roll out, reconciling to be tied to yeah any way to accomplish. That might actually be to relay both conditions um and have actually like three like what I would say is like have a three-way so that you can have call it.

D

Maybe you know reconciling done and reconciling or reconciling all and reconciling uh ready or something like that, and then there's a configuration value as to which one which one of those two conditions, um the actual like case status.

D

Reconciling condition is so that that, like reconciling condition, would just be equal to one of the two values, um but then so case status would be able to pick up the reconciling, depending on on whatever configuration you wanted. But if you wanted to have like custom Logic on the other condition, you could just write a controller that was looking for that condition too.

D

I, don't know if that made any sense or if that's overly complicated,.

F

So basically, we dedicated the basically the definition of the criteria to customer the input, whatever they care about and.

D

We come up with a sensical default. Maybe we make the default conservative and and reconcile all of them.

A

D

A

You almost I'm I'm, oh I'm, almost wondering if it's like uh either a number or percentage of ready of the new version. If that makes sense would if you were like, we say we're done when you either like this is how many so 100 10 would mean that or you're just like five give me make sure, there's five ready or two or one whatever it happens, to be default, could either be one or 100 I, actually don't really care.

A

um Probably one of those would be would be sufficient.

F

Do you think it's enough to only think about the new version of game server, the customer really care about like the clean out process of the old Gaming server, because I'm thinking about because maybe they care about the resource of the okay over game is over consuming um so like how many over game server is already cleared.

A

Up I would say that is why this is a thing.

A

um Which I just need to drop the docs on pretty much to have as a thing? Because.

F

Once yeah, once yeah yeah that'll.

A

That'll update yeah.

A

Updating so once they have that, then you can be like okay, if I have an updated label or annotation that I know about, then I know that my fleet's, updated and I should shut down more readily than anything else.

F

A

Stephen did any of that make sense, considering you're the end user here, we're gonna ask you that.

E

uh Yeah I think I think either way is fine, um I guess as long as the uh the command line isn't hanging. While we wait for the elite to be updated right for a day and we have to control C and then.

A

No, so none of that's going to change like the like a cube, CTL apply or like a CI thing, it's more about. If you go back and look at the resource and say like, is it done or or does like another process go and look at it and there's a condition on it? That says, like yes, I'm done.

D

I mean this can all be derived from in many ways. This is just nicely to integrate, with like case status right, because this can all be derived from game server sets today.

A

Is that correct, It's, Tricky, I think sure.

B

C

Exactly your question earlier about whether it's overly complicated I feel like it might make sense to like we have a great way to sort of extend how we set this condition once we have the condition in place.

C

If we pick one way to do it to start, which we think is reasonable, um we know how we could add stuff later if people actually need it, but it all almost feels like we're over designing if we're adding a whole bunch of stuff that we think people may need in the future or might want to do it slightly differently, um whereas right now that doesn't exist at all and once people know like oh I can now plug into case status or hook this up to my Jenkins or whatever.

C

Then they might start asking for things that we haven't thought of right. So if we say, let's put the condition in, let people start using it, make it based on something we think is reasonable, and then we know there are ways we can change that condition in the future or give people ways to set it differently, et cetera, et cetera. We can always add that later I'm not sure it's necessary to add from the start, especially when Steven's kind of like, like anything, would be.

C

Okay right, so I think I would think putting something in there we can, then you know have a demo of using it with case status of using like rollouts, explain people how to use it and if they say like oh you're, telling me the wrong signal, then we can make sure we put the right signal in, but we'll give them a signal that doesn't exist at all and I think we can do that in a much simpler way. To start there.

A

Perfect, as the enemy have done, that.

D

Was uh actually why I would ask the overly complicated thing is like similar to when Mark describing described like percentage blah blah blah? It's like yeah. We could do all these things, but I am very concerned about introducing API that doesn't seem necessary because when we have to continue to support it, so it makes sense that.

A

Makes sense makes sense exactly.

C

Yeah so I would I would argue to start with something simple yeah and then I think we have a great way to extend it. If we want to add here's how you can tweak how this variable gets set through extra conditions and pick which one gets you know propagated up or by specifying you know some sort of rules on how this condition gets set or whatever it is like that stuff can all be added to the API later.

C

um Those are why.

A

We have alphabets yeah.

C

Yes, and we can put a binding Gates and yeah I would argue for starting with something that just lets us know when it's done with some. You know definition of done that we picked and say that that's good enough. Okay,.

E

Maybe another complication, but uh what would happen? Let's say yeah we do a fleet update and then, but before it's uh done, we do another Fleet update. um Do we need to think about this? What happens on the first one? Is there another event to say it was it's uh overridden or.

F

Yeah, um what's the current anxiety to be only show that the older Fleet is doing the reconciling the uh programs? So no no difference like well weather is like override to a new uh flavor out or something else. Yeah.

E

Because you'll, never, you might know you'll, never get it done. Basically, there are cases.

B

E

Never get it done. That's fine with me.

A

um So then, okay, so what is what is the basic done?

A

Do we want to start with.

D

I think both you and Steven expressed a uh that that just doing the ready game servers um seems to be the the intuition you both have so I say: let's roll with that as y'all are actually probably more experts than we are at this point.

A

The only Edge case in that scenario is what, if you have no ready ones,.

A

I, don't know what the right answer is there, but.

D

I mean then you're done. Every new game server that comes up is going to be yeah.

A

It's gonna be your deployment, so I.

B

Guess actually.

C

A

At that point, as soon as you have at least a single ready game, server you're like cool we're done Maybe, maybe pick a direction. It probably actually doesn't matter. Okay,.

E

B

F

Okay, so the takeaway for me is that we will start uh from simple the basic condition started. Is that way uh the new free, the new game service that have won better game server, then we call Face Down doesn't make sense?

F

Do you want me something.

B

A

uh Depending or was it did you say either as soon as the new game server set? Has one ready game server were done, or did we say that whenever the old ready game servers that are ready get replaced, it's done? If that makes sense, or is it something in between.

B

D

Going to draw pitches, it.

E

A

D

Right like we need at least one new ready and one. uh Oh.

A

D

Like that and all of the ones that are all of the ones in the old version of that are previously ready are trained right, although that.

B

That that's starting.

D

To get into that Canary and territory too, like how do you handle the case where the new binary just doesn't come up? Does it stay permanently.

A

D

Reconciling which is an interesting signal in its own right, yeah.

A

Actually, the the um if it doesn't come up the rolling update, should actually abort there's there's some stuff in there about that. Oh I see so we already kind of do that. We already kind of do that anyway. Okay, um I'm, writing, I'm, just gonna I'm. Writing an audit Trail.

B

A

Me if anyone disagrees with this um at.

B

Least one already.

B

B

Here all right, whatever.

B

I'm saying that uh only.

A

B

F

Maybe only allocated and reserved Kim silver, completely good.

A

Point yep I, like it forget about reserved.

D

That sounds good to me: Max how's, that look.

A

F

A

F

B

A

Very specific about it because it could move fast, and so that's pretty that's pretty reasonable. It's like, after after ready, I, think game server had like an after-ready function on it already. I think um all right, sweet.

F

C

D

I like to put my name here.

A

I did well you're the one who's been doing all the work, and so thank you very much for doing all that work to work out. What's going on here.

D

uh Well, actually, I recently backed off a little and Max's did the last couple days of Investigation, but um both of you then yeah. So the ede has to been very flaky. Recently um I did find there's a gke, seemingly a GK product issue. That's been affecting uh some of it and I. Don't we we've still been working through that internally. I actually need to get some time to uh look at our.

G

Internal Repro.

D

Of it and work with teams here, um but we think that I I I think it has a workaround in the EES. That's already present in Maine.

D

um Max took the most recent run at this, and it is a very broad set of failures that all kind of seem to be failing in different ways and we don't have a handle yet on. What's what's going on um and then Max I think you said it actually cleared up for a little while. So it's kind of you know, weather related or.

F

I'm sure about starting from yesterday is getting better. Okay.

A

um So I I was happening and we don't know why. The worst kinds of bugs.

D

Yeah uh so um Stephen I hope this is inspiring a lot of confidence.

A

But uh is it still? Is it still primarily seeming to happen on certain types of clusters like? Is it mostly autopilot? Are we seeing it everywhere because I know I know the n10 test can be a little flaky at times anyway,.

D

F

Did you yeah, mostly autopilot and I, think autopilots won? 26 is a major majority of them. Okay,.

D

Okay, that's good to know.

A

That's in the timing of resources in autopilot would probably make things a bit tough on us, but.

D

Just a big query table that I set up might make it fairly trivial to see where the need to eat test failures are um like whether there is really a cluster on the autopilot side, um so it might be we'll see if I can poke some Cycles into that soon um and then separate from the edes uh Max. Didn't you call it yesterday that there was like the SDA SDK conformance test, just started flaking no.

F

I, don't I, don't think like the flake, because I think that's that's only somehow it stopped stuck by on one of the pr so I didn't see that in in the other builds of other PR.

D

Oh okay was it: was it ibsp.

F

In particular, no not Ibis, oh okay, I.

A

Went and had a look, I couldn't see which test was failing. Is it the c-sharp one rust I think it's right, yeah, okay,.

F

A

Sometimes the c-sharp one fails because we use the hashes the version ID, and sometimes it complains about the format of the version ID and so like. If you send a new commit, it fixes itself and no one's been around for ages. Hey.

D

Mark, do you have a bug for that? We have an internal fix it coming up and I was I was thinking.

A

That was actually a cheap item to try to fix.

D

You've mentioned the fix, uh well, you've mentioned afix for it several times and.

A

Yeah, oh yep, yep, yep, yep yep. Here we go. We.

D

Just won a memorialize like what to do because I'm I don't have any intuition. Oh.

A

Wait, uh oh, we have a bug for C sharp conformance test, but we absolutely do. We have that as a thing um here we go. Yeah I found it I found the one.

D

Because that that particular test has written us several times and the the most infuriating part of it is when uh is when I see the flake, we I intuitively just go and hit the retry build button, and it.

B

D

Which makes me another like hour and two hours or whatever actually less time, because it fails early for that one, but still.

A

Yeah, it's probably like, if you just put an a letter or something in front of or at the end or something at the version string, it would be fine, I see, okay,.

D

Cool yeah that would be a good way to just get rid of because it's kind of a nuisance, yeah.

A

I think it I I, have it I'm just looking at what was the thing I think is: what happens is if the hash just happens to have only numbers in it. It complains.

D

So uh us kind of a more general question of how to get confidence back. uh Max and I have been talking about using some bandwidth during this Fix-It to try to like I've, been trying to work on those kind of reports on like.

B

D

And ede flakes um I think between that and then also improving, like how we either fan the logs out or or like.

A

Yeah, the logs are hard to read in a lot of places too. That's.

D

Absolutely the problem is they're hard to read, but they're like when you need the data, you need the data when you don't need the data, it's noise right and so right now it's hard to discriminate, as the is probably the way to say it.

A

um And there's also a good conversation in there about, like what works for us versus what works like we can get through all the cloud builds, but, like anyone else, who's submitting PR's, probably can't do that, or at least as well as we can.

D

Yeah I totally agree with that, um and so maybe there's some some ideas. We could take forward to try to uh help with that. You know one idea that I had on on.

D

Kind of related to your login complaint is as if we had some way to dump artifacts out that were not just the I mean we obviously like can push images, but maybe we, you know, take an input of the the GCS log bucket and we separately, like at the end of the ede, runs like copy all the verbose logs out to the cloud.

A

Build doesn't have like a finally, which is super annoying right.

D

And like that's, the thing is like I've thought about that before and it's like. Okay, well, I forgot how to structure that and yeah.

B

Maybe it would go in that.

D

Back script that uh Max already has for like Fanning out the build. You could then like kind of bring together all the stuff and and and then dump the logs out, but like that's how we handle it internally is like when our internal builds fail like we, we often have a whole set of like different log files to look.

A

Through go look over here when it fails or something right.

D

Yeah exactly and if we had such a uh like a pert test, ground I could also then like I once had uh uh Aaron turned on the helm, debug logs, because his his build kept um failing during Helm install but that's the helm debug is is like 20 times bigger than our media. E-Logs and I was like. Oh that's not going to work in general and it'd. Be nice if, like I, could just do that and then redirect standard error to somewhere else.

D

So we always have the Helen debug log logs available, because you know it. There is like a one in 20 or 150 flake or something where we just run. We try to run the helm, install on E to e and it fails, and it's like well there's no way I'm going to be able to debug that, because I don't have anything to go on.

B

D

So there's an idea running around there somewhere of like okay. Maybe we just have more a way to get more artifacts out of the build. uh Maybe that's an interesting request for cloud build of like how. How is this pattern supposed to work? Is there a way that, like a build container, can you know specify like an artifact directory of uh this.

A

And then this is gonna, be where you have to roll it ourselves into our scripts, which is not ideal. But.

D

Right then, maybe that's an interesting feature request Forum of, like you know, make it easy so that when a container wants to dump artifacts, they just dump it in a or they have it done in the location and then like a cloud build copies it out of the container yep I, don't know well.

B

D

To think about.

A

um As it possible like as a silly, maybe silly question, does it make sense to just increase the number of retries we have on our end-to-end pests.

D

So I've been doing a little bit of picking on that too yeah just for context of anyone. Listening like um four months, I, don't remember how long ago I changed it so that um we retry our end-to-end tests using this package called go test, some, which is super nice yeah. It's it's an interesting package and that it basically just runs, go test just Json, and that, like is basically just a harness around go test, Jason that that then looks at whether the test failed or not, and and reruns the tests um it.

D

It is conservative when a test uh panics, which is why sometimes you see ede runs where you'll see like a panic at the at as one of the first failures. Oh and then you'll see a bunch of like unknown tests because on the panic, um the test harness takes out like all of the other tests and so you'll see, go report them as unknown.

D

um It's kind of weird Behavior but go teslam, says: oh, there was a panic here.

A

So don't retry that and.

D

Doesn't retry at.

A

All it makes sense that makes sense so.

D

I've been doing some light, scrubbing anytime, I, see something like that. Like I sent the pr to Max on Monday, where like, for example, we have a slew of tests, uh I actually just fixed, like a dozen of them in the fleet tests where uh they were using like a cert, not nil or a certain nil on uh on errors. But then then they would just continue on so I changed them all to run.

B

D

Yeah yeah yeah um in a funny case, where required is actually the right thing versus like that weird other case we found.

C

D

um So, like I've been kind of gradually looking at those cases because they prevent reruns nice.

A

D

That makes sense if we wanted to maybe as part of the fix it again like if we wanted to have like a concerted effort to kind of audit the tests and figure out, like um you know, just bang through and say, make sure that like, if you hit an error and you and you're also taking a a pointer to an object like you, stop there because you're about to look at it and and panic um like that. That could be something we do.

D

On the other hand, handling it kind of piecemeal as they come up, uh is not the end of the world either. No.

A

That makes a lot of sense. um The other thought I had in my head that you've, probably already thought of um I, had a look especially thinking of like autopilot.

A

Does it make any sense as well to insert a delay of some kind but I guess it doesn't really matter if you wanted to do that, you could just increase the number of retries so.

D

Like early on in the autopilot side, I actually did change. There's a framework um most of the calls that wait for resources like wait for a ready game server, Etc.

A

D

Go through like a framework retry of some sort like wait for ready, blah, blah blah and there's kind of just now a uh there was like a hard-coded like five minutes and I think I buffed, that to like 10 minutes for autopilot early on just because it makes sense. um I'm not like I would like to know. If that's what is causing it, because I inserted that just for the any flakiness around the nodes coming up, uh or rather any like you know, the P90 on a node starting on autopilot might be above five minutes.

D

I'm, not sure it's not always, but it's uh can happen. It can happen. Yep.

B

D

So but autopilot could be causing us something some other problem and that that's what it might really work out. Yeah and the other part is that- and this is you know, kind of what we're trying to work through internally as well, is that our standard clusters are very standard clusters like they're, pretty much like the bog standard uh configuration if you were just to run the defaults.

A

What was that? Didn't? You say, bug standard you mean standard kubernetes. It's interesting.

D

So autopilot has much more of the the kind of fancy features for gke. It has. You know, workload, identity, it's running data path, V2, it's running like all of the features, um because those are the features we think that you know uh people.

G

D

Running anyways, um and so, like you know whether it's autopilot or whether it's like ABCD uh feature is unknown at this point, um and we may be seeing like issues with some other feature that we could probably reproduce on standard as well. So it makes sense.

B

A

The other story we're talking about the other I thought I had I, know the test framework we're using that I've just blanked on the name of outputs, a report of it's like he tests from memory or test it retried. Are we doing anything with that report? We putting it anywhere. Do we care.

D

uh We are not right now, but actually that was the reason I set up the um question yeah like we're now logging to a bigquery sync, so that I can do someone else like post-hoc analysis on it, um and that was actually one of the reasons was I was gonna. Try to have it output in some funny way that I could pick up in that, um but yeah.

A

Those are all my questions.

D

B

Right now that.

D

One report does a uh an inference on like whether it was an internal retry. It also tries to make an inference on have I ever seen as an example have I ever seen. This commit past uh on this configuration at all.

D

um So if someone actually went and retried the entire build, it says you know, oh I have seen you know hash abc123 pass on uh autopilot126 before Ergo. If anything fails, it must be a flight. um So, like we have several layers of possible retries that can all kind of come into play. There.

D

um But we need to I need to brush that off and actually make sure it's kind of useful I did that as a Friday afternoon project and then haven't touched it since then, right.

B

It makes sense.

D

uh We should probably move on to the to your thing if uh the other stuff yeah I, was realizing, we're only 10 minutes left. Oh.

A

Boy uh mine's, pretty easy, uh I just need to push up the docs, and then that's done. um I actually have the docs I just need to push them up.

D

Oh, so is this just a FYI just.

A

An FYI just letting people know it's coming. I actually finally finished it. That was really it cool.

B

Functionality is all there.

A

And uh it's improved and it's gone through: um let's go ahead and I'll feature gate. So, yes, we'll have that functionality around just yeah once yeah once the what's wrong call it the case status. Stuff comes through for the mechanisms for rollouts people will be able to do a bit of both, which is nice.

B

D

um So we now have uh moving on to the last topic. We now have a robot that will, uh do you remember the the timing criteria. It marks things stale after. Oh, you see you're here. Do you wanna? Do you want to talk to that part.

G

uh Yeah, so basically, if the issue is not is inactive for more than 30 days, it will start marking the issues clue a steal um they've made for another 30 days. If there is an activity, then it will mark this is obsolete and if still inactive, then it will close it.

G

So we activated this GitHub action uh this month, so there are around seven eight issues that got marked as still uh because we don't want to mark all of them at once, because it will be too much to handle so some of them I think Mark marked them as we still need it and he added that awaiting maintenance label so that it will not be. uh It will be excluded from monitoring so right now there are other five issues that are marked as stale.

G

So do we want to take a look at those or uh what should we do.

D

So uh I think Max proposed that in this meeting we could actually just uh triage the stale issues and we could probably just insert this as an ongoing thing Mark. If that sounds reasonable to you, yeah.

B

I, like that idea,.

D

uh Because then, we'll always just have a process of every month we go through and try to make sure the stale issues have some sort of uh like. If we're going to close them. We at least do it publicly.

A

Well, we can, we can leave them, we can leave them to go to Obsolete and then like they.

D

Can just roll through the process if we don't yeah I, think that you're right that the meeting should probably be if, if anything, obvious, looks like it should be marked as a waiting maintainer like you've already done?

D

Are you presenting right now? Do you want to pull up the yeah.

G

Oh, you want yes, but they're. There.

C

I just figured it was only five I think we could do it pretty quickly, yeah, so on this first one um uh can we start from the top or the bottom all right. Let's.

B

uh There you go, I've got.

A

C

Got them all open in tabs, so so this is actually interesting because I think there are multiple solutions to this problem. Now, um I think when this was filed, uh those didn't exist or we didn't know about them, so I think for UDP Game servers. We can use quokken and then you can use uh Note, 7, external IP and for TCP Game servers. We know that you can use traypic and you can use nodes without an external IP, um so we actually have solutions for this.

C

um I. Think really. The ask at this point is just to sort of document and or put together a demo of how to use them. So we can either leave this open and say that these are the two solutions. We'll close this with documentation or we can close this with I already dropped a comment on here a while back about Quicken.

C

We can add a note about traffic if you want to do TCP and kind of leave it up to um you know, folks, to find and or discover those Solutions through the issue so I think either way would be, would be fine if we want to leave this as a documentation tracking issue, I'm, okay, leaving it open um or we can just kind of let it close on its own.

G

This issue I think uh Robert had already added steel level. That's why it has got obsolete levels, so it will get closed in next 30 days if it is not addressed. So uh do we want is that okay.

D

I'm not against changing this. To uh you know, document quilkin or other options. I would just change the issue title and then uh put in a waiting, maintainer or whatever on it, or we.

C

D

Or we leave it, I honestly, like.

A

We do have another issue around documenting web sockets.

A

uh uh Let's find it here. We go so there's this one, there's an ad integration pattern, websocket, so that that does exist already as a ticket with a bunch of different stuff on it.

A

um We have some of this I'd love to see like the terrific stuff as well, and maybe some other things, but.

D

The question is whether they belong documented here is like kind of an officially sanctioned thing for agones or whether, like I mean obviously terrific's, probably not going to host it on theirs, but quilkin probably already has an integration guide for agones.

A

Yeah I'm wondering if oh that's what I'm looking for.

A

A bit of some like it could just be a pointer here really.

A

um As a third party example, or something like hey, there's a thing, there's other things you can do.

D

That might be an easy way to dispatch it yeah.

B

Oh and I do that: uh let's.

B

Where's, what I was looking for foreign.

B

Please, what did that just do! That's not what I wanted.

B

Assume I need to remove those, no, no, no, no do what I want.

A

Actually, I will just take those, so I can just take those off, and if we don't do that, it will come back with a sale and remind us. Is that better or should I make it especially I make it a waiting maintainer.

D

I mean I think if, unless you want to keep going through this again, like I, think we should put it in a waiting maintainer and have a separate process for making.

B

D

We're not hoarding those yeah.

A

uh This could be a great actually. This is a good first. You should probably because there's a doc's issue.

B

Those as well since we're here.

G

Is it something that can like and uh since you added it as first good issue, is this something color can help yeah yeah I think.

B

G

Okay, you want to take a look at it. Can I yeah.

B

I, don't understand me, that's not what I wanted to hit the wrong button.

B

I'm doing well with the talking today.

A

Nice I think that sounds good.

B

Let's go to the next one. uh I share this. Instead.

A

So this is about passing average three metrics on game server to open census. I uh I was thinking let this just pass through yeah and let the system, unless someone immediately.

D

B

A

Lists and also I just don't think it's something we should be worried about. um Also, it seems like the common pattern. I've absolutely seen. Is everyone does a lot just? Does log base metrics out of their game servers.

B

Which is legit.

C

B

C

To close it is there any reason to just let it pass through versus just closing it now, since we're discussing it, I feel like we should just close it.

D

Then the cold hand of a robot closes instead.

A

um I have no strong opinions either way. I could flip a coin I, don't care uh go ahead and close it all.

B

B

It will not be fixing that.

D

A

The robot automatically.

D

Add the won't fix label I.

A

Assume, okay! So yes, yeah, because we talked about that key to remember yes, yeah.

B

Yeah yeah, it's.

A

A good little robot.

D

Does not include generators for photos that doesn't own this.

A

D

Any activity on this.

A

Really old I was gonna. Let this flow through, just because.

B

A

I, don't know if it's important I don't know nothing about C plus plus to be like yay or an A, so I figured if somebody really cares they'll chime on and if they don't, then they won't I. Think the issue.

D

Here is, if you do what they're saying you get a diamond pattern where some things end up built with one set of headers and the other things end up built with another set yeah and bad things happen, but yeah odr.

A

um We are at time, though,.

B

D

I'm, okay, letting it pass yeah. Obviously no one cares enough right now.

F

Right there, slowly through.

D

I realize we're a time. Is there? Are there any of these? That.

D

Someone wants to speak for otherwise they probably will just pass on to.

A

This one I I think should just pass through.

A

um We got asked about this a lot in the past. I, don't think like there probably had more Legacy Game servers. I, don't think it happens. Nearly as much but I figured seems like a good one to see. If somebody complains about.

A

Weird Legacy Game servers, they have a thing uh security, best practices.

D

Didn't someone just put up a PR that was in like a security space and they.

A

They did and then I gave them feedback and they never did anything, but it was someone who was paying me directly and wanted to start doing stuff. It wasn't bad it just we needed a CL, we needed the CLA stuff and.

B

A

The right spot in the docs and just some clean up um it wasn't terrible, but I also wonder whether it's we could actually just be like go look at these kubernetes docs.

D

Like I honestly wonder if that's our purview because it seems easier just to say, look like we don't post.

A

Those docs here there is some other stuff around certifications for allocators and stuff like that that maybe we can talk about but yeah, that's good.

D

We could definitely publish like what our our back rules do and yeah stuff like that. But, like the general concept of like containerization and like how our back works and like yeah, you know how certs work and.

A

Delete it just leave it and see if anyone responds yeah.

B

A

To me sounds good all right, that's it! That's everything, cool, weird! Every time, awesome! Thank you very much. Everybody thanks.

G