GitLab Scalability Team Demos, 6 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo Call - 2021-08-06

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh I guess I'll go first, so this is the thing I was looking at the other day. I don't have an answer yet I have a few things that I know it's not so I'm just gonna share my screen. I guess um so. This came from me looking at the error budgets for a stage group, um and I noticed that um so this is the distribution. This is the distribution of requests, http requests in our logs into one second buckets by the duration.

A

They spend we our logs, say they spend in the shared state redis um I've excluded ones that take less than a second from here, because I mean the vast majority of requests should take less than a second in any redis like it's redis like this shouldn't be a thing, um but what I did notice as well was that there's this big spike at five, which is suspicious um and there's also, you can see, there's basically nothing at 11 12 13 14., but there is something at 10 and 15.

A

um and possibly at 20 as well, though it's hard to see so I've just taken them taken the filter off here and you'll. See you'll see that under one dominates this. um But this is odd right because, like a reddish command's not going to take five seconds, certainly not an appreciable number of reddish commands said. You know 44 000 in a day, because if they did, our slow log would be full of that and.

B

This makes me wonder if they're sorry go ahead.

A

uh So um I know a few things it's not. um I also noticed this happens in our sidekick vlogs. um That's.

B

A

Normal, though, uh for the shared state redis for psychic,.

B

Oh no, I thought you were switching to the reddest side.

A

Oh, no, no, sorry yeah, so it happened. So this was the top one was from the uh rails logs, oh yeah, so here's the rails logs by if you include under one second and this you know- obviously that's unreadable. Okay, just give me a second.

B

So this makes me wonder if maybe we are running uh uh blocking commands, not necessarily a pr pop, but any blocking command on um that that would have a an implicit timeout and a reason to paint yes, sorry, oh you heard that I I.

A

Heard I had the plan about timeout so.

B

A

B

I was wondering if there was like a block, maybe in persistent reddish we are, we are still running uh blocking commands that would have a timeout, not necessarily be our pop, but you know any blocking command yeah.

B

What time span is this survey covering.

A

uh That was a day, um but it applies to any day so, like I created this issue like um in july, and the chart basically looks the same today, so we don't.

C

Explicitly remove um those commands from that reddish duration, then the the blocking ones.

A

uh Are you talking about in the instrumentation where we remember yeah commands? uh Let me check uh that's the wrong one, this one, because uh so they are excluded from app decks, but not from uh are they excluded from logs? I can't even tell with how this works. Let me see.

A

um No, I think.

A

I think this is what adds them to metrics and to logs, so these will be excluded from both aptx and logs, um but I can double check on that. um I'm just just gonna rule out one more thing: I was speaking to jakob about this. Yesterday he was off today.

A

He was wondering if it was at a certain point in the life cycle of the process like either the first or the last request, like especially the first. If it's some kind of timeout related thing, maybe it's a client timeout, um but it's not so here's when this happened for two different um kubernetes pogs and you can see that there are requests before ticking along normal and then there's a spike and then it goes back to normal.

A

um It also doesn't happen for any other redis instance which points towards maybe a command uh like. I think matt was just saying because it doesn't happen on the cache instance, the hues instance or trace trump's instincts.

B

um Can we get? Can we get a a frequency by time of day.

A

uh That's a good idea: yeah.

B

I'm mostly asking because, uh because I'd like to cut some of these in the act and if I can get an idea of how often they occur and what at what density then I'll know how long to to run the instrumentation for.

A

Yeah so one idea I had for this was um performance bar. We have a stack trace for request to redis, um so I was like if I can grab one of these, because these happen for like me and for you, probably um if I, but they mostly happen on async request, so you don't notice them. um If I can grab one of these for somebody, who's got the performance bar enabled I can grab that request id and put it in the performance bar to get it.

A

So I got the performance bar data, um but unfortunately the performance bar, because it's, um I think, because it's a happens in the rails controller, it won't catch things that happen outside of the controller life cycle. So if we see here um where are we so? First of all, we can see the duration is five seconds.

A

So it's not um it's unlikely to be a measurement error, which is another thing I was wondering because if it was a measurement error, the total duration wouldn't be affected and also the workhorse duration says this took over five seconds. So it seems like this was actually a very slow request, um but we can see that total radius calls were 19, of which seven are on the shared state, and here I've got total of six of which zero are on the shared state.

A

So I'm clearly missing a bunch of stuff in the performance bar, which is a separate issue. um Another idea I had to get stack trace uh just sorry I'll set up this time of day chart. um What do we want? Let's just say: let's just say this specifically, so let's say this is between 5 and 5.1. Oh, this is this chart is the wrong way around really isn't it.

A

Is there an easy way to change chart type, or am I just gonna- have to uh do.

B

Not that I've seen no.

C

Because that would be too easy, yeah.

A

And this this also bugs me because, like these are valid, but like basically.

C

Don't get me started shortly.

A

Yes, right so buckets uh by date, histogram jason.time uh y-axis can be count for the last day or actually we could probably yeah. Let's just do that. Oh wow, but I guess that does help you matt and that you can do it at literally any time.

B

Yes, um that's great zoom into any one of those. I guess. Let's, let's look at the last uh the the last uh couple of hours and yes, that's. I wanted to see if it's bursting on a shorter time scale yeah um change the graduality. um What do we? What circuit you need now? I know it's set to auto by default. Yeah 30 seconds.

A

It's 30 seconds, so do you want it to be? Oh great,.

C

Because it does seem to be like something around, maybe.

A

C

A

C

Minute, I said no make make.

A

The time the time span, oh sorry, got you.

C

A

Yeah uh wait what? Oh that's the start.

A

C

This is by minute yeah, I mean there's. Definitely, yes, there's definitely 12.

A

22 1204 yeah, so that periodicity is 14.. Definitely.

B

A

A

D

A

Minus eight is 40, minus a is 22 minus a is 04.

B

Well, that's kind of.

C

I mean there's it's blobby.

A

Right out, approximately every 18 minutes.

B

A

B

Send me a link to this question.

A

B

I'll, um which.

A

B

A

Question andrew, um so I discovered this because it happens a lot on internal allowed, but this is basically a chart of which endpoints hit the most often.

C

This isn't really narrowing it down.

A

Wow um I'll just pop that link there for now, matt.

C

The the other thing we can try and do which, which one was it the shirt yeah she said, um and it doesn't happen on the others.

C

The one thing you could look at is what calls are being made on shared state that you don't get on the other. Yes, I mean, I suspect, there's quite a lot because um yeah there's, I suspect, there's quite a lot because it's so much more generic than you know the ones just.

A

Yeah I mean cash. Cash is mostly going to be guests and sets right um yeah this. This is a wacky idea. I had um so. I've got a slightly improved version of this location.

D

A

But yeah, basically, if we just say if, if this duration is over five in our logging uh block track, but don't raise an exception, so we put it in century, so we'll get the back trace via century and then maybe we can see where in our stack this is coming from um because yeah, um the other thing with five seconds is um we are measuring.

A

um I don't think it's this, but the measurement includes this. You can pass a block of ruby code to be run with some commands, so it will include the the time it takes to run that block um so.

C

A

C

A

But these are the commands that can take a box, so not every command can take a block um and it sounds more like a timeout than ruby code taking exactly five seconds every time to me.

B

Yeah that that boundary you you showed of 5.0 to 5.1, that's very suspicious of timeout behavior. um I should I think I can. I think you can identify which specific redis commands are involved, but.

A

D

B

Little post-processing um I'll, probably also be able to find out which hosts are sending those requests, but not necessarily which application code path, um at least with what I'm thinking of doing right now,.

A

Okay, so maybe we should maybe we should maybe we should take this as a two-pronged thing, then, like I'll, I could kind of think of a better idea than this to find on the application side, but, like I think, that's.

B

Reasonable, maybe this is fine.

A

C

Z scans at the bottom of that list of of things that take a block uh all the scans, in fact, um do those blocks get called multiple times, I'm just trying to remember how they work in in the case of those scan. Oh they they call for every single item, not the uh so those could take five seconds if it's measuring the entire uh scan right.

A

Possibly I just I just feel like it's unlikely that it would take pretty much exactly five seconds every time. um Yeah yeah yeah, I mean yeah, I don't know, I guess I guess if we get the commands that will narrow it down a lot already, because if it's not one of these, then it's clearly not a ruby, ruby side issue.

C

It would probably be worthwhile having the measurement excluding like the block. If we could yes it would it's not really helpful having that in there.

A

I don't know why let.

C

Because those you know you could have a process, that's kind of doing something. Inside I mean I hope people aren't using scans a lot.

A

Well, I mean you can literally do like sleep in the block and then it'll take five seconds. That was how I tested yeah, so yeah um not super helpful to to conflate the two things like that um so yeah. I just wanted to share this as like something I've been looking at, but I don't really.

C

Know I love what's going on that. You've actually found this and you're digging into this. This is like scalability awesomeness, but.

A

The reason I found this was because we were looking sorry, I might just need to go and deal with.

A

C

I mean my gut is that it's definitely on the client somewhere and it's a it's a, but it would be good to understand that.

A

B

This is this is why I want to do um instrumentation to uh to catch uh a few. A few examples of the security yeah.

C

A

Are you gonna use.

C

Rb spy or something like that matt, is that what you're thinking out of interest.

B

No, I was thinking more primitive than that, but I I need to I've got a couple of rough ideas and I want to. I want to check that they're they're viable before before. I actually execute them.

A

um So sorry, the reason what I was going to say was the reason I found this in the first place was because the error budgets work. So I was looking at the error budgets for source code who have a lot of these ones at the top, because source code is one of the groups that has that owns most of the requests that we make um because they only get access it's about three percent right. There we go- and I was like um you know- by proportion like these endpoints aren't most of their failing.

A

Aptx requests right because there are so many of these requests that actually proportionately, the number of requests on api internal allowance that take over a second is quite low, but by overall count because there are so many requests. uh Api version allowed.

A

um This actually does make quite a big difference to their error budget and I was going to create an issue for them. But then I realized it happens on everybody's endpoints and it doesn't seem to be related to like what any particular stage group is doing. um So this did come out of the the work on error budgets and I think this probably would make a a dent in our overall abduct score.

A

um If we could stop it happening because.

B

Some some of the alerts I got last week um uh ended up resolving down to internal allowed being uh being running for multiple seconds. um Okay, so yeah definitely definitely this would this would help.

A

B

C

It sometimes when you push like things just sitting there, and it's probably that right. So it's not like.

A

C

Also possible but yeah, but but often you see that in it it's kind of inexplicable, because it should always be fast right.

A

It should always be one of the ones we should really have. The most yeah should always be approximately the same speed, yeah yeah um and the thing is I've noticed when I'm looking at these like because so many requests to the web and api uh servers are either api requests from ci. Like you know, that's what a lot of api internal allowed calls are is from ci or from oh ajax requests um like the one I showed earlier. That was my request. Was the request for the mode request. Widget, like you know, polling.

A

um You probably won't notice it that much because, like it'll just be in the background somewhere, but it's still happening so yeah. I want to know. Basically I want to know what's going on here.

D

I'm most intrigued by the 18-minute cycle. There yeah.

A

um So yeah matt- uh I just pinged you on that issue. um I was wanting to chat to you later anyway about the sidekick stuff you were talking about yesterday. So maybe we should just chat about both later. If that works for you or do yeah, we can, we can see um but yeah. That would be good to get some help and um try and try and unravel this.

B

Yeah definitely definitely um so yeah. I um um because we've got this periodicity. I think what I'm going to try to do as a first cut is sorry andrew. I I said earlier that I had two ideas, um but I neglected to articulate what they were.

B

One is um uh one is uh packet capture during one of these periods and and because we have dense period, I I should screen share just to show this is this. Is this is sean's query um oops? Stop that I just pasted a screenshot in the issue, but this this shows even more clearly the periodicity. All I did was switch this to an area display from from a line chart to a bar chart.

B

um So I'm going to time it uh so first idea is uh packet capture. um Do push processing analysis to pull out the the the calls that have a dirt that have this duration um and that will identify the the command its arguments and the um um and the the client ip, which you know. In many cases, I think will be a kubernetes pod, so that won't be especially useful, but um identifying the the nature of the command is probably going to be more useful for analytical purposes.

B

um Since we do have these dense periods, um I think I could probably get away with doing like a 30 second capture. um If I time it right, so I think that's probably the better approach, because it will give more data and it doesn't burn uh redis's cpu. It will burn a separate cpu. The second.

A

Idea, sorry matt where's the packet capture happening here. Is it happening.

B

On the register.

A

Okay, the only thing I'm wondering here is, if it's time out, if it's client-side issue, yes,.

B

A

Might not see anything there, but I guess.

B

A

B

Exactly yeah, that's exactly the absence of data on the right of the server, so tell us that um yeah. So I would come back here to confirm that the selected time span did have uh client-side measurements, that ex that met, that time, that that duration um and we'll see how they match up the other idea I had, um which I don't think I'll do, because I think we'll get more information and less impact from from the pcap instrumentation um is uh because we thank goodness, have uh debug symbols on the red server.

B

um It's possible to instrument um um it's possible to instrument uh shared uh shared code, pads, like uh like process command, for example. um So we could. uh We could measure the distribution of durations for for completing process command, and that would tell us something um something about whether or not this was, uh for example. That would be a reasonably cheap and quick way to identify if this was a blocking command or not. um I need to instrument additional data.

B

Additional function calls to to determine which specific command and if it is a blocking command that does a separate code path instrument, so that would be. That would be a little bit more hunting um and it would still not give us information about uh like, for example, the arguments to the coming in so that's uh and would uh it would consume some cpu time within the redis main thread which is undesirable because that's its bounding uh capacity.

B

um um So that's that's why I'm leaning towards the p cap analysis it'll be more informative and it will spend someone else's cpu, not rediscip.

C

That's that yeah that will not pick up if it's a a block or something in the ruby side. The other thing I was wondering is: have we tried to look at this staging jaeger because.

A

It might be, I haven't even looked if this happens in staging at all yeah, so.

B

Because I should mention that commands do actually have a client response that redis will send so um like if.

C

It's timing out from a br pop timeout, sorry, I'm into client. um I meant a block on the client. You know the the yield blocks rather than a locking command.

B

Yes, that's that's what I think.

C

That's what sean was referring.

B

To earlier so, if, if we have a spike in the kibana log events, but don't have a corresponding spike for the same time span in the p cap, that would that way from that it was client-side. um Is that water? What.

C

Was it going to say yeah I mean I like looking at the spread right, it's probably something in access. It's got to be like a permissions check kind of thing. I imagine yeah, um and I mean it could be something else but like just.

A

Balancing this thing that happens.

C

That's also a good option, um but neither of those I can imagine have any blocking calls I mean they might do. I could be surprised about anything but, like I wouldn't imagine, oh.

B

Yeah, that's 18 minute, periodicity yeah yeah totally.

B

I can't wait to find out what this is. Yeah.

A

um I had a look on staging this happened once in the last day, so I don't know if.

C

That's a real right yeah and the chances of that coinciding with a stat with a trace are very low.

A

Yeah- and I don't know what the periodicity on that is because it's happening once in the day, so yeah um yeah. I think I think it's worth looking into this um because yeah I mean actually. The other thing I should have mentioned was so I said I was looking into this because of the source code groups, error budgets. The source code group will also look into the error budgets, which is good. That's what we want right. Rachel um teams teams looking at their budget spend- and they asked me like.

A

Have you noticed that, like this takes five seconds a lot of the time so they're also finding this? So it's good to be good to answer that for them as well, if not just for.

A

Ourselves uh anything.

D

Else there is more emphasis coming on from the error budget side of things, so the more things that we know um ahead of them uh is good like it. It's all helpful to be able to to.

D

I know just to give more confidence around the error, budget process and mechanisms. If we're able to answer these kind of questions, so I'm glad that this is uh this is going to get looked at.

C

This is yeah exactly why we have these things, and this is how we.

D

Was, if is there anything more on that topic, or can I ask another question about error budgets, cool, um um I'm looking for a reminder about not owned in the error budgets, and I was wondering, apart from graphql what other things need to happen to attribute more of that of that not owned group.

C

Reactive caching,.

A

uh So yeah we talked about this in the last demo andrew yeah, so I've got an mr to make the reactive caching take the feature category from its caller, which will inherit as well. So sometimes you have a reactive caching worth of call by reactive caching, worker um and if it will take that from the first caller, so basically I think the logic is getting.

A

I don't want to make the logic too much more complicated than it is now, because this already caused some confusion for matt. Yesterday, it's actually related to what you're asking me about yesterday, matt um so with our context, propagation, from particularly from a web request to a background job, a psychic job. We want to inherit every single field um like user root, name space project, uh except some cases in a background job.

A

We want to clear all of that, because we're operating across multiple contexts and the feature category we normally want the feature category of the job, not the feature category of the caller, because uh the feature category of the job is generally more specific. So even if um say, a merged job is called from a ci page.

A

The merge jobs still own still owned by the team that owns the merge job- it's not owned by the ci team, just make a page that triggers that worker the exception being not owned workers because for those not owned, this is essentially like. We can think of that as a null value. So for there we do want to get it from the caller, because that at least helps us yeah.

C

And then it's starting to get quite complicated, though, and then people sort of start squirming a bit and they're like.

A

Yeah, so I don't want to make it this is. This is the most complicated I want to make it um so I've kind of paused that, mr for now, because it's also going to cause more confusion as we do the rest of the catch-all rollout, because it already something related already gave um matt a fair question yesterday, but I think I think, inheriting, if it's not owned, if we, if we, if we had null instead of not owned there, I think that would make perfect sense.

A

It's that there is a concrete value that we're treating as like a null. That is the the weird bit, but.

C

So if we did more work to fix the attribution on the on the rails controllers or the on the on the grape and rails controllers, right and graph with graphql there's this, we have an approach for that. Then we wouldn't need that to the the kind of conditional we could just say it comes from the from um oh wait.

A

Sorry, no, we still need the conditional, but we're saying the opposite, so we basically never need to care about not owned sidekick workers because they will, if they are not owned, they'll get their feature category from whatever calls them, so we don't have to go. We only have to go through the not owned controllers and api endpoints, not the not owned jobs, but I agree it is a little bit confusing to explain. So I'm not um I'm not saying that's the most elegant option.

A

I think that's just the option that gets us pragmatically the most of what we want with them.

C

Yeah, I mean the one thing: I've really noticed is now that this is starting to bite the error budget, the graphql, no, no just in general, error, budgets and people and push back on on tech on fixing technical debt. People are becoming a lot more interested in in the accounting of it and- uh and you know often it's in big meetings with lots of very busy people and I'm realizing the importance of keeping it simple.

C

But but this might be the only way, but I am also seeing like people are like no, but this number doesn't make sense like explain it to me and they sort of give you 10 seconds to explain it, uh and, and so it it is kind of like, and those are the decision makers right. So you've got to have them brought in. um So just just something. I've been noticing recently, especially.

A

Yeah, I think I mean the other option there.

C

Is like engineering allocation? We.

A

Kind of talked about this a bit before, but the other option is with these: not only workers we could just we could go to the other extreme and just make like uh a not own, like a reactive caching worker for each feature, category that uses reactive, caching and like have each one use the correct one. um I mean that's a lot of code, duplication.

C

Yeah, it's a yellow.

A

Metric's point of view so.

C

I I think I think it's reasonable, but.

A

Yeah we'll see how it goes like. I said I've paused it for now, because um I think it could cause confusion with the next stages of the catch-all sidekick rollout, because we're doing that by feature category. So if we create a metrics query that uses feature category and not owns in the mix, we don't want those to be conflated just right now, like in a week's time, it'll probably be fine but uh yeah um so rachel. That answers a small part of your question, but I don't think answers the rest of it.

D

Well, it's helpful because it already tells me what the two main things are that are actively in progress to sort it out, because that was one of the questions that that's coming up as andrew says, with everybody just getting more attention.

D

So my what I'm doing at the moment is trying to find all the pieces that people may be concerned about regarding error, budgets and say well either. This is what we're doing, or this is what we're going to do, or no, we hadn't thought about that yet, but to list out those things so that um people have some idea of where error budgets are going to go next, so that this is helpful for me, because I can just put that in the list.

D

Is there anything else that anyone would like to demo or talk about today.

B

um Yes, but I need to I need just a moment before I'm ready to talk. I want to go back to our first topic.

B

I don't know why I'm having an awful time putting a comment on our um on the issue. I think this is probably this is probably client-side thing: um okay, I'll just I'll push this, as is for now and I'll, put more details later. um Okay, screenshot.

B

uh This is what we were looking at earlier. Is this sharing properly? Can you see yeah, perfect, okay, great, so we've seen this together. We just did this. These are. These: are the 5.0 to 5.1 um duration period, periodic spikes.

B

These are the rdb backups running on redis. The timing matches the timing and count matches perfectly.

A

B

Rdb backups on redis- and this is uh this.

A

Is the overall.

D

B

Yeah, which is you know, not super great to see, but this is yeah. So this is uh it's easier to see on this graph, that's showing per cpu core usage. You can see that these spikes are attributable to exactly one core, which means exactly one process, which means uh this. This is very likely to be the rdb backup process.

B

um Rdb backups, effectively run by forking the main, uh the the redis server process, um so that it begins life with uh with an identical copy of the virtual memory for the main redis process. As of a point in time, and linux will implement copy on write for all of those memory pages. So, as the main process mutates says, pages new pages get allocated which, for for the for the forked child process, that's actually trying to write the backup file out, and that is why this graph shows um so spikes in uh page usage.

B

So if I suppress this, we get to see uh the the growth is primarily coming. The growth in memory usage is primarily coming from anonymous pages, which is because of the copy on right for reduces for redis's, actual datastore and also in cache. Because because the whole point of this forked process is to write out uh files to disk which of course, go through the fast cache.

A

So, do we not do rdb backups on the other instances or.

B

We definitely don't do them on redis cache for cash mains. Yes, I would have expected us to see uh to do those on red is sidekick, but I'll I'll go verify that it may be that um I don't know. I mean right.

C

Now this kick's got quite a different load, though so.

B

C

B

C

It's not likely it's like it's possible that it's just not affecting that, because.

A

I mean psychics memory usage is, I'm pretty sure is much lower right. It's the cpu usage. That's the problem on the sidekick.

C

Yeah matt, can you there's a graph if you go to the redis dashboard um for the cop, the amount of memory used during the copy on rights? um If you go to redis main and then go or, however, you want to get there, um there's the amount of of memory that was used during the copy and write by the fact that it was mutating on the in the main radius thread.

C

Right and it'd be interesting to see, because if you only see a problem with that on the on the shed, then it might be something to do with is still very strange. So.

B

C

Yeah, um if you uh yeah down it'll, be down uh not not in the indicator, detail somewhere down here. I think I mean it's definitely metric, but I don't know if we plot it, um it doesn't look like it and there. Those spikes are again was that is that no.

C

B

The same time span set here. Let me yeah okay, it's here. I wanted to match what sean had set for the uh graph.

C

Yeah, there's a there's a like a metric which shows you how much memory you basically used by you know the the the copy part of the copy and write and uh that I've found in the past is like quite an interesting metric to tell if.

B

C

Rdb snapshots getting too long and you've got too much change. That's that's what.

B

We're seeing here this is a direct measure. Well, this is a direct measure of that yeah.

A

Yeah, this might be a silly question, but so my understanding was that for regis persistence say we have a primary and secondaries and all queries go to the all commands go to the primary yeah. So why do we need our idb backups on the primary anyway?

A

B

A

Those on secondaries.

B

Perhaps we could the the role can switch without without notification.

A

Right so you'd have to figure out. Am I a primary yeah yeah got it.

B

um I don't know that.

A

I mean I'm not saying rdb backups should be a problem just to be clear. I'm just curious.

B

Yeah, no yeah, it's it's a good question. uh I think I think the answer is just that that um we we configure the primary and secondaries the same, because any any node could take up that rule.

B

um Yeah, I I'm sorry um andrew I'm not finding.

C

No, it's fine, I find it in. I find it in thanos and it's.

B

Yeah I'll put it into.

C

I'll just put it to the no it's fine. Well, it doesn't really show us anything much, but I'll show you it might be worth worth investigating, but I'll share my screen quickly.

B

So I mean I don't I don't want to make this sound like it's, it's causal it currently it's just correlated, but it's a strong correlation. So I thought it was worse. Yeah.

C

It's no, I think it's it's a very interesting one of the things that's kind of weird about it, though, is it doesn't seem to affect all the reddish queries that are. You know it seems to affect a handful, because obviously there might be 20 000 going through yeah and we don't see it in the red or slow log.

B

C

Which which yeah they yeah it's? This is going to be very interesting, and I'm really looking forward to figuring out uh someone figuring out what it actually was yeah, um because that is that is very.

B

Strange, are you wearing the reddish shirt right now.

C

I am these shirts are.

B

Really nice get.

C

Their shirts because they like this, is pretty old, but they like they're, just good swag, shirts, nice yeah um yeah. That's that's yeah sean like what I think because, like what I've seen in the past on, uh I think when we were still on azure a long time ago, there was something with the hypervisor and forking reddest processes. I forgot it's like lost in the midst of time, but it basically um when we forked and did the copy and write it performed, really really badly and things basically came to a halt.

C

But if it was something like that, you would see far more um requests that would have slowed down. But it's just like a handful and that's what's really strange. Yeah.

B

Yes- and it doesn't explain the fight, the very tight five minute window, like we're. Looking.

A

B

A

Yeah, that's the part I don't understand like. If it's a five second window, I don't think I don't think the volumes look right for all requests that happened within that five second window. If you follow me like.

C

No, it will be much much higher than that. I mean yeah we're looking at many thousand per second right.

B

So I think I'll I'll proceed as as, as we talked about before, with uh with the pcap instrumentation and um I'll try to get that today. So we've got something to look at tomorrow.

A

B

My the second time, skill for.

D

You know without any entity.

B

Having to modify the page tables, I would expect the the process doing the meeting to incur that overhead, so that would be the rightest. The the reddest main thread, not the not the process doing rdb backup um would be my guess, and I would expect that to be on.

B

You know, perhaps a microsecond time scale on a per event basis, but those events would happen very often, and so it could cumulatively add up to a lot, um especially early early in the process. Before most of the pages mutated. I don't see how that would. I cannot imagine a way that that would be biased.

C

To affect commands that yeah, one percent.

B

Yeah for a single day.

C

And then they all come out at five seconds like it's showing yes,.

A

Yeah but they line up so well.

B

Yes, so it's hard to ignore, I mean maybe it's nothing maybe, but it's yeah, because.

A

C

Make those sorry I'm sorry to those requests? Are they coming back with 200 when they come back? Are they.

A

C

Successful when they say the.

A

Ones I looked at yeah like they're just.

C

Okay, so it's not like a timed out at five seconds.

A

No, um so the other thing, that's weird, is that I don't know I was assuming that they would show up in the red is slow log, but if it's an issue with like actually processing the command on the redis side like before it gets to the like process command stage, you know before it like decides which command it is, then, maybe maybe that wouldn't be counted in the slow log um yeah.

A

It's I don't know I feel like. I feel like we learned more and we understand less now.

C

Before I was kind of a bit skeptical that p-cap would get anything, and now I'm like the p-cap will tell us what's going on here.

A

A

Weird also, to be honest, the spike above one second is still a bit concerning um you know like in that first chart. I showed this like one.

B

A

Two seconds three.

B

A

Four seconds, and then five seconds is longer the spike above one second is still bad.

A

Yes, it's just that the one from five seconds is so noticeable that.

B

Yeah well, if, if we, if what we learned for the five second spike, doesn't also explain the one second spike, we can repeat.

B

Methadone, this is so fascinating. I'm glad you shared this son.

A

um Well, let's, let's report back next week, I guess with what we found.

A

B

Rachel are you comfortable spending a little time on this.

D

Yes, as long as a single coupon shot is still going forward as a highest priority thing.

D

I don't mind if the rest of the time is looking at this, because this is uh it's interesting to see what's in here and why this is happening, and I think any time that we spend researching interesting things like this gives us more understanding of how else we can tweak reduce in our favor to make it more and more performant, because I mean, as we've seen from the latest 10-minute reports like we're, also going to have a problem with the cash one as well so um yeah.

D

Any additional understanding we get, I think, is super helpful, but yeah single cue pro chart still needs to be the the it's. The least you finished.

B

Sounds good sean. I saw you wrote a comment um I rolled out of bed to join this meeting, so I haven't read it yet. But I'll read it right after this.

A

Yeah my comment uh yeah. No, I asked you on slack actually like when do you want to pair up? um So just let me know on slack, but if you want to roll back into bed for a bit, that's totally fine, because it's like before 7, where you are so yeah cool anything.

D

Else thanks so much for joining the call um looking forward to seeing what we find. This is interesting.

C

Hi, everyone have a nice day. Everyone bye.