GitLab Using Strace to Understand GitLab, 28 Jul 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Using Strace to Understand GitLab - Part 3: Puma / Unicorn

Description

Walkthrough of what an strace of a typical request in Puma looks like, and a quick comparison on how Unicorn is different.

Part 1: https://www.youtube.com/watch?v=fcY5pPUq35U
Part 2: https://www.youtube.com/watch?v=tThs8QeP2qY
Part 4: https://www.youtube.com/watch?v=dgJH4wpR5OE

A

Hi, everyone welcome to another s, trace video, uh so in this one we're going to get into using s trace to understand what unicorn and puma and rails in general uh are doing when a request comes in. So let's dive in okay. So um here we have a 13.2 instance running puma and if we just look at yes see, we've got two cluster workers, each of which will have four worker threads. That can each handle a request separately, so we basically have eight threads total that can take a request.

A

Currently, so let's go ahead and trace a.

A

A

All right so we're going to follow any child processes, get the accurate timestamp um capture the time that syscall took with capital. T yy is printing out the file paths we capture up to a kilobyte of each string and here's your output file. Okay, so we'll kick off the trace and we'll pull up this project, that is on local storage, all right. There we go, and so we can see we actually have three different requests that came out from the performance bar here.

A

So let's actually look at.

A

This, and so here this is kind of the idle um system and no traffic, because this is just a little demo instance with the only user.

A

uh All right well here is the first interesting line. So basically right here, the select so this is this process is listening on these two addresses so get lab.socket and localhost 8080 and seeing which one of those comes back first.

A

Actually it's listening on three messages, this pipe from somewhere that might be from its parent process um anyway, so gitlab.sockets gonna have any traffic from workhorse, so external requests, uh localhost 8080, we'll get requests from gitlab shell, so ssh requests that come in um and some guilty requests as well. Will fire off uh internal api calls to make sure that they're allowed to do whatever the request? Is? uh Okay? So that's that's just it's! It's saying either look for external requests or look for internal requests, whichever one comes in first with that select uh okay.

A

So let's just look for wherever the first request comes.

A

That's not here. We are okay, so here's a request coming in on github, socket and yeah. This is get root. Local storage, which is the path that we put in the browser, okay and then here you can see all the headers that are on the http request, your user agent. uh Here's my accept.

A

um Here's the encoding we accept and so on. um You can see that bar enables you right there, okay and then so it's a total of 16k roughly of data, so we only get the first k of that. So we can't see the full request there. We could use the browser tools if we really cared about that. Okay, so unicorn and puma are essentially single threaded within the context of a given request. That's not entirely true, but it's true enough for our purposes.

A

Okay, so so thread 20243. Let's just look at.

A

A

Don't I okay, um so this is a thread on one of the cluster workers: cluster worker, one okay,.

A

Yeah, okay, so two two three okay yeah, so there's the parent process, which is cluster worker one here, okay, um so that's one of the threads, not the uh the lead thread, which is the pid in ps, okay, um and so actually here you can see these get pid calls right here. This is saying: what's the actual process id here, so this is the actual thing that you see in ps, like we just saw.

A

Okay, um true text is just synchronizing with other threads there's a little skip over most of this and just gets the interesting parts. So the first thing we do before anything else happens. Is we write out to production.log that we're starting this request um clock get time, we'll see a lot with puma and unicorn?

A

So typically, what that means is that something in the background is happening, and so, when you see a long string of clock get times, that's just we're doing some kind of processing in user land, and we just need to know what time it is. So it's really just we're not accessing the kernel at all other than just get the time. So when you see a lot of those it's kind of idle within in terms of kernel, not in terms of the actual worker, it's doing it's doing something.

A

It's just not doesn't need to access anything external okay, then it reads the status from proc just to make sure its memory is not too high. We get our pid a lot for some reason.

A

Okay, so here we're starting to query the database. This is a local postgres, so it's over a unix socket. If it were an external postgres it'd be over tcp. um It may or may not be encrypted at that point, which would make it hard to read, but conveniently with local it's not encrypted. So you can see that we're actually sending select um to postgres and we're getting back um something it doesn't. We haven't actually made a real query there. As far as I can tell um so here we go.

A

I make our first query uh we're selecting namespaces, and we can see all this right here and select namespaces.

A

A

Okay, well anyway, it's in there somewhere.

A

um So we have this long query where we're getting the namespace detail and then here you can see the response back, which is not quite human readable, since most of the data is binary encoded, but um you can kind of guess at what it's saying and it's pretty obvious that we got back a result as opposed to an error or anything like that.

A

uh Okay, we keep making queries recurring on the name space again and then we're trying to find the round. um So let me join on redirect routes here, get back some more details.

A

So here you can see all the field names owner, id visibility, level, membership lock.

A

Etc: okay, we are recruiting on projects.

A

And these are all queries that I think are generated by rails itself: okay, and I guess we're getting pretty close, no okay, yeah all right. So now, we've we've found our route. We found our method, um so here we write out direction.log, here's what we're doing.

A

uh All right so here we're hitting redis, so you can, when you see these, you know dollar sign. um This is a new line uh with windows index and then get, and then you know this is the actual radius query. We're making get the obsession to probably see this in performance bar.

A

A

Maybe not um but yeah you should be able to see most of those in performance bar.

A

I got a sig pipe for some reason. That's weird, I guess from the parent process. Yeah 2023 is the parent process.

A

Now the parent process close the pipe okay.

A

Okay, now we're continuing to query reddit's here we're going to run a user key. I guess this is cookie consent.

A

And now we're querying on the user using the root user here, so user id is one okay, and here we start checking down here for feature flags.

A

Okay, more reticents um make sure we lose our connection to redis briefly, and then we reconnect.

A

And we're getting more details on the user. Are we writing yeah? Here we are okay, we're setting yeah right. So now we're writing out to read us and saying: hey here are the details of the user.

A

This is all for the session.

A

We get if we need to do a soft email confirmation on this user, because I don't think I ever confirm the email all right. Let me see if we have an ldap provider, we don't.

A

Query on the user user preferences another. I think this is web. Ide, monaco, another future flag, more feature, flags, more feature, flags, more feature, flags, lots of feature flags. uh All right. We do not have geo enabled.

A

Okay, uh now we're back to querying on namespaces down here we're just getting a group back.

A

Okay, now we're seeing if the uh cache for the project is this should show up here. I think, uh let's see.

A

Yeah here we are okay, so you can actually see this is the full stack trace for what led to that particular brass query being made, but we're just seeing if this project has been cached or elements have been cached and read us already, and this should be because I've accessed it for this.

A

Yeah value t okay version zero where's that that's not a null blade. No created expires that yeah okay,.

A

All right um all right, so here's what I was saying about these stretches of clock get time. When you see this, this is just kind of.

A

um Checking it's doing something that we can't see in s trace right, because stress is only showing us what happens when we reach out to the world, and this is all internal processing. So it's it's got some value back from redis and some value back from the database and it's trying to decide. You know what to do next or maybe rendering I'm not rendering yet, um okay, more redis queries.

A

Let's see a theme here. Okay, so now we are hitting italy just based on this port here, 8075.

A

Okay, um this is the internal ip address, I think, of the server, because it should be local storage for some reason, I'm trying to connect over udp and it doesn't work because we're not listening at udp.

A

uh Okay, so here we're creating a new socket which is going to be tcp, ip6 ipv6 connection.

A

We set some options on the socket, and here we try to connect to that same ip address. We saw above for udp, but with tcp and an ipv6 format once again, port 8075, which is how we we know this is actually uh gitly uh all right. So we connect and now we're e-pulling here. So we try to connect. It comes back and says we're working on it. So here we create an e-poll on that which is basically saying.

A

I want to be able to wait on this and once we've once we created that, then we should actually wait at some point yeah. It's weird that we keep trying to create this multiple times.

A

We try to make the same equal number of times. It works the first time. Let me keep doing it.

A

Okay and then eventually we were connected, and now we can, uh we send a message. So this is an actual giveaway message. Let me bring this up at the top of the page, um so you can see that because it's grpc we're sending a http 2 request.

A

Which is- and you can you can kind of see here- is the length of each section of this request. So most of this is in binary and it's not really legible, um but some parts it will be this particular request. You don't see anything yet that is very easy to read. Okay,.

A

Yeah, let me get something back. This is from gilly again.

A

And it's all binary. I think this is just setting up the connection in the first place. um So when I said at the start of the video that unicorn and puma are single threaded, that's were reached. The work isn't enclosed entirely in a single. You know, threat of execution. um That is this is where it actually is not quite true, so you will see the gilly requests go out on the main thread or process depending on which boom or earth guillotine, but then there are on both of them listener.

A

Threads it'll actually take the response back. So it's not uncommon, see a request, go out and then not see a response. Then, shortly after you see the server just seem to move on when you see that what's actually happening is one of those listener. Threads has picked up the response.

A

Okay, uh all right, so here we're getting back an actual response.

A

um So you can see this. This is mostly legible right, so this looks like a sha and then here's a message: more messages, here's the user who created it, their email address and so forth, and we can see all that in the ui right here.

A

Okay, so that is what it looks like when a gilly request goes out and comes back um all right. Now we go back to postgres again korean.

A

A

I think we stopped doing license queries every single time. Okay, oh here we go, and here is another giddily request.

A

Let me bring that up to the top all right, so this is a little bit easier to follow along, so you can see we're calling the get tree entries get the rpc um here is the bearer token that we send to get a lead of authenticate that you know who we say we are um so basically what we do is we take the current timestamp you next time stamp and then do a hash of that against whatever the secret you set in your get, get that reviews and that creates a string and then we add the timestamp to the end to say and here's the time stamp that I sent to make it easier for humans to understand.

A

What's going on, um okay or actually, I guess it also helps make sure it's valid either case uh here. This looks like the correlation id here um session id. Maybe this is correlation id.

A

uh Deadline type is regular, so that's going to be 30 seconds. I think, and then here we have the actual path down the hash storage to that project.

A

Here's the name of the project which we can see. That should be it, so we send that yeah. Okay, so here we go. This is what I was talking about. So on 202 243. We send that and uh then milliseconds later we start clearing the database without seemingly never getting a response back. uh But let's see if another thread got it back, yeah, okay, so 20833 did receive a message back and it's binary. We can't really understand it.

A

That's just the header actually and then down here we get the full, the full data back, okay, so yeah! So that's those that's what I was saying, but so there's this this child thread. That's just handling gilly requests. Essentially, so let's go back to the main thread now, uh all right. Can you see if we have any project mirrors which are not notification settings um so things to keep in mind?

A

So, basically, when you're looking at this and trying to understand, presumably something has gone wrong and that's why we're test tracing it like, particularly if it's something that's slow.

A

You want to just look at the time stamps here and what you're looking for are big jumps, so keep in mind that most syscalls are really fast, and so it doesn't really matter if you see ten thousand lines of a fastest call right after the other, because that's not a big impact. A lot of the times we're looking for is one request that took five seconds, for example.

A

um So that's that's what you really want to look out for, if it's slowness is just what was actually the slow thing as opposed to how many requests are there, because number of requests and total length of time spent don't correlate very well. uh Okay, now we're checking project.

A

A

uh We're seeing if we have auto devops turned out, I think we do, because I didn't turn that off all right. Let me query in license.

A

Some more reduced queries so things that we that can commonly cause problems that are not immediately obvious, slow, ldap sync. So if we're waiting on dns to get your ldap server or the ldap server is taking a long time to get back to you, you're not going to see that in the logs anywhere. So that's something that s trace is useful for and you can actually see that we're getting getting a dns entry for ldap we're calling out to that. It's taking five seconds to respond and that's what makes everything slow.

A

That's one example: that's happened somewhat often, um you know things like database being slower, kiddie being slow. Those are usually pretty obvious from the logs that you know again. We log those durations in production, json api json.

A

So it's not going to be hard to understand those, and I wouldn't use s trace in general on those unless you're trying to understand something very specific about what we're getting stuck on and even then assuming the page is loading. You can use perfbar to really dig into what's taking a long time and where it's coming from.

A

Okay, now we're clearing project skin for some reason and users again membership. I haven't wrapped around the next request. Have I okay checking on.

A

A

All right, looking for forks namespaces cash, how many products we have project statistics.

A

Yeah, so, as you can see, we query the database a lot. We have 43 queries just for this local storage request as and then we still have two other requests that were made.

A

Okay, those are gilly requests this time we're getting tree entry, which is, I believe, gonna tell us the details of a file uh tree entry- yeah okay, so this is probably yeah all right, so we're getting readme to md for the look of it do make sure you see the file name here, yeah and here's the file name down here.

A

Okay, now we're doing processing, probably rendering the page there yeah now so we've taken the result of that first, one. Now we're going in getting more tree entries doing a fair amount of.

A

A

Checking for pipelines.

A

Yeah, so I think at this point we're starting there. We collected a lot of the data we need and so we're starting to do more work to actually render the page. In the background.

A

Just based on all these clock get times showing up all of a sudden.

A

A

Yeah, let's be rendering here.

A

Here, all right now, we're basically done we're. Logging, we've logged that we're done, and this this is a little just look at the s trace, running it, making everything uh extra slow.

A

um I think this is perf bar data that we're sending up to redis, with all the details of each request that we made okay and now we log- or we write out the response to get that socket.

A

um So first, the initial message, then the full body.

A

Where is let me mark this.

A

Spot there we are okay.

A

Yeah all right and down here right out to production json the full details of the request- okay, great so that is kind of the lifestyle life cycle of a standard uh request that you'll see it um just come in and we'll query write us in the database a lot we'll make a few kiddily queries and we'll spend some time getting what time it is and then we'll write out to the logs and to the socket with the response we have.

A

One thing one thing to note is that we do log before we write the response, so it is every now and then possible that we've logged that were successful, even though the request actually fails. This happens a lot when workhorse drops the connection. So what will happen is a client makes a connection goes to workhorse workhorse forwards down to unicorn or puma. Puma starts the work, but then the client drops the connection. Workhorse then drops the connection to get that socket as well. So unicorn will do all the work.

A

Loga, 200 or puma logo 200 write it out to the socket and then fail to write it because sockets already been closed. So, in that case, you've got three different response: codes being logged there, nginx will log 4.99, because the client hung up workforce will log a 502, because you know the puma didn't respond in time and then puma itself will logger 200, because, as far as it knows, it succeeded until it actually tried to write the response.

A

So that's a little confusing and something that happens fairly, often just to be aware of okay, so that is again the the basic life cycle of request. Let's actually go back up to the start here, find that okay, so with puma the the process that gets the request in is usually not the one that actually does all the work.

A

Okay, so here we're reading out the socket, but there should be something else: yeah, okay, here we are um so right here you can see. Pid 20246 is accepting.

A

So if we look here, let's do this.

A

Yeah, so we've got two two four six and then two two four three so two or two four three doesn't work, but two or two four six actually took in their request. You can see these are both threads for the same process. um So we basically have. Is you have one thread that just listens on the socket and then accepts the request as it comes in, but it won't the pid that accepts the request, will not be the one that actually does the work.

A

So then you'll see a different pid, do a receive from and actually read in the uh the details of the request over the socket. So that's what you're seeing here 20246 says it actually responds with the initial response and then 20243 does the actual work and sends back the the full response at the end. uh Okay, let's briefly go over the differences with unicorn, which are not very great, we'll just go over that same thing. We just went over with unicorn real quick and what is it enable water enabled? I can't remember enable okay.

A

A

All right, so, let's back up.

A

All right we've got unicorn now I don't know if it's actually up.

A

Yet go to two processes.

A

And, let's load the page, yeah five, two all right I'll give unicorn 30 seconds to come up. I know this is the best part waiting here for this to come up, but we'll be basically done once that comes up. So if you're, not interested in unicorn, then feel free to stop the video all right great. It's up.

A

A

A

All right, so, let's load that and we'll just pull up a file. That's fine.

A

Okay, all right, so the only difference here is that.

A

5628 is the same process that both accepts and actually does all the work. um So so, like you mentioned at the start of video, puma has multiple threads per process and each one of those threads can handle an image, will request. Unicorn each process is kind of an island. You know it isn't one one main thread within it, which is the work you know with the inception, with the notable exception of the grpc list and threads, but other than that.

A

You know it's used one process doing all the work within with no real threading involved at all. um So in this case you see that you know this. The process gets it and the process then goes and does the work and then.

A

Right now, unicorn, and then, if we look here this, the process id and mps actually matches what we see in s trace, because it's the main the main thread within the process. That's doing all the work. Okay, so that's that's the only real difference between puma unicorn other than that. All the work you see within the s trace is going to be the same yep. So that is a quick summary of what you can see in a typical rails. Request great all right next time, we'll go over kitty, which will be interesting thanks.

A