Node.js Node.js Interactive North America 2016, 15 Dec 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Real-Life Node.js Troubleshooting by Damian Schenkelman, Auth0

Description

Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0

When building a large enough set of services using node.js, there will be a point when you find that your application is suffering from performance or memory issues. When this happens, you have to roll up your sleeves, get your tools and start digging. This talk explains how you can use tools such as ab, flame graphs, heap snapshots and Chrome's memory inspector to find the cause of these. We will go over two real life issues, a CPU bottleneck and a memory leak, we found while building our services at Auth0, and also explain how we fixed them.

A

And I'm here to talk about a couple of things that we found over these four years that are kind of the usual suspect. You know things that you find that are kind of making your application on your services crash and how we find them, how we get them fixed. So this is kind of a repeatable process. There are two important things that we want to talk about. One of them is memory leaks on the other one, our cpu bottlenecks or performance related issues. So, let's start with the first one memory leaks.

A

We first have to define what it is. This is going to be pretty fast first part, so the main cost for remember leak, a cyclic of mine likes to say it is unwanted references. We are keeping something alive that we aren't going to be using in the future. We don't need it. So if you can represent the memory model like this, we can see that the garbage collector has kind of pointers to what it calls roots and those roots have arrows or references to our object, and then we have kind of a dependency graph.

A

So when we do something like this and we say: okay, B dot d equals no. What we're saying is get rid of that arrow. That's it at this point. When D is no longer referenced, then the garbage collector comes in and says: hey. Let me take that out of there. You don't need it anymore and you're good. So that's how your program keeps running over time. You end up with something like this, which is the sawtooth pattern, so you start allocating memory. You allocate a bit more and every once in a while.

A

The garbage collector comes and says: hey. Let me take care of that for you and it doesn't do that all the time, because the garbage collection is not a good operation for your service. It means that basically the world has to stop and it has to go and do it right. So the problem comes when well.

A

You start having things that you don't need, so you have like a large bucket and then you are at another one and at this point, if you keep creating a lot of these either your browser, or in this case your node application, they will crush. So this is what we were saying. This is an actual chart from one of our monitoring services. Basically, what you can see is that memory went up and it kept going up. There were no garbage collections and the process just died.

A

So this is a really bad situation to being, if you think about it like this is not the best place to be so, you have to figure out okay what's going on and you have to go on, okay, find them or bust them. Oh really, right, or is it? How do we find it? How do we fix it? One important thing that we learned is that, if you're ever in this situation, the first thing you need to do is take control.

A

You don't want to just start researching and leave everything, as is that's because every time the application crashes responses to requests are not being generated. So some some requests are failing. You don't want that. One trick that you can do is you can increase the heap size of your process, so it's in general, it's like 1.2, 1.4 gigabytes. If I remember correctly, if you make that higher than your application crashes less often, the other thing that you want to do is drain connections.

A

If your memory reaches like the 80% limit in 85, you should probably stop accepting your requests and just process the current ones. What you're doing basically is you are manually garbage collecting by restoring the process again, not ideal, but you have to buy time until you can find the real reason.

A

Once you have done this, the first thing you should do is get a heap snapshot. What is a heap snapshot? It's kind of a picture of of everything in your memory in your heap, you can use the bit profile module for this. There are other tools. This basically allows you to send the signal to the process and probe on take a heap snapshot right. So let's dig a bit deeper into this like what does this mean? What can I do with the heap snapshot? So, let's see, can you see that yeah?

A

Okay, so you get the heap snapshots, it's a file, you come to crown profiles load and you load it, and you get something like this.

A

What we're seeing here are all the objects in our application in our service and the type of term, how far they are from the root, so the distance, how many of them there are the shallow size, which is basically how much they are occupying in memory and retain size, which is what's the size of everything else that they are pointing to on keeping alive. So these are fairly different.

A

My first recommendation, if you're looking for memo leak, is check the strength if, for any reason, you are creating a lot of them, strengths are good because they are very contextual based on the context of the strengths. You can figure out where, in that, in your program that string is being created, so all of these are of course strings from node modules code. That's always kept alive in memory, and we started seeing some of these.

A

So this is actually again real heap snapshot of a memory leak we found- and this is how we send logs to Kinesis to our stream service. So the next thing you do is you come here and you pop this thing up and you see the retainers who is keeping that object in memory right? So we have a body- and that's been pointed to by an HTTP request. This is an anonymous function, so name your functions like if you name that you will have a better name here and eventually we get here right.

A

So we see a forever agent has a cell keeping a key to Kinesis and then there's a TLS of it. So we see that this is being kept alive, eventually right, but this forever agent. So what does this mean? It means we have something like this as a mental picture, so an object sockets, that's pointing to another object that has a key and that key has an array of TLS sockets. So that's how the forever agent works.

A

It keeps a socket or more of them alive for each of your origins right, but if you're doing keep alive and you're doing logs, you shouldn't be creating a new socket on every connection. That's what keep alive is forward to avoid that, so that was kind of the first thing that something was off I'm explaining all of this in a very sequential manner, but it was a bit more chaotic than that. You can see the PRS for this stuff and say well.

A

We didn't really know exactly what was happening right, but we had a couple of different approaches. One thing that we found is that the AWS SDK actually wasn't getting rid of a couple of event listeners, so they were always live and keep on keeping references to the strings and the other one we found is that the forever agent was actually creating a new socket on a new connection. For every time we locked so the more our application was used and the more logs we generated the more memory we consumed. This was like the fixed part.

A

It took a lot of time, but once we found, as we said, okay, let's go back, use the normal agent and use it save it to forever and that's it. But this hopefully gives you an idea of well how you can find a memory leak field, what it is and fix it once you fix it, you'll go back to the Sawtooth right, so this is kind of the normal graph. These are not restarts that we force. This is real memory being Darvish collected okay, so the IRR thing I want to talk about our CPU bottlenecks.

A

Again, how can you find them? How can you get rid of them right that that depends a bit, but in essence, when you that's actually, when you have a scenario like ours, we said: okay, we have a client, that's calling an authentication service going to DV and back right.

A

The thing is, you start saying: okay, we want to create performance tests for this to avoid progressions and stuff like that, and if you are doing something like that, you would expect a chart like this one, which has a minimum in this case it's 400, but that's because of latency. So you go to server it's like 200, milliseconds 200 back and then you start getting the errors right. That's probably the small var there it's okay! We have some bad requests, so you don't have to do a lot of processing.

A

The rest of them are the real responses and you have a long tail. It cost well, some things sometimes get delayed. The problem is that we had this in reality again, not what you want right. This is like what you want for customers.

A

So how do you fix it? Something's being queued? So the first thing you think about is okay, I'm going to look in my TV, because if it's there and I just add an index, it gets magically fixed. That's it! It wasn't. So we had to do a bit of research. Look around.

A

We found this tool called flame graphs and what fine graphs do is that they provide you a representation of how much time each function is taking in your program, the wider, the representation for the VAR of the function, the more time it's taking so in this case, B is taking 80% of the program. C and D are taking the same and what you see in hate. That's the call stack, so it's how you actually got there right.

A

This is a flame graph from an actual program. I, don't know what program it is. I just got it online, but the good thing about flying graphs is that all they apply to node, but they can be applied to any program that uses functions and has this senior drawing. So you can use it for anything.

A

Let's do an honor demo. Okay! So, let's see this is kind of the beautiful simplification of the program of the problem that we have. So we are using this tour, which has a user and the hash for that user right, but that's the password hash and we know there's not, there's not a problem with the store. So we can just keep it in memory and then we had an authorized endpoint which basically fetches something from the store and thus a comparison from the password and the hash, and that's it right so well worse, the problem.

A

This is actually a like 30 line file. We have like tens of thousands or even hundreds of thousands of lines if you consider dependencies, so it's a lot simpler than it looks, but let's, let's not even take a guess right. So what we can do is we can? Can you see that yeah? So we can come here and say: okay, I'm, going to run this program as like in benchmark mode using a tool called zero X and what 0x allows you to do is to get flame graphs from your node code.

A

So once we run this, we do premium run benchmark.

A

This will actually ask for my password because it requires some permissions, YouTube and kernel level stuff like root and then I can generate load. So we can use AV or any other tool for this and I'm doing a hundred requests in total, with a concurrency of ten. So like 100 people, logging in right and you have like- you really have to be patient because oh we have a problem. This is a CPU issue, so it finishes and like the times that you can see here kind of similar to what we saw before.

A

But the interesting thing is that if you come back here and you stop this, this starts generating the flying graph. So you don't even have to kind of take a guess at what was being slow. So we can just open that.

A

A

Ok, so that part has now soon. Ok, so let's see what's this here, although you probably can't read it, but it's bcrypt, compare so that's like 34% of the code, the same thing here, bcrypt compare it's actually on different, like bars, because the coal sucks different, depending on when the event finished, parsing the body and stuff like that, but against it's always being going to the same function on the same thing here, big group compare so evidently we have a problem and it's related to be clip.

A

Compare if you take a look at our implementation, we are doing something sync. So we're a know. Then we say: ok, things are async, they are better and we'll just change that to run asynchronously right. But what does asynchronous mean in this case? Well like this, is a CPU bound operation, so it being a CPU bound operation, it would block the event loop. So what we're doing- and this is related to a talk I saw earlier today- is we are queuing this in the leave. You be event loop, sorry, so, let's run this again and.

A

This should be a lot faster. Why? Because the libuv thread pool has four threats, so they can run H on a separate processor. And if we take a look, our flame graph, you will notice that it actually doesn't have any.

A

It doesn't have any pointers to the big trip code, because it's all running in a different thread, so 0x doesn't consider that and they call that you see here is that it's using P threads, so we can think okay, we'll fix it well, like the famous actually is that we still have a problem right because we are CPU bound and we are slow on purpose.

A

So we need to scale and scaling is not about just handling throughput. It's also about how much money you spent handling that throughput because, like if you buy a Cray supercomputer. Well, it's not that good, so you could think about creating like using a faster hash function. The problem is that that's not a safe. The reason we use a slow hash function is that if someone gets a hand of the hashes, they have a hard time tracking them. So that's not an option for us we're a security company. We don't want to go that way.

A

Caching, that's like the first thing you think about it's not applicable in this case, because there's no temporal allottee for the data access right, you log in once, and then you might log in once again in 24 hours, there's no sense in caching.

A

There are other problems with caching, but but that's that's not very important, so scaling up I always think of scaling up as like burning money, but it's a good first approach, because what you're doing is you're, saying: okay, if I increase the size of the thread, pool and I get more threads, I can get by for at least some time in production, handling or load, and that's good I think that's the first thing that one should do if one faces a situation like this one, but on the long run you are going to be spending more money than you need, because you don't have a clear execution profile and you have machines that are larger than you might need right.

A

You don't have a lot of atomicity. So then you go kind of the horizontal scaling way which can be combined with like the vertical scaling. But if you create multiple, odd services, you run into another problem, which is that the service not only allows you to log in, but it can also allow you to change your email and changing you. Your email is an I/o bound operation because you just go to data base instead of like login, which is again cpu-bound.

A

So you running to the problem where, if you have a lot of people changing their emails and you bought a really large computer just to kind of process, a password hashing, you are spending more money that you need to on different course.

A

We fix this by creating a service called bus, it's open-source, you have the link there, and it is that you have the same interface as you do with bcrypt, but it actually works like this. So you set up the bus service behind the load balancer. It communicates with the client using protocol, buffers or Avro depending on the day and your configuration, and it does two things it either compares a password to hash or it hashes a password. That's it.

A

The good thing is that it's very easy to figure out when this is actually going to be a bottleneck and autoscale, because you can very effectively measure the amount of requests you know that it takes between 70, milliseconds and 100 milliseconds to run a big comparison, so you can say: okay, I can handle like ten of this per second. That's it! That's when you scale important, regardless of the numbers, is to always do the cost.

A

Comparison did I achieve the desired throughput, and am I spending the lower amount of money that I can't write, so those are the two key things and always fail gracefully when you introduce a new dependency, so you see she's keeping her hands up, even though she didn't stick the landing, that's good! If you introduce the new dependency, you should be able to run things in a different way.

A

So if the bass cluster starts to fail, what we do is we turn back to running the big creep comparison locally as a potluck, and that's that gives our operations team time to figure out what's going on and get the cluster back up, it's not ideal. In terms of cost, it's not ideal in terms of performance, but it's ideal in terms of that's the best that we can do for our customers right now. One last thing cut picture: I was missing one of these, so I'm adding it.

A

If you have a long flight back home like I, do to Argentina, you can read all of these links and there's more okay, that's it so I, don't have anything else. I think we have two minutes for questions and I hope this was useful. Thank you.