Node.js Node + JS Interactive 2019, 19 Dec 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Throw Me a Lifebuoy: Debugging Node.js in Production with Diagnostic Reports - Christopher Hiller

Description

Christopher Hiller, IBM
Diagnostic Reports are a recent addition to Node.js core. This feature enables insight into Node.js processes running in production—without needing to attach a debugger—and the results can be interpreted offline. If you've ever had to debug issues in production with a customer, you know this can be a life-saver.

I’ll show you how to trigger report generation manually and automatically, then use the results to diagnose a problem process. While this is fine and dandy, manual diagnosis can be tedious, so I'll also demo a toolkit I've been working on. This toolkit can help automatically detect known issues, redact secrets from a report, and much more.

A

So this talk is about diagnostic reports and nodejs. It's going to cover some of the material that gireesh covered yesterday, but it's also going to talk a little bit about all I'm going to talk about giving you an introduction, say a few things that you can do with diagnostic reports: how to use them, basics and I'm, going to talk about some tooling and belt to help you use.

A

So my name is Chris Hillier I come from Portland Oregon I'm known as bone skull on the internet, so I work for IBM, primarily working on nodejs related things, I'm a maintainer of MOCA, which is a testing framework, also involved as a maintainer of MOCA in the open, J's Foundation, cross-project, Council and I am bone skull on get up and on Twitter. If you have nothing better to do, you can look at my tweets and that's bone skull with a zero.

A

So I want to start with with kind of some hypothetical problems. So you have a hypothetical problem. Your process crashed. So what happens in a process crash is if you're, lucky you're going to get a spec trace somewhere.

A

Unless things went really self, but so you might get a stack trace and you've been you're a developer and you've been tasked with investigating the stack trace and trying to figure out what's going on so remember, this is a this is a dead process.

A

Maybe your stack raises in your logs, and so you look at the stack trace that says: oh well, you're doing something weird and the stack trace points to to this code, where you're you're saying like rinder and you want to delete a temp directory or something and you pass this flag, and so the error that you get looks like this, so error not empty directory, not empty or under yadda yadda yadda, not empty. So, okay. Why would this fail? Some of you may have an idea, so you're you're, passing a correct flag.

A

You're, meticulous integration test pass works on your machine works in CI builds green, but this happens so one way to help you figure out. This problem is to use a diagnostic report, and can you and see that anyway, it says, use a diagnostic report, and so let me describe the diagnostic report and this is the gist: it's a experimental module, some functionality added it's in no.12, so this is in LTS. You can use it, but it is an experimental API. So that means it's behind a flag.

A

You need to pass a flag to use it experimental if you're not familiar in the node sense that that essentially means the API or the the behavior could break outside of the normal major release cadence. So if you do start using them, please be aware that they could break. That being said, they do their job very well, but that that API might change the output might change slightly. You know before we hit the next next major.

A

So essentially, what this is is it's a huge JSON dump reflecting the state of the process, most of I, seen of them I've seen work out to be two up: 28 25k, you can trigger it several looking ways, including you can give it some command line flags. You can create programmatically. You can even tell it to dump a diagnostic report when you receive a user signal. So how do we want to create a port in this case? Where we've got this process?

A

That's crashed so we're gonna, we're gonna start up that process again, except for any of these flags, so experimental report. You need that to do any of this stuff right now. You're gonna see report uncut exception and then give it a nice file name. You don't need to pass the the file name, but in our case that will be helpful, but normally it'll create this very long, filename based on the time stamp. So you run this in in your production and time time passes and now you have another problem.

A

So now you have a diagnostic report and it crashed, and now you have a lot of JSON, and so it looks kind of like this, where it's just like this blob- and you know we can kind of zoom in and maybe take a closer look. So it contains a whole lot of stuff and I'm gonna try to run through this pretty quick, but so there's nine or eight defending nine top-level properties and the first one is going to be header, and that's going to talk all about the report itself.

A

Information about the node process, the command line. You can see the version, the versions of the in the libraries that node uses operating system version CPUs all sorts of stuff- so that's gonna be in the header next one if we will scroll down- and this is an order- so the next one you see is JavaScript stack and it's going to, of course give you the stack in this case. It crashed on an air.

A

Next you'll get the native stack, which is which is pretty far under the hood, and you may or may not need that, but it's there anyway. The next will be information about the heap. So this is your your memory. Usage resource usage will be your CPU usage a little bit about filesystem activity.

A

Next is this libuv might need a better name, but it's it's essentially the state of the event loop. What's in that event, loop right now, and so this is it, you know it gets a little technical but there's stuff in this this particular event loop and over there environment variables. This has been trimmed, but it's everything in your environment.

A

Windows users will not get this so user limits if you're a user on a Linux system, you'll have like limits of what you can consume. Shared objects will be the shared libraries that that node has is using, and so what we are concerned with like what can help us solve the problem we have well, it would be here in the header, so we look in this header and we see we want to focus on this. The node.js version.

A

So the problem here is rim ref, with that recursive flag didn't land until 12.10, so your node version is too old, but a start, no stack trace, wouldn't tell you that so great hey, you found the problem good job, so you take this and you want to say, oh look. This is this is the problem everybody and you're going to slack, and you take this big report.

A

You paste it in there and now you have another problem and what you did was you just leaked the entire environment like in the slack or wherever you sent it? Maybe you sent it through email. Hopefully, you didn't put it on paste bin, but yeah they're gonna, be your your your AWS stuff in there. Who knows so. Your team lead is pissed, and so that's that's kind of what we need to avoid, so so how we gonna. What are we gonna do about this?

A

If you want to send one of these report files around, you need to make sure they're kind of scrubbed of things that shouldn't get out, and so what you do is you go back and you you delete your slack message and you go and you open the report and you you delete this secrets and then figure out how to exit them and then, of course this is all very tedious.

A

So there is a tool that I was working on and and it's out now, but it's called report toolkit and it's a tool for processing and analyzing diagnose reports, it's kind of a multi-tool, so it does several different things. It's not Unix. You know, you know how multi tools kind of suck to do any of those with anyway. So they don't do any one thing great, but I'm getting ahead of myself. So this thing is going: this does some cool stuff. It gives you a CLI tool to to consume these things and there's programmable API.

A

You can check out the docs which are incredible and there's the repo up there. So what what can we do? So we can use report toolkit and give it this redact command and pass it be the report dot, JSON file or Fugees, whatever I called it, and what this command will do was.

A

Is it will look for things that it knows are potentially naughty and need to be kept secret and it's based on the black list that may be WSS get secrets, project news- you may be familiar with that, but you can kind of customize it to your needs. So what I will do is little it'll replace all those terrible secrets in that report file with this string and so it'll and it'll overwrite the file in place. So you know nope nobody's the wiser right, and so now you can.

A

You can safely pass this report around sure with your colleagues. You know discuss it over dinner, but so time passes and you get you, you have another problem, so you have this this process and maybe it's even a test or something, but you have this process and it's running, but but you thought it should have stopped. So it's not a zombie process but I'm just gonna call it a zombie process.

A

So you don't know why- and this is this is weird because so you got this process and you'll know why, and so you open up your debugger and it doesn't. You know it doesn't stop it's not doing anything. It's just sitting there. So it's not hitting lines of code. You know you set breakpoints whatever, so you don't know why one thing you can do this is something that report diagnostic reports can help you with. So you can actually generate a diagnostic report on demanded.

A

The process doesn't have to crash for you to get a diagnostic report and so I know we love command line flags, and so we can send report on the signal and so by default. What this will do is the process will respond to the user. To signal and that that's configurable, but it but so you'll start start your process and you can do this sort of thing and the process ID, and so that sends the user to signal and when the process receives that signal.

A

Node will say ah it's time for me to create a diagnostic report and so it'll dump a diagnostic report out.

A

So you look at this diagnose report and then the I'm gonna cheat because I know where to look here so I would look in this libuv property and I would go down and look. Oh, look. There's this timer and.

A

So this timer and it's active, so it's so it's in the event loop and it's referenced. So, okay, it's so it's still on hasn't been garbage collected fires, an MS from now 999, that's a while right, and so you can see that using this you can get a clue, ah so I must have created some set timeout or some interval or something- and you know I was off by several orders of magnitude. You know who knows but that'll give you a clue to try to figure out.

A

Oh this is this is one of the this is where a problem could be so report toolkit. If you aren't, if you don't know where to look so it can do this sort of thing for you, and so it has this inspect sub command- and this is this is the thing I think is really neat.

A

So there are these rules- they're, heuristics they're, just some algorithms and functions that that accept a a report file and you can examine the function, examines the report file and it decides what to do, and so the the there are built-in rules. One of these happens to be the long timeout rule which will look for this very situation in your report file, and so you could run this on your report file. Any report file really and it'll look and you'll see. Is there anything fishy going on here?

A

So one of those rules? Is it a long time out one where it will it will? Let you know if there's a timeout? That's that's far off in the future and it's still active, and so you could. You know, write your own rules to this. It's like a you know a plugin system, and so you could you could write your own. It works. Similarly, the similar I came and say that word, but that's how it works. It works like yes, LaHood, and so you can write your own rules, publish the ESM you could have.

A

It talked to the blockchain. It's not sure why you do that, but you could do that, and so this is what the output would look like.

A

So pretty simple, so it's just like this kind of tabular standing where it says. Oh there's an error very issue in this report file in the rule that was triggered as this one and then there's the thing with this bad expiration date time.

A

That's one of the rules, there's there's others that will look and make sure that you're you know, memory usage is within expected range. Your CPU usage is within an expected range, there's another one that actually will examine your shared shared libraries versus the libraries that node was built built with and if there's a mismatch there, and so that's not gonna, you know be something that most people wouldn't be concerned about. But if you're compiling node that might come up where you say, have a different version of open SSL than node expects so.

A

Another problem you might have, so you got this flaky process in the flaky process. You know it's running and you're, not sure why it just kind of it fails once in a while.

A

You know, maybe it fails on one machine, but not the other, and you can't really tell what the difference is. So one thing that report toolkit can help us here is it provides a diff sub come and so it's you know, you could take a report, a dot Jason for BJ's and give it to your favorite dipping tool, but that's for dipping source code or text files. It's it's not for dipping these report files a neat thing about when we know the data we have.

A

We can create a custom, a purpose-built, diff tool for this, and so that's what this is it. It tries to ignore stuff that it thinks you probably won't care about and so tries to kind of. You know signal-to-noise ratio. It tries to make it nicer for you to to look at your reports and say: oh well, that's how they're different. Instead of this, you know huge, unified dump or side-by-side diff, and, and so it answers your process.

A

How does this, if you run this again and again again, you couldn't you can different them all and say how does the process change over time? Maybe that's a single process, maybe that's a process on several different machines, but you can dip any two reports this way and the diff output looks something like that.

A

In this case, we see that you know the command line. Flags are a little different. So with this first report file, we actually said efore for eval, and so the the command that was sent was actually hey. Just write a report, the other one who knows, but it didn't have any command line options. The the first report was generated with 12.1, the second one was generated with 11.2 and so it this is. This is an excerpt of that diff, but yeah.

A

That's kind of the idea there and you know if you don't like the the tabular kind of output, you can choose different formats. Maybe you want to in JSON or CSV or something.

A

Another thing is: maybe you got maybe got processes that are crashing somewhere, maybe a lot of them and maybe you're like that's, not a big deal. We can just restart them because it's no right, but so you want to know how frequently certain exceptions are happening, and maybe this will help you prioritize bug, fixes or who knows what, um but to be able to figure this out. How often does a particular exception happen? You need to be able to count them.

A

So how do you count an exception? Well, you need to somehow you know you could you can take the whole exception and stuff it who knows, but you could take a what you can do here is you can take a hash of that exception and you can kind of there's there's some customization that can happen here, but you can take a hash and actually just kind of output. This a little little bit of Jason with an SH one here, music report tool kit. Of course you could do that with a script report.

A

Toolkit we'll do it out of the box. It'll also convert these diagnostic reports to CSV JSON. You can filter stuff, so if you only want a couple of those fields you have to filter table, of course, is that kind of output you saw before newline would be something like new line: delimited JSON. If you need that sort of thing, a numeric eye kind of this kind of experiment where you can like use it in in a shell context where you can actually pipe it to something and maybe generate.

A

There's like these, like neat little tools, that'll generate like graphs and your console, you could do that and just combine it with filter and only pick out. You know a certain a certain field and keep running that over time. Redact, of course, is it's essentially the same thing as the redact command. So you can combine these transforms. You write your own ocean npn using no. You can't do that, but so this is what.

A

If something would look like, so you'd get this stack hash and you can see there is sha 1 hash calculated for this I think you know you need to be able to customize this a bit. Maybe if your exceptions have some user information in them- and you want to get rid of that- you know maybe there's some personal personally identifiable information in there. You should be able to pass it a a like a red, regular expression or just a function, and you know, write your own and plug it into this thing and it'll help.

A

You generate those stack that that hash and then you can give this to your logging tool or your metric system. Or what have you.

A

So I think that's about it, but what we learned is what a diagnostic report is how to create them. It's not everywhere. You can create them, but that's that's a couple of them.

A

You can also create them programmatically, which might be useful if you're trying to grab them in, like a server, less environment, how you can use them to solve certain problems, they're, especially useful, of course in postmortem debugging, where you don't have the option of running a debugger, because your process is stopped so and of course, how report toolkit can help you work with diagnostic reports when they become tedious or how it can help you uncover problems that you may not be aware of, and so, if you want more information about diagnostic reports, of course, it is in the lovely nodejs documentation.

A

There's a tutorial written by gireesh who spoke about diagnosed reports yesterday, and also he was the one who who got this code into core, but there's a tutorial there, which links to those two developer.com you can also and I apologize. This is not very legible, but the documentation site for report toolkit is IBM. Github do forward slash report, toolkit and I'll leave that up for a second, but it is an IBM project, I'm the only person working on it, but it's still an IBM project and so again I am Christopher Hiller.

A

You can call me Chris I work for IBM I, like node, mocha and stuff. Look at my website and things so Thank You, Montreal and noches Inara.