Node.js Tracing Working Group, 9 Mar 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Instrumentation and Tracing in Node.js - Thomas Watson, Opbeat

Description

Instrumentation and Tracing in Node.js - Thomas Watson, Opbeat

This talk is a head first dive into the next-gen core tracing API’s being developed under the Node.js Tracing Working Group. We’ll learn about the upcoming AsyncWrap API and how this can be used to build your own high level instrumentation logic in production.

A

My name is Thomas Watson I, hang out and github a lot and do a lot of open source and mainly node I've written a lot of node modules, I just tweeted a couple minutes ago, a link to these slides. So if you see some code in there that you want to look in details, Luke look into in details, you don't have to like scribble and take all the notes.

A

Now you can just look on Twitter later so I suggest, of course, that you follow me there and I'll get up so with that out of the way I work for this company called upbeat and contrary to popular belief. We are not a coffee company, though we love coffee, as one might might expect. So, instead of coffee. What we do is we do instrumentation of your note app in production, so you can see exactly why some requests are slow and why some are fast and fix it.

A

So, but that's not what I'm gonna talk about now we're gonna talk about now is instrumenting node and tracing node in production. So how do we get good insights into a node application, while still it being a node application because know what applications are freaking are to instrument? So what I mean by this? So normally you have. Let's this example: we have an HTTP application.

A

It receives an HTTP request and at some point in sense of response back to the client and in the meantime it might, you know just query readies to see if it has what it needs in the cache. It doesn't have that so it's gonna just kind of look in the database instead and then it might write something to the database and it would be really cool if we could like figure out how long these things took.

A

We call them traces, wanna trace how long this is each of these IO request take, and then we want to be able to like group it all under a transaction so that we can later refer to the request that actual cause, this these traces to occur in our production code so and another reason why you wanna what you might want to trace your application besides figuring out what slow is.

A

If this happens, you wanna there's an error and if there's an uncaught exception, what note we'll do is we'll just exit, and you want to be- you get stack, trace, that's cool, but you get a stack trace back to the latest Asian cooperation. So you can't. Actually you can't see like what request was active at this time.

A

You can't see why you call this database query, but there might be multiple entries into the database, so it'll be nice to try all these things nice and neatly together and that's another reason why we want to do tracing a note. So I'm part of the note yes Diagnostics working group and one of the things that we work on is tracing and to figure out how to solve this problem in node. We have to dive into this thing, called the event loop and learn why this is a problem.

A

The window back should look like this so notice. As you know, single threaded, so only one piece of code can run at the same time, but notice clever about what goes on in your application. So if you've talking to the database, the file system, doing any external HTTP call, you'll they'll wait and leave blank space on the process for other stuff to appear.

A

So here we have like one request coming in and it'll get into the router first and figure out which controller to call and anybody might do it database a call in there's nothing going on and then a little bit later. The database calls returns and you know you know the drill, and this is all synchronous code.

A

So only one piece of these blocks can run at the same time, but then another request comes in and it kind of interweaves it selves inside the process, and so so what you have is multiple requests going on at the same time. So that's why you you you, you might be tempted to do something like this like have a global with the current request and refer to that. But obviously that will not work because here we said it when the first request comes in so that requests one. Then another request comes in.

A

We set it again to the request to and then okay this is cool because it's still the same request and oh no. Now it's actually running the code from the previous request, but it's still pointing to the to the new request. So you'll get like weird weird things happening, so that's not good. What we want to do is we want to have like this happening like in some magic way.

A

We want to have current request to point to the correct request, based on the code that's currently executing in this threat, so there is a way to fix this. Does anybody know what this is? There's a monkey patch.

A

And what do I mean by monkey-patching? So this is. This looks really really dirty. Let's just imagine for a second that everything in node all async stuff is is happening via set timeout. We all know set timeout this. Of course, a lot of other places where async stuff can happen, but just look at it from from set timeouts perspective. So first we just when the application starts. We just save a reference to the actual set timeout function, and then we basically just override it with our own function and then whatever.

A

If somebody calls a timeout with a callback that they want to be called it like two in two seconds, whatever we just make a note out, which watch the active request right now and just store this in a closure, so we can get access to it later and then we call the original set timer, because obviously we want to keep the functionality and instead of keeping it the the callback that the user provides, we give it our own callback and the reason why we want to give it our all callback is because we want to.

A

Of course, you notified when, when the two seconds or whatever have passed, and then when that callback is called, we make sure to call the actual callback so that we don't want to break the users code. Obviously, but before that we call the actual callback, we want to restore the state as it were, when this callback was actually huge. So when the user called set timeout in the in the in the first place, so current request might have changed.

A

In the meantime, we want to just hold that reference to whatever it was in the meantime, then we want to restore it to whatever it was before. Then we want to actually call the callback and every one a restart back to the to the one when the callback is called. So that's another word, for this is magic. Yeah I wanted to get that that, given there somewhere, this felt like the appropriate time. So basically what we have to do. We have to patch the world.

A

We have to patch everything for this to work and it seems really really nasty and it kind of is. But if you run any monitoring solution in your production code, be it New, Relic or my company up beat. This is actually what's going on it's a little bit more clever than what you saw. The very simplified example, a lot of more guards make sure that we don't break anything, but basically that's what's going on. So what you want to do we want to you want to patch every async operation in core of node.

A

So that's like timers, the next tick call and process. We've got promises now, the native ones, and we have deep UV, which basically is where all the stuff that has to do with IO system calls all that stuff. It's it's in leap, you view which is a c-plus plus module that ships with note, and then we also want to patch certain third-party module.

A

Excuse me: you want to patch certain third-party modules that opera that have their own callback queues and stuff like if you, if you're using Bluebird, for example, every time you give a call back to Bluebird, it stores it in its own little queue and it doesn't touch the core and it been when the promise results. We want to make sure that whatever it was before is restored, so we have the current context all over the place. So this is a pain in the butt.

A

Basically, obviously so, luckily in the Diagnostics working group, Trevor Norris have been working on this thing, called async wrap and there's a link there to the diagnostic working group repo. We can see all the stuff we're working on so I had I was I was making these slides and then this thing happened.

A

So he decided to disturb our Norris I decided to rename a sling wraps racing cooks, which is the way better name, but he didn't consult me first so have to fix this quickly good.

A

So now it's called Aysen cooks and it's it's been. She actually shipping with note for quite a while.

A

You just didn't know this, and normally, when you want to require some part of core you you say require and then like HTTP, your FS or whatever, but that part is not shipping yet so the only thing that shipping is the actual see stuff beneath the scenes, so you need to main it make it binding to some core stuff inside node in the C++ library and the way you do this is you call process that and then acing rap, and this will, of course be a sin cooks at some point, but right now, if and the reason why I show you this code here is because right now, if you are a note, 4 above this code will actually run, this will work this actually ships and it's been shipping for a while.

A

The API have changed a little bit, but it's not in it's not in master. Yet it's a pull request. Currently, so, even if you build for math, you will still. This will actually still be the code and needs to you need to run so so we requite, we get a binding tracing wrap. Let me set up the hooks, and these are the hooks that will be called every time. Something every time note does anything to do with asynchronous stuff.

A

Timers HTTP calls all that stuff I'll image to enable the hooks, and then we need to define the hooks, of course, so there's four hooks in it before after and destroy and in it is called the moment that you actually try to create the create the callback. So that would be like when you call sip timeout unit is called, and it gives you a lot of variables.

A

I'll go briefly over in a second then, just before the callback is called in a synchronous context. So when the callback to set timeout is called after, let's say two seconds, the pre or before hook will be called, and then when it ends like in a synchronous way, then the actual callback will be called and then just after the callback has finished running the post hook after hook is called and then at some point later the destroy hook is called.

A

It depends a little bit of what type of thing are we looking at if it's a timer, it's one thing: if it's a HTTP request is a it's, it's called sometimes when the garbage collection is being run, but it's a nice place if you want to do some cleanup to do it in there. So let's try to use this. That's let's say: let's do the let's require FS and let's try to read a file and then I'm, locking some stuff here so I'm, lucky alike.

A

Just before I I opened the file I get a file descriptor just before that I write a user before and then just after I've set it up. I'll write you up and then, when I actually get the final scripture or right done so so how does this Alice's actually look here? We have the whole code, so first you'll see user.

A

Before that's like this line here like just before SFO boom, and then here we are queuing a callback, so you will see that unit in it hook being called and then right after the in hook is called now we're down. Here we haven't really done anything but the poly head or try to read it we're running synchronously down through the JavaScript and then at some point. The file system responds back, say up. I have I, have a file descriptor here the pre-hook is called just before the function.

A

The callback here is called, then the callback is called BC done, then the posts and then at some point, you'll see the destroyed. So how come we don't use console.log, I use this magic, lock thing so turns out beneath under the covers, console lock is actually doing asynchronous stuff. It might doing it a blocking way on most platforms so that you actually get things out as you expected, but it's actually using process Dutch standard out that right right standard up.

A

So if you try to use console.log to keep up what are you trying to freak out houses? This magic thing work? You right console luck to debug something.

A

Then it was trying to do something, asynchronously sort of call itself and you will get like and infinite infinite loop and you will get this error. So of course we don't want to do that. We want to how do we debug this? When will you try to develop this? You can either to you. We have synchronous right to the magic file. Descriptor called one which is standard out, there's another called which is standard error, but then we need to do our own formatting.

A

You would be nicer if we could have the rich richness of console locks. So there's this magic thing called underscore wrote epoch that it's basically console locked, but it's synchronous. So we use that and then everything works again. So so this init work. What does that give you, and what can you do use with you support, so you get a UID, so ever call back that is cute. Cat gets its own ID and you get that ID in all the four hooks.

A

So you can, you can keep track of which call is calling me before and after the show I hook by by using this UID SS to test your key, you get a provider provider is so right. Now it's a number in the final one. That's gonna ship, a node which will be back ported to node 4. By the way it would be a string right now, it's a number and you can see what the numbers mean by I in the in the Revel.

A

You can just get the binding and you can save dot providers and you get like all these different providers, so you can see if it's like there's one call: where is it timer, wrapped their number 17 timer app, so time arab would be. If somebody tried tries to run, set, timeout or set interval or said mediate, it will be a timer app. So if you want to handle stuff differently depending on what type of async thing is going on, you can use this, and then you have the parent UID.

A

So in certain cases one call back is cute based on another ace incorporation, so you might have a parent it. Otherwise it will be null, and then you also have this handles. Both this will be. The current handle and parent handle will be Panhandle.

A

Obviously, so what a handles handles are just a way for for node core, which is written in JavaScript, to talk to the view we leave you libuv and v8, which are both Whitten in c and c++, so it needs some way to transport data between the JavaScript world and the c-d-c world, and it uses handles for this. So if you take, for example, a look-in net connects, this is inside net connects you can see inside behind-the-scenes, we need to something with net connect.

A

It actually uses the process that binding TCP wrap to get CCCP the TTP code from libuv. It gets then two handles and then it instantiate the handles and it puts properties on these handles.

A

So these are the handle objects, and that is basically, you had your this property inside inside each of the four hooks. So you can act, you can. You can get access to like the port or the address of the of the TCP connection. If you want to use that for some some debugging or something and then at some point later it will, it will call the destroy in the socket and that's when actually the destroy book is called timeless is a little bit funny.

A

So, basically, if you have two timers with the same number of seconds, they will share the same callback, so you will only get one called init one call to pre and one call to post and one call to destroy, which is a little bit weird. If you don't know this in advance. So if you want to be notified of each of them individually, you have to still do a little bit of monkey patching.

A

So how do can you actually use this? So I wrote a little bit of code that actually show how you can get access to the current request at any point in your application, using this new API bear in mind that some things are not covered by the API yet so you will still have to monkey patch those things manually like process process. Next, stick: when you know when this ships for real and note process next state will be covered, but right now it isn't. So there are certain things that you need to patch manually.

A

Also, the the timeless thing I shall talk about. So basically you get the the binding we want to handle timer separately. So you just keep a reference to what's the idea of the timer, the timer provider, we set up the different hooks, we enable them and that's it and how it hooks. Oh yeah, so we have a. We have an interstate and a previous state.

A

That's basically just a nice little map where we keep track of what was the state of the current requests are at the point of in it and the point of the the before pre-hook being called. Then we have the unit provider and we just checked. Oh, is sorry the init hook, and then we check if they provide us a timer and we just abort, because we do our own monkey patching of that. So we don't want to do anything with that.

A

Otherwise we just set using the UID or whatever current request this in this map, and then we just restore it in the before hook and in the after hook. We on restore it basically and then we clean up after ourselves and that's basically some gotchas, so the handle creation time is not always obvious. So, for example, you do HTTP to create server. You would think that that's the time where you actually keep the callback to create server and that's the time where the eniac is called.

A

No, it's actually, when you call the dot, listen function on the server, so sometimes it's a little bit tricky to know when the creation time with the handle is- and that's also when the unit hoop is- is called console.log. As I said, it's a little bit tricky right now. Next tick is not covered by this, but it will be that's why the brackets timers have the problem. I talked about with sharing sharing the the hooks. Promises are currently not covered by this, because VA doesn't give us an API to monitor these things and promises.

A

It's something that we work together with Google to to fix, so hopefully it will be fixed soon and if you have in your codebase, multiple people trying to use async hooks at the same time only the latest s up now. This might hopefully also change. Only the latest one to register hooks well will will win, but that will hopefully also change.

A

Then there's this bastard. So if people like use Redis, for example, a radius keeps it own. Its own queue keeps a pool of TCP sockets open to the red Redis database and will like figure out when a response come back like which, which callback actually needs to be called based on the response from the Redis database, and that, of course, will I. It would be totally we will. You will not know like which request was was was responsible for queuing this callback, so you have to patch stuff like Redis yourself again.

A

This is something that is actually being worked on, so there's a new providers API being made so that people who make modules like this, for example, read- is I, can't tell node when these things happen. So node knows to fire these hooks, so that will also ship, but until all the module owners use it, we still have a problem. There then there's es modules.

A

All of those all right something so yes, modules is, is what is the word? It's you I'm blanking on that word yeah. You can't you can't overwrite the stuff that the import statement returns. It's immutable, that's what which means you can monkey patches easily or you can't monkey patch certain things at all in node.

A

So that's gonna be a problem, but I don't know we'll we'll see. Finally,.

A

This is the the link to the public they'd. The the final draft of the API for async hooks is currently a pool requests. If you want to know more about how it's going to be when it ships this is the place to look.

A

The tracing workgroup or its today is called. The Diagnostics working group was pre released, called the tracing work group. That's where you can get a lot of information about all the stuff that we're gone, there's also more documentation facing crap or async hooks, there's even an issue. We can keep track of like the progress of how close you are to shipping this and with that I'm done. Hopefully, one minute overtime, danke ye, so that I say.