Node.js Benchmarking Working Group, 21 Apr 2016

Previous Meeting

⏯

youtube image

►

From YouTube: Node.js Live (Paris) - Gareth Ellis, Node.js Community Benchmarking Efforts

Description

Gareth Ellis is a part of Node.js benchmarking working group: https://github.com/nodejs/benchmarking.

In this video, he provides an introduction to benchmarking, how to get started with benchmarking (depending on what you are looking to test), key challenges to benchmarking, approaches to benchmarking, benchmarking Node.js, and use cases.

Gareth's GitHub page is here: https://github.com/gareth-ellis.

Thank you to Opbeat for sponsoring the videos for Node.js Live Paris, and IBM for sponsoring the Node.js Live Paris event.

A

So an introduction to benchmarking, then one of the most important things when you're benchmarking or performance testing, your application or your own time or whatever is you should change? One thing, one thing only this might seem quite obvious, but the thing you should change obviously, is whatever it is: you're wanting to work out the performance of so if you've gone and checked in a new copy of your application code in your wanting to know, does it perform as well or better than what it was doing previously. You'll be keeping everything else.

A

The same so you'll be running on the same version of node same machine using the same set of NPM, same versions of those modules as well and just be changing your application code, and it can sometimes be quite tempting to go and change everything at once. Oh you just released version 2 of the application, your think actually and also will go and update the version.

A

A node that we running on I will go and update to note version 6 and we're going to put it on this new brand new machine we've got and we may as well go and pull in all the latest versions of all the modules. That means that it's quite difficult to decide whether there is actually a performance regression or not. You could just be masking a performance regression by an increase in, for example, the speed of your machine that you're Elina.

A

Something else that's worth mentioning is that performance, testing and benchmarking is quite different to functional testing in functional testing. Typically you'll do you run and it will either pass or it will fail, and if it fails, you can maybe rerun it and it might work the next time with performance testing. It's not only whether the run completes are not but you're also going to be interested in whereabouts.

A

The data is in your sort of range of acceptable scores, and this is going to bring us on to some of the key challenges when doing performance runs and benchmarking. The first key challenge is that fund up there is fundamental, renter and variance. So if you go and run the same benchmark and the same version and out on the same machine once after in each other, there's a very good chance that you'll get two different scores and, depending on your environment and a lot of other things, they could be quite far apart.

A

So the first thing you're going to need to do when you're doing some benchmarking is to run your benchmark a good number of times. So you can get a good feel for the data for the range of scores that you see. This is also possibly going to hold you back later on. If you go and bring your benchmark and find out actually there's a ten percent difference between the best score and the worst car.

A

If you're then going to want to try and narrow down a two percent regression, it's going to be quite difficult to reproduce that and to accurately measure one of the things that could cause these issues is having a consistent environment.

A

So the machine that you're doing your testing and you're going to want to try make sure that it's in the same state each time you and your tests in our team at IBM, one of the things that we do to try and get around this is we reboot our machine before we ruin each set of tests? This means that each time we ran our test it, the machine should be hopefully in the same state, so there won't be any left over processes or anything else like that.

A

That could be affecting our scores, and it also means it's very easy for us to get back into that same state. So if we want to rerun a set of tests, we can just reboot the machine run them again, and hopefully the machine should be in the same state and something else that would help with getting consistent environment is making sure your machine is isolated from outside interference, so that could be making sure other people aren't going to be logging into your machine and running something at the same time. That could skew your scars.

A

If you're running a throughput test that relies on network traffic, you may want to have it on a private network. So there's not going to be other people transferring files across the network and potentially affecting the skills of urine.

A

Something else you may want to do would be to interleave your good, build and whatever it is that you wanting to test, and by that I mean that we could run one copy of your good, build, followed by a copy of the one you testing and so on, and that should hopefully mean that any interference that's happening on the machine may be alleviated it's much better than comparing the scores. You got from your good build when you run it two months ago. Things on a machine will change, even though you think they may not.

A

The final key challenge is the jumping to conclusions. So it's very it's very easy. So you're going to do a single run of your good, build singer. Only build your testing they'll, let the same brilliant job done. We don't need to bother doing all those reruns, because everything looks fine. You should still go back and make sure you collect a good number of measures, so you have confidence in the data, so some different approaches that you may take towards benchmarking.

A

The first approach could be to do some micro benchmarks, so this could be useful if you're, implementing a new API or a new function or you making some changes to an API or function. You want to see. Have your changes made any changed, how it performs. These are quite useful for maybe comparing key characteristics.

A

Some things to be aware of, though, is that even if you go in improve, for example, the buffer API, so it now runs 300 times faster. The may that may are not actually translating to any real-world improvements, because perhaps yeah creating a buffer might be a very, very small percentage of the time spent in a real world application.

A

Something else that you need should be aware of is that when you are micro, benchmarking, especially when you're running on a run time that uses a jitter, some sort of optimizer is there's a good chance that the JIT could go and optimize away.

A

The thing that you're trying to test- so you may not actually be testing what you think: you're testing, there's plenty of presentations and videos online of people going through and actually showing what the JIT ends up running and actually, you think you're testing some API, but actually you might just be testing the performance of how quickly it can go through a for loop and it might have got rid of the bit you're testing all together.

A

Another approach to benchmarking would be sort of whole system benchmarking, so this could be where you would be pulling various metrics out of your full application. An example of that might be the Acme air benchmark, which is one of the benchmarks that we run in the community benchmarking workgroup. This is a fiction airline and a user can create it ourselves account login book themselves on flights check in all sorts of stuff like that, and we can.

A

We use jmeter to drive load against that to measure how many requests a second nodejs conserve along with some of the metrics. This is good because it represents a more realistic, real-world approach. The disadvantage is the more things that you are doing within your benchmark. You introduce more room for variance, so there's a good chance that you might get really consistent micro benchmark results, but when you go and do lots of these micro benchmarks in a whole world system, you may find that the variance between your different runs increases quite a bit.

A

So you've got in collected a load of data. What you're going to do now? You think you found a regression. The first thing to do is to make sure that you sure they actually have found a regression make sure that you've got a good range of data, for you good, build and whatever it is that you're testing have a look at the variance compared to the percentage regression.

A

You think you found, as I mentioned before, if you've got a ten percent variance or temp, then range of scores for your benchmark, and you think you see in a two percent of aggression- that's going to be quite difficult when we get later on to trying to narrow it down, because we're going to need to make sure that this regression that we think we found is easy to reproduce to give us the best chance of finding what it is. That's caused it.

A

So, if we're sure that the regression exists, we then need to have a look at what it is that we actually changed. If it's our application code, one way around, we could have a look at what is that's change between our good build of our application code and this one that we're testing it could be that you've gone and upgraded your copy of node.js to a later version, in which case we need to have a look at what it is. That's a change within nodejs.

A

It could be that you've just going to move your whole stack onto and new server. That's running. That should be run in a bit faster, but maybe it's not quite running faster, in which case we need to start looking at what v8 or other platform specific things are doing. Perhaps it's doing something that we're not expecting.

A

We then need to be able to compare between the good and the bad cases, so this tools that we can use well, in fact, there's hundreds and hundreds of different tools. We could use I'm going to talk about a few of them. Another option is we could just binary chup our change sets. So if we're looking at nodejs- and we can see that one version- a node- we good- we were good and then another we were bad. We can have a look at what it is.

A

That's changed between those two different versions to try and help us work out what might have caused the regression so within nodejs, as I'm sure you're all aware, there's lots of different things that come into the nodejs project to provoke to providers with our nodejs binary. This some native JavaScript thing is in the lib folder, so buffer cluster lots and lots of things in there. V8 we, if you upgraded your version of node, it may well have pulled in a newer version of v8 that could potentially have caused a regression.

A

There may have been a recent security fix that could also affect performance. It could be libby. Uv could be new version of NPM that you've pulled mpm module that you've pulled in. It could be lots of different things, so some of the tools that we could use, we could look at some JavaScript profilers, and so we've got the v8 profiler and we also got something such of app metrics, but there are lots of others available as well. We may also want to use a native system profiler.

A

It's not just going to look at the eight and JavaScript it's going to look at all the other things that are going on in the machine at the same time. So this is a micro benchmark that we've been running at IBM.

A

What we do is we go and get a quite a large array with 60 numbers in it. We then go in. We've got our full loop at the bottom there, which is going to go and create lots of new buffers from this array. We then go and run this through a test harness which will keep repeating until we either get a good consistency of data or until we've reached our maximum number of runs. When we run this on node 4 3 2, we were seeing operations a second of about ten point.

A

Six so and then, when we upgraded to node 4 at 4 dot 0, we saw that drop to just over six operations a second. So it's a fair-sized regression.

A

So these are some of the tools that we could use to try and narrow down what it is. That's caused this regression, so the first one is the v8 profiler. This comes as part of v8 and it's you can expose it through node by adding minus minus proof to your command line. When you run that it will produce a file just as we've got there, isolate a hex number and then v8 dot log, so you can run your benchmark or your application normally, and then you can go and post process.

A

This later so a more recent versions of node, 4 and node 5, you can use minus minus prof process and that'll go and pros process this log and give it something put in a format. That's a little bit more readable. There are also some helper modules, for example v8 profiler, which will let you programmatically enable and disable the VA profiler.

A

So when I went and run this an hour, good build I'm for dot 32 an hour one that we were testing node for up for dot zero. We saw quite an increase in the compilation of the from object, so we've originally we've seen about 20, just over twenty three percent of the time spent in that, and then when we are to fold up for at zero. This went up to forty seven percent, so it's a fare increase can go and get similar set of data out of Perth, which is our whole system.

A

Profiler, and the output of this would also show you time spent in native system modules and things like that. But again, the big difference between the good and the bad profile was this time spent in this lazy. Compile so that's suggesting that it could be something that v8 is doing differently, so we can then go and turn on some extra trace options into v8.

A

So again we can pass these straight into our node command line, so you can use minus minus, trace, upped and minus minus trace d upped, which is going to look at optimizations and D optimizations and our good case. We saw that we were noticing that from object was quite hot, so we're going to compile it and optimize it and then, in our bad case we saw that, just after doing all that work of compiling and optimizing, we then went and dropped the optimization area and we're going to D optimized it. So that's a bit funny.

A

Another way that we could go and try and find the difference between these are the same before it's just sort of going binary chop. The change sets, which is all right in this case, because we were looking at two adjacent versions of node. But if you are upgrading from nerd 028 to node, 6 you'd have a few more change sets to look through and it might take a bit more time.

A

Another option that you could use is something such as that metrics, which Andy mentioned about before available on NPM.

A

You can go in get all sorts of CPU information, GC memory, profile and information out of v8. You can either add the what the information that you want. Programmatically like this turning on when we get cpu information, go and print it out to console, or you can also connect IBM health center to it over the network and then connect directly, and that also allows you to enable and disable the method.

A

Profiling, as you require, with our regression that we saw before the actual problem it turned out, was because we want, in the node for up fall at 0 we'd gone and changed how we were declaring the iterator in our for loops in the buffer module above buffered up Jas.

A

Originally we were using VAR and then we switch to let so we went and issued a pull request to go and revert that change as it turns out that the v8 optimizer isn't too happy when you have let's in a for loop, this is going to be fixed later on in a later version of v8 ones, turbofan becomes the default optimizer, but for the meantime, we're going to change that back to just having them as far so.

A

It's just changing the scope of the variable, and when we made that change, we saw the performance go back up to where it was in the previous version, a node, oh yeah, so I'm just going to talk now, very briefly about the node.js benchmarking workgroup. So the workgroups been going for quite a while. We got a mandate to track and evangelize performance gains between node releases. We've got key goals of defining use cases for node, identifying benchmarks that represent these use cases and then running and capturing results and reporting them to the community.

A

We've currently got about 13 members and we have meetings every month, or so we had a meeting on Tuesday to our next one's likely to be in early May. You can have a look at what's going on on the github page, there Jess benchmarking, and you can also see the various grass for the benchmarks we were in there, I'm benchmarking dunno jst. ugh So these are a few of the use cases that we've discussed in the community and have we then approved at the meeting on Tuesday this. These aren't necessarily set in stone.

A

So if you're looking at this list and saying actually my use case, isn't there then come and let us know- and we can see where we can go there. So some of the use cases we've got such things such as back-end API services. So this is rest and rest like ap is typically running over the internet or over public networks, and so we're going to want to make sure that node can perform and doesn't regress in these sort of situations service-oriented architectures. So these may be typically be private.

A

Ap is and where things such as very low latency are very important micro service based applications.

A

This may be cases where they use some different types of networking protocols, possibly things such as UDP, so we're going to want to make sure that node is as successful as possible at transmitting this information generating and serving dynamic web page content. So we've got modules such as Express have piko react and so on. All these modules are very popular within the node ecosystem, and we want to make sure that things that we change within node don't go and regress.

A

The use of the modules such as this single page applications then communicating back to the backend over web sockets in HTTP two agents and data collectors that may be distributed through networks where we're getting to want to be able to update those automatically rather than having to go redeploy nodes, also use quite a lot in sort of small scripts are being able to script stuff to run quickly. So in those sort of cases where we'd want it to be using very low amounts of CPU low amounts of memory.

A

Quick startup things such as that, so all of these different use cases we're going to be looking at these lists of metrics, so we're going to want consistently because consistent low latency in our communication, the ability to support hiking currency, we're going to be wanting to look at throughput. We were going to a fast start, obtain fast, shut down and therefore also fast, restarting and also using low resource, so memory and CPU.

A

If you want to have a look at these use case in a bit more information in a bit more detail, the information is again on the nodejs benchmarking workgroup, so the benchmarks that we've been running so far. At the moment, we've been looking at some quite basic startup tests, so we're having a look at how quickly node can start up when it's not going to be really doing very much we're going to be looking, we've been looking at footprints or how much memory, footprint or resident sectors node use.

A

Once it's up, we've been looking at how long it takes when you go and require a module both for the first time and also perhaps when it's been cached and, as I mentioned before, we've also been running Acme air.

A

So we've been looking at things such as throughput response time and various footprint measures as well, which have been taking before before we once node started up, but before we start driving load and also after we've been driving load for a while to see how node scales, once it's been doing, some work, we've got a few of the benchmarks in progress, as we've got some that been written to look at the foot, the performance of the URL module and we've also we're also looking at grabbing some of the benchmarks which are in the nodejs benchmarking directory and be running.

A

Those more regularly be able to graph the results. So, as I mentioned before benchmarking, don't know Jess dog and encourage you all to go and have a look to see where abouts we are. We've got lots of charts such of this one since that Acme air throughput chat.

A

We've been running this since februari, so at a time so, once a day, we go and run the latest check out from github of 0, 12, 4 and master, and then once we've cut node 6 we'll be adding that to the chart as well. Obviously, at the moment it's being tracked in master and it's quite a good sign, so we can see there at the top. We've got the master branch performing the best followed by node 4, followed by node 0 12. So it's the right sort of pattern.

A

We want to see newer versions of know performing better than the previous ones.

A

So how can you get involved? Going? Have a look at our github repo, no GS benchmarking go and have a look at what it is. We're running, have a look at our use. Cases have a think about how you're using node, if you're saying well actually how I use node or the bench sort of things that I run. Aren't there open an issue in letters? No, we want to try and get as many benchmarks running as possible that cover all the uses of nodejs.

A

That's about all I've got so if you've got any questions, I'll be happy to answer them a little later on.