Node.js Benchmarking Working Group, 9 Mar 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Node.js Community Benchmarking Efforts - Gareth Ellis, IBM

Description

Node.js Community Benchmarking Efforts - Gareth Ellis, IBM

Benchmarks and the information they provide are important to ensure that changes going into Node.js don’t regress key attributes like startup speed, memory footprint and throughput. Come and hear about some of the fundamentals of benchmarking, how to go about narrowing down the cause of a regression between versions of node along with the efforts underway in the community benchmarking workgroup (https://github.com/nodejs/benchmarking) to run/capture/report and act on benchmark information.

A

Hello, my name is Gareth Alice I'm, going to be talking to you about node, community benchmarking efforts, and so a little bit of information about me. First I've been working as a runtime performance analyst at IBM, since 2012 originally I was looking at the performance of our our version of Java that we produce for stack products but for the past, 18 months have been looking at the performance of nodejs I'm, also a member of the benchmarking workgroup.

A

So what we're going to talk about today, then we're going to start off with a introduction to benchmarking and performance testing, we're going to have a look at some ways that we can narrow down, regressions that you may come across in nodejs and then finally talk about the work that we've been doing in the benchmarking workgroup.

A

So an introduction to benchmarking.

A

But okay, Oh Cleveland, hear me still so an introduction to benchmarking. Then first thing to mention is the benchmark. You know. Performance testing is quite different to say: functional tests in our system, testing in functional testing. Typically, there's going to be one answer: either it work. Well, there's only two answers: either it worked or it didn't in performance testing.

A

The number of results you could get is pretty much in infinite. If you're measuring startup time, it could be anything from zero milliseconds up to hundreds or thousands of milliseconds. So one of the things that's important to try and do is to change one thing and one thing only between the runs and the thing that you're going to change is going to typically be whatever it is you're wanting to test.

A

So, if you're wanting to see the or measure the performance of the latest version of your application code, you'd be trying to keep everything else, that's involved the same, so you'd be trying to run it both on the same version of nodejs, really on the same machines using the same versions of various mpm modules. The project may require. That means that once you've done your tests and you've got a result, you can be fairly confident that the improvement or regression has come from your application code. Whatever it is, you've changed.

A

Sometimes it can be quite tempting when you go and issue a new version of your application code to also pulling the latest version or your modules and then upgrade the version of node that you're running on, and maybe also put it on the latest hardware that you've got available. But if you do that, it's going to be very difficult to try and work out where a improvement or a regression has come from.

A

So some of the key challenges that you may face in performance testing, the first one is there, is going to be variability in the scores that you get. If you run the same test, say ten times, there's a very good chance that you'll end up with 10 different answers. So one of the things that you need to get used to is the fact that you can't just simply do a single run and expect to be able to say how good or bad a change was.

A

So one of the things you need to look out for also is false positives. So if you've just checked in a version of your code- and you think this is probably going to offer me about ten percent improvement, if you go and do one I it's about ten percent better, then you could be tempted to just stop there in deploy your code.

A

However, it's very important to make sure you collect a large number of results, so making sure that you collect enough results that you're able to be sure about the the range of scores that your test or your product may be able to give you and you're going to then need to document this expected variance so that you can make sure you run the correct number of tests when you actually go through and complete your performance testing.

A

Another key challenge is being able to keep a consistent environment, one of the things that we do to train make sure that we're always running with the machine in the same state. Before we start a round of benchmark testing will go and reboot the machine. Whilst this can sometimes add a little bit of variability to the scores, it means that if we need to go and try and recreate the environment, it's very easy for us to do just by rebooting the machine rather than having a machine.

A

That's been running for months and months and months with thousands of hours of testing done on it. It's going to be very difficult for you to get it back into that same state. If, for example, a security update comes along that you need to reboot the machine or have a power cut something along those lines.

A

Another thing to do trying to keep a consistent environment is being able to isolate the machine, so this includes making sure that people aren't going to be logging in and using the machine whilst you're doing your testing, because otherwise you could actually be measuring the fact that there that they've been using some of the CPU resource or memory or whatever. Also if your test is going to be using a network having a private or a dedicated network available for your test is also a good approach.

A

Now. Obviously, the numbers that you may be using for this to try and aside if you've had an improvement or aggression between two versions of either node or your application code. You're not going to be using these numbers to then go and publish them and say: look how fast our thing can do.

A

Otherwise, in that sort of situation, having a private dedicated network wave effectively got infinite band, it may not be the best way to go something else that you can do to try and reduce the variability and also try and make your ins a little bit more consistent would be interleaving the measurements that you take so say. If you've decided that, because of the range of skills, you get your going to need to run your test ten times what we may do is we may ruin our baseline.

A

So it's got our known good version once and then we'll run the version of the code that we wanting to test and then the good version again and the one that we're going to test and so on and keep switching between these. This would mean that if there is any sort of natural Varian variation in the performance of your machine that you're running on that should hopefully be taken account by the fact that you keep running one and then the other. The third key challenge to try and get over is jumping to conclusions.

A

It's very easy to see the first run of your baseline and then your build, see attempts and improvement and decide actually, no I don't need to bother doing any loads of reruns or anything I'm. Happy with that, you need to make sure that you've continued with the number of iterations that you've decided is necessary. Sometimes the data can be very misleading. It's always important to try and get a good set of consistent, reliable data.

A

So two different approaches that you could take to performance testing. First one would be running micro benchmarks. These are quite useful for being able to measure a specific function or API. For example, you could be measuring how creating a new buffer may, how long that may take it's quite good for being able to compare key characteristics of either your application or runtime. There are some disadvantages as well, however, one of them being that micro benchmarks may not always represent a real-world improvement.

A

So if you find a micro benchmark and go and check in some changes to try and improve that micro benchmark, it may not actually translate into a real world improvement in in a final product. The second disadvantage of micro benchmarks is that sometimes you risk not actually measuring what you think you're measuring, so one of the ways that quite a lot of micro benchmarks work is by repeating the same action many times if the JIT or the optimizing JIT, that's in your runtime is able to spot that.

A

Sometimes they can optimize a way, the bit that you think you're, actually testing, there's quite a lot of information and some interesting videos online as well on this sort of topic about micro benchmarks.

A

The second approach would be a whole system benchmark, so this would be where you'd be trying to test something that represents a solution that may be deployed in production. One of the benchmarks that we use in the benchmarking workgroup is Acme air, which simulates a fictional airline company and you we measure the throughput of the number of requests, a second by users, logging in booking on flies, checking in logging out things like that.

A

One of the disadvantages of this type of system, however, is that typically the more things that your system is doing, there's more room to introduce variance in the scores that you're getting out. So whilst you may get more reliable data from a micro benchmark, you could perhaps argue that you may get more useful or real-world data from a whole system benchmark.

A

So if you followed all of these and then you think, you've found a regression. What should you do now? The first thing to do is to check that the environment and the data that you've collected is correct. Otherwise, you could find you go and invest many hours of time going down, trying to find what the problem was when actually it was just perhaps and variable data that you've collected, if you sure and you've, had a look at the data and it's within the expected variance that you've measured.

A

What thing what you need to do is have a look at what it is that you changed.

A

If it was you've just upgraded from one version of node to the other, then you could start having a look at what's changed between those two versions of node in github, perhaps use git bisect to go and narrow down the changes and try and get something that demonstrates the issue that you're, seeing if it's your application again, if you've got everything in source control, you can perhaps go through and binary chop the different changes that have gone into your new version.

A

It could even be that your environment, you may have found that you've upgraded to a new version of hardware, which now is showing your aggression that wasn't previously seen and that's where you may have to then start looking into you.

A

Perhaps how the JavaScript engine is behaving on the new hardware, it could be that maybe there's something that's not quite working properly, so there's a number of tools that we could use and as I mentioned earlier as well, you can also use a binary search so either doing that manually or using something automated likely get bisect function.

A

So, if we're running in no Jess there's a number of different possible sources that a regression may come from, it could be perhaps from some of the native JavaScript libraries. It could be that your new version of nodejs you're testing on his pic top of the 8th upgrade, so it could potentially be a issue as a result of a v8 upgrade.

A

It could be that there's been a security fix in openssl and that's where the regressions come in could be libel, UV update it could even be that you've gone and downloaded the latest version of mmm modules and they've then perhaps caused the regression or it could be if you've updated your version of v8 that now you're NPM modules have to be recompiled, and some of the issues have come in that way.

A

There's a few tools that you can also use to have a look at what's happening under the covers, so there's some JavaScript profilers that you can look at there's one built into node that we'll have a look at in a second and there's also some extra add-on modules, for example at metrics by IBM that you could use the other type of tool you could try and make use of would be a native profile.

A

This is something like / t prof profile, and these would be looking not only the resources that your application is using, but also perhaps the resources that some external libraries may be contributing towards the system usage.

A

So quick example, then, so this is a micro benchmark that we've been running at IBM and it's simply measuring how long it takes to create a new instance of a buffer from an array of numbers. So we have a function there which goes and repeats this 300,000 times, and then it goes and pushes that through a test harness which the test harness runs it until either we've reached the maximum number of attempts, or until we've got some reliable data.

A

When we were running this a few months ago in between node 4 32 and note 44, we noticed a quite a sizable regression between the two versions. So we can see there that in node 4 3 2, we were getting about ten operations, a second 10.6 operations, a second. When we went up to node 4 4, we were only getting just over six operations a second, so it's a fairly sizable regression again.

A

It may not be something that you would pick up in a lot in a larger benchmark, but it's something that we could easily recreate. We could run a number of tests and we would be fairly confident. We could recreate it because they're quite distinct numbers, six and ten.

A

So as I mentioned before, our metrics could be one option. It's something that you can install through NPM and you can programmatically get information out about your CPU usage, GC memory, v8 profile in loads of different stuff. So you can either pro write some code that can output that to a file or you can actually connect it over the network to IBM health center to get live monitoring.

A

So the v8 profiler- this comes as part of the node binary in more recent versions, and you can turn it on by just adding minus minus prof to your command line. This will then generate a file, that's called isolate, then a hex number dash v8 log, which you can then post process by doing node-, minus prof process and then the isolate file that it's created and that will give you a breakdown of where the time is used in various javascript methods and other things that v8 is doing.

A

There is also some helper modules that, if you prefer, you can use to try and automate this so when I went and run this on our two versions of node I collected my post process data and then I dipped them to the top. Few lines is from node 4 3 2, so we can see 23% spent in lazy, compile of the from object in buffer and then in node 4 40, it that had gone up to forty seven percent. So that's perhaps somewhere to look for for completeness.

A

I then also ran it through perf, which is a system profiler that you can get on linux based systems. So again, there's a wide number of different options. You can pass it pass into Perth these ones. I've got on the screen here. I found worked quite well and again. You can then do perfil report to do the post-processing, which then, when we compare them, we can see an increase in the time spent from 23% up to forty six percent.

A

If we weren't able to spot anything in doing that, the next thing that we could have a go at doing would be to binary chop. So so we could either do it manually get a list of the good and bad result. All the changes sets between the good and the bad version and then go through and manually, rebuild them, or we can use git bisect and put together a small script that was able to run our benchmark and then decide if it passed or failed.

A

So a bit a bit of time later and appalling, an issue that we opened on the in the community.

A

It turned out that this regression was actually due to a change that had been made in the buffer API, so we'd gone through and changed all the declarations for the step of variables in for loops from VAR to let which then triggered a issue that v8 had in its optimizer, where it preferred things to be declared as bar instead of like and that's going to carry on to be the case until turbofan becomes the default optimizer again, you can have a look at a bit more information about it.

A

If you wish in that pull request 5819 so now, I'm going to briefly talk about the work that we've been doing in the community benchmark work group, some of the things that we're up to and how you could also get involved. So the work group has a mandate to track and evangelize performance gains between different node releases. Some of our goals are to define use cases, identify bench maps that we can run and then run and capture the results.

A

We've got 12 members at the moment we have meetings every month or so with the next meeting likely to happen in probably next week. So you can get some more information by looking gone github, it nodejs benchmarking. You can also look at the graphs and charts from the results of our runs. At benchmarking. Don't know jst, ugh so this is a list of the people that we've currently got involved.

A

Michael Dawson from IBM is the facilitator that, along with some people from various other communities and also freelancers, some of the benchmarks that we're currently running so we've got some that measure startup time, and so that's looking at how long it takes for an ode to actually start and get going. We've got some that are measuring the footprints or how much memory node actually uses.

A

Well, so it's running it's the amount of time it takes to require a module and with most projects using a large number of NPM modules, birthers direct requirements and then child requirements, that's a very important metric. We were in Acme air, as I mentioned before, where we track the throughput or the number of operations a second. We track the response time and also we take footprint measurements at different stages of the run, to see how node performs, not only when it's idle, but also when it's just a lot of work.

A

We also have a docker file available for comparing two different versions of node. So if you want to have a look at how two different versions or node compare, you can run those through the docker file and it will produce some output at the end, where it'll tell you if the new version is better or worse, we're also in the process of putting together the facility to allow to test particular pull request again, some of the benchmarks that come as part of the node source which that's taking place in 8157, so I'm benchmarking, nodejs dog.

A

We have a number of charts that look a bit like this, so this is the one where we're tracking throughput on acne air. You can hopefully see there's a bit of a jump around April time, so that was when we took a new version of the eight you'll, also notice that more recently mid-august, we've got a blue line at the bottom.

A

That was where a change that was going into master actually broke the mongodb module that acne I was using, so we had to date our version to a new version of the MongoDB module which, as a result, actually caused a slight performed regression compared to the previous sets of data that we've collected, which is why that we see the drop off on most of the lines towards the end of August.

A

Some of the use cases that we've identified here then so one of the first use case that we've identified is back-end API services. So this is going to be typically, scenarios that are using rest or rest like cause, mostly over HTTP and public infrastructure.

A

Second, one would be service-oriented architectures, so this is typically where api's are provided that may go and call into a large number of other API is to produce one result generating and serving a sorry micro service based applications. So these are going to typically be a very nimble, low resource, quick start off applications where there may also be a number of different micro services running on the same system. So a small amount of footprint there would be better, generating and serving dynamic web page content.

A

So this is looking at modules such as Express, happy, color, react and so on single page applications where this is where you'll be loading, page ones, and then further communication to the backend will happen via web sockets or HTTP, two agents and data collectors.

A

So this was where people are going to be deploying instances have no Don to their network to monitor certain things and they're going to want the ability, not only for it to be able to report back, fits but to be able to push out new versions and then the instance restart itself and then also small scripts.

A

Just doing little bits of tooling so rather than writing things in bash or Python, some people are now starting to ruin more things in node, where they wanted to start up very quickly, not use very much resource and then once it's done, go away very quickly as well.

A

Some of the key attributes. So these are different metrics that we're looking to tracked me and we obviously, we still got some gaps here that we need to fill in. So we've got two memory footprint measures, one just how much node uses once it started before it's done any work, and then one also after load, there's node CPU usage at idle.

A

Ideally that shouldn't really be using any CPU at all throughput as we're tracking through Acme air operations per second and then the how large the node packages when you download it and then also once you've installed it, how much space is it going to be using on disk? We also want to start collecting some metrics on GC, not only the CPU usage impact on the unknown, but also how quickly it's able to allocate memory tracking the max pours times when under load.

A

And finally, then, if you're wanting to get involved, you can take a look at the github repo. No GS benchmarking have a look at. What's there have a look around at the charts, if you think something's missing or if you think something's wrong, then by all means open an issue. We've got an issue open at the moment, organizing their next meeting. So if you wish, you can either join the meeting and contribute, or even just listening via the YouTube on the air recording, and that is it. Thank you very much.