Node.js Benchmarking Working Group, 9 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Node.js Benchmarking WG Meeting 07-09-2019

Description

https://github.com/nodejs/bootstrap/issues/114

A

So just going through the setup, I guess I think.

B

We're already live, it shows, live in the actual on. You know, in the main screen versus the other one more quickly, so I think we're already live.

A

Okay, so welcome to the working group meeting on 9 july 2019, we have my cell butum michael Octavian and Gabriel online.

A

I would appreciate, if you can, everyone can add their name and Twitter and the get up handle to the document.

A

So looks like we have one item in our agenda, which is issue number two to five to discuss.

A

So this is two about the core workloads, the buffers, and so there is no update from either get it or they are I. Don't know whether did you get a chance to talk to any one of them. Michael.

B

I haven't followed up with them, I think at this point, we have to assume that they're no longer gonna have time to work on it. Okay,.

A

I know there was a last comment we had was gabrielle was going to write a proposal for an alternative approach, so I would maybe ask Gabriel if he got any time to think about it about the proposal.

C

If you're talking about like running the but running the benchmarks, periodically yeah.

A

We talked about multiple ways: we thought, because there are, we have so many of those micro benchmarks. We thought you could take the subset, we could run either weekly basis and we could run all of them, maybe once a month kind of thing, so there's some kind of proposal, some kind of purpose that we need or approach. We need to cover micro benchmarks in the in this guarding.

C

That is, can we can we can we do like a like a random sample of benchmarks like a time-limited random sample, and then you know if we limit it to like two hours or something like that, and then you know we can. We can just run certain benchmarks at certain times and we just choose randomly, and so so you know we can run that more often.

C

That way- and you know the the data points would be a little sparse err, but at least on average all the benchmarks would be covered, except maybe buffer, because buffer takes a long time, no matter, no matter how rarely you run it, it always takes a long time. That was my only thought that I had on this topic and I'm wondering what you guys think about that I.

B

Have a good motor- oh sorry, I just won about. How would you think that we would display that or consume the data? Well,.

C

I mean it would basically be like that the graphs would be exactly the same as they are today right, only only that it would be sparser right, but in the end, in the end, if you do this for long, you know that the data points will come together, just as they are now well.

B

Yeah, except we don't have problems running our existing benchmarks that we have the charts for it all. The micro benchmarks, which we data for.

C

Exactly yeah so so like right, like we have some representatives benchmarks now well,.

B

We have yeah, we have they're, not the micro benchmarks, but there's only like ten of them. The only the questions always been for the micro benchmarks, there's like hundreds exactly yeah.

C

B

Having hundreds of charts and it's like I, don't think that like kinds of charts or something anybody's in to get and they're gonna, look at easily or regularly or ya, know it's it's it. My question is really around, like whatever we do. How is it consumable and useful to people right right.

C

I mean yeah, I, guess I, guess you know we could. We could take like the lazy evaluation approach. You know and just have charts and cache them and and- and you know, generate them on the fly so so for things that are never needed, except once it takes that person a long time to to see their chart, but but the data would be there right. If we did this random sampling, we would just be collecting data but not generating any charts right.

C

However, the charts would practically be there, except of course, you have to run the generation algorithm on on a particular query right. Sorry, so you.

A

Know the virtually.

C

Be there, but it wouldn't actually be generated until somebody's asking for them. I respect.

A

You I think the question was: how do we represent the result from all these different world microbial to maybe one chart or do we need to generate hundreds of charge? Even though they're dynamically asked or Oh general yeah.

C

Yeah yeah, yeah, I, guess I, guess you know what are people interested in right? I mean they, they just made the change and they want to know how it affects things or or a change. The thing like like you know, maybe maybe they want to see before a relief. You know how the change has has affected performance right. So so you know if we're on.

C

If we run the random sample like every night, you know, then then chances are that before released is made at least two or three samples will have been collected of the benchmark. That is relevant to them. You know, and- and so you know they can, they can have a look. They go in the chart is generated for their particular case and, and you know they can check. You know what happened. You know there. There are a few data points from like three months ago, and then you know there are.

C

There are a few from like two days ago, which happens to be after the the committing question landed and so they'll have a little bit of data for comparison. You know is that is that a useful use case, or is that the representative use case I'm thinking.

A

Definitely that's the good thinking, I guess VA, probably, but I still want to add. One comment is: should we think about what would be the if, if someone has to do a priority to do all these benchmarks, which one would be critical, one that we have to measure or good to have yeah.

C

See, that's that that's a those are yeah classifying the benchmarks like that is is a tough problem. Right I mean we're having the same problem with the buffer benchmark, all right, there's one PR that we were submitted to try to cut back on the number of buffer benchmarks and- and we still couldn't decide like he, even even at the TSC level. It's like okay, but you know I. We are going to be missing some stuff. You know because the benchmark doesn't doesn't strictly eliminate duplicates.

C

You know it does eliminate some benchmarks, and so you know we would be losing perspective and so arguing as to you know whether this particular aspect to benchmarking, a particular aspect is important or not important or how important it is. You know from zero to five. It's it's tough too. It's tough to classify!

C

That's that's what that's what that's the impression I'm getting from from from from from that that that buffer benchmarking sort of revision PR, no, not just above what I'm just thinking or all I know about what I'm saying is, if it's so difficult to decide for a single benchmark which is buffer.

C

It's going to be that conversation repeated over and over and over for all the different benchmarks, so so I I suspect that coming up without what the classification might be complicated, you know and a slow-moving process, and we need to consult a lot of people and a lot of people. We need to agree that, okay, this benchmark is good to have, but it's not critical, you know and making this decision for every single benchmark you know, might take a long time. So so you know that's why I was thinking like.

C

If we do a random sample, then it doesn't matter how important the benchmark is. Eventually we will get to it, you know and and that, because what is the limiting factor the limiting factor is is is is CPU time right, because you know we don't want to hog. We don't want to hog the whole system, just for our benchmarks, that you know that's one in the spectrum and the other end of the spectrum. Is you run these benchmarks only as needed?

C

But then, of course you don't have perspective right like you need to know how the how the performance of any given micro benchmark changes over time right and so so I think. The in-between solution is, you know, give us as much CPU time as you can and and we'll try to distribute it as randomly as possible across the benchmarks and and and get some numbers. You know and and then there's a good chance, not a perfect chance, but there's a good chance that you will.

C

You will get the perspective that you need once you look at the numbers when you need to you know the numbers may not be there in which case you're like dang okay, can we do a run for this particular commit ID? Now you know, because I really need to have this before before an after picture, but normally we will run a set of benchmarks that fits within the CPU time that we have allocated.

C

So that's what I'm thinking, but of course you know implementing all this is a different question. It just seems like the the kind of thing that we may want to be doing: okay, I.

B

Think, like capturing the suggestion, slash thoughts in an issue would probably be a good a good way to like you know, basically have that way. We can think about it. A bit chime in yeah.

C

Yeah the issue would have to list the benchmarks, though you know cuz, I, guess I guess a lot of people will not care about. The large swathes of benchmarks would be would be irrelevant for them. You know.

B

I meant even more among our own group right.

D

B

Probably good that you know you can think about it for a few days and then comment on that issue and.

C

Okay, yeah have.

B

An issue says: here's the proposal for how to you know we have a discussion and, and then.

A

There's a record of you know why.

B

Why we've discarded certain things all that kind of stuff yeah.

C

Yeah yeah and maybe maybe have people edit the comment you know like upvotes parts of it down both parts of it. That kind of thing so.

A

Should we open a new issue, selecting or up with this proposal, let's put to put forward a proposal that this is what we will be doing and get some feedback from the community? Yes,.

C

Yes, yes, yeah I think that's a good idea. You know what granularity are we looking at here like we're? Not gonna like we're gonna treat each file as a separate benchmark right like we're not talking about like individual parameters like this parameter is more important. You know, testing with with 512 byte chunks is more important than testing with 1024 bytes chunks. I mean we're not going in that kind of level of granularity right, I.

B

Think that I mean based on what your stand: you're, suggesting we wouldn't yeah.

A

B

A perfect world, it would be really nice that there would, if there was like a subset but like I, agree with what you said, that choosing that subsets not easy right. Yeah.

C

So, okay, but but I think that's it's a good idea to have this issue and you know I mean if people chime in they might chime in you know structurally, as well as in terms of just.

A

C

Data to the issue, you know they may say.

A

No, no, no, no.

C

Break it up like this, because that's better, you know, I mean ideas, come when you not when you solicit them, but sometimes pon, taneous lee right, yep.

A

Yes, so we can pick certain benchmarks or random benchmarks, with a fixed number of maybe whatever input it has maybe number of iterations. It has the size of the data it has. We pick one number and run those is that what we are saying well I was I was thinking like you know,.

C

A

Makes sense actually, because of the way it will be, you won't be able to finish in a look at it time or whatever time. Okay,.

B

It could make sense to just run a portion every night, yeah.

C

B

In a random, rather than a random portion, just do the first tenth, second tenth or twentieth or whatever and over time, it'll all fill in right, because really you don't want to, is it worthwhile having two for one and zero for another? It may be better just to have one for everything and to run it.

C

Out there are ways of doing randomness right like you could you could like you could like randomly fill in the range you know and then and then you do not repeat the benchmark until you've run all the benchmarks. You know that's one concept of fairness, so.

B

C

B

Yes, that would work well as well. I was just thinking whatever it. You know in what I was thinking off the top my head. It probably makes sure to pick something that makes sure we run everything once before. We start rerunning yeah.

C

Yeah yeah yeah yeah, so so basically yeah yeah, basically just just sort of uh you shoot at it in the dark. Until there's no paper left kind of thing and then you put up a sheet of paper and start shooting again then.

B

In that case, is it is it like why be random at all? Why not just do a sequence.

C

A

B

A

So that well I think in the top 12 category fifteen categories we have, we can take one run sequence pick one. So that way, maybe we can cover all the inputs it has already in the files. So we don't have to pick and choose a number. We can just run that one okay, I could am gonna run buffer tomorrow. Maybe data view, whatever sequence, we have yeah.

C

And you get you get, whichever combination of parameters you get to and then you stop because you run out of CPU time. We.

A

Will achieve the about the CPU time we did talk about using that Broadwell machine Intel donated, so we can use that just for this micro benchmarks, right, okay,.

C

So it's dedicated okay, but can we run it? We run into a problem with saturation like how long will it take to get through all parameters of all the benchmarks? You know I mean that that could take a long.

A

For multiple days so people days, yeah yeah, so we say mr1 a day. They say you know so.

C

That's not bad if you have a repetition like if the machine runs a hundred percent CPU all day every day, then then, then we will have will have one data point for each benchmark. Every two three days I mean that's great. You know.

A

I mean how long does a PR.

C

That may affect performance of one benchmark. How long does that take to land I mean two three days right, so it lands and two three days later, we will know if there was a performance impact or not well,.

B

That assumes somebody goes and looks for it, but but it it it does mean that yeah I mean I. Think that's that's. You know if we basically had it all right now,.

C

I mean yeah, I mean you know. Oh.

A

You're new Gabriel.

B

It's more that, like what you suggest would give enough data points that somebody could run the subset of benchmarks. They think are important and then compared against past history, which would be good.

A

So I believe I know Gabriel. Somehow it shows his unmute, but I think we should put a proposal like that. Okay, we are going to yeah stops categories. We are going to run one a day and then kind of cover, all of them or maybe a two weeks period, if not within three days or four days. That is still good enough yeah, and if someone wants to go detail at least we have a mechanism but infrastructure to run this benchmark in the our benchmarking setup right now we don't have anything yeah.

B

We'll still have to figure out like how do we extract the data points? How do we store them? That's an expert. How do we.

A

B

Do we make them available, but yeah just getting something that will run it, so you could go and look at say. The console output to find the results is a first step. Yeah.

A

Getting meaningful information out of it may take a little more thinking, but at least we have something which is running yep and by the time release comes, and we can it. You have some data set exactly yeah.

B

Yeah, for example, at a release we could go and somehow look at the you know. Even if the file could be stored somewhere and then a delta be done on the files or something that would be.

C

Anyone who's that.

B

C

Will dump anyone right.

B

So yeah I mean today we, you know the existing ones. We have actually post individual results to my sequel, yeah and but I. Think in this case I'm not sure that that's the best way like because we'd have to extract all the data points individually, push them out. It might be better to somehow like store the files and then because there's already like I, think a comparison you can, you can use the the run. That says show me the deltas between two two runs yeah.

C

B

If we could somehow reuse that that says, you know from the data points the most recent data points we you know, maybe it takes three or four or five six whatever days. If we could pull that into one thing, you could compare against or something and then you know that basically would let you say, download a file. You could say we'll compare against this and you know if you could run a subset of the buffer things and say, compare against what we've got in the database, but.

C

B

C

I'm wrong: the the comparison algorithm actually establishes not only uses uses like a mean and standard deviation to compare and to also establish relevancy right. So so so so then, if we do that and that, then that's even more expensive, because Marya stablishing a single data point. It then runs the benchmark multiple times right to establish a standard, deviation and a mean you need. You need multiple run.

B

C

I mean we're talking about like a single run of a circle right, so so so so then I don't know if we can use that tool. Okay,.

B

All right I'm, just thinking like if we can look at that reuse anything or give somebody get people. The similar workflow might be interesting. Yeah.

C

I mean the truth is the truth is to be able to run that tool on every benchmark would be ideal even more so because you know, if you, if you, if you run it once because you know it, takes six days to run everything and you run one benchmark once and you get a value for that. I mean that could be the outlier for all.

B

C

B

I'm gonna have to drop off but I'll. Let you guys continue.

E

On yes, yeah right now, we said that we just open it.

E

This is what sounds good.

B

And you do my assume you can PR in the minutes. Okay, thank you talk to you guys later. Thank you. Michael yeah.

A

Gabrielle yep: do you mind opening any issue? Can you just open an issue with with this side proposal, and maybe you can add a few people, you think would be important to do the discussion yeah.

C

A

C

The relevant group, I guess I'll, have to look at it, who touched the benchmarks, the various different benchmarks and and I just sort of mentioned them and and see. If they, you know collaborators come and go so they may no longer be active in that area. That kind of thing so it'll be a little bit of a forensic task, but yeah I guess I can try to find some cycles for this yeah.

A

This opening issue at lease and, of course, the our group and few people from from them. You know Cole team and let's wait for maybe few weeks anyway. We have this meeting every three weeks. So that gives enough time if there is some interest or whatnot and we'll go from there.

C

Ok sounds good yeah.

A

Thank you all right.

A

That was the only item we have on our agenda today is anything else we should be discussing. I want to discuss yeah.

D

I have tracked a little bit ductility or no DCI, yes and yeah. I would like to show some results in in time how they, how no DCIS behaves. If this is something of interest to us,.

D

Sure this specifically talked about so I have followed the this data and I'm trying here to share the screen. I, don't know how to do that share the screen share, share screen yeah, so here the here are some some dates when I tracked the data and here the master, the tenex version and so on. So what I observe it at a certain stage, all the data was stagnant for more than a week, so things did not change in on 17 of June I stopped this tracking, because I didn't know.

D

We have to continue not so it's a long time when the numbers didn't change today, I have looked and I saw some strange numbers, so mastering techniques have the same. The same performance I would say that it's a little bit strange. The only difference that I saw I saw here in this in this column. So so they are here. Operations per second latency footprint before Road and foot through in after load.

D

Does that make sense does? Is it cleared a subject.

A

Yeah I mean it makes sense for sure so. 10X master and 10x. Only question is why we don't have number for the 12, 8, X and canary, build and then also at the same time. Why do we have LOB's per second, but we don't have late. We have latency for a takes, but we don't have the ops per second for e-text version. It dot, X, I,.

D

Don't know this is the data that I collected and I suspect that there might be a bug or in I, don't know, maybe in the way in which the no DCIS it's triggered or something I would be happy to investigate this issue. If, if it's needed or if it's could.

A

You mind: do you mind opening an issue in issue list and just say this is the observation is and we don't have the consistent numbers we are not reading. If there is something with my sequel dump or my simple database, so I think someone is to investigate that one now.

D

Would be happy to investigate this so I possible to get a little bit of in intuition or I, don't know somehow working on this I will be happy because I'm not very familiar with the infrastructure of running the know this yet in the environment. So I would be happy to learn this as well. Yeah.

A

The normally what I used to do is whenever I see something like this I would try to reproduce on my local machine, for example, I can take if you clone the whole benchmarking repo, you will find all the scripts. You need to run the no DCIS and through all this various versions of node, and see whether you can reproduce any of them. Okay,.

D

A

That works. That means there is something wrong with our set up on benchmarking and we someone we can go and investigate then I said. But then, if you, the first thing would be just open an issue document this and then take those steps sure so.

D

I shall open any shown on on this on this side. Yeah I can open on issue here. Then you. Thank you. Thank.

A

A

Cool anything else.

A

A

So, let's is there any question on YouTube.

A

No okay, I guess: that's it so they're two years all right to openly show open issues. We have an open if there's anything else, just send out an email or update the issue and we'll meet in three weeks. Then.

A

D

You very much I want for the you.

A