Eclipse OMR Architecture, 16 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OMR Architecture Meeting 20200716

Description

Agenda:
* Add Dynamic Breadth First Scan Ordering to GC (#5377) [ @jonoommen ]

A

B

Okay, welcome everyone to this week's Omar architecture. Meeting this week we have one topic: it's from John Newman he'll be talking about scavenger dynamic, breadth-first scan ordering in the GC. So take it away. John all.

A

Right so my name is my name: is John Newman for those that don't know me, I've been IBM, run times, GC developer for the last three years, so today, I'm just here to talk about my most recent most recent PR for the Eclipse Omar and Eclipse opened a non project, so this is called scavenger dynamic, rep for a scan ordering. So just a quick, quick outline of what I'll talk about it today. So what is what is a scavenger scan ordering?

A

Which many of this will be familiar? I'll move through these topics somewhat somewhat quickly, but what is dynamic breath for a scan scan ordering and for the sake of time, I'll call it dbfs? Oh! So what is the goal of devfs? Oh, the design and implementation, some results and then look at the code.

A

It it dictates the order which objects in an object graph are scanned and copied during the GC. So currently within scavenger, there exist two possible scavenger scan orderings. We have a hierarchical scan ordering, which is the default scan ordering for for GenCon, and then we have order, which is the default scan ordering, and then we have breadth-first scan ordering.

A

So what is dynamic breath for scan ordering? So it's an optimization for breadth-first scan ordering packaged as a new scan ordering, so scavenger will scan objects, breadth-first, but dynamic or DB. Fso enables the recursive depth copying of hop fields marked by the JIT immediately after the armed that scavenger is currently being copied that contains off I'm a metal but you'll visually.

A

See this and it'll it'll be a bit it'll, be more clear, so currently is developed for Gen Con, although the first main expected use for it will be in bounce GC, where breadth-first scan ordering is still currently the default scan ordering. So what is that? Well is the goal of dbfs, oh, so this is to improve BFS Oh locality issues, so just a brief overview of some basic locality principles. Three of the 90/10 rule, so a program spends 90 percent of the time in ten percents code.

A

Spatial localities, as items whose addresses are near one, another tend to be referenced close together in time, and then we have temporal locality which states a recently accessed items are likely to be accessed in the near future, so moving on and building on locality with regards to hot fields and hot and access patterns. So a hot field is a field that is frequently accessed by object. Answers so take take a string object when a fields for other string is accessed.

A

Typically, it's a char array, that's accessed when a hashmap node, a field of ash map known as access. Typically, it's secure of the value. So hot access pattern is an object, access pattern or path that occurs frequently.

A

So hot field information currently is known by the JIT, and there is some information, some usages that we use with with this information, but we. But how can we exploit this information and improve locality while scanning and copying objects in scavenger?

A

So, according to the 90% 90/10 rule, if 90% of the time is spent in 10% of the code, there's likely some very hot object access patterns that it would be great if we could exploit so looking at locality. It would be great if we could have frequently access objects beside each other in memory which will likely reduce cache, misses.

A

So take a look at this. This very basic object graph. So let's say you have object a which has two fields: B and C. So 10% of the time B is accessed 90% of the time C is accessed and if you look at object, C 10% of the time field F is accessed. 90% of the time field exists and then, lastly, would be 10% of the time. D is accessed, eat 90% of the time he is accessed. So how can we? How can we optimize this object graph for locality, so, ideally a memory?

A

It would be great if we could have a C and G next to each other and B and E next to each other.

A

So this is how so looking at this object graph.

A

We'll have so with root set a and F, and there isn't there isn't a step-by-step of how scavenger would do this. So I don't want to speak about every detail on how scavenger would approach this and this object Raph, but from but just basically with looking at breakfast scan ordering.

A

What would happen is you'd start with the root set, so a would be scanned and copied and then F will be scanned and copied and then scanning a B would be found and then, as you can see, in a breadth-first manner, we'll have B C, D, E and then F, which has already been copied during route scanning and then lastly G, but what dynamic breadth-first scan ordering will do as while we're looking so still going back to the beginning.

A

Looking at the route set when a is call be at the end of the copy will ask if a is any hot field and it'll say yes: I have I, have C so then dynamically. What we'll do is we'll recursively depth copy. The hot fields of a so after 8 will be copied, then C will be copied and then, as you would expect during this depth, recursive depth copying then, as C is being copied at the end of C's. Copied it'll be a stages to see it.

A

Okay and then we'll rehearse our way back up and then looking at the root set after now, a is done so then F would be copied and then we'll move to skip it's a while after a is scanned, we'll have B and then, while B is being copied at the end, that's and they could understand where I'm going after B is copied at the end of the copy at last. Hey just do I do I have any hot fields.

A

Yes, you have E, so then you would be, you would be copied next and then sorry and then B would be scanned and D would be found, and it would continue in a breadth-first manner, any time that there isn't gonna dynamically adjust for the heart fields.

A

So just some basic basic takeaways of D banthus, oh, is that enables a possibility.

A

As a memory and it will likely result in fewer cache misses so on a high level. Looking at the the design and implementation, there is leveraging of existing information. That's done so. Applications consist of compilation. The instances and the JIT compiler is a tiered compilation of pilot, meaning that each method has a base, compilation level and then the JIT can decide to optimize a method further based on various heuristics that relate to how frequently a method is being run. So, looking at the Testarossa compilation levels, we have cold warm hot, very hot and scorching.

A

So each and then each compilation is divided in two code blocks were relative, where the relative hotness of each block within the compilation is determined and blocks, receive and normalized hotness value for all purposes. I'll say between one and ten thousand.

A

So when the field is accessed within within the compilation we use, so this is an optimization pass that was that was implemented by Andrew Craig, who I believe is on the on the call right now, and so we can use this compilation level of the method, as well as the block hotness of the block within the compilation where the field was accessed to compute an overall hotness value for the particular field within the compilation.

A

So this hotness value is computed for every access and every compilation for each field of a class, and so once again this is also done adds as a compilation is it's promoted or optimized? So for each field of a class, we can aggregate these hotness values for a feel for the field accesses across all method compilations and get an approximate value for the hotness of each yield of the class.

A

So now, knowing the approximate relative hotness of each field of a class during a GC, when an object survives, the GC is copied to the survivors based. We will recursively depth coffee the two hottest fields directly after the object itself is copied during a scavenge.

A

So if I look at and three, we can look at the some of the code shortly, but if we look at results and there's more performance benchmarks uh than being run, but if we look at once once initially right now, it's respect to UV, 2005 and spec baby 2015.

A

So this was these were runs consistent throughout the development of the feature, but I gathered gathered new ones just for the sake of having having clear-cut results of what it's, what it is looking at like right now, so we look at spec TV as a five with breadth-first scan ordering, and you do a throughput comparison.

A

You see that what this dynamic breadth-first will comes up, seven just about seven and a half percent ahead, and then, if we look at spec tive 2015, I'm sorry if we look at spec TV 2015 we'll see that before max max j ops will have almost a four and a half percent improvement and then for critical j ROPS, roughly a two percent improvement.

A

So, with regards to getting into the code, I wasn't I wasn't entirely sure on the best best way to do it. But I have I've links right here to the to the pull request for the eclipse Omar, and it comes up when j9 projects as well as some details, were the more important aspects of the can be found just and then I also have a touch or just slide representing summarizing.

A

Some of the key data structures that set that are used within this feature, so that for that, for the presentation that would be all and but Darryl, is there any way or any? Do you know how you would like to move forward with looking at some of the code or perhaps talking about some of their design.

B

Well, I guess we can just us here and see if there's any questions about what you what you've talked about and if you want you can certainly dive into your. It would take people through a more structured walkthrough of the code. I know myself. I have a question about something you talked about on slide 11, which was you talked about aggregating.

B

Treated here you you have you come up with an aggregation of all these different hotness values. How do you do that aggregate? How do you, how is that I creation done.

A

That that average, that it's clear there's a few different reduction algorithms I've been playing around with, but yes, an average would be one of them. I've also done so then an average of summation and I did implement also like a minimum and a maximum, but for the sake of the initial, the initial commit I left those I love those in just in a private branch. But yes, that's correct. So these these are. These are averaged among clocks. Is the.

B

Hotness of the method itself, part of that aggregation. I know you do the the block frequencies, but how about the actual method itself? So.

A

The aggregation is regards to the block frequency is solely, but as we but the if you look at the information, that's that's stored with regards the field after we've aggregated this block frequency. What we do is similar to what was implemented before in the old hot field. Implementation is we'll multiply this value by by a factor so for, as of now, it's more or less copy directly from from what used to happen before for hot field, so for compilation. This is this value. This aggregated block frequency value is multiplied by one for a hot method.

A

This aggregated value is multiplied by ten and for a scorching method. This aggregated value is multiplied by a hundred there is there is that has been looked into to potentially for next phase of this to look into try to getting CPU utilization for each four methods that will allow, in a far greater increase that alright, a greater increased accuracy to this as warm as methods of the same compilation, level will differ or can differ somewhat significantly, and especially in applications as day-trader there there is through. You will be minimal benefits as.

A

Methods so this so we have to a major effect in an applications like that, so that would be a next step.

B

B

How is the so I seem to recall that there wasn't that you created a pull request several months ago for this, and there was some discussion around how the hotness information was communicated from the gifs to the to the GC I. Don't remember how it with what the resolution of that was. How is the hotness information exchanged with the GC.

A

The hot the hotness information is stored within so the hotness information I can I think this slide might help a little bit so it's stored in so each class loader is currently now and so I'll back up a step and just say that for most for all VM decisions.

A

I would for all the VM decisions. Toby Toby was my go-to for for expertise and then for all JIT related questions, Andrew Andrew correct. What's my with my go to and then for GC Alex Alex me teach was when it came to architecture was, was the ones I I contacted so looking at this, so that this was the best decisions that we collectively came up with and so looking at a j9 classloader. So this is from speaking with toby. We came up with each class.

A

Loader will have a hot field pool, and so this will have a pool for off all hot fields related to all classes, for that class loader and then will have a global pool of what we've called we've called j9 hot class info. So we'll have a global pool of these objects, and so each class has an initially null hot fields info and then, as its first as its for as its first hot seals are discovered, will initialize this pool element.

A

So this is a this would be a global pool and so how this information is then it's so information is stored in these data structures and then at the beginning of a GC.

A

What we'll do is we'll go through we'll go through the j9 hot fields, info pool, and then we know we only store the head of the list, so we'll iterate we'll go through all these objects which and then so, as you can see, this hot field list head just points to the head of the list within the hot field pool of the class loader. So what will do is? Will it will iterate this list and will find it is a.

A

Hot field that the two greatest hotness values will then store these in our hot field offset and then, when scavenger, when scavenger is copying, it'll asked for it'll pass.

A

Yes, it's correct, sir yeah, when scavenger is copying it last for the first hopfield offset and the second Hartfield offset, and then based off that will recursively depth copy the the hot field and then there's a special value of you data. Are you eight au, eight max that lets? Scavenger know that there is currently doesn't exist, a hot field there, so this previous and then so I guessed a and then we're working. Just a little bit back to your initial question regarding the issues of the issues of first implementation,.

A

Together with that that that was rework together with Andrews, together with Andrews guidance, who I believe if, if Andrew wants to elaborate but anymore, but I, believe that I should cover that, does that make sense.

B

Yeah: okay, interesting I.

C

Have a question: okay,.

C

As my understanding goes looks like this, whole Hughes is like a static to every class instance, meaning if there is a new instance of this class. All the odd fields are re Herenton from whatever values they were previously. So why don't we the difference between like a class that has two three fields and another class that has eight nine ten fields so how these behave? Even those scenarios.

A

So as a so as a now there's a there's, a hot field max capped off at that. That list is that letting this length is capped off at ten, but there's also a certain threshold that, during the heart mark during the hot marketing pastors that there's certain threshold of a block, frequency that has to be met in order for that field, information to be to be stored and that's to avoid the possibility of to avoid the possibility of having excessive amounts of hot fields. So for each new object.

A

So for each objects that's created it will, it will inherit the it will in hit your hair at the hot field office instead of being already set, does does that answer the question yeah.

C

That's me awesome thanks! No prob.

A

Does that make sense, so there any other? Are there any other questions or.

B

What is the so? What is this state of the pull request in you know? Mar, sorry, I guess I can bring it up, but what you know other issues outstanding that need to get addressed, or is it just needing a review as.

A

Of now it's just needing, it would be just needing a review. So there's been some preliminary reviews before this, but as of now it is that I would say, I would say more or less ready ready. It is ready for review.

A

B

Is the hotness information preserved in the share classes, cache.

A

No sorry, sorry, I! Let me let me think of it that the cotton s so.

B

When we yeah I mean so that we can use it on the next, the next invocation I.

A

Believe I believe so yes, it would be, it would be, it would be kept I, don't see why it would not I believe so. Yes,.

B

Other questions for John.

B

Okay, so I don't know if you know you, you said that you were maybe prepared to take people through a code. I mean I, think that would help if you know, for those that are going to be doing the review. I'm, not I'm, not unless, unless people really want that level of detail, I don't know if we need to go that. Go that deep. But if somebody does want John to do that by all means, please speak up.

B

Okay, if not I, think that was a good, a good design, thanks for taking us taking us through that no problem.

B

Okay last call for questions, otherwise, that's that's all that we had on the agenda for today. So I guess we'll adjourn if there's.

D

This one last question or one question consistent: are you finding the results from this from run to run like because there's a certain aspect of this that depends on compiles that happen earlier sort of influenced compiles that happen later and there's some non determinism in the order in which compiles happen anyway. So right.

C

D

C

D

Sort of become a wash and you get fairly consistent results from run to run. Yes, yes,.

A

That so that was our overall. That was one of the things that I was. That was for sure, an initial initial concern, but overall I can like I guess: I could no I'll leave out some others up there. I might it'll be maybe a little difficult to find, but overall every every.

A

Quite quite consistent, very consistent values and each one looking back at looking at the analysis that I did- and this is I'm just trying to remember out- remember, but each each run had so. If you look at, for example, for example, these these results, each one hat, would have had roughly a within within one two one, two to one percent, one to two percent of all the values. So if you look at the percent improvement, sometimes they varied between would between, let's say six and eight or seven and nine.

A

And if you look at the the improvements perspective with regards to max the obstacles, they are those they tended to be very very within within a within a margin within a thin margin. And so, along with this, like compile time like Quercus footprint, footprint throughput compile time and all that was analyzed and all those are have have shown very consistent values.

D

Okay, yeah I was thinking more in terms of just from. If Ron one said that field one and field is that sort of consistent from run to run for the same class. I guess is what I was thinking of. Yes, the first results are a proxy for that field. Information being fairly consistent, but.

A

Actually so yeah I can better answer that question. Then, yes, I had the exact same questions myself, while I was while doing it, so this would be slightly.

A

May perhaps you know a month a month dated, but during out several point that did perform an analysis on on all the fields, all their hotness values from run to run and the overall, the overall two hottest fields, and so that's the I think from run to run there were some there's, some variance among the ordering of the hotness of the fields when it came down to the to the two hottest fields. Those were consistent every time with the benchmarks I looked at.

A

Does that make? Does that make sense, so the two hottest fields that will actually be recursively depth copied those were consistent every time for for every run, so.

B

With the analysis that you did to validate that, did you find that there were some classes that had more than two fields that might benefit from this kind of locality? Yes,.

A

Yes, I did I did found I found some and I found some that had, let's say I some that had three two four three two hot for hot fields, but then some with so there's I think there's room I, really believe, there's room for for plenty more optimizations to this moving forward. But yes, there were. There were some that had that had at least three that were quite quite hot feels so.

B

I guess the trade-off would be the added footprint cost of storing that frequency information for extra for a couple of extra fields. I know it just might be useful from a or interesting from a just to see what kind of performance you can get perspective if you just increase the size there and sort of see what how it affects some of these workloads that you're running you capture more than just two fields.

A

Yeah yeah, that for sure would be a would be a possibility. I know with regards to I I'm, not sure D the how how wide the benefits would be just just with regards to looking at like if you, if you look at let's say this, this object graph and now let's say that both like both B and C, where we're hot field.

A

So that's a we're now in this scenario, we're keeping two ah feels so if you're caught, if we're talking to in-depth so I, have a depth set to three so it'll a copy of max depth of three. But if within that, if you're copying to a depth of three four hot feels, if you go along this path, I believe that, even if you extend that to a third hot field,.

A

Outside of the cache outside of the cache line anyway, so I'm not sure if those benefits would be weird being inherited from extending further. That was. That was my thought, because, when I cuz I had thought that when I saw some results that had three there or a third hot-hot field is that does that make this yeah.

B

And maybe the other thing that I was wondering is: do you I? Imagine that the the frequency information that gets computed is that recalibrated as the application is running. So if you pick fields D and E at some point later on, it might change to e in X or something like that and okay. So.

A

There's an adaptive and an adaptive sorting that that takes place so within the first, so I I have it now I experiment with some different values. So now for the two, so the first 200 scavenges on each scavenge ie, a quick sorting of all the hot fields takes place and that will allow for adaptability of hot fields and then moving forward for the rest of the duration. Every so every I know, I can't quite remember the value, but every X amount of GCS.

A

We increment that adaptive that an adaptive sorting so, for example, on GC 1000, let's say you're, probably you're, sorting every six, every six, seven GCSE and from that was done from as looking at a looking at various benchmark runs that as life went longer and longer for the program. There wasn't much need to continue to sort every DC, I.

B

B

Alright anything else for John.

B

Okay, um all right! Well thanks John for for preparing that and and taking us through that. Hopefully we can get that reviewed and merged. In short time.

A

Excellent excellent, no problem yeah.

B

Okay, well, that's that's all that we had on the agenda for for today. So thanks everyone for attending and we'll meet again in two weeks: okay, thanks! Okay, thank you. Bye.