Eclipse OMR Architecture, 22 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OMR Architecture Meeting 20210722

Description

Agenda:
* GC Parallelism and Adaptive Threading (#5829) [ @RSalman ]

A

Welcome everyone to the july 22nd omar architecture meeting um today we have uh one topic: it's a gc topic um presented by salman renna, so I'll just turn it over to him and take it away.

B

All right, uh hello, everyone and uh good afternoon.

B

um My name is solomon, as the bureau mentioned, um I work at ibm on the runtimes team, specifically on the gc team, focus on mainly optimization and performance, um so today um I'll be talking to you about parallelism and adaptive, garbage collection, threading or adaptive gc threading for short, so the goal of this talk is to kind of give a under the hood look at gce and speak about performance and optimizations by specifically taking a look at dc, parallelism and adaptive adaptive threading- um something I should point out before we start is that um work around.

B

This has been also referred to as dynamic threading and thread throttling in the omr github repos on the various issues and the different pr's that have been opened. uh So it's been referred to as with different names, but I think adaptive gc threading is like a good umbrella term which encompasses the different goals of this work.

B

Another thing I should point out is that um for most of this discussion and talk um I'll be um approaching it from an open, g9, um vm implementation perspective. um Openg9 is the most comprehensive consumer of omr. So I think uh um so I'll be talking from that perspective, but even though um this is all common code and technically it could be used by any consumer of omr all right, so an overview of the uh talk. um So I assume everyone knows what what the gc is and what it does. um But I'll.

C

B

Throw a slide or two up just to um go over. It um then I'll specific, specifically talk about omr, gc technology and then we'll get into some parallelism. um What parallelism means to do you see the implications of it uh advantages? uh How gc takes advantages of it and some throughput and pause time, discussions around that and then we'll talk about the motivations for adaptive threading, resulting from the issues with parallelism that we observed um in certain workloads and we'll talk about.

B

Why why we saw them and the assumptions and uh the investigation around that and uh then we'll get into uh adaptive learning itself the core idea of it and how it solves some of the issues with parallelism, we'll talk about the models and heuristics that we use from there we'll get into some internals we'll talk about adapt, actual implementation of the optimization.

B

So anyone could go into the code and take a look at it and make sense of what's happening under the hood and then we'll wrap things up with uh performance results and perhaps some future work that can come out of out of this.

B

um So as so I'll just quickly go over garbage collection. um So, as many of you know that gc garbage collection is automatic memory management um with the end goal of reclaiming um garbage or memory that's allocated by a program, that's no longer referenced, um but memory management uh with. Even though the end goal is reclaiming memory, um it implies a lot more than that, so it has to handle allocations.

B

It has to um worry about the heap layout memory layout and handle those um have access barriers in place, um ensure object, validity things of that sort. um So all that to say um the gc touches a lot of different areas in the run times other than just popping up when memory needs to be reclaimed um so nowadays majority of the languages at least like the top 10. If you take a look at them, they provide some sort of gc component and I don't think we need to make a case for it.

B

There's a lot of advantages. It eliminates a big category of bugs speeds, up development lets developers, focus on the problem at hand etc, um but without a doubt there is some associated costs and drawbacks. um Although we can't get rid of all of them, um we we can always.

B

um We can have optimizes optimizations in place to handle them in a smarter way, also look for certain patterns or certain scenarios and situations where we can, um where we can mitigate some of those uh those drawbacks, um so biggest drawbacks being the runtime costs associated with it or unpredictable application causes.

B

um So, for the purposes of today's talk, um we will be focusing on parallelism and the unpredictable application pauses as a result of it and then we'll get into the optimization of adaptive threading that was developed to kind of mitigate some of those uh those those pauses.

B

So I think, before speaking about optimizations, it's important to understand the user's perspective um so from a high level user perspective, it's usually a compromise between application, throughput um pause time and sometimes footprint, and these can be considerable um things for the user.

B

um For example, if the user has a vm deployed in the cloud and there's thousands of instances running um and they're being charged based on memory footprint, then uh footprints uh something significant for the user to um consider, and so what this means for us as gc developers, is that the internal technology that we develop to accommodate these different goals of the user is significantly different. So we have different technologies to accommodate these different goals. um So um so, for example, we could have different heap layouts flat heap versus a fixed size regions.

B

We could have different allocation patterns first fit best fit and different style of collection altogether.

B

So, for example, generational or just uh traditional um marked sweep collection, but I think the most most evident is when you consider stop the world technology versus concurrent and uh its implications on throughput and pause time, so user, that's more sensitive, um that's developing an application, that's more sensitive to pause times, might um want to consider um concurrent um concurrent technology um and they give up some throughput while um uh and and uh someone someone who's more concerned with throughput might be.

B

Okay with the longer pauses and might opt out might um go for the stop the world. um So, as developers, we have to accommodate these different needs and requirements with different technology, um so I think to understand the to understand and categorize the omrgc technology. Specifically, I think we can think of it in terms of policies.

B

um So these policy names are from openg9, but I think policies are a good way to categorize and think about the different under the hood technologies and how it relates to the higher use high level user compromises.

B

um So um we have so in order of increasing complexity. From top down, um we can start with op throughput, which is the simplest out of um of the three policies that I have listed here. um So opt is a traditional standard. Stop the world um collector mark and sweep um gives very good throughput, but at the expense of uh some some pause, um so yeah and so opt average pause, builds on that with a concurrent component.

B

um So it adds a concurrent marking component to that software world and it uh it gives up some throughput for um better pause times and then we have gencon, which, uh which kind of um takes those concepts from opt average paws and uses that um as a global, collector and then has it, introduces a new local collector um for generational style of a garbage collection.

B

uh So, um for the purposes of this, talk we'll be talking speaking about um gen con, specifically um as it is the default in openg9, and it provides the best results in most cases and it's a good compromise between pause time and throughput. um So, even though we'll be speaking about gen con, the the concepts concepts of parallelism and adaptive threading applies uh universally to like all the different styles of collection.

B

um So the underlying technology actually um comes into perspective when you actually look at the internal um hierarchy of uh class structure of the different classes. So this this is the high level um uh hierarchy of the collectors themselves. So if you, if you notice opt average pause, is a parallel global, gc, collector and then opt average pause is a um it's.

B

It's inherited components from the parallel collector and it adds a concurrent um component to it, um whereas gen con it, uh it uses two different collectors: it uses scavenger as a local collector and it uses uh concurrent gc as a global collector and so for for again. For this purposes of this talk we'll be uh speaking about scavengers.

B

Specifically, um majority of the collections happen in scavenger and by default, um as it is the default policy in openg9, um uh so we'll so first we'll take a look at things from an application perspective and see how the different, how how thing what's happening. um So, if you were to observe um dc activity, uh see and see, what's happening relative to the application, we would see something like this.

B

um We'd see, application threads running the applications running, does doing useful work, allocating uh doing activity and at some point, um perhaps there's an allocation failure and there's not enough memory to um allocate an object. So at this point, gc kicks in there's a pause. Application. Threads are halted um dc. Does this work to reclaim memory, uh and then it releases control back to the application thread with freedom memory, and then they continue on.

B

um So this is how the default scavenger currently works and uh there's also a concurrent variant, which I have uh also shown a trigger, for. um I think the only relevant thing to our talk with this is that um even with concurrent, we also have these pauses, um even though it's a concurrent style of um collection, but these pauses are a lot shorter, um but they're there.

B

So so we're observing we're focused on observing these pauses and how they can be made shorter or, if there's anything obvious, that sticks out in certain scenarios or cases that might result in unpredictable or unjustified justified long pauses, um so we'll be taking a look at the effects of parallelism on these gc pauses and how we can end up with some long pause, um long, um unnecessary pauses as a result.

B

um So parallelism, um I think um parallelism is um it's it's a given with any modern application. That's cpu intensive, that's processing work has computation computation heavy, um especially in the last 10 to 15 years, with compute, with how computing hasn't evolved. um So, throughout the years, computing hardware is increasingly scaled um and there's more resources available and consequently, garbage collection tasks are paralyzed to take advantage of this and overall be more performant.

B

So parallelism decreases past times, as we have more um more resources available to us. So all collectors and all major vms have have parallelism. um It's a given, there's nothing new with it um in general, it's a key optimization to reduce um gc gc cycle times so, um for example, um traversing an object graph graph with one thread versus two threads will will be a lot faster doing it with uh with the two threads or even multiple, uh any any any larger. Multiple in that case.

B

um um So so currently, um it would make sense to use all the available um resources. So that's why in openg9 at least the total gc threads that are used is equal to the number of hardware system threads that are available, so we maximize um the utilization of available resources.

B

So if we come back to this, um it's this diagram and we zoom into the gc pulse time and see how we'll see what's happening there um and look at parallelism and the threads running. We would expect to see a a one thread um executing known as the main gc thread and at some point, uh worker threads are spawned, so parallelism kicks in and then these worker threads do work, help with the main gc thread and they finish up and the main gc thread continues on and releases control back to the application.

B

um So if you analyze this, if you analyze this further, we would see three distinct phases during collection. um uh Two two phases where main thread is running exclusively without worker threads and the phase where, where the bulk of the work is happening, um the actual collection takes place. um So that's the second phase in this bottom diagram. um So the first phase, the main thread is just setting up doing some reporting and spawning the worker threads themselves and uh at the end, maine is doing some cleanup work, reporting and doing some bookkeeping there.

B

uh So you're probably wondering um what's what's the issue here? uh What's the big deal, everything everything looks okay. If we have more resources, we should be able to use them right. um But let's, let's take a look at this example which tells a different story, um so we have the same workload running on the same system, but in one run we're limiting um the parallelism to four threads, um we're in, whereas in the other run we're letting it use all the available resources and the default behavior before adaptive 30..

B

um So there's something clearly wrong here. The there's uh pause times are three to four times um consistently higher, with the 48 threads and um and and and and um so so um so so, there's something's wrong. So intuitively um a larger system should perform better than a small one uh looks like 48 cores versus four core system. um It's a 48 core system should outperform it. uh It should either be better, or at least the same.

B

It shouldn't be worse, um so you shouldn't be penalized for using a larger system that just doesn't make any sense, um but that's what we're seeing. uh So we have to ask ourselves um and look at if there's any cost or overhead associated with parallelism.

B

um We know that there are additional requirements with multi-threading. um We know that we have to synchronize uh crit, there's critical sections. There's we have to access global resources, so there's uh threads uh there could be races between threads, so we have mutexes and semifloors and and all those different um synchronization mechanisms in place, and we also have to manage threads. So the threads have to be dispatched, they have to be suspended um and they have to be notified to wake up when they're idle.

B

But what we can say is that, in ideal situations, parallelism provides significant, significantly more performance benefits than overhead and we can say that the overhead is negligible and if we look at the base case of running single threaded versus running two threads, um that claim is evident.

B

um So um if you look more into this issue um and and look at what exactly is being synchronized or why do we? What? Why do we have this overhead in um in the code uh in the technology?

B

um So, for example, um with mark maps, um there's one bit for multiple objects and different threads are marking those objects and they can race in updating the the the market so that we have atomic operation operations there, um which we can't get around so they're like comparing swaps, atomics and gc threads are also frequently pushing and popping from the work stack, and so we need a new text there to control um access, and then we also have um an issue with the distribution of work, um so threads can go idle and when they don't have anything to do um so.

B

These are all different uh things to consider when we um talk about synchronization of different the threads.

B

um So in the example before with the 48, the threads versus the four threads um running the same workload, there was obviously noticeable overhead which resulted in the higher um collection times and given that the only varying parameter between the two runs was to utilize threads. We can very confidently say that the overhead was the result of parallelism um and- and I also mentioned before that parallelism is a key- is key in reducing pause times, but we're seeing these long pauses, which aren't acceptable.

B

So how can we explain this this this and how do we reconcile these two ideas? um So for that we we consider two um two cases where there might be a problem with parallelism um and we might actually end up with detrimental parallelism, um so um the two main reasons, um one of them being there's little work to be distributed to the different threads and the second one being cpu saturation, so high cpu utilization.

B

um So in terms of work distribution, um this is a common issue with the light workloads with the total collection times, usually in the range of microseconds to a few milliseconds.

B

um They have limited amount of work that could be divided between the the threads and so threads end up being underutilized, but they still um incur overhead also, depending on the workload and the object object. Graph collection can only prob be paralyzed only up to a certain number of threads, um so there's an imbalanced distribution of workload um when too many threads are utilized.

B

um So in these cases, um threads are left idle and unused, but they still have to participate and go through the steps of gc to reach synchronization points and they're still dispatched.

B

um So with this with this type of parallelism, we end up incurring overhead without um gaining any benefits and, in the second case, um with cpu saturation, uh uh the effectiveness of um parallelism is limited um by uh by by availability of the threads. um So this can be true for, for example, with a system running multiple vms. uh In that case system system threads are shared among the different vms and the different gcs. um For example, if you have two vms running um the same set of threads will be scheduled to both.

B

There will be a lot of context, switching and uh for and if gc takes place, um it will need uh the thread um schedule to it to continue on and to clear synchronization points, and uh so so, overall, um these kind of threads that are shared are limiting um dc progression and as a result, they they provide ineffective parallelism uh and, and then and the other big cost is the context switching which, which has a big impact on dc performance.

B

um So, overall, unless, unless the, unless the benefits of parallelism are greater than the overhead, um then parallelism will be detrimental and cause gc times to increase unnecessarily, um and this overhead can be significant. um It will increase proportionally with the with the number of threads that are utilized.

B

um So if we come back to this example, and so uh if we were to run this example again with 8 16, 24 or 64 threads, we would see that, as we increase the number of threads, the the issue is compounded and it's even more so uh if you prefer to plot 64 thread threads um here, we see the pause times are even higher than 48 threads.

B

um So this this kind of reminds me of uh we could draw a parallel with the example of too many cooks in the kitchen where there, when there's too many people working together on something it couldn't result in the final product being negatively affected, and I think that's that's- that's precisely what's happening with these, these different threads.

B

um So knowing this um we have to ask ourselves what the right number of threads thread threads is: are we using too many threads and getting bad performance? If so, how much do we decrease by, or is it possible that we're not using enough threads and missing opportunities to paralyze the task farther and gain benefits?

B

um So it's kind of like the goldilocks um goldilocks situation, um where, uh if you're familiar with it, uh we're looking for that sweet spot where it's not too high, neither is it too low and we're just looking for the perfect perfect number um so that that brings us to uh sub-optimal versus detrimental parallelism. um Where we're seeing is there a net loss or are we losing gains um potential gains?

B

um So in general, um parallelism, we could say is most beneficial when done with the correct number of threads, and we can call this number. The optimal thread count anything more than this number would result in unnecessary overhead um and on the other hand, the thread count less than the optimal thread count would be considered sub-optimal, since there would be more opportunities to paralyze it parallelize the task farther and gain benefits.

B

um So these two tables show a comparison of the same of of two different workloads, um so the table on the left shows the workload running with 48 threads, eight threads and four threads on the same system, and uh we can see that as we decrease the thread count, um there's decreased parallelism, obviously, but the cycle times also decrease and performance results actually increase uh on the on the on the on the table to the left or on the right.

B

Sorry, um we're running a different workload again doing the same experiment: writing with 48 threads, eight threads and four threads, uh but we actually observed that peak performance is um obtained when we run with eight threads um with 48 threads, we're getting too much overhead and with four threads, we're missing opportunities to further paralyze the task with more threads, so we're missing out in benefits there.

B

So here we can say that um with 48 threads we have detrimental parallelism and with four threads we end up with suboptimal parallelism um and the point at which we reach peak performance can be referred to as the equilibrium point um so in the table to the right. That would be around eight threads since we get the best performance there.

B

um At this point we could say that the work is uh optimally distributed between threads and uh and if it's hot, if uh and it's uh it's, it's appropriate for the cpu utilization to use that many threads.

B

um So with all that being said, um we need some. We need something to solve these issues, um the received parallelism, and we know that it's not so straightforward.

B

um So that leads us to adaptive threading and, as the name suggests, um the idea of adaptive threading is to tune the thread count to adapt to the situation and control parallelism.

B

uh So before we get um deep into adaptive threading, um I think it would be helpful to draw a parallel between uh an example that we all might be familiar with and that's brook's law which states something like adding more resources to a late project makes it even more late, and I think the idea of adaptive setting can be related to this. So we could draw draw a parallel, and so, if we look at the graph, we see the months until completion of a project, the number of people contributing to the project.

B

As we add more people and resources to a project. The time to completion decreases until a certain point where any more people added on become a net loss um and because there's increased coordination, costs on uh onboarding and ramping up. um So um so there's this inflection point at which, uh which we would say, is like the optimal number of people to have on a project um with in terms of uh gc, paws and threads.

B

We can, we could say we have the same relationship where uh adding on helper threads, increasing parallelism results in decreased pause times only at to a certain point, after which each additional thread contributes to a loss, um because there is the additional overhead and without getting any better fits. um So this inflection point is that is, can be seen as the equilibrium point at which we reach um optimal parallelism. Anything beyond this number um results in increased cost and any anything uh anything uh before this number.

B

We would be missing out on potential benefits and gains, and this is precisely what adaptive threading attempts to uh attempts to estimate.

B

So, coming back to the goldilocks example, um where we're looking for the sweet spot so uh or the goldilocks number, which is just right um so adaptive threading, is looking for the for that for that sweet spot based on observations for a completed cycle, uh and we want this number to be just right. We're not losing out on benefits, and neither are we incurring unnecessary overhead.

B

So so, essentially, adaptive threading is an optimization um and it's answered question answering questions such as when to adjust and how much to adjust by and uh and and it does so to tune um parallelism and seek that equilibrium that we're talking about um where parallelism results in deep performance.

B

um So so it's it's a it's a systematic approach um designed to identify these uh detrimental and sub-optimal um scenarios and determine the degree to which it's suboptimal and and and and as a result, give give give a recommendation how to how? How can we actually achieve that optimal thread? Count from where we are currently, um it also needs adaptive. Threading also needs to ensure that um the changes in parallelism are not invasive given like anomalies or things that are observed one-offs.

B

uh So it needs to take everything into consideration, uh and one thing is that this optimal thread count isn't static. It changes over the life lifetime of the application, um so adaptive threading is a continuous process of reevaluating and readjusting the thread count so, for example, um workload and load load distribution properties can change as object graph changes, you could have higher allocation rates or the live set increases um or in terms of cpf cpu utilization.

B

Cpu utilization can be higher or lower um later on. um So so it's it's a dynamic, um it's a dynamic system.

B

um uh So so, if we compare this to um traditional approach um of uh of that, we currently have in place. um Currently um the only thing we have in place to mitigate this before adaptive threading. The only thing we had in place was to um look at the heap size to kind of estimate the number of threads to be used, um and then we also had manual tuning where the user could set.

B

The thread count themselves, but manual tuning requires tedious experimentation and perform analysis of the application performance and and overall, the traditional tuning always assumes that um this optimal thread count is static and it just it doesn't change, whereas we know that it can change over the lifetime of an application.

B

So um so adaptive threading is a superior um solution to these.

B

um So we'll get into some specifics and define adaptive threading a bit more formally um so adaptive threading is a set of models and heuristics that just work together to project the optimal thread, count um and adjust it between um cycles.

B

So it's a systematic approach um that that we use to evaluate that. That's used to evaluate a completed cycle um by by looking at two things, um the the number of threads that were utilized and overhead data from using the that number of threads. um So in terms of overhead data, we're specifically talking about busy times and stall times for each thread participating in in in garbage collection, um so so um uh busy busy install times are key um they're. What um drive adaptive threading um these things, these these measures hint at uh cpu utilization.

B

um They they tell us about the object and workload um distribution. um So busy time is anytime. A thread is performing uh useful gc work which contributes to completing the cycle like scanning object, processing, routes, copying or marking objects. um Stall times are more interesting.

B

These are times that are that threads are doing non, non-useful or trivial work or times that the threat is idle not doing anything at all, and these um implicitly um don't contribute. So these don't explicitly contribute to completing the cycle. So it includes things like pushing and popping something to a shared list, acquiring a synchronization, monitor, idling it for work and notifying idle threads. So all of these can be considered stall times, which would where the thread is doing non-useful things.

B

um One thing to consider is that different stock types of stalls have different characteristics and it has very varying dependency on the the number of utilized threads. So what that means is that these different types of stall times respond differently when changing the number of threads.

B

um So specifically, we have three different types of stalls that we can kind of categorize in their own category. We have synchronization stall, um we have resume stall and we have idle waiting waiting for work, so resume stall would be a function of it would be dependent on os and the platform, whereas idle waiting for work um would be more dependent on the object and and the live set, and the and the properties of, uh and the bill and availability of work for the thread.

B

So if you come back to this example of too many cooks in the kitchen- and we see that there's this chaos that's happening which results in the end product being negatively affected. This would look like this.

B

um We can say it looked like something like this for these different threads um so over here we have some examples of the different time measures and where the stalls are coming from, um so um we have a synchronization stall in red resume stall, then yellow and then this time to notify in purple, and these are all different times which we would consider are non-useful or trivial, because they don't explicitly contribute to completing the the gcu work um and then we have busy times in green.

B

So in this example, um there's a synchronization point and there's four threads, um which have to clear it. um Worker three is the first uh first thread to reach the synchronization point and it's waiting um there for all the other threads to reach the reach. um So this this time that it's idle, we would consider it a stall time and then the last thread worker one reaches the synchronization point and it it notifies the rest that we're all synchronized.

B

So it notifies them to wake up. And so then there's there's a time. There's actually an overhead for it to go ahead and wake up to different threads. So we would consider that, as as as a stall and we'd measure that and then as the worker threads are waking up, they have to reacquire the mutex and uh and the monitor um so there's a they're waking up one by one and so there's an overhead to resuming these threads as well.

B

um So, um let's take a look at at the model and to understand how this information helps us to estimate a optimal thread count um so as an implementation of the model can be derived from finding a minimum of gt time function and this this gc time function proves to be pretty good um and over here this function. What it basically does is it projects the total duration of gc for m threads um would with observe busy install times while performing gc with n threads.

B

So if you have n thread, then we know the busy install times uh as a result of using n threads. We can. We can project uh the total duration of gc if we were to use uh and and thirds um so if we solve that take uh for the minimum, we would end up with something uh uh like uh equation one, uh and that would give us the number of optimal optimal threads as a function of uh previously observed, thread count and busy install times resulting from it.

B

um If we take that and incorporate that into equation, two, uh we get a final recommendation of threads to be used for the next cycle and uh what what what equation two is.

B

Basically, just uh using heuristics and weighted averaging to massage the thread count um and to make sure it's not too invasive, or it's it's a it's a just using different thing, experimental data that we know for better results um so equation, one can be re rewritten uh for simplicity, in terms of percent stall and uh and so in terms of it, for instead of being written in terms of three parameters, we could reduce it down to two parameters um and uh it so the final final model, I guess, is uh shown at the bottom uh here- that's outlined, but just stating the percent stall here is somewhat an oversimplification, because coming up with the percent stall is actually more involved and requires closer consideration.

B

Another thing to note is that this constant x, this is a model constant to help model, um non-linear dependency of stall times on the gc thread count. I won't go into too much details there about these different constants, like h, w and x um I'll leave a link where it goes into these in in detail.

B

uh So so, let's see- let's see, let's see this in action, um so this table gives gives an example of how the model and heuristic approach works. um So it gives give the matrix of inputs and the resulting recommended thread count. um So, for example, if we, if, if at the end of the cycle, we have a percent stall of ninety percent that we calculate um after using the 12 threads, um then we then the model would actually recommend to throttle down the thread count to eight for the next cycle.

B

um Similarly, if we were to use 12 threads and we we determined the percent stall to be 35 percent, we would increase to 15 threads and then in the next cycle we would reevaluate again um does pulling down the thread, count actually help and what effects does it have in the stall point? um And if it's positive, it could either be stable and it could hold that thread count or it could continue adjusting it for the subsequent cycles.

B

um So this this matrix is, is assuming that.

B

Model constants are x, is equal to one and the threat. The thread booster is equal to 0.85 and the weighted um averaging is 50 at the time.

B

um So we can look at some implement implementation details and see how uh adaptive threading fits into the collection process. um So if we go back.

C

B

The diagram and look at the three different phases and uh compare it to this flowchart on the right here um it would be. We would see how um adaptive threading works at the different, um the different phases, um so during pre-collection um we actually want to adjust the thread count based on the previous cycles. Observations um so before the main thread spawns the the worker threads. We want to see if there's any recommended thread count um based on previous observations um and then during the collection itself.

B

We want to collect all of the overhead data and during post collection, we want to um use that collected data and project, an optimal thread, count which could be used for which could be used for the next cycle, like, like I mentioned before during pre-collection the next, so the next cycle's pre-collection would actually look at the the the thread recommendation from the post collection of the the cycle that just completed before it. So we could take a look at each um phase in a bit more detail um so at this post collection um uh it.

B

What what so at this post collection time? Well, what adaptive threading does is that um so the main thread um once it completes it checks if all the worker threads are have suspended or not, um and if they haven't it will wait for all the worker threads to complete and once they complete it'll, go and aggregate all of the overhead data and uh and from there.

B

It will um at this point, we'll know the total cycle time and it could compute the percent stall by massaging some of that overhead data and and and and and and and compute the optimal thread count by um providing that percent stall um to to to the model and then from there we can determine and set the number of uh recommended threads for the next cycle, and so during the next cycle, when during pre-collection we would, uh when the main thread goes and dispatches the worker threads, um and it involves the parallel display dispatcher.

B

It will check if there's been a recommendation um and if there's a recommendation for thread count, it will force the parallel dispatcher to use that when spawning the worker threads.

B

um So, in terms of uh the collection itself, um we see that there's a this is where the actual gc work is happening, where we're doing roots we're scanning doing some copying and whatnot, and here we're we're actually measuring the different stalls and the different idleness and trying to pick up on the utilization of the threads. So specifically when we're scanning threads go idle all the time. So we want to put things in place to take measurements here and attract idleness.

B

So if we put all that together, we end up with something like like this, um where there, where, where the pre-collection routine has has, uh has this adaptive threading component, that's that's implemented, and then in post collection we have the the the model actually going and projecting the third count.

B

um So that's how it all um fits together right now.

B

So if you were to see adaptive threading in action, I think these two figures um illustrate it. um So these figures show how adaptive threading works, to reduce parallelism overhead by uh trying to reach that equilibrium point or optimal thread count, um so the figure on top shows percent stall, number and number of threads at any given cycle, and so for each cycle. The number of threads is determined by the previous cycle number of threads, you utilize at the previous cycle and the percent stall as a result of using that many threads.

B

um So, for example, if you take a look at cycle number two, uh we see that the thread count is 30 and that's determined by taking by adaptive threading, um taking a look at uh thread count of 48 with a percent stall of 94., um so it it brings down the thread. Co dried count because it thinks the stall is way too high and it's uh leading to detrimental parallelism.

B

um So so the bottom figure shows um a breakdown of the cycle total cycle times. um So it shows the total cycle time. Then it shows the proportion of time that's considered stall time and that's considered um busy time. um So by looking at this cycle time breakdown, um we see that initially, that cycle times are very high and majority of the time is coming from um stall or parallelism overhead and as we decrease the threat count, we see we observe a significant drop in in in cycle times, uh indicating that parallelism overhead is reduced.

B

um But overall um we can see that this initial thread count of 48 from cycle. One ends up at six around six to seven, so it reaches a stable point around that and uh in this one, particularly this results in a seventy percent decrease in cycle times and uh and about a twenty five increase in percent uh in performance.

B

Without adaptive, threading each cycle would use the default thread, count of the control 48 and incur this parallelism overhead similar to cycling one.

B

um So I think another interesting example is the multivm uh um or high cpu utilization um scenario. uh So these two tables show uh multi-vm for running multiple vms running. On the same on the same system, um one set shows it's with adaptive: threading enable the the threat, so the set on the left shows it with the without adaptive threading and the set on the right shows it with the adaptive threading enabled um so there's six vms running on the system.

B

At the same time, um where vms one to four are running the same gc workload for the same application, while five and six are running uh two different um applications, so we have two different workloads there. um So when we compare these two tables, um we can see that average collection times are actually lower with adaptive threading. um This is most apparent when taking a look at vm5 the row for vm5 um over here.

B

There's 77 lower collection times and if you consider all of the vms on average, there's 10 percent lower collection times as a result of adaptive threading.

B

um Another interesting thing to look at here is the thread distribution, um so for so for any given number of uh threads. This figure shows how many uh number of cycles collection cycles use that use utilize, that amount of threads um so for for vm 24.

B

We can see that the default of 64 or the maximum thread count was used for 26 cycles, um but we do have variability, so we have anywhere from 40 to about uh 57 on average um threads used for majority of the collections, um but I think this, the the the effect of um adaptive threading, is most apparent on vm5, where we actually get the most benefits, where we have 77 percent reduction uh in collection time.

B

So over here you, you would notice that uh only three collection cycles used um 64 threads, um which would be the default had adaptive starting, not been enabled, and we can see that on average we're using seven um threads, and so this clearly is um is is, is is um reducing this detrimental parallelism and uh we could uh we would. uh We can observe something similar for uh for vm6, similar to vm one to four.

B

um It's it's also decreasing the thread count and the majority of collections are using anywhere between 40 to about 50 56 um threads at any given moment um so yeah and and if without adaptive, threading we'd use the default of 64 in 100 of the of the collections.

B

um So that sums up adaptive, threading and parallelism um in terms of future work for this, um I think uh the model can be made more aggressive um right now. It's it's pretty conservative and it's uh in uh in throttling the thread countdown, it's more reluctant to use more threads, and so it's more. I think it could be more aggressive, but it needs better like um detection to find anomalies or like not to be too invasive um another.

B

Another thing is, this could be expanded to other collectors as well um right now, it's currently only for scavenger, because that is doing majority of the collections right now by default.

B

um Another thing is: maybe we can um have it have adaptive threading on the fly, so we can make adjustments in the middle of the collection itself instead of changing the thread count at the end for the next cycle.

B

That would require continuous calculations and adjustments. uh That would be more that work would be more involved and I think it would be a bit more complicated to get that working.

B

um Another thing is that we can use um ai or some sort of machine learning or for smart recognition to tune the model constants constant themselves, so um the weighted averaging, constant or the thread booster heuristics. um I think they could benefit from some some ai concepts and machine learning um that that pretty much sums it up. um Thank you and if anyone has questions, I'd be happy to answer them.

A

D

Ellen that was.

A

Very interesting any questions for salmon.

D

I have one quick question: will the adaptive threading ever recommend more threads than the number of hardware threads or is it just we'll just go below.

B

Yeah, that's a great question: um yeah it it would, it would recommend. So if you have uh like a 64 per system and that's what the default would be and we would actually cap it at that, so it would. It would try to recommend more, but we would cap it at the maximum default.

D

B

D

E

That uh equation that you showed earlier or that formula: where did we get that from the several slides in the yeah? I think maybe this these slides here.

B

E

Maybe the next one after this as well.

B

So number of optimal threads, um so that's uh by setting this equation to zero and finding the derivative of it so essentially finding the minimum and solving for um so this time function. um This is something that alex alex came up with and uh I think through experimentation. This show proved to be pretty good, um but this is just one implementation of the model, but this was sufficient um but yeah. So if we set this to zero and solve for m and take the derivative, we would uh we would get.

B

We would end up with that equation on the next slide to get the number of optimal threads.

F

F

But basically that first formula, what it says is, uh I guess the time is, uh I guess, reverse directly proportional. So we have this ratio n over m or m over n. Where m is, I guess uh correctly, someone, but one is a current number of threads and the other one is is uh yeah yeah.

B

It starts for which you want to project the time.

F

Yeah, so so we have two components here and say: one is saying, for example, that uh time is directly proportional to busy times and universally proportional. You know to start, you know to stall times you know and that, however, there is more some nonlinearity where this factor x comes. You know, that's roughly, you know like it wasn't really that straightforward formulas, maybe we could there there could some other v1, but you know like.

C

F

It's it's it's. You know a relatively, relatively true statement. You know that these ratios apply. You know.

C

I'll jump in with a different question then, um but first of all, thanks thanks for this presentation. It's quite interesting and obviously a lot of work went into preparing it. So I just want to thank you for for spending so much time putting this together. This has been really informational.

C

My question was more around, so a bunch of the results that you publish, I assume, are from using this model, this adaptive threading model in the context of open j9, I'm I was wondering if you could comment on how much of the implementation of this is present in omr and how much of it is just kind of reusable.

C

You know out of the box, if someone were to build the gc on top of the the componentry, that's in omar itself, yeah.

B

So 100 of the code related to adaptive, threading is actually in for more, um it would be only the only thing to consider is um hooking up a vm to um to use um scavenger, and I think that requires a bit more work than using the traditional um mark sweep collector that we have so it's just about uh hooking it up to scavenger and and using the generational collection from omr, and this would be available.

C

Okay, that's very cool! That's a that's a great answer. I guess that. But it suggests the follow-on question um and and how hard would it be for someone to use the scavenger from uh from the omar gc? Maybe that's also kind of a question for alex. I'm not sure if you have a pat answer to that kind of question.

B

Yeah, I think um I'm not aware of um actually another vm using um gen con or using scavenger or generational style of collection um from omr um from the other talks I've heard at cascon or like this, they usually tend to use the traditional um stop the world collector, but I know alex. Are you aware of um the amount of work that's involved in actually hooking up another.

F

Yeah yeah, so scavenger is mostly ready to be used by other, uh I guess platform languages other than java, uh probably the the biggest I guess uh glue code to be written is around the right barriers. uh I mean the people and or projects have successfully incorporated uh scavenger but uh glue code. There is a little bit of of a glue code to be written around the the right barriers.

F

You know so basically the generational barriers, so we have to ensure that uh whenever uh mutator threads mutates a reference, it has to notify the collector about uh this mutation. You know so uh some examples could be used uh from uh j9, uh but uh I think uh there is a little bit of code on j9 site, probably that could be better shared with omr that we don't share right now. This is something that maybe in future, we can do better.

F

But anyway it's it's doable, but I would say a scavenger code itself is 100 reusable and there is not much more to that top of it, but right there, it's probably two thirds of the work is probably missing and one third of the work is there. You know mr side.

E

Okay, all right, thank you.

A

Any other questions for.

A

A

Okay, if not um I'll, thank you again salma for putting together this informative talk and oh the.

B

Code's already.

A

Been delivered into omr is that correct.

B

Oh yeah, yes, it has uh yeah. I was gonna, I I was gonna post links to the different issues and uh things, but I could share that as well in the slot channel.

A

Okay, terrific yep, so uh yeah that looks, uh looks really good. So thanks again and thanks everyone for uh attending and uh we'll talk again in two weeks, then thank you.

B