Eclipse OMR OMR Technical Talks, 11 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalable Locking

Description

Currently, Eclipse OpenJ9 uses OMR’s Test & Test & Set (TATAS) locks a.k.a. spinlocks with compare and swap (CAS) for synchronization, which are known to be unfair. TATAS locks collapse on massively parallel systems during high lock contention where many threads attempt to acquire a lock simultaneously. This talk will cover the following:
1) Does transitioning to scalable locks, such as the Mellor-Crummey & Scott (MCS) queue-based spinlock, in OpenJ9 resolve the TATAS bottleneck?2) Do features, such as lock cohorting, concurrency restriction, transactional lock elision (TLE) and scalable statistics counters, help further improve locking performance in OpenJ9?

A

If being recorded,.

B

My name is Abney Singh focus on knocking within urban j9.

B

A high concept known as mechanical sympathy, which refers to understanding how the hardware works and into consideration when this, the software, we will see how open G lines current locking strategy becomes like under contention, and then we will learn how the current design goes against principle of mechanical sympathy, dive into scalable locks and associated features girl. It's not just to resolve the both seen in open j9 locking, but it is also to make open, g9, locking, more scalable, competitive and future ready.

B

I'm, assuming everyone has the basic knowledge of threads locks and atomic operations such as compare and swap those who are unfamiliar with these concepts. I will try to briefly describe these concepts as they show up in the presentation.

B

It's going to focus on scalability of locks to evaluate a locks scalability, we need to know what lock contention is contention is evaluated by the number of threads that are competing to acquire the lock. At the same time, the lock contention would refer to fewer threads, wanting to acquire the lock, whereas high lock contention would refer to a substantially larger number of threads then want to acquire.

B

The look, please refer to the graph on the slide to visualize the locks performance, which is on the y axis in terms of time to acquire a lock varies with the lock contention, which is on the x axis and represented by the number of threads, which one to a or compete for the look. At the same time, g9 bottleneck, which will become a later arises you during high-low contention and the two main symptoms of high lock contention, are a drop in throughput and a very high resource utilization, which prevents useful work from being done.

B

Language, abstracts locking so a Java developer doesn't need to worry about locking in the Java code. The Java language provides a synchronized keyword for abstracting looks an example. Use case of the synchronized keyword has been provided. The Java Virtual Machine is responsible for supporting the cigarettes keyword and implementing the locking features. An example of high-low contention in Java would be where hundred threads use, the synchronized keyword on a single object on a 2015. At the same time, we dive into open g9, locking bottleneck. Let's see how open g9 implements, locking and.

B

You know uses a data structure named system monitors for locking when there is lock contention. This data structure is not only used with Java objects, but it is also used by the VM jet and GC native threads. This data structure is maintained in our mark. It implements a type of lock which performs good at low, lock contention but collapses at high, lock contention the next. We will study the types lock used in the system monitor and why it collapses at high, lock contention.

B

We will dive into the concept of mechanical sympathy where one needs to account for the hardware in order to design good software.

B

The system monitor implements testin testin, set law or tates, look, which is a type of lock with a global, lock state focal key word here is global. Global Hawk state is shared among all the threads that want to acquire the lock. We will see how this global, lock state becomes the bottleneck before diving into tates lock. We will study the performance bottleneck in the context of the test and set or Tata start as lock, which is a simpler form of the TARDIS. Look simple implementation of the task.

B

Lock, the lock stirs the global, lock state, and if true is third in the global lock state, then it means a thread has ownership of the lock and if Falls has been stirred in the global lock state, then it means the look is free and anyone who wants can go to acquire the lock has to make key operations.

B

A thread performs the acquire operation which is shown on left side of the slide in order to own the lock and the release operation, which is shown on the right side of the slide is performed in order to relinquish ownership of the look kissing on the acquire function.

B

Let's rely upon a computer and swap operation cask operation in order to update the global log state, and then this pin indefinitely performing the cache until they can acquire the lock. In practice we won't spin indefinitely. We use a technique known as spin and then park the threads only spin for a short period of time and then park themselves apart. Technique allows threads to perform useful work.

B

Now, let's dive into mechanical sympathy and study the impact of the COS operation on the processors, cache cache uses a cache coherency protocol for enforcing data consistency.

B

You guys can see the me OSI protocol, which is used on most modern architectures. In this protocol, a cache line can exist in either of the five listed States modified, earned exclusive and invalid. How does a caste operation impact the processors caches performance of an uncontained uncontained cache, which would translate to one thread trying to acquire the look? The global, lock state would persist in one cache in an exclusive state which is going to be very cheap to maintain as the Lord becomes contended different threads would perform cars on the global log state.

B

This would lead to a lot of cache invalidations and maybe bus traffic Emily Caen did it heavily contended. Kaz operations can saturate the processors caches and buses. The locking code will prevent other useful work to be performed due to the hardware. Saturation would be a scalability killer for a nation which wants to scale by increasing the number of threads, so we just covered has operation, but on other architectures such as PPC.

B

There are instructions such as fence and flush, which shows similar scalability issues, which we noticed with the cast now I would like to share a simulation or on how the TAS lock and the cache it together can see a multi-core processor each car has just.

B

Everyone should know that l1 cache is smaller and faster and closer to the curve. L2 cache is bigger than the l1 cache and further away from the curve. In this example, it's 1 & 3 are scheduled on curve 1 and threads 2 min for our schedules on core to read. One wants to acquire the lock right, one will get the lock locks global state from the main memory through the l2 cache and then the 1 cache, and, let's assume no one owns to look at this point so thread 1 will successfully acquire to look.

B

It's only thread, one competes all to lock. The global state will persist in one set of caches in an exclusive state which is going to be inexpensive to maintain third, one still owns to look now. Our thread 2 also wants to acquire. The lock will execute the cast operation in an infinite loop until it acquires the look. This is going to be the acquire function we saw a few slides ago. Every catch that thread 2 performs will cause bus traffic and cache invalidations. This is not useful work.

B

The CPU utilized for the cache can be used by other threads to do important work a little contention case. Some designers would say that this is a bearable. We can live with it and may end up neglecting this issue from a scalability perspective.

B

Let's kill the previous example and consider a multi-core processor with 96 curves, so, instead of only to incur to competing for the book, let's assume a thread running on each girl wants to acquire the lock. The locks global state would need to be maintained in all the ninety-six cache sets in the processor.

B

It's going to be very expensive. You're wasting resources on a super expensive processor. Just for acquiring your log work is being done via plaque. The applications performance is going to be drastically impacted, most likely it will experience a scalability collapse again. I would like to remind everyone about mechanical sympathy at this point know your hardware, when you design your software neglecting this basic principle would lead to you. Failure at some point.

B

Here we can see the task locks performance I, think we saw this graph earlier as well. Performance is reflected by the time to acquire the lock on the y axis and lock contention is shown by the number of threads competing for the lock on the x axis.

B

The main point to take away here is that the performance of task locks collapses in high-low contentions in areas which may arise just from an application scaling sieve the ideal case via the time to acquire lock remains constant as the low contention increases, although it is going to be impossible to achieve the ideal case in practice, we can still try to get closer to the ideal scenario, with better locks.

B

Engineering system monitors, use, testin, testin, set or status looks the difference between tabs and TARDIS locks is reflected in the acquire function which is shown on the left side of this light in the acquire operation for the TARDIS look. There is an additional while loop, which did not exist in the tab this, while loop acts as a as a buffer for the Cavs operation supposed to reduce the frequency of Cavs operations.

B

In return, it's supposed to improve the locks performance. How this.

C

B

In the locks, implementation impacts, the locks performance.

B

See the performance of eight has tartars and an ID lock has an tartars behaves. Similarly, the difference is in the point of collapse. The TARDIS look scales little better than the task log. Before reaching its inevitable collapse. The slight improvement in the TARDIS performance is DDD reduction in the castle. Creation has operations which resulted from the additional while loop. We saw in the previous light.

B

You know I would like you to again remind about mechanical sympathy and how everything is linked together, software and hardware and.

B

We have looked at the bottleneck that open gen is current, locking strategy has and we want to fix the bottleneck. Seen with the tardis walk, we did a literature survey, we looked into the locking literature and cross cubase looks.

B

Well, variants of Cubase looks here: I have listed 7 variants of Cubase locks, keyless lock, sir, had highly valued for their performance benefits on modern processor architectures. It has, it has been ordered adopted by IBM and some of its products. Even the Linux kernel has started using Cubase locks since the past 5 years gone. As far as to item k42 looks the amazing value, Cubase lakhs per wipe.

B

Now we we had to decide which QB locks to use in open G line with key community members.

B

Mckenney, who is a distinguished engineer and CEO Linux at IBM, was consulted because the Linux kernel recently adopted you base locks, Doug Lee, who is the owner of the java.util concurrent library, was consulted because of his contributions to the Java, concurrency and synchronization library.

B

The purpose of consultation was to make sure that we are going in the right direction and we don't end up wasting our time. After consulting and studying the lock performance of Cubase locks in the recent academy papers, we decided to try the mcs lock in the omr system, monitor.

B

Let's see how the mcs lock works is used. The look contains the tail. The key appointed to the tail of the queue is our thread specific notes. They contain the thread specific lock state, whether a thread can acquire the lock or if it has to wait for the lock to become available.

B

It also contains a pointer to the next element in the queue to the next threads note that required. Lock. You add elements to the key. The atomic exchange operation is used generally asked at. This point is: what's the difference between an atomic exchange and a compare and swap operation right to a memory address atomically a scan fail if the computer doesn't succeed, so it has to be repeated until it is successful, whereas an atomic exchange or creation of a succeeds. This should summarize the main difference between an atomic exchange and a compare and swap operation.

B

Let's go through a simulation which shows the work of the MCS look.

B

D

B

The cube is empty, the lobsters null it no for the queues, milk, sorry, no for the queues tail, which means there are no elements in the queue.

B

One wants to acquire the look, the thread, one, a pins, it's very specific nerd to the queues tail, which is shown as thread 1 or t1 on the slide and the lock now points to the thread ones. Queue nerd, t1 read one notices that there is no one in the queue other than itself and it ends up acquiring to look.

B

To you all so, once you acquire the Lochner, but thread, 1 still means to look red to will append and a limit to the Q's tail, which is going to be similar to what thread wondered. Look will point to thread to Z. You note, because in a queue elements are generally appended to the tail of the queue aqua pointed t2 to will notice that there is another threads node in the queue.

B

So it will update Iman's next field to point to its key node and it is going to busy wait or spin until its local state is updated to true by the current owner current LOC owner when it releases the lock step forward. Let's say it's: 3 also wants to acquire the lock. It will perform the same steps as thread to a pen. It's known to the tail of the queue and lock. We will point to the tread to thread three's, nerd, 3, even loaders thread. 2 is already in the queue waiting for the lock.

B

It will update the next element in thread to snow, to point ut3, which is itself then.

D

B

Going to busy weight or spin the cue liberal as more and more threads want to acquire the lock.

B

Case where a thread wants to release the lock in this case thread 1 releases, the lock, while releasing the lock thread. 1, will update the local state in thread twos node to true which is waiting to acquire. The lock will note the change in the local state from false to true and then acquire the lock. Similarly thread. 3 will once thread 2 releases, the lock and, at the end the queue will again be empty until other threads want to acquire the lock. This summarizes the basic working of the MCS lock.

B

The MCS look better than the tightest look difference is in how the lock state is maintained. In the tardis talk, there is a global, lock state which is shared among all the threads in the MCS lock. Each thread has its own specific, lock state, which only gets updated by one thread, which is going to be the next limit in the queue each thread has its own lock state, and only one thread is ever going to update that local, lock state.

B

This eliminates the need of a cast to maintain the lock state in the cast in the MCS. Look, it's the rid of the cash and berth saturation, which was seen with that hardest. Look.

B

Look at the of Tartus in mcs locks. This light has the acquire functions for both this look and the mcs lock artists acquire is on the left side of the slide and the mcs acquire is on the right side of the slide. In the TARDIS. Lock there can be unbounded, has operations violet threads pins, you acquire the lock by updating the locks of global state, but in the MCS look here is note as needed, while spinning, because only one thread will update our threads local state, so you don't need Tama City.

B

You update the thread specific local, lock state in the MCS lock. Let's see how the absence of the castle creation in the MCS lock improves the locks performance.

B

The performance data is taken from a locking paper, which is shown at the bottom of the slide. It was collected using a micro benchmark, mirror threads compete for a critical section, wire lock. It is written in C and it doesn't reflect performance of a java application. The benchmark measure, Steve, lock performance, while increasing low contention.

B

The lock performance is shown on the y axis in terms of throughput of look acquires per second, and the low contention is shown on the x axis in terms of the number of threads that want assignment Dainius li impede for the lock. The blue line in the graph shows the performance of the tightest lock and the orange line shows the performance of the MCS lock.

B

It corresponds to better performance closer in the graph top. This lock is doing nothing at high, lock contention. It is just wasting resources. On the other hand, the mcs lock performs better than the toughest look overall.

B

It also shows, but it also shows symptoms of scalability collapse, the only difference being that the throughput at the point of collapse is higher than that of the TARDIS. Look the presentation we will see or we will cover a feature which will has avoid the scalability collapsing.

B

Just to summarize, the MCS log performs and scales better than TARDIS lock in terms of scalability and throughput.

B

Just looking at trooper won't be a complete evaluation of the MCS log. We also need to compare the worst case, space complexity or the memory requirements between the MCS and TARDIS logs. This is a global box taped, so the space complexity is proportional to the number of locks in the MCS. Log of Q is used. Each thread competing the lockup ends its own node into the Q, so the space complexity is going to it's going to be proportional to the number of locks multiplied by the number of competing threads.

B

This space complexity is valuated in the scope of the entire JVM.

B

The mcs lock the space complexity depends on the lock in tension and in most java applications, only three to four percent of our highly contended. So we won't hit the worst-case space complexity for every mcs lock in the JVM. This is definitely going to be an increase in the memory requirement when transitioning from the taught us to MGS lock, but for the performance improvement from the alias lock.

B

We will have to sacrifice on memory, but we can memory cost arising from the MCS lock by managing and reusing the cue notes, but thread using a data structure in Omar named j9. True.

B

What is the current state of the MCS lock implementation in open j9? We have implemented a basic MCS, look and incorporated it with the system monitor within our implementation. Addresses these special cases, such as out of order lock, acquires and releases the support for park, wait and notify features which to be implemented in the context of the MCS law.

B

You do not cover these special cases, so we may not see performance improvements similar to the Academy graphs or performance numbers that we recently saw. Current implementation is complete. There is an Omar pull request open the MCS implementation. It passes all the Omar and open j9 testing. The only pending task is performance benchmarking.

B

After benchmarking, we can most likely merge this pull request. After a code review, we plan to further optimize and improve the performance of the basic MCS look dive into your future work. Other ways through which we can improve the basic MCS log implementation or in case MCS logs, do not as well as tardis locks in all workloads.

B

You saw that mcs lock doesn't have the same bottleneck as seen in the tardis look, which helps the MCU look to scale better in high low contention, but we did not cover about. We did not account for low low contention.

B

Just look have similar performance to the tardis look for low log contention in the current open, GL and implementation. The tardis log takes two atomic operations, one in the acquire function and the other in the release function. This is in the best case scenario for the tardis log. Mc locks in the worst case scenario will only perform two atomic operations, one in acquire function and the other in the release function, so I speculate that the MCS and TARDIS locks should have the same performance in low, lock contention.

B

So this growth few slides ago, I brought it back to compare the performance of MCS in Tartus locks under low, lock contention. The new addition here is the green circle and the green arrow, which points the low door contention area. This graph reiterates that the MCS lock has similar performance Toutatis under low lock contention.

B

After benchmarking. Let's say we find that the MCS lock is slower than the TARDIS log for lower lock contention. So what to do now?

B

Reactive locking algorithms come to play these algorithms were introduced in an academic paper which is shown at the bottom of the slide. You will use a reactive algorithms to address the pure performance of the MCS, lock in low lock contention, workloads, reactive algorithms, do reactive, algorithms will split the system monitor code part into to a simple or default code. Part would be to use the TARDIS log for handling the lock contention workloads.

B

We will need instrumentation to measure the low contention as the locking T becomes high enough for the MCS lock to performed better than the TARDIS look from the Trotters. Look to you and MZ lock in the system. Monitor reactive approach is going to be the fallback solution. In case my assumption of MCS looks performance in low, lock contention. Workloads ends up being fault.

B

Clock issues do not end here. The MCS look suffer from suffers from another issue, known as the LOC greater preemption issue.

B

And this issue, we need to look at the working of the mcs lock-in. The mcs lock only one thread. Only the next thread in the queue is able to acquire the lock, which is in contrast to the TARDIS, lock, we're we're all the threads can simultaneously compete for the work. The one thread which is able to acquire the lock may be preempted resuming such a proof. Such preempted threads can result into an expensive contact switch which can negatively impact the locks performance. This is a simple description for the lock greater preemption issue.

B

Again, to summarize scheduling the next thread in the MCS lock, you will be expensive will be expensive. This will adversely impact the logs performance. The lock waiter preemption issue.

B

Before we dive into fixing the lock waiter preemption issue, we will discuss a feature known as concurrency restrict concurrency restriction, and then we will later dive into how this feature solves. The lock waiter, preemption problem.

B

And see shouldn't.

B

You guys would need to know about two terms before understanding what concurrency restriction is. The first term is active thread set, which means a set of threads which are allowed to compete for the lock and then the second term is passive thread set, which means or which comprises of a set of threads that are not allowed. You acquire the look into what is the objective of concurrency restriction? The objective is stated on this slide.

B

It is, it is to minimize the minimize the size of active threads said, while still remaining serving ensuring there are sufficient threads in the active thread set to actuate the lock or saturating the logger. We aim to maximize occupancy in the critical section.

B

The goal of concurrency restriction is to introduce unfairness in the MCS locks admission policy. Implementation is very fair, so there's no unfairness see.

B

Concurrency action can be implemented with the MCS logs.

B

Commenting on currency restriction, the main queue of the MCS lock is into two smaller keys, one for the active threads and the other is used for the passive threads logic for concurrency restriction goes in the release, operation of the MCS lock.

B

The elements are moved from the active to passive keys.

B

Vice-Versa, in order to achieve the objective of the concurrency restriction feature which we saw in the previous slide unfairness achieve, it is achieved by occasionally shed yielding the latest thread which wants to acquire the lock the latest or the newest. Fed will incur the least cost from a scheduling perspective. Since it's already running on the processor, it won't have to go through an expensive context, switch.

B

How do we disrupt order in the MCS lock? This is accomplished by moving an element from the tail of the passive queue which is going to represent newer or latest thread, and then we will move this element to the head of the active queue when we do this move, the element or the thread which is going to be moved, will end up owning the look.

B

The way we make this decision is going to be random, we rely upon randomness to inject unfairness in the MCS lock vise concurrency restriction aims to reduce the involuntary preemption rates by inducing unfairness to the MCS locks admission policy. In future, the basic mcs lock implementation will incorporate concurrency restriction in some form. This will allow us to handle the lock where the preemption issue, by inducing on fairness,.

B

We can see the performance of the MCS lock with and without concurrency restriction, concurrency restriction and the performance data are both taken from an academic paper which is included at the bottom of the slide. The benchmark used is a stress, latency, benchmark measures, lock performance, while varying the low contention similar to the previous graphs, lock, performances on the y axis and the lock contention is on the x axis.

B

Concurrency reception is shown on the graph and its counterpart without concurrency obstruction is shown in red.

B

It is very clear that the MCS lock with concurrency restriction, achieves and maintains a steady state trooper at high look low contention in one of the previous graphs. We saw the same scalability collapse that we noticed with the talus lock, but with concurrency restriction. The scalability collapse no longer exists.

B

Again, I would like to bring back the concept of mechanical sympathy. If you understand how your software is going to work on the hardware, you will be able to design better software, and this feature concurrency restriction, is a good example of the application of mechanical sympathy.

B

Concurrency restrict restriction lead to better utilization of the hardware, putting it into the context of mechanical sympathy. How? How did we study the behavior of our via the down? The slide is taken from the same paper from veer. The previous graph is taken.

B

Your the impact of concurrency restriction is that the number of active threads reduces from 32 to 5, which means only 5 threads needed, maintain maximum occupancy of the critical section, and the impact on the hardware is that CPU utilization reduces by a factor of three ash usage drops by 98 percent, which is reflected by the fewer l3 misses. Overall, the low performance with concurrency restriction increases by a factor of 16 Press by these numbers and there it will be amazing to have such a feature in open j9.

B

Always have a saturation point by moving from Tatars to mcs lock, we are just delaying the inevitable scalability collapse. We need to ask ourselves and we do better than software locks.

B

And the answer is yes, we can do better. Hardware performs faster than software. This is common knowledge, so here I would like to introduce the topic of transactional log elision yearly comprises of our tle. In short, prices of Hardware transactional memory and a software lock.

B

An external memory provides Hardware instructions for synchronization.

B

Hardware instructions only take few cycles, whereas using a software it's like, whereas using a software lock, may take hundreds of cycles on a CPU. The transactions can yield better performance in a software lock. Yell II aims to maximize usage of Hardware transactional memory, leads to a mission of the software lock to rely upon the hardware more and more and use the software look as much as possible.

B

Memory will not perform as well as a software lock in all scenarios. So tle has to decide when to use the transactional memory and when to use a software. Look.

B

Yes, the instruction set for Hardware transaction memory- it's begin begins a hybrid transaction X and ends a hardware transaction Excel. X abort allows the transaction to be aborted, X test checks. If a transaction is happening, it's pretty straightforward and it's a very simple instruction set.

B

You know that hardware transactions won't always succeed in the presence of memory conflicts transactions will fail and in case of a failure, our hard transaction would need to be repeated. Our transactional memory instructions are cheap, so we can run them multiple times to evaluate the affinity of a critical section with Hardware transactions.

B

Good performance cannot be chief with hybrid transactions. Then we can fall back to using the software look. This is the simple premise behind PL e I would like to show you guys an abstract, ele design. It is taken from an academic paper which is referred at the bottom of the slide. There are two primary code pots code pot represents the software lock and the red color part represents. The hardware transaction part has provisions for instrumentation.

B

Your statistics are collected for evaluating effectiveness of each approach. For example, effectiveness of hardware transactions can be measured by the number of times the hardware transaction fails before the critical section is successfully executed.

B

The cases where Hardware transactions may not perform better than a software, so, if you look quite curt, part represents the decision-making component, which decides whether to use the hardware transactions or software log before executing the critical section. This.

D

B

The tle design here we can see a specific use case. Their transactional log elision yields better performance. The performance data from the same academic paper attract ele design, is taken. The micro benchmark uses a skip list base set, which is a data structure implemented using linked lists. It is designed for file searches mostly performs, insert and remove operations on this data structure. The performance graph shows the logs performance on the y-axis and the contention on the x-axis.

B

This specific case performance is shown in the form of speed-up, on which improvement you're getting the Persian represents the performance of a software. Look all of the lines represent the performance of different ele implementations.

B

Do we deduce from what can be deduced from this graph? The graph shows that tle performs you do four times faster than just using a softer look, clearly improvement. Ele improvements are coming from the hardware transactions yearly, has the potential to further improve open, G and I'm, locking if it is implemented effectively within open Jana, we began the work on incorporating Hydra transactional memory into open genomes, locking strategy.

B

We did not meet the minimum compiler requirements to subvert the transactional memory.

B

We use GCC, seven for compiling open j9 or we use newer compilers, which has a transactional memory support. So this is going to allow us to easily pursue the work on implementing tle into OpenGL 9.

B

At this point, we are very close to the end of the presentation. We covered a lot of topics. Today we saw a bottleneck in OpenGL, locking on to the system. One usage of Tartus, which utilizes a global, lock state leads to collapse. Then we covered MCS a cubed and how it can help with the throughput collapse, which is seen with the hottest look.

B

We covered a backup plan in case MCS, alters and performed similar to the hottest look and the low-low contention.

B

We will rely upon reactive algorithms, then we come in currency restriction and how it solves the lot greater preemption issue by inducing unfairness in the admission policy, and we talked about transactional, look lesion, which combines hybrid transactional memory and software lock achieving better lock of performance. This dog heavy focus on the basic principle of mechanical sympathy. You can deserve better software. If you are aware, the line Hardware behaves before concluding I would like to encourage everyone to employ the concept mechanicals beti. Whenever you write or design.

B

At the end of my presentation, but before I would just the help that I received from that Vijay and Shelley in for this presentation, sincere effort in improving the flow and organization of this presentation also provided a lot of constructive feedback. The right dry runs I'm, sincerely grateful for their hope I. Would you address any questions if.

A

Before other questions, I guess uh say: I guess you will be very busy for the next five years with all these special work but anyway, so let's go to questions. Anyone has questions.

A

It was very clear any questions in Ottawa. We don't have any questions right now in Toronto, I.

C

Have questions oh.

A

C

So probably go back, oh yeah from back, would you have many slides going to slide number 48 I would say we have seen Hardware transactional memory for micro benchmarks. We can.

C

We can get up to a thousand times of Spira yeah, but the for example, for the java hash table kind of because the the scalability of transaction log Elysion essentially is at the whim of the behavior of the critical section itself so for for for the hash table kind of code, because it is the the critical section actually is it it's emanating majorly through the different packet.

C

So you you have little contention itself and that's why you so, basically, if you want to use the theory, then only you need to can understand that the compiler or whatever, can understand the behavior of the trivial section. You can decide whether you are going to use it here or yawn. I.

B

Think you're, correct and I think this is your instrumentation comes into play. You will need to collect statistics to study the.

D

B

Section and 6: you should be able to decide whether to use tle or Hardware transactions 34 back to a softer. Look I completely agree with you.

C

Going back to slide 31.

C

How is 31- oh yeah, I have a suggestion here. Actually you, instead of you, need to manage j9 pool in the certain way for Java, especially because unless your bytecode is is kind of easy, if some somehow modified, unless otherwise the monitor enter and monitor exceed is always in the same method.

C

So, instead of using gene pool, probably we can use is stack, stack memory allocation allocated on a stack in that memory frame as a node, and you can avoid managing j9 pool.

B

C

You you you, you are, you need to 18 node. You need node. Instead of putting out the load on a t9 pool, you can use the stack.

B

Works for cases there is only one critical solution or if it's read only acquires one look, but what if I don't know.

C

Know why you suddenly, you can help multiple node owner, honest in a same stack frame right.

B

You get in flight nodes. If, because you don't, you do not know how many notes you need or how many looks a thread is going to acquire. Sir, how do you you.

C

Know, while you compile, are you you as you interpreter going on or you we are compiling you, each each pair of the more enter money and more Exeter. You have one node.

B

Node but then one load per thread: yeah.

C

Why not you have this? You have the thread, and you have. This thread stack that.

D

C

Thread right naturally, so I believe this is going to be easier and cheaper. Instead of you, help multiple threader containing on on the chain on single j9 pool or something each.

B

Thread has its own g, a9 processor. You won't see contention.

C

You still need to manage EU, indeed, a locator or something how.

B

Do we initialize general coolest, we free, we tell it to which already has ten elements. So each has pre-allocated ten elements. I says it's j9 PO for the lock yeah.

C

In my point, basically, you eventually, you can run out of a tank or something you need to. You need to do a management but use a prone. A stack is easier. Okay, I think that this is just the some suggestion and I have a couple. Actually technical questions going to slide. 28.

C

Comes up in terms of hardware cost you, because versus any change is pretty.

A

C

No difference it's actually in on power, for example the forelock you have a hint to it saying this is a lock and for exchanging you saying this has is if the hint is atomic operation. So so it's not going. As you mentioned, this is going to cast to be eliminated. You just avoided a cast all.

C

The the other thing is the you you probably going back to slide. 27 partly is clear. Oh the the meeting is called the.

B

Meeting is done.

C

The WebEx is died.

D

A

Still going on, the meeting is going on still being recorded.

A

We hear you okay,.

C

Yeah, you hear me so going to 20s light 27 I think.

A

Okay, either side I.

C

Cannot see the side anymore, so I'm, looking at his PowerPoint.

A

Okay yeah, so it's a it's a slider says: what's different about MTS law, I.

C

Think it's 2027 or 20 something he seven years: 27 26 26, they'll release.

C

You you have a king, you have a contention of while mmm wow you have a new lock enter and the thread is going to release. So you have a contention of you.

C

The thread to release the lock is going to check on itself node, while the conflicting thread going to modify that node right. There's.

B

Going to be some there.

C

Is a contention: how are you going to avoid out order thing? You, basically the the the reason the guy going to release it is checking. It is node whether the the the thing is now the node is containing now versus the thread going to enter is going to write to that thing as a threat, three of a to the the store versus that reader to check you have a order there. You have a how to synchronize it you, you have a problem there right there.

B

Gonna be multiple nodes works right, so if it will enjoy no.

C

No I'm I'm not hung in a multi node, multiple node I'm.

D

C

About there are only two threads, for example, at only one node right, the the node, the node is owned by the holder right now and then there's a concrete coming from a threat number two, the threat number two doing the compare nice work found that the lock is busy now I need to put all my thread into the the holders noticed previously is now right.

C

I need to tell it oh I'm, waiting on you right, so I need to write t2 or t3 or T t1 there, and at the same time the holder, maybe is a checking, is going to release so I need to check whether I have a cut I have a waiter or not right. There.

B

I think how it works. This thread, 1, T 1, is going to Batista, look and then related to that one is going to be cleaned and T 1 is going to update T to this local state. Yes, T 2 is is going to be notified. It's very acquire the lock, sir, that.

C

Exactly the point I try to raise, you have wasting memory location. You have multiple rather going to either modify or check. You are going to check it yourself or there are other other threat and may start into it. Now you have an order there, whether when you are going to see my store versus how I'm going to wait help you have a problem to piece out there.

B

Since 21 updates the local state of t2.

C

I'm right so I'm, starting p2 is waiting on t1 t1 on the lock and at the same time t1 is Jackie. Oh I'm, about to least release the lock I'm going to free, so I'm, going to check whether I have a waiter or not right.

C

So how do you know? I have a waiter if I am read back. It is now so I.

B

Know the no oh sorry, this is the state it has t2 in it, and.

C

No, you cannot assume t2 in t2 is stored, is not the store, is not visible. Yet right, you have a you, have a latency, for example, you will have a home machine therapy, Q fabric and you need to travel. It takes many. Many nanoseconds I think.

B

We use a write barrier so whenever the next is write.

C

Me now you have up, you have barrier there and you have a cost of the hair. There seems to be not described to hear how you yeah, I. Think you have you have a barrier, you have a consistency to be solved here.

B

Right barrier should fix the.

B

Be able to see everyone should be able to see the update.

C

Not that simple, yeah I think it will I need to think about it. How how you you you are going to this is one scenario of the the enter and release you have a conflict there. You have a consistency there order in there plus you have one when you have multiple.

C

Multiple threat trying to enter only one compare-and-swap succeeded that one probably is easier to solve. Yeah.

B

Okay, I think I made so you my implementation, so I think those we can talk later and then you know give you a better explanation by referring to my implementation, yeah.

C

B

Actually didn't get your name. I admitted.

C

A

Thank you any other questions.

A

Any questions in the Ottawa.

A

Okay, so that concludes the Vitaly talks. Today, I wanted to thank Shelley Lambert for organizing the events in Ottawa who currently, as the Toronto event in the future, I look forward to working with dalawa team more in terms of dreams. Some of the technologies in the j9 be m and G C teams to to the talk series so that the team can also benefit from the salaat knowledge over there. Okay, so thank you old.

A

Okay, thank you, I'm, going to stop the recording now.