Eclipse OMR OMR Technical Talks, 27 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: JProfiling 20180927

Description

Profile driven optimization is very useful in improving runtime performance especially in Just-in-Time compilations where application are compiled while running. In this talk, Rahil will first give an overview on the different profiling mechanisms including those used in OpenJ9 and discuss their advantages and disadvantages. He will then explain the low-overhead patchable runtime profiling mechanism called JProfiling, how it works and its advantages over the other mechanisms.

A

And you, then we make Ben and me so before I talk about your coupon.

A

Before I talk about G profiling in detail, let's start to the LS cart, take a bigger background, photo profiling, so traditional program, optimization statically, done optimizations without knowledge of any programming citizen area forces optimizer to become more conservative, see it almost often optimizer bases, HTC's or non program, syntax and semantics, which results in a very suboptimal a program body that is not optimized for commonly executed cases as they were unknown during compilation time, professor optimizations or profile director Phil Bates is a compiler, optimization technique that uses a runtime information collected from and learning application to improve performance.

A

This profile infrastructure can forward some of the vital vital information for an example. When a compiler encounters, particular virtual call or interface call, it means it needs to resolve that class to call the correct implementation of a method, if known during compilation time that most significant number of time, that call would result of particular class optimizer can inline. That call is faster which effects on a significantly in statically compiled language. This information is achieved through a training run which might not reflect a true learning behavior of an application.

A

On the other hand, a dynamic little compiled language can benefit the most out of this infrastructure, as they can compile the math method, while the while the program is running, they can have the most accurate profiling information using which optimizes at can generate code of a better quality provided at the past. Behavior reliably predict the future behavior before I jump into the open, j9 profiling story.

A

Let's take a look at the classes of information that any runtime profiling infrastructure tends to collect a code, frequency or code execution frequency which aims to describe how frequently different path of an expla of an application executed competitive to another. This helps us in a block ordering and also provides us information about hot spot in an application that we can put more optimization effort. Our value profiling. Information, on the other hand, aims to describe different value, sets a civic program point which gets us to generate a code. Dealers for most frequency frequently executed cases.

A

So how does open j9 collects and utilizes the profiling data? Well, open, j9 or execution works in a mixed mode where application starts with interpretive methods and frequently executed. Metals are compiled to native code, so initial profiling information is achieved from a bytecode interpreter loop, where interpreter is provided with the application, fare profiling data buffer, which it uses to collect a load data from an application. Once this data buffer gets full eye, profiler or interpreter profiler consumes that raw data and passes into a global it.

A

Global data structure that can be consumed by JIT compiler data collection in in in an interpreter is very costly operation, so I profile also controls data collection by turning it on or off. As my requirements once I prepare passes it. The data into global data structure, a JIT compiler uses that data in its compilation and also sends out a query to AI profiler for to request specific data request, specific data.

A

So using this infrastructure we collect Tyga's of virtual call that helps in Nano to design each math implemented in to align a branch direction or switch statements. Associative block ordering instance of a check cast on time types that helps us generating better typecast when we are in let all compiling this method, although most of the JIT compilation are driven through data collected for my eye profiler, there are some challenges associated with the high profiler. First of all, the data collection is really costly, so we only profile last and measuring location.

A

So we only profile last and method invocation when n is tuned as per application characteristic, most most entries in the data buffer is consumed by branch entries, so so to reduce time spent on business. In those buffers we skip those bland changes in a sampling fashion and collect a branch bias information. Just before we are about to compile a method. It requires a separate thread pool to process the data so that we can avoid having forces in application.

A

Third- and one of the main challenge with the eye profiler, is the collection of record that is shut off once the method is compiled to native code, but open gene I can also add an instrumentation in compiled code that is called cheat profiler. It can make an exception for very frequently executed methods, usually a method that consumes significant amount of application. Learning time can be chosen by JIT compiler to go undergo special compilation for compilation fees that is called a profiling compilation in this profiling compensation forgiven method.

A

We generate two metal bodies, one without instrumentation code and what with instrumentation code a code execution is interleaved between these two metal bodies based on sampling, profiling, frequency, open g9 uses a compiler control counter based sampling, in which, at various points in a metal body without instrumentation code. We have this code that decrements the counter and upon reaching a counter to zero a transient instrumentation code.

A

After returning from instrumentation code, it resets the event counter, based on the size of an instrumentation code, that it is running after certain transitions between these two metal bodies in compilation is requested, so that a compiler can use the data that is generated by the instrumentation code to optimize the meta body. Detailed information about this implementation can be found in Arnold Android of spiders paper, a framework for reducing cost of instrumented code, although it is, is really useful to help instrumentation code in a compiler part.

A

There are some challenges associated with JIT profiler, which makes it viable hold option 1 for only handful of methods. First of all, in a metal body with instrumentation code to get a cold frequency. We put the counters of all the blocks on the on the order on all the blocks so to reduce complexity during optimizations.

A

We need to limit the certain important optimization, specially optimizes, that especially loop, optimizations optimizations and because of the duplicate duplicate method body, the code cache consumption is pretty high and because of this reasons, G profiler is only recommended to capture pic for her capture peak profiles, because in this kind of workload you can collect up required profiling data quickly and trigger the recompilation, so the effect on the runtime would be minimal.

A

So we have seen a software based profiling techniques. There are some profiling mechanism, which use hardware provided buffer for profiling data in open j9. We use those hardware provided buffer in a sampling mechanism where we compile everything at optimizer level, with, of course, low invocation count to improve startup of a workload and based on processing those Hardware buffers. We try to identify hotspots and compile those methods at high, optimization level. In general.

A

There are so many challenges associated with the use of hardware buffers in profiling, but the main challenge is limited information in virtualization dome, and that makes it that makes it unusable in cloud environment. Its implementation is Hardware dependent. The harder samples we have in open j9 has two different implementations: one for power and 4c platform. Collecting profiling value profiling. Information and of low frequency information becomes very costly and complicated with this kind of infrastructure. So that is why we kept its users limited to simpler only in open j9.

A

So we have seen a software based profiling techniques used in open Jana, and we have seen hardware using open g9. Why do we need multiple profilers in first place to answer that? We need to see two main categories of foreclose under which DVM falls in workload with hotspot. A significant amount of application running time is spending health and full of hot metals in this type of workloads.

A

This handful of hot metals will be compiled to native code and might inline some other methods, as well, so to get a real current context about an application data collected from my profiler on from aliens and might lack the information about inline methods. So did profiler comes here because it can add instrumentation in compiled code, and it provides us the current information that we need to compile these methods we compile this matter to help us improving so put a flag were close.

A

Learning time is distributed is pretty flat in this type of workloads to improve performance in this type of workload. We need to compile as many methods as quickly as possible and startup. Time is really important in this type of a workload because of the higher code, cache consumption and dependency on a recompilation.

A

The profiler becomes really impractical to use with this kind of workload and I profiler provides your networking operation that we can use to optimize those metal bodies.

A

So we have seen challenges that we faced with different profilers. What do we expect from a standard former ideal of profiling infrastructure? Well, we want to have a single framework for both startup and support improvements. We went to have a profiling code for large number of compile methods with the minimal impact of minimal minimal impact on a compile time, so cost of adding instrumentation codes in the compiled body should be minimum.

A

We also also the compiled body with added extra mentation in it should not have any impact on a throughput if the attack election is turned on turned off so in case, if you don't see any benefit from any compilation, we can just turn off the data collection and continue on your application without any to put impact so and also the data collection s and the processing should be efficient so that we can collect the required data with minimal usage of resources.

A

So, with this goals in mind, G profiling implements two different mechanism once to collect a whole frequency and when to collect a value for value profiling frequency. Let's talk about how G profiling collects code of frequency.

A

Well, one of the main problems G profiler head was it in the method body with instrumentation code. It has a counters on all the blocks to get the code frequency j profiler tries to aim to minimize this counter. How can it does? How can how can it does it? Let's take a example intuitive: let's take a look at an intuitive example, so this in this example, we can decide. Now you put counters on the all edges, but is there something better way we can optimize this?

A

Well, we know that the if you put a counter on a blocks with a single success of a single predecessor, we can effectively counter frequency of other edges and block so that reduces the counters to half. Can we still do something better well for some? If somehow we know the method, entry, frequency or the exit frequency, we can leave out the counter from the frequently executed path and have a minimal impact of adding counters on the on running time. So, with this intuition in mind, J profiling implies a counter placement algorithm.

A

Based on this two paper, one is written by Newton situations that describes CFG reduction procedure to optimally profiling and another is Bell and written by Balan was that provides us comparison of different algorithms about tracing programs and, based on these two paper, we prepared a formulation based on C F, G H, excitation frequency profiling. Let's take a look at the example how a profiling minimize the number of counters we need. This is a simple control flow graph with one a cage it can represent for loop, a while loop in any method.

A

Let's see how we can minimize the number of counters in here. So, first of all, as a profiling, the what G profiling does is to implement of maximum spanning tree out of that control graph and just to get the maximum spanning tree. It s an artificial edge between start block and exit block. To emphasize that the number of times you enter into method is equal to number of times you exit into a method. So after it it also uses the plane's algorithm to obtain the spanning tree.

A

So once it obtains of spanning tree I have just highlighted those edges which are not present in a spanning tree with wrinkler. Here it Travis's through the all the edges and tries to find out the edges that are not present damage in maximal spanning tree and put counters to measure those edges. So in this case we just need to count the edges 4 to 5 and 5 to 1, so we put a counter to measure edge 4 to 5 and hh5 to 1, once it has put the required counters on all the paths.

A

It prepares, a formulation in terms of addition and subtraction of that counter blocks to get a frequency of uncounted edges. A detailed implementation of this algorithms can be found in J profiling, block toe CPP in effect in G 9, and, let's see how we can obtain a frequency of uncounted blocks. So, let's take an example, we want to get the frequency of block 2. That is not counted. Well. We know that the block 2 has a single predecessor.

A

So if we get the frequency of edge 0 to 2, we can get a frequency of block 2. We can see that the edge we got to 2 is a single successor of block 0, so to get a frequency of edge 0 to 2. We just need to get the frequency of block 0 log 0, which is a start block. That means the frequency of log 0 is equal to frequency of block 1 because number of times you enter the method is equal to number NGH it matter.

A

So it becomes intuitive that to get a frequency of block 2, we can just get a counter from block 7 and obtain the frequency for block two. So following these steps, it can prepare a table to find out frequency of one counter blocks or edges to get the in terms of block numbers that contains the counter that we need to load or add or subtract together.

A

In this example, there are no requirement to add or subtract any counters together, but we can assume that in a complicated control flow graph, we would have some blocks that we need to add or subtract together. So we have minimize the number of counters that we want to put in a control flow graph. We have prepared a formula that requires to obtain a frequency for uncounted paths. How do we store this information? Well, it uses a persistent here block frequency information for data structure to use to store this information.

A

This data structure contains array of pointers to a bit vector that holds any information about which on which counter blocks, it needs to add or subtract together to obtain a frequency of counter blocks. So in this array, if we have a null entry, that means there is no counter available. If that entry is low-tech. That means we can efficiently obtain a frequency from single counter and with meta single mathematical operations.

A

On that entry we can find which block counter from which block number we need to load to get the frequency if there are, if she is not null or not low-tech, that means it contains a pointer to the arbitrator which holds an information about which kind of counters it needs to add or to add or subtract together to get the frequency detail. Information about this data structure can be find in j9, profiler, dot, h, PP or CPP file.

A

So the benefit of this algorithm is in a cyclic region. It has at most one pout. It has at most one counter and implementation favours counters at lower frequency. Edges have low key locks, so it reduces the head contentions while be updating the counter. It can compute a frequency of any single block as a sequence of additions. It's an operation and the block frequencies are stole using pair of bit vectors. That means provides a high information density.

A

So, in my example, one of the main part of our open g9 was excessive, which I did not. So how do? How does reprofiling handles exception? With this algorithm? Well, exceptions are special in open, G and s. They can represent a non-local control flow changes in a non-local control flow and if we decide to put a counter on all the exception edges it, we need this many counters to get a frequency of all the blocks in the in the in the program.

A

But we have simple observation that we made about XS and edges that they are there and usually do not occur. So using that observation, we can simplify an assumption and combine all the counters from excerpts and edges and put them onto the catch block as currently in open j9. We know how to exploit the information on a catch block and we do it, but we don't have any mechanism that exploits information about counters on exceptional edges. So we can take that as themselves and put counter on the catch block.

A

So how does the profiling handles inlining? Well, inlining is special because it can change comparison to compilation. So if we are compiling a method and find out that the method is not in line in the previous come in the come in the compiler profile body, then it goes into a sim call chain and finds tries to find out information about that method.

A

If it does not find any information- and it uses a eye profiler data about that about that method in this algorithm or counter a place where your little compilation phase may associated by calling for that reflects il gen step and optimizing optimization can co-locate or D duplicate the counters, as we already have minimizing number of counters that we are putting in a control for graph of.

A

We are not limited when we are want to apply certain optimizations, which we wear in subject compiled AG profiler, so we have seen how efficiently G profiler optimizes the number of counters that it requires to put less together. Let's take a look at how it employs a value prefer how it collects a value. Profiling, information well main aim behind. Add value profiling. Information is to identify common values and program properties at specific points during program execution.

A

This information can be type of object that which is a type dislike jackass instance of a virtual method dispatch it can be loop, induction variable or rent of Ellis, and using this of information and optimizer can generate a code, tailor to frequently executed cases. So the main lookout in any value profiling information is to detect a consistency, detector consistency so from given value profiling information. Only small set of information is of interest so that compiler can use that information to generate better coding to generate better code.

A

Lack of this consistency suggests that there is nothing there is no commonly executed cases present in the method body that optimizer can exploit. So using this idea, the profiler only profiles most frequent n values where n is a small integer, with keeping track of uncounted cases that are supervisors information about the inconsistency. So how does the profiling data stores the value?

A

Well, it uses a hash function to store the value that hash function contains to err a one with the value information that that value that contains value that your provides us information about which values we are profiling and other one contains the frequencies of corresponding values. It uses a bit extract as a hash function that extracts the bits out, offer value and together in integer, to effectively Ginetta indexing into added, so to update or add a new value to a table.

A

First, it first of all it janessa index to an array using the hash function and it checks in the array. If that value exists or not, if that value exists, then it just increments because finding counter and continue running the main code. If that value does not exist, then it tries to check if the table is full or not. That means, if it can add new values to a table or not. If it cannot add new values to tables, then it contributes that value to uncounted cases and continues a continuous learning mean program.

A

If it can add new values to a table, it attempts to lock the table and checks. If the table is locked or not, if table cannot be lost, then it again contributes that value to uncounted cases and continues to the execution of a main program. If it can add a new value to a table, then it rehashes the table to make sure that most frequently profile values are not kicked out in case of collision and add new value to the table, with initial counter unlocks the table and comes back to running main program.

A

So with this hash structure, it gives us the precise frequency information for only small set of values, while keeping a counter for other uncounted cases as well and as I have shown you incrementing. The counter is very cheap that can be done in line without going to any complicated C code.

A

If uncounted cases becomes more significant, then it is really easy to clear out the whole information and start populating the table from scratch, and the efficient hashing mechanism that contains a bit extract are probably optimized on all platform with single instruction. So this is a simplified version of how trees look like in our in our I. Just prefer, simplified the trees.

A

It uses the itenary to decide if it wants to increase the other counter or the frequency counter and just check if the table of table is full or not here and depending on what it is, a execute, increment part or calls the helper to add new value, so current status of J profiling in open j9 right now, we are, you know we are enabling J profiling for profiling, compilation effectively trying to replace a JIT profiler right now and where, while doing that, we had some of the challenges that we try to address like it was very slow in triggering the compilation in methods with no initially, it was just using a method invocation count to Tripoli compilation of a method.

A

So if you have a method that spends so much time in a loop, you'll stop executing profiling method before chipping recompilation. So we added a recompile intesting. Look to identify these cases because of a block splitter. The comparison time was increased in global register allocators. So we reduce the number of blocks that we had for adding a profiling value.

A

It was profiling, multiple value in same same value, in multi, same value, multiple time in extended basic block, so we remove duplicates as well and it requires tuning because we wanted to decide how much it is enough for us to get the purpose. Optimized body performance results, I have just put out the numbers of I log on x86 and on power.

A

I lost 300 to set improved by 5% when we switch from GE Profile, otogi Jitsu filer to j profiler, while on z, it improved by two point: three to five two point: five percent future work. We want to enable this infrastructure for warm bodies and we want basically want to have or this profiling instrumentation code for large number of bodies.

A

There are some challenges that we have faced with, while we enabling this for the profiling comparison as increased compliation time in mobile register, allocator that we are trying to address right now and we need to have a mechanism that allows us to deliver spatting, because currently in open j9, we can only patch a branch to a no, it's not possible to do. Vice versa.

A

A

In the je profiler, so in the profiling comparisons effectively you don't you don't want to disable the counters.

A

Yes, right now it does it. We are working on enabling this infrastructure for warm bodies where we are going to apply some mechanism in which we can have a single body that can just yeah.

A

No nose for the profiling compilations. We because it is only anyway going to get away when we compile it at a scorching level, or we don't care about fetching of your spreadsheet.

A