Eclipse OMR OMR Technical Talks, 30 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Code Generators and Much More (Part I)

Description

In this talk, Julian will give an overview of the JIT code generator framework for the platforms we support, namely X, P, and Z. In particular, he will discuss what concepts/code is currently being shared among the code generators, and what is not. He will compare the architectures, reason about the philosophy of the respective construction, and how the modern micro-processor differences result in differences in our JIT code generators.

A

So we go to the next slide.

A

A

Which is going to next? Oh that one happy. So basically the whole talk is the introduction to the modern architectures of the three platforms we support it, and that is a high level comparison and because of that, the in our three code, generators, a commonality and the difference, and then a lot of the coach and details I'm going to talk about it.

A

So the agenda is the chip. Comparison and call comparison is a and our flow of coaching and I asked the implications on binary size, performance, exclusion, time path, length, addressing mode for the coaching, mainly for the load, store instruction or memory operand and addressing mode for branch and a call and condition code for the loop and, if, whatever there and a register and register allocator.

A

Finally, encoding instruction scheduling and we Association optimization.

A

Going to the next slide chip comparison, I'm many refer to the top line of the most current tips on three architectures and maybe x86 skylake and the power is power. 9 and Z is the mothers.

A

Nowadays, when you talk about cheap cheapy is, is equal to a socket equal to processor, equal to Co processor, in a path that is not the same, but nowadays is basically the same. All these three processors are fabricated. On 40 nanometer process, skylake is by Intel's 40 nanometer process and power. 9 and z14 are fabricated by a group of foundry 14 nanometer for cocoa foundry, actually IBM's previous facility in Fishkill is acquired by. We saw it too cocoa foundry now.

A

Follow on the same chip, you'll compare the skylake power, I + z-, the court hound skylake is up to 2800 chip and the power 9 is up to 12 of reporting up to 10, and you look at this l3 caching. So this slide all contain cheaper level resource, not a core level private resource. So l3 is typically a chip with shared resource.

A

Skylake contain 338 0.5 megabyte of l3 cache shared by all 28 cost power. 9 is private, is 10 megabyte per hole, although the l3 cache can be shared among a cause, but it's law is yo-yo because it is a private that you need. You have one cache line: Henley the Torah your neighboring cause l3 is took 128 20 cycles. It's not that fast reporting is 128. Megabytes 3 cache is shared by the Holocaust and memory channels. The memory connection, Skynet has 6 channels and power. 9 8 channels are z14, is of 5 channels. Now.

B

A

The 8 channel of memory connection can be connected in two ways, for the low-cost machine is connected through DDR double data rate connection. That's the industry standard memory, connection for the enterprise machine on the same chip. It can be connected through the PMI channel, the direct media interface channel, the PM eyes, the password memory connection provided closer to 2 X of the memory bandwidth, but it slow down memory as slow down AP. We have the poverty, but there the memory latency is around 15 to 20 nanoseconds longer.

A

So you look at the memory bandwidth per chip per socket. Skylake is the theoretical memory bank bandwidth is around 118 or 117 gigabytes per second, but really you can pull to use. It is probably around 90 gigabyte per second on power.

A

Eight channels, if you are going vdr, is going to be 100 653 gigabyte per second, that's theoretical max. Oh really, you can use the power around 120, but even you are going a buffer. The memory is, the theoretical is 230 gigabytes per. Second, you probably can use around 190 gigabytes per second.

A

Z 40, the number is secret, maybe jor-el knows, but you probably can't deduce it from powers power 9. The buffered memory is around how much it is and thee is always a buffer and X is always DDR that my understanding and give you a perspective helps. How much is this memory? Pencil is okay, it's good for the running. If you look at the HPC, the high performance computing code is demand.

A

A lot of memory bandwidth, the the skylake core actually can pull around 40 gigabytes per second, so that 90 give us a second, probably only good for to coal. So you can imagine is running that kind of code. You're 26 cause sitting idle. You cannot do anything the only the two cars can put in the bandwidth. So in this perspective you can see the p9 and z14 is more scalable in for that kind of workload, and the connection for the l3 cache is shared versus private. That's a performance implications here as well.

A

I have been going through these performance issues with many customers between X and P, for if you're running single-threaded on the chip, then that single thread for the x case you can use close to 40 megabyte of cash l3 cache on power. You can use 10 megabyte so sometimes the performance is the customer saying a power running slow, slow, but my answer to them. You are not buying the machine to run single thread right.

B

A

Stopped here any questions.

B

A

So the call comparison and is a comparison: let's call each call for some. Your ten years, multi-threading is equal to hyper threading in x86 land.

A

Skylake is up to two SMT thread and now SMD thread means. The hardware plan means logical, CPU on AIX AIX busy name. Each hardware is a logical CPU and on linux each logic, hardware thread is termed as a be CPU. So it's all these things. Terminology is the same thing.

A

Basically and depending on on which platform you are, you are seeing terminology virtual CPU actually on AIX means another thing: okay, on x86, each core can run up to two SMT Fred power 9, each core run up to 8's empty flat and Z is up to to thread.

A

The is a kind of landscape is instruction length on x86 is two bytes to 15. Bucks power is a four pipe. Always Z is 2. 4 6.

A

Addressing mode you can see, our x86 is very powerful there. The P, plus X, multiplied by s plus immediate. That means base address, register, plus index register, multiplied a scale.

A

The element size, plus a displacement immediately, plus there's a another addressing mode for IP, relatively instruction pointer, basically relative to the current instruction and you're upset going to be a certain amount offset. So is IP relative addressing that to 2 mode of that there are a lot of sub mode, real Derby process, X multiplied by s, and you can combine, combine press ability to have a lot of motor there. But you look at power. The instruction set. There are only two mode, two for the load and store instruction.

A

You either go into base address, pass index register or phase plus a displacement of 16-bit. Only that Z is in the middle is close to the X. Addressing mode pays, the price index, product displacement and IP relative as well, ranging and the call.

A

X86 provide a 32-page distance, branching distance and for indirect core they are using GPR to contain address. So can the branch is GPR based on power?

A

The see there it means conditional for conditional branching power. Only provide 16-bit worth of bridging distance for unconditional branch is 26 bit for indirect call. You need to use the special register, not GPR, so power is called LR and a CTR link and a counter register on Z is the took. The branch in relative range is 32-bit, prime multiplied by 2 distance, because their instruction is always 2 by 4 5 6 5 is multiple of 2 and indirect bench is the GPR based the registers act.

A

X86 is the 16gb are processed 32, vector and power is a study tool. Gpr, a plus 64, vector and z, is a 16gb, are about 32, vector condition, code Exodus 6 is a flag. Our power is 8 condition. Registers Z is the CCW.

A

Cash I will not Digital support information because the cache architecture and memory consistency, I'm livid, for the part to talk any questions.

A

So the full of cogeneration in our coaching compiler, the cogeneration, basically going through instruction instruction and register assigning and people optimization and finally, finally, encoding there used to be two stages between the register before that: a pre register pre ia instruction, scheduler and after the register assigning as a post RA instruction scheduling that right now is our disabled and it's disabled for a good reason. Basically it because it all the cause and nowadays is all out of order deep outer order machine. So the hardware itself provided scheduling capability already.

A

Instructions gathering actually didn't provide as much performance benefit, but ever take a lot of compilation time that it is not good for teeth, environment.

A

A

Okay, so for the instruction length implications of the X and Z the typical. Finally, operations are automatic operation instructions destructive attractive, in a sense that the example there is a eco, a plus B. Then you, the AE is destructed you if you, if we want to keep the a value for a future reference, you need to copy, you need to copy the register or you need to later. We motorized it, and so you, you, probably see more register copy instructions in our X and Z, but I know skylake microarchitecture, actually optimizing this copy instruction.

A

The instruction duck is register. Copy instruction is actually not issued to the pipeline. They can be finished in stage of register renaming you just rename it your is finish it and also xnz.

A

Allow memory operand for non load store instructions on power. You load store instruction that the memory operand can only use the load store instruction or any arithmetic instructions. You is register based.

A

These has finally size the benefit, because you, if you look at the instruction of increment the memory operand by one power, typically, you need to load it and add one to the loaded, a value and a storage you have you need to use three instructions and encoding is a twelve ability, but on X you probably only need to buy the instruction to do the ink so that finally size benefit is obviously obviously is there and now white problem. Basically how many instructions you need a dynamic sense, how many instructions you need.

B

For you, as people.

A

B

A

Potentially including loop right so is different from the binary size. The finally size is aesthetic Thanks so because they have a memory operand for non load, store instruction, have the path length parameter as well, the three instruction on power versus one instruction of X or Z.

A

However, that doesn't mean power take more time to execute the add the increment, because the the memory operand instruction internally actually is a crack into 3 micro up anyway. So from a pipeline SEO point of view, the 3 micro alternately issued the versus power 3 3 instruction issue there and the x86 instruction are mostly of two operand format, so they are mostly destructive.

A

We have free operand, the format for new instruction, FM AM FM, a means floating multiply and add for in my plasma, and you need this resource anyway, you need the register, a multi price register, P plus register C and a result that needs put into a register. So you have three operant format is still destructive.

A

The architecture is a mix up to two over N and three over and I saw the newer instructions of free operand and what the order two operand instruction is destructive.

A

The three operator probably is non-destructive, but I didn't go through the percentage, how many instructions destructive? How many are not and the power is mostly of the operand, so it's mostly non destructive, but for the VSS we have said the vector. Fma is three operon, so its destructive any questions.

A

No question: okay,.

A

Addressing mode implications on instruction selection, the here the limitation is mainly on power because the addressing mode is either base price index or pay supplies displacement and the disagreement is limited to 16-bit, 16-bit ability or negative 32, k+, sorry, okay, and that there is no architected IP register. You can now refer to the IP register, so there is no idea relative. Addressing so give you an example.

A

Typically, you are iterating over and Java array on power. You need to use more instructions or for sure, because you of our object model, you have a object, header, each byte or whatever, and then you have a index. So what do you need to addressing a particular element? You need to add the index first, then you need to multiply something shift on an index plus the base and a part of the header.

A

So it's quite a few instructions to go to addressing it unless look strider Kiki and you can have the internal pointer. So basically, the base register can point anywhere in the object in the middle, and then you don't need that long sequence to to calculate the address in a path We. Certainly have a proposal for pi object model that will help the power for sure you what the by object model.

A

Basically, the Java object reference point at the start of the data, not as data of the header, so the header is in a negative offset of your object reference and the data is in positive. Offset of your object reference, but this is in a path is a. This is a. It is a problematic for the architecture, because these, the immediate offset, is always positive if unsigned data.

A

So it cannot address that negative.

A

It's not in order to address the negative you need use the index form. You need to move a negative value into a register and using index form to address it so that the problem there, but nowadays are the new. The architecture allow both negative and a positive offset. So it's a property of ayodhya object model can be revived.

A

We selected, but the CEO. Probably some implication here. I believe that the architecture for the negative offset addressing is a six byte instruction right now, four byte, so it helped the implications of bigger binary accessing stack afraid the stack frame is can be more than 32. Kilobyte I have seen this, especially when escape analysis. Where another allocating a lot of object on a stack, then a stack frame becoming bigger than studied.

A

Okay, that Saturday is a problem for power, because the power you have the stack register pass is 16-bit offset and that only address addressable up to 32 K when you bigger than say it. Okay, what happens? Is you have a performance issue here?

A

Performance impact because our stack frame addressing binding is late is at the very latest stage. So you even don't know which, in the instruction selection phase, you don't know which item is offset modern little kilobyte and at the binary encoding time you're found a this thing is more than 32 kilobyte offset. Then what you need you want to do, because you do already passed the register assigning stage now you need, you need to evacuate a register to calculate the offset that evacuation of the register have certainly has the performance implications.

A

You evacuate the register typically to negative offset of your stack and then you calculate address and you addressing the stack. Then you load in the back that register. There is a lot performance penalty there and also for accessing static, variable and a constant address so I'm, something like a j9 class and a j9 method.

A

The in this way, the IP relative addressing is really handy, bound power, there's no IP relative addressing so, but we have a lot of register. So we in power coaching, we dedicate one register in 64-bit mode. We dedicate one register as the pseudo talk. So what is a talk, and where is talk is the in the avi in that traditional avi or the ERF or s cough they have with something called God or talk. Kata means global, offset the table and talk means pay table of constant. The inequalities is a area.

A

This mainly is for a position, independent code and a shared library for the benefit of shared library, because you don't want to you, don't want to modify your instruction sequence in a shared library to encoding the particular global address. Viewing modify that then your Sherrod library cannot be shared by many process. So what they do is the ABI ability have got or talk and in gotta talk, you put the global address there and you are using I feel relative. Addressing to address that got our talk to load. The global address constantly address Palin power.

A

We don't have IP relative addressing so we dedicate the one in C or whatever environment. We have a dedicated resistor to point at the talk and then you can similarly doing IP relative addressing, and indeed we have another, because we have so many registers. So we will delegate another register to point at the pseudo talk now.

A

Pseudo talk basically is similar to the talk in containing a global, Canadian, j9 class, j9 method and a floating-point 1.0, whatever that kind of content in that pseudo talk and learn using the pseudo talk register to reference, the global constant.

A

However, if you have a studio talk overflowing, then you you have trouble, then you'll need to revert back to a long sequence of the terrorising address, in particular for a 64-bit mode. You'll use five instructions to encoding the long 64-bit address, and then, when you are using pseudo talk, you also have the pic size invalidation. When card unloading is happening, you need to take different actions, so that is the peculiar things on power. Coaching, that's the different things here. So any questions on this. Yes,.

A

Yes, the question is the pseudo top is for put method or group globally for a whole JVM. Yes, the studio talk is global for whole JVM and the global that the pseudo top is. We have a hash table in the pseudo talk. We know which one, for example, Java object. Class pointer is in index of 30 32, then the globally, whatever that T decoder is running and if they need a Java and object class is just to access.

A

The study to offset of the studio talk or is a single pseudo talk about a whole JVM, any other questions.

A

You so branch and a call and trembling so you'll, keep in mind on X. The branch distance can be studied a bit so that basically up and down two gigabyte and on power.

A

Your conditional branch is only sixteen bit so that up and down 32 kilobytes for a unconditional branch and a call is 26 bit that is up and down 32 megabyte and the even bigger is up and down for gigabyte and now the the implication here is limitation is also if many on power, but it can happen on other on X and Z as well for within the method that typically conditional branch. So do you have enough LaBranche in distance for conditional branch using a method on power? Is it happened?

A

It happened relatively frequently, you don't have enough of branch in distance within the method for conditional branch, the biggest method compiled I have seen in PMR is 300 kilobytes. Actually, so you can only jump up and down schedule.

A

Kilobyte is not enough. So what happened here is we have a pre encoding phase gang to identify the potential, long branches and then change JaVale. The identified potential candidate of long-range, we would turn me if you are tween a branch of evil greater. Then we turn it into a branch if less than or equal reverse the Ferengi direction, and then the real branch is using unconditional one and canítö one can go up and down 3-2 megabyte, that's enough Cepeda teeth and you assuming there's no method compiled requiring more than 32 megabyte.

A

That typically, is a safe, safe consumption. Assumption now going to do you have the branching distance for a direct cost that that is a call, for example.

A

This can happen in a past on all platforms, because in a past that the code cache is allocated separately each piece by piece as your run, so you can imagine you have a code code, cache or piece of code. Cache is more than four megabytes a 4 kilobyte away from your previous code. Cache now, in that case, even the even though it can address up and down for gigabyte, then is still not enough.

A

um Actually is also is, is only up and bound to gigabyte so, but in I think it sings Java, 8 or Java 7.1.

A

The code cache is a big hole, kg use, kind thing, ecocash is reserved, preserved there, 256 megabytes and then every code cache is tab. One piece out of that preserve the 256 megabytes piece by piece. In that case, X and Z will not have the problem of not enough branching distance. Around P is still can happen because you, your preserve area, is 256. Megabytes P can only jump up and down 32 megabyte. So you still not enough on all platforms. You can.

A

This thing can happen for helper call, because the helper is in share library and how far away share library from your code cache that is performed dependent in run time dependant. Actually, so, if not the 2 gigabyte whatever not enough, then you still need the bridging help from the trampling. So what is the trampling? Basically, you have a tiny piece of code to bridge the call you you don't have enough branching distance.

A

So, in your code, cache, your branch is jumping to the Diaa trembling and trembling is, is responsible to transfer your control to the long distance away collie, but the Creator is create an illusion that you are. You are calling to the Kali directory. Ok, next slide. I can show you what what the trampoline look like it and trembling actually is the saving encoding space. You you you, you can imagine you're in your code cash, your car to method, the fool a lot of times and you cannot reach fool.

A

If you, you are not using trembling, you need to encode the long sequence in many time many places, but you have a trembling. You have a single tramping sitting in your code, cache every core justic job to the trembling be trembling is the thorough gait of your colleague in your current code. Cache also is required trembling for atomic touching purpose. If you have method recompilation and you need to touch your core to the new target, then if you directly encode your car in your main line sequence, then you cannot patch it in multiple instruction.

A

You cannot atomically touch it at one goal, but for trembling. Okay, any questions.

A

Okay, so next slide basically are the binary encoding and the late creation of trembling binary encoding. Is you you? You always need to keep in mind. We have the dynamic instruction. Patching can happen. You need dynamic code. Patching can happen for a lot of places in a DJ environment, for the data method and across the resolution. Maybe you the data method and a class unknown at the time of your compilation. You need to resolve it. Then you resolve it by instruction patching.

A

So you need the instruction passion for that compilation and recommendation is required, touching as well. No other god invalidation is patching. Castle loading, unloading and pre-existence optimization is, can trigger patching and compilation. Failure also can trigger patching so for instruction patching because it is a is a relatively mark, complicated topic. It is how I catch the coherency implications and see mod X specification, basically concurrent modification and again, that is, for the part. Two I will talk about more about that.

A

But for this talk instruction patching. Basically the requirement is atomic behavior. So for atomic behavior, you release that you need to align your instruction on right boundary for the atomic patching to succeed. So so this is basically the hardware specifically require Minh on instruction alignment, when power is relatively simple, because every instruction is a for Biden, so it automatically is aligned and you can patch it a simple way.

A

Taman, because there are different instruction length on x and z, then you probably need to take care of the alignment of what you are touching and now for trembling trembling is our trembling is created on demand, because, most of our time, even on power, you is not. Many applications generate a modern 32 megabyte of binary.

A

So typically, the 26th bit of the branching distance is enough for most of time, for example, for a typical dippity day trader on the binary generate is probably around 14 mega byte to 15 mega byte, but for a for virtual enterprise, 2010 is typically 28 megabyte to 30 something background, and in that case occasionally you can have. The trebling is required, so trampling in that scenario or trembling is created our demand only when you found when you do instruction fetching you found.

A

Oh I, cannot reach the target, then at that time you are requesting the trampling to be created now for that to to always succeed, because this is a runtime behavior, you cannot allow later creation of tramping to fail. If you fail that and then your whole runtime fail right. That is not allowed to fail so that space for each potential trampling is preserved busy. You you, you, you preserve the memory and then later on, you always can get grab the space to create your traveling.

A

Now trembling still need to be purged for recompilation. Now the question on whether trembling can be atomically patted, modified in place, that is platform specific on X and Z. Yes, the trembling air can be modified in place around Pino, so traveling. What trampoline look like is on a green portion, the trampoline. Basically, you compose the address and then jumper to the address jump. I need to emphasize its jumper to address, not called your dress, because you call your dress. Your return address is for the fall ecology.

A

The return address going to be after the jump, and that is not where you want to return, because your return address should be on initial branch instruction branch into the the court hood trampling. So in this sequence you should not be trashing. Your return address. The return address already established for the initial call to the trembling, so trembling can only use jump. Your jumper to it is not a call if you're not create a return address again and a thrashing.

A

Your initial return address also is need to be aligned for patching and, as I mentioned, X and z kandula patching in place because they have the 8 byte address in a branch instruction somewhere, you can do the patching in place, but P is multiple instructions with using the 16 bit.

A

Displacement seems to calculate the address, and then you cannot patch it in one goal. So what we we? What happened on power is? We have a temporary trampling, we create a temporary trembling and redirect the the initial call to the temporary trembling to jump to the new target and the temporary trembling will be cleaned up on a next GC when the world is stopped.

A

Ok, any questions on this. Yes,.

A

The question is where to put the temporary trembling, so in Ichiko you have 5 right now is by default. 5% of the code. Cache is cut out for the trampoline space. In that 5 percent. You have how many I don't remember, how many is 4 is 64 or something we reserved for temporary trembling is typically we in real life. We have never seen more than 2 or 3 is used. So is enough already there any questions.

A

B

A

Eat, yes, you have fixed amount, we, the the 5%, is unless the past the story, the historical reason, the trembling, the how big the trembling is. I think on X and a PMZ 16 by 16 byte is the trembling.

A

P is 24, pied and I. Don't remember, is having is up to 24 pint, so is the sort of the capacity for the trembling area is calculated from power. Size estimate, you're, 24 PI and for a typical code cache you 5%, allow you more than your the typical method. The compilation is on power is to kill abided to three kilobyte or something you calculated is. Basically, you know.

A

And by the way on amd64 in a past, they have the that the peak the the interface call can create a trampoline, and in the past we have a park or somewhere. We do overflow. The trampling area.

A

Occasionally I think that's fixed and nowadays you even don't need trampling, because the preserve 256 megabytes code cache any more questions.

A

Our condition code and a condition register the difference between our X Z and the P is our x and z arithmetic instructions to set a contingent code by default automatically, whether you want it or not. On P arithmetic instructions don't set any condition registers by default. Unless you ask for it there's a instruction form is called a record form. Instruction is one bit in instruction: you can you can ask for the condition status for particular instruction and our power is. There are eight condition register in total architectual.

A

E4 compare instruction and conditional branch instruction you can designate, which you see are to use and for record form. Instructions is a specific CR is implied according to the instruction type, for example, for integer instructions is CR number. Zero is implied for floating point that the cr6 is implied and.

A

Multiple condition registers allow parallel evaluation of conditions. This is a capability right now we didn't take advantage of it only in one case in our red currant code generation. For the when you have a look back a branch.

A

Typically, there is a single checker there as well, so we use the two conditional register, one for the branch control another for a single check.

A

That's the only place I know we use of multiple conditioned registers and actually for multiple parallel evaluation evaluation of conditions. Multiple condition register is performance benefiting potentially. If all the conditions are unpredictable.

A

You use multiple condition: registers actually performance benefitting because you, if you are you don't have multiple conditions, read condition registers. You need multiple conditional branches and each one is unpredictable. You have potential pollution means and you suffer the multiple miss penalty, but power. You can combine the CR logical instruction to end or act or whatever the instruction bit that the condition bit. Then you can land it on a single instruction.

A

Single could branch instruction conditional branch instruction? So if you suffer a position means you suffer once not suffer multiple times.

A

For for xmz, I, suppose eFRAG and CCW are renamed as well, because if you are not renamed, you still rise on in most all time, because you every time you you patty for setting. If rack and CCW thing, you not renamed your basic series on it, any questions on this.

B

A

Yes, on power bases register, allocator need to allocate assign a traditional register as well.

A

Any more questions.

A

So registers and global register allocator and local register allocator. So in coaching we typically dealing with virtual register and you have. The real register is called architected register and the register Isana is to assign the manage the lifecycle of virtual register and assign the real register to the VR and also there's a register renaming or register buffer or reorder path or whatever is in modern architectures.

A

That is another is basically another thing of register.

A

We co-chair mainly dealing with virtual register and the real register architecture register, but there are actually hundreds to thousands of renaming registers in Ichiko, and that is aggravated by transactional memory. Actually, for example, on skylake, the core is close to 350 registers.

A

Gpr side is 180, something renaming buffer and the protein point side is 100 more than 160 on power, 9 you'll think about it is not surprising at all. You have 8 SMT each is read, the architecture is 96 already you help 32, plus a 64 vector, and you multiply by 8 that is close to 800 architected register, possibly maybe is going to be more than 1000 and the past transactional memory transaction memory. You need to keep the old register so is even more basically now for studying, studying on 30 on x86.

A

They already have partial registers. The aah is overlapping with the EAX whatever right. Yes, that's the partial register problem and with the introduction of vector register this even happen more frequently. The x86 has the XFM register ymm register. The register is all overlapping and the power is same. Similar is a floating point.

A

Register is overlapping with the vector register is only partial of the partial of the vector register and the floating point is partially overlapping with the vector register, as well as a performance implications here, because your non overlapping portion, how is defined in is a basically is undefined versus unchanged, where the performance implication here or pipeline usage implication here, for example, your rx, you have the zmm register. The 512 bit register then is partially overlapping with the 128-bit xmm register.

A

Now you need to be abstraction. Need to define the non over overlapping part is undefined or unchanged when you are using the SMM register, the lower portion, if you, if you have the inner path that you have implementation problem here, performance problem here, you if you didn't carry the know over the overlapping portion with you then later on. You need to combine things together, but for the for, if the instruction set that defined it as undefined, then you are free.

A

You are free to go because it is undefined anyway, and also is relevant to how the pipeline is used. If you, if he is unchanged, you need to carry the whole 512 bit with you in order to 4/4 is not not changing, but you carry that over. Basically in Prior, you need to use a wider pipeline to push the instruction through and I can tell you. Our power is almost always undefined.

A

The non-overlapping portion is undefined, so the hardware implementation is free to go basically and global register allocator. This is different. The overall framework is the same among the three code generator, but is configurable and parameterize able to each code generator. Basically, how many real registers you are. You can be used for for global register candidates and which real register are most favorable for the candidate, and this is configurable in each code generator because it is up to each culture and platform specific to decide which will register a most favorable.

A

All these information is carried to cogeneration to through the gr, Red, Death, Note and low for local register. Allocator. The status of the real register set is controlled through registered dependency condition, so that is used to honor. The global register allocation request. Also guarantee knows the bill for certain code regions and enforce the linkage convention.

A

Also, for is a special cases, for example, for x86 divide remainder, they are using specific register and the PN is e. The GPR 0 have special meaning you need to do this condition register as well condition dependency as well. Any questions on this.

A

Okay, my time is: okay on I need to go faster, I, think linkage, Convention and the ji direct to the private, a linkage convention business, the jbjb, a JVM is own ABI. So you have nature. You helped question: why not just using a system ABI this historical Wizards, we are using I. Think the most simple answer is that in the past we are doing the compilation on the same.

A

We are doing, compilation did compiled compilation on the same stack of same thread of application thread. So if you are using, you are using the same stack for the native code and Java thread your potentially for every javathread. You need to have a big stack, so in the past available you separated we separated so native stack and is native such Java staff in Java stack. So the compilation probably cannot cause the stack to be that bigger, for implication, for the Java stack. Now you have a different stack, so you have a different linkage already.

A

The linkage basically is how real registers are used, which are volatile. You can modify, which are preserved and which are reserved reserved easy. You cannot touch on it and how arguments are passed to your colleague and how value is returned and what the stack frame shape. Look like. That is the linkage and also your argument, the past. If your past on stack, you have to load his to stop performance implications.

A

So what is the load he saw? Is you helped a older store and a younger load to the same memory, and if we, our load, is issued later than the store and you check the stock you and your hit the store, then, in that case, stop forwarding need to happen and typically is going to be rejected and then stop holding on xnz.

A

The stop holding is a fast, and sometimes it can reject it. You also suffer performance there. On power, typically locust or penalty is higher and a store heat load is when you issued out of order load is, is your first store is easier later and store check the load and we oughta kill heat the load you you need to reject, and that is a load, a store penalty and also if this can also happen where your short circuiting complicated cause without the shrink-wrapping.

A

Basically, you have, you have saved a lot of register, but you you return right away, because you have a condition of something: is a null pointer, I going to return right away and that return you need to load you need to. We saw all the preserve registers and a lot of load historian there and getting a direct copy. This is the bridge the linkage convention, difference between a dieter, privateeer linkages and a system linkage.

A

You need to bridging that difference, even on a different stack. If you you can dispatch their I through interpreter or helper call as well, but the overhead is very significant. You can imagine housing given. It is because you you need to copy the argument from a Java stack to the native stack and you even you need to have a matter towed to set up.

A

You need to decoding the signature to know you is a fraud, important argument or integer argument whatever there and that decode big time and the shopping of acumen take time and you need the medical physics meta code sequence to set up the register. For example, you need to name the t got TB r3 to contain a first argument on power. That kind of thing is overhead is high and the Jake I direct. The call is itself it has a lot of overhead.

A

You need to building the stack of frame to anchor a call and release we acquire we have accessed. This are including atomic update, and after your call you come back, you need to free. You need to free the reference frame before a call. You need to prepare for potential DC and and at the end, that you need to tear down a stack frame. This is pretty expensive and right now, I think we have a better, lighter mechanism in a work. Possibly is better any questions.

C

A

Instruction charging and we Association instruction scheduling. Basically you are you, you have the constraint of the same semantics of instruction sequence and then you're moving them around to fit the pipeline better and not much benefit with the current Court of deep outer order. Execution engine I have examples in the next page.

A

We Association after is much more beneficial and have the example data as well. What is the we Association? Basically you take advantage of certain algebraic property and to reorganize the expression evaluation.

A

Any questions on this and.

B

A

Page so so I'm going to to describe. Why is the instruction scheduling is not that beneficial on our order machine so on the generate code, column and the schedule column, I have a basic machine model. Here is the load. Instruction is a five cycle latency and aromatic instruction is the sixth psychology for floating-point and it typically true.

A

The original instruction is the sixth cycle and load is a four to five cycle and the example is a currently is pretty popular that idiom for machine learning, for whatever there is Mitch matrix multiplication or vector products on there is the typical idiom here. So you will hear the example. Is ability is squared each element and path together, put it in X environment variable?

A

So if you unroll this stupid twice, the ternary code look like you're load, a I and the square, the our putting a 2 and add our 2 to X and a load I plus 1 and square it and add it if you are not out of order, and although your pipeline pipeline ability means if the next instruction is independent of the previous instruction, it can go. But if you dependent, you cannot go so for this instruction.

A

Your load instruction is issued the result that will come back in 5 cycle and the next instruction of multiply our 1/2 hour and because it's dependent on the load instruction so loaded is issued. The mall cannot go. You need to wait there for five cycles, although you are pipelined, but you are not out of order, so you can roughly calculate it that sequence with that model. It will take 29 cycles.

A

But if you schedule you're going to the schedule column, you move the load the to load together then to mall together at a to add together in that, in this case, you can go load and then load a I in a first cycle, but in the next cycle load a I passed, one can go because they are independent and there you need to wait a few cycle once the other one come back.

A

The mouth the first amount can eat you and the next cycle, the ass we come back, the next mal can go and you can calculate it this for that pipeline. The model is around 23 cycle, so that is the 20 to 30 percent benefit because of the gathering, but on our order machine you feed this instruction. The generate code instruction into the outer order machine it will executed at the schedule or anyway, because if XP came out of order can be out of order.

A

Your load, more cannot issue is hold there and air cannot issue hold there, but the load that the fourth instruction can go Eva eventually, if it look like the schedule anyway.

A

So that's why the scheduler is not providing much benefit. Because of this, however, we Association is going to help this I have the we associate our own eight times. You can imagine your execution, echoing look like eight load, eight mile and the few add there. So the we association is my way when you do the eight load, eight mile and you're, coming back, you are going to. You are not going to sequential eyes on the x register. You are going to add our two and our four together first doing go into a new register.

A

Peggy you we associate the will thing not not going to and sequentially on on your package the register, so our two are for together our 6.66 update together, you help for such and is independent, each other is can go and after therefore you can help to independent y&z together and M, plus J together. That can go in parallel as well and avenge. The last moment you are sequentialized or X: u X, plus P 1 and X plus T 2.

A

So you, if you assuming the two pipes for each type same latency model you unload 8 times. If, if we're only doing as the schedule, 1 8 load, 8 mile and 8 added 8, add sequentialized. That will require 59 cycles. But if you we associate, as that, we associated column showing you calculate is roughly 35 cycle. That is almost twice better, and this is our JIT compiler. Right now cannot do this, but is sitting on Andrews radar somewhere.

C

A

A

Okay, so it's almost on time right.

A

Any questions, when probably of my part to talk is about the cache architecture and memory consistency model.

B

D

You so, let's give a round of applause for doing such a great job on a very complicated topic, so we will end here today. If you have any more questions that Julian or me know like we were talking about earlier, he will deliver. The second part so hold your interests for a month. If you really want to know early, we can also reschedule if you want to talk earlier, but right now as scheduled for next month's talk, and if you have something that you would like to share like I was talking about earlier.

D

Let me know, and we can schedule that in as well. So thank you all for coming, and hopefully you learn something today. Thanks.