National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Understanding AMD GPU ISA

Description

Rene van Oostrum from AMD presents a talk on Understanding AMD GPU ISA. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Oisín Creaner

A

Good morning, everyone, my name, is renee van ostrom. I work in amd research on porting and optimizing applications for frontier and el capitan, and a small part of my work involves staring at gpu assembly and that's what I'll be talking about today.

A

Max's slides on gpu architecture take care of many of the prerequisites for making sense of my talk. So thanks very much max. You may wonder why anyone would want to read amd gpu assembly.

A

Well, I've found that it helps to get a better mental model of how our hardware operates, and also when, while working on performance, optimization seeing and understanding what the compiler does helps in making optimization decisions, the slide deck of this talk will be available for download, and it has a couple of extra slides that I won't show today and that give pointers to resources such as our public iso reference manual and the application binary interface. In case you want to dive deeper.

B

A

The topic than I'll be able to cover today and with that said, let's dive right in so consider the following hip kernel, which is coincidentally, also a valid cuda kernel. It computes y, is alpha times x, plus y, where x and y are vectors, and alpha is a scalar as simple as this kernel is. It contains elements that are common to nearly all kernel.

A

It has kernel arguments that are passed by value such as n, an alpha or by pointer to device memory such as x and y threads compute, an index of the data elements they are to work on using thread, block dimensions, thread, block id and thread id values for the index computation.

A

There may be more threads than data elements and the kernel guards against accessing out of bounds array elements. All the threads do the same computation, but each one operates on different values.

A

Here is the complete isotherm for our deck speaker and all let's go over some basics. First, in amd gpus, the assembly is executed by wavefronts, which are groups of 64 threads that execute in lockstep. Wavefronts have access to two types of registers.

A

Scalar registers are registers where there is a single instance per wavefront visible to all threads registers are 32 bits wide, but we can combine consecutive registers if we need 64 bits to operate on scalar registers, we have scalar instructions. If we look at the operands. Typically, the target register or memory address comes first, followed by one or more source, operands and load and store instructions may have an offset too.

A

The second type of register is the vector register, starting with a v for this kind of register. There is one instance for every thread in the wavefront, as you would expect. There is also vector instructions, and some of these may take a scalar register too, as one of their input operands.

A

Let's have a look at the first instruction in line two, this instruction loads, a d word, so four byte value from the memory location pointed to by the eight byte pointer in scalar registers, six and seven and it stores the d word in scalar register s0.

A

But how did we get a pointer in s67 that is specified by the application, binary interface for our kernel? The abi says that the first four scalar registers are uninitialized and the kernel can use them as it pleases.

A

The next two scalar registers hold a pointer to the kernel dispatch packet in device memory and we'll see in a minute what that is.

A

Dendrite is appointed to the region of device memory that holds the kernel arguments then scalar register s8 holds block idx x, which is the same for all threads in the wavefront and vector register v0 pulse threat, idx, which is unique to each thread in the wavefront. So here we see how scalar registers are used for values that are common to all threads in the wavefront.

A

Let's analyze, the iso line by line s67 in the instruction on line 2 holds a pointer to the kernel argument, region and device memory on the bottom right. We see what it looks like. The first argument is a four byte end. Then we have four bytes of padding and the remaining arguments are all eight bytes. The instruction in line two loads, the value at offset zero, so that is n and it stores the value of n in scalar register as zero.

A

In line three of the iso dump s45 points to the so called kernel dispatch packet, which is defined in the hsa specification.

A

An outline is shown on the right hand, side of this slide the instruction loads, a d word from the dispatch packet at offset 4, and that happens to be the work group size along the x dimension, which is called block dim x in our hip kernel.

A

But this field in the dispatch packet is a two byte value and since we don't have instructions to load, just two bytes, we are loading both workgroup size x and workload, size y and store it in s1, we'll deal with the unneeded workgroup size y. In a moment after executing the first two isa instructions in lines, 2 and 3, we have the value of n in scalar register s0 and, as one holds both blocked in x and block them y, where the letter is undefined, or at least that is what s0 and s1 hold.

A

When the data loaded by the two instruction arrives in the registers, this may take a while, and the hardware does not wait for the data to arrive before it issues, instructions that depend on this data.

A

It is the job of the compiler or the assembly programmer to make sure that data dependencies are satisfied, and that is where the scalar instruction s weight count comes in. The idea is as follows: whenever a memory instruction is issued, the counter is incremented and there are separate encounters for scalar loads and loads and stores to lds on the one hand and vector loads and stores, on the other hand, and lds stands for local data store. This is what encuda is called shared memory.

A

There is a bit more to the counter story, but what I'm showing here suffices for understanding our eyes at them. In any case, the counters are incremented when memory instructions are issued and decremented. When the reference data arrives at its destination, the s weight count instruction, halts execution of the wavefront until the counter has reached a specified value for lds loads and stores and vector loads and stores, instructions complete and the order in which they are issued, but for scalar loads. This is not the case. These may complete out of order, so for scalar loads.

A

We must wait until the counter is back at zero.

A

Once the data from the scalar loads has arrived in s0 and s1, we can deal with the possible garbage value and the high order word of s1. We take care of this with a scalar, logical and operation, with a bit mass that has all ones and the two lower order bytes. So the two lower order, bytes of s1 containing block dimx, are retained and the two higher order byte are reset to zero.

A

In line six, we multiply s8, which was initialized to hold block idx with s1, and we just loaded block dmx and we reuse s8 to store the result of the multiplication line. Seven is our first vector instruction: it adds v0, which is the threat idx value that is unique to all threats to the value in s8 so effectively. This completes the computation of I in the first line of the hip source code that is shown at the bottom right of this slide.

A

At this point, v0 holds for each thread the index I of the arrays x and y that it is to operate on, but we may have more threads than array elements, so we should disable any further computation for threads. That would access the arrays out of bounds, and that is what the last three lines of the first assembly block take care of line. 8 compares s 0, which holds n with v 0, which now holds I.

A

The comparison instruction is a vector instruction that yields a boolean result, so a 0 for false and a 1 for true for each of the 64 threads in the wavefront.

A

These 64, zeros and ones are stored in the vector condition, code or vcc, which is a scalar register pair, so threads, for which n is greater than I would have their vcc bit set to 1 and the other threads have their vcc that set to 0..

A

The execution mask is a 64-bit scalar register pair that determines which threads in a wavefront are active line. 9 saves the current execution mask in scalar registers, s01 and, and the execution mask with the vector condition code that we computed in the previous line.

A

In this case, the execution mask prior to the instruction was all once so after line 9, the threads for which I is less than n, and only those threads will continue to execute. The remaining threads are masked out as it is called they effectively only execute no ops line. 10 is an optimization.

A

This is a conditional branch that tests if the execution mask, is all zeros. If that is the case, there is no point in stepping through the remaining instructions and we can jump right to the end of the program.

A

Let's summarize, what we have done so far line two loads: the value of n into scalar register at zero lines. Three through seven compute the index I for each thread and stored in vector register, v, zero lines, eight and nine set the bit and the execution mast to one for threads. That should remain active and line nine jumps to the end of the program if there are no active threads.

A

Next, let's have a look at the second half of the assembly dump. This block is only executed by the active threads lines. 12 and 13 use the pointer, 2d kernel arguments again. This pointer is stored in s67, I'm showing the kernel arguments with our offsets in the top left of this slide line: 12 loads, 4 d words, so two 8 byte values, starting at offset a so it reads alpha and x in one instruction where alpha ends up in s01 and x in s23.

A

The next line reads: y and stores it in s45. You may recall that s45 was used to hold the pointer to the kernel dispatch packet, but we don't need it any longer. So we can reuse. These registers, after line 13, we have pointers to the start of the x and y arrays and scalar registers.

A

Every thread must read its own entries from x and y and store the result into its own entry of y. We will use the scalar pointers to the beginning of the array and the value of I for each thread to compute unique pointers into x and y for each thread.

A

Each thread has I stored in v0 a 32-bit register. The pointers are 64 bits, wide line, 14 and 15. First turn the 32-bit value of I into a 64-bit offset from the start of the arrays line. 14 does a 31-bit right shift with sign extension of the value in v0 and stores the result in v1, so v1 ends up with all zeros if the assignment of v0 is 0 or all ones. Otherwise, in other words, v01 is now the 64-bit representation of the value of v0.

A

Note that higher order bits are in the higher numbered register of the pair line. 15 does a left shift by 3 positions of the 64-bit value and v01, which comes down to doing a multiplication by 8., since the values in the array x and y are doubles. The offset of the I double from the beginning of the array in bytes is 8 times I, and hence the multiplication.

A

Lines 17 through 19 add the thread offset in v01 to the pointer to the first element of x in s23. The result, a pointer to x, I for each thread, is stored in v23 amd gpu. Isa does not have separate instructions for 64-bit integer edition. Instead, the 64-bit edition is executed as two consecutive 32-bit editions with carryout and carry-in.

A

These two 32-bit editions are in line 18 and 19, where vcc is used for the carry bits.

A

The instruction with the carryout in line 18 allows for a scalar source argument, but the one with the carry in in line 19 does not and that's why, in line 17, we first copy the scalar register to a vector register so that we can use this vector register as a source argument in line 19.

A

lines 20 through 22. Similarly compute a pointer to y. I for each thread we're almost there. We have pointers to x, I and y I in 2, 3 and v 4 5, respectively line 23 reads the value of x, I into v 2 3 reusing the letter register pair since the pointer to x.

B

A

Is no longer needed after the read line, 24 reads the value of y, I into v 4 5.. In this case registers v 0 1 are not reused.

A

Since the pointer to y, I is still needed to write the results back in line 27 line, 25 waits until the values read in the previous two lines have arrived from device memory and the target registers and are ready for subsequent use line. 26 does the core computation of the kernel it multiplies x I and v 2 3 with alpha and s. 0 1 adds y, I and v 4, 5 and stores the result. Alpha times x, I plus y, I and v 2 3.

A

line, 27 stores the results from the fma operation in the previous line into yi and device memory. Finally line 29 indicates to the hardware that the wavefront is done.

A

uh This is my final slide, showing a high level summary of the second half of our kernel. But since my time is about up I'll, just leave it up here where, while we open the floor for questions.

B

Thank you for that. um So those were some very in-depth and uh fascinating topics there, I'm just looking to um see if there's some hands up amongst the audience.

B

I've just allowed you to unmute your mic. uh Sorry ist, just like. If you mute yourself, you can ask the.

B

B

B

It it suck um it's the name, I'm seeing there you've just been. You raise your hand.

C

When do you use atomic instruction.

B

Sorry, can you speak up a little.

C

C

When do you use atomic instruction.

C

Like atomic add atomic multiplication.

A

Oh, I see what you mean um so in this kernel. We don't need them in general, uh you could use them, for instance, in in reductions.

A

um It's it's not pertinent to the the kernel for doing a dax fee.

B

Good, um we have another question here, and uh this is: how does the differing wavefront size in amd gpus versus warp size affect applications? Do users need to think about this explicitly.

A

It depends so if you're porting um between the two architectures uh say you're you're, going from cuda to uh to hip, and you have a code where the wavefront the warp size is hard coded as 32.

A

Then you need to be very careful about this, but other than that it should not be an issue in general.

A

Unless maybe you go down to to very low level instructions like intrinsics for assembly operations, it may turn up. But if you don't hard code, the warp or wavefront size, it should not be an issue.

B

Very good um another question here um wondering uh what was the sort of what what would you say is the the sort of core take-home message that you would take from the the walk through here. What what if there was one thing for, maybe a less experienced speaker should what should they take from from this? This.

A

um Well, one thing is that if you, if you look at the hip source for this kernel, it's basically an fma that is executed by many threats. And then, if you look at the isotherm that I'm showing here, you see that the fma instruction is in line 26 and and all the rest is scaffolding.

A

And that tells me that I'm I'm doing quite a bit of work before we get to the point, and this indicates that I maybe want to do more than one fma instruction for each of the threads.

A

Admittedly, you shouldn't figure this out by staring at assembly in the first place, as max already mentioned, you should always use profilers too to determine uh where your time is spent, but this kind of isotherms can be quite revealing in what is going on under the hood, and it helped me more than once.