National Energy Research Scientific Computing Center (NERSC) HPCToolkit Training for NERSC and OLCF Users, Mar-Apr 2021, 6 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2 - Introduction to HPCToolkit

Description

Part of the Using HPCToolkit to Measure and Analyze the Performance of GPU-accelerated Applications Tutorial, Mar-Apr 2021. Slides available at https://www.nersc.gov/users/training/events/hpctoolkit-for-gpu-tutorial-mar-apr-2021/

A

So welcome everybody. thanks for the interest., as you saw from the introduction, using these systems is pretty complicated, and we've tried to automate a bunch of examples.

A

That you can try out.

A

and the whole goal of the workshop is for you to be able to work on this.

A

With your own codes. and so hopefully, by looking at our automated examples,.

A

You'll get a sense of how to do that.

A

So the detailed presentations are going to be given by laksono adhianto,.

A

One of my staff members, and keren zhou.

A

So before they get a chance to talk, I'm going to tell you a little bit about our overall project and what we're trying to do., and once you see some of the complexity.

A

You'll understand that well, there's some issues.

A

And so what we have is a work in progress here.

A

But hopefully you'll be able to use it productively with your codes.

A

So let me first start with a brief acknowledgement.

A

So as helen mentioned,, our funding is mostly through the exascale computing project..

A

We also have some funding from argonne, from the doe tri-labs, amd and total.

A

The team that's working on hpctoolkit is principally my group at rice. there's a collection of research staff and phd students.

A

We also work closely with barton miller at the university of wisconsin, who's, the lead of the dyninst project.

A

So the people that are here to help with the workshop are, the first row is my staff,.

A

Laksono adhianto, mark krentel,, myself, and xiaozhu meng.

A

And then the bottom row is my phd students.

A

aaron cherian, dejan grubisic,, yumeng, liu, and keren zhou.

A

So, as you're all aware,, the doe's plan.

A

For exascale is heterogeneous. Platforms.

A

And so the aurora, frontier, and el capitan.

A

Are the forthcoming exascale systems.

A

Those are all gpu-accelerated.

A

Aurora has intel gpu, or will have intel gpus,.

A

And frontier, and el capitan, will have amd gpus.

A

And so that's been a major focus of our work.

A

However, in this workshop, we're going to be working on.

A

Nvidia-Based heterogeneous supercomputers., so we've been working on the intel,, amd, and nvidia platforms. All together.

A

And so there's summit, sierra,.

A

And the forthcoming perlmutter., and so today, you'll get a chance to use cori, gpu and summit.

A

For these forthcoming exascale systems,.

A

There's a collection of node-level programming, models., and so what we're trying to capture with our tools.

A

Are programming models that are using.

A

Any of these techniques.

A

So there's intel dpc++ for intel's gpus,.

A

There's cuda for nvidia gpus,, there's hip for amd gpus,, there's openacc and openmp.

A

Then there's also template-based programming models that have been developed at the national labs, raja at livermore and kokkos at sandia..

A

And so all of these are viable strategies for doing gpu programming at the node level,.

A

And our tools supports all of them.

A

So for global programming, models, most people are using mpi.

A

But there's also a collection of other global programming. Models., there's upc++, that's been developed at lbl,.

A

The gasnet, global address space layer,.

A

It's a library-based model for shared address space, programming.

A

And so that, and openshmem are also viable options for global sharing.

A

And there's also the charm++ model that was developed at uiuc.

A

and so hpctoolkit should also work with all of these, as well.

A

we're agnostic to the node level programming model and to the global programming model.

A

So there are a number of performance analysis challenges for our gpu-accelerated supercomputers.

A

You worry about the computation., so you need extreme scale data parallelism to keep your gpus busy. you're, worried about the data movement costs within your nodes and between memory, spaces,.

A

Between the cpu and the gpu., there's communication costs between the nodes and there's I/o as well. and so with hpctoolkit,. Our aim is to enable you to measure all of these things..

A

So there's lots of ways you can hurt your performance.. You can have insufficient parallelism, or load imbalance, or serialization,, replicated work, data, copies, synchronization,, bad locality.

A

And so with hpctoolkit,, you should be able to identify.

A

These kinds of problems with codes using each of those kinds of programming models., so the hardware and execution models are pretty complex.

A

So we've got the cpu and gpu compute engines.

A

That have vastly different characteristics, capabilities, and performance concerns.. There's multiple memory, spaces,, the cpu memory,.

A

And the gpu memory, that have different characteristics.

A

And then we also have some asynchronous execution if you're launching asynchronous kernels on the gpu..

A

So all this is a pretty significant challenge for tools to be able to gain some insight into that.

A

And so our tools have to interpose themselves in a layer.

A

Between the operating system and your application.

A

And then monitor everything, that's going on as your code interacts with the device.

A

So some measurement-related challenges.

A

Well,, if you're using extreme scale, parallelism, then,, any serialization within our tools, is going to disrupt parallel performance., and so our tools have a collection.

A

Of concurrent data structures inside them so that we won't hurt your application performance.

A

When we're measuring performance, we're very dependent.

A

Upon third-party measurement interfaces., and so, if we're going to measure it,, we have to measure it because there's hardware support or there's software support. and so on. The hardware side, on the cpus,.

A

There's a performance monitoring unit that can measure all sorts of things like instructions, and cycles,, and cache misses,.

A

And instruction completion, and things like that.

A

on the software side, we're dependent upon a number of different interfaces that enable us to see how your application interacts with both the operating system and with the hardware.

A

And so in glibc there's this capability called ld_audit.

A

That enables us to track dynamic loading of shared libraries..

A

It's really a great design.

A

however,. It turns out that it's only been lightly used., and so we've been working very closely with red hat.

A

To address some bugs that we've discovered., and so all I can say, is that it will work better in the future, once some of the bugs get cleaned out in gilbc..

A

So on linux, there's the perf_event subsystem.

A

For measuring hardware performance,, but we can also actually measure performance in the kernel as well. now, as it turns out, the systems that cori and summit.

A

Are both configured in a way that you can't actually measure things in the kernel,, but on other clusters or your machine back at your institution,?

A

That's probably different. or you can get it configured differently.

A

For gpu monitoring, we're using instrumentation libraries.

A

From the vendors. today, we're using cupti, the cuda performance tools, interface from nvidia., there's, a roctracer library from amd,.

A

And there's some instrumentation libraries from intel.

A

That we use as well.

A

so to measure your parallel application's running.

A

On cpu and gpu,, there's a couple of different measurement strategies that we're using. we're using sampling on the cpu,.

A

We're periodically interrupting the application.

A

And identifying where costs are incurred.

A

I'll tell you more about that in a minute.

A

for gpu operations,, we get callbacks when gpu operations are launched, and sometimes,.

A

We get callbacks when they're completed.

A

for gpus, we're also using an event stream.

A

That tells us that a particular kernel started.

A

At time t and finished at time t prime.

A

On nvidia gpus,, we also get pc sampling, measurements.

A

And so we can find out what machine instructions.

A

It spent its time in. and if there are any delays,, what those delays are.

A

So another challenge for our tools, for the measurement side is that,.

A

Gpu kernels are very frequently launched.

A

and so, in order to not add a lot of overhead to your code,.

A

We had to work on tuning our tools,.

A

So that we can measure when kernels.

A

Are launched and attribute those costs to the program context and do that quickly.

A

So there's a number of engineering challenges for performance tools,, and I don't want to dwell on this a lot.. I just want to say it's very complicated.

A

Especially on these supercomputers.

A

and so applications, have...

A

we've seen as many as a hundred shared libraries, applications that are larger than five gigabytes.

A

With features where exit is initiated.

A

By a non-initial thread, or they fork non-readable helper applications.

A

Dynamic libraries are loaded all the time.

A

and there's things like threads being created.

A

In the init constructors, in libraries, processes, fork.

A

When processes fork,, these interact with the tooling substrates., and so let's just say that, there's a lot of complexity in there..

A

So all this is to say that, for your application,.

A

You may run into a problem with some of these things.

A

We've been working with a lot of complex applications.

A

And so we think we're on the right track..

A

If you run into a problem,, then just let us know.

A

And we'll take a look at it, and see if we need to make any adjustments.

A

So compared to other gpu performance, tools,.

A

Other gpu performance tools feature a trace view that shows a series of events that happen over time in each process,, thread, and gpu. Stream. there's also a profile view of gpu kernels.

A

That show some performance metrics with program contexts,.

A

But there are a collection of tools.

A

so for nvidia's, tools,, there's nsight systems, and nsight compute, and nvprof., and then for amd systems,, there's rocprofiler,.

A

And intel vtune. there's some third-party tools, such as tau.

A

Now,, how our tools differ from these.

A

These other tools lack a comprehensive profile view.

A

To analyze complex cpu, calling contexts.

A

Including inlined frames,, where gpu operations are invoked.

A

You want to understand: where,, how, and why,.

A

Gpu kernels arose from instantiating nested templates.

A

And so that's particularly the case with these template-based programming models like raja and kokkos., and you want to understand the cost of a gpu, apis, for example, cudamemcpy that are invoked.

A

From many different contexts in your program.

A

and you need to understand where those costs are incurred,.

A

Not just, "my program spends a lot of time. Copying data." you'd like to know where the copies are expensive, and then, take some steps to reduce the cost.

A

By moving the data on and then running many kernels, while the data is on the gpu. and then, moving the data off instead of moving the data back and forth at every kernel, launch, for instance.

A

There's also sophisticated calling contexts on the gpu.

A

So using openmp, target, or kokkos, or raja,.

A

You can end up with gpu code that has a lot of different procedures., and so just knowing that I spend time in my kernel,.

A

Isn't necessarily enough for this livermore code called mercury.

A

They have about a hundred thousand lines of code that runs in a gpu kernel., and so, if you're just told that your kernel is slow, that doesn't really help tune. Your application.

A

And so we collect fine-grain information inside the gpu kernels and we can actually recover calling contexts within the gpu.

A

So if there's multiple functions on the gpu,, we can attribute it to the individual functions and show how those functions are. Invoked.

A

and you'll see more of that in a bit., so we can also collect and attribute loop-level performance information on both cpus and gpus..

A

As far as I know,, nobody does this on gpus and I don't think anyone does this on cpus, either at least without adding a loop-level instrumentation.

A

Which is very expensive. so at best, the existing tools really only attribute like a flat runtime cost to a flat profile.

A

Of functions executing on the gpus. and we do a lot more than that..

A

So let me give you an introduction to hpctoolkits performance, tools.

A

I'll give you an overview of the components.

A

And their workflow, and then, briefly mentioned hpctoolkits graphical user, interfaces.

A

And then my colleague, laksono adhianto, will follow up with a talk in detail about.

A

How to get the most out of our graphical user interfaces when analyzing your application.

A

and then,? The second topic, I'll talk about.

A

Is analyzing performance of gpu-accelerated applications.

A

With hpctoolkit. I'll give you an overview of our measurement capabilities.

A

And how we collect the data, and I'll analyze it,.

A

And attribute it, and then, finally, I'll finish.

A

With a few closing remarks, and then we can move on to the detailed presentations by laksono and keren.

A

On how to actually use our tools on your codes.

A

So for a long, time,, hpctoolkit.

A

Has been doing binary-level measurement and analysis.

A

So we're observing executions of fully optimized, dynamically-linked parallel applications.

A

and we can support multiple languages as long as you have compiled, code,.

A

Not necessarily jaded code like java., and we can also accommodate the fact that your application is typically using a collection of libraries that are only available in binary form..

A

So we use a sampling to collect measurements on the cpu.

A

The reason for this is that it has controllable overhead.

A

Also, if we're adding instrumentation to your code,.

A

We might just measure your program and not measure the time you spend in any libraries.

A

That are loaded. so by using sampling,, we measure where you are no matter what you're in.. So the key difference here that we're focused on today.

A

Is measuring gpu performance.

A

so when we measure gpu performance, we're typically using vendor apis.

A

and so that's a blessing.

A

Because it's a level of abstraction that we get to rely on.

A

It's a curse, because,, the level of abstraction that the vendors are providing, well,. It doesn't have all the features we want and we've encountered a lot of bugs..

A

So in using these vendor apis, we've registered callbacks to monitor the launch and completion of gpu operations.

A

and then,. We also measure asynchronous gpu operations.

A

Using these things called activity apis that nvidia.

A

And amd apply that provide an event stream.

A

That says this function ran and it had the following characteristics, and it ran from time t to t prime., also on nvidia gpus, we're collecting fine-grain measurements using pc, sampling.

A

So this is something that's supported by the hardware since amp...

A

sorry, since nvidia's maxwell gpu..

A

So in fact,, back in 2013,.

A

We got deeply involved with nvidia.

A

And told them that their measurement strategy.

A

Wasn't good enough, and that,, we wanted some more detailed measurements support..

A

I had some discussions with nvidia.

A

About adding some support for pc, sampling, and then,, they got around to it, and edit it in maxwell., and so we were very pleased. and so that actually provides some deep insight into. What's going on, and you'll get a chance to use that today.

A

on intel, gpus, we're actually using instrumentation, we're using binary instrumentation to collect measurements on gpus..

A

So those are two strategies for fine-grain measurement for amd.. There isn't a strategy for fine-grained measurement, yet.

A

So our tool associates metrics with both static, context,, so static context.

A

Being like load modules, and files, and functions,.

A

And loop, nests, and dynamic contexts are call chains.

A

And so we get to do this, where we're associating performance with loop, nests,, procedures,, inlined, code, and calling contexts, both on the cpu and the gpu.

A

So we're actually building heterogeneous call chains.

A

That go from the cpu all the way into your gpu code.

A

So the user interface as laksono will show you.

A

Has a mechanism for computing derive metrics.

A

And so you can compute your own derive metrics on the cpu and gpu to obtain insights.

A

Or you can write some metrics that look like waste or a scalability loss,.

A

Or say to compute cache miss ratios, things like that.

A

From the raw data that you collect.

A

and then,, our tool support top-down performance, analysis and you'll see that more with the user interfaces with laksono.

A

So let me just give you a high-level overview of how hpctoolkit works., and so the situation is a little bit more complicated for gpus,, but not too much.. So keren has a more detailed diagram in his talk.

A

So you start out by compiling your application,.

A

And linking it the way that you normally do., so we don't really need you to change your make files.

A

At all, if you're, building, dynamic, binaries.

A

So in her talk, helen mentioned.

A

To use the dash fast option.

A

so that actually isn't something that we tested when we put together the examples for cori.

A

What we found in the past, is the dash fast will actually change whether you're, generating a dynamic binary or a static binary.

A

And so the examples that we're providing.

A

Are using hpc run for measurement.

A

And so for that, we're expecting dynamic, binaries.

A

If you're actually producing static, binaries, there's a way to measure those two., we have a tool called hpc link that will link our measurement code into your application., and so, if you're, generating static, binaries,.

A

Then there's some details in our manual on how to use hpc link to add instrumentation.

A

So for dynamic binaries,, what you do is.

A

You just take the binary that you compiled.

A

And you profile it by using our hpcrun command.

A

So you launch your application using hpcrun,.

A

And it will collect, call path, profiles.

A

Of the events of interest. and it will also collect call path, traces, and you'll see a little bit more about those in just a second. and then,. Where necessary, it's going to intercept interfaces for control and measurements.. So for instance,, we want to know when threads are created,.

A

And when threads are destroyed, and when the process exits.

A

And so we have a little bit of instrumentation at those points in order to maintain control of your application.

A

So what do I mean by call path? Profiles?.

A

Well,, what call path profiling does.

A

Is we're attributing costs to the calling context.

A

In which the costs are incurred.

A

and so on, the cpu, we're sampling using linux, timers.

A

Or we're using hardware counters., and so the hardware counters can be used to count things like cache misses,, and so we can say, interrupt me every time you execute a million cache misses.

A

Or we can say, interrupt me with a timer every time.

A

You hit a thousandth of a second, for instance., and so what happens? Is, when we get...

A

Either a timer goes off, or a hardware, or counter overflows,. It interrupts the application and the tool takes control, fielding the profiling signal,.

A

And we find out that we're at a particular machine instruction., so we're at some instruction, in some routine c, and then we unwind the call chain.

A

To find that we were in c, when call from b,.

A

When call from a, when called from main., and so that is a so-called call path, sample,.

A

Where we're attributing the costs of either the timer expiration, or the hardware counter overflow to a particular instruction in a particular context.

A

And so we gather cpu calling contexts.

A

For gpus using..., so we gathered cpu context using stack unwinding.

A

and when we're launching gpu kernels,.

A

We unwind the call stack to find out where we're when we launched the gpu kernel.

A

And so this is how we gather the information.

A

For a single call, path. and then, over time, as the application and executes, we're building a tree., so conceptually, you can of there being a node at the top of the tree. That kind of corresponds to main,, and then some subtree might correspond to a solver,.

A

Another subtree might correspond to initialization,, another subtree might correspond to post-processing., and so the result of our measurements is that we have a tree with weights where the weights correspond to whatever metrics we're measuring..

A

So whether it's time, or cache misses,.

A

We're attributing those costs in context.

A

The nice thing about using sampling on the cpu to gather this is that our measurement cost is proportional to the sampling frequency, and not the frequency with which functions are called., and so anybody that's using instrumentation as their principal measurement mechanism,, where every time you enter a lever, procedure, then you're invoking a tool,, then that can add a lot more overhead.

A

and so by adjusting the sampling frequency,, you can adjust your measurement cost with hpctoolkit..

A

If you think it's too costly,, then you can just lower the sampling frequency.

A

And reduce your measurement costs., so the second thing that we do is we analyze your application, binaries.

A

So we have a tool called hpcstruct., and so this will take your application, binary and recover program structure, information., and so this will give the most detailed information.

A

And attribute performance to individual source lines when you've compiled and you've added a.

A

Like dash g-like option., so you can compile with optimization, turned on,, and then you can either use dash g or with the pgi compilers,. You use dash g opt,.

A

Which basically says record line map information.

A

But don't disturb optimization., so what the hpcstruct tool does.

A

Is it analyzes the machine code that the compiler generated.

A

It looks at the line map information that was recorded by the compiler, and it looks at debugging information.

A

So for instance, information about inlining that was recorded by the compiler., it extracts loop nests.

A

so by analyzing, the machine code, we're actually recovering loop, nests, and control flow.

A

In your application,, we identify inline procedures,.

A

And then it maps the structure of the control flow.

A

That is in your application. Binary., it maps it back to what it was in the original source, code.

A

And we know if there's something like a code.

A

That's been hoisted out of a loop,. We can tell that because by looking at the line, information,.

A

We can tell that a code that is both inside and outside of the loop came from the same place., so this binary analysis, computes, program, structure, information that tells us about the files in a load module,.

A

Either being your program or shared libraries., so we know the load module,. We know the files,.

A

We know the procedures,, we know the loop nest in the procedure,, the statements in the procedure, and inline code., so we first collect our measurements, which is attributing costs to addresses and machine code.

A

And we compute the program structure, information which tells us how the machine code relates to the application source, code.

A

and then, we use..., so the binary analysis is done using.

A

This library called dyninst,, which is a toolkit from the university of wisconsin.

A

and so we're using their parse api.

A

For parsing the machine code, symtab api for analyzing, the symbol, table information.

A

And line maps in it, instruction, api, and the dataflow api.

A

For doing instructional level, analysis and slicing.

A

And so dyninst provides...

A

there's some native support for analyzing amd gpu binaries,.

A

And there's some lightweight support that enables us to...

A

It does enough so that we can actually analyze gpu binaries.

A

For nvidia gpus and for intel gpus.

A

So the next thing that we do after collecting the measurements with hpcrun,.

A

And then analyzing your cpu and gpu binaries.

A

With hpcstruct, then,, we have a tool called hpcprof.

A

For combining the information.

A

From the program structure, file and the call path, profiles.

A

There's also another tool called hpcprof-mpi.

A

And so this is not just for analyzing mpi programs,.

A

This is for analyzing, small-scale data..

A

This itself is an mpi program, and we use it for analyzing large scale, performance, data.

A

And so, rather than analyzing your performance data sequentially on your headnode,, you can launch a parallel job to analyze: lots of performance data..

A

So for instance,. If you had a job.

A

That had maybe a thousand mpi ranks or something,.

A

Rather than serially analyzing all the profiles from those thousand mpi ranks,, you can use hpcprof-mpi to analyze those in parallel.

A

So the result of our analysis is what we call a database, is a directory full of information about the profiles.

A

And the traces related back to the source code.

A

and then we have a presentation tool called a hpcviewer.

A

And so this enables you to explore the performance data from multiple perspectives.. You can rank order. Your inspection,.

A

You can rank order, the cost by particular metrics.

A

So if you want to focus on cycles, or instructions, or gpu instruction, counts, or gpu stalls,, you can just say, "okay,. I want to put it in this sorted order and then show me where those costs occur.".

A

You can also compute derive metrics in the viewer interface, as laksono will show you.

A

and then, with our trace analyzer,. That's in here,.

A

We can explore the execution behavior over time.

A

So this is what our user interface looks. Like.

A

And so laksono is going to tell you a lot more about it, but I just want to point out a few things..

A

So there's a source pane that has some source code in it.

A

There's a navigation pane, which is going to show you.

A

Either top-down call, chains, or bottom-up call chains,.

A

Laksono will explain the difference.

A

or a flat view that shows you, like the load modules.

A

And files, and functions, and loop nests,.

A

Where your application spend code.

A

and then there's some metrics, and the metrics are whatever you collected.

A

and so here,. This was done with just measuring, using time-based metrics.. So the other thing that's important.

A

Is that there's a support for looking.

A

At three different views,, the top-down, bottom-up, and a flat view., so there's some view: control, tabs., there's also some tabs over here that control the metric display, laksono will tell you more about those.

A

The thing that I wanted to point out about this.

A

So this is the top-down view, and what this is showing you is the view.

A

Of like a call chain in the code. and you notice, that in blue, there are names of procedures,.

A

In the green, there's names of procedures that are proceeded by, I in brackets.

A

This means that this code has been inlined, and that we were able to discover that the code was inlined.

A

And we're able to attribute it back to the source code in the application.

A

So we have a procedure,, we have loop nest.

A

In that procedure, we've got a chain of inline functions,.

A

We have another procedure, another chain of inline functions,, some inline templates,, a loop,.

A

Some more inline templates, an outlined openmp loop,.

A

And then, finally lambda functions that are invoked.

A

By these raja template-based, programming, model.

A

So this is raja sitting on top of openmp.

A

and what we're able to do, is to reconstruct this complicated context in which your costs are incurred..

A

And so the thing that's nice about this, is that when we're using call stack unwinding at runtime,.

A

What we're measuring is the call chain., so the real procedures- are,, there's main, and there's cal volume force for elems.

A

And then there's an outlined openmp routine,.

A

All of these other layers with the loops and the inline code, that all comes...

A

that doesn't cost us anything to gather at measurement time,.

A

That all comes from combining our measurement data.

A

With the program structure, information where we identify where all the inline code and loops are.

A

And so that comes at no extra cost.

A

When you're measuring your program.

A

So to understand the temporal behavior out.

A

Of an application., so I mentioned earlier, that we'll get interrupted.

A

And then we'll unwind, the call chain.

A

and so over time, for an individual thread,.

A

You may unwind in different places and we may see call chains and it differ to some extent over time.. So maybe the top level of the call chain may represent main, and then this green.

A

May represent some procedure, called, solve,, and maybe it does some sort of preconditioning step.

A

Followed by some main solve.

A

And so these call chains represent the individual contexts.

A

That we ran over time. by actually keeping all of the call chains individually,.

A

Then,, what we can do is look at all of the different places where a thread was in over time.

A

And then we can do this for each of the threads in an npi, ranks. and so then,. We view this by...

A

Conceptually, there's a visibility plane. and if you lift it all the way up to the top, you'll see that,. Everybody was in main for the entire time in the execution. and then you move down and you look at a lower level of abstraction..

A

You might see that there's some sort of initialization phase, a solid phase, and a post-processing phase..

A

Then you look a little bit lower, and then you start to see detail inside the initialization phase, inside the solve phase, et cetera., and so you'll see this more. When laksono gives a demo.

A

So here's just like one static, screenshot.

A

From our viewer to just give you a sense of what's going on.

A

So this is actually for a code called flash, that was developed at the university of chicago.

A

So it uses block structured amr, and they simulate astrophysical flashes.

A

And so this particular execution was from a detonation of white war, star.

A

And so, when you write your program and npi, generally, you're thinking, it's an spmt programming, model.

A

everybody's doing the same thing, all the time.

A

now,. What we have here is we have ranks and threads on the vertical axis, and then time on horizontal axis.

A

And so what we find is that, looking across the ranks and threads,, we see actually that different ranks.

A

Are executing different amounts of work., so this pink represents some particular action.

A

Inside the application. like it might correspond to setting up some...

A

Like pre-processing, some measure or something., and so we find is that on some of the ranks, it takes longer than others..

A

And so, while your conceptual view is that everybody's doing the same thing., what you find is that they spend different amounts of time doing it.. And then this blue in here is...

A

Some of them, they're waiting on like a collective communication that occurs after this phase., and so with this view across ranks and threads,.

A

And over time, you can see variations that give you some understanding of load, balance,.

A

Or load imbalance, or serialization that enables you to tune your program.

A

now, in this view,. There's a cursor,.

A

And that shows where we're.

A

and then for that,, we have an individual call, chain.

A

And so this tells you the complete call chain at that particular point in time for that mpi rank.

A

You might think is got to be very expensive to collect all this information.

A

actually,. Every sample is only 12 bytes.

A

It's an eight-byte timestamp, and a four-byte identifier.

A

For a node in a tree. and then by taking the node in the tree, and unwinding like every node, has a path up to the root.

A

And so by just identifying the node,, then we know the complete path. and so we're able to just kind of gather this information.

A

For free, and that's interposed on the the data.

A

That we gather at runtime. laksono will tell you a lot more about the trace viewer and using it..

A

So now, let's get into our support for performance analysis of gpu, accelerated, applications.

A

So I mentioned that hpctoolkit has a core.

A

For measuring gpu-accelerated applications.

A

and on top of that course,, it's measurement interfaces.

A

For nvidia gpus, amd gpus, intel gpus.

A

And then, there's an openmp layer,.

A

An opencl that are sort of independent of the individual gpus. so in this talk, we're going to focus on.

A

The support for nvidia gpus.

A

and so we're principally using this nvidia layer.

A

Or an nvidia layer inside hpctoolkit that is leveraging nvidia's, cupti user interface.

A

And so cupti is nvidia's performance tools, interface.

A

That gathers all the information about the gpu performance., so some highlights of hpctoolkit's support for gpu-accelerated codes.. We unwind the cost stack to identify the cpu calling context.

A

For every time you invoke a gpu api.

A

So we can tell where your kernels were launched, or where your copies were incurred. and we map all that back to these calling context. Profiles.

A

And we're able to show you that and you'll see examples of this later.

A

so internally, hpctoolkit is employing some novel data structures for a fast and non-blocking inter-thread. Coordination.

A

we've got the application. Threads are launching things.

A

There's a gpu-monitoring thread that is gathering data off the gpu., and then we have to take that measurement data and attribute it back to the application threads..

A

And so it's important that we have fast data structures for doing this, because we don't want to disturb the execution of your application..

A

So hpctoolkit has support for binary analysis of gpu code.

A

To attribute fine-grain performance measurements.

A

so for nvidia gpus,, the fine-grain performance measurements.

A

Are in the form of pc samples., and so we can do this for nvidia intel,.

A

And amd gpu binaries., we use a novel technique to reconstruct an approximate gpu calling context tree.

A

To attribute the costs to functions on the gpu.

A

And I'll tell you more about that in a minute.

A

In a single run,, we can gather a rich set of metrics.

A

That we derive from the pc sampling measurements.

A

That can give you some insight into your gpu performance.

A

And then we have a new version.

A

That is performing scale, analysis of sparse representations of performance measurements., and this is actually sort of a coming attraction. it's in a branch,, but it's not in the versions that are installed on cori and summit at the moment.

A

So what are we doing at runtime??

A

So as the program runs,, I show you these calling context, trees, earlier., and so here we have some...

A

A couple of nodes in a calling context, tree., and so this might be like main calls,, some function...

A

Solve,, which then launches a kernel on the gpu.

A

And so we have cpu calling contexts,, we have gpu kernels,.

A

And then inside the gpu kernels, we have gpu machine instructions., and so when we measure something that's running on one of these heterogeneous supercomputers,, we may be measuring things like time on the cpu.

A

We measure how much time we're spending.

A

In each of the functions, here. we're doing this using sampling. and then, for a kernel,. We have information.

A

At the kernel level about how many registers you're using,.

A

And the time that was spent in the kernel., we also have information about the sampling, frequency and whatnot., and so there's a collection of kernel-level information..

A

And then, for each of the machine instructions, inside the kernel,, we have some detailed information.

A

Like instruction stalls, and we can find out like,, what's the total number of instruction stalls?

A

so like,? How much time do we spend running??

A

How much time do we spend stalled? and how much of those stalls are waiting for memory?

A

and how much of those stalls are due to synchronization.

A

Or waiting for values from functional units?, so there's a lot of detailed information. We can measure using nvidia's, pc, sampling.

A

So what we get out of this is we get.

A

These code-centric profiles of gpu-accelerated code.

A

And so here we're showing a calling context that has,.

A

On top,, we have it on the cpu, and then inside a gpu kernel,.

A

We have some contexts that involve loops and inline code.

A

And so we have fine-grain metrics like number of instructions executed., we have coarse-grained metrics like seconds that we spent executing a kernel., and so the kernel metrics just apply to the kernel, whereas the fine-grain metrics apply inside the kernel itself., and then we have derived metrics like gpu utilization.

A

So there's sort of three different things we can get:.

A

We have fine-grain metrics,, coarse-grain, metrics, and derived metrics for gpu kernels.

A

So when we're measuring gpu performance, at runtime,, there's three categories of threads.

A

There's application threads,, there's a monitoring, thread,, there's generally one per process: that's getting the information, that's streaming off the gpu.

A

and then internally, there's some tracing threads.

A

That are used by our tool, to gather traces for the gpu execution streams.

A

you may ask,. Why am I telling you this?.

A

This seems a lot like internal detail inside our tool.

A

Why do you care? well,, if you're using a job launcher like jsrun,.

A

That's like very carefully controlling how many cores you get, and the mappings of things.

A

We have some resources that we're using.

A

we're using some threads inside our tool,.

A

And you may need to provision some extra cores.

A

To run our tool threads as well.

A

And so that's just a cautionary tale.

A

So I don't have any specific advice to offer you.

A

Other than that, this probably needs to be considered.

A

So safe way is to sort of over-provision your job.

A

But there's probably a better way to do it., so the threads are interacting in various ways.

A

So on an application thread, we're creating a correlation that says, anytime, I launch a gpu kernel,. We gather the calling context.

A

Where the kernel was launched.

A

then,, we get information from the gpu and it's associated.

A

With a particular correlation that was recorded.

A

when we launch a kernel,, we noted that there's a correlation id.

A

On the thread that launches the kernel, and then the monitoring thread says,.

A

"Okay, I've got some measurements that needed.

A

To be correlated with a particular calling context.", and so these particular measurements belong to thread, one with a particular calling context.

A

And so to attribute measurements, we're taking information from the gpu monitoring thread.

A

And we're feeding it back to one of the application threads.

A

And then, finally, we're recording traces,, and so the gpu monitoring thread is getting the information.

A

About what kernel executions occurred and what time intervals they occurred? In.

A

and then,, it's handing them to some tracing threads that are actually writing this stuff in files as your application, runs.

A

So I don't want to go into this in too much detail,.

A

But, there's your application. Threads,.

A

And we're monitoring the application threads with some callbacks.. So when gpu operations get launched and completed,.

A

And we get some callbacks, there's some tracing threads that are responsible.

A

For recording traces into trace, files.

A

and then, internally, we're monitoring the information.

A

About what was launched where, and then we're monitoring information back from the gpu.

A

That says,, so here's what happened.

A

and then we send the measurement data back to the application threads to say, record this, that you launched a kernel on and it took so long and record that in your profile.

A

and so that's what's happening under the hood..

A

So one of the things that's important.

A

For understanding, gpu performance is that we compute an approximation of gpu calling contexts.

A

So you can better understand your application. Performance.

A

So when you're, using a gpu code from like c++ template-based programming models,, it turns out that the gpu code is complex.

A

and so with nvidia gpus,. I mentioned that we can collect these pc samples..

A

We can find out where you spend your time executing instructions., but those don't give us any color context, information that just says, I got sampled and I was.

A

At machine instruction, 65 at the following address.

A

And so what other tools do? Is they produce flat profiles.

A

Now, when we have complex c++, applications,, here's what a flat profile looks like..

A

So this was from a code called the rajaperf suite.

A

So it's just measuring various raja operations.

A

And so what this was doing is it was launching a dot product, operation, and then,, while executing the dot product operation,, while executing optimized code for the product operation we had. All of these functions were observed.

A

And so there's a whole collection of things that were generated through the template, metaprogramming.

A

And so you just see all of this and you say, "well,. How does that relate to a dot product?"?

A

That's what these flat profiles look like..

A

So in fact, for executing a dot product,, there were 25 functions that execute on the gpu.

A

And I would argue, this is not a good way to analyze your code., so what we did is hpctoolkit reconstructs,.

A

Approximate gpu calling contexts.

A

I'll talk more about how it does that in just a minute.

A

And let me just show you what the result is. and so for raja,. We see this is the cpu calling context,.

A

And here is launching a kernel.

A

and then we compute this....

A

We reconstruct this call chain on the gpu,.

A

That shows that you're implementing.

A

For all cuda kernel and then a privatizer inside raja.

A

Then we can see that there's a dot operation which is invoking reducesum,, which invokes a reduce template. and then,. There's some details.

A

and finally,. There's a loop down in the reduced template.

A

That's actually doing the work., and so what I would say is that seeing this information.

A

In context,, this looks a lot more like the conceptual model of your code than what you see. If you just get these flat profiles that have 25 functions in there., and so we believe that that our approach for reconstructing these calling contexts and using binary analysis to that.

A

Is actually pretty important for understanding these complicated codes that people are developing as part of the doe exascale computing project.

A

[Participant], what's the difference.

B

Between those two lines in that single box that had the same label?.

B

Loop at reduce hpp:203,: why are there two lines?.

A

Oh, so there's inline code.

A

From line 203, and it just says, like,, I've got some inline stuff, and I can see that there's a bunch of code that came from line 203, and that we can see that on line 203. in fact,, there are some nested loops.

A

Okay, so then the nested loops provide context.

A

For the stuff, that's really happening., so you'll get a chance to see this.

A

When you look at the pc sample profiles for some of the sample, applications.

A

So the one thing I want to say at the moment is that nvidia doesn't have great dwarf information.

A

That maps every machine instruction, back to its full provenance,, where it came from.

A

and so often,. We see there's a machine instruction.

A

And it came from the following source line.

A

In a particular functions, like it's in some function in c,, but what we really want to know is that there was a call to a,. Then a call to b, and a call to c,.

A

And then all of them got inlined., and so the code for c got mixed into b and mixed into a.

A

And so in cuda 11.2 nvidia built.

A

Some better line mapping information, and we have not yet used that inside hpctoolkit.

A

Because it involves making some changes.

A

To the way line, maps are analyzed, and making some changes to the dyninst binary analyzer., but hopefully the additional line, mapping, information.

A

That they've provided will have us able.

A

To compute, better reconstructions of the inline code.

A

So that, I think, is a coming attraction., so here,, it just says like,. This is inline code that came from this file., but we don't know what function it came from 'cause. We just don't have that information, just we know that it came from somewhere, whereas,, the new information that they're producing,.

A

It should look like that. There's a call.

A

To an inline function, online, 143,, there's a call to this inline function. and then, inside that inline function, on line 723,.

A

There's another call to another inline function., and so that's the level of detail that we expect to eventually get on the gpus.

A

Does that answer your question, steve?.

B

[Steve] yes, yes, thanks.

B

[john] okay. more than (indistinct).

B

okay,. So one of the things that we do.

A

Is we reconstruct these calling context trees., and so the problem is that the gpu monitoring apis.

A

Don't collect call pads inside the gpu kernels..

A

All we get are these flat piece of samples., and so we might see some samples inside some function. F.

A

And that function f, might be invoked from different calls sites., and so we need to decide how to attribute costs among each of those calls sites..

A

So the solution we have is that we can do a reconstruction.

A

Of this gpu calling context tree using the flat instruction samples. and then we also use information about static call chains.

A

That we get from analyzing the binary..

A

So let me tell you just a bit how this works,, so you have some understanding of what's going on under the hood., so we analyze the gpu binary and we look at...

A

We figure out every function,.

A

We look at the other functions it calls., so maybe inside some function f, we can see that it calls function, g and h., and so we build this static call graph.

A

That says: f calls g and h. and then, maybe, h calls I and j.

A

And so we have a collection of static call, graphs.

A

and then, when we measure the application,.

A

We actually collect samples of call instructions.

A

That we can see that there's a call instruction,.

A

Inside f to function g and we measured that it occurred.

A

And so now we know that g was actually called inside f at runtime., so we take our static call graph and then we initialize it with call counts.

A

That we get from either using sampling or if we're using instrumentation,, we can use instrumentation to tell us how many times each of these calls was invoked.

A

and then,. It may turn out that we're sampled inside.

A

Some function, g, and we never got any calls to g...

A

Any call instructions that call g that were sampled.

A

And so then,, based on the information that we have in the static call graph, we say, "well,. I know that I was in here.".

A

And then I look at where I was called from.

A

and so in this case, b2 is only called from one place.

A

And so we say, "I must've been called here.", but it was called from multiple places. and we didn't see any of the calls.. Then we make an assumption. That, "well,, I don't know where I was called,.

A

Let's say that I was..." we'll approximate and say,. Let me assume that each of my call sites reached me equally often.

A

A

I just talked about this., so the other thing that we do is there might be recursive.

A

Or mutually recursive functions., and so we can't really do a good job of figuring out.

A

Who called who, in this case., we just know that there was a bunch of calls between one and another., and we can't tell who is the root and who is the leaf.

A

So if we're just collecting samples- and we find that some of the samples are in d, and some of the samples are in e, but, d- calls e, and e calls d,, we just say fine, they're a strongly connected component..

A

So that's a technique that refers to the fact that there were cycles in the call graph.

A

And so we compute all of them...

A

after we compute the strongly connected components in the graph., then we say, "okay,. So for the strong reconnecting component.

A

We can see that there were costs that were incurred here and that we had some invocations from b, and some invocations for c." and so then,. We take all the costs, here, and then,. We portion them out and we say,.

A

"Okay, so there's two calls here, and six calls there,.

A

So we end up with two eights.

A

And six eights of the cost here.". So what we're doing is we're, making the assumption that, for anytime a function is invoked, we're making an assumption that all invocations have the same costs..

A

So this is known as the gprof (indistinct).. This is what the gprof profiling tool back in the '80s did for a portioning costs among callers.

A

And so, if we don't have any information about how much cost was incurred inside a particular function from a particular call site,, we just make the assumption that, "well,, let's just assume that all calls are equal and then we'll have apportion based on that.".

A

And so what we have is something we know precisely. Where calls are,.

A

And then we're approximately attributing the cost among the calls, based on the limited information that we have from sampling call instructions.

A

So keren I'll show you some examples of that. Later.

A

We also have some support for openmp target operations.

A

And so we can reconstruct the full calling context of call trains., and so here's an outline function inside openmp, and here is launching a kernel on the gpu.

A

And so this particular call chain.

A

Is reconstructed with the help of a tooling interface that we built.

A

Inside the llvm, openmp, runtime.

A

and I'll say a little more about that later.

A

In the second talk., so we can also reconstruct information.

A

From these template-based programming models and see when they offload kernels onto the gpus and attribute those the cost of the gpu kernels, back in the context where the templates ended up in forking things on the gpu.

A

So finally,, when we're collecting.

A

These measurements on the gpu, we measured using pc sampling,, and it turns out that you can't measure using pc sampling.

A

And also collect other metrics.

A

So nvidia's nsight-compute will run like nine passes.

A

Over a kernel to deflect multiple metrics., what we do is we use a single pass of measurements and then we compute some derive measurements from pc samples and other activity records to compute things like utilization of the streaming multiprocessors, and gpu occupancy..

A

And we do that using a single pass., and so that means that our measurement overhead is significantly low..

A

So here is a sample execution trace.

A

Of this next code, that's developed.

A

As part of the ecp program.

A

As I recall, it's a cosmological code, but I don't remember the details., so what we had here, was an execution that was running on summit, and we have these trace lines that are showing the activity on the gpus.

A

And so at this level, it looks rather cluttered., and so I'm just going to zoom in on a little region in the execution.

A

and now,. You can see that there's actually some structure.

A

So we have some information about the cpu trace lines.

A

and then we have a streams for each of the gpus streams.

A

That are running on the video to use., so there's streams on nvidia gpus,.

A

There's (indistinct) on amd gpus and on intel gpus.

A

That are the abstractions for where the gpu costs are incurred.

A

So, while tools like nvidia, has a trace-based interface.

A

That you can look at these, like individual trace, lines,.

A

For individuals gpus.

A

with this tracing interface, we're trying to be able to look at things across an entire cluster, and then be able to gather some statistics about that, and look at how your program is using the cluster overall,.

A

Instead of just how you're, using one gpu or using one stream in a gpu.

A

So a coming attraction is that when we add a support.

A

For measuring gpu executions or collecting.

A

A lot of gpu metrics, in some cases,, it's a 100,, a 200 metrics,.

A

Various kinds of instruction stalls,, various information about kernels,, various information about data, copies of different styles,.

A

Whether it's a host to device, device to host, device to device, et cetera., and so the problem that we soon came up with was inside hpctoolkit, originally,.

A

When we're collecting measurements for a cpu,, we were using a small number of hardware performance, structure,, maybe you're. Looking at cycles, instructions, and cache misses. now, we've got 200 metrics for gpu contexts.

A

So for a gpu machine instruction, we have information about all of these instruction. Stalls.

A

And so what we quickly found was that.

A

If we allocate space for all of these gpu measurements,.

A

For every point, in one of these calling context trees, in fact,, we have a whole lot of locations in the calling context, tree, there're cpu that will have zeros for all of the gpu metrics.

A

And so we end up with a large amount of data where most of it is zeros., and so we've switched to collecting information.

A

In a sparse format. and now we have a new tool for post-processing. This.

A

And putting out our data on sparse formats on disc,.

A

And so the sparsity really matters.

A

when we were looking at nyx and lammps,.

A

Nyx is a astrophysical code and lammps and molecular modeling code..

A

We found that using a sparse data structures for nyx,.

A

Reduced the space needed for collecting the measurements.

A

By 21x and then for lammps, using a sparse representations.

A

Of the result, data from the results of our performance analysis, gave us a 337 reduction in space..

A

So if you're using a very large-scale code.

A

And you're running it at large scales, we may find that the data footprint for hpctoolkit is quite large.

A

When you're collecting gpu measurements.- and so I just wanted to offer this as an indication.

A

That things are changing and that within the next couple of months,, we expect to commit this support.

A

For the sparse representations, that will have a dramatic reduction in both the size of the measured data, and size of the analysis. Results.

A

So we use our tools ourselves.

A

For looking at our own execution, performance., and so this is actually a trace.

A

Of our own hpcprof analysis, tool.

A

And so we collected 64k profiles for amg2006.

A

This is a doe benchmark for algebraic multigrid.

A

This was not a gpu accelerator, but we ran this on (indistinct)., so we collected traces for 64k ranks and threads.

A

And so using hpcprof-mpi, we did the analysis.

A

On eight knl nodes, and inside every mpi, rank,.

A

We had one mpi rank per node, and 128 threads per rank.

A

And so we're doing a lot, of...

A

we're doing mpi-based, parallelization, and threading ourselves. and we use our tools to analyze our execution time.

A

And we were able to analyze a 64k profiles in 184 seconds.

A

And so what you can see is that we're using our tools to understand all these blue regions are like synchronization,, where we're waiting for one thread to finish something before we move to the next phase..

A

And so this is the kind of thing that you'll see when you're. Looking at some of the examples that there are these phases, where you spend time waiting for someone to complete, and then there's a synchronization, and then you move to a second phase.

A

and so you'll be doing the same kind of analysis that we did of our own tool.

A

Of looking at our own tools, execution, you'll be looking at your application: executions in the same way., so then, finally, a couple of comments about some status and ongoing work.

A

So for today, our focus is on nvidia gpus,.

A

Where we're measuring using the cupti interface,, we get fine-grained measurement support with pc sampling,.

A

We get tracing support with cupti, and then we do binary analysis of loops and inline code using information from nvidisasm,.

A

So this is nvidia's disassembler..

A

We asked them for an api, and they refuse to give us one,.

A

Because they saw it as outside the scope of their contract.

A

And so we end up running this nvidisasm tool,, then dump text files, and then parsing them ourselves to do binary analysis of the assembling code,.

A

And the control flow using the dyninst performance tools.

A

Intel has their own story., they have opencl and level zero runtimes.

A

They do not have like a layer like cupti,.

A

Instead, there's callbacks as kernels are launched.

A

And completed inside opencl, and level zero,.

A

We're using instrumentation to collect a measurement data.

A

And then, we're using this intel, graphics, assembler.

A

That has an api for cracking machine instructions, and we knew dyninst for analysis. and amd has a similar story to nvidia where they have a tool. That's called roctracer,, which is quite like... or a tool layer of roctracer,.

A

That's quite like the copy layer..

A

It provides support for measuring kernel, executions,.

A

And one can build traces off that.

A

and we've been working with the university of wisconsin.

A

To build some support for instrumentation of amd gpus.

A

Like intel's, pin or gtdpin for their gpu instrumentation.

A

We've been working with the dyninst team to build some instrumentation tools for measuring gpu execution.

A

And then also, we've been working with the dyninst team on building an instruction decoder.

A

That will enable us to do binary analysis., so we already are able to do preliminary analysis of amd binaries., and so the ongoing work is to improve the support.

A

For indirect control flow like switch statements in indirect branchesm, and (indistinct)., so I want to make the comment here that the kind.

A

Of detailed performance analysis that we're trying to do where we're coming up with costs that we're attributing to every gpu machine, instruction.

A

This requires a lot of support across the entire software stack., so the hardware has to have support for fine-grain measurements and attribution., and so this is the responsibility of the gpu vendors.

A

And so in videos, pc sampling approximately meets our needs.

A

So then, there's support for appropriate user interfaces.

A

For introspection and analysis., so linux, perf_events interface,, the dynamic loader.

A

So that we can track shared libraries provides us ld_audit interface. we've been working with red hat on ld_audit.

A

And then elfutils is something that enables us.

A

To gather data out of binaries and look at line maps.

A

and so there's some extensions that are needed to deal with the binaries from cuda 11.2.

A

And so we've been working with red hat on those.

A

there's also the gpu vendor software stacks.

A

For controlling that., so beyond, just the hardware support,, there's things like cupti that has interactions.

A

With the kernel driver and the runtime and tooling api.

A

And so we've been working with the gpu vendors.

A

To refine their definitions of these apis to better support, tools.

A

Also, if we're going to get good attribution.

A

Of fine-grain performance measurements,, we need high quality, dwarf information from the compilers., and so we've been working with the vendors and the llvm community. In order to make sure.

A

That each machine instruction is associated with full call chains.

A

inside the runtime,. We need the runtime to maintain information and map computations back to the source view., so we're working with the openmp standards. Committee.

A

In fact,, I was involved in leading the definition.

A

Of the effort to define the openmp tools, interface.

A

That was finally made available and openmp 5.

A

and the implementations that are still a work in progress.

A

And then finally, the performance tools have to gather the measurements using sampling.

A

And callbacks and reading event streams.

A

And map them back to the source., and so we attribute things precisely whenever that's possible. and in the cases where we have flat pc samples.

A

In the gpus,, then, we want to attribute things to gpu, calling contexts and loops in the gpu code.

A

And also using programs slicing to attribute inefficiencies.

A

From where they observed back to their cause.

A

and so for instance,, we might have on machine instruction, like an ad on a gpu.

A

That's adding value from a couple of registers.

A

While we might see that there's a memory stall on this ad instruction., it isn't the ad instruction itself, that's causing the stall.. The ad instruction is where the stall is observed.

A

The stall is a result of some load that hasn't completed yet., and so we try to trace back through the control flow to find out where they were causing the stalls.

A

and so keren has built a tool..., keren and zhou,. My team have built a tool called a gpa that traces. These costs back.

A

To their causes. and so we're using the university of wisconsin steinem performance.

A

Of binary analysis infrastructure as the framework for building these kinds of tools.

A

So what I wanted to give you a sense of is just that.

A

There's a lot of moving parts in here., and so, if something doesn't work out exactly the way that you would like with one of your applications,.

A

It's possible that it's not our tool, stall.

A

It may be a fault of something inside the openmp implementation, or some issue with the company measuring infrastructure.

A

Or certainly we found that there are issues with gmc, there's some bugs on the dynamic and whatnot.

A

And so any of these things can affect the usability of our tool.

A

Briefly, some ongoing work,.

A

I mentioned this gpu performance advisor tool for nvidia gpus., so this is something that's available on github.

A

So this is actually an extended version of hpctoolkit that attributes..., it analyzes instruction stalls and it uses backward slicing to figure out where the stalls came from. and it offers some advice about.

A

How to fix performance problems in your code.

A

and we're also...

A

laksono is going to tell you about this integrated interface for looking at profiles and traces., and we've been building some work on identifying serialization in both cpu and gpu traces,.

A

And that's not in the version that's released on the clusters. Yet.. Another thing: that's work in progress is collecting gpu performance, counter measurements to support roofline analysis..

A

So if you don't know something about roofline analysis,, you probably want to..

A

So this is a way of understanding how close you're to the maximum possible performance.

A

That you could achieve with your code., so this is a way to know that your tuning is done.

A

And so it's very useful for assessing how well you're doing relative to what your program is. Trying to accomplish.

A

So we expect to have support for that eventually inside hpctoolkit.

A

right now you can do this inside the vtune or nsight.

A

We're updating, hpctoolkit to use.

A

The new cuda 11.2, which has...

A

So it has more information about inlining and there's an emerging version of cupti that has a lower overhead.

A

Can't say much about that, but just look forward to improvements in the future.

A

And we're also working on improving the scalability of our measurement analysis,, approving the measurement on amd and intel gpus.

A

And also improving things so that you can support analysis.

A

Of machine learning frameworks that are based on python.

A

So finally, a few final remarks.

A

So it's nice to work with national labs and have early involvement in the big procurements.

A

It amplifies our ability to affect the vendor hardware and software in the near term.. So I mentioned that we're dissatisfied.

A

With the nvidisasm as a way of getting information.

A

About nvidia gpu binaries.

A

well, nurse does not sign their final check on promoter yet.

A

And so they still have some leverage., so maybe they can work with us to request somethings from nvidia.

A

So being involved in international procurements, means that we can get some of the things that we need in order to build better tools., and it's not just something that we need.

A

if anyone's trying to build the tools with these capabilities,, they need certain features.

A

and so having some leverage always helps. Get these things.

A

One problem, though,, is: if we're not involved in the procurement too early than a statement of work, can get written, that it doesn't have anything about providing, say an api to crack machine instructions.

A

On a video gpus., and so that's what happened with the cori one project.

A

Where summit nersc got delivered and we didn't get the api we wanted., I asked for it, but contracts had already been signed before I asked, and so nothing happened. in videos now recognizing that maybe they ought to try.

A

And meet some of the needs of the community, a little better.

A

And then finally,, I just comment that the software development challenges are building a set of tools that understand all of these things at the lowest possible level is pretty challenging.

A

And that my team is currently working on building tools for three different gpu software stacks.

A

For nvidia intel and amd gpus.

A

and that's ridiculous for very under-resourced group.

A

For doing this., I could use a team. That's like three times the size.

A

We're also working on building capabilities ahead of what the vendor hardware and software are capable. Of.

A

And we're waiting for the hardware to catch up,.

A

We understand some of what's coming and some of the features that are coming. and so we're actively working on that.

A

The software stacks for amd and intel are a work in progress at this point.

A

and, as I mentioned, relying on these vendor closed source components as a challenge that the standards like even say,.

A

The cupti api, it tells us like, "here are the things that you can call.", but it doesn't say anything about, "oh, by the way, we're creating threads behind the scenes and we're creating lots of them." and so as a tool that measures thread, creation and whatnot.

A

We need to know what's happening, inside, or experimentally. We determine what's happening inside and say, "oh,, cuda or cupti is creating lots of threads just for measuring.", and so we shouldn't measure the threads that they're creating for measuring., and so that leads to some exploratory development.

A

Inside our tools that just makes it a little bit more difficult.

A

okay,. So that's what I wanted to say as like a high-level overview of what we're doing and why we're doing it.

A

And then laksono and keren, and are going to talk about the details of actually using these tools to look at some examples.

A

So I'd be happy to take a couple of questions and then probably a break is in order.

C

[Participant] thanks, john., anyone have any question,, you can unmute yourself and speak.

D

[Participant] I have a question.

D

[John] yes. [participant]: can you elaborate on the workflow?.

A

So the workflow, pretty much...

A

well, you'll, see this from keren in detail, but generally, what you do here is.

A

You will compile your application.

A

You'll launch it with hpcrun, which collects some measurements.

A

and then, while it's running,, you can analyze.

A

Your cpu binary with hpcstruct.

A

As it runs, your application may be actually.

A

Jiting some gpu binaries., and so after the application is ran,.

A

Then you use hpcstruct to analyze.

A

The gpu binaries that are created. and so keren's going to cover that in detail..

A

So in general, you say: hpcrun and then hpcstruct to analyze some machine code and then you use hpcprof to combine the measurements.

A

And then that will produce data that you then look at with the user interface.

A

And so the details of exactly how to do that will become clear.

A

When keren shows you what that is.

A

and then actually the tutorial examples,.

A

All have this stuff automated., so we have a set of examples where you should be able to say, make build, make run.

A

And if you're not on summit,, then you can say, make view.

A

And so the build part is going to compile your application,.

A

And so on, summit you're going to compile it on the login node, on cori, you're, going to compile on the backend nodes.

A

And we'll do an s batch, automate, offloading.

A

The compilation onto the gpu nodes,.

A

Hpcrun gets run on a compute node,.

A

Hpcstruct can be run either on a head node.

A

Or on a compute node, hpcprof can be run on the login node,.

A

Hpcprof-Mpi,, this is a parallel program.

A

And so this will get run on a compute node.

A

And on cori, you can analyze the data on a headnote on summit..

A

The warning is, please don't,. You may crash the machine.

D

[Participant] so just to clarify, the one that needs to be inside the batch job will be just the hpcrun?.

D

Hpcstruct and hpcprof-mpi can be run on the head. Node.

D

So it's not necessarily-.

A

Yeah, so for the workshop.

A

The way things are organized is the software stack.

A

On the cori log-in nodes is completely different than the software stack on the gpu nodes.

A

And so one way that you can deal with that is you can log into one of the gpu nodes and you can build your code back there.

A

for the examples that we've put together on the login node,. You can say, make build.

A

and then what it's actually going to do is it's going to launch a batch job? That's going to do the build on the back end nodes on cori.

A

Because they have different compilers and they're running a different version of the operating system., so on cori, we're doing the build on the compute nodes as well., so basically for cori,. All of this is occurring on compute nodes.

A

on summit,. This occurs on the log-in nodes,.

A

This occurs on a compute node., and then we also do the analysis,, the hpcstruct and hpcprof.

A

we're doing it on a compute, node again, out of convenience.

D

[Participant], how much resources would the analysis routines take?.

D

The hpcstruct and hpcprof.

D

would it be considerable amount of time?.

A

So it depends upon the size of your binary.

A

So we've seen binaries as big as seven gigabytes.

A

And so, if you're analyzing, seven gigabytes, you probably don't want to do it on a login, node.

A

and actually hpcstruct supports multiple threads.

A

So if you log into a compute node,, you can say hpcstruct-j16 or something.

A

And you need 16 threads to analyze a large binary.

A

So if you're analyzing, really large binaries,.

A

Doing it on a compute node and using parallelism is preferable.

A

For analyzing gpu binaries,, as I mentioned, we're using nvidisasm to do that.

A

And so, if you're doing surface level analysis, just to map back to line information, then that's reasonably fast.

A

If you're, actually trying to analyze the control flow in the nvidia, gpu binaries.

A

That can be very expensive. and so doing that for one gpu binary, might take as much as 30 minutes or so..

A

And so,, if you're running that long, you might not want to do that on a login node either.

A

And so you'll get a feel for this.

A

As you start working with some of the examples.

A

Or trying it on your own code.

C

I'm interested in julia, a dynamic language,.

C

What would be involved in supporting jit compiled code?.

C

It uses llvm orc jit.

A

So the main issue...

A

So llvm,, I think, will actually do.

A

A good job carrying information.

A

About the mapping between the machine code.

A

And the source lines that it came, from.

A

So the main thing is in order to know what that mapping is.

A

We actually have to capture the jit compiled output.

A

So, if what they actually produce is something like an elf binary than if we can capture it.

A

Then we could analyze it., so I haven't actually done anything with julia..

A

So I don't know exactly what some of the internal details.

A

Are inside their runtime system.

A

So what I can tell you from the high level is.

A

We probably have to do at least something,.

A

We have to capture these jit compiled binaries.

A

So if they're not already captured, then we would have to capture them., but beyond that, we might be able to just use the rest of it without very much changes.

C

Okay, thank you., [participant] john,. I have a question.

E

About the whole provisioning of the resources.

E

So if we over provision resources with some few extra calls.

E

Do we need take care of the pinning of these things.

E

Or we just keep them idle and hpctoolkit is going to figure out. These are extra calls and will pin its internal thread for these calls?.

A

That's a good question..

A

I think if you just use some extra calls,, then using say jsrun or srun.

A

That the problem will probably take care of itself.

A

okay, when the extra threads get created, they'll get created on other course.

E

A

So I don't have any specific advice to offer you about that. [participant] okay, thank you.

E

And second question is about the overhead.

E

So hpctoolkit is nice because it has typically low overhead,.

E

Depending upon sampling, frequency,, of course, how this changes with the gpu profiling, support.

E

And let's say,, especially if we're using more mpi ranks.

E

Than total gpus, I will label, for example, with something like cuda mps, service,.

E

How this profiling overhead, especially compared.

E

To the cpu profiling.

A

Okay, so there's two comments about that.

A

One is if you're using more mpi ranks than gpus,.

A

Then you can actually use the pc sampling, capability.

A

That is a limitation of nvidia's cupti stack.

A

If you want to do pc sampling,, then you have to have only one mpi rank per gpu.

A

I understand that that's not a way that lots of application developers write their code, but you might be able to run some small scale, experiments that are just using.

A

One gpu per rank in order to collect some data.

A

To find out some details about how you're spending your time.

A

So other than that with multiple mpi ranks per gpu,.

A

Things work, out, okay.. I don't really have experience using mps.

A

But I have used multiple mpi ranks with a gpu,.

A

Without mps,, the network's, just fine for, like the car screen kernel level, profile.

A

So then you asked about cost., so I think that the cost is roughly.

A

A factor of two when running.

A

One of the gpu accelerated applications.

A

And so that's on per with what nvidia's nvprof was using.

A

So it's much less expensive than nvidia's nsight compute.

A

Still on the order of around a factor.

A

Of two or just the kernel level, profiling.

A

And then for the pc, sampling,.

A

It can be more and that's something that we're working with nvidia on., and so I would expect some improvements on that in the future.

E

[Participant], if I understood correctly,, you said two x compared to the cpu profiling, is it?.

A

Two x in total, runtime.

A

[participant] okay.

A

And unfortunately, most of that overhead is not ours.

A

And so it's like a function of nvidia's.

A

Cupti measurement infrastructure at the moment., and so we can only reduce it so much.

A

'cause, a lot of it is inside the measurement library,.

A

But that's changing.

F

[Participant] sorry to ask this. so you're saying that nvprof was actually less overhead than nvcs the new way of doing things?.

F

Than nsight compute. [participant] yeah, sorry.

A

(Murmuring) nsight compute is doing pc sampling, but this may do as much as like 10 passes over our code.

A

And we're just doing one pass with pc sampling.

A

So we get much of the same information,, but with less cost.

A

[Participant], that's because they do replay.

F

More kernels essentially., that's correct.,.

A

There are other ways where we can collect information.

A

So keren will talk about this a little bit,. I think.

A

But we can compute utilization by just saying so we're using samples, and this is how many samples we expected, based on the clock frequency.. This is how many samples we've got., and so, if we get less samples than we expected, then we can infer that the sms were idle because they weren't collecting samples..

A

And so then we can get approximate information about sm utilization by looking at samples, expected,.

A

Versus samples actually recorded without actually measuring directly., and it turns out that these measurements, that the approximations are actually pretty useful and pretty accurate.

F

[Participant], so you don't need to do any replays at all or you do just less replays?.

A

So if we're just using pc sampling,, then we don't actually do any replays.

A

Keren, we'll say a little bit about this,, but for this gpa tool, we actually use replace a little bit.

A

By just using some cupti events.

A

That will collect some additional measurements, using instrumentation, and that we'll do some replays.

F

[Participant] and you expect to use the same general philosophy for the tools for rocm,.

F

Once they're actually available?.

A

That's right., so we actually have something that's operational at present, but we don't have any support for find great measurements..

A

We only have kernel level measurement.

A

and so a few weeks ago that was working great.

A

And then they released candidates.

A

For rocm 4.1 and broke our software stack.

A

And broke people's applications., and so does it work?

A

well,. It worked on rocm 4.0., it's not working on rocm 4.1 yet, and we're waiting for 4.1 to stabilize.

A

So we have a new release for 4.1 out..

A

We haven't installed it yet., we don't know., but in general it's like pretty much working.

A

Hpctoolkit is pretty much working on rocm when rocm is working.

F

[Participant], so if one were to basically,, take it out and build it for rocm,, it should just build for rocm 4.0,.

F

If that's what one has installed.

A

If you have rocm 4.0 installed,, then yes, you can build it for rocm, 4.0., now, I'll caution. You that there's the build instructions.

A

For rocm are a little bit more complicated., so all of the build we do is with spack.

A

And my colleague mark is responsible.

A

For all of our support with spack., and so he is available. if you're trying to install with rocm, he could assist you with that., but the rocm build,. It's not as easy as saying spack install hpctoolkit, plus rocm.

A

That doesn't work yet because amd has to fix some of their packages.. So you actually have to build a packages that yammel file that says, "here's, where rocm is installed.", don't expect to build it with spack.

A

and as long as you do that,, then you can do a spack install of hpctoolkit with a pre-installed rocm.

A

[Participant] okay, thank you.