National Energy Research Scientific Computing Center (NERSC) HPCToolkit Training for NERSC and OLCF Users, Mar-Apr 2021, 6 Apr 2021

Previous Meeting

⏯

youtube image

►

From YouTube: 5 - Analyzing CPU Applications

Description

Part of the Using HPCToolkit to Measure and Analyze the Performance of GPU-accelerated Applications Tutorial, Mar-Apr 2021. Slides available at https://www.nersc.gov/users/training/events/hpctoolkit-for-gpu-tutorial-mar-apr-2021/

A

So, let's see,, how about a couple of things that I wanted to talk? About.

A

Some events for cpu measurement, which will affect the way that you analyze your code.

A

I wanted to say briefly some things about the openmp tools, interface, which we're actually using on cori and can improve your experience., but it's,. It's still a bit of a work in progress..

A

I wanted to show you a key technique.

A

For understanding performance that you can use.

A

On either cpu's or gpu's., we use differential performance, analysis.

A

third,. I want to talk about kernel sampling, just a bit.

A

Because if you don't use it then or your system isn't configured with it, then all the tools will lie too you.

A

And then finally, something about.

A

Context, recycling for dynamic threads because I think we may have run across this in one of the applications we were working with this week.

A

Okay, so first there are a couple of different ways that you can time things on the cpu..

A

If you just say hpcrun and you don't give it any arguments to say what events you want to monitor.

A

Then the default event is cpu. Time.

A

And so that's the best for analysis of profile, data.

A

So when you run an application on a system like cori.

A

Or summit, an mpi application,, it often has a helper thread.

A

That's running a progress engine. and if you measure with real time, then what you're going to get is, it will seem like the.

A

Helper thread for the progress engine is running hard.

A

And taking up a significant fraction of your execution, time.

A

That's not really true.. What happens is the progress engine will sit there and wait. The majority of the time and they'll spend its time blocked in a system call..

A

If you monitor things with real time, then every time a timer goes off.

A

So if we set up real time,, the default is, is maybe 300 times a second.

A

So if you monitor with realtime.

A

It's going to repeatedly interrupt your application and say: where are you now? and it'll and it'll say I'm at this system call.

A

And it'll look like it's active at the system call, but the only reason it's active is because we woke the thread by probing it with a real time signal., if you instead measure with cpu time.

A

Then you won't wake the application.

A

It'll just remain slumbering.

A

however,, the last sample that you got.

A

Before it blocked might be from somewhere completely different., so you don't have any idea where it blocked..

A

So my advice is if you're tracing and you want to know sort of where you're at real-time gives you better traces.. The what I mean better traces means that.

A

If you're blocked at the system call you'll have call paths shown in the window.

A

That that showed that you're blocked in the system., if you use cpu time,, then in the traces you may have something that looks misleading.

A

Because the last time we were able to sample the process.

A

The last time it was running,, it may not have been at the point where it's blocked right now..

A

So that's just a caution.

A

That real-time is better for the trace in this, but cpu time is better for analyzing the profile data.

B

John... just go ahead. when it blocks the,. I mean, when it activates the block thread when it goes back,. Does it go back into the system call or does it go back into the program?

B

and if the program expected, hey, I should only unblock. When I have something, you know, will it get messed up?? So that's an excellent question.

A

Normally, what we say is we want to restart the system. Calls.

A

And so in general,, if you're in a system call.

A

Like read or something and you get woken up,, then the system call will just get restarted.. There are certain system calls that.

A

Don't automatically restart.

A

One example is poll., I think another example. Is, is select.

A

Mark can you think of any.

C

[Mark] poll and select are the two classic calls.

C

That don't automatically restart.

C

there's also sort of the issue of,.

C

You know some code that doesn't handle.

C

Any interrupted correctly.

C

so like this,. This thing with the cross process.

C

Across process memory,- something that we saw.

C

It had a problem because it was a.

C

Third-Party kernel driver and it basically didn't handle the automatic restart correctly.

C

So richi,, the answer is that.

A

In most cases, if the code is well-written and it can handle the fact that it's being profiled that there may be, your code may see interrupted calls.

A

But if we did find that there was.

A

There was a case where, with the cray x, pmm driver.

A

Which is used for fast interprocess communication.

A

If we interrupted it, while it was in the middle of.

A

Xp mem open,, it didn't return. The interrupted.

A

It just returned as if it's succeeded.

A

and that it actually hadn't., and so we have a couple of we have a couple of workarounds inside hpctoolkit, where we know that that is a bug and cray hasn't fixed it., and so we end up saying, well,. For that case,.

A

We're just going to manually retry and see if we can sort of force xp mem to open okay.

A

in other cases.. But let me finish in here for a second., so with select and poll,, we actually have wrappers.

A

In hpc toolkit where we will reinitiate the call.

A

Since we can't just automatically restart it.

A

Okay, so we sort of hide the fact.

A

That it's been interrupted from you and that the system doesn't automatically restart it.

A

Our tool, catches it and then restart. It.

A

okay, go ahead.

B

[Sharp voice] well,, I was going to say the one case that I could see being problematic and probably rarely whatever happened is if you're trying to do an mmap and you can't get the kernel pages and you interrupt right in the middle of that it'll, probably return, hey,, I couldn't do it.

B

and then your program will probably say,. Well, I guess there's no memory.. I think that it,, it will return. Probably a return.

A

I haven't looked at the map return codes, but it should return eintr. and if your code is, is well-written.

A

Then it should say, okay, I'll, try it again.

A

otherwise, you know,. The code is not friendly to profilers.

A

And there's not so much we can do about that.

B

[Sharp voice] yeah. I just see the code doing that a lot which is they say, mmap that me, this thing and if I get a null, that means there was no memory. So I quit.

A

So I would have to look at that case to see what the,.

A

What happens with with, mmap.

A

Okay,, so these were some, some general rules of thumb..

A

I mean, you can use cpu time. If you don't want to interrupt the system. Calls.

A

And so that's, that's safe. okay., all right,. So besides the linux timers.

A

There's also other things that you can measure on the cpu., so there's no linux, perf event, monitoring subsystem.

A

And so this supports using hardware counters., so you can measure things like cycles, instructions.

A

Cache misses, and then there's some software counters.

A

For measuring things like context, switches and page faults.,.

A

And so perf has been stamped a standard part.

A

Of the linux kernel since 2.6.31., and so I think that on almost all systems that you're using.

A

You shouldn't be able to use perf., so there's a pretty good document.

A

That I came across when looking around for materials with this talk,.

A

Where this explains,, you know, some.

A

Some events that you can monitor with perfect and has some, some explanations and some examples.

A

So, just to give you a sense, I don't want to spend too much time on this, but just to give you a sense of the kinds of things you can measure.. So perf has a set of hardware event. Counters.

A

So cycles instructions, cache references, cache, misses branch, instructions,.

A

Front end stall, so front end, stalled, means waiting for, say the instruction cache.

A

so,. So this is like front end.

A

the front end. Is, is related to the, the code.

A

And then backend is related to the execution., so instruction tlb misses instruction,.

A

Cache misses, ended up being front, end stalls, and then there's the the cpu's can vary their cycle.

A

They vary their frequency., and so you can either measure cycles or reference cycles.

A

and so these, the, the the time for the regular cycles. Can, can change based on.

A

Based on changes to the clock, frequency, the reference cpu cycles, are, are stable.

A

So there's also,, there's.

A

There's a bunch of hardware, cache events., so you can look at l1 instruction and data cache.

A

And last level, cache translation look aside, buffers.

A

The branch processing unit.

A

and then this is a general thing that has.

A

A bunch of modifiers that you can add to it. and if you look at the documents that I referenced on the previous page, that has a bunch of modifiers for this..

A

So you can understand when you're reading and writing in the cache, or whether you're prefetching, or you can find out. When whether,.

A

Whether your, your results, access to the data.

A

Or whether it missed in cache., so I'm not going to try and give you a whole tutorial on on sort of how to use perf.

A

In general, you know, it's,, it's a lot about the memory.

A

And so, if you,, if you monitor your cycles.

A

And your instructions to know how many instructions per cycle you're executing,, where your cache balances are.

A

And your tlb misses, that'll. Take you a long way.

A

So there's also some software events.

A

And we can monitor these as well. and so perf faults, all of these.. So we can look at page faults and context, switches.

A

And when threads migrate from core to core page faults.

A

So all of these things you can monitor as well., so in order to find out what you can actually monitor.

A

You can say hpcrun -l,, and it's going to give you a huge list of things,, probably over a hundred.

A

and there's a there's, a few ways that you can name these events..

A

So you can name these events by saying, like perf_count_software, context, switches.

A

But perf also supports some, some shorthand names.

A

So I think you can also just say: lowercase contexts dash switches., and so, when you use hpcrun dash, l.

A

You can see all of these events and they're convenient aliases..

A

So for instance,, this perf count: hardware cpu cycles.

A

Can just be shorthand, references,, lowercase, cycles.

A

And so it's a lot easier. and if you run hpcrun -l you'll find out what all the aliases for these events are and what the shorthand is.

A

now. The other thing that we do is we use,.

A

We use papi and, and so.

A

Papi supports this syntax, where it.

A

The events are named, like ix86architecture., so this is x86 events on and these apply to pretty much all of the processors and all of the x86 processors.

A

So you can say that colon colon event name., and so, if you're interested in next 86 events,, you can say hpc run dash, l and grep for those..

A

The other thing is all of the perf. Events are also available as perf colon colon event name., and this has all the built-in names, and you can either refer to them. As the long names, like perf colon colon perf, underscore count, underscore whatever,.

A

Or you, I think you can also refer to them as like perf, colon,, colon, cyclists.. Why would you add this? well, if,? If you just want to distinguish exactly what you're specifying, you might use this qualifier, but typically, if I'm using perf vents,, I'm just going to use the shorthand like cycles and instructions.

A

and then for particular processors, they'll have some some other sets of events like.

A

Bdw ep, and that's a like a broadwell ep specific events., and so these are events that are available on.

A

A particular instance of an x86 architecture., so I'm sure that there's some knl specific events or some haswell specific events..

A

If we look on, on cori., I mentioned that you can omit the colon colon.

A

And you can also specify the candidates with lowercase.

A

So when we are using perf events, it also supports multiplexing.

A

So the situation is that the performance monitoring units in the cpu.

A

Can support a small, fixed number of counters.

A

Maybe three or four counters per hardware thread.

A

And if you start to use more than that, then perf events will automatically multiplex them., and so what that means is that any, at any particular time the number of events being collected will exceed the number of hardware counters that you have available..

A

So the kernel will partition the events into sets of things that can be met at the same time.

A

And then it will monitor one set of events for a little while and then switch to another..

A

So for instance,. If I said something like.

A

Cycles, instructions, l, two misses.

A

Tlb misses and.

A

L1 accesses, and I want to monitor all of those.

A

Well,, it may decide that it can't monitor all of those all of those cache events in one set., and so it might monitor, say, cycles, instructions and l1 misses.

A

and then in the other set of events. It has the other cache events.

A

and so then it'll alternate back and forth between them.

A

And so you won't get a true counts for.

A

Exactly what happened during the application., but my advice is that the multiplexing is fine for sort of casual execution, analysis.

A

if,. If what you're trying to do is count exactly what happened, then, probably what you want to do is multiple runs,, where you'll run using a small number of counters in one run and then a small number of counters in another run and then just analyze. The databases together.

A

So the counts you get back from multiplexing can give you an idea of what's going on.. So obviously, if you have tlv problems, it's going to show up with tlb problems..

A

If you want to know exactly the number of tlb misses, you have,, then you want to measure it in with a few enough events, so that you're not using multiplexing.

A

Okay,, so there are a couple of ways of controlling.

A

The sampling frequency for the perf events.

A

So the recommended way is automatic.

A

So we use,, we use this frequency based sampling.

A

Which samples at a certain rate per second.

A

And we set that when, when hpctoolkit is installed.

A

I think that the default frequency is 300 times a second on most architectures and on knl.

A

I think it's,, I think it's lower..

A

I would have to check and see what our default frequency is on knl..

A

We found that if we monitoring more than a hundred times a second on every hardware thread, then it slows down the program significantly..

A

So the automatic is, is, is very convenient and.

A

It's going to kind of right-size the monitoring so that.

A

So that you're not spending too much time, monitoring.

A

Now,, there's some, there's two options with perf events. there's there's period based sampling, where I can say that I want to monitor cycles and I want to monitor, say every million cycles and every five million instructions..

A

I can just specify a number, or if I want to monitor cycles.

A

That are frequency of 100 per second, or an instruction with 200 per second.

A

I can use the f modifier to say that I want to do frequency based sampling.

A

and then, if I want to just sort of change the default frequency for a couple of things on monitoring, then there's an option for that to using this -c option.

A

So if you look at our, our hpc toolkit manual, there's a longer discussion of some of this.

A

So, in order to measure dynamically linked, executable is.

A

When you launch a group program,, you can just specify.

A

here's a bunch of perf things. I want to measure cycles, context, switches,.

A

Page walk, last level cache misses.

A

if you're using statically linked executable.

A

These are relatively uncommon.. It used to be the default on craig and bluejean platforms..

A

Then you don't get to use hpcrun..

A

You have to actually link our measurement library.

A

Into your executable with the hpc link, utility.

A

Details are in the manual, and then you have to specify.

A

Then you don't get to use hpcrun.

A

You specify the things you want to monitor in an environment variable.. So actually what happens is hpcrun will.

A

Set the environment variable for you.

A

and we use a hpcrun for dynamically linked applications.

A

Cause it sets the environment, variable. and pre-loads the code into the address space.

A

in this case with static lulling programs, the code's already in the address space cause, we've got put their compiled..

A

So next I wanted to briefly mention the.

A

Open mp tools interface.: this is something that I've been working on for a long time, sort of multiplexing with a lot of other things.

A

and so could have been done a long time ago, but we're still working on it. 'cause you know, it's just mixed in with a bunch of other things..

A

So the issue with open mp.

A

Is that there's a large gap between threaded programming models and their implementations., and so the user-level calling context for an open mp programs and tasks is not readily available..

A

So what happens is when you start an open-end p program, there's an initial thread. and then, when you enter a parallel region, it'll start up a bunch of workers threads..

A

And if you look at the at the call stack for each of these threads.

A

So hbc toolkit uses call stack unwinding to discover where costs are being incurred.

A

If you unwind the cost stack in the worker's, threads, you'll find that.. So here's some open mp routine,. Here's some open mp code regions.

A

That might parallel loops, or, or tasks.

A

and they've been torn out of the application..

A

And then you see them in a call and context that says.

A

Launch worker, launched thread, invoke task, function, pass brands.

A

and I would argue, no one wants to analyze their program. This way. cause. You can't tell exactly where these things came from and you don't have a global view of the program., and so the problem is that.

A

When these activities were launched, we don't know exactly what pr like at runtime.

A

We don't know what parallel region these were associated with.. We just know some workers thread pick this up and started executing them., and so in order for tools to do a better job.

A

Some runtime support is necessary., so the,. The challenge that we have is that.

A

The tools are just naturally going to see this low level view of the open, empty threads., we're going to see the master thread and the workers threads, and those are asymmetric..

A

And then, if we just monitor the program, you'll see that there,, the runtime frames, the frames for the open-end p runtime system are intermixed with the user code., and so the idea was if we can just define as part of the open mp standard,.

A

A tooling interface and then we could appropriately address these things.

A

So we started this process or maybe in.

A

2012 or so defining this mp tools, api.

A

And so what it was supposed to do was to add, to.

A

To specify that the runtime system had to respond.

A

To a certain query, operations.

A

And it also had to maintain certain information.

A

Inside it and provide a certain callbacks.

A

And that, if this was part of the open mp standard.

A

Then a group such as mine could just build a tool.

A

On top of the open mp tools, interface., and then you could use it with the canoe open mp compiler with climbing, with ibm xl compilers,, and it would all look great.

A

You would have this high level global view of what the the application looks. Like.

A

So,, so the design objectives for this tools, interface, was.

A

To enable tools to measure and attribute cost to the application source and the runtime system., so let me just do a quick, well,, actually I'll, I'll I'll demo. This with.

A

With the hpc gene application that we included.

A

In the example codes. so I'll get to that in a few minutes.

A

So currently, hbc toolkit has support.

A

For the omp t, interface based on open-end, p, file.

A

And there's there's prototype implementations of this in llvm., and so actually my group has a, a copy of this.

A

We have in our hpc and toolkit project.

A

We have a fork of llvm.

A

And we have the latest implementation.

A

Of the tools interface., so we have the tools, interface, support for both cpus and four gpus.

A

So this is something that we are currently working to upstream back into llvm.

A

And then also working with amd, to add to.

A

Their runtime for their lomp compiler.

A

That they're planning to deliver for frontier.

A

so ibm has support.

A

Ibm's lightweight open mp runtime has support for open mp five..

A

When we ran some things on summit.

A

There were some indications that the support was there.

A

But it didn't seem to be used by the tool.

A

and so a bit puzzled.

A

I noticed also that the runtimes that were on summit.

A

Were not the same runtimes.

A

That I was exchanging with ibm.. We work closely with alexandra eichenberger.

A

And he sent us copies of their runtime and it said lomp lightweight, omp.,.

A

And what's on summit is something called xl somp.

A

and so I'm not sure what to make of it.

A

All I can say is that for the examples that we ran on summit, we didn't see the open mp tools, interface being used there or being provided by the.

A

By the application started by the runtime system.

A

So I must say that we didn't look at this very carefully.

A

Because we're trying to get a set of examples working on cpu's on gpu's and on two different processors., so we didn't do any investigation., so it, it may be there and it may work.

A

and if someone's interested, we can take a look today.

A

A

In the examples that that we have on, on cori.

A

We actually built a, a copy of the llvm open mp runtime.

A

That we had had developed., and so we had made some changes to this as.

A

As recent as two weeks, ago.

A

and, and so the, with the open mp runtime.

A

What you might see is, in general, when you're running a multi-threaded programs.

A

You might see that there's a program, root and thread route appears in your profiles.

A

And so program you would think of as, as the main program. and then, if it spawns workers threads those end up being listed under thread root using the lmpt interface. It takes all of the time under here.

A

and if it,, if it can figure out where it came from with the parallel regions, then it integrates it into program root.

A

And you see this global view that has the kind of full call paths that you would expect.

A

Whereas if they're in thread root, then they're going to look like the ones that I showed you a minute ago, where they're these program, fragments that were being launched by the runtime., so without the lmpt interface, you'll see a lot of this.

A

and with the on interface. Almost all the time will be under program. Group.- and so it's a little bit easier to analyze in that case.

A

[sharp voice], a quick question,.

B

It says this fix the problem where I was bringing up the other day, where, if you're at different levels in the code like the open mp, will be right because of this.

B

Yes. remember: we are looking at the levels., so this, this, fix...

A

Yeah,, so let me let me pull up and show you an example here..

A

I think the example would would help.

D

D

A

Okay,, so here's here's a version of this code.

A

(Indistinct)A little more,.

A

A little more code that that does.

A

It's a proxy app for their.

A

Ultra proxy app ultra shock code., and so here you can see that it says.

A

1.4% under program rate and 98.6% under thread brief.

A

So this one was run on a machine. We have that has two broadwell processors.

A

So it ends up with 72 hardware, threads.

A

So we have is,. This is like one threads work of data and this is 71 threads worth of data., and so, if I look under thread root, what I see is the following, and this was built with raja templates.,.

A

And so the raja templates,.

A

Well, there's a template for, for all.

A

And, and so what we see at the top is it just brings us into the the template library, and you see that there's a,.

A

Okay,, that's not helpful. well,. If we go look inside that and we see,, you know well, there's a loop. and then we look inside that, and then we see that there's a call to something.

A

and then the thing that it called now,. This is a piece that was torn out of, out of their code as a lambda, function.

A

okay,, and so this is,. This is what it looks like.

A

When you have raja templates and we don't have the omp tools, interface,.

A

And you see that, that it's all visible in this context,.

A

Launch worker, launched thread, invoke task invoke microtasks.. I don't think that that's helpful at all.

A

and then in the program root.

A

Where the main program is executing stuff,, then you see that there are call paths here.

A

and then there's pieces of the runtime where we entered the runtime to implement this for all.

A

And then it ended up going through a couple of frames.

A

In the open mp runtime system, before we came out in this outline function, that's implementing the,, the,, (murmurs) the,, but at least here,. We get to see where it was invoked from.. So if we look, look back.

A

so where I'm going to find this.

D

Several levels of the template.

A

Here we go, right., so this is,. This is showing us like. We invoke this raja for all.

A

and then all the rest of the stuff in here is like details of the raja template, library, and then the runtime library..

A

And then this is what we're actually executing.

A

and the lambda function gets executed at the bottom.

A

So this is,- this is a sort of a global view.

A

But you still get to see, you still see the open mp runtime..

A

The real problem here is that almost all of the performances presented in this confusing way where I have to go, look under each of these outline functions and find out what was actually called..

A

So this is calc hourglass, (murmurs)..

A

What about this? oh,? This is calc monotonic q region for lms, okay.

A

I would argue, no one wants to analyze the program. This way.

A

So to address this problem,.

A

We, we built the lmpt interface.

A

so now I'll show you the same program where we collected the data with the ompt interface.

A

Okay, so now it says, 47% of the time is in program root.

A

54% of the time is in idle.

A

Where, like the open mp open, mp, is just idle.

A

And then under thread, root, there's like 0.1% that,, that.

A

We couldn't figure out where it went.

A

like we were,. We took some, we took a sample and we're in the middle of of.

A

Sort of setting this thing up and couldn't like.

A

Couldn't relate it to any open, mp, anything., it like,. It wasn't idle waiting for work.

A

It, it, it was,. It was getting to the point.

A

Where it would go, idle, waiting for work., and so this is only 0.1% of the,, the execution time.

A

And so that's not a big deal..

A

The real nice thing is that now, with the omp tools interface, you get this intuitive view. That shows you sort of a full call path that contains loops and inline functions., and we seen like calc hourglass control for lms forced lms..

A

Now we go into some raja templates.

A

And then we come back out in the code at the bottom., and so you get to see,. You get to see all of your program execution time attributed at the bottom of of culture chains that are meaningful to you.

A

As an application, developer. and so that... [sharp voice], I was going to say, and also you get in the.

B

In the depth, view, everybody's at the same depth.

B

That's exactly right.

A

So it exactly addresses the the question that you were concerned about where this calc hourglass force olms.

A

Was going to be either at a depth here in the master thread.

A

and then the workers thread, it was going to be underneath something like,. You know, launch worker launched thread, launch. right,, you know,, maybe three levels, deep where it said invoke: micro, test.,.

E

A

So, so this,, if you have this integrated high-level view,.

A

Then this makes it much easier to do scaling experiments where you say, okay, the,. The performance data is roughly in a tree.

A

And what I want to do is say,.

A

I want to run it on say, 10, threads,, 20, threads,, 30, threads,,.

A

And I can compare and difference, them.

A

and I'll talk about that with the differential analysis in just a minute.

F

Is there a way to figure out why 54% of the time the threads are idle.

F

That would seem to be what you'd want to work. On?.

A

Okay,, so green is idle.

A

So the, the answer is we have a technique for that.

A

And it was implemented previously in our tool.

A

Using a feature that the open mp standards committee.

A

Took out of the runtime interface.

A

When I missed the committee meeting, people said,, oh, this is useless and they were wrong.

A

So we're working on reimplementing, it.

A

so,, so here's the, the things that you like when, when you're in a thread and it's idle,.

A

What you want to know is like just to know that the thread is sitting there waiting for work,. That tells you nothing..

A

What you really want to know is what else is going on.

A

And you know, what else can I,? So if there's serial code somewhere,, while I'm sitting working or I'm sitting waiting,, then I would like to blame the serial code for my idleness, saying: that's not shedding enough parallelism..

A

So in a,, in a previous implementation of.

A

Of hpctoolkit, on top of llvm, open mp,.

A

We had support for this blame shifting, and now we have it in a branch., but as I mentioned, we just reassembled the open mp runtime about two weeks ago., and so we haven't yet fully integrated our our support for this blame shifting idea.

A

So I think I can show you an example of that.

B

[Sharp voice], so in this, john,, do we even know what the master thread is?, who who's the master thread? cause? It looks like just everybody's, doing random work and waiting randomly.

A

Well,, let's see the master thread is.

A

Is going to be thread zero., so you can see that there's a little different stuff.

A

That's that's happening on, on thread, zero.

A

okay., I see. so while,. While this is happening.

A

At least thread one is idle and we can look and see that.

A

Yeah, things are sort of basically idle for.

A

This period of time.

A

So I think I have a,.

A

I know I have an example, and it's just mixed in my terabyte.

A

To to show you an example of the amg code.

A

So this morning I was running some examples on the cluster.

A

Does anybody have an example of the amg 2013?.

A

Did they run that on, on the compute nodes, core on, knl?.

A

Okay,, maybe one of my group could run one of those.

A

And someone could run an amg 2013 example.

A

And get a database., I have a hpcg example, but...

D

G

How do I get it to you.

C

Well, john,, I have some amg, 2006 and demoed on theta.

C

Do you want me to try to tarcopy a database for you?.

C

Actually, if you just open a viewer on it,.

A

[Soft voice], I can give you the file.

G

The directors (indistinct), get over to you.

A

Okay, and how are you going to give it to me?.

G

[Soft voice] I'll call you I'll just send a with a command called (indistinct).

G

What is your user name?.

D

A

Oh,, I think I have one here. I see I did a run this morning, but I hadn't gotten to the analyze stage..

A

Doesn't that clock next you.? Okay, I think I'm going to have one second.

D

A

Hopefully this run work. cause. I haven't looked at this data, yet let's see.

A

Okay,, so here is a case where.

A

Where, in this initial region in the code.

A

We have a main thread, that's active and then.

A

These threads are our idle.

A

I mentioned that we're still working on this open mp run time.. It says that they're in a barrier. well, when you're in a barrier, you're idle., so it's sort of,. This is a technical issue.

A

Where what we're reporting is kind of too close, of,.

A

Of of an approximation of, of what's in the state.

A

So what, what happens actually, when threads go idle,, they sit well.

A

When a thread is active in a parallel region, it gets launched.

A

and then at the end of the parallel region, it enters a barrier. and then it never leaves that barrier until.

A

It begins new parallel work., so you can think of the way that it goes idle as it enters a barrier at the end of a parallel region, and then the region gets torn apart.

A

and then at the moment,. We're representing that by just saying that it's barrier, instead of saying that it's, that it's idle, but here's the,, here's the thing.

A

So if you look at the traces,, what you can see is.

A

Mark, actually,, if you do have the blame shifting data for the other amg example,, that would be great.

G

[Soft voice] john, I just gave you my data.

A

Okay, but that, that doesn't have the.

A

Blame shipping thing.

G

Right I don't, of course.

C

Okay, so [mark] it may take a few minutes, john. I think you should continue on.

C

A

Okay. so, so what,. What we can see is in over,.

A

In like this region of the program,, there's like a main thread, that's,, that's working., and then all of these other threads are idle., and what we would like to do is to be able to blame the idleness on whatever the master thread is doing..

A

And so if there were say three threads working and five threads idle,, then we would split up the blame.

A

And attribute that to whoever was actually working to say.

A

Whatever you're doing, you're not shedding enough parallelism., so if you look at it in this view,, then all of the things that show up in barrier just mean that it's it's not doing anything useful.

A

So let me just go into the procedure, color map.

A

And just say, if, if, in fact,.

A

We are in a barrier,.

A

Then I'll just make that thread.

A

Okay, now the execution looks sort of a lot less.

A

Efficient and pleasing.

A

in fact,, if we go and we look in, we look in detail at what's going on, the.

A

The application has two phases: while it's a benchmark., so there's a,, there's a setup phase and there's a a solid phase,.

D

Zoomed in a little too far.

D

A

Okay, so, well,. What we can see here is.

A

That there's a set of threads in a process., and so you notice that on the left-hand side, there's these color bands.

A

And so the leftmost color band says everybody in the same process. and then it alternates between two colors for individual threads., and so everybody that belongs to the same process is shown. With, say this light blue.

A

And then the,, the darker color means that.

A

These threads were another process., so we can see is that if we look at the threads in the process,, we see that for the solver.

A

in fact,, there's a load in balance,, there's like four threads that are working and then there's four threads that are sort of underused.

A

So, and, and that pattern sort of continues.

A

And the reason for, for this in this particular application is that.

A

They're they're doing multi grid.

A

And they were dividing up the work among the threads., and so the application decided, which thread was going to do, which part of the grid.

A

And so if the application has k threads.

A

Then it cut it up into k pieces and it turned out that those k pieces were imbalanced.

A

And that shows up in rank after rank iteration.

A

After iteration of the solver that, that there's imbalance and the higher number threads have too little work., so one of my students took a look at this a number of years ago and found out that that you could improve the load imbalanced by just telling the application,.

A

Well,, you know you were going to there's k threads.

A

so, rather than cutting it up for k threads, why don't you cut it up for 5k threads?

A

and then, instead of using static scheduling,?

A

Why don't we use dynamic scheduling instead., so you would get a chunk of iteration or get a chunk of work. and then, when you run out- and you go back for more and it turns out that that was successful in dealing with this,, this pervasive load, imbalance.

A

So at the, at the high level,, this open mp view,.

A

This the open mp view,, instead of just.

A

Telling us that we're in like launch worker, launch thread,, it's giving us application context in all cases..

A

Now the cases where it says omp barrier- it should probably just say idle in, in here rather than barrier,, but that's sort of a, a nit and in a work in progress,, okay,, so....

F

I you don't have a question: does that approach result in better balance between the threads but worse cache performance.

A

We would have to use perf events to see whether.

A

It hurts the cache performance.

A

my,, my recollection,. This is a number of years back, but I think that.

A

It actually improves the overall performance.

A

And so if it improves the overall performance, then.

A

You know that then that would be mean that there's a good trade off between cache misses versus load, balance. yeah,! That's what I,! That's what I'm trying to say.

F

I just wanted to know if,. You know it's clear that the overall time is, is reduced because the load imbalance is better,, but the performance of each thread might be a little bit worse than the ones that were running before..

F

That's all I'm asking.

A

That's, yes. and I think you're, probably right., and so then one would need to look at that with, with perf events. If you want it to get to the.

A

Bottom of that.

D

Okay,, let me switch back.

C

[Mark] actually, john,. If you still want my old.

C

2006 databases: I copied a tar file to cori for you.

C

Okay, (indistinct).

C

[Mark] so in my home, directory, you'll find two tar files,.

C

Db, dash amg,, something or other.

A

Or you mw parental on this machine.

C

No, on cori,, what machine are you on your cori?, cori. [mark], so you'll find those two,.

C

Probably the one that says dbamgompt is probably makes sense.

C

I haven't opened,, these are old., I haven't opened them in several years, but these were the things that I used as demos at argonne.

C

yeah, I'll attempt to read. Them.

A

Okay, we'll come back to that.

C

[Mark] oh, hold on one second.

A

All right I mentioned here, okay.

A

So the idle time is reported as barrier.

A

and then I,. I looked at.

A

At using this on summit and the, the,, the new version of the llvm open mp runtime that we had built and we were not able to because the.

A

Ibm and the pgi compilers wider wire libraries.

H

[Soft voice] john,. I have a question.

H

yes., [soft voice] the question you were talking about going from static, that static scheduling.

H

Was better than dynamic scheduling and yet you can, and you continue to use threads and you got better performance.

H

and the threads weren't idle as, before..

H

My question is the following.. Did you also investigate? huh?.

A

So my comment there was...

A

[soft voice], I haven't asked the question: yet.

H

The question is, did you,: did you think of using doing things sequentially within the process and or did you think of using some other threads.

H

Doing things sequentially within those other threads,.

H

As opposed to just using threads within the process of the first of the thread?.

A

Well,, so what I would say is that we didn't look too hard at trying to optimize this.

A

I mean, the focus of my group's work is mostly on building the tools.

A

and so anytime. We optimize a program.

A

If there's usually there,, there's a reason for it.

A

Either, we're looking for an example for a paper.

A

Where we exploited insight, or we're working with an application team that actually cares about what the performance was.

A

and so for the amg benchmark,, nobody really cared enough for us to actually go back and do that.

A

So we didn't invest.

H

[Soft voice] okay, thank you.

H

A

I'm in some bizarre new line in my display here.

D

And I'm hoping that I don't know what that's from.

C

[Mark] well,, I fixed the permissions on the files,, but you should you probably just want to go. On.

C

yeah, I'll, come back to that later.

A

If someone wants to follow up on, on ompt, I don't know why. I have.

B

[Sharp voice], if you go into annotate and then you have to hit clear, clear my drawings that that'll usually get rid of it.

D

G

It's on pdf, right?.

A

It's on the pdf., I know. Actually it is some sort of annotations that are happening right now.. It's a zoom annotation.

A

I didn't mean to turn that on., I must've just clicked on it. Somehow.

A

okay, we're there all drawings, there. We go.

E

[Sharp voice] there, it is.

A

Thank you., I'm still in drawing mode.

D

Okay,, I'm just going to be careful,. I guess.

D

[Sharp voice] all right close, the annotations.

B

And you'll get out of drawing mode all right. Let's, let's see that.

A

All right,, I don't want to spend too much time.

A

In this cause, I'd rather spend time working with people on, on their applications.

A

So let me just let me just go through so I wanted to mention differential performance analysis.

A

Because this is something that's useful for both cpu and gpu codes.

A

so,, you know, the problem, is, well,.

A

When you're scaling to a large number of of threads or gpus- and you notice that your efficiency is dropping off, then your question is what's causing that.

A

and so the,. The goal is to have some sort of.

A

Automatic scalability analysis that will pinpoint the bottlenecks and guide you to the problem and quantify the magnitude each problem.

A

and then ideally you'd like it to diagnose. What's wrong.

A

well,, I think we have a solution for the first grade,, but then diagnosing exactly what's wrong.

A

Is really kind of complicated.? It depends upon a lot of things in your code..

A

We can tell you the problem is here and we can tell you how much it costs,, but the the ultimate diagnosis is really left to you.

A

And so you'll see with an example, what this, what this really looks, like.

A

so for parallel applications.

A

Often, there's like lots of layers of libraries.

A

And the performance is context dependent..

A

So for instance, in like some sort of climate code, where you might be having land, sea,, ice, ocean and atmosphere simulations,. And if you found that your program spent time in mpi weight.

A

You would need to know the cost back to find out that there's a problem with the atmosphere or a problem with the ocean., so we want to have call stack profiling.

A

if you were just told that you spend time in mpi wait, that doesn't help you when you have such a complicated application.

A

so for us,. What we want to be able to do is understand these bottlenecks,, no matter what they are, whether they're computation or data movement or synchronization,. We don't know and the constraints we have,.

A

Or that we have to have a low data volume., we can't be recording everything that the application does.

A

Because otherwise we, we perturb your execution too much.

A

And so we perturbed the executing too much. we're changing the behavior and also it's going to make it so that it's not useful and large-scale runs..

A

So the thing is that you have expectations.

A

For what your code is going to do., so maybe you're, you're trying to do strong, scaling.

A

And so, if I doubled the number of processors, then I expect linear speed up or maybe you're doing week, scaling, and where you double the number of processors.

A

You also double the problem size so that the the amount of work that's being done by each processor.

A

It should be constant.

A

okay,, so the thing is,, if you have these expectations,.

A

Then you can put these expectations to work..

A

You can measure your performance under different conditions, with either different levels of parallelism.

A

Or different inputs or both., and then you express your expectations as an equation.

A

And then compute the deviation of.

A

From expectations for each colon context,.

A

And then correlate this with the source code and explore the annotated, calling context tree..

A

So let me give you an example: here., so if,. If we're running some,.

A

So, so let's say that the first.

A

Where we're running the application and we're running it, on, say 256 cores, and we find out that we spend.

A

400 k units of time in some solve.

A

And then, when we scale that out to eight k, cores.

A

We're doing strong, scaling.

A

and under strong scaling,, then we expect that the amount of the total amount of time that we spent in the solver across all of the, the ranks is, is the same.

A

And if we spend more time,, then that's wasted. Effort, okay.

A

Cause a strong scaling problem,, essentially the cost.

A

And their distribution should be exactly the same, no matter which,, which number of processes you ran. It on.

A

So if I ran it on, if I ran it on.

A

10 processors and then I ran it on 20 processors,.

A

I would expect that it would run in half the time, but it would still spend equal amounts of time in the solver, across the two executions.

A

So what we can do is we can take this insight that.

A

We spent different amounts of time in here.

A

and then we can, we can take the, the the distribution of the costs or the large-scale execution and subtract off the distribution of the cost and the smaller scale execution and then compute the differences..

A

And so this difference here is wasted. Effort.

A

And if we,, if we take this and we divided by the total amount of time that was spent in the whole run, then it becomes a fraction of waste that effort.

A

and from the fraction of a waste of effort.

A

That's the same thing as the, as scalability loss.

A

So I'll, I'll show you an example: here.

D

Okay, close this one.

A

All right,, so this is this a flash code from the university of chicago..

A

We ran it on 256 cores and we ran it on eight k, cores.

A

And so in the user interface, I can just,, you know, explore and find out where the applications spent its time.

A

but, you know, and, and I can do that,.

A

Like focused on where I spent my time with 256 threads or with or with 256 cores over eight k, course.

A

Now this was a weak scaling experiment., and what I have here is data from one rank.

A

And under week scaling you expect that the execution time is going to be is going to be the same. okay,. Now the fact that that this five times 10 to the eighth differs from 6.71 times,, none of the eighth, that is scaling, loss.

A

okay,, if under week, scaling,.

A

If I double the problem, size and double the number of processors, then I expect the times to be the same.. So now I'll just apply this idea.

A

That I was just showing you on the slide, where I'm going to take I'm going to take the inclusive time on 8 k, cores,.

A

And I'm going to difference, I'm going to, subtract the inclusive time.

A

Onto 256 cores., and so that's computing, wasted effort.

A

And then I'll divide it by the total time on 8 k, cores.

A

So there's a difference here where you say point-wise or aggregate. The point wise means like at every point in the call in context to the aggregate means like the cost of the root,, the total cost.

A

so divided by the,. The aggregate cost on eight k, course.

A

So that's my fraction of wasted effort. and then I can multiply that by a hundred and get percent wasted. Effort.

A

and percent waste of effort is same as, as scalability loss.

A

And I'm going to write this as a.

A

It's an inclusive metric and I'm going to write it as a percent., and so I'm just going to display this as a percent.

A

Okay,, so that tells me that I have a scaling loss, of, of 25%.

A

And I can see that there's 14%.

A

Is in 14% of the scaling losses and they evolve phase and 14% on, sorry, 10% is in the unit phase.

A

So I'll just, I'll, I'm going to pick the unit phase first because I know something..

A

So I say, show me where the scaling loss, is.

A

And it takes me down here deep into the computation.

A

And if I look around a little bit, I notice it says that there's a loop overall processes.

A

Now the line that actually came up is a couple of lines below this.

A

and the reason is that the attribution of,.

A

Of loops, what we do is we say that the loop exists.

A

At the line where the first machine instruction came, from.

A

And so this was the first machine instruction that appeared in the loop. and so that's why we ended up with this, this mapping but so,, but very nearby,. We see a loop overall processors.

A

Well,, if we have 256 cores.

A

Versus eight k cores we're going to go through this loop.

A

A lot more times with eight k, course.

A

okay,, so that's a reason for the scale and loss.

A

But what's really going on. well,, what's really going on is, if you look at this, this is amr, refined d, refined amr, morton process.

A

Find surrounding blocks.

A

So what happens is they're doing block structure at amr.

A

And so after, after some number of refinements.

A

In the block structure, amr,, you have a data block.

A

And you want to know what are my neighboring data blocks.

A

And because these were distributed with space filling curves in morton order, the data blocks are scattered all over the processors in the system. and so the way that they find the blocks that are neighboring. Is they say,?

A

So here's the blocks that I have.

A

and let me take that information and I'll send it to my right, neighbor.

A

and then we'll,, we'll circle that.

A

and my right neighbor's going to do the same. Thing.

A

everybody's been doing the same thing., so everybody's circulating the information about what they have around the ring., and so basically, it's like an all communication where everybody's telling everybody else, here's what I've got. and then you select out and say, oh, this is,. This is who's, got my neighboring block., and so that's who I have to communicate with., and so we end up spending all this time on.

A

On calls to send, receive replace.

A

as,, as this data is cycling around the ring.

A

And the ring is proportional to the number of processors.

A

That's how many times we have to cycle things around the ring. okay,, so high-level point.

A

Using this technique, with with just when an equation and a spreadsheet, and then say.

A

Show me the call path.. I came down upon this bottleneck.

A

now,. If you see what's going on here, I just explained it to you.. You can understand why we can't tell you exactly what it is and exactly how to fix. It.

A

Okay,, this is something that requires some deep application knowledge about. What's going on in order to address the scaling loss.

A

so we,, we talked to antre dubé,, who was who was one of the leads on the flash code.

A

And the first thing which he said is that's not my core.

A

You know, the thing that has a scaling loss. That's not my code.. We just, you know,. We've got this library from nasa called para mesh and like,. That's what para mesh does.

A

So what, what, what she did do, though,.

A

Is she looked at well,? Do we really actually have to.

A

Ask everybody, you know, what they have.

A

and what they found is that, through properties of the space, will encurve there's only a few neighbors who actually might have a neighbors in the space do and curve who would possibly have the blocks that surround me.

A

And so by then, looking by communicating only with that fixed number of neighbors.

A

Then they could avoid this all to all communication,, which.

A

In this case was circling, cycling things around the ring.

A

And then they could turn that into and much more, more scalable behavior.

A

So this is,- this is not good for just one thing.

A

So I mentioned like we found this one bottleneck.

A

So if we go back up to back up to the top of the unit phase,, we see that most of it was in the domain where I was looking.

A

So there's also some in grid init.

A

And let's go look and see where that comes up.

A

Well,, so here's a call to mpi barrier and then we'd see something that says into I proxy equals, oh,. That seems suspicious..

A

It seems just like the same sort of thing: before.

A

well,. If we look a little closer,, there's a whole bunch of reads and then there's a loop that says iterate over all the processors.. If it's my turn open the input, file.

A

Well,, I'm going to bet, that at some point they ran this thing on 20,000 cores.

A

And everybody opened the input file.

A

And they crashed the file system. and so somebody said, well, why we just take turns, and I'll just put this little conditional in here and we'll all open it one of the time. and then we won't crash.

A

and then I'll just get on with my work. and, that,. You know they left behind a little scaling bottleneck and it's only a little thing.. It's like 1.97% for this run, but we can find it. okay,, so that was in the initialization phase..

A

So then we can look at the evolve phase, where the scaling losses is also high..

A

And then this,, I pressed the flame button and it.

A

Drops down in the, in the call stack until the cost, shattered, some.

A

and so I'll just pick the first thing, which is the highest sorted cost inside the scope.

A

Where,, where am I scaling losses? Largest.

A

And then, if I look,, I find that,, oh actually we're back in the same.

A

Loop overall processors and find surrounding blocks., so in the initialization phase, they're doing some of this refinement, d, the refinement and and finding surrounding blocks.

A

And that happens in the evolve phase too..

A

So basically we have,, you know a couple of bottlenecks here account for a significant fraction of the scaling loss.

A

and despite the fact that this is a hundred thousand lines of fortran, it didn't take too much to find them by just writing this equation and then clicking around.. Any questions about this.

F

Did they solve the read problem.? I think I have a solution., it can broadcast, right. [bold voice] exactly, master. Everyone else receives.

F

Receeds, the broadcast.

A

That's right, mean,, so what, you have to look at this with, like you know,. This is a lazy thursday. Where someone said, hey, I crashed the file system.. Let me fix that in like two lines of code.

A

And, and they did,, and then they never looked back.

A

right,, but they left it., okay,, so this,. This scaling loss.

A

So,, so this worked well because everything is in a tree because, like everything is rooted at flash and I can go find where the losses are in the tree., if I have,, if I'm not using the ompt interface- and I have things broken up into master and thread root, then this top-down analysis doesn't work well on a tree. But I'm sorry on a forest.

A

But if you do the analysis, bottom up,.

A

Then you can, you can actually get.

A

Some of the same benefits., so if, instead of inclusive scaling, loss.

A

I computed exclusive scaling loss., so my colleague xiaozhu meng, who is here.

A

Did this for a code pick on gpu, which we was running on,.

A

On collection of amd gpu's.

A

and he was looking at scaling over one gpu versus scaling over 16 gpu's., and so then you can compute sort of a scaling metric. Just the same way that I did.

A

Except that, rather than computing, inclusive scaling loss.

A

You can compute exclusive scaling loss.

A

and then you look at it in the bottom up view and show and say. Show me where the exclusive scaling losses are the largest and then show me how I've got there.

A

and then, even if you got there by different call paths.

A

Because of those master worker threads you'll be able to find the places where the loss is big.

A

So,, so this technique applies for not only cpu code.

A

You can use this differential analysis technique on on gpu codes as well.

A

All right last thing, kernel, sampling.

A

So when you use sampling.

A

In the linux perf events subsystem,, you can sample not only activity in user space,.

A

You can also sample kernel space activity., so when a threat is frozen and when a threat is active in the kernel,, the user-level call and contact is frozen., and so what we do is we attribute kernel activity to the point where we entered the kernel from the user level calling context., and so let me just show you a live demo of this.

A

So I don't think that cori is configured to.

A

Be able to do this,, but let me show you this quick example.

D

A

So this is also included in the tutorial, examples..

A

I apologize that it was broken previously.

A

But I didn't focus on it because our focus was on gpu's for this tutorial.

A

So here, this is a very simple program.

A

A

As a loop, and it calls do work.

A

And do work calls,, malloc and allocates, a.

A

Large, temporary and then hops through the temporary.

A

4096 words at a time and then assigns the number 10.

A

Okay,, and so if I monitor this.

A

With real time and find out, where does it spend its time., it spends 99.3% of its time.

A

On this line, where it's assigning number 10.

A

Anybody have any idea why we're spending our time there?.

F

[Bold voice], are we computing cache?.

F

Remember for the page, faults.

A

Yes, so rishi understands.

A

So, so the view here,, I just measured the I just measured real time.

A

Okay, and it says, I'm spending my time in this loop, now.

A

So suppose you were writing some some complex solver and you found that you spent your time.

A

Inside your solver loop. you've been pretty happy, but maybe you're a maybe you're a little bit.

A

Deluded about what's going on.

A

so now, I'm going to measure with perf events.

A

And, and so I'm measuring eight, I'm saying hpcrun, -e, cycles.

A

and so an important part is that this particular machine. I'm doing this on a machine at rice,.

A

This machine is configured so that perf events is allowed to take kernel. Samples.

A

And now what we find is if we.

A

If we look at where we spend our time,.

A

We can see that skin.

A

So where do we spend our, our, our cycles?, and we see that on line 11 and line 12,.

A

Line 12, we have page faults,, page fault handler.

A

Mapping pages, clearing pages.

A

Okay, and so really.

A

What you're spending your time doing is not this you're spending all of your time, clearing pages in the operating system,.

A

90% of the time you're clearing pages.

A

In the operating system, okay.

A

[sharp voice] is that a security thing because there's no,.

B

I mean malloc doesn't require clear pages.

A

Well,, if so,, the thing is that.

A

This was a large allocation. okay,, so n is large., and so when it was malloc'ed and then freed, malloc returned the pages.

A

To the operating system. and then when you,, when you go around the loop again.

A

And you malloc them again,, then it gives you pages back, but the pages that gets back from the operating system. It clears them for security before giving you the pages.

A

Yeah, okay,, that's what I was asking.

B

So it's because of the security, not because of cause malloc, has no has no.

B

Has no behavior, if you just you, know, say,.

B

I want I want some pages, they're, just random data..

B

They could be random, data. [mark] yeah,, so it totally depends on.

C

Where the page was before., if it's a page from the same process, then your argument is correct..

C

If a page came from a different process, then it has to do it for security.

A

So I think, though, that it doesn't, when you release pages.

A

I don't think the operating system keeps track of there's a page on the free list and who had it last just as there's a page on the free list. [mark], it will keep track of who had it last.

B

Okay. [sharp, voice], yeah, and also malloc will have its own free heap., and only if you, you know, got rid of a big mmap'ed page.

B

Just the kernel even know about it.

A

Yes, agreed. so anyway, so,. So what this shows is this shows.

A

All of the page faults and that we're actually spending our time down inside the kernel.. So then there's a, another example here.

A

If I say, make a, run dash pdf.

A

So this is actually going to monitor cycles.

A

And it's going to monitor page faults., and so then we can see with,.

A

With perf events that, in fact,.

A

That in fact,, the reason for this is that we've got all of these page faults that are occurring.

A

Okay, and then the page fault handler is getting us pages and clearing them., okay, all right.. So what are you to do as an application? Developer?.

A

This is all nice to know. well,. Actually, it's sort of terrifying to know.

I

[Calm voice] john,, but one quick, question., yes. [calm voice]. In the level of program room. There was a partial call path, something.

I

What is a partial, oh, in the previous, at least in what is that?.

I

Could you explain a bit?.

A

What that means is we tried to unwind the call stack and we never made it all the way back to the root, all the way back to main., so we're working our way up. The call stack.

A

And then, for some reason we couldn't get out to the color frame..

A

Why can that happen? well,? What we found is that sometimes the compiler generates bad dwarf information, which tells us.

A

How to unwind the call stack., so we try to use the information that the compiler provides. If the compiler provides us with junk., then sometimes we can't unwind with it..

A

The other thing that we do is we actually do.

A

On the fly, binary analysis to try and figure out how to unwind., and so that's something that we do on both the power architecture and on the x86 architecture.

A

And so, if there's no information from the compiler.

A

Then we'll try our own analysis..

A

It works a lot of the time, but sometimes we just can't figure out what's going on.

A

and so that's another case where we might be in a frame and not be able to get to the caller. [calm voice], okay,, understood., and so also when you run a program, sometimes you might see program roots..

A

Sometimes you might see partial call paths and then other times you might see some you know, strange routines.. Those should all be listed under partial call, pass.

A

Consider that a bug in hpc toolkit anything that shows up here,, that's not like program root or thread root, or a partial call, paths, or omp idle..

A

If you just see names of random functions, it's a sign that our unwinder is having trouble.

A

and it couldn't even figure out how to put them under partial call. Paths.. It seemed like the unwinds were okay for some reason.

A

And like, it was, it was a bad unwind and we didn't recognize it, okay.

A

All right so now let me show...

F

[Bold voice] do work called multiple times, in that example?.

A

It was there, there was a loop..

A

Could the programmer have done the malloc outside of.

F

Do work and past pointers to the memory per pointer to the memory would that solve the problem?.

A

That would solve the problem?. I'm going to show you another solution.

B

[Sharp voice], I think the better idea is use huge pages or use calloc.

A

So this make fast example,.

A

We still see the, the cost of.

A

Of down on this loop.

A

But if we actually look at the, at at the cost, here, what, what,, what we'll see is that.

A

This cost is 1.5 times 10 to the ninth.

A

If I bring up the other example.

D

Okay, I'll choose (indistinct).

D

A

So this is, in fact exactly the same: code.

A

And so we find in the slow cases, it's 2.59 times, 10 to the 10th, and here it's.

A

1.5 times 10 to the ninth.

A

Okay,, so it's significantly faster over here than over here.

A

And the only thing that I did is I linked this with tc malloc.

A

So tc malloc does not turn bait pages.

A

Back over to the operating system., it hangs onto them inside your address space., so even after they're free when we go back to reallocate them.

A

It's just taking pages that are already allocated in the address space., and so it doesn't have to spend all that time. Clearing the pages., and so the point is that some allocators,.

A

So I don't know whether by default,.

A

They've been doing some work on malloc and then doing some work on making malloc thread conscious..

A

But here we can see.

A

That there's definitely a difference between using malloc and here I'm calling malloc, except that I'm linked against the tc malloc library.

A

So I'm using exactly the same user code..

A

It just runs a lot faster because I'm using a better allocator or an allocator that doesn't turn pages back over.

A

To the operating system.

B

[Sharp voice], I just want to say,. I often use calloc because in linux at least it uses copy on write pages and- and it only touches the pages that I touch..

B

I find that works a lot better in many cases, too.

B

That's a,, that's a good suggestion.

A

And so the,, the only thing that I was really trying to convey with this example is that there are, there can be hidden costs here.

A

That, that,, you know nothing about if you're, only monitoring things and you're not seeing what's going on in the kernel.

A

okay,, and so, if you're, just using cpu time.

A

Or you're using real time, you're only getting part of the picture., even if you're using cycles we're not monitoring inside the kernel.

A

Unless we can actually show you call paths., I think one could, one could argue that.

A

Maybe we should change the way that we do that.

A

so that, even if we can't get kernel call paths, we should just show you that there's a giant pile of samples down here that come from calls into linux., even if we can't tell what they are.

A

so the, the way the cori is configured.

A

And I, I checked on summit and I think that the configuration is similar, similar.

A

there's something called proc k. Allsyms that says: what's the addresses of the, the routines in the kernel.

A

and on cori they're all listed as zero., and so what that means is, if we collected any any sampling data, and we got programmed counter locations out of the kernel using perf, we can't interpret them.

A

And so in that case, we're just throwing them away.

A

And saying, well, there's not much point.

A

One could argue that maybe we shouldn't tell you if there's a lot of time in the kernel, just,, we don't know what it is.. That would probably be better.

A

okay,. So that's what I wanted to say about kernel, sampling,, so it's,. It's useful.

A

Because on, on well configured systems, you can get a lot of insight and figure out that actually the problem isn't with your application or the problem,, isn't what you think it is..

A

So there was a code called this lulesh code.

A

From livermore that I showed you earlier., it was a version that they develop an early version that they've developed., and so what what it would do is.

A

It would go down inside the call chain and then it would allocate a work array and then do something with it and then free the work array.

A

and the problem that code had was exactly the same problem that I just illustrated here.

A

in fact,. We developed this capability after.

A

Spending a lot of time, trying to figure out what was going on because with lulesh, it was just pointing into the workloads and saying, you're spending, time here., and so that looked, that looked good until you started. Looking at how many cycles I spent there and how many instructions I executed and realized that actually the number of instructions I execute is pretty slow,, it's pretty low., and so it was spending a fair amount of time. Clearing pages.

A

All right,- and I guess maybe there was one more thing.

A

I wanted to mention.

A

So something that we do in our user interface is.

A

Something that we call context recycling.

A

so on some codes may create a large number.

A

Of very short, live threads., so we're working with this code called dca++ at oak ridge.

A

And so they were running a 160 mpi ranks.

A

And that, over a relatively brief execution.

A

In generated, 1.2 million thread profiles and traces.

A

And so what we found was that.

A

The trace looked like little snippets,, all on the main diagonal like basically something would be born.

A

It would do a little something and then it would die., and so the whole trace looked like just a little snippet around the main diagonal..

A

And so this is a a synthetic code that we wrote just to sort of illustrate the problem, a little better, where you create a bunch of threads and then the threads die.

A

And then you go and create another bunch of threads., and so you get another trace line for each of the new threads and then they die. and then you've got another trace line.

A

and we decided that having this sort of,, you know the traces along the main diagonal and just maybe having, you know,, a million of a million trace lines,. It was just not very effective., so what we did was when a thread completes.

A

When a thread runs for a while and then.

A

And then completes and finishes,, then we we can reallocate that trace line so that if another thread is born and the intervals are non-overlapping,, then we pack them in.

A

So, what's shown at the bottom here is the packed in version of what's shown on the top.

A

And so,, what were the, the white between them indicates that there's like no thread active.

A

For that period of time., and so the threat activity is punctuated by inactivity.

A

And what that means is that a thread was born.

A

It died and then some other threads sort of took its place.

A

On the trace line., so just to show you that this can be, you know, relatively useful.

A

This showed the dca plus,. Where has.

A

10 ranks and there were 12 threads, each executing with the context recycling.

A

Instead of having a million trace lines, we end up with just a modest number where, because there were so many things,, whether we use in say standard a sync to launch something on a gpu.

A

and so the standard async would create a thread.

A

And then the thread would die as soon as the computation was finished and we just pack them in..

A

So it's something that is useful to know that we're doing this, because otherwise you might be confused now for gpu streams. I don't think that we're packing.

A

Multiple gpu streams into the same stream line.

A

I think we keep our gpu streams separate., we haven't found many applications that create and destroy screens repeatedly.

A

Although I'm sure that someone's got the exception, but we need to pay attention, to, all right.

D

[Sharp voice], so that's about all.

I

Yes. [calm voice] just to understand the previous previous slides. Now. Can you go back please, yeah.

I

So in the top it's so the y, x, axis,.

I

It's still timeline and y-axis,.

I

Those are final ranks or not. I mean, I'm sorry.

A

Well, yeah, so like, think of thinking of the y axis as being threads here.

A

[calm voice] okay.. So what happens is that so here's maybe 50 threads are,.

A

Are born and then they die and then another 50 are born and then they die and then another 50 are born and then they die.

A

and, and each of them are on. Each of these groups are on separate, trace, lines, but for, for the the first thread of this group.

A

Why are you leave the display empty for the whole rest, if there's another thread that was born at some later time,.

A

Why can't we just take the trace line for say the first thread and the second group and put it and put it up here., and so that's what we've shown here.

A

and then that's what yields this view for dca++, where all the blank spaces in here mean that the cpu thread was idle.

A

and, and so it's showing when you were active.

A

What you were doing, and if you zoom in you, can see that there's actually,.

A

You know idle time between them.

A

and,, but I argue that this was a lot better.

A

Than having a million trace, lines.

A

[calm, voice] okay,, so in the previous slide,.

I

Still in the second, second trace,.

I

Does, y-axis still represent those are 50 vertical lines, right, in total?.

I

Is it correct?.

A

I have to zoom in and see what it's actually 176, but,.

A

[Calm voice], so those are., so here it said.

A

I said 50, I was just making that up.

A

so here it says that you probably can't see as I'm zooming this.

A

So here, it says zero to 880 and then on the bottom. That says: zero to 176.

A

Okay, so we collapsed 880 things down to 176.

F

[Bold voice], would you like to have each horizontal line.

F

Represent a core something in a given hardware: context?.

A

So that's a good point., so there's a, a new version that we're working on.

A

That's in a branch where, if you'll notice, that in our current version, when you're looking at things, all you can see is really rank id and thread id.

A

But it's not really correlated with the hardware in any way.

A

And so there's, there's metadata there that's missing.

A

And so in the new that we have in our branch.

A

We keep track of what your node id was.

A

What your gpu id was,, what your gpu stream was.

A

Instead of just saying, here's a gpu streamline.

A

now you can understand that it was on this node on this gpu associated with this rank, you know, the following streamline.

A

okay,. So we have that, that metadata.

A

So your comment about like,, it was on a node and then for a cpu thread. We could say, it's on a node, it's on a core and it's hardware thread too.

A

And so we have that, that kind of metadata now.

A

and then we're actually going to show you that metadata.

A

So if you select on a trace line right now, what you see is you'll see that I was on like ranked six red four.

A

we're going to tell you that you were on such and such a node with such and such a,.

A

Such and such a core and such and such a hardware thread id.

A

Okay, so now,, if we were,, if we were using the,.

A

That sort of hardware oriented view,.

A

Then what we should see is all of the things that ran on core six should be on one line.

A

So,, it's a good observation.

A

A

Let me see if I can pull up the example from mark and then we can figure out.

A

How to use the rest of the day productively.

D

[Mark], if you want to try,, I fix the permission.

C

So in my home directory on cori,, look for the one db dash amg, probably the ompt.

C

Is the one you want. I haven't looked at it in at least two or three years.

C

But from, from the way I named it,, it sounds plausible.

C

I'm expecting what this is, is that this is amg 2006 run with the open mp tools, library.

C

A run on theta of course.

A

Yeah, so like,, like cori,.

C

[Mark] yeah,, it probably comes with a trace., so you want to see. If there's a trace there yep,, we will look at that.

D

It's more of a trace first.

A

I'm not using nx here, I'm just doing this directly on a virtual desktop at rice.

D

So, let's see we have,.

A

So here, remember: I showed you with the with the trace that I collected on cori that it was showing us that, that this was in a barrier that this amg 2006 is very much like the amg 2013.

A

It's just that the amg 2013 is a little bit more modern version of it., and so in this case,. What we see is it does tell us.

A

That all this green here is open mp threads that are idle., and so, if you look a little closer, what you can see is that there are,.

A

There are periods, the colors are, are really sort of bad.

A

There's like low resolution between them., we assign colors randomly.. Sometimes it works and sometimes it doesn't..

A

So what you can see is that for this brown computation, where we're offloading some open mp work.

A

To the worker threads,, but then once that finishes then, while all the rest of this is going on, the open mp workers are just sitting, there.

A

they're, not doing anything., and so, if I,, if I start looking at this application in sort of a top-down fashion,.

D

Okay, it's a little slow in my lane.

A

So I'll find that there's main.

D

Too long to render.

A

And then there's.

A

Now I'm moving down one level and.

A

What we should see is that the,, the computation is, is divided up into there's, set up.

A

Which is this green phase.

A

and then there's and then there's a solid phase,, which is different.

D

[Mark] yeah, the, the interesting part of the data.

C

Of this database would be in the solve phase.

C

Looking at the,.

D

Looking at the the load imbalance within each region.

A

Right,, so what I want it to show is, is right: here.

A

So if,, if you notice that there's this.

A

Amg set up.

A

It's drawing, drawing.

A

A

I'm looking for,, I'm looking for the level where.

A

We can see the difference between what code is is serial and what code is parallel.

A

On this, zoom in, it'd be a little easier to see.

D

A

A

This is a case where we couldn't, it was just slow.

A

A

Typing way, ahead of user interface, let me go back to the profile.

A

So this is going to address what I think it was (indistinct) was asking about.

A

so what, what we had done in the.

A

In the user interface was we had collected.

A

Information about when you were idle.

A

We used a strategy where we would say I'm idle and I don't know why., and so let me just incremental counter saying that my thread is idle. and then, when other threads took samples and they were active,, then they would say, well. I have to assume some blame, like how many of us are active and how many of us are idle and I'm going to take some blame for the idleness, that's elsewhere., and so then what we're able to do.

A

Is we're able to charge the idleness to places in the code.

A

And so then we can say, well,. Why don't I?

A

Why don't I take the program and then say.

A

Show me where the idleness occurs.

C

[Mark], yes, when I demoed at argonne what I demoed was the sort of the parallel efficiency of different open mp loops.

A

Right,, and so all of the idleness,, like a lot of the idleness, is coming from this amg course in valgus., and so if we actually look in the,, if we look carefully in the trace and it's a little hard to do it interactively remotely,. But what we would find is that.

A

In that setup phase that amg course in valgus never calls any open mp stuff at all.

A

And so what's happening, then, is there's a bunch of open mp worker threads that are idle and all of their idleness gets blamed on the serial code. that's running in the other thread..

A

So I think that this addresses the, the issue that you had steve,, which is like, you know,. So there are all this idle time,. What am I going to do about it? and so by attributing the open mp idleness to the code? That's actually running, we're identifying regions of serial code or regions of code that have.

A

There that are under parallelized, where,.

A

Where, if you can add more parallelism to them there, then that would result in less idleness.

A

And so that show you some advice about where you should tune your coat. and so then we're looking to integrate the support.

A

For this back into hpc toolkit and the llvm open mp run time..

A

So the good thing about doing that is these days.

A

Intel's generating code with with the llvm runtime.

A

Amd's generating code with the llvm run, time.

A

So we fix it there, then it will be good on the forthcoming exascale systems.

A

And a lot of people use clang, anyway., so,, okay.

A

okay,, other questions.

B

[Sharp voice], I have a question. It's it's related to something I dm you,.

B

Do you guys have like a set of like here are the best counters for looking at different things.

B

As opposed to here's, a big list.

A

I don't, because we've spent so much so maybe if you look at that, that thing that was linked in the slides that might have some advice about that.

A

We've spent so much time just trying to figure out.

A

How to get the data that we haven't focused on.

A

On using it, in, in that kind of way.

A

So intel in vtune they, they did build this.

A

This support for top-down analysis., so there's this whole kind of a whole area..

A

So this was started by dave, levinthal called cycle accounting., and so he did some work on this for the itanium architecture.

A

and so there's a bunch of stuff that's available on the web about cycle accounting,, where it's like. Looking at this top down model where you you're you either are.

A

Stalled and then you can ask, well, am I stalled in the front end or the back end?

A

and if I'm stalled in the back end, is it? Am I stalled like waiting for functional units or I'm not waiting for the memory hierarchy.?

A

If I'm waiting for the memory hierarchy,, then is it level one level two or level three, and either I'm am I executing instructions?.

A

Are the instructions graduating??

A

Are they useful, or are they speculative, and they're being squashed?? And so this top-down analysis is,, is like a tree structure designed for exactly what you,.

A

The way that you should be looking at this.

A

and so the, that whole approach.

A

Is something that's very useful. and now intel has codified this in vtune and they have this, this top-down analysis, style.

A

Available and there's supporting it., and so we can look that up on, on the web.

A

I think seeing that [sharp voice], I sent you a picture of that.

B

In your, in the,, in slack I was asking you exactly: that, like,.

B

Is there something like that for, for what do you call it for hpc tool, kit?.

A

Hpc toolkit, so,, so right now we, we don't have support for top-down analysis..

A

So what I can say is a few years ago, we were looking to build this and then it became clear that 95% of the cycles were on the gpu.

A

And so we just like stopped working on the cpu side.

A

And began to focus almost exclusively on monitoring.

A

And analyze and gpu performance.

A

I think you're going to understand why. yeah.

A

Okay, any other questions, or I was thinking that maybe.

A

Is wilamon, well.

I

[Calm voice] john, a few questions. I have it's about the open, acc kernels.

I

Is there any specific things that needs to be considered for the open, acc applications?.

I

You mentioned about the open mp tools, interface.

I

And you know, some of the caveats or challenges there..

I

Is there anything specific for the open, open, acc?.

I

So this is actually a code that.

A

We were working on with a woolen fan, who was working on summit.

A

And so I worked with them some on on wednesday and we got pretty far so let me show you what, what we collected.

A

So this is,- this is an open, acc program.

A

And this was running on,.

A

I'm not sure I guess it ran from 36 ranks.

A

And, and then it had a couple of,, a couple of cpu threads and a couple of gpu strings within a process.

A

And so, if we look at what's going on in, in a process, I think this is,. This is the main thread.

A

And then this is a progress engine thread..

A

I could see thread group calls, progress, engine.

A

And then the next one says: here's another progress, engine calling polling., and so that is thread two and thread three.

A

So the progress engine threads are not of interest., so I'm just going to filter out thread two and thread three.

A

So I'll just say,.

A

On the filter, menu thread, two don't show them thread. Three,, don't show them.

A

And now we're looking at the main rank and then there's some open mp.

A

Well, no,, I guess some sort of slave threads.

A

I don't know what they are and they're launching some things.

A

and then we have gpu kernel operations and we can find out.

A

They're they're very short and find out what they are and the full call path, where they came from.. So this is a fortran code and it's coming through open acc and in invoking these,. These kernels.

A

Okay,, so what I wanted to show you back in the profile is you you asked about. This is still.

D

Closed database I've got two things over here.

A

Okay, so now we're just down to his open acc code.

A

So if we look at where it's spent its time.

A

And last, where the gpu,, where it,, where was executing gpu operations.,.

A

B

[Sharp voice] well,, it isn't most of the time spent in thread root,, not in program root?.

A

Yeah, it, it is. and,, and so the reason for that is just the thing that I was that I was talking about where we have open acc slate, that way.

F

Right,, and so I would argue that, like.

A

This is an uncomfortable way to analyze your performance., so we're probably better off in, in the bottom of view..

A

[Calm voice] yeah and I've noticed that.

B

Open acc spins pretty hard when.

B

When you're waiting for the gpu kernel,, it's not it's not light on the cpu., so I was looking to see what I had done. Here.

A

So actually, what, what,. What you're seeing is a version.

A

Of this code,, where I had applied some filters.

A

So what were refresh here?.

A

So, if I,, if I look and and look for where the, sorry, I'm clicking around so much.

A

I'm looking for where the gpu operations are called.

A

Then, what I,, what I see is all of this stuff.

A

Where we're going through the open, acc, launch.

E

A

Pgi uacc launch cuda, launch cuda, launch three,.

A

You know, say all this stuff, like,.

A

I just don't want to see it., it doesn't add any value to what I'm looking at.

A

And then I also,, we also noticed that.

A

There were, there were some places where unknown.

A

Like there were here, we go, like unknown file.

A

But acc device say they've got like some strips,, some strip, functions. and,, and, and so we,. We went and just put some filters in.

A

And said, anytime, you see pgi, uacc, just hide that function..

A

So, basically any cost that is incurred here, just attributed to its color. and so by, by saying,, star, pgi uacc, starts.

A

It's going to collapse all of these out., and so, if you look at this mode, gen meg something.

A

Calls some open acc things called gpu kernel.

A

now,. When I turn on the filter.

A

And I go back into the same place,.

A

Sort by the right metric define the same place.

A

That this is where we were.

A

now, it just,. It looks like it calls gpu kernel directly.

A

And you get to see the, the open, acc kernels.

A

And we've, we've hidden all the pgi library, cruft.

A

[Calm voice] yeah,, that's helpful.

A

okay, and then notice that there's some stuff down here.

A

[calm voice]. Well, I have a question.

B

If you're, accidentally or not,, I shouldn't say accidentally if you're sending data that you shouldn't be sending to the gpu, 'cause, it's already there won't. You hide all of that stuff. That's going on!

B

When you do this and then you're not.

A

No, because, like here's so here's copy in.

A

Okay, and so we, we have, we have put in these specific tags.

A

Like copy and gpu sync, gpu kernel.

A

And so these are synthetic tags that we put in..

A

So you don't end up just sort of showing up somewhere.

A

In an open, acc library,, where, if, if I hid those things,, then they.

A

They would just get swallowed.

A

and so we,. We attribute things to these synthetic tags.

A

Okay, so copies get attributed to the tag.

A

Now,, where did this copy come from.

A

well,? There's no line information right in front of it..

A

That means that it must have come from from this right, here., so it's,.

D

I think, let's, let's look up in the code a little bit.

D

A

Let me not hide things and see if it looks different.

A

So we have a kernel launches at the bottom of this call. Chain.

B

[Calm voice] right,, and what you don't see here is, is that is that pgi is looking inside, that loop.

B

Possibly in,: well, you do have a copy in there,.

B

But it will, it will put implicit copies in if it, if you don't say anything.

A

Actually,, so this is a very good point.

A

notice that there actually is a useful line number where this pgi new acc launches call.

A

And so, if we swallow this line, then we lose this line number..

A

So rather than blindly filtering out all of this stuff, I could say this: one is useful.

A

But I don't care about this one. and don't care about this one.. So I could write a slightly more sensitive filter, that...

A

[calm voice]. They have, they have the same name.

A

No,, this is uacc cuda, along., [calm, voice], oh, cuda. I see yeah, yeah, yeah.

B

So if I just edit this and say,.

A

I want to filter uacc cuda.

D

Instead of just all uacc stuff.

A

It's annoying that we will write this down as a bug report that it's like losing the settings. so I,. I was originally selected here..

A

This was originally selected as my sorted column. and that selection disappears. When I applied the filter., so we should get that back., so here,.

E

Well,, did I not specify this right.

D

[Calm voice], I think those were not applied.

I

I think printers were not applied.

D

D

Does that keep recycling the same thing?.

D

Getting ready to fall, awkward.

A

Nothing like eating your own dog food,.

A

Okay, so then, now we're, we're.

A

We're actually tying the kernel launches back to open acc directives and then the copies also should be tied. Somewhere.

A

So if I want to go, find the copies.

A

Let me look in the bottom up view and sort by.

A

Copies and that where I spent my time, copying,.

A

And then this tells me my copy outs.

A

Were coming from a.

D

This is just copy in and we've got something wrong. There.

B

[Calm voice], it's probably implicit.

A

Yeah, right,, so there might be,.

A

There might be implicit copy out.

B

[Calm voice] compile with -m excel.

B

it'll, tell you for every loop, where it's implicitly doing things. but yeah,. If pgi won't throw an error, it'll, just say,. Oh, you must need to copy this out.. So I'll do it.

A

I see well,, so what I can tell you is that.

A

We haven't spent a lot of time looking at like the first time, we actually looked at open, acc with saturday., so this is basically like what you get,.

A

If you don't pay any attention to open acc at all,.

A

In hpc talking., so is it possible that we have the,.

A

The wrong tags on here?, I think we would have to write a little benchmark program that was copying in and copying out., and so we knew exactly what was there and we could validate the tags..

A

It seems that what we're showing,.

A

But I can't guarantee it..

A

What I can say is that anytime we're saying either copy in or copy out,.

A

It definitely is data movement. but,, and I think that it's the right directions..

A

It's just that I'm a little bit unsure here, because we do see copy indirectly.

A

and we're measuring things that say copy out.

B

[Calm voice], the rest of the trace, looks like you're doing it right. because you see all, it says, acc downloads and acc uploads.. It's doing the right thing.! It's just that! There's no line in the code that says copy out, because it's implicit.

D

D

I

[Calm voice] hi john,, another question here.. So, along with the program root, we have the tread root or something about the thread.

I

So in case of the open, acc, it's, open, acc threads.

I

There is no open mp in one here,, right.

I

I,. I think that these are our open, acc threads and, let's see.

I

[Calm voice], I meant the thread root, third one.

D

This is a partial. yes,. I wasn't paying attention to what I'm clicking on.

A

A

Create slave threads from nvomp.

D

B

[Calm voice], that's actually nvidia stuff like the lower level.. It's not even acc.

I

If that is the case. okay,, so then out of 100% of the application, time only 17% is actually.

I

Just in the code.

I

Is it how well.

A

So, so remember, the reason why I gave this talk this morning was I said,.

A

We're looking at profiles be aware of real time and the profiles, 'cause. These were at progress engines that may be just sitting completely idle, right.

A

Just sitting there waiting for some event to happen.

A

and we're repeatedly saying,. What are you doing?? What are you doing,? What are you doing., and we see that they're inside the progress engine.

A

if we measured using cpu time or measure using cycles.

A

Then that would give us a more accurate assessment of what's going. On.

A

So it may be that just having these threads, sit, there.

A

It's not helping you,, but it may not be hurting you either.

A

now. It may be the case that when you provision it for say, jsrun,.

A

If these things are running hard, then maybe you need a core for them., but if we measured it using cpu time.

A

And we found that actually they're spending all of their time blocked. Then maybe we don't need to dedicate a core to them.

A

and we use the cores for something else, like,. You know packing more ranks or something.

B

[Calm voice], my experience is that that one core runs runs hot and the rest of the threads are idle.

A

Well,, we should be able to see that by,.

A

By running it again and measuring it with something different., so this result that we got with willem is.

A

A bit of a cautionary tale., so first of all, we launched it and it got stuck in.

A

And the way it got stuck was.

A

There's a for historical reasons.

A

There's an option that has to be passed to ibm's spectrum mpi. That says.

A

Disabled pami, disabled cuda, hook.

A

They they're trying to understand when.

A

Memory is being allocated on nvidia gpu's.

A

And nvidia refused to put in a callback interface.

A

That would notify them of that so that they could just have spectrum memory.

A

vi does register for a callback., so instead, what they did was they decided.

A

That they were going to do some violence inside the runtime.

A

and if nvidia wouldn't tell them, they were just going to take the information..

A

So nvidia will find their functions.

A

Like cuda malik by opening nvidia's, library,.

A

And then doing a dl sim to find the function., that's the fund,, a pointer to the function that has the appropriate name., and so what ibm said, is, well we're going to override this dlsym function.

A

and any time anyone goes to look for a symbol in any shared library,. We're going to check to see if they're looking for cuda malloc.

A

and if it is,, then we'll know what that symbol is and then we're going to wrap it so that we can watch the mallocs.. So it turns out what they did.

A

Will lock up just about every tool., and so we knew this in 2017 and we had a better solution for them.

A

and I thought that the better solution was what they ultimately deployed.

A

and I come to find out now monitoring things on summit that no, it's left in the same state. It was in 2017.

A

And so you have to give them this. Disabled, pammy cuda hook thing because they put something in their runtime. That's basically.

A

Kryptonite for tools tools have to use dlsym.

A

And if they override dlsym,, then it,, it causes trouble.

A

So that was item one. we, when we ran, we found out that it is programmed deadlocked and we had.

A

We had to attach a debugger and then saw it was dead, locking in, in this lib pami cuda hook, library that ibm has.

A

And we remembered that,, oh actually, we have to turn that thing. Off.

A

And so we added that option.

A

and then the second thing that happened was it. It collected all the information and it made it all the way to fortran stop at the end of the program.

A

And then, at the end of the program, we said, cupti flash all which was saying.

A

Asking nvidia's cuda profiling tools, interface to flush all the activities out of the gpu.

A

And that call never returned.

A

And so the job timed out.

A

And it turned out that we had collected all the data.

A

And written it all down, except that the job never, the job never completed, because this call that we made to nvidia gpu infrastructure, just never returned.

A

Now,, maybe there's something that our tool did.

A

like there's, some interaction between the tool and nvidia's infrastructure that caused it not to return.

A

I don't know., obviously it seems like there must be something.

A

But we don't know what that interaction is, and we don't know how to fix. It.

A

And so we'll have to work with nvidia to try and track that down. we tried to get a reproducer for it, but we don't have a, a simple reproducer right now.

A

What we have is the whole application., so a good solution for us to use.

A

Is if we say flush, the data out of the gpu and the gpu doesn't respond in 10 seconds.

A

Then we should say, okay, fine., I give up,. Let me just record the profile data we've got and go home instead of, instead of expecting that it will terminate normally..

A

So I think we could be a little bit more defensive inside our, our,, our tools.

D

I

[Calm voice], that's the question I have.. So if I can ask one last question about the the user interface and in terms of the color mapping.

I

So in one of your slide,, you have that climate model.

I

And there was like io routines were called.

I

Let's say there is a one,, the ocean, io routine., and there is,, I don't know, like the the earth io routine and it's calling hdf5.

I

And then I went to map basically everything from ocean.

I

I worked with the one color and the earth..

I

I want to be one color independent.

I

there are multiple,, you know, the stack frames..

I

I don't want to see those stack frame because I just wanted to have one.. Is it possible?.

A

A

It would that, currently, with the the filters that we have.

A

It would not be possible to do exactly what you want to do.

A

So this is actually something that is.

A

Under discussion, because we're looking at.

A

What kinds of enhancements we'd like to make to this.

A

So right, now, the, the way that one writes these.

A

These filters, is,, is one writes.

A

Say something about a procedure name.

A

Okay,, so it's like,, here's a here's, a name for the, the procedure. Itself.

A

So there's no way to actually like, look at context like you were saying, like,, you know, to say,.

A

You know for the atmosphere, themes versus the ocean themes,. When I do, io, I want to color it one way or another. we also discussed,. Maybe you want to be able to say.

A

If you're in the following load, module,, then like say you're in the mpi library, like you're in, in live mpi. Third, so I just want to color all that stuff red,.

A

You know that would be helpful.. Another thing that we thought was if your code came from the following file.

A

Or from the following path., so for instance, using using say the raja template based programming model.

A

You may include the templates all over the place.

A

And so the raja code is kind of scattered all.

A

Throughout your code- and you might want to just highlight anything where it came from the file- star, raja star,,.

A

You know where raja appear somewhere in its path.

A

That,, you know. Maybe you want to color all that stuff.

A

One color. so I'll, I'll take as a, as a note that.

A

You would like to have some way to apply colors.

A

To things based on context., I think that that's a good suggestion. [calm voice] yeah. That will be very helpful.

I

Because I'm wondering this for quite some time.

I

and a classic example is let's say we have the io. and typically when the process waiting,, we see,, for example, gpfs, no cancel and stuff,.

I

Which is waiting for the io, but and I want to dry the timeline..

I

I want to see what's going on.

I

and if I change the depth,.

I

I basically get more and more from gpfs,, which I'm not interested in, and all the lot of things.

I

so depending upon the io phase,.

I

I just want to make something black or red., and I just want to,, you know, understand that would be very helpful.

A

Right, so one thing that you could do, so so, based on the filters here, there's a couple of ways that you can filter things.. So when I add a filter,, I can set.

A

I can write in a pattern and say.

A

I want to filter out things of this name..

A

So that's what we did with the pgi uacc cuda launch.

A

Okay, and said, don't show this name just put the cost into the parent..

A

So that's filtering out yourself, but you can also filter out descendants. Only.

A

So for instance,, if I say mpi send.

A

And then say: filter out descendants, only, then,. Then all the costs of mpi send.

A

Will just get sort of folded upward into mpi send.

A

[Calm voice] does this metric.

I

Get applied to the trace in the same way.

I

for example, in the profile,. This is applied..

I

Is it possible to have this in a trace that will achieve the same purpose that we just discussed? I think.

A

That's an excellent question. and I don't know the answer, because this integrated user interface just got released in march.

A

laksono are: are you on?.

E

A

Did the filters apply to the trace.

A

E

If you applied the,.

E

The filter on the profile view and then you move to the trace view.

E

It will apply to the trace view.

E

[calm voice], so the colors,.

I

I expect to see the same color now as the parent to the, all the descendants.

I

[Strong voice] yes, it,, but there is some issue right: now.

E

So if you already opened a trace view and then filter the tree, that it doesn't affect.

E

But if,, but if you have an open, the trace view.

E

Just in the profile view, it will apply to the trace view when you open the trace, view.

A

Right, right,, so what you're saying is, if, if,? If you, if you don't touch the trace view and you put in the filters, or you always leave the filters so that they're turned on,.

A

Then when you go to the trace view, the filters will be applied.

E

[Calm voice] okay.

A

Now, right so the,, so the question that that you were asking is, if you have like a calls b, calls c and I want to apply the filter that says, roll that whole thing up and just report it as,. As a in the trace view,, then you will never see b and c.

A

It's not going to,, it's not going to color them the same..

A

If you're using the filtering, it just it'll, just make them disappear and lonely show the parent.

A

okay,, and so sometimes that's useful..

A

So like, in the in the case where I just showed, with the pgi, acc.

A

You may never care about that: frame,, pgi, uacc, cuda, launch.

A

and so like just don't show it to me.

A

it's, it's, it's,, you know, useless.

A

in other cases,. You know. Maybe you want to apply colors to things and still show them.

A

So I have cases where, for the amg.

A

Code where I could say, well,.

A

If we can see that you're inside the intel, runtime,.

A

Like I match star knp star.

A

And, and so the intel runtime routines are all named kmp.

A

I say, match star k, it means star.

A

and just color that yellow. and then I can see some other things at the bottom that are like, well, so I was calling into barriers, but then to the bottom, like barriers calling sched_yield or something,. I said color that yellow too.

A

and so then all the things I consider, like,, you know sort of wasteful waiting can get colored the same thing.

A

and then, and then treat it as an equivalence class.

A

When I look into the statistics, view. [calm voice], so I think in trace view.

I

You would still like to see the cost stack,.

I

You know when I clicked through the trace,, but at least if there will be way to just color them, with the same way that will, thank you.

A

So,, so certainly you could color them the same way. By, well like right, now,, your colors are independent of context.

A

So for instance,, if you did read,, you could say I just want to color reads: blue. okay,, but then it,. Then that would be true.

A

Whether you're doing the reads from ocean or from atmosphere., if you wanted to have context, sensitive colors,.

A

Then we could have something where maybe, in the view, filter.

A

We could say, well,. Not only do you match a name.

A

But maybe you, you match a caller's name as well.

A

You know, sort of, if you do some context, sensitive.

A

Okay, I,, I feel, like I've talked too much already.

A

I mean, I'm,. I know I'm answering questions., I'm happy to answer questions all day, but I would like to see if we can get some, some people,, some people's code working..

A

That's what my colleagues are here. For.

A

So does anybody have any experiences.

A

One way or the other, where they either successful, or they would like to report that they, they failed and they,. They need some help.

A

To see if we can move forward.

B

[Rishi], so this is rishi I've, I've gotten.

B

Hpc toolkit to work. I'm, right now, I'm able to see more from insight.

B

Than I can from hpc toolkit., but I think that's just because of my naivety and how to read.

B

What's going on in hpc toolkit in general., that's been my experience, so far.

A

Okay,, I think that maybe.

A

Let's see who here is on my...

D

A

If karen is here,.

A

A

B

[Vivid voice] yes, I am.

A

Perfect,, so you can take a, a look with rishi.

A

At like what he finds out with insight.

A

And, and then if,, if there are some things that you can find out with insight that we can't find out what hpc toolkit, then we can take that as a a note on something that we ought to look at..

A

I know that for instance, rishi, like we don't compute things like roof line, models.

A

I mentioned that that's a piece of future work., so that's clearly a place where there's some things that you can get from insight.

A

That you can't get from infancy, talking yet.

J

[Calm voice], if I may ask one other thing that.

J

The nvidia tools have tried to do, and it wasn't clear to me.

J

Whether you're tracking it was a unified memory, movement.

A

So unified memory is, is an interesting case.

A

So if you take a, a page fault on the cpu side,.

A

Then we should be able to.

A

See that that cost of the page fault.

A

And I guess the I guess the unified memory can be attributed there.

A

But what karen and I found was that for the gpu, the records in cupti.

A

Did not have an indication of.

A

Who caused the page fault.? I think they just gave us the address that the fault was on., and so I don't think that there was a clear way to map it back to the code. max,. I see that max is on,. Do you know.

A

If that's something that that's been adjusted in the later models of cupti.

A

Or whether they're still lacking a.

A

A way to attribute unified memory faults back to the code on the gpu.

A

Or karen,, if you know this, I know that you've looked at this,, but I I don't know that we've looked at this recently.

K

[Karen], I don't think so.

J

I don't think we can do anything on the gpu side.

F

Yeah,, that's my recollection.

A

So I think that we we're constantina said.

A

I think that we're missing a critical piece of measurement infrastructure to be able to map that back to the gpu code, to understand when.

A

We're causing page faults and data movement.

A

Now,, the other thing that we could do., and so this is also sort of a coming attraction. Though,.

A

Is that if we're using the nv link counters.

A

Then we should be able to see that there's.

A

Like see the volume of data movement, but it's, again,, if it's, if we can't attribute it back to the code,.

A

Then it's it's a little harder.

A

If there are serialized kernel launchers,.

A

And we read the counters before the kernel launch.

A

And then, after the kernel is finished, then we should be able to see data movement that was crossed by the kernel. and, and that would,. That would show up as,, regardless of whether it was unified, memory, data movement or otherwise.

A

so,. But we do not yet have.

A

The gpu counters completely wired into hpc toolkit.

A

We're, we're waiting, we're working with the papi team to address some issues.

A

Before we can wire it in, in a way that we can release.

A

Okay, thank you, john.

I

[Calm vice] so unified memory, transport.

I

Those were shown on the trace view or no?.

A

I think that they should be shown on the trace, view.

B

We don't show that at this moment.

I

Okay. [laksono], because I think we they're not mapped to the source code is fine, but at least, you know, as an application developer,.

I

I know at this particular time, you know, what possibly causing..

I

So if they appear on the trace view,, that's at least., you know, something useful, all ready.

I

okay,. We should take a note about that.

A

Because we could certainly show them on the trace view..

A

I think that when we get the records, I I would have to look at the.

A

The cupti dot h file to make sure that the records actually have times in them.

A

But as long as I have, okay., they do have times in them, so in that case, then we could make them visible on the trace view. and that would seem like just a few minutes worth of work. To just say it's another.

A

Another set of things that we're tracing., I think at the moment we probably just have the unified memory. Events turned off.

J

[Calm voice] and as a continuation of that,.

J

To the degree that in the future,, things will be done.

J

Using hmm in the linux kernel,.

J

Do you expect to to have more flexibility.

J

In figuring things out, or you will have to rediscover things from scratch and possibly run into similar problems.

J

Because I'd like to see,, for example, with other types of gpu's that may be using that mechanism for unified memory, how we can get similar information.

A

So what I would have to say is that, at at the moment we haven't.

A

We haven't put too much effort into the tracking of unified memory because first of all,, for,.

A

For cupti, it didn't have any way of.

A

Attributing it back to code., and so that made it of less interest for the profile view.

A

And then second, we've been waiting for gpu's.

A

From the other vendors who have implementations.

A

That support a unified memory.. My understanding is that the release gpu's from amd.

A

Do not, although they promised it for, for, say frontier.

A

so I,. I think that the gpu's we have just don't support it.

A

And so there's like no measurement mechanism for it. Either.

A

do you know something different? am I mistaken?.

J

[Laksono] well, I,. I know that there have been things that have been put in the linux kernel to support.

J

And that's the mechanism that amd has been putting in like.

J

Okay. [laksono] kernel patches for that,, so I assume that's the mechanism.

J

That they're they're pushing.! That's why I asked specifically about hmm., because hmm hopefully again,.

J

To the degree it gets adopted by more than one vendor should be a mechanism to allow you to do things in a easier way..

J

I know that nvidia, at some point we're also working with, hmm but I don't know what their current planning.

A

So xiaozhu,, could you just take a note of that that we need to talk about this with amd about hmm.

A

So, to your knowledge,.

A

Are there any amd gpu's that support unified, memory?.

A

My understanding is that the unified memory support works, with,.

A

With power, but it doesn't work, elsewhere.

A

[Zhou], let's clarify the two differences here.

J

A unified memory- that's page-based, works with pretty much everything. In, nvidia land.

J

And again,, to the degree that I've understood it.

J

The hmm mechanism is the amd mechanism.

J

And I have no idea exactly how usm is implemented for intel, whether it's using something like that or they're going their own path..

J

I wish everybody could use something that would make everybody else's life easier,, but that I'm not sure that would be page-based.

J

The fine-grained one., we are the only ones that implemented it until such a time that somebody else comes out.

J

I presume of for frontier.

J

They may not want to take a step back from summit and have something equivalent, but I don't know what the mechanism is..

J

We are, the only ones that have that.

J

But basic information that pays base, hopefully.

J

Can be gotten anywhere., right, and so what if,? If I,, if I recall correctly,.

A

Then it it's like the cupti interface is the thing that that implements that right.

A

it,. It makes the accelerator, it has a.

J

[Zhou] no, no, that the, the, the copy and then open copy stuff was a.

J

Power specific mechanism for generic accelerators.

J

But in fact,, when we collaborated.

J

With the nvidia for summit,, we did something specific with nvidia that was not cupti based., so it's, there is an actual engine specifically.

J

For talking to nvidia gpu on power, nine cpu's.

J

So that was very designed, specifically.

J

It has its own protocols its own way of one-way caching and things like that.

J

So,, I mean, if some tool could actually.

J

Provide details from how things were done, that'd be great.

J

But that's a different story.!

J

I was asking more for something that would work everywhere. Right, now.,.

A

Yeah,, so I would say this as clearly is something that needs some attention and especially since everybody is moving to these more tightly integrated models,.

A

I I don't remember off the to of my head.

A

What the standards are,, but I think that I think, there's something called ccx, perhaps.

J

[Zhou] yes, ccx was one effort to essentially make.

J

The pci bus allow coherent connection of accelerators.

J

I guess people call it c6 for some reason.

J

And then now cxl seems to be the big.

J

A hundred pound gorilla out there.

J

And it supports in the cxl 2.0 standard.

J

Fully coherence with accelerators.

J

But there is no cxl 2.0 hardware out there.

J

And it,, because of a generality, it may actually be offered less than what you get on summit,, I'm not certain,, but what you got on summit was something very specific to power, cpu's and nvidia gpu,, as opposed to a generic standard.

J

While cxx is going to be supported by everybody.

A

So I,, I guess my, my,. My impression was that.

A

There was one standard, like everybody's, a member of everybody's standards committee,, but I thought that there was one that was the the horse that amd was backing. and then another one that was the horse, that intel was backing.

A

and so it sort of looked like everybody was going to have their own standard, which is sort of less useful for us, as consumers of these ends.

J

[Zhou] no, I,. I can't speak for the, the other vendors, but my impression is that everybody is on the cxl bandwagon at this point.

J

But for more performance solutions.

J

They may always implement their own thing., because if you have your own cpu on your own gpu, you can always put hooks that are more specific.

J

to the degree that you can start tracking.

J

Cache coherence, moves between cpu and gpu.

J

Your tool would be like a dream.

J

But I have no idea how much work that would be.

J

At this point I would simply be happy if I can use hpc toolkit and see unified memory.

J

Or whatever it's going to be called implemented, whichever way it gets implemented,, maybe hmm.

J

Maybe some other way, but be able to see it.

J

Even in the trace view,, as was mentioned before, that's more helpful than not seeing it at all.

J

obviously, if there is a way to get attribution,, that's that's even better., no question about it. the reason that I'm saying that is that what we discovered on, on the summit, experiments.

J

So to speak is that the moment you start saying,.

J

Okay, let me put more and more unified memory. Usage., you discover sometimes that you just stepped on a.

J

In a trap, because you introduced a lot more emotion.

J

Than you wanted to. so having that way to look at what you did and what caused the trouble would be great.

J

and sorry for asking a lengthy question.

A

I mean, this is,, I mean, it's say it's a,, it's a great question, but this is actually something that has to be addressed with the gpu vendors, because, for instance, nvidia has this pc sampling, interface.

A

And so what you can find out is that here's a, a.

A

An instruction that is stalled and there's nothing else, that's that's runnable.

A

and so then we'll, we'll log, the stall reasons.

A

And it says I'm stalled on memory., and so we don't know whether it's stalled.

A

Waiting for something to come back from l2 cache or whether it's waiting for something to come back from from l3 cache., so all we know is it stalled on memory now, on the cpu side like on.

A

With the amds instruction based sampling,.

A

And with say, intel's load, latency, facility, or.

A

Or with a power marked instructions.

A

We can find out where you got your data from.

A

And, and so that's something that we can measure,.

A

But on the gpus,, it basically just like right now with nvidia.

A

It says you're waiting for memory. and so that isn't going to really tell you you know, in detail,, like.

A

Where this data is moving from.

A

like,, I think that that's what you want to know for this unified memory case., and so what we need is something.

A

That's kind of like the equivalent of of intel's load latency facility inside the gpu.

A

And right now we have none of the gpu vendors supply, anything like that., and so if this is the problem that we want to measure.

A

And this is the problem that application developers face.

A

Then we ought to advocate that the next generation gpu's has some some mechanism to support. It.

A

we've been just trying to fight a simpler problem,, which is we just want some way to attribute things.

A

To instructions in the gpu.

A

so,, you know,, I feel like we're halfway there.

A

if we can at least say we're stalled, here, right.

A

At least, if we can say we're stalled on memory,, that's part of it., we don't know which memory,. But at least it's helpful instead of just saying the gpu kernel run for the following number of seconds.

A

So we've been advocating with all the gpu vendors.

A

That we want some support for fine-grain, measurements.

A

and can't really say about,. You know,.

A

What the status of that is., all I can say is that currently nvidia.

A

Is the only one that has support for pc sampling and in a release? Gpu.

A

[calm, voice] observer listening to this conversation,.

K

Go on, I'm wondering if it would make sense to,.

K

To present hpc toolkit.

K

Or to present a motivation pc toolkit where it's focused on drilling down into the internals of what the kernel is doing or what the the backend underlying libraries.

K

Behind the rocm stack are, are doing,.

K

Rather than looking at the application., it seems like a totally different focus to try and to try and look at the end of the call stacks.

K

Instead of the, at the top portion.

A

So for, for instance, I,, I don't know what set of slides.

A

Of it it's in it's,, it's not in the current set of slides., like it's probably a hidden slide in the current set of slides that I just showed, except that I could only find the pdf..

A

So I don't,, I don't have the hidden, slides handy,.

A

What we showed was that using the kernel sampling on an earlier version of nvidia's infrastructure.

A

We were actually able to show that you spent a lot of time. Clearing pages inside cupti.

A

So,, so that was showing the interaction between the.

A

You know, the gpu software stack and the operating system.

A

And we felt that that, that was helpful.

A

And so then we turned around to nvidia and said, hey, you know,. We noticed that a lot of your overhead is actually clearing pages.

A

and then they went back and, and took that information.

A

And then their subsequent version of cupti didn't do that. so,. So those sorts of insights.

A

Into the software stack can be useful.

A

and I think that that perf is good about that..

A

As long as we have kernel symbols, then we can actually use perf to see what the issues are with the gpu software stack and,, you know beyond the applications.

K

[Calm voice], yeah, and maybe sooner or later, there'll be an example where comparing those sorts of information across different systems to look at how there differences in hmm.

K

Perform for example, would be, will emerge.

K

A

A

okay, any other questions, or should we work on some code?.