Home
Contribute
Contact Us
Browse all meetings
Home
Contribute
Contact Us
Browse all meetings
National Energy Research Scientific Computing Center (NERSC)
/
HPCToolkit Training for NERSC and OLCF Users, Mar-Apr 2021
/ 6 Apr 2021
National Energy Research Scientific Computing Center (NERSC)
/
HPCToolkit Training for NERSC and OLCF Users, Mar-Apr 2021
/ 6 Apr 2021
Previous Meeting
Next Meeting
⏯
Sync
Add meeting
Rate page
Subscribe
►
From YouTube:
4 - Using HPCToolkit’s Graphical User Interfaces
Description
Analyzing GPU-accelerated Applications
A
Thank you.
good morning and good afternoon to everybody.
A
My name is laksono adhianto from rice university.
A
for the last 20 minutes,.
I tried my best to pack the presentation about hpcviewer.
A
So,, it's divided by three sections.
A
One part is the introduction of hpcviewer.
A
Second is how to use hpcviewer.
A
And the last one is the demo..
A
I will not demo on the remote..
A
Somebody will be angry..
If I do the demo remotely.
A
so,, I will do the demo on the laptop.
A
so,.
You have seen that john already mentioned,.
A
Hpcviewer is the last part of hpctoolkit workflow.
A
You already collect the database from hpcprof and hpcprof-mpi.
A
And now the trick is how to analyze the data.
A
Data can be very huge and complicated and is hard to analyze.
A
And here we will show tricks how to.
A
Find a performance bottleneck.
A
hpcviewer is built on top of eclipse rcp.
A
Which is itself based on java 11.
A
So,, I think, on slack, somebody cannot run hpcviewer.
A
In fact, it's because it runs on java, 8.
A
It's operable on most platforms like mac, windows,, linux, x86,, ppc and even arm.
pay attention for the windows.
Users.
A
Windows allows you to run java 32 bits,.
A
And it doesn't work on a hpcviewer 64..
A
It is not supported.
Anymore..
The old version of hpcviewer supports jvm-32.
A
But the latest and greatest hpcviewer is not.
A
And it's not tested yet on apple m1.,.
A
I'm not sure if it works, never know.
A
So, hpctoolkit database has four.
A
Dimensions or information., the first dimension is the call-paths, which is the union of all functions,.
A
Loops, statements, executed.
A
And john mentioned about the profile, which is a list of threads, mpi process,, openmp, threads,, pthreads and even gpu streams.
A
Keren already mentioned about gpu streams and view it in the trace view.
A
The third dimension is the metrics, which is hpcrun events.
A
Every time you specify hpcrun -e.
A
Something and this something is a metric.
A
And it has two types of metrics.
A
I think somebody asked about e and I.
A
So the exclusive metrics, suffixed by e.
A
Is the quantity of the metrics, measured for a scope alone., the inclusive one is exclusive metrics, plus the cost of his children, right?.
A
And the time dimension is available.
A
If you run hpc run with -t option.
A
So there's a lot of information in the database.
A
And the trick is how to navigate.
A
And how to find performance bottleneck,, it is not easy., it requires good, sometimes good knowledge of the application itself., and sometimes it requires attention to details.
A
Next.
A
So, suzanne and helen already mentioned.
A
About installing hpcviewer,, it's already there.
A
to be installed.
Locally, you can download from our website on mac for mac user,.
Please download.
A
With curl program to bypass apple gatekeeper.,.
A
Right now, hpcviewer is not certified by apple yet,.
A
But it is in our priority: list.
A
You can also build via a command line.
A
Even on windows,, you can build it.
It is an open source..
As long as you have java and apache maven, 3.6 or newer.
A
On linux, you can use spack, install.
A
and launching hpcviewer is very easy.
A
It's just that its easier on mac os.
You can type open, hpcviewer.app.
A
You don't need to go to content and etc.
A
Just type open, hpcviewer or click the icon.
A
I'll do my best to be interactive.
A
So if you have questions,, please interrupt me.
A
Any questions?
A
no.
okay.
A
so,.
There are two modes in hpcviewer, the profile mode that presents the summary.
A
Of application performance with different perspective.
A
You can see from the top-down fashion, you can see from the bottom-up, and the flat..
I will describe soon.
A
and the trace view that presents program.
Traces.
A
In a top-down fashion.
so,, this trace view is new.
A
If you already know hpcviewer or hpctoolkit.
A
Long time, ago.
few months, ago, the two application independent.
A
Hpcviewer and hpctraceviewer, but now hpcviewer is integrated into one application.
A
So, here I want to make sure that we are in the same page.
A
About the terminology.
A
Some performance tools have the same words.
A
But different meanings slightly with us.
A
and some tools have different word for the same meaning.
A
A profile view has three views,.
A
Three perspectives:, the top-down that presents dynamic, calling contexts.
A
In which costs were incurred., the bottom up presents the cost by looking upward along the call chain.
A
And the flat view presents the cost based on the structure of the application, itself.
A
So if you have an application,.
A
The main routine called f and then call g and call another instance.
A
Of g and then and h, like here, on the left., the top-down will present exactly the same.
A
Where the function is executed.
A
The bottom-up,, on the other hand, look upward along the call path.
A
you have a function.
A
for example,.
I want to know who called h.
A
The bottom-up will show you all the call chains.
A
From the main to h.
A
And the flat view is to present the cost.
A
Based on the structure of your application.
A
So,, if you are familiar with intel, vtune.
A
I think the bottom-up is similar to the callers view.
A
Intel vtune, callers, view.
A
and if you are familiar with oracle pa.
A
The flat view is similar with the top down,, something like that.
A
so, different tool has different term, but basically they are almost the same.
A
Now,, the top down.
A
john already mentioned.
There are two parts.
A
Of the section in the hpcviewer.
A
The source pane on the top and the bottom.
A
It has the tree and the metrics.
A
This is an example of nwchem.
A
I don't know whether if there's nwchem developers, here.
A
so, in nwchem, you can see by using the "flame" button,.
A
It will drill down the tree to find a performance bottleneck.
A
It calls from program root,, main, and then loop and the task., and now the inlined function until the ccsd routine.
A
So we have seen this.
A
The bottom-up is one of my favorites., it's very useful to find what is the most,.
A
The highest cost in my functions.
A
so to do that in, in bottom-up view,.
A
You have to click the exclusive metrics.
A
And it will sort based on the cost.
A
So in this nwchem.
A
We can see that the six most,.
A
The six costliest functions are communication.
A
The first two are from cray communication library,.
A
Gni underscore something.
A
And other four or five is from gasnet library.
A
Well, this nwchem database is collected.
A
Five or six years ago, and may not be the same with the current nwchem.
A
And maybe nwchem now it's much more efficient., but five or six years ago the nwchem has significant communication.
Costs.
A
So if we are interested with the gasnet_barrier_try.
A
Which costs 43% of the total cycles of the application.
A
You can select gasnet_barrier_try.
A
And click the hot button and it will drill down, go down.
A
and we find that gasnet_barrier_try.
A
Is called by ga group synchronization.
A
And ga group synchronization is called by ga_destroy and which is called by nx_task.
A
So, nx_task is called by many functions in nwchem.
A
So, every time the code give me another task.
A
Give me a next task, give me a next task..
A
It calls the ga destroy, and the ga destroy causes synchronization called barriers.
so,.
It's very inefficient way to get a next task.
A
Maybe people from nwchem can explain better than me.
A
You can interrupt me anyway.
B
(John mellor-crummey), so this is john.
B
I'll.
Just add that the next task was the way that using the global race programming model.
B
They were dividing up the work., and so the implementation of next task.
B
Because, as like laksono said, was, was pretty costly.
B
And that's what we see here.
A
Okay.
thank you, john.
A
Okay., so I have only 10 minutes,, the flat view.
A
So,, the flat view is good to know.
A
The list of libraries, the most costly libraries,.
A
Like, if you want to know the overhead of communication, library,, I/o library or openmp, library,,.
A
Okay?
and you can also create a user-defined metrics.
A
There's a question in the slack.: is it possible to get a new metric?
yes,?
It is possible.
for example, in the database,.
You have papi total cycle.
A
And papi floating-point instruction at the time papi has appreciate.
A
To approximate the floating-point instruction in intel., let's assume that papi_tot_cyc has metric id 2048.
A
And papi_fp_ins has metric id 2050.
A
To compute the cycle per floating-point instruction is $2048 divided by $2050.
A
And there, you can see the new column called cpi, right?.
A
So the dollar is to, is a point of like in the spreadsheet.
A
Like for formula is to refer to a point-wise value of a metric at the node in the tree.
while,, the "@" sign is the aggregate metric.
A
Another very useful feature in hpcviewer.
A
Is the metric property.
you can see in keren presentation, previously,.
A
That sometimes the metric label is not very descriptive.
A
Like for example, gpuop,, what the gpuop means?.
A
So you can click the view and show metrics,.
A
And you can find.
A
That gpuop metric is the sum of the rank or threads.
A
Of inclusive gpu time for all operations in seconds,, okay?.
A
So,, this is very interesting because sometimes the metric label is not very descriptive.
A
And you can edit the label if you want here in this window.
or if you have a derived method, you can edit again, the formula.
A
So the trace view shown by john and keren as the main view.
A
The top to the bottom is the rank or the trace or gpu stream.
A
And from the left to the right is the time.
A
Here again, is example of nwchem database.
A
You can see different regions in nwchem and different phases.
A
And on the bottom is the depth view where you can see all the call stack.
A
Across the current display time range in a specified rank.
A
So,, if you see the cross-hair here.
A
Then the depth view is the call stack.
A
In this rank, 193.
A
On the right side, is the the list of call stack of this cross-hair.
A
So you can check the call stack.
A
If you move the cross hair.
A
And another very useful feature is the summary view.
A
Where you can see the projection.
A
Of number of calls across the current display time, range.
A
And it is useful to see the load imbalance.
A
And the statics view is the proportion of number of samples.
A
And you can zoom in and zoom out.
A
And then you can also save the current region into the disk.
A
You can open it again for the next hcpviewer instance.
A
Another thing the caveat in hpcviewer is that the color is generated randomly.
A
So if today,, the color is blue,.
A
Tomorrow, when you run and open again the database, the color is blue or green or red is,.
It can be different.
A
And it can be troublesome.
A
If you want to compare two databases, because the color is different for the same function.
A
And we know this problem and.
A
We will try to handle it in the future release but, as keren shows previously,.
A
You can assign a routine or procedure to a specific colors.
A
You can specify mpi_*, blue.
A
So all the mpi function has blue color.
A
Okay?, so this is example of nwchem.
A
I will do the demo on local machine.
A
I don't have it here., so you can just click from the finder,.
The hpcviewer.
A
And it will show to ask to open the database.
A
We've already seen nwchem.
A
Now I will show the qmcpack.
A
Let's see the qmcpack in the trace, view.
A
So let's go back to the depth zero.
A
In the depth zero you can see, there are two phases.
A
The first is, everything is idle,, except the main threads, which is perhaps the qmcpack for initialization.
A
Or something like that.
A
and the second phase is the computation and there's a lot of parallel regions..
A
So if you look, go down,.
A
You can see a finer, finer phases.
C
Not here, here.
A
You can see not only two phases or two region.
A
You can see: one, two,, three,, four,, five regions.
A
Until if you go deep in the call-path.
A
You can see finer and finer interactions in your program.
A
And you can click the max depth to the bottom of the call path.
A
So it's already 3pm..
A
I will stop my presentation..
A
If you have any questions.
A
3:00 pm central time, maybe.
B
(John), I would like you to show one more thing.
B
Which is show the view menu.
A
Yeah,, so I will show the color mapping.
A
If, for example,, you want to know all the costs.
A
Of openmp libraries, all costs of the sched_yield.
A
Or kmp something you can kmp_*.
A
Black for example, and *_yield and click ok.
A
If you go to the statistic,, it's just all there.
A
With omp idle, its cost is 54%.
A
Of the total cycles or the total execution time.
A
And openmp library, like kmp_barrier, sched_yield,.
A
It costs 30% of the total execution time,, but in general.
A
64% of the execution time are idle or it's a waste.
A
64% is huge.
A
So,, the majority are idle in this application.
A
Any questions?.
D
(Student) oh,, you had mentioned briefly about how to see load balanced easily..
D
Could you review that one more time?.
A
That is in the nwchem database.
A
I will show nwchem database.
A
So this is collected five or six years ago and the database can.
A
The application can be mass efficient right now.
A
Here you can see that some processes.
A
Have very long ga destroy and some have very short, ga destroy.
A
This one is very short.
A
I will make it bigger.
A
So, different process has different execution time.
A
Some are very short,, some very long, and you can see here very good.
A
The load imbalance, for example.
A
and if you click summary view,, it's very clear.
A
Some processes have very long ga kernels.
A
And some very short or even is pretty short.
A
Does it answer the question?.
D
Yes.
thank you.
okay., so also.
Sometimes the load imbalance is very small.
A
That you cannot see on the whole application.
A
you have to zoom in and here you can see different.
A
Execution time for different process., some is very long,, very short.
A
And sometimes you have to go back.
what's happened in the previous execution?.
A
The previous execution is there's a lot of wait, armci wait.
A
Which is not the same across the process which cause others.
A
The next execution time,, that is the synchronization very long., so waiting for others,, making the ga barrier very long.
A
So you have to go zoom, sorry.
C
Sorry., if you have just explained about this, but.
C
In the summary view,, the bottom,, what that does represents there?
C
yeah.
that part.
A
Yeah., this is the summary view,.
A
Is the projection of the main view?
A
So what happened is that we count the number of pixels.
A
In the main view- and we projected here- the number.
A
So if you're, for example, over the cursor here.
A
You will see, 2.1% is in line from ccsd.
A
Its just the projection,, the number of calls, in the main view.
A
To the summary view, and the statistics is show the table.
A
Of the summary view.
yes., basically like there.
B
(John mellor-crummey), so let me,, let me interrupt for just a second..
Can you zoom in on that, that sort of grey area?.
B
For just a minute,, these were just the,.
The area.
B
That we're just pointing at.
B
so,, so just zoom in and show that for like most.
B
Of the display., which grey area?.
B
I'm sorry.
can I just take control of your laptop.
B
For a second.
E
Yeah, yeah, I'm access that you could specify?.
B
(John mellor-crummey) so,, so what I wanted to do was to just show you here.
B
So what,, what this shows is that, like in the center.
B
Everybody is working on this particular computation.
B
Okay?
so this has like a hundred percent of them are working on it., but then over time, as we, as we start to leave this phase.
B
Then there's like a mix of the gray and the mix of the green., and so what you can see is that the fraction.
B
It goes from being a hundred percent grey to being like 50%, gray and 50% green to being like 2% gray.
B
And so what you see here is that this is an indication of loading imbalance as as we're shifting from one phase to another..
B
Okay.
that it's a, it's a function of an imbalance of.
B
One of them is in like procedure f, while the other is in procedure g.
C
Okay, perfect..
Thank you very much..
C
I mean, after using hpctoolkit for many years.
C
I didn't pay attention to this and I didn't understood easily also what's going on in the summary view.
oh, okay.
E
So how,, how do you tell what each color is?
E
does it just hovering, or is it clicking or what?.
A
Sorry,, can you repeat the question?.
E
There are two colors., so kind of lightish blue and a green.
E
How do you know which procedure it is??
E
Oh, it is either you click, or you let the cursor there.
A
And then it will show the tooltip of the function.
E
This is the number,, the inline, right at the top.
A
Yeah.
E
On the statistics, list.
E
Thanks.
A
And the tooltip, in the summary view, also show the ...
B
(John mellor-crummey) yeah., so I switched it., so you can see that this represents the call path to get there..
So if we were to select some function that actually has a name.
B
okay..
So it's all part of that one function.
B
But then, if we move down another level and so there's, so.
B
If you click on this, the,, what we see.
B
For the place that you click on is this shows you the call stack..
So this is like the call to get hash block, one call from here., whereas if I click over here,, then it shows me that this is the call to util get next val.
B
In this call chain.
okay., so these colors here represent a call chain.
B
They, they're not like,, like a, a legend for what.
B
The colors mean in this particular thing., the only color that we're all the only thing we're looking at here is a depth 13 in the view.
B
And so we see as depth 13 at one point in time and.
B
You know, depth 13 at another point in time on another rank.
B
Is that explained?.
E
I don't think I understood that.
B
(John mellor-crummey): do you understand what I said now?
or no,?
You didn't understand what I was talking.
About.
B
No.
(john mellor-crummey) okay.
when you,, when you click on a spot,, what it shows you.
B
So, here I clicked on something that says: rank 3.18.
B
And, and now at that point, this represents the entire call chain., that's active at that point in time at 300 and 320,000 milliseconds into the run.
B
The call chain has this and I'm looking at it at like the middle depth..
B
So if it was like main calls, eight calls b, then I'm 13 levels deep.
B
And that's the procedure, that's being executed., but if you want to know where I really was it's, I wasn't in get hash.
Black.
B
I was all the way down, get hash, black calls, get blocked, calls, get calls,, nga get.
B
and I was down inside the, the cray gni.
B
Layer like waiting for a request to be satisfied.
B
As part of my call to get hash block., so you understand that this is a call chain and that represents that one place.
B
yes.
B
(john mellor-crummey) okay.
now,.
If I move over somewhere else, I get the call chain for what I was doing over here.
B
And so this, the green refers.
B
To this highlighted color here., it's a depth 13 in both places.
B
So we're looking at this the whole application as a hierarchy of views.
B
So at the top level.
E
So you could change the depth at any time going to change.
(john mellor-crummey).
B
You can change the depth by just clicking on something here., so this is now looking into the depth four and then I can look at it at depth.
Eight and then I can look at it.
At, at the bottom.
B
And it showing you like different levels of abstraction, of what we're looking at.
E
I think I understand.
B
(John mellor-crummey) okay.
now the summary view looks very weird.
E
John, a question to look.
C
Through the summary view, again, in the summary view.
C
The positions of the colors,, for example,.
C
Green at the bottom, then blue, and then yellowish green.
C
Does that represent anything there?
C
you know, like if the green is at the bottom?
C
Does that mean anything related, relevant?.
B
(John mellor-crummey), I think the answer is no.
B
Think of it as it's just whatever color id is assigned.
B
And then we always show the colors in the same sort of order.
and then the,, the height of the bands.
Sort of is shallow.
Here.
B
Because there's a little blue,, but then it's like it's like bigger here, because there's lots of it.
B
Which means that there's a lot of this computation going on at this point in time.
B
and it sort of peters out until there's like almost nothing.
C
Okay, perfect.
F
Thank you..
I have a question..
How do you deal with the fact that different threads have different call depths, even if they're in the same function?, so like,?
If I do a pthread,, you know, a pthread, create.
F
And then it calls a function, but my main program was seven layers down and it called the function.
Too, they're completely different layers,, but they're doing exactly the same.
Thing.
F
(john mellor-crummey).
B
So that's an interesting question.
B
so, you know, in, in some sense,.
You know you would like to normalize for that.
B
And so I have a student, lai wei, who did a phd thesis and it would like, and it would take the viewer data.
B
And it would auto select which levels to show you.
B
At various points in time.- and it would just it would pick the things that were interesting to show you like where there was imbalanced and what not.
B
that's not the version that we're running here.
B
But that would address the question that you have.
B
The other thing that we often do is to just say to just hit this max step.
Thing, and you,, like.
B
You look at it at the deepest possible level because that's showing you where you really were, like.
B
What were you really doing?
okay?
and then we look in the statistics, pins and says, well, okay,.
So a g and I cq get events.
B
So that was like dealing with the request queue..
B
So there's like a lot of activity where, where threads are dealing with the request, queue and ga dla pro progress.
B
So this is like the gemini cray's gemini fabric had driver.
B
And so this is the progress engine.
B
and so there's a lot of time being spent on the progress engine.
A lot of time checking the request, queue.
B
so,, looking at the bottom level, that tells you what you're, really doing., and so that's always a good place to start- is to look.
B
At the bottom.
F
Okay.
that makes sense..
So if I just looked at the bottom,, it doesn't matter from the depth five or 58 on different threads..
F
It'll.
Just show me where I am.
B
(John mellor-crummey) exactly.
G
Okay.
yeah.
so john remembered that if you're using the openmp tools, library,, then.
G
Then the depths are matched up.
B
Correct.
yes., so we didn't talk about,.
They openmp tools, library, that talk's going to get deferred; til, friday.
F
Okay., so what you're saying is that if you compile it.
F
A certain way and link it, a certain way, magic happens and it works.
Out.
G
( john mellor-crummey), if you're in an openmp region.
B
Yeah.
I wish, the answer is there is no magic bullet, but it's better., right, right.
F
Yeah., but I'm saying is: if you did a pragma openmp, it doesn't matter if the main thread was like 15 levels down at that,, you know, the master thread.
F
Or everybody in the, in the worker pool, is at level two.
F
Because they just call it gumbo and p function, whatever., but it'll fix, it'll figure it out, because it's part of the open mp library.
B
(John mellor-crummey), that's exactly right.
B
yes.!
Now, there's topics to cover on friday.
yes.
A
Okay.
any questions?.
F
I was wondering for openacc,: are there any things that.
F
That you guys are doing specifically that apply to open acc.
F
Versus openmp,, gpu, offloading, or just openmp normal?.
B
(John mellor-crummey) we're not doing anything specific for openacc.
B
Basically.
it just works,, but you get the default view.
B
Okay., and so I threw in some examples for I- threw in a lulesh openacc in the examples,.
B
And I hope it works.
it got thrown in over the weekend., and so, if it doesn't,, I can take a look at it and if it doesn't work,, then I'll fix it.
B
so that it's something that you can look at before.
Friday,.
B
Or it may,, it may work already., not sure..