National Energy Research Scientific Computing Center (NERSC) NUG Monthly Webinars, 8 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG Monthly Meeting, Jan 2022

Description

Run-time configuration for climate models on Cori —throughput per job or throughput per year? Koichi Sakaguchi from PNNL will talk about experiences running climate models on Cori, including some observations about the impact adjusting sbatch parameters has on workload throughput.

A

Okay yeah: let's begin.

A

Welcome all to uh ay 22 and our first uh monthly meeting of 2022..

A

So I think most of us are kind of familiar with the the format of this uh meeting now, but uh in case you're. Not so it's intended to be quite an interactive session. It's um more meeting the webinar, so please participate yeah. If you've got an interesting. Yeah got a comment to make or a question to ask um yeah. I think we're a a suitable size crowd that you can probably just unmute and speak up as it starts to get very interactive. Then um you know we'll move to raising your hand yeah.

A

um We also have the nurse user slack uh I'll post the slides actually actually.

A

In case you haven't seen this like already here's a link to how to join it in the chat here. You'll need to you'll need to log in on www.nurse.gov to be able to see the page that this is on uh our general. Follow our pattern of we'll start off with a win of the month then followed by a server.

A

Today I learned, which is an opportunity to discuss interesting things, you've done experienced or discovered or tripped up one recently we're using nest systems, we'll have a few announcements and uh calls for participation, and then we have our topic of the day and I see yeah christian has been doing some interesting work on uh how to set up his climate runs at nurse to get sort of good throughput, um good q time and so on and uh yeah. So he's got some interesting, uh yeah news and and discoveries to present there thanks squishy.

A

uh Don't look at what's coming up and uh a few a few metrics of what's been happening in our skin the last month.

A

So, let's start out with our whether the month sections the opportunity to uh either show off or shout out of an achievement that you've heard of somebody else's head, it can be big or small. It can be. You know, from getting a paper accepted uh solving a bug. That's been giving you some group for a little while that took some took some digging, um maybe that maybe you've got a an achievement. That's good to note as a high impact scientific achievement award or an innovative use of high computing award.

A

B

A

uh Tell us a little about it. What you did what you achieved, what you learned, then any any tips that others yeah might benefit.

C

Others here and ask uh this is koichi from piano.

A

C

Good to see you.

A

C

Thanks, I just want to for me win of the month, and maybe many others, uh the two train events this month, I really enjoy from nask one is the polymer training and then last week we had engineering, hpc, sv development, kita training. I learned quite a lot from those two training events and just started to uh trying offloading this existing fortran code. We use to generate a computational mesh which takes a really long time.

C

So now I started trying to port this to a gpu on polymer and started working with the support staff mask that first emailed the quiz that pointed out that my first beginner mistake is that I have a subroutine call within the code that being uploaded to gpu, which we need some other additional cares to do that and then also terms of otherwise from going moving from large direct to fine-grained builders, to make more effective to do this rgb of clothing.

C

So that's sort of the one of the things I did around this month and I really appreciate this to train the events that was really good, so yeah. Thank you.

A

Some that's uh good to hear and and some yeah good tips there about.

A

Yeah, the way that you offload to a gpu might not be the same as the way you um optimize a parallel thing, parallelize things for um for other circumstances like openmp or mpi, and thanks also for that reminder about the trainings.

A

um The I believe the slides and recordings are up now for those, uh so you can find them under the the training section of www.nurse.gov yeah. For those who didn't make it. We had two sets of trainings one on an introduction to pelmeter generally a lot of the the tools and the other on an introduction to the nvidia program and and tool suite.

A

So yeah and it's uh really good to hear that they were helpful thanks very quickly.

A

Anybody else like to shout out um an achievement. ah Look I see helen's just posted in the chat, the links to those two trainees, so if you did miss them or if you saw them, would like to go back and refresh your memory on on some aspects, uh it's like three days.

A

So there's a flip side of uh winner of the month. There's today I learned, um which is a a similar kind of format, but you know it can be something that uh tripped you up, but you might give the rest of us a full warning about or something that might be good to add to nurse docs and I've seen in the last week or two actually, we've had a few merge requests, so you can also contribute to the docs by.

A

Adding using using gitlab to uh add comments or make edits and pass them to us that way, somebody started to speak up, but I didn't quite catch who.

A

Might have been a microphone glitch.

A

So yeah as well as um achievements, if anybody has something they'd like to tell us about that, that they learned or got stuck on and would like to learn, it might be either yeah something we can. We can talk about here or um give give the rest of us a tip to follow up on.

A

I think there's definitely a lot of tips to follow up on in there yeah in their training. The last couple of weeks.

A

Okay, we'll move on then to announcements and cfps.

A

We definitely have a few and the biggest one is in case you missed. It welcome to allocation year 22, which began uh yesterday with maintenance on corey, um so there's a a whole bunch of detail in the nurse weekly email, and you know a few other places there's links from that weekly email. This will be a good place to to begin uh a few important differences that that are worth watching out for is for pis.

A

Did you remember to mark your continuing project members and if you're, not a pi, and you find that all of a sudden you can't submit to your project, you can't access something, uh take a look in iris and make sure that you are still in it by default.

A

um We don't carry over users from one year to the next, um in terms of which projects they're in uh you know, pi's and pr proxies do not just yet indicate who is in the pro who is still in the project at the beginning of each year or end of end of each year.

A

So if you are finding trouble with that, take a look at iris and drop your pi aligned, that's sort of the first thing to try there um another important thing. In the new year we have a new default python modules. All of last year the default was 3.8 python. 3.8 this year, it's python, 3.9, there's been a few other upgrades um here, we're using, I believe it's called member, which is an improvement on conda.

A

I think the interface is still the same, but it's a little bit faster. A couple other modifications there.

A

So for python users, that's probably useful to keep in mind uh the other thing, you'll notice, when you're looking at the your job costs and nurse hours that for this year, they've been recalibrated, we were using a nurse scour unit that was based actually a few machines ago on, I think, a uh a hopper or possibly edison.

A

It was a hopper no tower, um which is to save you kind of a amount of work in units of what a previous machine could do on a single note in a single hour and get a little good times gone by and those numbers we're getting yeah further and further away from from reality of the date, so we've recalibrated, and what we're using this year is.

A

We've recalibrated it to what's expected to be a pulmonary cpu node, which is a phase two node, so we don't have those quite yet um so a permanent cpu node will have two sockets of the amd epic processes that combat phase one has felt motor phase. One has one socket plus gpus, so the new hours have recalibrated to uh you, know: models sort of node speeds if you're using palmata jobs are still currently free of charge and will be for a little longer.

A

uh You'll also notice that gpu node hours are a separate allocation from cpu node hours, they're also currently free. um Eventually, you know when, when promoter becomes a full production machine, those will start to be charged.

A

But if you have code, that's gpu, ready or preparing code for it or close to it and you're not already on perlmutter. You can request access and start testing your code out there.

A

Okay, another important announcement. It's been a few weeks now, but if you dig back through your email, you should have seen some email invites to participate in the nurse annual user survey. This is really important for nursk. We use it both to get feedback and get a sense of how our users are experiencing the machine and the services over the last year.

A

uh What we should continue doing, as we are, what things we need to improve on, and we also use it in our to generate some of the metrics that we use when we're reporting to department of energy which is important in terms of funding as well.

A

So if you haven't already filled it out, please do flip back through your email for a couple of weeks. Look for one from this uh nbri research, dot, com, nursecat, mvri, research.com! It's the uh the third party survey company, that's managing the the survey collection for us and click through there.

A

So corey default modules uh did not change in the ay changeover maintenance that happened yesterday. There will be an update of those in the march maintenance, so it should be a minor change. I actually know it might potentially be a fairly large change.

A

uh Coming up oops coming up tomorrow, we have our first uh julia nurse, call see the weekly email for some details about that and before we go to others, I see in the chat there's a couple of comments.

A

So tyrone has a question about rolling over and not going through smoothly. So I think there was maybe an uh early science quest or a couple of situations where things didn't go smoothly immediately and from from kind of the system side. What was what was rolled over.

B

A

Believe that's been fixed now, so, if you were having trouble yesterday, take another look today and see if it's solved others. As far as we know, I should have gone through as according to checked.

A

Some people may still need to nudge their pis, because if you are finding that something's, not working yeah, as someone's pointed out um I mentioned uh yeah, please please send us a ticket. If you find things that are not worth please.

A

Are there any other announcements or cfps? So that's what I have from nurse side. This is uh also an opportunity, for you know people in our music community to um tell us about things as well. So if you're involved, for instance in a uh you know, a conference or event that nurse users might be interested in, this is a good opportunity to spread the news a bit. Does anybody else? Have anything they'd like to add on that uh nature?.

D

Hey steve, this is heya from nurse.

B

D

I will be asking for this to go out on the weekly, but I thought I would go ahead and say it now. Is that um uh uh lbl, so a couple of different divisions uh are co-organizing, something called the monterey data workshop. So in 2019 there was the first monterey data conference we've been trying to hold it again, but with current travel conditions, it's been difficult, but we wanted to hold a workshop in april, allowing us to share some.

D

You know good information on ai and machine learning, so we're calling it the convergence of hpc and ai so uh providing an opportunity for early career scientists- or you know just just about anyone, but you know I think early career has been hit pretty hard recently to get an opportunity to talk about their work. It's very lightweight. You know not full papers, just abstracts on talks and panels, an opportunity to get together and exchange good ideas. So please do submit uh submissions are due by the end of the month.

A

A

Yes, thanks for that link um yeah, I think that conference was quite uh popular and and had some some interesting outcomes in um a couple of years ago.

D

Yeah we're hoping that.

A

D

Will feed to it right so the same organizers from the data conference are organizing the workshop and we're hoping that we can. You know, drum up some interesting content early and then feed it into the conference. So it's a good way to get invited.

A

Sounds great and is the is the workshop um virtual in person or both.

D

So I think um to be safe, we're doing virtual and it's something towards the as it gets closer to april. Like looks like it's going to change, we might try and add some sort of in-person component to it, but I think that will be the add-on versus that the norm will be virtual yeah.

A

Sounds good so so travel uh shouldn't be a preventer.

D

We'll try and make it fun virtually.

A

Sounds good, it should be a good opportunity for yeah for users to uh I I guess, show off and talk about what they're doing.

A

Saying say: yeah any other announcements that uh people who have.

A

So I think we're slightly ahead of our uh usual timing, but that's uh that's all good um coachee. Would you like to start sharing a screen or stop sharing okay, yeah, so kuchi sekurichi is from a pacific northwest lab. He has been a very active nurse user and participant in the nurse users group for a while. Now and yeah we've been looking forward to uh to seeing his findings on some work he's been doing recently.

A

I know that there's an aspect of uh queue, structure or queue. What do you call it.

C

Yeah yeah yeah.

A

Especially um configuration uh things here, I think, will be great.

B

Okay, thank you so much.

A

Here we go, stop hey.

C

C

Oh, uh let me sorry, I have to be making one second to change my setting privacy setting my machine ah yeah yeah, set to not to give control to zoom right before it's a live desktop. So let me come back quickly. So sorry about that. No worries.

A

So I guess: well, we have a few minutes um well, which is preparing his um settings interested to hear. If anybody in the meantime has thought of uh something they'd like to flag as a uh either today I learned or a bit of a win.

C

I'm back and I'm now sharing the screen, and can you see my screen and looking good okay and I'm gonna start my slideshow.

C

And you guys are looking at the slide view right, not the present w, hopefully.

E

Yes, this looks like the right view.

C

Okay, thank you so much steve and uh thanks for the all the users and ask staff for checking out my talk today. Some I have a 12 slides and the first half is to introduce what the climate models do to give a context for the last half of the presentation and really I'm talking about my experience of optimizing. My whole work flow and the important question for me is model throughput or how fast model simulates climate charge off or per our annual or fiscal year is my question?

C

Okay. So briefly, so I'm a research scientist at the channel now and then doing studying mainly atomic turbulence in the planetary boundary layer and related moist convictions. uh You know this comes with clouds that you know uh quite often formed as a result of interactions between the surface and atmosphere and those are also linked between small scale and large-scale atmospheric phenomena, and that's also my scientific focus and one model to have the useful model to study. This is the one I'm showing here as animation. It's called variable resolution grid.

C

Many models are having this functionality now. So in this particular case, I have verified 4 kilometer good, spacing over the us to better resolve all those strong stones and its global model. But the rest of the globe are using much closer uh bridge, spacing so to save some computational cost, but still allowing uh two-way interactions between this. The focus of division and outside.

C

And then so uh we study climate, and this is sort of how we see the us system, I believe, just uh representing my own people. So we have us- and we want to know about the climate particularly stimulate and verdict project the future, and we divide our chronic component into different components so to facilitate understanding in numerical modeling, and then this is reflected to how to redesign the current model for each component. We have numerical models, for example, atmosphere, land ocean sea ice and the other components, and these are couples so-called coupler.

C

We have this uh name scene, which is used by two kind of models, one by doi's own model e3 and its system model csn, and, um as you can see, it's a really huge chunk of mainly for current generation model. It's fortune code. I have I'm gonna, go into detail a little bit later, but uh it's uh it's. A community called and developed by the community and many users uh just contributed. So it's uh it's very complex.

C

And if we just focus on the atmosphere, part um just quickly go through to give you an idea what we mean by new major models. So a large part of the atmosphere is this dynamics, crude dynamics or navier-stokes equation. So we usually simplify the governing equations. Given the scale we can make some approximations.

C

So that's sort of the conservation of the momentum look like and then we have several terms and then we discretize that in time and space.

C

This is because this we cannot say is analytically, so we have to solve numerically and then this is typically how the grid box looks like photos of climate models, particularly atmospheric component, and then we new maker into it in time at each grid cell, because once we discretize this and write and rearrange, then the future value is the sum of spatially dependent processes operating now, that's why we use numerical models and how we project into the future- and I think this is a part of the atoms here, because I'm not touching this important terms in this equation.

C

One is unresolved dynamics at the scale smaller than the grid box. We have to do something because it's still important and processes others are fully. Dynamics has to have a different form of equations or empirical uh representation.

C

That's that's a really big research topic and then so, if you just to me atmosphere component, only this is already very complex because we have so. If you go into the directory of this model, we have some directory has scripts to build the model, make file and then source code has different processes has its own directory and the one I just talked about in the previous slide is about dynamics, hopefully dynamics or with identical core, and then these models offer actually multiple choices for the users.

C

For my research, I'm using this experimental model impasse and in this directory I have these files, and this is actually standalone model outside and imported to this code. So we have external directory. We have a lot of another photon code, the other physics where we call has this process I just mentioned and resolve dynamics, radium transfer, molecular skill processes in chemistry has its own subdirectories and in each directory have so many files. For this case, we have 122 files and different across each file representing some part of uh distinct hypotheses and for a given process.

C

We have some multiple choices of different algorithm or different scheme and each scheme developed by different groups.

C

Let me change to the laser pointer. How can I do this?

C

Should it's changed? Oh yeah, okay, here we go yeah, um so it is community model, so each scheme developed by different groups from the labs or universities, including students as one such a student, wrote some fortune calls without knowing anything, almost nothing about hpc or optimization.

C

So those chunk of core constitutes those huge model. And then, if you remember when introduced, knl hazard was introduced, we have this nasa program that designed to some test cases to optimize code from conventional to multi-core architecture.

C

There are some examples in this case from csm model, but those are focusing on those sub components very important computationally, but still just these two components. So this is another way of seeing how gigantic those code calls are.

C

So the main point is really not too easy to optimize for just given project group, even though more systematic uh development effort is being done for the does credit model. So I'll touch on this a little bit later, but there is some more refined processes to how to proceed. This model development.

C

So that's one challenge for our climate community. The another challenge is more scientific, that is time scale and chaos or sensitivity to initial conditions of this global atmosphere or climate systems that constrains how constrained how we, what simulations we have to run.

C

For example, you know for climate. You know we talking about global warming from co2 and other greenhouse gases, which is involved carbon cycle in co2 in the atmosphere absolutely in the trees over the european ocean. Those are 10 scale of uranium, so one application they have to run simulations for thousands of years.

C

For these cases, grid spacing tend to be coarse, but for more, uh this societal impact, really studies, usually incubate the model for 100 years, but then we are trying to address this chaos part by running many, many simulations by certain changing initial conditions that goes to forgiven model. We repeat the same simulation 100 times, for example, that's also very expensive, but those grid spacing are not enough to realistically simulate some of the processes, particularly storms.

C

So then we have to go down higher resolutions, but inevitably this shorter uh time period, but uh in return we can get those very realistic cloud fields from uh this whole model, except for one. One is observation, but all the other is just a modern global model simulation given those uh so for one of my project, I was tasked to run one of those clinic simulations using this variable solution global model. This is in terms of resolution.

C

This is a moderate grid resolution with this many good clones and this many radical levels gives about 4 million with boxes. This is experimental code, so we didn't have an open mp support. We just use mpi and then I we for our projects to be planned to run this model for about 44 years for the historical and future period.

C

Based on the test simulations, we know one simulation month takes about three hours of real time, using this many nodes on the kml on edison. That was faster, but this is already 10 transitions recovering, so I use knl as a main machine.

C

Typically, I submit a job for requesting about five three to ten hours. For this simulation I usually eight hours to be safe for some load of area or slowing down the sketch space, for example. So we have enough time for the two months of simulations and at each end of each job. This kind of model code writes what we call we start fast. We can continue in the next job.

C

So, given this, how many years to run and then the model throughput estimation is, uh we may need about 300 jobs to be submitted if there is no queue waiting. This becomes. If you just run as planned with six hours for two months, then that takes about 66 days. That's not too bad. Maybe we can finish testing analysis, and maybe we can report the writer paper in maybe one or two years.

C

However, we need to consider the two q time because it's community architect community hpc system, so that, as my interest goes on- and I contacted steve to help me to get some information and before showing some statistical cue wait time, I do follow best practices in the mask documentation. This is really our best friends.

C

In particular, uh the visible huge impact was by setting appropriate, lasting fire stripping. We got something to documentation for some high resolution simulation. I showed a animation arc earlier, something like that that setting this uh appropriate uh striping reduced two hours of writing single file to just 15 minutes. This is a restart file. I just talked about, and other io has also gets faster, so that has huge impact on my workflow and productivity.

C

I also follow those other um best practices, for example, b, cast option to copy model executable before doing sram actual simulation, and then uh they use one switch the closer nodes together to reduce our communication cost and then other some tricks that you can maybe go to the documentation.

C

Later I haven't yet tried basketball, but steve again already gives some advice to how to implement in this complex csm what you call so I'm get to try that I might report the results in one of those monthly meetings in the future.

C

So given these best practices, uh so I did this as much as I can to run simulation faster once I get the key. But the question is when I get a q, so what I'm showing is a statistics wire average regular q wait time on korea, night, london and it's shown to be as a function of requested hours and the weakest number of nodes, starting with a single nose to a huge number of nodes, and you can see it really really depends on those two quantities changes a lot.

C

uh The choice of my 14 ohms and eight hours, like kind of empirically, are dependent on this. My uh q waiting time, but this is really make it more specific and in the bottom I uh averaged some of those groups into smaller number of groups so that I can see more clearly that's numerical values. So again, this is requested hours in x-axis.

C

The y-axis is weight time again on an average.

C

So if we use, for example, one two thirty nodes with dark blue, that's starting from like a few hours q a time if it's because I was very short- but it goes much larger more than 20 hours. If you request longer- and even if you get 30 hours or more, maybe you could end up waiting 70 hours hot spot is this 32 to 63 in alls are very popular, particularly around 10, to 13 hours.

C

You might end up with 100 hours waiting for who you get killed to get in regular priority for my choice for this particular resolution. It's about 12 hours average I mean two times, then I can add that number to correct my estimate of the real time to get my job done now. Instead of 66 days it's about 200 days and then, if you just consider as a manual work and downtime it's now, it becomes like one-year job rather than three months or quarterly job.

C

This really affects how I even propose how many things we can get in the you know, proposals when the api asks me some questions and then so you cannot simply, for example, I know my model code by using more mp ranks or more nodes. We can get faster solution, gets down faster, like two months, maybe one hour, but then that some oftentimes make you average uh expected to return longer, so it doesn't really always help and then quickly. This is a result for the haswell. That's even you can see from the color.

C

Obviously, color is number of the average q. A time is even longer. So that's one of the reasons that, even though our call is not so optimized for multi-core knight landing, I will stick to the night landing to get a better queue and also it's cheaper for matching factor.

C

And so that's pretty much the main points, uh just another statistics or my understanding is about uh simulation cost as we increase the size of the program because for our community science pushes to a higher resolution, as we saw previously, but higher resolution means many good points with bigger problems.

C

So far the problem becomes bigger and bigger, so we have to use more and more knowledge to get the job done in a realistic time. So it's uh a concept, weak scaling uh program.

C

Given the last current charge policy, if the problem size increase, I think the cost has to increase. I think the cost can stay constant for only perfectly strong scaling so that, if the problem size bigger, I think it's been a while. Actually I wrote this slide, so I have to think but uh hold on yeah uh swim skating if we increase uh for only given problem size, if we increase number of mp ranks but twice then run time, it gets shorter in half.

C

But for us we are interested in having bigger and bigger problem to resolve better those atoms phenomena in better with better great resolutions, and that's what I was asking in this slide, because I run simulations global simulations with different resolutions, very starting from very close 240 kilometer grid spacing to 30 kilometers in spacing, or variable resolutions going down to 12, kilometer spacing and for each resolution.

C

I kind of empirically change number of mp ranks uh or how many hours to request and across different resolutions best fit line here is, of course, as expected, worse than perfectly weekly scaling efficiency of one and the gap getting bigger and bigger. So I'm sorry. I forgot to mention that the number of mpi tasks in x-axis is the simulation cost in terms of mask our power simulation here.

C

And then this picture is the same if I swap x axis to the number of degree columns, so it's getting higher and higher resolutions cost increasing with a steeper slope than the perfect weak scaling as expected, so meaning our current community needs more and more and ask others. You know in our proposal as time goes, then, then you know just simply assume our model calls keeps the same rate of increase as we use mono.

C

And uh so that's pretty much. All I have and in the last slide is some challenges I already mentioned in my own thoughts for night learning and maybe different impulse see in general queues, less than three hours are much less crowded and you can get a queue like within 15 minutes or so, and my personal challenge is for our community one month in the modern time is very convenient time scale because uh we wanted to learn many different variables in 3d in space. But if I output this every time step or one hour.

C

So, every month we only write certain statistics like mean or variance for each month and that's sort of our convention. So if it's possible, I want to run for one month for one job, but that's getting more and more difficult and if it doesn't reach you a month at the end of each job, I have to write an additional file that keeps track of those statistics for all those expensive cd variables over hundreds of those variables.

C

That's not so optimal.

C

Maybe we could do better online calculation of statistics offloading by gpu, but that's something we as a competitor. We have to think or even dimension reduction using some sort of machine learning, and also you know current because of our numerical scheme is designed to based on some sort of you know, financial difference or similar.

C

That holds some optimal numbers in plans because, as we decompose domain into more and more domains, we need to the ratio of hollow sales to actual sales becomes uh smaller and smaller, and then that makes the modification less efficient.

C

But this is maybe because of my code that was experimental and then doesn't support, threading or or to mp. That might be different with previous gpu of learning.

C

However, as I said, the studio flooring is not so straightforward for this large community model, so we might go switch to a newer, imaging emerging neural model that does have some already uh gpu of loading using those um directive based opacity for some impulse model or screen voice kind of models. Newer version is already written in c transparency in fortran to use caucus, and yet many model processes are memory bound.

C

So I don't know how much we can really push those user needs and then, in the end my experience shows uh just users, both developers and those people running simulations required to have more and more in-depth each pieces, knowledge to debug or when the job fails. Why john failed to understand that we need more and more expertise, so I really appreciate and ask providing lots of training events.

C

I really want to encourage modern users to take advantage of opportunities, and that's all I have today and maybe, if there's any questions, please let me know I try to answer to my best.

C

Okay and let me stop this ride. Show.

A

Actually, do you want to keep on sharing for a moment because uh yeah thanks that was that was a whole lot of really interesting uh stuff and before I asked questions I said, there's a couple questions already in the in the chat uh one given us. Can we get a copy of your slides? uh Can we post a copy of these slides with the meeting.

C

Yeah, maybe after a while, actually maybe by now, but I haven't- got permission to distribute a copy yet from the lab. But it's already I interested approval. So once I get approval, maybe I can put a copy and on my own yeah, maybe I'll send it to you, but uh it's many of my personal opinions or understanding so not guaranteed community.

A

Yeah, even so it's uh yeah like I, I found it a really helpful overview of a lot of things yeah. So so I guess the answer for that. One is uh watch this space and uh if, when that.

A

John asked: are you using all of the cause on the on the nodes when you're using mpi.

C

uh Not always for lower resolutions. I use most of the course uh like 64, but no hyper thread just purely 64 course of night learning, and I assign four cores for doing the. uh What you call this.

A

Because specialization, that's.

C

Right, yes, thank you, but uh we found our experimental code does not scale the memory. Well, we just found out from our simulations that so that you know, if you remember right in the restart file, that part has some redundancy in how to use memory.

C

Specifically, each task was saving unnecessarily a global array and then that it's not using memory efficiently. So once we go higher resolutions with more and more memory bound at the end of the simulations, so they've forced me to spread simulations to even larger number of nodes to get enough memory for for the particular one of the tasks in charge of io in each node. So in the worst case, I was only using 20-ish cores on night landing modes.

C

A

Yeah so yeah, so using about um about a third of the available cause, then because of other other constraints, limiting how much you can put on a on a single node.

C

Yes, so yeah yeah, thanks john, so openmp yeah and uh I haven't like I said this experimental call does not have openmp, so I don't have much experience in combined uh mpi on pnp, but version does support. So that's my I'm gonna test on that so yeah. Thank you for your insight.

A

So something some good options for sort of further further development. There.

A

So before before us, I ask um uh anymore: does anybody else like to ask a question or or make an observation.

A

You can either just unmute or or write something in chat if that's more convenient yeah.

F

Hello yeah- this is yeah, uh sounds like this, as scalability is a challenge for this existing version of the code. Do you know is this is not only memory or also io, bonnet.

C

Io bounded, yes, and uh I found also different subroutines, write different kind of output files, uh like I said um so, it's again this challenge to go through this huge model code, so one by one we found first by using parallel. We do have a really really nice, parallel, io library being used and that's really doing good job, but uh how different some routines use that library very via line, so those regular output, part, is really well tested and then used in library in a good way.

C

Restart file component was not so good, but now it's taken care of, and now we align this I just mentioned. The last slide is statistics when we did not finish one month using a job, we have to write statistics that part of the table routine was also not really doing a good job on how to use or utilize those libraries, and that was really taking a lot of time. In my experience, and actually it makes force me to use a different model for certain uh applications, yeah.

F

So what exactly is the reason? Why do you say it's? If it's less one month, then you have to print, create a lot of more files for to keep statistics. Do you mean you need those statistics to get the average for per month or something like that.

C

Yes, so if I stop the job in two weeks, probably this file uh just keep track of, probably just for the average just the sun for each time, time, step or each one hour, and once the month one month finished, then we get statistics.

F

Okay, now another question I have is so because of weak scaling. Do you know, is that io is the reason for bottleneck or is that a synchronization is a bottleneck or is that computational memory is the bottom.

C

I don't have the exact answer from the papers. I learned the particular is from io and synchronized or communication bounded, but that study did not use the full component of the climate model and my colleague profiling of the csm call sorry. I don't really remember the exact answer. I have to go back and ask my colleagues.

F

Another question I have this: is you running on the uh kl right, knight's landing? Have you run this other machines and seeing different uh skating? Okay? What are the difference.

C

Scaling but uh usually haswell and ajsong was twice faster, pretty much if you like, use the given same number of nodes or similar number of nodes or even fewer nodes uh could be twice faster than kml.

F

You mean the the the I guess, the latency or the response. Time is much better right with.

C

F

Kl. Thank you. Yes,.

C

And uh or I don't really nail down if it comes down the performance of each core or as we are just having more memory for now yeah, that's also. I.

F

O is also a point even the communication, between the course on the king app and for hospital. It's different.

C

F

Even on-chip communication bandwidth is different.

C

So it's more like a latency. Do you think.

F

Not only latency the bandwidth also as well.

C

A

Yeah different different detail of different architectures will change things.

C

There's a part of the resolved by using threading opmp. Do you think.

F

But if you can provide some details, then maybe we can make some reasonable guess. What part is the bottleneck.

C

Oh okay, yeah yeah thanks.

A

Yeah, particularly if memory and communication is, is a significant.

F

Example, we can.

C

F

Some uh high level calculations.

A

Yeah yeah, oh so it.

B

A

The the model that you've sort of assembled here for um yeah for finding the sweet spot. You know there are a lot of variables in this model because there's aspects of you know what um your natural times that the simulation breaks down nicely into as well. As you know, the simulation zone scaling and the queue times for different um parts of the machine like it looks like there's, probably enough, for you know a very publishable paper, just just on the the performance optimization or the the throughput optimization model here.

C

Yeah like yeah, certainly- and I think probably the ethiosim community was a lot of experience and then because e3sm are designed to be run on doing machines, including masks.

C

So I think their users and software engineers have a lot of experience. And probably we can also include this kind of conversation with you.

F

So maybe this is a I'm sorry take longer than so, because it sounds like your application actually divided. The functionality upon different ranks is that true, some focus on io others focus on the computation.

C

That that's how I set yeah, there's a you know: runtime settings I can put from again this io library and then recommendation from my colleagues was used aside, one task for io per node, but I'm not sure if this particular task does only I o or does more simulations than I o.

C

F

You're not quite sure.

C

Okay, not quite sure, but.

F

uh Even for that is that if it's I'm wondering whether that's per rank dedication or mostly for ios or mostly for computation, do you know that.

C

Actually, I don't know- and uh I haven't profiled- that indeed I really haven't- really used profiling for you.

F

A

Be useful information.

F

A

Okay, so so I think there's some some really interesting discussion going on here, but uh I'm also uh aware that we're uh particularly at the the official end of the meeting time. So what we might do is flick very quickly through the last sort of couple of items um and then uh perhaps uh people who do have availability continue. Can we yeah we can keep on chatting um after this, but thank you again, uh pochi and uh buchmay and others for really interesting presentation and discussion.

C

Yeah, thank you.

A

C

A

Last couple of um comments quickly so coming up, we've got a few um upcoming topics uh looking at nurse docs and also a preview of the nurse annual user survey, but we're also very interested in more. um You know talks like what pretty just gave uh you know an overview of interesting work that our users are doing and yeah. I think we can see that you know there is so much uh interesting stuff and interesting discussions to come out of that.

A

So so, if you've got something that you'd like to present uh drop me a line either on on slack or you know, buy a consulting ticket, a quick look at last month's numbers, corey had a couple of outages that were unscheduled for a file system issue and a cabinet power outage, and also the scheduled outage.

A

This is how oh availability, you look over the last three months. It really doesn't change very much if you look at that overall time. It's it's been uh the scheduled availability, which is you know when you, when you remove the scheduled downtime has been up over 99 and the overall availability, which does include you know taking out the downtime they're still up at sort of 97 and a half percent uh capability metric is the fraction of the machine being used for large jobs, which is sort of a proxy metric, but yeah for work.

A

That's you know, basically very difficult or impossible to do anywhere else, um and we have a target of sort of 25 of the machine for that and you can see over the last year. You know, for most of the year we've been pretty comfortably above that.

A

And the bottom chart here is sort of the currently active open user tickets, and you can see it sits pretty flat, which basically means that uh we're keeping up with the inflow of tickets, so tickets are coming in actually at a pretty good rate, but uh they're also being resolved at a pretty good rate.

A

So that's all for the official part. Thank you again, everybody um preachy. Do you have a few more minutes to reshare slides in case there's further uh discussion.

C

Yeah, yes, and.

A

C

Are available? Okay, should I share a screen again.

A

uh Yes, please do and uh everybody else thanks for joining us today. um If you need to move on, you know, please do but uh yeah if you're available and interested in further discussion we'll continue for another five or ten.

F

Minutes maybe I'll just ask one more question here. You said you, you somehow addressed the restart uh latency event. uh Io. What tools did you use? Oh, it was just.

C

Changing the strike count of the uh whatever.

A

C

Number of ranks that were.

A

C

C

And uh but to me I was just satisfied by just sitting last so far, striking shutting down ah um sorry for the restart file uh there was temporal solution was again to just distribute to larger nodes, to keep enough memory to write in a read instead for restart file for the following the rank, zero.

C

Well, I don't remember actually yeah yeah for each io. I rank on each node, so very inefficient. Just using the worst case, I used only between 10 to 20 cores per mode, and then that part is taken care of in the latest version of the modern code by just not saving global, uh when we either calculate or cultivate certain statistics, or something because it's a global array and three some of the array being saved like coordinated variables as 3d and then also not only so.

C

We have this grid structure of uh this can be destruction, so it have to be saving coordinate of the good sales center, but also grid cell vertices and with their lines. So there's so many global coordinate variables that was uh saved efficiently in some part of the modern course, and then that is the root cause of that is because the model developed 10 years ago or 20 years ago, this is has been very generations ago.

C

Didn't really expect we're gonna run this kind of high resolution simulations, so they didn't pay too much attention in how each node the memory is being used, but going to those extreme high resolutions really many many problems now.

C

But for me I am not a developer, so my temporal solution is just to use more noise to get enough memory.

F

Okay, so you can, I summarize what I I heard here. First, you use more nodes in attention to get the memory capacity. You use the last course per node. To avoid thrashing memory is flashing.

C

Oh, I don't understand this term memory is flashing.

F

Means each core try to get their own data set or computation working set or whatever to the same capacity which clicking out of each other's data, set okay or working okay. Now.

A

F

Thing here, yeah yeah, the other thing is the striping help you to get more io bandwidth per node on average.

C

F

Does that sounds a reasonable explanation? Why you get this benefit by making these kind of changes.

C

I think so, but I wasn't thought about why, uh for striking.

C

This helped to others time spent on writing, but actually the problem of.

C

This type file.

C

Slightly, actually, I'm sorry different.

C

Yeah this does show if the model called can write restart file. This does help to shorten the time, but without getting enough memory per node by spreading across more nodes, the model gets just stuck doing nothing. During writing. Research file, yeah yeah.

F

And I think, if you can help us to find out when you do the restart, how many cores are actually doing ios will be helpful as well. Okay, all can you measure per node? How much io bandwidth is? Actually you know, loaded or stored, mostly loaded into that node will be helpful as well.

C

Those are just available from debugging software. I think.

F

You can do some, uh I o profiling tools.

C

A

I see so yeah something you might be interested in looking at is um by default on quarry there. One of the modules that you have loaded is called dashhand and that's an io profiler that uses the.

B

A

This is the mpi profiling interface setting, um and so it runs by default. um You can uh you can remove it if um yeah. There are some circumstances where you don't want it. You know it will probably add a very small amount of overhead, um but more significantly yet because it uses the mpi profiling interface.

A

um It's not necessarily compatible with other profiling tools. So, while you're doing performance analysis, it might be difficult to use. But um we have you now in docs.nurse.gov.

A

If you do a search on darshan, there's a link there to where you can find the output file that dash n generates after each run, and they can give you some pretty helpful statistics about the I o behavior of your program. um Okay,.

C

A

You can you can discover some surprising things um uh in in one ticket a little while ago, working working with one of our users, we found looking at the darshan output that, even though this job uh was really just sort of yet reading a bunch of text data, because the defaults for fortran would have rated in fairly small chunks.

A

uh Actually, a a huge amount of extra data was being transferred to and from the nodes during reading, and so there was a a lot of overhead um that wasn't necessarily in, and you know so that profile kind of helped us to you know, find that and and improve it. So that might be something that's uh interesting to look at for further tweaking of the I o aspects.

C

Just quickly so dasha can look at the information that individual rank or task is doing. For example, like.

A

It does collect that. I think the default report does some summary statistics.

F

Okay, even even that we can see the difference, we should be able to see the difference. Okay, okay,.

B

F

Think at least per node wise. We should see its statistic provided yeah, I'm not quite sure whether foreign can tailway.

C

But yeah good starting point sounds like yeah.

A

So so something that that struck me, while you were speaking english- is that so there's there's so many different variables going into this model to work out. What's the sweet spot, were there any tools that you either identified as being useful or would be useful if they existed? That would make this easier, for um you know, for a different project that was trying to you know, get a similar kind of information to what you found.

C

Like by variables, bringing like a valuable physical variables inside the model simulation or like.

A

In in the sense of like you were looking at uh things like the queue wait time based on the number of nodes and the um and the length of the job, and then you had an extra constraint of the the job works. Well, when you divide it into month-long chunks, whereas you've got a you know more, I o and more overheads, if it's in two two-week chunks and so that sort of adds an extra constraint, um you know the scaling of the model itself.

A

You know the um that the optimal point of there, like there's, there's sort of a lot of factors that go into. What's your ideal, um you know throughput per year and did you uh I guess what I'm? What I'm asking is. Do you have any tips for you know if somebody is trying to do something similar, you know with it with a completely different model: yeah they work, maybe on um yeah molecular dynamics or go a different field, but but they want to also find a a good optimization.

A

Do you have any any tips for them.

C

Yeah, it has to be two stages right. One is again optimizing the code for.

C

Both speed and then from our discussions, uh efficient ieo, so that we have a freedom to choose ours and the nose for you know, accepted uh throughput right, for my case, um being experimental and a big call and lack of our expertise in our project. Our project, members are all scientists. We don't have any dedicated software engineers in our project, so optimizing the model was a challenge and then later we found this memory scaling issue, so we only had a freedom to change those valuables available when we submit the job.

C

But then still this information was very helpful that you know that you helped me to get then I can see so in iris. If iris has you know even multi-year statistics of this, this is wire somewhere in the iris website or my mask.

C

Something I haven't done, that is the variability of these numbers, maybe seasonally or quarterly, maybe end of the year might be busier. I don't know sometime in the summer might be busier. I thought that might be uh useful.

C

I thought, but I haven't really quantified that yet, but uh the more most of the users of the you know those kind of model code do not have time support or funding or expertise to optimize modern code students.

C

So what they can do is the same as me to change those numbers, so those information will be, I think, very helpful and that's one main message I want to share in today's talk.

C

F

Yeah, I have a simple another simple question: uh is this every time you run, you run the complete test or do the stop and resume type of thing.

C

I don't understand the difference between those two. Can you say it one more time.

F

Because I think corey has some: uh what's that.

E

uh Oh uh checkpoint restart, do you mean, like it yeah.

C

Yeah yeah yeah yeah.

F

Have you used that or you enable it.

C

I haven't but right now I guess maybe I'm hinting so when we use this functionality, for example right now, if for isolated jaw, and if this simulation does not finish for a specified time period like a three days if it reaches the time limit, it's just a failure.

F

Then you restart.

C

F

The beginning, that's fine I'll.

C

F

Rewrite another copy.

C

But we manually set, we can set, for example, instead of writing restart file at the you know, end of the job or end of least one month I can set from the model code you can, let's like restart file for every week or every 10 days.

C

uh In that case, I don't need to go back at the beginning of the month, for example, but you know just most latest installed file, however uh yeah there's no functionality to automatically adjust or decide when to.

C

F

I think that could see some significant difference here, not from your application perspective.

D

F

Than from productivity perspective,.

C

C

And yeah, so it's really affordable for me, is that I run simulations for maybe for eight hours. You know the job I get, the queue finished, 29 days, spending, 6 hours, 100 nose and then a load of failure or some model just crashed from. I don't know floating point and anywhere anything instability. Then I have to just go back and yeah know. The failure is uh um uh yeah time because yeah do not visit.

F

The main thing is not exactly node fitted rather than long queueing right. So if you have a fragmented, uh you know machines, but you can still use.

B

F

Your usefulness.

A

F

Computation then that's a good thing to do.

C

A

Yeah yeah, you don't want your queue time to run out early, um but you don't necessarily want to overshoot it as well. um The uh variable time, jobs, the time min flag, can be an interesting.

B

A

To experiment with that, because that allows you to set kind of a maximum queue time- that's significantly higher, but a time min. That then becomes what the scheduler uses to find a slot and so it'll sort of find the earliest slot. That's at least as big as time min. And if you get a bigger slot than that.

B

A

Job can continue past, although if your job sort of check points at specific um intervals,.

A

You just sort of want to go gonna work in in units of those intervals, so you might need something a little more sophisticated too.

C

Right so maybe I can, for example, use that variable times of maybe uh for, for example, I could the minimum time to finish one month, even though I'm finishing for two months: job yeah and then at the end of the first month at least I write this file and then, even though the job ends in the middle of the second month, that's still a game for me, but uh is there any way for the model code to communicate with the hardware system, so that model can, for example, c?

C

Oh, we only finished uh 20 days, but the requested hours only half an hour left. So maybe it's better to write a result. Now.

A

Can be done in the script you can from from within the script, uh you can run the sleep command to see how much time is left. um It would take a little bit of sort of you know coding and scripting, but um yeah in in principle, you you could, at the end of a checkpoint, look to see how much time is left and decide whether to continue or to stop the job. At that point,.

C

But that's uh sort of higher levels are not inside like islam right, it's a higher level, um obviously not.

C

Executable, it doesn't communicate.

C

I think that when the job starts moderating a photo on your name list at the beginning, that's a time when the model knows when to write restart file yeah. So somehow.

F

Yeah there's some thinking need to be done, but uh I think that may provide some useful, generally useful approach to to solve that long. Waiting, curing problems, yeah.

A

Yeah custom solution, but um but but the information is available.

C

Okay, information is available, then yeah, once we can use that information to make the model or executed to read name list again and adjust some setting that might be possible to write restart file. Not you know, pre-determine the timing, but.

C

Yeah, if the you know the time limit is, becomes less half an hour, write restart file that kind of instruction.

A

uh Question do I have time to run the next step, yeah.

C

Yeah, what we can do is maybe maybe when the job you know when the time left is become like again half an hour, maybe stop integrating model integration and then read nameless again or get some input, text file or xml file, the past xml file again and the update, restart file setting and then, if it's time to write list and then or or just to make the model right, listen for at the time, but yeah. Let me think- and maybe I can talk with some uh engineer as well.

A

So, coming from the other direction, uh I guess, depending on how well the model responds to if after you've written a result file, then suddenly it hits timeout um so that you only lose a few minutes after or or it quits after that see, so you could put sort of a watcher in the script to look for a restart file and when and when a new restart file appears um check the amount of time. If there's not going to be enough time to finish the next.

B

A

Next restart file, actually I bought the job like can't, cancel the job and.

C

Okay and clean.

A

Up the rest from the outside.

C

That's also a good idea: yeah.

A

So I guess the other consideration that you would need there would be um presumably the time for a a new. uh Would you yeah continuing on is not the same as the initial time like I imagine. The time from the start of the job to the end between the first restart file is written is probably a little bit longer than from between when the first real style is, and the second restart file.

B

A

Initial overhead, to account for.

C

Right so is it: is it the valuable length, job or different.

A

um Yeah so so take a look at variable length, job for that.

C

A

And it's it's not going to be exactly the same, but I think.

B

A

A similar enough, it sounds at least like a similar enough problem to solve, but a lot of the same techniques are probably um yeah valid.

C

Yeah yeah yeah thanks.

F

What I see is your waiting time is longer than execution time right before.

B

You turn around.

D

And actually about.

F

What about 50 longer, that's.

C

Right, yes, that's very typical for my work: yeah yeah, so yeah be nice and keep thinking how we can improve this but yeah. We really really appreciate this discussion and thanks for so much for the opportunity.

F

Yeah, I think, uh is I don't know whether you have time otherwise. I potentially can work with you on that, but uh we'll see.

C

Yeah, you see I'll try to find.

F

That's something we also work on okay, but uh depends on you know how this can set up, but anyway discussion. We can help each other yeah yeah.

A

It sounds like it's worth um if you get touching base offline and hopefully you've got enough information, I guess there's a there's. A direct message. Chat option in in zoom swap contact details.

C

A

Yeah all right so we're coming close to 12 30 now, so we should probably wind up, but thanks again.

B

A

For this um presentation and work, actually, this this was and for for the discussion. Some really interesting stuff here and yeah get some some sort of really interesting sort of results and findings, and you know of what your experience has been using the system.

C

Yes- and here, thank you so much for the discussions and again opportunities.