National Energy Research Scientific Computing Center (NERSC) New User Training 2018, 20 Apr 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: New User Training: 01 Introduction to NERSC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right so welcome everybody I'm happy to welcome. You here looks like it's a little bit of a sparse turnout this morning, so rainy I appreciate everyone coming in the rain, so my name is Rebecca Hartman Baker I am the leader of the user engagement group, and so today, I'm gonna talk with you, I'll just give you an overview of nurse and what we do and sort of how we interact with users and how we expect that users would interact with that.

A

So here's my agenda, so I'm gonna, give you a little introduction to nurse and who we are. Why we're here talk about the hardware that we have the software, how to interact with nurse and then overview of user responsibilities and expectations? Okay, so nurse stands for the national energy research, scientific computing Center. So that's why we call ourselves nurse get instead of the whole name right too long, so it was established in 1974. It's the first unclassified supercomputer Center.

A

Now the original mission of nurse- and we weren't called nurse back then, was to enable computational science to study magnetically, controlled plasma experimentation. That's why we had a nurse today. Our mission is to accelerate scientific discovery at the d.o.a office of science through high performance computing and extreme data analysis. Nurse is a national user facility, so we have users from all around the country and actually all around the world.

A

So we have more than 7000 users and 800 projects and our users use about 600 different codes. We have hundreds of users who are active daily on our machines. Our allocations are primarily controlled by the Department of Energy, so so 80% goes to what we call our cat energy research computing allocations program. That sounds convincing, at least if it's not right, and so we give out well, they give out about. You know ten thousand to ten million our words. Actually we have some hundred million our awards.

A

Now users submit a proposal and then do a program managers choose from those proposals, so the proposals go to a specific, do a program manager. So if you're you know, if you're in nuclear physics, let's say it would go to the nuclear physics program managers. So eighty percent goes to the air cap Awards.

A

Ten percent goes to the do e Oscar leadership computing challenge, which we also call a LCC, which is an acronym with an acronym, so Oscar I never can remember what it stands for, but it's like advanced scientific computing, research, I, think and then over. The remaining 10% is in our nurse reserve and we use that for overhead if somebody's, if we need to refund somebody's job that didn't run, we used our overhead for that. We use it for education and training.

A

We use it for Directors Awards for like right now we have this scale, science Awards, so that comes through our reserve. Okay. So from the do a point of view, this is how you I was right: advanced scientific computing, researchers, Oscar domain okay, so this is kind of a pie chart of how our allocations were distributed, based on last year's distribution of hours. So you can see they go to a wide variety of different areas.

A

So probably some of the biggest ones are BER and BES. So that's bait by a biological and environment research is BER and BES is basic energy sciences, and so those are kind of broken down into several different sub areas in this pie chart. So you can see we have users from a wide variety of different scientific disciplines that are using our machines.

A

I mentioned before that we have over 600 codes that run. If you look here, the top code here is vast and vast accounts for more than 10% of all of the hours that are run on our machines, but the top 10 codes make up half of our workload and then the top 25 codes make up two-thirds of our workload so well, we do have a lot of people from a lot of different areas using our machines um and we have more than 600 coats that are being used.

A

Really. Most of our usage is in these top 25 codes.

A

Ok, so we are very focused on science, nurse users actually produce and publish more than any other any other Center in the world. We have about 2,000 publications per year that a site nurse so yeah. We actually have probably more publications than any other Center in the world as far as we know, so we love our users and we want to help you all, but we need you to help us to help you.

A

So if you don't acknowledge nurse in your publications, then nobody will know how useful nurse gives and then we won't get as much money and then we won't be able to provide services to you anymore, so be sure to acknowledge us in your publications. Also, we love user success stories. So if you have any user success stories, like really cool publications about your super cool science, then please send send us your links to your publications. We might even interview you, we could make a article about it. It's be super exciting.

A

It would be a win-win for everyone. So please, when you use nurse, acknowledge us and then give us your success stories, because we want to hear about it. Okay, so we have a lot of systems here at nurse our flagship system is Cory, so Cory is currently in the top ten of the most powerful supercomputers in the world.

A

I can't remember where we are right now exactly, but Cory has two different types of nodes, so it's got about 2000 Haswell nodes at about 990, 300, Kol nodes and we'll talk more about these later Edison is our other big machine. Edison has five thousand five hundred and seventy six nodes, and so both of these machines have scratched systems that they are attached to.

A

Edison is also attached to Cory's scratch system and then Cory, of course, has a burst buffer, which you'll learn more about this afternoon very neat resource. So, in addition to having our machines there, we've got some clusters. We've got a cluster called gene pool that houses our PDS F cluster, our hei resources in there we've got other resources like visualization and analytics resources. We've got data transfer nodes, we've got science gateways, we're all connected into es net and then also I should mention. We have our global file systems very powerful, very large capacity in particular.

A

Something remarkable is our HP SS system, more than 50 petabytes stored on it 20 years of community data, and so we'll learn more in detail about all of these things. As the day progresses, I just wanted to show you sort of a map of how things are all connected here.

A

So when it comes to our HPC systems, Edison Edison is great. Edison is a large and stable machine. It's not the new hotness anymore, so it has shorter accuse. Then Cory does, and it also has a lower charge factor we'll talk more about what that means later, but basically it means you can get more CPU hours for cheaper if you go on Edison than if you go on Cory now, Cory, like I mentioned before, has two different types of nodes. It's got the has little nodes and it's got the KL nodes.

A

So Haswell notes these are ideal for throughput. So these are really notes that we primarily hope that people are using to analyze data or other purposes like that.

A

We have. We allow single core jobs on those Haskell notes, and we have longer wait time limits for some smaller jobs. Now the KL nodes, those are really the new hotness. They are the best that we have they're, really good for performance. The issue here is that the the KL architecture and I think we'll learn more about this later. The candle architecture has a lot of very small low powered cores, but a lot of them all right. It's got 68 cores per node versus these others that have maybe 32 cores per gram.

A

So if you can exploit a mini core architecture like that, then these core ekl nodes are perfect for you, and this is where we like people to run all the really large stops, because remember we have more than nine thousand nodes of this. So if you can run across nine thousand notes as awesome as where we want you to be okay, so I mentioned before we've got some pretty awesome file systems, so we've got different types of file systems that nurse we've got global file systems, local file systems, and we have a long-term storage system.

A

So we've got a home file system and it's mounted on all of our machines. So if you log into Edison- and you log into Cori you- you have access to your home directory on both those machines.

A

It is not to perform well in parallel jobs, so we encourage people to not run your parallel jobs from your home directory. You have a quota on that home directory and we can't change it if we just. We just won't change it, because that the purpose of home is primarily for storing some data, such as source code or shell, scripts, okay, or maybe some binaries, but that's about all that we really want you to use home for because we've got other file systems that are way better for other purposes that you would be using.

A

So, in addition, we've got a project space for everyone, so it's mounted also like home. It's also mounted on all of our platforms. It has medium performance and parallel jobs. We can change the quota there. We can extend your quota, it has a snapshot backup. So it's got a seven-day history, just just like home did I forgot to mention that. So what that means is that if you accidentally delete a file, you can go back into the snapshot and you can retrieve that file within seven days and prod the project system.

A

We really want you to use that for sharing your data within your research group right within your project.

A

Ok, then, we've got some local file systems, so we've got our scratch file systems, so these are large temporary storage systems. There there's a local scratch on Edison and then quarry scratch, which would normally just be local to Cori. We've also mounted in on Edison. So you can you can access quarries, scratch from either machine, but you can only access the Edison scratch file systems from Edison. uh These are optimized for read/write operations, but not for storage.

A

Excuse me: we do not backup the scratch systems and in fact we have a purge policy. So if you leave your data on for 12 weeks on one system, it's eight weeks on another 212, if you leave it there without doing anything to it without accessing it without writing to it. We're gonna delete it. But scratch is really perfect for staging your data and performing your computations.

A

That's where we want you to do those things and then after you're done, you need to clean up after yourself and put put the output into a more appropriate storage system. So another one is the burst buffer, so burst. Buffer is sort of a temporary per job storage and it's really a high performance file system made out of SSDs. So it is really really really fast for readwrite types of operations. It's only available on quarry, that's one of unique features of quarry and it is perfect for getting really good performance from I/o constrained code.

A

So if your code does a lot of I/o reading and writing, then you should consider the burst buffer. It would be really good for your performance and this afternoon we'll have somebody talking more about the burst buffer.

A

Then, finally, we've got our HP SS system, so that stands for high performance storage system. It is archival storage for eat-in, frequently accessed data, so it is sort of it's sort of a hierarchical storage system. So, on the front, we have this, these high-performance disk arrays and that's kind of where your data goes when it first gets ingested and but then, after a while it hasn't been accessed. It goes into the back end, which is a bunch of tape drives now y'all may be seen tape. Are you kidding me Rebecca like?

A

Why would you use tape? Tape is actually really great. It's really low cost. It doesn't require in any electricity or power to maintain. It just needs to be in a safe environment for tapes, um and so that's why we use it, but for more information about HP SS, again we'll have later presentations, you'll get to learn a lot more about it, okay, so using nurse filesystems. This is this is an analogy that I like to use with people so computing, it's kind of like baking. Right like you, have these baking ingredients? That's your input!

A

You have this output, which is like a cake. Let's say we're, gonna bake a cake. Okay and the computer is kind of like the oven right. That's where you like, where all the good stuff happens. Right. Where are you you're? Actually, taking these strange ingredients putting them all together, putting them in the oven and out comes a delicious cake. Okay, so I would liken the home and project systems to your pantry or your fridge right. That's where you store your ingredients for your baking right. Hp SS is like your freezer.

A

That's where you have like the frozen blueberries or something that you don't use very often. Sometimes you need them, so you would bring about and then scratch is your kitchen countertop. Okay, that's where you're gonna stage all of your ingredients, you're gonna, put them all together and then you're gonna bake them in the oven and then you're gonna stage them out onto the countertop again right.

A

All right, I already said exactly this. So when you're baking, you take all of your ingredients out of the pantry right and you put it on the countertop you're like okay I need my flour. I need my baking soda. You know I need my buttermilk, whatever I put them all out on the counter, all right and then I dig into them and I mix them up in my mixing bowl or whatever. Then I put it in my pit, my cake pan right. So it's the same thing when you're doing your computations right.

A

So you've got your data. You need all of this data in order to learn what you're gonna learn from your computations, you put it all in there, you're already you're all set from when it finally runs. Okay. So after baking, you really should clean up after yourself. So in this case, instead of it's your own kitchen and you can attract all the roaches you want- and nobody cares right here.

A

It's like it's a public kitchen and you need to clean up after yourself, because we only have a finite amount of counter space, and so you know somebody else is gonna need to use that space. So it's okay to let your cake cool on the kitchen counter, but you need to leave a space clean for the next user. Okay.

A

So, after a while, if you don't clean up we'll clean up, but we're not gonna clean up in the way that you like, because what we're gonna do is we're just gonna get the trash can and we're just gonna dump everything into it. And that includes your cake. Okay, so don't make that mistake. Don't leave your output on scratch for 12 weeks and expect that it's gonna still be there when you come back because it won't because we'll get rid of it.

A

Okay, so let's talk about software, so both of our machines are crazy for computers and their OS is a version of Linux that is optimized by Cray. Now the machine on machine. We provide compilers, three different, compiling environments and we'll learn more about these things in more detail as the day progresses. We have many libraries that are available that some of them are provided by Cray.

A

Others are provided by us at Nernst and then we also add nurse. We provide a lot of applications, so we compile and support many different software packages for our users and, like I, said there will be more details on this in the later presentations. This is sort of an overview, so one big thing that we do is we provide a lot of chemistry and materials applications.

A

So if you recall I mentioned that vast is our number one code that uses more than 10% of the time actually I believe last year, it may have been 15% of all of the CPU cycles that were consumed were vast, so because of that, it's pretty important for us to provide an optimized version of vasp, which is what we do so, but then there's all these other ones too.

A

That chemists use a lot. Okay, so switching gears again we're gonna talk about how to interact with us at nurse. So these are three primary ways in which you will interact with us. So the first thing is you'll interact with nurse consulting and possibly with the nurse operations folks and then, hopefully with the nurse user group. So we here's our consulting team. We are composed of people from three different groups: user engagement. That's my group! That's me at the top there, these little tiny pictures, the application performance group and the data science engagement group.

A

So we all comprise this consulting team. We all we all spend time talking to users answering your tickets, things like that, so in 2017 we handle seventy four hundred tickets from two thousand three hundred forty two unique users. Okay. So it's a lot of work for us, a lot of different areas that people ask us about, so primarily software and running jobs. Those are our two big ones.

A

So here is here's our level of service. So we will pray, will reply to you within four business hours. If you send in a ticket within four business hours, will reply will help you resolve your problem and we'll keep you apprised of progress?

A

We will attempt to accommodate needs that don't fit within our operating structure. For example, sometimes people require a reservation so we'll we'll try to work that in if we can- and we are always happy to get user feedback and constructive criticism.

A

So the only thing we ask in exchange really is again help us to help you I, remember I just said we had 7400 tickets in a year so provide us with specifics. What is the problem that you're? Having what machine is it on? When did it happen, what modules were loaded? How did you try to fix it or work around it? So if you just did this message- and you say.

A

My code died, okay, well, what code? What happened when it died? How did you compile it? You know what error messages did you get? You know right like we'll probably reply and we'll ask you all those questions, but sure would be easier if the first time when you send us message, you say, I was running vast and I submitted with this particular input file and here's. My my batch script and I got the following error and it was on quarry and it was job number such-and-such right.

A

If you give us all that information at the front, we can help you a lot better than if you're just saying my job died.

A

Okay, so we've also got operation staff who are on site all the time, all the time, 24/7 365 or 366 days per year, and they supervise the operation of a machine room. They make sure that bad things don't happen to to these machines, which are worth tens of millions of dollars, so I mean they're. Even there on Christmas they're, even I mean they're there on Thanksgiving they're, their holidays, they're there at 2:00 a.m.

A

all the time, so our operations, folks they answer the phone and they will forward it to us consultants during business hours, if applicable. So so you might talk to them. If you ever call, although I see a lot of young faces around here, I think young people don't like calling which that's good I feel the same way so operations. They know exactly what's going on with the machines, though, so they can actually be very helpful with some tasks, so they can help you reset your password. They can help. You kill some jobs.

A

They can make limited changes to your reservation. If you have a reservation, that's running so operations, don't discount them and just say oh I, just must be to a consultant whose operations can really help you out there they're a bunch of smart folks down there, I'm always impressed whenever I talk with them. Okay.

A

Okay, so then we've got the nurse user group. So this is a community of nurse users. As I said, we have more than 7,000 users they're a great source of advice and feedback for us at nurse, so we asked them their opinions and they tell us- and that's great I- mean there's nothing better than being able to just opine and not having to actually do anything about it right and that's what they do.

A

So we've got an executive committee, that's three representatives from each office, so remember the DOA offices that I told you about that that that used the Machine. So we've got three representatives from each office and then we also have three members at large and then we also have monthly teleconferences, hosted by nurse usually on the third Thursday of the month, 11:00 a.m. to noon. So we already had it last Thursday. It was the third Thursday of the month, okay, so you as a user.

A

What do we expect out of you couple of things so, first of all be kind to your neighbor users right, don't abuse the shared resources so, for example, everybody logs into the login node, so don't just go on the login node and just take all resources and run something that's really computationally or memory intensive on the login nodes. That's not nice! Okay, use your allocation, smartly or wisely.

A

We have limited allocations right, I mean there's limited time and number of notes right. So if you blow through your whole allocation, there's not a lot, we can do for you, so use it smartly and don't don't abuse it pick the right resource for your job and your data, so small jobs are really great. On Cori has well they're, not so good. Well, they're, okay on Edison at this point, they're really not so good on the on the Cori KL nodes, that's really for the big job, so that would be my recommendation.

A

Another thing is back yourself back your stuff up. So remember. If you leave your cake on scratch for 12 weeks, we'll throw it away so don't do that acknowledge nurse. Can your papers so that we can continue to get funding and be sure to pay attention to security. So don't share your account with other people. Please don't do that we'll have to disable your account if we find out that you're doing that. So don't do that. It's bad plan all right.

A

So thank you thanks for listening to me and welcome to nurse I'm really glad you're. Here we have any questions.

B

A

Can you find the slides we will put them put them put them on the training weapon, she's.

B

Mostly seen recording.

B

A

Right did you hear that so is that, after the training days.

B

A

B

So committed, but the videos will need to be post processed.

B

But you can check the slides later today and tomorrow. Okay,.

A

Thank you any other questions. Oh.

A

Yes, okay, so if you came in, if you didn't come in early.

B

A

Got we've got their name tags here and we'd like you to sign.

B

In and then there's.

A

Also, some refreshments, if you like yeah,.

B

You can do.

A

That break time, where you can rest up right now or something, but that's about it. Okay,.

B

Any other questions.

B

A

That's a really good question, so question was: what did it mean when I said if you haven't accessed your file in 12 weeks, so what this means is, if so, if you, if you read the file that that counts, you've accessed it, if you write to the file that counts that you've accessed it, what other operations can you do.

B

A

It measures the last access.

B

A

Yes monitor the metadata yeah, and that means.

B

You might see empty trees there for the fire we're protein, the video files they put a top-five in your scratch directory. It's cop talk perched on a date, so you go to that date. You see all your files listed that being perched on that day, great.

A

Easter is considering as an expiration. You know. I didn't want to talk about touch technically. Yes, but if we catch you touching, you will bathe you good yeah just wanted to make that push it really clear. Yeah! That's why I didn't bring that one up. Mm-Hmm.

A

Okay, any other last-minute questions here. Otherwise, we'll move on to something way more interesting.