National Energy Research Scientific Computing Center (NERSC) New User Training, June 2020, 16 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 07 Data Ecosystem Overview

Description

Part of the NERSC New User Training on June 16, 2020.

Please see https://www.nersc.gov/users/training/events/new-user-training-june-16-2020/ for the training day agenda and presentation slides.

A

Welcome back to the new user training, this is the afternoon session. We're going to talk about the data science ecosystem at nurse. Remember that if you have questions you should post them to the Google, Doc and I'll. Add that to the chat, if I can here in case, people need that link.

A

While the talk is going on, feel free to post questions there and if we need to, if we, if we can answer them, while the speaker is going and we'll do that, otherwise, the speaker will pick up the questions of n. So the first talk that we're gonna have is a overview of the data ecosystem and nursed by what he'd been gene from Deena analytic services group. So taking away what.

B

Hi thanks Ronnie yeah, I'm, sorry I'm, just gonna be just giving a brief overview to start. So this is just the schedule this afternoon. So I'm talking, we don't hear you oh well. We can but.

A

B

So I'll carry on and then someone can just tell Robin cuz I assume it's him. That's got the problem.

A

B

A

Still don't hear you yeah.

B

But it's your problem: this is yeah, okay, I'll keep talking and it goes so actually I'm. Just as oh. This is actually wrong because I'm looking for 10 minutes and then bill I think is something for 20 minutes about and then I'll talk a bit more about file systems and in particular, the burst buffer, which is a sort of slightly different sort of files, and then Lisa will talk about two ways that we can use the file systems in terms of transferring.

A

B

Data and then Quincy what that says, and then, after the break, we have more sort of analytics topics, first of all, about Python and Jupiter, and nurse 2, which is our container indeed I. Think you briefly heard about earlier and then finally, the interesting topic of deepening that we'll be presenting.

B

Ok, so here's just an overview of the whole. What like maybe comes within the idea of data and ask if these technologies we'll be talking about more in detail in later talks, some we won't be, but I'll just go through kind of what they are and they've been putting the categories of accessing data either transferring in so I recommend you both doing nice globus. These then interfacing with nurse where we have, as we probably as you heard in the morning, a bit Jupiter and you'll hear more about that later.

B

At this funny symbol here is an X I believe, which is there's no machine way of accessing mostly very button morning. Support web portals, which we'll be talking about here. I'll briefly mention an easy way of putting things on the web, but there's much more. You can do and we are people who can help with that as well, and many people use, for example, the jungle framework, and this.

A

B

Is an API we have so you can actually it'll there's a so refresh of this coming call the super facility API, which I won't be talking about either, but there are ways to do that. If you need and then workflows bill is going to talk about this in lots of detail the some tools here, like tasks, farmer and fireworks or supporting some time and then some new that are coming they're coming and they were happy to help you with then in terms of data management.

B

These things on the Left file formats, so hdf is a very common and widely used across different groups. File, format and Quincy is super active and we will be talking later. Netcdf and root are like were heavily used by the communities that use them. So we do also provide some support for those, but if you're not in those communities, anybody interested in that and then in terms of dip. We don't have a talk on that, but you can look up our documentation.

A

B

Thing to know is that there is a form there if one, a database that we do host database, both MongoDB sort of larger data sets and MySQL and Postgres, but traditional SQL databases, and then the data analytics space is quite broad and you'll hear a little bit about this in the Python talk and also in the deep learning talks. Are these things on the right and deep learning frameworks, but we also support.

B

Those as well and spark for distributed analytics and then in the area of the patient, which is important also for traditional HPC applications. We support it and power of you, okay, software, but then also there are. There are features if you like, particularly on quarry, and so these are a kind of bunch of things, some of them work, particularly in integrating with our high-performance computing environment. That's really about running big jobs, which is very cunning, pensive jobs, so.

A

You know the most basic thing.

B

Here is having quite beefy logging notes and may off them.

A

B

I'm, supposing applications on these, but you can run short things interactively on there and it's good to have a few nodes for that purpose.

A

B

You have long-running things that you need, for example, to control workflows. Then we have some separate login like nodes that can be used and again there's a form on the web. This you just have to say what you use cases and what your application is. If.

A

You have bigger memory requirements.

B

We also have some big memory nodes, it's just you ins them and then in terms of queues. Maybe this box should go here actually, but we.

A

Also have other queues.

B

That are maybe suit for a time emitting stores or experimental data and ask this jobs such as shared no queue. If you don't need, I know application, so that can be serial jobs or any actual fraction of a note and then a separate to you for transfers. They can, you know, queued up as well, and then we we see that some needs expansiveness. We have queues and we can take some time for a job to get through those depending on the size of it. But some.

A

B

Have needs for real-time things, so things that actually run when you submit it, and this obviously requires dedicated resource on our part. So again, it's by request via sort of app for that, but we can support it and then in another useful features, interactive queue. So here you can request the artistic loads project and up to four hours, and you can usually get quite a lot quick response on that. Then, once you talk to running, we have containers which later I'll talk about the burst buffer. More detail later, so I won't go through that.

A

But another area that.

B

We've worked on in quarry um best to external major sets directly into compute pains, so this just shows that you know all these things. We have we've kind of learned from our experience on quarry and we're building on that for Perlmutter. So, for example, I. Oh, we have this burst buffer, which you'll see is a a system but requires management of your data. You have to move things into the buffer announced again so on poem, so this will be easier. They to management then analytics this.

B

This ranges from experiment software, which can be huge stacks that are difficult to manage. We're shifting can Humphrey and things where you want just high-performance libraries on their system and those again will be better on Perlmutter.

A

B

Planning, and particularly enhanced by the hardware of perlmutter GPUs and then in terms of workflow I, mean we just sort of moving direction less into the system. So it's easier to manage your workflows and in terms of data transfer. This thing I mentioned at the end that we actually had to do quite a bit external data transfer on Corey. This will hopefully be much better from the outset, with Perlmutter that's based fabric, but if I performance and for external access, okay,.

A

B

I'm, probably don't want to take up any more time. I mean this is to show it'll, be awesome, improvements, data coming in and out and flying across all over the place and controlling the way that you desire um so.

A

The last thing I just want to say that.

B

Generally, we have quite a lot of staff supporting data use cases at nurses and we're here to help. As you know, normally this session is quite interactive when we were able to meet in person with people, but you know to give you another way of kind, as we mentioned earlier, but with the the nug is not. This is not an official supported, support, Channel and asked, but it is you can contact the.

A

B

Normally we're hanging out there as well so feel free to to message us directly or with suitable Etica and provide any feedback and critique. You have and also any interesting collaborations des machines. Okay, that was me. It's just an overview. Alright,.

A

I'm ready to take over okay.

B

Any questions so I can't.

A

A

There's no questions in the Google Doc. Thank you so I Keenan, sorry about the audio mix-up at the start,.