National Energy Research Scientific Computing Center (NERSC) New User Training 2019, 14 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 7. Data Ecosystem Overview

Description

Learn about the data ecosystem at NERSC.

Slides for all sessions can be downloaded from here: https://www.nersc.gov/users/training/events/new-user-training-june-21-2019/

A

So, thank you all, for you know coming back from lunch.

A

Hopefully it wasn't too hard to find the cafeteria and anyway back so I'm I'm, profit, I, I, lead the data and analytic services group, and the afternoon session is going to be around the data stack, so I'm sure in in the morning Helen and Rebecca- and you know a lot of folks walked you through the the broader system, the simulation side of things we had nurse cutting appreciate that data is extremely important for science going forward, so the afternoon is going to be all around the data side.

A

So this is the the schedule, so I did want to connect some names to faces as well. So after I give you a quick overview stressed earlier, all right, so stress is going to be talking about data transfer tools. How do you move your data back and forth from from a supercomputer?

A

Gellin is going to be chatting about file systems best practices. You know, how do you store move, write data to to a file system? Quincy Koziol will be joining us remotely and he'll be chatting about io libraries. So there's a lot of emphasis on you know: how do you store and move data? That's that's a fundamental operation in the afternoon we're going to shift more towards in in the in the part after the break, we're gonna shift more towards analytics.

A

So now that you know about how to store and manage your data, how do you actually analyze it so increasingly Python and Jupiter or key technologies in that space and rollin is gonna? You know, walk you through those I think. Initially we are chained in place for shame, cannon in place for shift to a particular container technology, but rollin is going to speak.

A

You know on Shane's behalf and finally, at the end of the day, I'm going to be chatting about deep learning, so I think that's mostly the distaff, that's in the room at the moment now you really should feel free to interrupt us at any point, and you know ask us questions. You can engage with nurse stuff by sending tickets to consult and us gov or you can chat with us in person. So now that we are all here in a room, you know please interrupt us. Ask questions catch us in the break.

A

Catch us, you know after the day is over because we're really all looking forward to interacting with you today, all right. So I think I mentioned that you know. Data is extremely important for nurse and very often, if you look at an organizational structure you can you can make out what the priorities are. So our task.

A

You know we have the systems department that makes sure that our systems are performant and running all the time you heard from again Rebecca and in the morning on the HPC side of things, and then we now have a data department that is whose Charter it is to make sure that our systems are responsive to the emerging. The current and emerging data needs of the user community, so I lead the dash group. You know we manage the the user facing data stack.

A

We have the data science engagement group led by Debi bard, who has specific strategic engagements with different science communities. Damiana Hazen leads the storage systems group, so they manage SPSS. The archival system and the file systems, and then Cori Snavely, leads the infrastructure services group. So, even though, today you know, daya's will be sort of presenting the user facing data stack, but there are several groups who have a lot of active roles in the data space.

A

All right, so hopefully, I think this is clear to you by now, but we've tried our best to make sure that Cori as a single unified system can support both simulation and data workloads. I would say that maybe three or four years ago there was a genuine question mark a task on whether we should have a different system that does data and data analytics and then maybe a separate system that does simulation. But we made the strategic decision that you know single system will do a good job in supporting both so I.

A

Think some of you who so I guess I did want to get a sense for it. Are you I, guess who's a new user to nurse you're just maybe getting started? Maybe if you can raise your hands, all right sounds good and how many of you have already logged on to nurse systems are familiar with masks. Okay, all right! So you're predominantly you know new users, so we do have the the Intel Haswell partition in many ways.

A

If you really do not want to modify your code, then then the Haswell partition is where you can continue to run your your jobs, but going forward. Of course, truly leveraging. Many core computing is is important and the knights landing partition is what is recommended for for those for those needs. So I'm gonna come. You know in a few slides to what are some of the data. Specific features that we've configured on Cory, but first I do want to walk you through the stack.

A

So again, if you are a data user and there is some software or a service that you want to leverage this is the production stack that we support at Nazca. So if you care about data transfer and access, so let's just talk about data transfer. For a moment. You know you have your data set in your lab. There is maybe a remote instrument and you'd like to move that data set to nurse. Then we recommend that you use Global's and great FTP.

A

Those are the two tools that you can use once your data is in place here. Chances are that you want to share the data set with the rest of your community, so web portals become very important, and you know there are a range of technologies that you can use more and more beyond just sharing data with other users. It is maybe also important to share code or your analysis. Scripts and Jupiter is a key technology that you can choose to leverage for. For that.

A

Sorry is there a problem all right, so workflows.

A

Chances are that you need to move a lot of data, manage a lot of data analyze, a lot of data, and you need to do this repeatedly. You want to make sure that the entire workflow is automated. So there are a few tools that you can use. Fireworks is a fairly sophisticated tool that understands all of the file systems, the queuing system. That knows that, hopefully, you heard about in the morning- and you can choose to use fireworks to caption and automate your workflow tasks. Farmer is another technology that that we support here at nurse.

A

So if you have embarassingly collections of embarrassing apparel jobs, then tasks for Merck and in many ways take care of that. You know important use case now. I'll note that many communities already have workflow tools pre-decided for them, and we try to work with those communities to make sure that the workflow tools will work. We will continue to work at desk now. Data management, I, think, is a key bit again. It's one of those things which you, you know only learn when maybe you're in grad school or as a postdoc.

A

Someone has maybe already decided a data management scheme for you, you're gonna be storing your data, you know maybe as CSV txt files or you know, or some other scheme, but the moment you start talking about big datasets terabytes of data, tens of terabytes of data, or even even you know, hundreds of gigs of data. It is really quite critical that you pay attention to how you're storing your data sets.

A

So modern I/o library is like hdf5 netcdf root, have all of the good characteristics, I would say of a you know: data management solution, so you're welcome to use those. We support those at that desk and then, if you do want to use data bases, it makes sense. Perhaps for you to use database, then MongoDB, my sequel and Postgres is what we use. So these are all tools that are well supported at this point in time.

A

I'll mention quickly that, in terms of visualization capabilities, if you care about scientific visualization, then visit and paraview are two tools that the de the Department of Energy community has been developing for a long time.

A

But if you care about information, visualization, then of course matplotlib in python, ggplot and in are those are all you know, reasonable choices now, frankly, I think a lot of the buzz in in the entire data stack tends to be in the analytics area, and this this you know capability in particular, it has been evolving very, very fast or the last five or ten years.

A

So you know you're not going to see C, C++ and Fortran here in this light, I think we all recognize that people care about higher-level languages, so more and more python is the recommended language. If you wanna, if you care about generic analytics, if you're a statistician, anyone are unsophisticated statistical analysis, then R is a is a tool that you can use. Julia is an emerging language that you may choose to explore. Sparc is an interesting framework, an analytics framework that again, you can also leverage.

A

Now there are legacy tools like MATLAB and Mathematica that, of course, you know have been there for a while and will be around so you're welcome to use those. And finally, there are you know a bunch of libraries in the deep learning space that I'm gonna get to towards the end of the presentation. So this entire stack is in production. So it's it's there. It's available to you you're, welcome to use it. There is documentation. You can file trouble tickets with us.

A

You know we try to make sure that we vendors to ensure that every single technology in the stack is performant and scalable runs well on our systems.

A

So beyond just the software and services we've tried to make sure that there are. You know a number of features on Cori which are data friendly. So, as I mentioned, you know, we do have has well and KNL compute nodes. We do also have a large number of login nodes that you can use.

A

You know, rollin is gonna, go into Jupiter notebooks and the fact that we have dedicated nodes for Jupiter soon will have dedicated compute back-end notes for for 2 butanol. Perhaps there are some jobs that will run in Syria but require a lot of memory. So there are some big mem nodes that that you can use. There are some workflow dedicated, workflow nodes. Where you know, perhaps you need to let your workflow manager run for a long time, so those those nodes can be used.

A

I think we also certainly appreciate that data users require you know, maybe a single serial job running for a long time, multiple serial jobs, running on a single node and then also the capability to move data. So there are dedicated queues that you can leverage.

A

One of the unique things that we see in the in the data space is is real-time, so perhaps you have a cryo-em microscope or you have a telescope or you know some other device, and it is time sensitive for you to move the data from a remote source to a compute node and do the analysis.

A

So now there are real-time queues in place that that can let you do that interactivity again is really quite cheap, so you know again, as a data user may be the the prospect of waiting in queue for 3 days to run your analysis. It's not very appealing so I think interactively says you can use the interactive queue, submit your job and hopefully you'll get. You know command shell, some compute nodes belly on a short time or room IO is again is key, so you know making sure that you can read and write.

A

Data fast is important and I think what we are seeing increasingly is that GPFS and lustre file systems are not keeping pace, so the burst buffer technology is something that you can. You can choose to use all right. So all of these are features that we've tried to configure on Cori to make sure that you know you as a data user are productive, but if there are things that you're still struggling with, you know, please let us know all right, so I think I'll.

A

You know just make a few asks of you for the remainder you know for ours. Please engage with us, you know the reason we've set aside four hours today is to be able to talk to you and maybe educate you, but then also learn from you on what is and what is not working well and do tell us about your interesting science problems. I mean fundamentally the reason we in the group you know other stuff at nurse work at nurse as opposed to doing the same job in the industry is because we care about science.

A

So if you have any interesting science problems that you want to work on that you wanna, you know have breakthroughs in or the coming years, then please tell us about it, and you know we can. We can provide you with some pointers, all right, so I'm gonna stop there and while we do the switch, are there any questions or comments for me.

A

All right, so stress, I think your.