National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 03 - Overview of NERSC DL Stack - Wahid Bhimij

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Is a big data architect in the endoscope at risk? He's? Aha, you kisses by training. I was in the UK handling data challenges for the Large Hadron Collider before coming here, and you know of the years he's, definitely developed a lot of expertise in deep learning, deep learning, as applied to hanji physics for classification and generative problems.

A

Last year, I think we worked on scaling deep learning on the biggest CPU system that we have at nursed. The the quarry system and this year he's been working on something called the at Loomis project that hopefully then I hear partly they don't quite agree.

B

With so yeah I'm just going to tell you about the nurse deep learning stack, so here you don't get lunch without working so I'm, not between you and lunch, but in between you and sunshine. So I went. Don't worry, I won't, take the whole break.

B

B

So I'm just gonna give a brief super brief introduction to nurse talk about the production stack that nurse the tools that we provide and why and then a little bit of practical information. So that's probably what you'll actually want to pay attention to at the end. Okay, so you heard this from city.

B

/ nurse gives the mission APC Center for the Department of Energy, so we support the full range of Department of Energy science, and you know vast number of the parliamentary scientists and the main machine we have on the floor is Cori and there was an Edison box here until recently, but that's retired now, so the only machine that really is Cori. This is predominantly made up of around 10,000 ninth landing CPU nodes.

B

So that's where the bulk of the props came and when this machine was installed, it was like the first be an old machine really and in fact, by big drops because she's in the country, but that's dropped down now. So you know we have this combination of as well and xeon phi nodes, and then these are all connected with high speed into connect from crazies. On these and then, of course, large file system, both plus the file system and a flash burst buffer SSD.

B

So this is a big machine, but then we also have a smaller test system now compose the GPUs and actually that's what you'll be running on in some of the exercises. If you run them at.

A

B

So that just has a small number of one, relatively small number, be 100 volt, so GPUs, okay. So the reason we have a GPU test system is partly because we're expecting a big GPU machine on X machine perlmutter. So this should have about four times ability of Cori. And you know a lot of those props will come from the GPU accelerated nodes which are exciting to people doing deep learning, but there's also many parts.

B

So I guess we're hoping that you guys are going to push science or to use deep learning and therefore make exploit the GPU nodes. But if you don't manage to do that and they'll still be a large amount of work that needs to run on CPU nodes and so we'll have a CPU partition. That's you know as big as Cory but composed of AMD CPUs, and so you know this machine should fly for deeper.

B

So we'll have optimize that to provide that and a lot of the preparation for that now and then the GPU nodes will be comprised of four of these Bolton next to you. So current machine hands these be 100 GPUs. This would actually of the next generation GPUs, and you know a lot of details about that I'm secret, but you know, will at least have tents, of course, and even three for connecting the GPUs together. So you can use them all as well as, and then this will all be connected again by a high-speed Internet.

B

But one of the differences on this machine is its Ethernet compatible. So that will make it much easier to also transfer big data sets from outside into the machine. You know same kind of you know for supporting me experimental science. So that's exactly, and then you know. Currently we have this relatively small burst buffer as part of Cori, but on.

A

On kilometer it will the whole file.

B

Systems will be flash based, so that's- and this is coming in in late, 2020, okay, so to get to the production stack. So you know, as Probot mentioned, machine learning in science is certainly growing. So you know last year we did a survey and there's lots of interesting results from these survey that we talked about another time. But one thing is that we saw that you know the respondents to the survey across various types of sites that it's interest across science. You know this is a little thing.

B

We have on our website about some examples that.

A

B

With projects you know and you're learning, more about supervised and unsupervised learning and different techniques- and you know across this gap the science examples already and you know we have in-depth experience of some of those. So you know given that interest and need for deep learning. We want to provide a platform if you like for doing that. So you know at the top here are the scientists or actual experiments and they should have both. You know interactive ways of doing that.

B

So this is where I do stir notebooks and stuff come in, but they should also be able to you know plumb into automated pipelines, and then they should sit on top of suitable methods and stuff. So that's where you know things like this school help to thy shape. You know push encouraged cutting-edge methods, but then these should sit on top of optimized libraries. So we work on making sure that ivories work well on the hardware that we have. We also try and get the best hardware to meet this me so.

A

You know provides.

B

Group, the data remanence group, which I'm in and muster and Steve as well they'll, be talking later. You know, work on all of.

A

B

As well as other things, you can visit the website to learn about that.

B

Ok, so to be more specific, here's kind of like the deep learning stack or an HPC machine and there's a bunch of things like hardware here and the libraries that you probably don't have to worry about. As a user. You know we will be talking a bit about the deep learning libraries in the distributed libraries in Friday's session on distributed training, but you know most of what the user sees is really up here in these sort, I level frameworks and talking about high-level framework.

B

So again from the survey, we saw a really that dominus framework in use today is tensorflow, and you know when I started working on deep learning with nurse four years ago or something Caffe was the framework people used the I know, but then Center pro is really dominate. Now you know I.

A

Torch here.

B

Is also a significant growing framework of interest and Stephen supports a torch on our machines and a bunch of people are talking this week, so I'm not going to criticize by George boner much in and I know, there's many great things about it as well, but a lot of the exercises we'll be doing our intense flow, which is the easiest work, particularly okay. So now the practical part, so you can luckily because you've got foodie and your new choice but to pay attention so, okay.

B

So today's a hands-on session and a little one we have tomorrow in the Thursday lunch self-guided one will all be run in Jupiter on nurse, so this people have do Nazca counts. This is a different Jupiter site. Udl and I'll show you this in a minute or you can. If you want just run these exercises in Google's collaboratory, so we have the links for both. So if you google service, you can just run them there if you want, but if you want to use this machine, no.

A

B

Just their service on their cloud so they're on their cloud, you don't often get a GPU actually used to initially when they launched it. But now you don't so you know you, okay, so another question yeah.

B

On this one, okay, yeah well so I mean so ultimately, so at the moment is four hours so time out. So if you wanted to do so, I mean this is something we've just set up for the school to do. The hands-on exercises and I will come to work. What hours it's available and things like that. We have a reservation in general, the service nurse.

B

It depends how, but you know at the moment on the GP, we don't know what we will run one for hours anyway, so you have to checkpoint. But if you know the easiest way is just to submit to batch.

B

Okay, so the Friday session will run on Corey because it's distributed, so we want multiple CPU nodes for that, so that one you will have to use a nurse account so for running at nurse. You should use these training account, so you should have been given a user agreement.

B

If you take that after I finish talking and to the registration desk, then you can get a training account that has a username and password right now, even if you have a nurse account, if you want to use the reservation of CPU in this thing, you will need to use that training account and return the form to get the training account, but you can still run it on regular Jupiter. If you want the notebooks, but you won't get a GPU, and so they will run slowly.

B

So you know, probably you want to do this just to comment. There is a little box for OTP I'll show you the Jupiter login there isn't an OTP for these training camps, so just leave that blank okay, then another practical thing, but tomorrow's working lunch will make you work again. But again it's lightweight working lunch with an expert, so we have various rooms and various people, including the speakers from tomorrow.

B

So we have Joel the first speaker from tomorrow in here and then we have just second speaker down in building 59, and then we have a couple of other topics based sessions like breakouts learning for a moment with Carthage legal zone, building, 59 and Nalini is.

A

B

Another deep learning pipelining session so so there's various slots in because the rooms are not infinite size. So you know when you're at the registration table. You can also sign up for a slot, and that's so that's tomorrow.

A

B

It will be open till tomorrow at lunchtime, but if you have a particular desire, you should sign up now. Okay,.

B

Okay, so I'm just going to show you briefly how to run this Jupiter DL. So as I mentioned these GP, you know to reserve during the hands-on sessions outside those hours. The account will work, but we are sharing these GPUs with others. So you may not get a GPU, and in that case the server won't start up and probably get an error message, but it would be pretty obvious what's happening and similarly, after 6 p.m.

B

each day, you probably possibly won't get a note, probably well possibly, and so then, if it doesn't work, but you want to check your scripts and maybe make a cup of props or what-have-you or download them whatever you. You can just press this CPU note button and that will get you in.

B

Yeah so there's a no TV box not filled in is that what you say: okay, yeah and so you'll, just put your training account and password here if you don't use. So this is a bit small now, but basically this is the GPU now. So you just press Start, hopefully and take a moment because actually starting a batch job and it might.

A

Take longer when there's all of.

B

You doing at the same time, okay.

A

So then you get a node and.

B

If you, if you like so, for example, so GPU sat, is installed, which is a nice program to seeing what GPU you have and run that you see GPU or one here and utilization of it and then there's various notebooks. You know that Steve will show you in the afternoon that you can run okay, so one gotcha. Occasionally you can. If you have multiple, if you just download.

A

B

Notebooks and run them all at the same time, you might get ku DNN issue because there's like too many things running on the same GPU and one of the other notebooks has it. And so then you know there's this little tiny, green thing here, which shows that this is running. You can stop the kernels on.

B

Okay and then one other point so as I mentioned, this is running on the back system. It will hold a GPU for four hours, just killing the window, isn't enough to kill the process. So if you do want to be a nice citizen, you should go to control panel. Stop the server like this and then you know you're giving some feedback they stopped yet so they stopped so then you can start again, but then you can logout or what have you now?

B

If you actually start again, you might get a message that says this workspace is already in use and that occurs because this window is open. So if you close this leave that and this, then you should be able to start again.

B

Yeah, well, it's just one person at a time. It's all great yeah.

A

Anyway, so there you go and.

B

You can get back so then ends just to remind you. This control panel stop to the and then once it's stopped, you can logout or not that doesn't matter, okay, okay, so that was that. So then, not much left promise.

B

I mean it's got confused by my old screen. That's now.

B

Okay, so that's that information I, don't know if there's any questions on that. Actually maybe I should take a question. If there's any questions on using this so Steve, oh okay, you have a yeah yeah, don't don't do hardcore training on the CPU I mean you'll probably be I mean it should restrict you to one core and B right. Yeah.

B

Yeah, don't don't hack around and you take all the cores on the shared CPU.

B

Okay, so just don't run see if you deep deep learning on the CPU just use that to grab your scripts and run them on collab.

B

Okay, we're here to help. So you know, there's a bunch of people at the top we're finding the organizers, mikhailo, isn't here, she's, fair, but she'll, be here tomorrow, and then we have a bunch of people who've kindly volunteered to help out as TAS. They have a range of skills, summer systems, more systems, expert somewhere, more deep learning experts. So you know you'll just have to ask you know and ask whoever and they'll find the right person to help you so commit all these phases to memory and then your but.

A

B

Under websites, so there that's, these are all TAS and Torsten will be helping out at the scale and stuff is an expert okay, so conclusions, deep learning is awesome, asylums and nurse. We, you know we built on tools and hardware, and you know algorithms and stuff to make sure we run these on our machines scale. You know, there's various challenges we face in doing this: it's not just by computational and methodological and practical challenges. So we do welcome. You know new ideas and collaborations and whatnot.

B

So do talk to us about that in a break I enjoy the school, that's a good and enjoy this.