National Energy Research Scientific Computing Center (NERSC) Data Day 2022, October 26-27, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Deep Learning at Scale on Perlmutter

Description

Part of the Data Day 2022 October 26-27, 2022

Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.

A

Steve Farrell I'm a machine learning engineer, one of a couple of them in the data and analytics Services Group at nurse I. Broadly speaking, I support machine learning, workloads on our nurse supercomputers and, of course, there are a lot of things that that are included in that some of which I'll I'll talk about today.

A

So the title is deep: learning at scale on Pro Mudder um I'm, going to talk a bit about our our offerings at nurse the kinds of things that we do to support and enable Cutting, Edge, AI or deep learning um for science and I'm gonna have some stuff specifically in here as well from a tutorial that we do regularly at Super Computing and some other places with the exact same name, deep learning at scale um and what else do I have I have some uh fairly fresh machine learning at nurse survey.

A

Plots that you know are nice to kind of illustrate what the community is doing and and how we think about supporting those um all right, I'm not actually going to do too much I'm, not going to say too much about deep learning or AI methods um in an introductory sense.

A

I I do have some links to other resources, Outreach events that we've done in case, that's of interest to you I, will really only kind of touch on things that are most relevant to what I'm going to cover for how you deploy, workloads and deploy them at scale. So I'm also happy to answer any questions that come up of course, um but, as we probably all are acutely aware, um AI is is really kind of uh taking over the world in a lot of ways.

A

It's um it's certainly transforming science and and shows a lot of capability to to keep transforming science, um AI or machine learning or deep learning. I may use these interchangeably, but um um the title is deep: learning I'm, mainly focusing on on deep learning methods, because that that's the kind of methodologies in AI that that have really been dominant these days, but these are powerful capabilities for scientific workflows.

A

uh Just a few bullets here to give you a flavor of the kinds of things that people are doing, it's not exhaustive, but uh people are using these methods to help with analysis of large data sets. Maybe data sets that um traditionally require more like hand.

A

Labeling, maybe you don't have an analytical way of doing your analysis, but now you can automate it with machine learning or ways where you had traditional approaches to analyze that data, um but you know maybe they're, based on some sort of assumptions or simplifications and um machine learning methods are able to get more out of your data.

A

Another area, that's pretty relevant for the HPC space, is acceleration of expensive simulations. Of course, the dominant types of workloads on HPC still today are these in a large, large-scale simulation workflows and a lot of these science domains are really Limited in the kinds of science they can do by how expensive those simulations are.

A

They cannot simulate systems large enough or enough systems in order to have a good estimate for things they're trying to compute, uh and so there's a lot of excitement and a lot of work going on in in trying to replace uh either simulations completely or some of the calculations that happen in simulations with faster AI methods.

A

Another cool area is in control of complex experiments, so there are a lot of doe experimental facilities that are looking at how they might be able to automate or even have kind of more powerful control of their experiments with these methods.

A

So um science, of course, in the doe as well, are very enthusiastic about this. uh There's a lot of research going on a lot of R D. um The landscape is evolving rapidly and partially. That's because it's evolving rapidly elsewhere too, um in industry and stuff like that, um but yeah the doe has been taking notice and you know, as the EC is. The exascale Computing project is winding down.

A

There's some anticipation, hopefully for a future uh similar scale project on AI for science um and while the things are still in some sense, new AI for Science and rapidly evolving. Still, we do see that some areas are starting to move into maturity, which is pretty cool to see and these workloads increasingly, they need large comput computational resources, even in the cases where they're replacing very expensive simulations uh still, these can can need a bit of compute so um especially as it's maturing we're we're looking at folks tackling um probably like larger problems.

A

Larger data sets because these methods tend to be more powerful with larger data sets so they're, looking at more complex problems, they're using larger models to get even better results, so everything's kind of growing in size and complexity, which means the computational costs grow, and um you know we're looking at basically that HP centers may be like the really a key role.

A

They may play a key role in enabling this this new kind of science, maybe the largest models are trained at the the big super computers and then folks are able to use them and downstream science workflows.

A

um This is like a very broad overview of how we articulate our AI strategy at nurse. So how do we support? You know this new emerging way of doing science? uh First, we try to deploy optimize hardware and software systems. We also work with Scientists to apply AI on across different domains. We we try to keep up on The, Cutting, Edge methods and tools. We have ways of engaging basically different research groups um and try to do some ourselves, but we also try to really educate and empower the community as well.

A

So we do a lot of Outreach seminars, workshops, training events like this one, even um things that we call schools.

A

So on the hardware side, you know you've already heard about promoter, so I'm not going to say everything about it, but um we call it a scientific AI supercomputer, maybe not everybody else was- was calling it at that at day-to-day.

A

um But the the really relevant thing here for AI is that this is a system with a lot of Nvidia gpus, which are you know, pretty much state of the art for for deep learning workloads.

A

Maybe tpus are a big competitor but Nvidia kindly called this. The the world's fastest AI supercomputer when we turned it on.

A

um For the software side, so we try to find a good balance between uh providing things to users that that are well optimized for our systems, but also um uh sorry, I was just trying to move this. Not do that, but but also letting people have the flexibility that they need to to have their own software environments and their own things.

A

So so we build some optimized modules for the most popular Frameworks, there's the usual Anaconda python ones we heard about python earlier, but we do build and deploy pytorch and tensorflow with some recommended libraries and back-ends for for running on our systems. We also are heavily support. Containers, particularly optimized containers from Nvidia, these NGC deep learning uh containers and, of course, we run things via shifter, as we heard about earlier, and eventually podman users can also bring their own images or customize on top of our images or the Nvidia images, and so on.

A

um Of course, it's also fully possible for for folks just to um build their own common environments and have their machine learning software done. That way.

A

But it's not just about the Frameworks, there's also a whole ecosystem. That's growing! It's also rapidly evolving, but there's a lot of other things that users who are doing machine learning like to use um things like well hyper, parameter optimization. So this is, um you know, you're trying to train a model, but really you don't know all the settings of the model like the number of layers or the learning rates and things.

A

So you have to do something called hyper parameter, optimization to to find these and we we use some tools, but we don't really pick and choose favorites here too much most tools, I think, should work, but if folks have issues running their favorite tool on our systems, we're happy to help I think we mostly look at Ray tune weights and biases and stuff like that.

A

Jupiter is a very popular service at nurse, something like over 2 000 nurse users are somewhat regularly using Jupiter and the Machine learning users also um like to develop their things in Jupiter a lot. So, of course, we support that we provide kernels and users can have their own kernels for profiling and visualization. We recommend Nvidia profiling tools, but um people like to use tensorboard.

A

We have a nice way of of launching tensorboard from uh Jupiter Hub and we use weights and biases a lot and encourage folks to to try that it's a great way to log experiments and also to do hyper parameter optimization.

A

um So we uh we do see the the AI workload is is growing at nurse. It's still a small piece of the pie, but we anticipate it just to keep taking off as as time goes on, uh we do track the machine learning software usage to some extent. This is not all fully functional on Pro Mudder. Yet, unfortunately, because it'd be really nice to see the kind of uptake we, we have right now uh with GPU system, but um um we we generally, we can track things like module loads and python Imports.

A

That might have been mentioned earlier earlier today on how that mechanism Works. um But we have some data here that goes back from 2017 and we can see. There's you know a pretty steady increase in the the number of users there more than six times grow from 2018 to 2021.

A

We also track Trends and engage with the community in in this machine learning at nurse survey, you may have seen some emails from us earlier this year. uh We're we're still. We still have this one for this year. Open and I encourage everybody to help us out by filling out that survey and telling us about your. You know what you're doing with machine learning and what you need, um but the survey targets the communities uh we use the nurse user list and some others.

A

So it's it's it's folks that um they may not all be using nurse, but at least they're like potential users of nurse resources and definitely doing machine learning for Science. And we ask things about you know the kind of problems they're doing uh the kind of models they use, um the the kinds of compute resources they need, oh, and how happy they are with uh the things we have in Earth.

A

um So these are some preliminary results from our survey this year, I'm not going to spend too much time because I want to be able to get into the interesting things later on in the talk. um But I wanted to kind of show these, because it there are some nice and usually in useful insights into what the scientific communities are doing these days.

A

So we asked you know the kinds of ways that machine learning fits in their in our um in the community's workflows and we see most people are are in this first category, at least in the respondents right which, which could still be a biased sampling of the real communities. But um most people are are in this mode, where they're doing machine learning for offline data analysis.

A

So they have a data set somewhere on a file system, and now they want to do some analysis on it and they're going to use machine learning to help with that.

A

um But we do see uh the second biggest one here is and in the third actually are related to combining machine learning with simulation which are cool to see, and we see folks wanting to do machine learning for more real-time or online data analysis and um a little bit here of folks. Looking at controlling uh scientific instruments, a lot of people are still using uh convolutional neural networks, which is not too surprising um but I. Think in one of our earlier surveys traditional ml was kind of a dominant one.

A

So now we see uh some turning point. I think where um also down here on the left more folks are using pytorch than um at least claim to be using scikit-learn. That was definitely flapped before uh or it was psychic learn then tensorflow and pytorch. So um you can. We can see Trends over time, which is pretty cool, um yeah, I, don't know what else I need to call out here, but you can. You can take a look at these offline. Of course those lines will be shared.

A

uh We ask about the scale of resources that people need. So, um while still it looks like the bulk of respondents have problems that are not very large. They can train models on a single GPU in hours. Relatively small data sets tens of gigabytes, um maybe single device or single node kind of scale. But we do see these Tails here, where we need to try and think about how we support those users that you know it takes months or even years, apparently to train models.

A

They have terabytes of data, hundreds of terabytes of data um they might be able to run on hundreds or thousands of gpus and use various forms of parallelism in training. Their models also come back to the the forms of parallelism here, a little later.

A

uh Okay, so I already sort of said this, but yeah. uh We do see folks with large problems and um potentially a need for large-scale training, um not too much to say on this one other than just some takeaways. um You know about half of people say they like to use Jupiter notebooks to develop their models. So that's something we have to kind of take into account. A lot of people are still using CPUs like on Corey Haswell or up here on the upper right.

A

It looks like more people are using CPUs for inference than um than gpus, which is a bit interesting, but again could be um about a little bit by a sampling, because it's um the nurse users, a lot of them, know Corey um a little bit more on the kinds of Outreach that we do so that empowerment aspect of our strategy uh we for uh for a couple years in a row. We did this deep learning for Science school in 2019. It was an in-person event week long.

A

It was really great um a lot of great speakers. We had a good Hands-On sessions and posters. uh You can find all the videos and content there on the web in 2020 because of the pandemic. We switched to a webinar Series, so every week there would be um a speaker, fewer Hands-On things, but uh still some some code examples and we did record all those talks.

A

You can also see those we had a lot more introductory stuff in 2019 and then in 2020 it started to get um we featured more, uh not quite Advanced, let's say more advanced uh scientific, relevant uh topics.

A

um I mentioned that we do this deep learning at scale. Tutorial we've been doing that quite a while, or at least since 2018, at pretty much every Super Computing, it's some ISC conferences in Europe and and some others um last year at SC. That was the first time we got to use Pearl mutter for this, which was pretty fun while we're doing it again this year. So if you're going to SC feel free to check it out and I link to the full, the full video there.

A

We also uh posted videos because we were pre-recording videos back then um other things we've been doing not too long ago. There was this Nvidia organized AI for science boot camp and um they sort of did in collaboration with us and we opened it up to users. So that also had a good bit of introductory stuff. Sorry for the slack pinks here- um and you can I think view slides on that web page, uh and then we do things like the new user training events regularly.

A

Day-To-Day events like this here you are, and probably others that I may have forgotten about. Okay, so now I'll switch gears a little bit and start to get into the content from the tutorial. So this uh that we usually do like a full day tutorial, so obviously I can't cover a lot. But this is to give you a little flavor and cover some some aspects of that that um that hopefully you'll find useful or interesting, and maybe you can follow up and ask questions or go check out the full material if you're, if you're interested.

A

um But you know the real theme there is, how do we optimize deep learning workloads on HPC and particularly for them to run at Large? Scale, really try to optimize like time to time to solution right for scientists, because scientists need fast and efficient methods. They need this to enable rapid development and testing of their ideas, um but not just that. They may also really need optimized machine learning workloads to fit within their production workloads um to fit whatever computational constraints. There may be, maybe there's an experimental uh instrument like the Large Hadron Collider.

A

That needs to be able to very quickly make decisions about what data to ride out or um folks are maybe trying to replace part of a simulation with a machine learning model. But if it's not fast, then you didn't really save anything, but also as a center. We need to think about how we optimize these workloads for all users, so that overall, the the throughput of nurse in terms of science is optimized.

A

So um if you can make effective use of modern HPC systems like promoter, this can greatly accelerate these workflows um and and I think the situation is getting a bit easier with software and methods and stuff, but it can still be non-trivial.

A

So there's still kind of a need for for this sort of tutorial content um falling bit behind so I'm going to try to go a bit fast here, but hopefully I'll be able to at least uh get the important point across Point points across and and folks can ask questions where needed um so yeah. So deep learning is very powerful and it's it's.

A

um It's showing a lot of promise on a lot of different application areas but, uh as already said, it's computationally intensive, especially if we look at training, so training, big, deep neural network models and um and again as we look at more complex problems. Larger data sets larger models uh that compute crust costs. These are actually growing with time.

A

This is an open, AI plot, it's actually a bit old now it doesn't show all the latest developments with language models, but you can just see that there's this exponential growth in the amount of compute needed to train popular machine learning models out there. So what do we do? How do we make effective use of HPC for this uh in the tutorial? We break it up into these sorts of categories. So first we look at optimizing the performance of a training workload on a single device, um because there's really no point in scaling.

A

uh If you can just get a lot of you know, it makes sense to First Look at that before you just try to throw hundreds of gpus at a problem right, um give you much more efficient in the end uh and then uh and then we talk about Distributing, the training across multiple gpus and multiple nodes on our systems, and then we talk a little bit about optimizing now the distributed performance at scale. I won't talk really at all about the third one here, and I really only have a little bit on the first one.

A

um So this is this is some content mostly developed by uh the Nvidia folks that we collaborate with in that tutorial? um This slide comes from from one of our uh our tutorial last year, but um yeah. So in the tutorial, when we look at optimizing, this training uh example on a single GPU.

A

We use Nvidia Insight systems to do this, which is really a pretty powerful tool using a profiler as it says here, it's an essential step in optimizing any code and insight systems lets you view a nicely organized well, debatably, I, guess, I think you have to get used to it, but it gives you a nice view of the timeline where you can kind of look at what's going on, and our tutorial example is really nice actually because uh the kinds of things that you might see in the real world you can see.

A

In that example, you can see things like gaps that come from data loading. You can see things like uh GPU not being utilized super well, because there are a lot of many small kernels being launched, and then we were able to talk about the ways that you uh you improve on that, so it can yeah. It can basically uh shed light on.

A

What's going on in your data Pipeline on the GPU scheduling of kernels, you can annotate things with these nvtx ranges all that is covered in the the tutorial, but here's you know a rough idea of how you you run and site systems down below um and then the kinds of things that are important for optimization and these are really just lifted from the tutorial again. It's it's a nice example, because all these apply there and we get good speed. Ups um data loading is a frequent cause of performance loss for users, even for experts.

A

Really it's it's like the first thing to check. Basically, um so you know in the tutorial we talk about ways to paralyze your IO and then to take it further from there. For example, you can use nvidia's Dali Library, uh which has a lot of nice features for deep learning data pipelines.

A

um Nice features that parallelize and kind of cache data on the Fly I also do a lot of your data. Augmentations, your your pipeline stuff, you're pre-processing on the GPU um and this little platform upper right shows it for our tutorial. Just how much how many the kinds of speedups we get from uh from various stages of optimization just in the data pipeline, so paralyzing the I o um caching, things in memory and then going to Dolly. So we get over 2x performance on the end-to-end thing. Just from that at least I think that's!

A

The end-to-end uh speed up um mixed Precision is is something that that is often a very useful great way to speed up training helps you Leverage The tensor cores on the modern gpus. It can reduce memory and stuff like this, and the Frameworks provide pretty nice capabilities for this. Now they make it pretty easy to do automatic mixed Precision, where it will use fp16 where it can, and they give you the features to avoid numerical underflow issues that that can come about um by by automatically scaling the gradients.

A

Just when doing the the computations that that um that may have a risk of numerical issues um and then there's there's ways to um reduce um like overheads of launching kernels that are fairly small. So uh just in time, compilation, uh Nvidia, Apex library has some fused operators and then um there's a more recent Nvidia Cuda graphs uh Library. um So we go through those in the tutorial as well, and basically these uh these are um mostly. These are just ways of fusing kernels together and getting better GPU utilization.

A

They can give you good speedups and there are other tricks as well that I won't cover here, but you have to uh check out the tutorial tutorial to see them all in the tutorial. When we put everything together just on a single device, we get something like a six times speed up, so that can give you a sense for really sometimes how how useful it can be to go through this before trying to distribute across many devices.

A

But let's say you've done that now: you're ready to do some actual parallel training of models. There are different ways to parallelize the training of neural networks. Data parallelism on the left is the most common. That's where you take your your data, your data, basically your data samples and partition, those or distribute those across gpus or nodes. You, you replicate your model, so everybody has the same copy of the model and you do some synchronizations at the right at the right point in time. It's the most common.

A

It's the easiest way to speed up training, but nowadays, more and more, we see folks turning to model parallelism. Sometimes it's because you need to, in fact I think that's the most common case. Really, if you have a model, that's just too big that you can't fit in memory on a single device. You essentially have to distribute that model across devices. You can do stuff like in the middle here where every layer of a neural network itself might be partitioned across devices or something on the right which is called pipeline.

A

Parallelism where different layers of a network are on different devices and the data is kind of streaming through your your system. That way.

A

uh I should really hurry up now. So I think I mostly skipped this, but this talks a little bit about the most common way of doing this, which is synchronous, data parallel scaling um where, let's say you're, trying to use more and more gpus parallelize further larger scale. uh There are different ways to think of it. um You can kind of hold your batch size fixed or you can kind of try to grow.

A

Your batch size have a larger batch size as you're bringing in more and more processors, um essentially growing the global batch size, but there are different trade-offs here. As you increase the batch size, it can be harder to train models, but if you keep the batch size fixed, you run out of compute per per GPU as you further subdivide that and you can run into Network bottlenecks. So that's essentially what's covered there, um but um uh more generally for looking like how do you? Actually? How does this actually speed up training?

A

um You know if you look at stochastic gradient descent. Essentially, you know you're sampling, batches of data from your overall data set you're Computing a gradient, and then um you have a step size that says how how much you try to optimize the the parameters of the model to get a little bit better right, um and so uh we can converge faster, we're trying to get to the answer faster. You know we're taking a sequence of steps. You can do that by taking fewer, uh bigger and and or faster steps right.

A

um So what we're doing in practice, usually with data parallel training, is we're trying to push up to larger batch sizes, which actually let us use larger learning rates, so we're taking larger steps and larger batch sizes, also paralyzed, better across more processors. uh So this is the kind of way you do it, but you you have limitations.

A

You can't scale to arbitrary uh number of gpus, it's a bit problem dependent, but it's it's definitely not a free lunch, um and this slide just sort of says that there are some rules of thumb for how you can, let's say, increase learning rates. As you increase batch sizes. Sometimes you can kind of scale it linearly with the batch size or using a square root rule which is kind of more motivated by how the the gradient noise scales, um but really the situation, can be more complex and for a given problem. You know a situation.

A

Might look more like this on the lower right, where the optimal learning rate just depends on the batch size. According to some relationship like that, I'll skip the other parts here um and I. Think these slides too, that just just kind of dive in a little bit further into what what are the sources of challenges as you go to large batch sizes, essentially folks have just found that at large batch sizes, you tend to be more likely to overfit.

A

You tend to end up in sharper Minima in the uh object in your in your loss, objective, landscape here and sharp Minima are very sensitive to differences between training and test data. Sex, hey excuse me: can you go somewhere else? uh I'll skip this one here. um There are other tricks to try. I. Think one thing to call out here is that there are more modern optimizers for training.

A

Deep neural networks, um things like lamb, you see that this lamb Optimizer is, is particularly popular for the most recent uh state-of-the-art, really large language models, which are the largest models in the world these days.

A

um If we're talking about scale and pushing on scale uh ml, perf and ml comments, this is one area where a lot of innovation happens. So ml Commons is an organization that publishes these ml. Her benchmarks they're the um basically the standard performance benchmarks for machine learning in Industry these days.

A

um If you look at the latest results now, it's kind of the point where you can Train resnet 50 in like 12 seconds and they're, pushing up to 4 000 accelerators. We got involved in ml Commons to help develop an HPC Benchmark Suite. So here we drew from scientific applications.

A

um I list them here, but I'm not I'm, not going to talk about them in depth. But these are interesting things you make applications you may have heard about before um we've been doing some releases, uh so ml per benchmarks are organized with these submission rounds. Where participants come from all around the world on their own with their own HPC systems, uh they measure results on their systems and and things get published during super Computing, um I think I'll skip the rest.

A

Maybe one yeah one other thing to say here is that this has been a really valuable experience for us at nurse. At the last submission round, which was published at supercomputing 2021, we got to use Pearl mutter. We had really nice competitive results, leading in some categories or like uh close to leading in in some others, and it was a really great opportunity for us to understand the performance of our systems, uh particularly at scale and ShakeOut issues, and find problems that need to be fixed.

A

uh Then I just have a few examples of other kind of state-of-the-art large scale. Things which I'll go through really quickly so Megatron touring is, is it's essentially a code base with Nvidia and Microsoft um uh a code base that supports really really large language model, training and various forms of parallelism? There was a bit of press around this 530 billion parameter model, which at least at the time was the largest I.

A

Don't know if it still is, but it was state of the art and in some natural language, processing tax tasks and- and uh this is an example of where they combine all forms of parallelism, so eight-way tensor parallelism. That's each layer of a model is partitioned across eight gpus on a node, then there's that pipeline parallelism across nodes, so different layers of a model are now across 35 different nodes and then on top of that they also have data parallelism, replicated up to thousands of gpus.

A

So uh pretty impressive stuff, and you can read more at those blogs, uh then some science results from some of of our colleagues. You may have heard about these before, but this one is basically doing self-supervised learning for Sky surveys to detect these gravitational lensing events. uh Peter Harrington is one of the authors and um and some others at the lab and um yeah I. Think like an important takeaway here was that they could. They could do pre-training techniques that are self-supervised and then fine-tune on things that they want and get better results out.

A

Forecast net is a work between some folks here, as well as Nvidia, and maybe some others too. But uh JD was our former post operating a lot on this and then current post-doc, Shashank and Peter Harrington work a lot on this as well. So this is basically doing weather forecasting using some fancy, state-of-the-art Fourier operator type methods and basically giving really uh state-of-the-art results in terms of um in terms of machine learning methods on par with numerical methods, but much much faster.

A

uh So then I think I'll just conclude, since I I'm actually a little bit over time, just say that you know AI for science. It requires super computer scale capabilities, we're trying to deliver this, it's great to see all the growth and sophistication and maturity in science. We're excited to see who comes next and uh feel free to reach out if you're looking for jobs or want to collaborate. That's all thanks.