National Energy Research Scientific Computing Center (NERSC) NERSC User Training, September 2022, 28 Sep 2022

Previous Meeting

⏯

youtube image

►

From YouTube: 14 - Deep Learning

Description

Part of the NERSC New User Training on September 28, 2022.

Please see https://www.nersc.gov/users/training/events/new-user-training-sept2022/ for the training day agenda and presentation slides.

A

All right, um yeah, so congrats everyone on making it to the end. This is the last talk. um I will try to not go over too long, uh but today, I just wanted to give you a little overview of some of the deep learning for science uh things happening at nurse and then uh obviously tell you about the Deep learning stack, focusing on promoter, because that's our latest exciting machine and, of course, lots of gpus on it so very exciting. For for machine learning and deep learning, workflows um and I'll discuss how to use some.

A

You know extra sort of deep learning tools and Frameworks that make your life easy when you're doing this kind of work.

A

uh So obviously uh deep learning is is a very exciting and kind of growing field. It's it can enhance various scientific workflows and interesting ways can help you analyze. Very large. Complex data sets uh potentially help you accelerate some. You know computationally expensive simulations.

A

um We see a lot of enthusiasm among scientific communities and adopting deep learning for various applications, there's a lot of growth in machine learning and science conferences or workshops. There's a lot of there's been some significant recognition lately for achievements and AI. So you know the 2018 touring award or some Gordon Bell prizes recently were awarded for achievements in machine learning.

A

Deep learning um and obviously HPC centers like us, are awarding allocations to do this type of work and we're optimizing our systems to to be good at doing things like machine learning, um and then you know in sort of broader scope, Beyond just obviously the doe is investing heavily as well in AI for science, there's a number of different funding calls out there and there's uh you know, there's this popular AI for science, Town Hall series, which produced a very long report um but yeah it's.

A

You know a very exciting field to be working in obviously sort of unique as well, because there's a lot of interest from the industry side. So there's a lot of research being driven by industry, stakeholders that has led to a huge proliferation of different machine learning techniques out there um in the scientific machine learning area.

A

I think that's been interesting to see all these different machine learning techniques get applied to all these different scientific areas and again at nurse gets cool in engaging with all of our users who come from a variety of different backgrounds, and we see all these different machine learning applications and things like cosmology, Material, Science, genomics climate and so on.

A

So obviously, um historically, we've seen this pretty significant trend of deep learning, just getting bigger and bigger every year right, so the models are solving more and more complex tasks, uh they're requiring more parameters. For example, if you look at like large language models today, they have hundreds of billions, even trillions of parameters.

A

We also see this trend reflected in our user base, so we do a survey every two years of our machine learning users- and we see you know- people are interested in training, larger and larger models, tackling more and more complex scientific machine learning tasks. As these uh you know, as systems like promoter become more available and accessible.

A

So for for doing deep learning on HPC system, you really want to be able to take advantage of. You know the fact that you're running on a supercomputer right, you want to be able to run, uh hopefully parallel training, and so there's a couple different ways of doing that. uh The most common one is is data parallelism. uh This is where, if you have a batch of data that you're training on you split that up into smaller batches and you send those out to each of the different processors in your job.

A

um Another thing you can do is model parallelism. This is where you take your neural network or your machine learning model, and you actually distribute the the different components of that across the layers, and then you feed all of your data to each of the processors.

A

um This is a little bit more complicated to set up and practice often so people generally opt for data parallelism, um but also you know, depending on your problem, you might want to do some sort of hybrid parallelism technique where you do both data and model parallelism and probably the most common form of model parallelism. That's out. There is called kind of like it's called layer pipelining, um where it's kind of just parallelism only across the different layers of your model, so you have the first layer on your first GPU.

A

For example, you have your second layer on your second GPU, and this is something that you do if, for example, your model is so large, you can't even fit all of them on the same GPU.

A

So, like I said, uh deep learning, uh data parallelism is is by far the most common strategy to scale it out, especially if you're doing scaling across nodes or multi-node trainings.

A

um So we see the majority of our users opting to use that the great thing about data parallelism is: is the leading Frameworks like tensorflow or Pi torch? They both support. uh You know kind of native data pipeline or data parallelism and pipeline parallelism natively. So you don't have to really do much extra work to get those kind of functional and performant.

A

uh If you do want some extra performance, uh especially in the case of using tensorflow, you can also use horavod, that's kind of the most popular uh distributed training framework that isn't actually built into tensorflow or Pi torch and uh there's a couple other ones as well on this plot. But but basically, all these support either MPI or nickel back ends. Mpi is what you would use if you're running on a CPU cluster. Obviously, nowadays most people are running on GPU systems and so they're, using nvidia's nickel library for communication between gpus foreign.

A

Form of data, parallel training or scaling up that we see is, is weak scaling where you try to converge. Your training Faster by taking uh you know fewer training steps, but each of those steps is is a bigger step.

A

uh So the way this works is, if you kind of look at what's happening in in the way you train these models, you're using the stochastic gradient descent, algorithm uh and you're sampling, your data and you're getting an estimate of the gradient with respect to your loss, function, you're, trying to take a step to decrease your loss, functions or decrease your error right, and so what you can do is, if you add more gpus to your job, you can get a larger Global batch size and what that gives you is hopefully a less noisy or a better estimate of the actual gradient that you care about, and so hopefully you can take uh it's safe to take a larger step right, so you can use a larger learning rate in this gradient descent algorithm.

A

So in this cartoon example, here it's just kind of a diagram showing maybe on your single GPU training. You have to take three steps: three different gradient updates, whereas if you have more gpus with a bigger batch size, you have a better estimate, so you can just take one big step, um so that sounds great in practice. Obviously, there's some caveats sometimes so this often requires a lot of tuning to get it exactly right.

A

If you want to converge stably at large scale, and so there's a lot of different considerations, uh little tricks you can do where you've changed the learning rate throughout the training, maybe warm it up and then scale it up or and slowly Decay it. You can use different optimizers. They have these special adaptive optimizers, for example. uh There's a lot of details there so I encourage you if you're curious, to go check out our deep learning at scale tutorial.

A

We do this pretty much every year at SC, so we'll we'll be there again this year, but you can always check last year's material as well for more tips there, foreign.

A

Looks like on promoter for deep learning. Our general strategy here is to give you kind of functional and high performance uh installations kind of out of the box, and we do this. We focus on the most popular Frameworks. Obviously, but we also want you to be able to. um You know, have enough flexibility where you can customize it to your particular use cases, maybe install whatever python packages that you need for your domain specific.

A

You know, data analysis, steps, or maybe you have special data pipeline- that you need to set up to read your data files, so flexibility is also key. uh We support the the main you know. The top three Frameworks right now are tensorflow Keras and pytorch, and basically Keras is, is now folded into tensorflow. So you can access uh all the Keras API calls just through tensorflow.keras um to do distributed training with either of these.

A

You can of course, use whatever is already there built in to each of those um or you can use the horrified library that I mentioned, so we also provide, for example, our tensorflow installation. We use horovod to help do distributed training, that's there by default, uh and then you know external tools that are really useful for deep learning. That we've heard some great info on already are Jupiter and shifter. So I'll mention a little bit more details on those later.

A

um Yeah, like I, said you know, out of the box. I think the easiest way for you to get up and running on promoter just to do deep learning is to just use the modules that we've already installed. So we have tensorflow and pipe torch modules. Of course, tensorflow and pytorch are just you know, pythonic libraries right, so you just it's the top level language that everyone loves in machine learning is python, so these are just content environments, basically that we've built with optimized installations of of the software stack for tulsiform pytorch.

A

um We have a couple different versions available. So if you have a need for a particular version, you can explicitly load that or you can just try to pick up whatever one is default. There um we've already heard a lot about. You know how to customize your your python environments and stuff one kind of easy way you can do it with these is just to use this pip install user method and and that works, because we we've set automatically that python user base folder for you. So it doesn't.

A

You know anything you install on top of the tensorflow or pipe torch modules. It won't kind of pollute any of your other python environments, which is convenient. uh Another great option is you can just do a direct conda clone of any of these. So if you do like module, display, tensorflow or module display, pytorch you'll it'll show you like the path of the actual condo environment, corresponding to that module. You can clone that into your own personal version and then do whatever you want with it afterwards and then.

A

Finally, of course, you can always just start from scratch and create your own custom condo environments. This can be useful, for example, if you're trying to replicate someone's software stack where they need a specific Kuda toolkit version or something that's a good option.

A

Obviously, um I encourage you to come back to this presentation afterwards and visit all the links to our documentation. There's a lot more info there and some code examples that you can kind of copy for all these different use cases.

A

uh We just heard a great presentation on shifter Pearl Mudder, so that's our current solution for supporting uh containers and uh it's great it's what I use for pretty much all of my deep learning workloads. uh It's I think it's pretty easy to use and as Laurie mentioned, it's it's very performant, especially at scale. So even our you know our top 500 entry used a container to run.

A

uh You can just see the the currently available images on the system by doing shifter image images and uh as they're shared across all users, there's actually a lot of pytorch or tensorflow containers kind of already there waiting. So you might even just be able to grab one of those and start using it.

A

uh You can also pull anything you need from Docker Hub pretty easily um and you can build your own containers as well um as Lori was mentioning uh I guess she already also spoke about how to use interactively or or in s batch scripts.

A

So I won't go over these, um but I will just mention that uh yeah, as Laurie mentioned, the the Nvidia containers for deep learning on gpus are by far I think the best starting point so that these are the NGC or Nvidia GPU Cloud containers uh they've already set up kind of optimized uh images with pytorch or tensorflow and horovod. These have optimized drivers and Cuda runtimes, nickel, coding and installations. So literally everything you would need right, there's a lot of different versions available.

A

So if you need a particular tensorflow version, you know or a particular pytorch version to reproduce some code from you know a year ago or whatever you should be able to pull that specific version from their uh service on Docker hub.

A

We also provide some versions of these that are kind of like you know, nurse specializations of them and those just have a couple useful extra python packages in them that we see a lot of our deep learning users kind of wanting or using frequently, uh for example, in our pie, charts one. We install the inops library because that's a pretty popular library for doing kind of tensor, manipulations and models.

A

um We also have uh our in these ones. We have a parallel H5, Pi installation, so that's kind of convenient. If you have uh maybe some training that you're doing and then you want to do parallel, I o afterwards.

A

um Yeah, you can also build your own containers if you want it's very easy to build on top of nvidia's NGC containers. In fact, that's exactly what we do and we have some examples for how to do that link from our documentation.

A

You can also do this pip install user method- if you want you just have to manually, set this this python user base past yourself and then finally, Lori totally went over this, so I don't even need to mention it, but uh yeah the entity, NGC containers use openmpi. So of course you need to do that. Little extra step where you disable the end pitch model module for shifter and use MPI equals pmi2 foreign.

A

Sort of General guidelines for, if you're doing distributed training- or maybe you have your single GPU code and you want to make it a multi-gpu code um if you're working in tensorflow, we recommend using horovod to do this and that's just because, uh if you're going kind of beyond the single node scale, so if you're doing like multi-node, you need 16 nodes for your training, it's much easier to use uh horovod in our opinion than the built-in tensorflow distribution strategy.

A

uh It's yeah, it's easy to use with our slime schedule and it uses MPI and nickel to kind of coordinate, Communications and send data between processes. Obviously it's great because it has lots of examples online too, so it's pretty easy to just follow and and start working um quickly on it.

A

Tensorflow also has some really good profiling capabilities built in. So if you want to kind of improve the performance of your training code, look at maybe what what part might be slowing you down, there's a really easy way to just kind of import: the tensorflow profiler and use it uh for pytorch. uh We don't even really need anything beyond just a library itself. So pytorch has a really good built-in library for district distributed training. It's called distributed data parallel.

A

uh It just kind of wraps, whatever model that you've already created, uh makes it really easy to do distributed. Training uh they've spent a lot of effort, kind of optimizing this and making examples and stuff. So it's a great starting point and this one actually doesn't even need NPI, so it just uses nickel for all Communications between gpus.

A

Just for some some extra sort of General tips here, um as I said we we recommend providing our or using our uh you know already provided built modules or containers. If you can that's a very good starting point, it could probably limits the amount of setup work that you have to do and we've already tested these uh pretty thoroughly for functionality and performance also allows us to kind of track uh who's using what and helps us kind of set up our support strategy for future systems. So that's nice.

A

um If you're doing developing and testing work, obviously using the interactive queue or interactive swim, jobs or working on Jupiter is really nice, because you can just get On Demand resources and quickly look at something right.

A

um If you want to track your trainings uh I recommend using either tensorboard or weights and biases. These are external tools. I'll talk about in a moment that help you kind of track. What's going on during training and then, of course, for performance tuning. You can do things like check the CPU and GPU utilization to see if there's bottlenecks, so you can use something like top or Nvidia SMI.

A

To do that and that'll just tell you, you know, for example, if your GPU utilization is really low, maybe that's an indication that your data pipeline is not very efficient. This is often the most common source of bottlenecks. We see in our users, training codes. You know the the CPU is trying to get some data off the file system and provide it to the GPU for the training step, so the GPU is kind of just waiting and it's not the most efficient um and so to speed that up.

A

You can kind of use like some of the recommendations that are just built into these Frameworks like tensorflow or Pi. Torch have a lot of recommendations. uh You can use multi-threading in your data loader. You can try and Stage data or Cache it um so recommend following their tutorials on optimizing, your data loader for sure.

A

If you really want to you know, do a deep dive. You can of course profile your code. You can use nvidia's Insight systems tool for that or you can use any of the you know like the built-in tensorflow profiler tensorboard also has a profiler that works with pytorch. If you want so recommend those as well I guess I, don't really need to say much about Jupiter, since we already had some excellent presentation on that. um I will just point out that we already have our our you know. Tensorflow and pytorch modules installed as kernels.

A

So if you start up a server and start up a notebook in that server, you can just select, you know tensorflow and it should work. Pretty much out of the box should be able to import tensorflow easily same thing for pytorch uh or you can use your own custom kernel. If you have specific libraries that you need so I touched a little bit on tensorboard, which is different from tensorflow tensorboard. You can use with either tensorflow or Pi torch, but this is a great tool for visualizing and kind of monitoring your experiments.

A

So as you're doing a model training, you can track the loss over time. You can see, you know you can add custom metrics. So if you have some specific statistic that you care about, you can see what the value of that is. um We have a little tensorflow helper or sorry tensorboard helper in Jupiter.

A

So if you have a Jupiter notebook, you can just import this uh and then it'll give you a URL to go a visit and that it should be where um all of your data is kind of getting displayed into a nice little convenient dashboard.

A

Now, beyond that, um it's also very important in deep learning to do hyper parameter tuning so hyper parameter. Optimization is a key. uh You know stage of the deep learning process and obviously it can be sort of embarrassingly parallel if you're just searching over a wide range of parameters. So it's a good fit for systems like promoter where you have lots of resources available, um because there's just so many tools out there for HBO. uh We don't really. You know like ask that you use one in particular.

A

We we kind of generally support whatever people want to. We don't install these uh in their own separate things. uh So some of these are already there in the tensorflow and pytorch modules that we built, but they're also probably easy to set up. If you need some custom solution for HBO.

A

All right, I am going as fast as I can here. These are the last two slides, so hopefully we're not too much over time. I just wanted to mention also some additional resources, for you know people who are maybe more newcomer or they want to see they have some deep learning familiarity, but they don't know too much about applying it. For actual you know, scientific applications, um the Deep learning for Science school is something we put on a couple years ago. That has a lot of resources.

A

So all of the lectures and the demos and stuff are available, um so I recommend visiting that and looking through there's some interesting topics there that definitely go beyond just like introduction to deep learning style. Stuff I also mentioned this deep learning at scale tutorial. So this will give you a lot of detailed information on how to profile optimize your code and then how to start scaling it out across multiple gpus, multiple nodes up to maybe thousands of gpus.

A

If you really need that scale, so we'll be doing that tutorial again this year at SC, but you can also just go to last year's material. Following that link right, there.

A

um Yeah, so that's that's all I had for you today, uh thanks for thanks for your attention. Thanks for your interest in deep learning, I hope you agree that uh there's a lot of good options for doing machine learning and deep learning on Pro, Mudder um and, of course, file any tickets or or reach out for any additional assistance. You need and uh I'll just end with one more plug for this machine learning at nurse survey, which is what we do every two years.

A

It would be great to hear from you about what your specific needs are. What your current machine learning workloads look like So yeah! Thank you all.