National Energy Research Scientific Computing Center (NERSC) Using Perlmutter Training, January 2022, 11 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scientific Deep Learning on Perlmutter

Description

Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/

A

My name is peter harrington, I'm a machine learning engineer in the data analytics and services group at nursk, and this is going to be the last talk of today's training um and in this one, you'll hear all about our scientific, deep learning ecosystem on promoter.

A

So I'm going to start with just some background and a brief little overview of different deep learning for science applications and and how we see deep learning being used at nursk and then my colleague steve will go into details on the actual deep learning software stack. We have on promoter how to use the frameworks. We have how to get them to be performant how to do optimization on your models and then, finally, I will come back at the end, discuss some additional useful tools.

A

We have for deep learning workflows and then go over the just a little bit on the hands-on activity. We have for you to work on this afternoon.

A

So deep learning, as you may have heard about, is very exciting. It's a subset of machine learning and ai that relies on deep neural networks for the computations and the reason these deep neural networks seem to work. Well, is they they partition a problem into a sequence of steps, usually called layers where you feed an input in and each successive layer processes the input and the features extracted from it to produce an output. So it can kind of break a complex problem down into simpler, sequential steps or sequential computations.

A

This is this diagram on top is sort of a basic. You know, vanilla, neural network, but of course, with all the research and exciting different applications. We've seen in deep learning, there's all sorts of different neural network. Architectures now designed specifically to process things like text, data or image data, and then obviously these are being adapted even further in scientific domains.

A

So how do you actually train a neural network to train one? You need a loss function. This is something that you define it's a way that you compare your model prediction to your training data set, and you want to minimize this loss function.

A

You want to minimize the errors that you're making with your predictions right and the dominant way to do this is to optimize your model parameters to minimize the loss function and that's done with gradient descent, and so the way that's done is you just feed in an input of samples, make some prediction: compute your loss function, and then you take the gradient of your loss with respect to your model parameters and try to take a step in the direction that will reduce the loss right and the way these gradients are passed from.

A

The loss back to the network parameters is through back propagation, and this is simply using the chain rule to to go back through all the intermediate results in your neural network to propagate the gradients back to the network parameters.

A

So these uh computational structures have been around for a long time. Actually, the first, like you, know, machine learner, deep learning type network was invented in the 50s, but only recently have we seen them really start to take off, and this is mainly due to two key factors.

A

One of them is data, so we've seen a huge increase in the amount of data available, the size of data sets that we're able to process, and while some classical machine learning techniques may sort of saturate in performance as you increase, the data set size we've seen time and time again with deep learning that it scales excellently with the data set size.

A

So that's been extremely useful for deep learning. The other great advancement has been the advent and the increasing availability of uh accelerators, like gpus um in particular, linear algebra accelerators. That can really help with the internal computations that happen inside these neural networks, so, as gpus have become more and more advanced and more widespread, we've seen even further ability of these deep learning models to process complex data and achieve complex tasks.

A

Besides these two main key factors, there's also been a lot of hard work by the community, doing algorithmic advances developing different optimizers, different ways of regularizing or normalizing deep learning training. So all three of these combined have have contributed to this excellent performance of deep learning.

A

So, as it's become more and more mature as a field, we're seeing deep learning actually start to be used in science and transform scientific workflows can do things like help with analysis of very large complex data sets that would be otherwise cumbersome or impossible to process manually.

A

It can accelerate expensive, computational simulations and so we're seeing adoption on the rise in all sorts of scientific communities. There's rapid growth in machine learning and science conferences we're seeing good recognition of achievements in ai, like the 2018 turing award or the gordon bell prizes in 2018, 2020 and importantly, we're seeing hpc centers, awarding allocations for ai and optimizing next generation systems like promoter for ai workloads- and this is a sign sort of that you know- does investing heavily in ai for science as a result of deep learning's success.

A

So there's this, you know popular enthusiastic ai for science, town hall series and there's lots of funding calls from asker and other agencies now for machine learning, applications in science.

A

So, just to sort of flesh out a little, you know more explicitly some examples of what machine, learning and science can look like. We have, you know a whole host of really powerful feature. Extractors from you know, deep learning, literature nowadays and these are in computer vision, trained on natural image data sets, but we can easily adapt those to something like sky surveys and use those feature extractors to help us process. These very large data sets and maybe help us find rare objects or make good classification models or regression models.

A

Another exciting area is uh something like generative modeling, where maybe you need to synthesize some some high resolution or fine details from a course input, and this is something that's very exciting- for applications in simulation heavy domains. So something like computational fluid dynamics can benefit greatly from models like that.

A

Another exciting area is in graph neural networks. We have these these models that are adapted specifically for graph structured data, so something maybe like a social network modeling. The connections between people can also be used to model the connections between, say, atoms in some molecule or lattice structure. So these graphene networks can also be really useful for if, if you adapt them in the right way, useful for something like catalyst, research in materials design, but obviously the possibilities are sort of endless here.

A

Really machine learning applied to science, with the great variety of methods we see in machine learning nowadays, there's all sorts of different applications across all sorts of scientific fields that are possible.

A

And we at nurse see this great diversity, obviously, because we are at berkeley lab where there's a lot of different, exciting research happening across all sorts of science areas and in one particular way we track. This is with our machine learning surveys which will be happening again this year and these surveys just track. The current use cases of our machine learning stack and we try to identify some areas: um how to improve the user experience, how to improve performance and maybe inform our strategy and anticipate future workloads.

A

But from these surveys we see you know great diversity across things like cosmology chemistry, biology, fusion, all sorts of research areas are applying machine learning. Obviously, the dominant applications tend to be things like classification or regression problems, but we also see some exciting work with generative modeling and segmentation and reinforcement learning.

A

So there's a lot of exciting stuff happening.

A

So what actually goes into a deep learning workload? um Obviously you need to train your model. So in this phase uh it's typically more iterative and interactive sort of r ds.

A

You need to actually get your data set process. It set it up stage it and feed it to your network for training, and this can be very compute and data intensive. Obviously, especially if your problem is a large scale problem, one common use case we see in hpc is the sort of model selection process or hyper parameter optimization.

A

um This requires a lot of resources because you need to search over the full model space for your best possible model. There's lots of different little knobs to tweak, right and and deep learning. Pretty much always requires some tuning, so this typically involves a lot of parallel training applications running concurrently, so it's a great fit for hpc resources and then, finally, once you have a good tuned model, that's all trained!

A

You want to actually hopefully use it for something useful, um and so this would then be using it for inference right and there's things like production analytics or you know, so it tends to be more high throughput.

A

This can be either like offline analytics or even deployed like in real-time data processing for real-time experiments. um So yeah there's a lot of different sort of things you can do with deep learning.

A

Currently, people tend, or at least in our previous survey, people were doing more model training than inference, but it will be interesting to see how that changes with paramotor.

A

A very common thread in modern, deep learning is the need for scale and the ever-increasing need for scale. So one obvious sort of reason, for that, is you want to rapidly prototype your model. As I said, you need to tune things. You need to try a lot of different model configurations to get a good model, and so you want a good turnaround time on your your training, um preferably in the minutes to hours range. But we see time and time and again, a lot of these big models are taking days or weeks to train.

A

So it's very important to be aware of what scales you might need in a deep learning workload and that is usually very dependent on the data set size that you have and the type of data that you're processing so in scientific applications. We typically have pretty complex data sets either you know high dimensional or multivariate um these.

A

These survey results from 2020 are interesting because they are, you know trending towards smaller data sets, but I do think with the increase in in gpu resources with perlmutter we're going to see much larger data sets being tackled with machine learning.

A

If you look at trends in, say industry or you know, traditional machine learning, research we're seeing a pretty clear trend in uh increase in scale, so this plot on the left is showing the the size of state-of-the-art natural language processing models in in terms of the you know, just the number of parameters, but the unit is billions of parameters, and so you know if we go to like the most recent one. This megatron turing model from nvidia um already has, like almost has over half a trillion parameters in the model. So it's pretty gigantic.

A

We expect this- maybe maybe not, models of that full scale, but you know close to that are going to be used on pro modern. So this will be also interesting to follow up with this year's surveys, but in general, as we tackle more and more complex tasks with our models. Typically, we need to grow the model size, the model capacity, and to do that, we need to scale up the training process in some way right because it's impossible to train a gigantic model.

A

Just on a single compute node, for example, the most common way of parallelizing. Your training process is data parallel training and the way this is done is just partition. Your data set or your batch of samples across the different processors in your training job and each one of your processors or your gpus- will have a copy of the model on it and it'll just be able to run the forward pass and training as normal.

A

So it's pretty conceptually simple, and this is definitely the most popular one, because all the leading frameworks like tensorflow or pytorch have a sort of built-in native way. That's pretty easy to get working for data parallel training.

A

Another option is model parallel training, and so this is instead of partitioning your data across the different processors. You can partition the model itself, so this is useful if you have one of those gigantic models like those language models with a ton of parameters, you can put some parameters on some processor and others on other ones. You feed your data into your set of processors as normal and then pass results around as needed.

A

um One common sort of subset of model parallelism is just layer pipelining, where you split up the sort of sequential layers of your network onto different processors, but that one does take some considerations to make it kind of efficient model in general model. Parallel training tends to be less common. It's just a little bit more tricky to get set up, but there are some some nice methods out there for for getting it working.

A

So yeah, as I said, data parallel is the most common um and that's what we see our users using the most. Typically people just use the built-in sort of you know, native uh parallelism, that's set up in tensorflow or pytorch, but another uh leading sort of you know non-native distributed. Training framework is horovod, which we recommend and yeah all these use either mpi or nickel, which is nvidia's communication library for communication, and they all perform quite well.

A

Of course, there's no free lunch here, so it does take some considerations like how do we actually scale up our training effectively, so in data parallel training, what we usually want to achieve is uh weak scaling, or you know, converging faster by by increasing the global batch size, so increasing the number of samples that we're feeding to our model at each training step. In this way, we can take bigger steps, but take fewer of them, so the idea is with more gpus as we grow the batch size. We have a better.

A

We have more samples that we're looking at so we have a better estimate of the actual gradient with respect to our loss function, and so what that allows us to do is to take one large step, maybe with a larger learning rate than what we would normally be able to do with a smaller batch size. So with reduced noise, we.

B

A

Sort of more safely take a larger step in red here, as opposed to these individual, smaller blue steps.

A

Now to get this to actually work in practice. There are some caveats: you need to be careful to make it converge. Stably, you have to sort of tune things, especially the learning rate, so you have to warm up the learning rate usually scale it. According to some scaling rule, there's been a lot of research in this area, so there's all sorts of tricks, adaptive, optimizers and architectural adjustments, and so on for actually scaled up training and- and I definitely recommend visiting our deep learning at scale tutorial that we gave at supercomputing for lots of tips.

A

Here we did a whole detailed section on how to go about scaling up your model.

A

So now I'm going to switch it over to steve and he will talk to you about the software stack.

A

B

I'll pull up the slides, they are the same slides. So when you get the link to them you'll, you won't have to navigate between multiple ones. Here, uh hopefully you see that.

B

Okay, just let me know if it's, um if it's not showing up or if you can't hear me, it's good yeah, okay,.

A

What is donut permissions weird? I don't think I have media objects so.

B

It's fine, so peter talked a bit about the um uh things like how deep learning is going to be, transforming our scientific workloads sort of why we think it's going to be transforming. So um we certainly see this as an important emerging workload at nurse and hpc in general.

B

um So, um of course, our goal, then, is to provide a functional system for these kinds of emerging workloads.

B

um We're mainly going to be talking about the software and tools here today, but, of course, the the real overall vision has to include the hardware as well as methods as far as like procuring new systems, we're not going to talk about how we kind of informed that today, but uh of course, that that's a whole nurse, quiet effort um and methods also um uh will be important too.

B

We're not gonna talk about sort of maybe how we come up with new methods for deploying on our systems today, but we we do also do research in those kinds of spaces and nurse computers. We have a very highly diverse user base in terms of the domains and the applications, and we do know that machine learning and deep learning can potentially transform many different aspects of scientific computational workflows.

B

uh So what this means is, you know again we're thinking about this emerging workload. We really have to think about even a diverse set of things within the machine learning and deep learning space, uh the kinds of things that we might need to support. So how do we do this and again narrowing in more on the software and tools kind of thing? Well, we we try to deploy optimized software installations. We work closely with vendors. So of course, today we're talking about pro mutter we're working closely with hpe and nvidia as well.

B

We do a bit of testing and benchmarking of our system, so we do have machine learning and deep learning specific tests in our reframe regression testing framework, uh some benchmarking efforts which I'll touch on in a little bit. We do our best to put out good documentation and do training events like this one today to to help educate folks on how to use our systems.

B

uh Of course it doesn't catch everything, so then the catch-all then is is that we help users as much as we can through consulting tickets and then something that may be.

A

B

Obvious is we actually learn a lot, make a lot of progress, understanding how to deploy things on our systems through our closer science engagements and our own actual research projects and part.

A

Of that actually.

B

Feeds back then, into this idea of developing the right kinds of methods that will be suitable for our systems and the kinds of problems that that come up at nursk.

B

um So the first layer, of course, is the hardware you've heard already enough about promoter. I think over the past few days, I'm not going to give you the whole specs, of course, but just call out a couple of important things. So, of course, promoters are our first system at nurse with gpus and um uncoincidentally.

B

It's our first system. That's really good for these new kinds of deep learning workloads um and, of course, most of that comes from this specific chip that we have in there, the nvidia ampere a100 gpus and the fact that we have over 6 000 of these really make this a great system for uh for deep learning.

B

uh So, of course, we've been very excited about it. um I didn't introduce myself sorry, I forgot I'm uh steve farrell, I'm the other machine learning engineer in the das group, the the same as peter, so we we both kind of work on on these sorts of things, on supporting the the system at the software level and stuff like this, um so yeah we're very excited about pro mudder.

B

It's been very exciting to kind of see us see the ways that we're already using it and excited for the ways that everybody will be able to use it in the coming years. um I said it's a it's a nice system of over 6, 000 a100s, in fact, nvidia in some press releases, called it the world's fastest ai supercomputer and obviously there's a little bit of maybe propaganda to that. But um if you just you know, consider the aggregate compute performance for deep learning of six over six thousand eight one hundred gpus.

B

You get a sense that this is. This is pretty hefty.

B

Okay, so now a little bit on our strategy for deploying the the deep learning software stack, so we we try to take care to provide functional performance installations of the most popular frameworks in libraries. We're not going to do. Optimized builds of every single tool or framework that are out there. Obviously, but we do use things like our machine learning and nurse user survey to inform us on what our user base cares about, um and uh peter mentioned this, but I'll plug it again.

B

Sometime this year we are going to do another round of this survey, so I encourage everybody to use that opportunity to. Let us know what kinds of things you need, what tools you use and what you would like us to do better or differently.

B

So you know we try to support, let's say the the things that are most popular, but we also want to enable flexibility for users to really do their customization deploy their own kind of solutions. I think, particularly for this kind of user base and these kinds of workloads. This is important because there's always new tools coming out every day and in terms of frameworks really what it comes down to today is we're supporting we're having a kind of a deeper level of support for tensorflow and pike torch.

B

But folks can, I think, pretty much deploy whatever they want and in terms of distributed, training, libraries, um we support things like uber's horovod and the native pytorch distributed library, and then things that peter and I don't work as much directly with, but involve more of the nurse staff, useful services and tools for for deep learning things like jupiter and shifter.

B

Okay. So how do you use the deep learning software stack that we deploy much like with anaconda python or compilers, or anything like this? We have modules that you can simply load and it's you know almost the same as it looked like on corey now on perlmutter, so you can do muzzleload, tensorflow module load, pi torch.

B

One thing that I'll mention here that may not be obvious to everybody is that these modules are actually complete python installations. You don't have to do module load, python and then module load tensorflow. You can just do module load tensorflow um and uh in fact that also means that you can't actually compose these things, so you can't take things from anaconda python and from our tensorflow and from our pi torch modules.

B

We try to keep up with the latest versions and you can see which ones are available with the module availa command in terms of levels of customization. There are a few things you can do. Lori talked a little bit about this user, install of pip with some important caveats.

B

I'd say, use it sparingly, but you can use it for the machine learning environments. If you want to just install one pip package on top of our modules, you can do that with piv install user. We do also set this python user base environment variable in those modules, so um that directory that will be kind of unique for that for that module and you'll you'll have those packages tomorrow, when you modulate again these environments that we install, we actually use conda to install them. So that means that actually you can clone them as environments.

B

If you want to just start from that, but have your own, then you can customize.

B

uh So you do have to get the the path to where they're installed, which you can check with, like a module, show coach, but then you can do something like trying to create clone or of course you can create your own custom kind of environments from scratch. So more information on these.

B

These methods on our uh docs are a great way to do deep learning on promoter and we support containers via shifter, which is the current container solution, of course, on pro motor, it's easy to use. It's also very performant. I don't know if this was mentioned already, but the first, like top 500 um list for promoter, was done with the shifter container, um so you can check which images are currently available on monitor with the shifter images thing, but that does list every image available.

B

So you have to maybe grip out the things that you want like pipe torch or tensorflow.

B

If the container you want is not there, but it's on docker hub, very easy to just call the pull command just like with docker, and you can run things interactively or in your your batch scripts. If you do run in s s-batch scripts, you can use this s-patch argument from our um shifter plug-in that uh you specify the container there at the s-patch level and that just does some pre-catching pre-caching of the container.

B

uh Is there a question or somebody just unmuted, okay, a little bit on best practices for using shifter here, so uh nvidia is really like the go-to place for the optimized containers for perlmutter, because they obviously optimize for their nvidia gpus. So they have these ngc containers for pytorch or tensorflow. They always have optimized versions, the latest versions of libraries and many different versions. In fact they put out a new container version every month and uh we try to provide those already on promutter, but we also have our own images.

B

I should put the names here, sorry, but you can see them on our docs, something like nurse slash pytorch, which is a little more similar to our modules. So we just install a few packages on top that that our users, like you, can also build your own containers and you can do similar things with customization, um one drawback with shifter on promoter. Is you know you can't write two images so to add things you either have to um build a container, maybe on your laptop, but it you can still use this pip install user thing.

B

That's that's a work around. So if you want to do that, make sure you set python user base appropriately, so some path where you want your custom packages to go and then, if you do pip install user, the thing then you'll be able to write that maybe it'll be in your home directory, depending on how you set it.

B

Okay, so some more general guidelines on using the stack at nurse before I get into the framework specific things um I may have shown that already but it'll come up a few times, that's a link to our documentation, page for for machine learning, and then there are sub pages for the different tools and frameworks and stuff like that. We do recommend that you use our provided modules or containers if appropriate. um Sometimes they'll have features that may not be available.

B

If you just do, conda install pytorch or um or they may have newer versions of libraries, they may be more performant and stuff like this, but of course you're still free to to customize, as you like here are some more pages on our docs. So sometimes things are broken.

B

Pearlmutter is still you know, being deployed, issues come up, so refer to the current known issues page or our machine learning, current known issues page for um for any problems, if you're, if you're having issues and then, if what you see uh what's happening to, you is not on there, then, and you need additional help. Please feel free to open a ticket here at help.nurse.gov and we'll help you out.

B

um Okay, I'll go really quickly to these, but we do have a dedicated page for pytorch with some recommendations. If you're doing distributed training in pie torch, you know we recommend to use this distributed data parallel utility, that's in the native pytorch distributed library, um and we recommend to use the nickel back end for optimized gpu communication.

B

uh Did I lose the sensor, or did I skip this? I must have skipped the tensorflow one: okay, really quick another page also for for tensorflow uh for distributed training with tensorflow. We recommend using um uber's, horovod library. It's just really easy to use and launch with slurm there's some examples uh from horabad here, but tensorflow also has some native distribution strategies that are that are also good.

B

In particular, I think there's this mirrored worker strategy that that would work well for filling up a single node uh make sure I did do these slides, okay, good. So now I'll switch a little bit to talking about performance.

B

So what I tried to cover there was more just like the functionality kinds of things so to tell you how to kind of get up and running so you can run your workflow whatever that may be, um but good performance for these workloads is essential.

B

um I think peter touched on this a little bit, but for folks doing research, it's really important to be able to iterate quickly. uh Training models can take a long time, especially if the the training code you're using is is fairly unoptimized.

B

um So performance is important for those kinds of workloads, but also, if we think about production, workloads or maybe folks are starting to use ai in their actual science production workloads. um It's important to kind of meet the the computational constraints there, maybe they're doing some. Real-Time computing with data coming from an experiment and performance can mean a few different things so for you, who's developing a new model and trying to train it to solve a problem.

B

You might care about something like the time to train a model right, but it's also important to think more generally about performance in terms of the efficiency.

B

How well you're utilizing the resources, and um maybe while promoter is free for now, that's less important to you, but eventually you will have to be charged for computing allocation on pro letter and then you know you'll need to care about how you're spending your hours and also um for nurse as a whole. Of course, we care about the overall system, throughput throughput for for all of our users, who are trying to deploy things at the same time.

B

So performance is important, but it's also true, regardless of your type of workload, whether you're running on a single gpu or thousands of gpus, whether using a jupiter, notebook or running things in batch scripts and of course, ideally, the the deep learning frameworks would give you everything they give you maximal flexibility, ease of use and the best performance out of the box and they've come a lot a long way, they're pretty good and they're pretty good recommendations. But of course it's not always the case. There can definitely be performance, limitations or pitfalls.

B

um So I would strongly encourage everybody. I think it's always useful to spend at least a little bit of time evaluating the performance of your workload, because you may find that you're actually not using the system very well and you could potentially have a lot to gain, especially if, if there's an easy fix, you can do and of course that can just really boost your productivity.

B

uh So first a little bit on us. How do we evaluate system performance? So we run various kinds of uh functionality, tests and benchmarks. um Nvidia has things like nickel tests, which let us test nickels? All reduced bandwidth and things like that, we run some unoptimized benchmarks, for example, models straight out of by torch's torch, vision, library.

B

The plots on the right show resnet50 scaling on promoter with just synthetic data, so no real, I o, um but still shows with something a model. That's you know not super optimized and not a super optimized setup um you you can get pretty decent scalability up to whatever this is like 512 gpus.

B

um One thing that I spend a lot of time on is ml, perf hpc, which is this. I co-chaired this group and we're doing deep learning benchmarking for hpc science as part of um ml comments and what we do there. Basically is. We have benchmarks and we measure time to train models. uh We also measure the system throughput, so training many models concurrently and what is the like models per minute you can achieve.

B

We do submission rounds so far annually. We did one last summer, which was actually the second submission round and we use real scientific or these scientifically motivated applications. You may have heard of some of these before deep cam is a climate segmentation. Application cosmo flow is 3d. Convolutional regression open catalyst is a is a newer one. We added this year, which is a graph neural network for atomic systems.

B

uh So this summer, this last summer, we actually submitted results using promoter phase one, and um I was really happy to see that it turned out to be highly competitive. We had some leading results, some um let's say second place or sub leading results for the benchmarks, um pretty pretty good comparison to like nvidia's saline system, which is a good, a good thing to compare to. Of course, it's not exactly the same. They do have more like network cards and other differences, but um we're pretty happy with how it all turned out.

B

You can see the full full results of that link that I have at the bottom.

B

Okay, so you're deploying your deep learning workload on pearl motor. What are some of the things that might be hurting your application performance? It can, of course, come in at multiple levels, um whatever you're running so at the single gpu level. um A common thing is, is maybe you're spending too much time in python code, which is inherently single threaded and interpreted. So you want to make sure, as much of the work load as possible is is on the gpu or parallelized.

B

um The most common thing that we see in users is in terms of performance issues, is coming from poorly performing input, data pipelines uh I'll say a little bit more about that shortly.

B

You may also have kind of weird things in your model: architecture that that use kernels that are not super optimized yet by nvidia. That's another problem, a little bit harder to solve. If that affects you um and then in the distributed world at multiple gpus or multiple nodes network communication can become a bottleneck. In some cases you may be able to just tweak some settings and improve things.

B

We sometimes see in science, workloads, uh irregularly, sized scientific data samples, for example, atomic systems of different sizes, and when you have these kinds of things, you're trying to scale up on an hbc load and balance can be a real performance. Limiter we've seen that and then the file system.

B

Besides, just maybe what your code is doing for data the the data ingestion, the parallel file system can can be a bottleneck itself. So the kind of read patterns that that are in deep learning, let's say deep, learning, training workloads. They just turn out not to be super friendly to these large parallel file systems like luster, so many small random reads.

B

So how can you start to diagnose your performance problems? Well, we do recommend that you start simple, just, for example, checking gpu utilization and you can do that with kind of standard tools like some that folks have already been recommending things like nvidia smi um or I think rawlin had some jupiter tools that that track uh utilization.

B

uh There are things like if you're running with weights and biases, which does experiment, logging and hyperparameter tuning that can do some system monitoring for you as well. If you run want to run nvidia smi with some job that you submitted it's fairly easy to, let's say just ssh onto one of your compute nodes and run it interactively or you could do something like this snippet. I have here where you run nvidia smi in your bash script in the background and then get like a csv file, get the the results as a function of time.

B

So then, if utilization is low, you know there's something going wrong and you can investigate deeper to figure out what it is. um Then, how can you diagnose? Oh yeah, just continuing that so um the the next stage is is to use some kind of profiler. uh Folks have already been talking about insight systems. This is, of course, a very highly standard, nvidia tool. uh It can do a lot. You can elect the the data the execution stuff and and gives you nice visualizations of the execution timeline.

B

You can annotate things in your uh in your model and in your training to see oh in this visualization. This is where I was data loading. This is where I called the the gradient computation and stuff like that. um So it's it's still um definitely a very valuable tool and it works well for the deep learning workloads there's also insight compute, but I'm not really recommending that to you as a way to diagnose performance problems.

B

It's probably only useful if, if you're really an expert, but that can give you lower level kernel level information about what your application is doing and then there are framework specific pro profilers um that come with like pytorch and tensorflow, and these are getting better all the time they give you a nice bit of information. You can view things in in tensorboard and they try to give you actually high level recommendations.

B

I mean you can do that, because this is domain specific, so tensorflow has one pytorch has one nvidia also has this dl prof profiler that's kind of uh similar in some ways, um so we you know, we suggest you try those out. They may be a good place to start, but sometimes we notice that the high level recommendations they may not quite be accurate.

B

They can be misleading sometimes um so you may want to kind of mix things we may want to go to inside systems if you want to really just be able to see what's going on. uh So here is a little view of what insight systems looks like as you visualize, an actual deep learning training application.

B

This comes from our tutorial, we're going to plug our tutorial from sc several times in here, because we have a lot of great material there, a lot of great material on profiling and optimizing deep learning training so definitely check that out the example that we use in the tutorial. It's basically the one that we're using today for the hands-on, but just the tutorial goes in a lot more depth, we're not doing the profiling and optimization stuff today, but for that example, in the tutorial it just it works.

A

Out to be a great example,.

B

Because you look at the timeline like this and you can see these gaps and you can infer that it's coming from data loading and then you can optimize that these are just screenshots of what the tensorflow and pi torch profiler results. Look like some of it's a little bit small.

B

I apologize if that's hard to read, but the idea is they give you this view of of what's the time breakdown of your application, how much time is spent in kernel, launching or loading data or stuff like that and at the very bottom right of the pi torch one, an example of a high level performance recommendation. It says you know you're spending all this time in the data loader. Maybe you can tweak x, y and z, so certainly try those out some other tips for improving performance.

B

Most of this is just lifted again from our tutorial for the data loading stuff. There are some obvious things to tweak. um There's this no more numb workers setting in pi torch. Sorry. This is a bit quite specific, I realized, um but um for the data loader and pi torch, you can choose how parallel the the the input file reading is, which turns out to be one of the most important settings for for optimizing. This there's things like pin memory, um which is a setting that kind of can help with the host to device transfers.

B

If you think your pipeline is actually you know in pretty good shape, you're running at large scale, it could actually be that luster is the bottleneck, because you're kind of competing with other users on the system and again the read access patterns can be a little painful for lustre.

B

If you can, you can consider staging your data sets onto the nodes, so we don't have ssds on the nodes on perlmutter, but you can use the per process memory, so the actual ram of your running processes or slash temp, which is ram disk, so it uses ram. But it's it looks like a file system and it's shared for that node. So all the workers on the node could share this and it's up to 126 gigabytes.

B

So if you have a large data set, you may actually need to partition your data set across nodes to fit. This can have some performance, sorry, some uh convergence convergence implications and how that affects how you're doing like global, shuffling and every epoch and things like that. But in practice we see that usually that's, not a major issue. So partitioning your data set across nodes can work in practice.

B

Nvidia has a nice dolly library, with nice features and functionality for accelerating your your data, loading and deep learning. It can have some. It can help you with accelerating some of your data transformations on the gpu.

B

um It can optimize the host of device transfers through cuda streams and some other things uh so for tuning the single gpu performance. I always want to stress that this is important. um It's it's! It's not a good idea to just jump to oh, you know my my training code is slow, so I'm just going to run it on hundreds of gpus and then it's it's faster.

B

um You can get a lot out of just looking at what's going on at the single gpu level, so try things like using mixed precision: training, try it compiling your model, so both frameworks have different ways of doing these, but they're fairly straightforward to put into your code. Sorry, I don't have all the details here, but we can help you if you need help.

B

Nvidia has some stuff, like this apex library um for pytorch things like fused optimizers that are are good to use, fusing optimizer, fusing, kernels and optimizers. Basically um just helps you so you're, not launching so many small kernels and things can run faster um at the distributed kind of uh scenario um to tune that performance. There's things you have to consider like this. This trade-off between efficiency and runtime. It's always an important thing to think about.

B

As you take a workload and scale it up to more gpus, generally you're going to be trading off some notion of efficiency for runtime, so things may run faster, but you may be burning through your hours faster. So you kind of have to tune this for your needs and then there may be settings in the communication backend actually in the libraries that you can tweak.

B

um Sorry, I don't have a lot of details here again, there's more in our tutorial, so uh refer to those or just ask us in the q, a doc for for any details you may want, and then I think I'm going to hand it back to peter.

A

Great well thanks steve, um so yeah by now, you should have a pretty good overview of uh you know, what's available at our on perlmutter, for deep learning and uh yeah how to get it working. Well, things steps you can take to get it. You know optimized and performant.

A

In this last section, I'm just going to go over some of the additional extras. We have little tools that are useful and additional resources for you to learn more.

A

So obviously, this was covered very well by the previous talk, but jupiter is an except excellent service for deep learning, especially for interactive stuff right. So these notebooks are very popular service. With hundreds of our users, it's a favorite way for people to develop machine learning code in particular.

A

um I I personally like jupiter for things obviously interactive things. So things like debugging or you know, analysis or you know, quick testing of if you're, developing some custom operation for your architecture or something like that, it can be very useful for visualizing intermediate results and so on, and our jupiter environment, thanks to the hard work by nurse staff, is, is very flexible and I think easy to use. So you can run your workloads on it quite easily. You can use um dedicated promoter gpu nodes.

A

um We have some pre-installed deep learning software kernels available, which are just based on our module installations for pi torture tensorflow, but you can also use your own custom kernel and that's quite easy to get set up. We have documentation on how to do that.

A

So jupiter is a great resource.

A

Another one that is excellent is tensorboard. This is probably the most popular tool for kind of tracking. What's going on in your deep learning training, so visualizing and monitoring um your experiments different like metrics or results from your model as it trains um both tensorflow and pytorch communities are very enthusiastic for tensorboard. They both support it very nicely.

A

You can do cool things like not just visualize things like you know your loss throughout training, but you can also plot like the distribution or histograms of your actual weight parameters. You can see if maybe your weights in some layer are collapsing to zeros or off to infinity or something so it's easy to catch little debugging tips, and then you can also visualize custom plots that you make. If you have a specific like science metric that you're trying to satisfy you can plot that throughout training, so overall, very helpful.

A

I think the easiest way to get this working is just to use our tensorboard helper module.

A

So this is just a little python module that we've written that's available in our um custom, jupyter kernels that we've installed and this one you just simply have to import it and then load it and start tensorboard, and that will give you a link that you click on in the notebook and that link will bring you to your tensorboard server.

A

You just have to point it to the directory, where you're, storing your tensorboard logs from your training run.

A

So beyond those two which are kind of more useful, just like day-to-day interactive type, things um if you're doing hpo hyperparameter optimization, um there's a lot of different options out there. So again, this is a pretty critical step in in deep learning development, because you pretty much always have to tune your model.

A

So there's been a lot of different libraries and methods developed over the years to to do this. It's usually like we said, computationally expensive. You need to train a lot of models, so it's a good fit for things like promoter, um but because of these libraries out here you can also sort of save some time. They have some nice. You know more advanced, optimization algorithms that are that will help. You reduce the number of trials you have to run.

A

For example, if you're just running random search, you might have to run 100, but uh these tools contain you know, alternative methods that maybe can be more efficient in selecting the next trials. um So generally we we can just uh you know you can use whatever you want to um and if you run into trouble, you can just come to ask for any help. All these example frameworks I have listed here ray tune weights and biases um sig opt and optuna, there's many more but yeah.

A

All these are typically compatible with both tensorflow and pytorch, so they they cover the vast majority of use, cases that we might see.

A

So beyond those additional tools, I just want to briefly mention some some other sort of outreach or additional research or resources for you to learn more about deep learning in general. This will be, you know, more targeted for people who are maybe new to deep learning or or new to you know, deep learning applied to science, so a great resource is our deep learning for science school that we did. This happened in 2019 and 2020..

A

The 2019 was a great event. It was actually in person, so there's there's videos of the presentations and slides and code exercises all online um in 2020. We ended up doing a webinar series, so lots of different topics were covered. We actually got into a fair amount of depth and again sort of targeting, like actual scientific applications and aspects of deep learning, workflows that are relevant for um science.

A

So so this webinar series is also a great resource and all those talks are available online, um as we've plugged multiple times and will plug again right now. The deep learning at scale tutorial at sc is, I think, a great resource if you're maybe familiar with ish with deep learning, but you you aren't really familiar with how to get it to scale up um on our systems.

A

um So this tutorial we've we've done. We did it with nvidia we've done it with cray. In previous years, we presented it at sc for a lot of the past few years, as well as some other venues, but it has really good. You know detailed lectures and hands-on material examples of doing distributed. Training profiling, optimization our sc21 material, is, was all done on perlmutter. So it's a good basis. We actually we're going to use that for today's hands-on exercises, but, as steve said, yeah, please go visit the full sc 21 tutorial.

A

If you want to learn much more in depth on things like profiling, optimization or just derivative training, and then finally I'll just plug our data seminar series. This is a interesting set of seminars that we host from time to time isn't like specifically limited to deep learning but covers a variety of you know: data centric topics.

A

So yeah, to conclude, I just I'll just wrap up what we've gone over. So deep learning for science is here. Obviously it's growing people are enthusiastic about it.

A

We're seeing a lot of impressive results coming out of the research efforts and we're seeing machine learning deployed in actual scientific applications and workflows, which is exciting, we're seeing deep learning, take an increasing share of hpc workloads and influencing the design of hpc systems and promoter, for example, has a very productive and performance stack for deep learning as steve went over, we have some pretty well optimized frameworks and solutions for a variety of scale, problems that you might have.

A

We have additional suite of tools like jupiter or tensorboard that are helpful for your workflows and yeah. I just would recommend. I guess that you know joined our nurse user slack. It's a good place to see what issues people are having or ask for help with your own issues.

A

Always our documentation is a great resource and it looks like we have a little bit of time left. So um if you have questions, I think you should yeah. You should go ahead and put them into the google doc now, if you're interested in the hands-on exercises, which will be done in this afternoon session, I think starting around 12 30..

A

Then you can stick around for this last portion of the talk. um So in this last portion, I'm gonna just quickly go over some background for these hands-on exercises.

A

As steve said, these are lifted from our sc21 tutorial material and it's just sort of condensed. We don't have everything there, but of course you can always go visit. The full tutorial if you want more details, um but we wanted to actually have people work on.

A

You know like an actual science example, rather than just maybe some of the typical like intro to deep learning examples you might find on the tensorflow or pytorch websites, which are also great resources, but the particular example we're going to have today, for you is this end body to hydro application, so this is from a cosmology application, working with multiphysics cosmological simulations and trying to accelerate this workflow with deep learning, um and so the sort of science problem here is.

A

Is that in these simulations we have dark matter which is abundant and it's essential to structure formation, but we can't really see dark matter or we can't see it at all, and so we need to model the observables right from actual visible matter, and this comes from luminous gas or galaxies that are also forming in this large scale structure. So so this so-called cosmic web forms mostly from dark matter coalescing, but then on smaller scales.

A

We have gas dynamics that are affected by you know hydrodynamic interactions. So actually you know like uh pressure and temperature and so on affect the actual uh flux or light that we observe in the universe. So to get this observable field modeled correctly, we need to both model the large scale structure, as well as the sort of fine detail of the gas dynamics, but unfortunately the gas dynamics are very expensive to compute, so modeling, this full combined system of n-body and hydro fields is very computationally demanding it requires a complex multiphysics.

A

Fluid solver runs on hpc. Resources can take many many compute hours to resolve one of these simulations if you're running a high resolution for a long time, and so a simpler option is to just model the n-body or the dark matter simulation and this one you can kind of ignore all this complex, hydrodynamics, still capture, roughly the large-scale structure and it's a decent estimate. So it's been a long-standing goal to reconstruct the hydro fields from the n-body input.

A

So that's the sort of science background. What we're going to do here is just try to do that. Reconstruction process with a deep neural network, we're going to use a unit architecture, so units are nice models because you, you can have these sort of sequential convolutions that extract features and and down sample the spatial size of the input sequentially.

A

So it's sort of extracting more and more global features by the time you get to the bottom of the unit and then at this stage you want to start up sampling back to your original spatial resolution, so you can actually make a prediction right or or make a reconstruction of the hydro field that we are interested in and to do that, we just have another series of convolutions with upsampling in them. The key thing in a unit.

A

That's helpful for getting these high resolution features resolved is the skip connections which basically just copy the extracted features at each spatial scale across the network to the other side, so that the information, the sort of high frequency or high resolution information isn't lost. When you go down to this, the bottom of the unit.

A

And these big 3d simulations, the data volume, is kind of a challenge. There's four input fields and four output fields, for this example, and the spatial grid is very large, at least in terms of deep learning. Standards like 10, 24, cubed or 2048. Cubed is pretty big and it's hard to fit that plus a model plus the optimizer utilities and so on for training all on a single gpu.

A

So what we do is train with smaller crops or sub volumes, and the way that looks like is is pretty simple: we just select a crop out of the simulation for our input and feed it to our network and compare it and get our prediction, and then we compare that to the corresponding crop from our target simulation um and a special sort of addition. We have here for this. Workflow is using some extra data augmentations by just in addition to randomly cropping a sample.

A

We also apply some random rotation to it and correspondingly rotate the target as well. So these data augmentations are all implemented in the data pipeline. That loads data from disk takes the crop, rotates it and then feeds it to our model for training.

A

So in in the code today, it's all uh pytorch based pytorch is pythonic. It's pretty easy to integrate with other python code, as we said generally performant out of the box with these optimized libraries from nvidia, and it has good support for distributed training today, we're going to be using gpus, so we're going to opt for the nickel back end for communication during distributed training.

A

This link right here is the link to all of our example code and I'm not going to go through the readme, because it has everything you need to know in it and of course, throughout the readme. We've also pointed links back to our original sc21 tutorial material. If you want more details and that for accessing promoter for this hands-on we're going to recommend that you use the jupiter hub and again there's instructions for that in the readme.

A

So I guess with that I will conclude there and we can go look at the dock for some questions.