National Energy Research Scientific Computing Center (NERSC) SC20 Deep Learning at Scale Tutorial, 21 Dec 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Deep Learning at Scale Tutorial Introduction - Steven Farrell

Description

SC20 Deep Learning at Scale Tutorial
https://github.com/NERSC/sc20-dl-tutorial/

A

Hello welcome to the deep learning at scale tutorial at sc20.

A

My name is steve farrell, I'm from nursk at lawrence, berkeley, national lab.

A

Today, in this tutorial, you will learn how to effectively deploy deep learning applications to hpc systems, we're going to help you learn about performance, optimization, particularly on nvidia gpus, best practices for scaling to multiple gpus and nodes, talk about some scientific application examples and help you learn about hyper parameter, optimization and how to deploy those kinds of workloads.

A

We are not going to cover in depth the basics of deep learning. Today, however, we will provide some links and resources for introductory material that you can follow up on. If that's your interest, we've done this tutorial a few times now, including two previous years at super computing, and in past years we have been able to provide hands-on examples with training accounts on our corey supercomputer.

A

But due to the constraints of this year, we decided we're not going to be able to do that, but we will provide still code examples as well as do some live demos um which then you can take and and uh and use on your own systems. So we're going to have examples and demos for doing the performance, analysis and profiling for how to do distributed, training and for how to do hyper parameter tuning with the cray hpo tool.

A

This is the team of organizers and presenters, putting together this material today and nurse besides myself, there's waheed bimji and mustafa mustafa from nvidia. We have josh romero and torsten kurt and from hpe we have mike ringenberg and ben albrecht.

A

This is the agenda for today, so we're in the introduction right now from me. After this we're going to go through a live example, code walkthrough on the the pytorch model and and training code that we're going to be using today.

A

After that, the nvidia guys are going to talk for a little bit about performance, optimization and profiling and then switch to a live demo to demonstrate the use of some profiling tools, then we'll have a break and after the break, mustafa will tell us about best practices for scaling, deep learning and then, after that, we're going to have another live demo demonstrating how to do distributed. Training with the example code and then we'll finish off the tutorial with mike's talk about hyper parameter, optimization with some examples using the create hpo tool.

A

So, as I said, we're not going to go in depth into the basics of deep learning today. But here are a few links which you may find useful. The first one is a link to our tutorial material. From last year at sc19 we did have a talk and some examples more tailored to introducing deep learning um and also we had hands-on examples, specifically in keras and horovod, so today we're using pytorch.

A

If you prefer to learn or wish to learn additionally about keras and horva, then you can go check out that material at berkeley lab for the past couple years, we've been doing a deep learning for science school. So last year in 2019 we had an in-person week-long event and quite a bit of lectures related to introduction to machine learning and deep learning. uh You can watch the videos online they're on youtube there. You can follow. Follow this link this year in 2020, we had a webinar series.

A

Instead, um there was one talk which was introduction to pytorch, which you may find useful and then quite a bit of additional advanced topics. So again, I encourage you to check these out if you're interested and beyond this. Of course, there are many other great resources for learning deep learning online, many of which are completely free.

A

If you need help getting other recommendations for resources, you should feel free to reach out to us separately.

A

So I'll do a little bit of review, hopefully review on some deep learning concepts. So deep learning is, of course, a very powerful set of tools for solving problems.

A

um We are um certainly in the middle of what you might call an ai revolution uh and and a lot of that's because of the rise of deep learning and uh ai and deep learning are transforming big tech companies from the ground up as they work in every part of their business, but also deep learning is working its way into many many recent technologies, many of which we interact with on a day-to-day basis.

A

For example, the way our phone understands, what we say how we search the internet increasingly, the way we get around with things like self-driving cars, deep learning applications are starting to show promise in healthcare and, of course, many other things, including some fun stuff. Like arts and games.

A

So deep learning is the subset of machine learning, that's powered by deep neural networks and, of course, deep neural networks are highly parameterized function, approximators, basically with very high expressive capability.

A

It's basically a new way to write programs.

A

It's been referred to as software 2.0, where, instead of writing programs by hand in some language, like c plus plus, we define this uh this complex uh function and then we have data where we try to now fit that function, to data, to learn some functional mapping of inputs to outputs.

A

Now, this rise of deep learning is very much driven by things like the availability of gpus and large curated data sets. So it's it's, no big secret that that deep learning does very well and can do better than traditional machine learning techniques when you have a sufficient amount of data.

A

And if you look at the the history of the imagenet competition, it was around 2012 and 2013 when deep learning started to take off and the error rates started to go down dramatically, which is about the same time that gpus started to be used to train these models.

A

Of course, since that time, the field has really taken off and there's been many great achievements and developments and applications of deep learning to many different types of problems.

A

But deep learning is not only useful for industry use cases. Deep learning can be, and in fact is, transforming science workflows as well.

A

That's because deep neural networks have powerful capabilities that are relevant for science, things like their ability to automatically learn patterns from high dimensional data or to encode inductive, biases and symmetries.

A

Our science data often has structure and symmetries that we want to incorporate into our models.

A

Some of the emerging promising application areas of deep learning for science include analysis of large scientific data sets as more telescopes and particle accelerators are coming online or being upgraded. The amount of science data out there. The data sets, are growing getting more complex and deep learning has shown promise that being able to automatically do analysis of large data without using the handwritten analysis, pipelines or without manual human analysis or sifting through data.

A

Another thing is that often in science, we have very expensive but high fidelity simulations in order to do calculations to solve problems and there's a lot of excitement around deep learning's capabilities to accelerate this with generative models and other kinds of things.

A

The third one is in real time control and design of experiments, so whether you are trying to tune the parameters of a particle accelerator, beam or control a a scanning microscope or control the routing of data through a high-speed network.

A

The ability to do this automatically with deep learning and perhaps do it in a smarter way than a human might or faster way. This is a potential game. Changer.

A

So adoption of deep learning is on the rise in the scientific communities. They are definitely very enthusiastic.

A

um Every day there are more and more papers being published on science, applications of deep learning, there's a growing presence at conferences, including the machine learning conferences such as nurips, where paper submissions and attendance popularity is basically soaring, but also in the domain science conferences, where we see more and more submissions and dedicated tracks for machine learning and deep learning.

A

There's been recognition of achievements in deep learning with awards like the touring award and gordon bell prize, the department of energy and other funding agencies are investing heavily in ai and deep learning. There have been several funding calls related to ai for science.

A

Last year we had an ai for science town hall series in the national labs. There were over a thousand attendees across four of these meetings, and that culminated in this 300 page report on ai for science, which you can now go and read which talks about all the the challenges, the grand challenges and capabilities.

A

So now um I can't talk about you know everything that's going on in deep learning for science, but I'm just going to briefly highlight a handful of examples, some of which were done at nursk or with our collaborators and colleagues at nursk, and these kind of show some of the interesting capabilities you can do with deep learning for science such as cosmogan, which shows the use of generative adversarial networks for basically learning to replace cosmology simulations.

A

These two demonstrate super resolution methods. So the kind of the theme here is that you might be able to use simulators at a coarse, grained or lower resolution uh and then use deep learning to to enhance the results to make them equivalent to higher fidelity simulators.

A

Exascale, deep learning for climate analytics, work, uh torsten and josh were co-authors. On this. This shared the gordon bell prize in 2018 ran um at scale on summit.

A

uh Exatrax project is one that I'm involved in where we look at graph neural networks for particle tracking. uh This is a nice example that shows in science. Often we have data that doesn't fit into images or sequences. Maybe it has more irregular or geometric structure. So we can utilize methods like graph neural networks to tackle problems and the last one edoloomis was shown at sc19 last year.

A

Wahid was a co-author, and this is a great example of showing how you can do probabilistic programming with deep neural networks, combined with a traditional simulator for efficient inference and also demonstrated how you can do large scale training of this these systems.

A

So, as we start to tackle more complex tasks in science or wherever, with bigger, deep learning models and bigger data sets the amount of compute that we need to train those models grows and in fact, over time it seems like the amount of compute needed to train deep learning models is growing exponentially, at least for um some of the more popular cases here on this open ai plot.

A

um So we're clearly in a regime where a single gpu just doesn't cut it for many deep learning problems now, so we need something like hpc systems in order to throw at these problems.

A

So one example system is one: that's coming online soon at nursk perlmutter, that's our next generation system optimized for science. This is one of the early cray shasta systems. It's going to have a few times the capability of corey, our current system and we're working hard to make sure it has an optimized hardware, and software stack for deep learning.

A

Perlmutter is going to be coming in two phases: the first one late this year will have a lot of gpus and in fact, these nvidia a100, ampere, gpus, single tier all flash storage system uh and then pro motor will also have a cray slingshot high performance network, which comes in phase two in mid 2021.

A

So we are very excited about this system and its capabilities for deep learning. Now we have our deep learning methods. We have our interesting science problems and we even have hpc systems to throw at this. But how do we make sure that we make effective use of these hpc systems for our deep learning problems? Okay, so this slide details the road map that reflects how we're structuring the tutorial and the things we're going to go through today.

A

So, first of all, we're going to assume that we start today with uh already a model which is appropriate for solving your science problem, which trains, let's say on a single cpu or gpu, and so once you have something like that, which can at least learn on your your data set to solve a problem.

A

Now we want to um to try and scale up to use a large system for this, but before we throw you know, hundreds or thousands of gpus at a problem, it's of course always important to first think about how effectively you're using the single computational unit, so how you're, using a single node or single gpu, so we're first going to talk about how you optimize that performance on a single gpu using the you know profiling tools, showing you how to tune and optimize the data pipeline, which can often be a bottleneck and how to make effective use of the hardware with things like reduced mixed precision.

A

Okay, once you have a good handle on the single node or single gpu performance, then you can start to look at how you distribute the training across multiple gpus, multiple nodes, using the right communication.

A

Libraries, you might want to use data parallelism or model parallelism, but today we're not going to talk really much at all about model parallelism, we're going to focus on the more common case of data parallelism and mustafa in particular, is going to talk about best practices for large scale, training and convergence, and once you have all that done, a way to really leverage very large scale systems is then to do many trainings at once in a distributed hyperparameter optimization.

A

So we're going to show how you can do that with tools like create hpo, which can seamlessly work with the schedulers on hbc systems and have sophisticated search algorithms to tune um once again, just showing the agenda really quick. uh But this is the end of the introduction. So I'll just say thank you for listening and I hope you enjoy the rest of the tutorial.

A