National Energy Research Scientific Computing Center (NERSC) SC20 Deep Learning at Scale Tutorial, 21 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Hyperparameters Optimization - Michael Ringenburg & Ben Albrecht

Description

SC20 Deep Learning at Scale Tutorial
https://github.com/NERSC/sc20-dl-tutorial/

A

I'm michael ringenberg from hewlett packard enterprise I'll be closing out the tutorial today with a discussion of hyper parameter. Optimization ben albrecht also from hpe, will be around later to help answer any questions you may.

A

Have so first what are hyper parameters, and why do we care about them? So when you think about machine learning models, there's a couple of type of parameters to consider. First, you have your model parameters. These are the internal values of the models like weights in a neural network that you learn by training your model against a data set, the other type of parameter that we're interested and concerned with is the external values, the hyper parameters to the model that determine the capacity of the model to learn and how the model learns.

A

So, in the case of neural networks, there are really two high level types of hyper parameters. We're concerned with the first are the topology parameters. This is things like the size of various layers, the number of layers within your neural network et cetera.

A

Then you also have the training, hyper parameters, and these, rather than determining the the shape and size of your network, determine how your network trains. So this will be things like your learning rate, uh your drop up probability or the optimizer used.

A

So why do we care so much about hyper parameters? Well, it turns out that finding a good set of hyper parameters for your deep learning model can have a very big impact on everything from your your final accuracy to the time to converge, uh and it can also help with preventing two significant problems with machine learning models which is overfitting and underfitting.

A

So because of this hyperparameter optimization has consistently been highlighted by machine learning, engineers, data scientists etc. As a key part of their machine learning workload.

A

So, given this, how are hyperparameters typically optimized, so one method is the manual or by hand method. This is often guided by intuition and various rules of thumbs where hyperparameters, essentially selected and tuned manually, based on the the knowledge and the intuition of the of the machine, learning engineer or data scientist working on training the model.

A

Another technique, that's starting to become more popular and more tools are appearing around. This is automatic, hyper parameter, optimization, so clearly, a brute force, search of your entire search space is intractable, there's just too many in a typical model. There's too many hyper parameters too many possible values of those hyper parameters. So instead these automatic hyper parameter. Optimization techniques focus on evaluating a subspace of possible hyperparameters.

A

So you think about all of the hyper parameter about possible values as your search space and you're, looking at just some subset, some subspace of that there are fairly straightforward techniques like grid search and random search, and there are more advanced techniques like bayesian, optimization and genetic or evolutionary algorithms, which I will cover as we go through these slides.

A

But first you know. Why are we talking about hyper parameter, optimization at super computing? Well, it turns out much like in the the previous portions of this tutorial. We were talking about why hpc systems are great for training, deep neural networks, they're also great for hyper parameter. Optimization, because you are dealing here with a very large search space of many different types of hyper parameters.

A

Many different varieties as well of hyper parameters from integer to categorical to continuous hyper parameters and evaluating these hyper parameters can be expensive so to evaluate a specific uh point within our search space of hyper parameters. That means training a neural network with those hyper parameters and evaluating the accuracy or the time to convergence or whatever your fitness metric of choice is for the optimization problem.

A

In addition, if you're going to be training at large scale on an hpc system, it's important to understand that the ideal hyper parameters also vary with scale, so you don't necessarily want to do hyperparameter tuning at small scale and then training at large scale, because some hyper parameters, in particular learning rate, can vary with your effective batch size, which also typically varies with the scale that you're training at. In addition, your cost function is not necessarily smooth or continuous in many regions. This is a very complex search problem. uh It requires multiple.

A

It requires great compute power per evaluation and typically requires very many evaluations. There's many opportunities to to paralyze this process across a large high performance computing system.

A

So what are some of the techniques that are used for automated hyper parameter? Optimization, probably the simplest and sort of you know baseline technique for automated hyper parameter. Optimization is a grid search here where, essentially, you can think of each of your hyper parameters that you're interested in tuning as a as an axis of some. You know n-dimensional grid, and you have some. You know step size here and you're evaluating every. You know, you're evaluating all the points at those individual. You know grid points.

A

um This is simple to understand, uh clearly easy to parallelize, because a bunch of independent evaluations, you know exactly which evaluations you're going to do disadvantages, though, of course, are you know your cursive dimensionality and that, as you start, adding lots of hyper parameters, this quickly explodes in complexity, so it can be very computationally expensive. Nevertheless, this is a good sort of baseline, automated hpo technique that we can use to compare other automated hbo techniques against.

A

Another strategy- that's that's often discussed and sometimes used- is random search, so random search here, you're just picking points at random. Once again, this is easily paralyzable. You can, um you know, have multiple nodes going off and determining you know different evaluations to just randomly picking values, type of parameters.

A

It turns out it's more efficient than grid search, and I will kind of get into a little bit more about why that is on the next slide. However, it's not really. It still tends to be more computationally expensive than some of the more advanced techniques, because it's not really using any intelligence to narrow down on the search space.

A

It's just randomly picking points I mentioned on the previous slide that random search was more efficient than grid search as far as sampling of hyperparameters. So why is that? Well, if you look at the simple example in this diagram here, where we have one hyperparameter say that's more important than another hyperparameter.

A

If you look here at the nine samples in each of these cases, you'll see that, with a grid search we're only sampling, three values of that important parameter, whereas with a random layout, we're sampling, nine distinct values of that important parameter.

A

There's also an interesting. You know probabilistic theorem about this, which basically says that if you have any distribution with it with a finite maximum, if you take 60 random observations within that there's, a 95 percent chance that at least one of those will be in a five percent window. A five percent range around the the actual point that has the maximum value. So if your uh fitness metric is somewhat smooth typically, this means that there's 95 chance that you'll find at least one fairly good value of hyperparameters.

A

Another technique- that's commonly used in hyper parameter. Optimization is genetic search, so this is basically applying evolutionary or genetic algorithms type of parameter. Optimization genetic algorithms are inspired by evolutionary processes that occur in nature.

A

So the idea here, taking this and applying the type of parameter optimization, is that we treat each hyperparameter as essentially a gene and a distinct set of hyper parameters that you may evaluate to train a model. You consider that distinct set of hyper parameters as an individual organism within a population.

A

So the idea here, then, is you apply sort of ideas around survival of the fittest, to identify the say, fittest or or best individuals within a population. Those are then combined in ways that generate child individuals that we then evaluate as well.

A

So the nice thing about genetic algorithms is that they're a great way to combine a couple of priorities that you have when you're doing an optimization problem uh exploitation, so that is sort of looking down uh narrowing in on promising areas of your search, space and exploration, which is the idea that you know you may be stuck in the local maximum?

A

So you also want to go off and explore other regions of the search space as well, and genetic algorithms are nice in the fact that they have nice dials that you can tune to play with that trade-off between exploration and exploitation. um So, for example, there's a there's a mutation rate. So when you combine two parents, uh there's a probability that each of these genes will mutate, so that of course, encourages uh further exploration of the search base.

A

In addition, you often have separate populations, and so you know you may have different founders for different populations and they go off and those those different populations will be searching, maybe different areas of your search space and then periodically you'll take uh good elements from each of the population. Good individuals from each of these populations and you'll combine them to get even further exploration by combining individuals which may have kind of optimized very different areas of the search space and combining those so the basic cycle within the genetic search.

A

Is you start with for each population? You start with a founder. You apply a number of mutations to get your initial population and then you just repeat this process for, however many generations you want to search for so you'll start with evaluating each of the members of the population, selecting the ones that are the most fit from some set of the most fit members. You will apply reproduction to create new children. Those will become your next generation and you will repeat the process.

A

So let's look at a quick visualization of this process we start with and for each population, an initial set of hyper parameters to form that population we apply mutation and we generate a set of individuals in that population.

A

Next, we evaluate the fitness of the individuals in that population and we select some of the most fit individuals in this case. The green points we then choose pairs of those particularly fit individuals apply reproduction, which is crossover so combining of hyper parameters from the different individual from the different individuals, as well as potentially mutation, depending on the mutation rate.

A

Select another set of parents create another child, another set of parents, another child repeat the process. Until we have produced our next generation, we then repeat the process evaluating the fitness of that generation and we keep going uh reproducing evaluating the fitness, selection, etc. Until we end up with, hopefully, a set of individuals, a population that has all has a high fitness level.

A

Another variation on this is referred to as population-based training and in population-based training.

A

What we're trying to do here is not just pick a single set of hyper parameters for training to model, but rather a schedule of hyper parameters here, so we're trying to actually learn the ideal hyper parameters, for example, for each epic of training.

A

So the the basic idea behind this is you're, typically applying genetic search uh to an epic doing, checkpoints and then selecting the the best values determined for each epoch, restoring the checkpoint from that and reapplying genetic search to calculate the best hyper parameters for the following epoch, and we can see an example of this here where we had sort of original values for for learning rate and for weight decay and then using hype using population based training. We learned a new schedule.

A

We see here that is more smooth with the with the decay of our of our learning rate over.

A

Time so, as I mentioned, there's a number of tools out there for doing automatic, hyper parameter. Optimization. I've just highlighted three here that I'm most familiar with, uh there's, of course, the the creai hyperparameter, optimization library, developed by by hp and gray. This is available currently on nurse query system also and many other hpe cray systems.

A

It has a python front-end python, apis to interface with the library and it runs on the back end using chapel. um This is all compiled down, though you have no need to download chapel, runtimes or anything to run with us. We just utilize chapel uh for its high performance, distributed, computation and and ease of programming.

A

This is designed really for doing hyper parameter, optimization uh on high performance computing systems, as you would expect from something from cray there's. Also the dpiper library developed by argon and available on their theta system, also built for hpe systems and the ray tune. Hbo library built on top of the ray framework and I've got links here for for both of those.

A

So, to give an example and give you kind of an idea of how these automated tools work, I'll, walk briefly through some of the the functionality and usage of the one I'm most familiar with, which is the the hpe cray ai parameter, optimization library.

A

So this library supports both distributed, optimization as in running uh many different trials of different hyper parameters in parallel on many different nodes of your system, as well as distributed optimization with distributed training. So you can have your individual evaluations. Not only have many evaluations but have each of those evaluations uh be at scale as well fairly, simple steps to use- and this is this- is common with most of the automated hpo libraries out there. You create a wrapper script, typically in python that will import uh your your hyper parameter.

A

Optimization library uh you'll define the optimizer, you want to use and any parameters to that, and you will then define the hyper parameters that you're interested in training. Here we see learning rate and dropout rate. We see default values and ranges to explore for those, um and then you also create a training script and that's what's what's referred to here by the evaluator. So that is typically just your your deep learning training script that you would have used. Otherwise, with a couple minor modifications which I'll which I'll show on a show on a future slide.

A

So if you look here actually at the the model- training script, uh here's the ways you you need to change that in order to interface with the wrapper script and the hpo library, you need to make sure that the hyper parameters that you're interested in training are exposed to the hyperparameter, optimization library in the wrapper script, and this is done by adding them as command line. Arguments typically make sure you use those uh those hyper parameter command line.

A

Arguments uh in your training, of course, and then you need to just need to add something that prints out a figure of merit. This is the the function you're trying to optimize, so it could be time to convergence could be accuracy, it could be whatever you want.

A

And then, as I mentioned before, you also have to create a wrapper script and we see in here we have our parameters. We have our genetic optimizer. We have the various parameters to that genetic optimizer. These are sometimes referred to as your meta parameters to distinguish between them from the model parameters and the hyper parameters, and then once you've created your evaluator, your parameters and your optimizer, you just run optimizer.optimize.

A

uh There's a couple of new features that we've recently been been working on and added to the the hpe crate, ai hyper parameter, optimization library, one of them is the ability to do extrapolation and early termination. So this is useful if to prevent the hyper parameter, search from exploring evaluations and wasting time and evaluations that are clearly not going to do well. So we have the ability to stop an evaluation early. If it doesn't meet a specific threshold, you can also set multiple durations and multiple thresholds that it may need to meet at different times.

A

If you don't want to set a sort of fixed threshold, you can set your threshold to be. You know within a certain range of the best scene.

A

So far, there's also the ability to do extrapolation- and I show in this example example here- linear extrapolation- you can also do degree, 2 degree, 3, etc, extrapolation and the idea here is you basically say in this in this particular call here we're saying after three intervals, we want to extrapolate to what our metric our fitness metric will be after eight integrals, uh eight intervals using a linear extrapolation degree. One here uh and we wanna see if the extrapolated loss after eight is, is less than zero point, if not terminate early.

A

Another thing we've been working on recently is the addition to to the hpe creai of a analytics module which allows you to better study and examine the results of your hyper parameter optimization. So you can study the relationship here between either scale and a specific hyper parameter. So we see here we're examining the relationship between the number of nodes we trained on and the learning rate with the color there indicating that the fitness metric. This could also be changed to examine the relationship between two different type of parameters.

A

So, if you think there might be some interesting relationship between two of your different hyper parameters, you can examine that as well, and we see that we provide both both graphical and tabular views.

A

So that's covers what I wanted to talk about today. I've included some links here to some further reading. There's lots and lots of literature on hyper-parameter optimization. These are just a few of the things that I've looked at recently.

A

uh Thank you very much and ben albrecht should be available to answer questions.