National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 24 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Week 2 - A Modern Guide to Hyperparameter Optimization - Richard Liaw

Description

More about this lecture: https://dl4sci-school.lbl.gov/richard-liaw
Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda

A

Hi, my name is richard: I'm a software engineer at any scale. Previously I was a phd student at uc, berkeley, working on machine learning systems and cloud computing. During my phd, I primarily worked on rey, which is a distributed computing library in python, targeting ai applications, and so today, I'll be talking to you about hyperparameter tuning.

A

For this talk we'll begin by motivating the importance and highlighting the complexity of tuning model, hyper parameters we'll overview, some of the state of the art techniques for optimizing, these parameters and finally I'll talk about ray tune, a python library for simplifying and scaling hyperparameter tuning.

A

At the end of this talk, you should be able to ex to leverage new high parameter, algorithms and frameworks to accelerate your machine learning. Workflow.

A

Let's begin with some context about machine learning, today, machine learning, specifically deep learning, is experiencing rapid growth and adoption. This is happening not only in academia but also in industry, where more and more applications such as speech, recognition and autonomous vehicles are leveraging deep learning.

A

Despite this rapid growth and adoption, all deep learning practitioners know about one dirty secret and that is the reliance and need for hyperemergening. Let's give an example: I'll talk quickly about convolutional neural networks. These neural networks are very powerful and attributed to many of the recent advances in computer vision.

A

Here on the screen, we have one of the original convolutional neural network designs by yonder on the bottom. We have a more modern neural network design called alexnet, also convolutional neural network. This was developed after 20 years of research, and, what's interesting is that the basic idea over the last 20 years have remained largely unchanged over the last 20 years. We've actually simply just modified the shape of the neural network and size of each layers, and as a result, we have this new wave of deep learning research that is so active today.

A

These hyper parameters can clearly make a huge difference in the performance of these models. Now, what's making matters. Worse is actually two common trends that we're seeing the first one is that models are getting larger and larger. With the most recent open, ai gpt3 model containing nearly 200 building parameters, these state-of-the-art models are not only larger, but they're also more complex. So, as you see on the screen, we have another famous uh recent language model containing over a dozen hybrid parameters that you have to tune.

A

This means that selecting the right hybrid parameters is going to actually simply take a long time and compute. A lot of resources can consume a lot of resources.

A

So we write about the main problem with hyperparametering, which is not simply just to choose how to choose these hyperparameters, but rather how do we choose these parameters both efficiently and quickly?

A

And I propose that we can address this problem with a two-pronged approach. First, we can leverage advanced algorithms to navigate the search space and second, we can use cutting edge software to better leverage, our compute resources.

A

Today, I'll now I'll now cover five different types of hyperparameter tuning techniques. This includes grid search, random search, bayesian, optimization bandit based techniques and genetic-based techniques, such as population-based training.

A

Some of these techniques are applicable to many traditional machine learning methods, and some of these techniques are have much more significance and importance in the regime of deep learning, for the sake of clarity I'll be using the word trial. Quite often in the rest of this talk, uh trial in in hyperparametering literature typically means one configuration evaluation, so essentially one sample of the hyper parameters that that we plan to evaluate.

A

So I want to start with the most simple hyperparameter tuning technique of them all, which is grid search. This is a very simple and very standard technique for evaluating multiple hyper parameters.

A

Even though this may seem naive there, there are a couple benefits to doing grid search. The first is that it's easily paralyzable.

A

uh The second is that it gives you a lot of insight into how hyperparameters are affecting each other.

A

So, specifically on the screen, we have some pseudo code, we're essentially doing a cross product across all the different um listed samples or listed hyperparameter values for each of the different hyperparameter dimensions.

A

Now the problem is that, if you're trying to simply tune or optimize your model over your hyper parameter, space say you want to get the highest score or the highest accuracy or the best. The best simulated reward.

A

This technique of grid search can be very inefficient.

A

And we can see a graphical representation of how grid search fails to be a good tuning method, see on the left hand, side. The grid fully completely misses the optimal point of the important hyperparameter, which is on the top side of the chart.

A

However, on the right hand, side we have another technique. Random search, random search is able to provide good coverage over the hyper parameter space, allowing us to to actually reach the optimal point of this important parameter and in exchange for uh a ability to have a structural analysis of random of of the hyperparameter tuning grid.

A

So random search is just what it sounds like. You have different distributions for each hyperparameter tuning value and you sample parameters from these distributions over and over again, eventually, finding the best model again, there's a couple benefits to doing random search. One is that it's easily parallelizable, because each evaluation is independent of each other and second turns out in high dimensions. Random search is actually very hard to beat so again.

A

One problem with uh random search is that you lose the ability to have explainable hyperparameter tuning space that you evaluated and second there's a couple things that we can do better since random search is actually quite uh ineffective and expensive or inefficient and expensive after all, you're trying things at random.

A

So what if we used prior information from evaluated training runs to guide our tuning process? Well, this is what bayesian, optimization and other model based optimization processes do I'll spare you the details and the mathematics of bayesian, optimization I'll, just simply provide a very high level overview of how this sort of model based optimization works.

A

So uh we essentially construct a optimizer optimizer that is aware of the search space. So in this particular example, we have a range for um learning rate to be from negative or from 0.1 0.01 to 0.1, and we have a range of different, say, neural network layers that we want to evaluate from two to five.

A

So, every time we we evaluate a new point, we will first sample a point from the given search space and this particular sample will be guided and selected by this. Given optimizer object.

A

The optimizer object so now provides the sample to you, the we will evaluate the sample by training our model and returning a final score, such as a validation, accuracy or loss.

A

Then the optimizer will use that new information that new final score, that new validation accuracy to generate a new point to sample within the space. Ultimately, it aims to optimize this given validation accuracy.

A

So there's a lot of open source libraries that that provide great implementations for abrasion, optimization and sort of model based optimization techniques, most famously you might have heard of hyperopt, which is a yet another model based optimization library, as you can see, bayesian optimization is inherently sequential and utilizes prior information.

A

Another often ignored fact is that patient optimization is to some extent paralyzable. You can sample multiple points at once and oftentimes it's quite beneficial, because you can better explore the search space without being biased by the optimizer.

A

However, the again the the because it's inherently sequential the benefit of parallelization uh decreases significantly, as you add more workers, so so now that we understand um bayesian, optimization, there's actually still room to do better.

A

So if we actually take a look at a typical graph of multiple training, accuracies there's a lot of models that are simply bad performers. Now you might naturally ask why bother wasting resources on these trials? That aren't simple, aren't simply going to be good.

A

Well, there's a hyperparameter technique that tuning technique that addresses this precisely and that is typical, most famously known as hyperband hyperband and its variants, including asha successive having um etc. Is these these families of this family of algorithms are essentially early stopping algorithms?

A

What does that mean? It means that these algorithms aim to allocate resources to better performing trials and reduce the number of resources or the time spent evaluating bad trials. So, let's quickly walk through some pseudo code as similar to random search, we'll sample from the hyper parameter, search space and we'll evaluate this particular trial or model. Given these, this hyper parameter sample for a max number of epochs or steps or iterations, so every step we will.

A

We will keep continue training this trial and at a certain point- and this is a user specified point- the trial will be compared to other relative perform other trials that have reached the certain point. So let's say we can say something like the cutoff is equal to five epics at five epics.

A

All the trials will be compared against each other and if a particular trial is in the top fraction of trials at five epics, then we will continue training that trial, otherwise we're going to pause it and release the resources allocated to that particular trial in due of another. Perhaps more promising trial to to take make use of these resources.

A

So essentially, what's happening is if you're not performing very well uh the we're not going to evaluate anymore and if you're evaluating, if you're performing very well you're, you know a very promising hyper parameter configuration then we're going to keep evaluating you until the end, and there have been recent advances that have made hyperband capable of being combined with bayesian. Optimization hybrid ban is also nice because it's easily paralyzable, which actually improves its efficiency, but there's actually some more room for improvement, turns out in deep learning. High parameter schedules matter a lot.

A

This means that we can change the hyperparameter value during training and it actually affects our performance dramatically.

A

As you can see on the screen, there is a common neural network called resnet that is being trained, and this is one of the standard computer vision models today.

A

You'll notice that, after a certain number of steps, the training plateaus, but after we change the learning rate, so we say we reduce the learning rate in the middle of training.

A

We actually see a massive increase in performance, so this is now a standard technique that everyone uses and it's typically required to do the sort of dynamic hyperparameter tuning in order to get state-of-the-art results.

A

So there's a technique from google deepmind that is able to address this particular issue.

A

The main idea of this particular technique, which is called population based training, is that we will evaluate a number of samples or trials in parallel and for the the lowest performing trials.

A

We will terminate them similar to these early, stopping methods, but for the best ones we will continue training them and we will use them as templates to replace these terminated low performers. When these templates are are used. They are essentially cloned and the hyperparameters are mutated, so they are perturbed in some way. This effectively allows us to search over hyper hammer tuning schedules and is also efficient in that it terminates bad performers.

A

So here's a walkthrough of population-based training.

A

We'll start off with four different trials say we have four different values: the learning rate from 0.1 to 0.4, we'll train them for, say one epic, and at one epic we will have a a evaluation across all of the given trials that are running so, let's say it turns out. 0.4 is the worst performing trial of the four.

A

So what pbt population based training, also known as pbt, is going to do, is going to terminate this 0.4 trial and let's also say that 0.1 is the best trial of the 4.. So then we're going to clone 0.1 and perturb the value of its hyperparameters a little bit. So let's say it's perturbed to 0.15.

A

So now we will continue training for another epic and we'll repeat this process again, so we'll terminate the lowest performer and mutate the best performer say this is 0.3 this time and so on and so forth notice that, as we perturb the model over and over again we're able to identify or actually just evaluate, different hyperparameter tuning schedules across training.

A

Obviously this isn't perfect, but it actually performs quite well in practice. The mind ran this technique. When they published this pbt work over multiple previously published algorithms, they found that across the board, pbt was able to provide a non-trivial performance increase over the state of the art.

A

Now it's used in various different applications in google, such as self-driving cars. There was a recent article by waymo, saying that they were leveraging pbt to train their self-driving car algorithms and also other google brain efforts used for say, advertising or internal use cases.

A

So now that we have an overview of all the different hyperparametry algorithms, I'd like to give some tips for effective hyperparameter tuning in in daily work.

A

So I was reading a blog post, the other day about hammer training and the author brought up a pitfall for common, a common type of parameter, turning pitfall, which was that practitioners were not tuning enough hyper parameters.

A

In fact, this blog post recommended that the practitioners tune all of the hyperparameters of your model at the same time, and actually in practice, I've actually seen quite a few people do this and it seems quite intuitive because you might want to capture all the different dependencies and there might be some complex interactions between different hyper parameters, but actually um in practice this is quite inefficient.

A

Why well turns out there in modern, deep learning models as presented at the beginning of the talk? There are dozens of high parameters that you can tune in modern machine learning models right, and so we have here again this very famous language model roberta, and it has more than a dozen hybrid parameters.

A

However, only some of them really matter most of them, don't really affect performance, and many of the defaults are quite robust across different configurations.

A

So what that means is that actually there's only a select few, that really matter and tuning effectively has a low, effective dimensionality.

A

So this means that you want to effectively choose which hyperparameters you're searching over and again choosing the hyperparameter to space itself is an important decision, so you might be asking yourself okay. Well, I know that there has to be one of these or two or three of these things are incredibly important, but I have a list of 20.. How do I choose my hyperparameter space? How do I choose the right hyperparameters to evaluate in the first place?

A

Well, my second tip, for you is that you should make use of the available tools for visualizing and understanding your hyper hammered tuning landscape, a common tool that researchers use, especially at well well-served places such as google is. This parallel coordinates tool. It helps you visualize multiple dimensions at once, which is hard to do in, say a 2d or 3d graph.

A

So here is a graphical representation of how that might work. Typically, these parallel coordinate tools are allow you to filter out particular runs and identify different relationships between multiple hyperparameters. At once, this in turn allows you to better inform how you uh structure your search in this sort of iterative process.

A

So um there's many tools that that provide this sort of tuning um visualization techniques such as tensorboard weights and biases cave and neptune.

A

um I'm pre, I'm fairly certain that if you choose any one of them, you'll be you'll, be able to leverage much of the same useful functionality, and it will help you a lot by by better understanding. uh What is your hyperparent tuning landscape looking like.

A

Finally, a common pitfall is that the hyper parameters chosen through the tuning process don't actually perform well in practice, and the tip here is that one should always remember that high parameter optimization is at heart, a optimization problem and the goal of the user should be to provide some nice inputs to the optimizer so that it can optimize over a smoother landscape, typically in something like reinforcement, learning or even modern language training.

A

Today, there are metrics that people use to evaluate the goodness of the model, but those are very noisy the variance on a lot of deep reinforcement, learning papers. The results are are very hard to to to to actually see what's going on, because the variance is so high.

A

So there's multiple tips that you can do and- and you will have to either engineer it yourself or look for a framework that does this for you, but uh typical things that you would want to do to reduce overfitting and denoise. Your your optimization inputs include making sure you do cross-validation uh making sure you evaluate the same hype parameter across multiple seeds and then also look at consider looking at different metrics in addition to accuracy such as the gap between validation and training or model entropy or even training versus validation loss.

A

So those are my three tips on how you can practically improve your hyper parameter tuning process and now, let's talk about how you can quickly tune your hyper parameters by leveraging underlying resources or given compute clusters.

A

I'll talk about ray tune, which is a scalable hyperparameter tuning developed uh now, primarily at any scale, but previously at uc, berkeley and um ray tune, is a scalable hyperparameter tuning library that works with any machine learning framework.

A

For some quick context, ray tune is built on top of ray ray, is both a framework for distributed python, but also contains an ecosystem of specific use. Cases such as as we're talking about today, but also distributed training and deep reinforcement, learning and model serving.

A

So raytune, specifically, is the library that handles the execution of hyperparameter search. It provides hooks to plug in different high parameter, search algorithms and automatically handles the parallelism and scaling for you. Why is tune special well tune is built with deep learning as a priority. Now, what does that mean specifically tune is built so that you can utilize and spread your training and tuning across multiple gpus across cluster.

A

It also allows users to tune models with any machine learning framework and, most importantly, tune allows you to run high performance tuning at any scale. So you can go from running on a single process to running across a bunch of gpus to run across multiple nodes. All without changing your code, as mentioned today, hyperparameter tuning algorithms are very important to leverage so tune offers.

A

Raytune offers many algorithms to optimize your hyperparameter search, including all the algorithms mentioned today, tune, also uh integrates with a lot of open source, hyper hammer tuning libraries, so these optimization libraries such as hyperopt or recent, this recent library called ax from facebook in addition to services such as sig, opt and and others.

A

So what this means is that you can transparently scale out this optimization process that these libraries offer across multiple cores and multiple nodes.

A

Let's quickly walk through the tune api, so you can have a better understanding of what I'm talking about so we'll start off with a very simple model: training function.

A

This is something you might construct in in pi torch and essentially you have a model and you can train it for one epoch at a time, and then you can do multiple epochs to eventually converge to a particular loss.

A

In order to use tune, you can simply add a single line which uses the tunes reporting api for the training function to log the loss, so this informs tune of the current performance of the model and it's a very lightweight um interface.

A

So, in order to run a hyperparameter search, all you need to do is take that function that you've defined with this reporter call with inside and pass it to tune.run.

A

So here in this particular example, 2.run is going to evaluate this train model function that we've defined using this configuration.

A

This hyper parameter essentially trial of learning rate equals 0.1 to perform a large scale search, say if you had a large compute cluster. All you needed to do was is add an argument.

A

This increases the number of samples that you're going to take from the the training distribution and, specifically we're setting that to a hundred the parallelism uh that tune will operate at is determined by the size of your cluster, so it automatically leverages all the course available to you in in your particular cluster.

A

Specifically, for this, we're using a random search- and you can see tune- provides a simple api for specifying a search space.

A

Oftentimes you'll want to leverage a gpu and in prior torch and other distributed training frameworks or hype model tuning frameworks, uh you'll be forced to handle ugly environment variables and manual device placement and such however tune is, you know, built for deep learning, so it will automatically set your environment variables, isolate your training, jobs across multiple gpus, allowing you to paralyze your search uh even across you know, multiple machines without ever setting these uh environment variables by hand due to narrow, very narrow api, essentially, two two different code, um two different api calls tune, exposes a variety of features, including automatic checkpointing, and specifying different tuning algorithms.

A

So and, as we mentioned above, it's incredibly important to analyze your hyperparent tuning run afterwards. So, if you wanted, you can provide, uh you can capture the the results in a data frame which is provided to you automatically so that you can analyze different training results across all the different models that you've trained and all the different hyperparameters that you evaluated.

A

In addition, uh we talked about the importance of visualization, so uh tune automatically generates tensorboard files so that you can visualize and understand your training with uh different scalar graphs and parallel coordinate plots so at a very high level.

A

What ray tune allows you to do. Is it aims? You aims to simplify your machine learning workflows, so you don't need to do all this busy work yourself.

A

If you're interested in learning more about ray tune, you can check out the ray tune, documentation where you will find tutorials walkthroughs for integrating tune into multiple machine learning, libraries and examples for using different features.

A

If you have any questions or you want to get more involved, feel free to head over to the art github, where we have a large thriving community with over 100 300 different contributors and responsive developers, who are more than happy to help you out.

A

So to recap: in this talk, we motivated the importance of and highlighted the complexity of, hyperparameter tuning. We overviewed some of the state-of-the-art result: techniques for tuning hyper parameters and finally, we talked about ray tune, which is a library built on top of ray to simplify and scale hyperparameter tuning.

A

Hopefully, the provided information in this talk will help accelerate your existing machine learning workflow.

A

As a final call out, we are hosting a race summit which is going to be a free online conference um on september 30th to october 1st, covering workshops and different tutorials and keynotes on all sorts of different scalable machine learning and skillable python topics. So, if you're interested please check out racesummit.org thanks for listening, if you have any questions, your feel free to reach out to me on twitter or or at my email and happy to take any questions now, thanks.