National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 17 - Hyperparameter Optimization - Ben Albrecht

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

All right, so this is the last stop for the day. Hopefully, you're still surviving Ben is from Cray. He is the lead developer of the create AI libraries he's done a ton of work in developing these libraries and he's gonna tell us about hyper 400 optimization. Thank you.

B

All right so we're gonna dive right into a hyper parameter. Optimization I'm, gonna start with some background just to get all of us on the same page. It's this is relatively simple or maybe elementary compared to the stuff we've been covering already, but just to make sure we're using the same semantics. So what does a model parameter? A model parameter is our model parameters or values that are within a model and determined from the data itself.

B

So this is the this is the model parameters that you're training, so in a linear regression, this might be R. This would be our slope and intercept M&B in a decision tree. This would be your your splits in your tree that you're making in order to optimize your model, to predict some data and in neural networks.

B

This is our weights and biases that we've that we've been learning about and this week so that transitions us over to model hyper parameters so model hyper parameters are values that are external to the model, but that influenced the model capacity I will elaborate a little bit more on what model capacity means in the next slide. But first so in linear regression we don't really have any hyper parameters. It's a there's, no knobs to turn in that in that procedure, in in decision trees, we can think about modifying our tree depth as a hyper parameter.

B

This will influence the model capacity in the neural networks. We have many hyper parameters, we have learning rate learning rate decay, schedule number of neurons loss, function, activation function, so on we've heard about a lot of this throughout the week already.

B

Okay, so what do I mean by model capacity? It's it's really a general term for capturing how how much we are over fitting or under fitting both things. We want to minimize also our time to accuracy how how quickly we reach our desired accuracy, our model trains to its desired accuracy and related to that as efficiency. So the total CPU time required to achieve that accuracy.

B

Okay and I just want to acknowledge that, while we're talking about hyper parameter optimization here, that can actually mean quite a lot of things, there's actually a full spectrum of hyper parameters so on the far right here, I've been kind of talking about training, hyper parameters in our in the example so far, there's also spatial hyper parameters which are which are things like you're in your neural networks, your number of layers, the size of your layers or the connectivity.

B

We also have feature selection, which can be thought of as a hyper parameter and model selection, which can be thought as a hyper parameter. So there's kind of a spectrum of types of hyper parameter, optimization strategies or classes rather down at this end, I consider this traditional hyper parameter, optimization while doing spatial hpo is more of a neural architecture search and when you're doing all of these things together, we consider this an an automated machine learning.

B

Okay, so in hyper parameters and deep learning, there are a lot of hyper parameters to deal with, there's a lot of knobs to turn, and in fact there was a really good quote from earlier this week from someone who spoke, they said deep learning is a hyper parameter soup that was from Josh and his talk. So I dedicated this slides, Josh.

B

And it's true there's a lot of hyper parameters to deal with in deep learning. You know. On the training side, we have our optimizer learning rate momentum, I'm not going to read all of these off but and you've seen a lot of these throughout the week and probably have tried to tune some of these in your models yourselves. We also have a lot of spatial hyper parameters to modify as well so hyper, parameter. Optimization and deep learning is a very high dimensional problem.

B

In addition to the high dimensionality and hype in a deep learning, hyper parameter optimization. There are a number of other challenges, so we have so. We have the fact that hyper parameters can be continuous values or categorical or integer values which poses some mathematical challenges. Computing gradients is pretty challenging to do. It's still an open research question.

B

If we can do this in hyper parameter optimization, although there are, there are some interesting work that are being done and some contributions to that or the fact that the cost function is non-deterministic and noisy and the cost function can be discontinuous as well.

B

The hyper parameter space that we are searching is contains many flat regions which produce similar models. It can be very easy for an optimizer to get stuck in these regions.

B

The cost function is being minimized, only represent a sample of the performance as we are. Typically, we are typically optimizing. Our hyper parameters for only a subset of the data, as we should evaluations are expensive. It's just the nature of deep learning and evaluation times can vary greatly across hyper parameters. This, depending on the choice of hyper parameters, you can one obvious one is the number of epochs we're using in our training.

B

Okay, so that's a little bit of a background and now I'm going to jump right into reviewing some hyper parameter. Optimization strategies.

B

And at a very high level, there's kind of two main strategies you can think about two main categories: we have manual HP, oh-- or so I like to call it graduate student descent.

B

So hyper parameters, in this case you're selected in tune manually- and you know this- is okay for a lot of cases and I'm sure a lot of you are doing this in in some of you in some of the handouts, and this is typically guided by intuition or rules of thumb. It's okay and some it's okay in limited cases, but when you really want to find the optimal high performers for your model, you want to move into the regime of automated hpo.

B

So here in automated hpo, we are tasked with the problem of searching this hyper parameter space. Doing a brute-force search. The entire search space is a an intractable problem, so we must utilize an algorithm that searches the subspace.

B

So let's go over some of these algorithms that try to divide and choose how to explore the subspace of hyper parameters. So I'm trying I'll try to present this as number of different categories of hyper parameter, optimization. So so there is a class of HP OS called exhaustive search. Algorithms. This concludes grade random and genetic I'm going to go through all of these more in depth.

B

There's also surrogate models which really try to minimize the number of valuations in order to reach the minima and there's also early stopping out or them switch exploit this facts in hyper permanent. This property and hyper parameter. Optimization that you can. You can approximate the value of a you, can approximate the value of a set of hyper parameters before the before the training is completed and there's also gradient based algorithms I'm, not really going to say much about this, but I just want to acknowledge that they exist and I really think the research being done.

B

There is exciting, ok, so to start we have grid search, so grid search, as you can probably infer from the name, is simply discretizing. The hyper parameter space into cartesian coordinates to evaluate this is a naturally an embarrassingly parallel problem.

B

However, it suffers from the curse of dimensionality. The number of points that you evaluate grow exponentially with respect to the number of hyper parameters. As we discussed in deep learning, we have a large number of hyper parameters, so you're going.

B

If you were going to employ grid search new deep learning, you would really have to limit the number of hyper parameters you would want to search at a time if you do, if you do end up using grid search, you want to probably apply it in an iterative fashion, known as multi-resolution grid, search where say say in this example, we found that I.

B

Don't know this. This point did this point did reasonably well, so maybe we would do a finer mesh of grid points around this point.

B

Ok, the next the next strategy to discuss is random search, so random searches, as the name implies randomly sampling, hyper parameters from a from the hyper parameter space. It's embarrassingly parallel as well, and it exploits the fact that some hyper parameters matter more than others. This is this is an observation made in this famous paper by Bergstrom and I. Don't know you pronounce this name NGO back in 2012 and pretty much every place. You read on the internet that suggests that suggests using random search sites.

B

This paper, as the main motivation- um and this is a famous figure from that paper showing how grid search fails to fails to explore the hyper parameter space where one hyper parameter matters much more than the other, whereas random search is much more successful. With a fewer number of evaluations and in the HBO research, you can see a lot of researchers, reference, random search as frustratingly successful, because it's kind of like the dumbest strategy you could possibly think of, but it's ridiculously good for for how dumb it is.

B

But we've we've come a long ways or we've come some ways from random search. Since then,.

B

Ok, the next strategy I want to talk about is genetic hpo, so I'd like you to think of genetic algorithms, applied to hpo as an automatic iterative, stochastic grid, search with pruning and what that's really getting. You is kind of the best of both worlds between random search and grid, search with with benefiting from previous knowledge learned in previous iterations, so genetic algorithms in general excel at optimizing. Many parameters of varying importance, a property that we have of the random search genetic genetic hpo, is inspired by biological systems found in nature.

B

The three key features are mutation, crossover and selection. So I'll walk you through how these are applied in genetic HBO.

B

So so we start with the founder hyper parameter in this strategy, so we'll create initial population around the default hyper parameters by mutating that founder. So you have a set of initial. You have a set of initial randomized hyper parameters around that founder.

B

Well, then, evaluate each of those individuals in that population and get each of those neural networks, accuracy and then, from the accuracy, we're going to calculate the relative, the relative performance of each individual, and that would be the individuals, fitness and so based on the fitness, we're going to do a weighted sampling and choose pairs of parents, two that were successful to pass on their hyper parameters to the next it's in the next generation and we create we create a child from those parents with crossover and mutation crossover.

B

Being we take only some hyper parameters from one parent, others from another parent and mutation being we perturb some of those hyper parameters that we inherit from the parent and then we kill the old population and the next generation grows up.

B

So this is very inspired by evolution in biology and it it is strikes a nice balance between the exploration and exploitation and the export exploit problem.

B

Genetic hpo is also embarrassingly parallel per generation, as you need to have some. There is some sequential dependency across generations, and again the biggest advantage is that each generation benefits from the data of the previous generation. So the data, the work you're doing continues to benefit the next generation. So here's here's a nice figure from this paper, applying large-scale genetic hbo's on image, classifiers and what's important to note here- is that this is the. This is the populate.

B

These are populations over time and you see that their accuracy rapidly jumps, and then they slowly approach they slowly of converge to this high accuracy. If we were doing a random or grid search, you might imagine a lot more of this. Empty space here would be filled, but because the genetic HBO is learning from previous generations. We are doing a much smarter search here.

B

Okay, the next the natural transition from genetic HBS is into population-based training.

B

So population based training is also a genetic based approach and this this approach trains its hyper parameters during the model optimization. So this is an early stopping algorithm that I talked about earlier. So the process here goes as follows: we select a random set of hyper parameters and train multiple models in parallel and then every n epochs we do the following. We take the best model and hyper parameters and copy over the copy, those over the worst models.

B

So the worst models you can think of as the low-performing individuals in a population that are dying off in that generation and then, if a model was copied over, we randomly perturb those hyper parameters. So that's kind of that's the mutation we saw in the genetic hyper parameter optimization, and so this is a figure from google deepmind, their original blog post on this, which was late, 2017 I, believe so you can see here. We have two. We have two models and sets of hyper parameters being trained. This one does better.

B

You can see in the performance here this one does better than this one. So we're going to exploit that fact and copy these hyper parameters over here and then we are doing an exploration with this one that was copied over and perturbing the hyper parameters where, while we leave these alone and these continue and then a later time, you see that the exploration ended up doing better than the original parent.

B

So population based training, another advantage of it is it produces a reusable because it's doing early stopping it produces a reusable hyper parameter schedule, but because it's perturbing the hyper parameters during training, so you can kind of get a best set of hyper parameters per epoch or per whatever.

B

Your interval is and then use that at a later time to to reach Acuras reduce your time to accuracy, and then this is a nice figure from their their paper or blog post, where they visualize they visualize the population based training, so the x and y axes are actually kind of meaningless on this figure. What's in, what's important is the color and the darker blue or the yeah?

B

The more blue is the better performing and the black is poorly performing, and you can see that you can see that the blue, the bluer sets of hyper parameters tend to get explored quicker in the black ones, die off quickly.

B

Okay, jumping over to Beijing HBO, yes, I, I think this was two separates. Let's see here, if I recall correctly, I think this was just two separate models. They were looking at, so you the easiest way to think about. This is learning rate. It's pretty common for us to to train a learning rate schedule and in fact, a lot of these optimizers try to fund a lot of like atom optimizer tries to find that learning rate schedule. So that's that's exactly what's happening here.

B

Yes, so you could use this to replace. So you could use this to replace you. Wouldn't you wouldn't so rate, so you wouldn't use a learning rate schedule with one of those optimizers yeah I'm, not sure I know that number off the top of my head, but it's a pretty big. They did a pretty big experiment here. Yeah you can check you can check out this this source or just if you, google, population-based, training or deepmind PBT.

B

This will be like your first hit, jumping over to Beijing HBO, so I always like describing Bayesian optimization with pictures and figures rather than math. It's easy for your eyes to glaze over when you see all the Bayesian math, so the intuitive way to think about Bayesian optimization. These, for me, is if you look at the following figure: what number would you choose?

B

What number would say say you have this: this theta of random forests results with different number of trees and you are assigned with about choosing the next point to evaluate to find the minimal air. What area would you choose on the graph or on the plot, and you would probably choose somewhere down here, because we're already doing pretty well down in this area and that's exactly what Bayesian optimized optimizers are doing so formal finishing of a Beijing optimizer?

B

Is that it's a sequential model based optimization, that's building a surrogate model for the objective and it quantifies the uncertainty in that circle, surrogate model using a Gaussian Gaussian process, regression lots of caveats here, there's tons of there's tons of variations in this, but I'm just describing the most popular approach. So here's kind of here's a nice way to visualize this we've collected a few data points along this x-axis.

B

We have some values, and so the Gaussian process regression shows the expected value between those values, as well as an uncertainty, and so this this figure here this Gaussian process regression, is then mapped onto an acquisition function which is defined so it's defined by that surrogate model and then that we we take the maximum value in that acquisition function to determine the next point, we're going to evaluate which was somewhere around 200 over here and so effectively.

B

Bayesian optimization is minimizing the number of evaluations required to explore, get to explore a given space or find the minimum in a given space.

B

So some properties of Beijing, optimizations, Bayesian hbo's, is that they are ideal for optimizing objective, objective functions with very expensive evaluations, which is true for a lot of deep learning models. However, they are best suited for a small number of hyper parameters. They've been shown to be relatively ineffective with more than 20 hyper parameters, that's good to be aware of.

B

If you are employing one of these strategies, they are tolerant of stick at stochastic noise and function, evaluations, another property of deep learning and unfortunately they are highly sequential making parallelization difficult, but there are some parallel strategies that do exist.

B

It's also worth noting that so, along the way during the Bayesian hpo, we have an inversion of the covariance matrix, which becomes a computational bottleneck at some point.

B

So this is this grows cubic Li with the number of evaluations. You've done so this. This can impact you if you do a large number of evaluations with with a Bayesian hpo, but some efficient approximations exists to work around this I'm just good to be aware of, if you're employing one of these okay. The next strategy I'd like to talk about is hyper band, so now we're getting into some of the more recent developments in hyper parameter.

B

Optimization and it's not too long ago that Bayesian I guess it still is Bayesian approaches and HBO is kind of like the craze right now, but in a lot of these build on top of page of each Bo.

B

Okay, so so hyper band is a success of having algorithm that's combined with random search. So the process of the process of hyper band goes as follows. So you sample case sets of hyper parameters, you evaluate them for n epochs, and then you discard the lowest-performing half of hyper parameters, and then you continue evaluating in and continue evaluating the remaining hyper parameters for n, more epochs, and then you discard the lower-performing half again and you run the good ones for even more epochs, and this is kind of visualized here.

B

So this is this again is an early stopping algorithm. That's finding a nice balance between the explore exploit problem. So we start with a bunch of sets of hyper parameters that are sorted by their performance after they're evaluated to some some level, and then we only continue training a certain number of them, and then we keep chopping that off until only one remains.

B

B

So, as you can imagine, that process is pretty hard to paralyze, so the extension to that is the asynchronous success of having algorithm, which is also, which is also a part of hyper band. So this is the parallel friendly extension of hyper brand.

B

This this works by assigning workers to evaluate hyper parameters ranked at the bottom rung and then, when a worker finishes their evaluation, they request more work. If a set of hyperparameters qualifies for promotions in the next rung, it is chosen. Otherwise the workers start starts with a new set of hyper parameters at the bottom rung again. So this gives workers something to do.

B

This gives workers something to do if their set of hyper parameters didn't work out, and you can see here. The resource efficiency is much much nicer. Actually on your gonna next slide, you can compare these side and side so up here we have the synchronous success of having I believe this up. So I don't recall the year this algorithm was developed, but this is the more recent asynchronous successive hasn't having from I think 2018, where we now have a parallel strategy for for doing hyper band.

B

Okay, the next strategy is Bayesian, optimization and hyper band, so getting really creative with the names we're just combining two things. So that's vo HB, so this so Bo HB essentially is hyper band, except instead of using random search to sample the hyper parameters is using a bayesian optimizer to sample the hyper parameters. So this is a pretty big improvement. It also supports a parallel formulation, as you can imagine- and this is from their paper, which they do a blog post about and they they show some pretty nice speed.

B

Ups speed ups over some other, so they compared to random search they using an optimization, hyper band and then boah B, so Bo, HB and hyper band are kind of neck and neck early on, but as as the number of epochs continues, Bo HB tends to outperform hyper band as well.

B

With the caveat of this is a specific specific model, they were looking at I. Don't think this I don't know, I, don't think this was this is yet a generalizable trend, but it's it looks pretty promising.

B

Okay, some other strategies that I'm not going to go into in depth, but just want to expose you to there's tree-structured partisan, estimators TPE. These are bayesian approaches to that utilize, categorical, hyper parameters and tree structure such as the connectivity between layers, depending on the number of layers.

B

There's this there's this algorithm with a prequel named fast Bayesian, optimization on large datasets or fabulous fabulous, I guess from Klein at all in 2016. So this is a approach that this is an ax Bayesian approach as well that operates on only a small fraction of the data set at a time.

B

So it exploits the fact that we can optimize our hyper parameters on just a small, very small sub sample of the total data and it uses a tunable parameter, determined that fraction of data to use throughout training and then lastly, I just want to give a shout out to my gradient based H POS.

B

There's a lot of active research going on out there, so it's I just wanted to mention the work being done there.

B

Okay, so now I want to talk about some give me an overview of some different HP oh-- software out in the wild. So I am the developer on the cray AI hpo framework, but I didn't think it'd be fair to just present on create Bo, since that would be kind of biased opinion. So I'm going to give a kind of gentle overview of some of this HP of software out there and then we're going to dive into looking at Cray HBO.

B

Okay, so some of the traditional HBO software's out there. These are these are HBO libraries that provide a number of strategies, so we have hyper opt which supports random and TPE three-person estimators. That I mentioned it's kind of nice because it supports distributed HBO with MongoDB.

B

Unfortunately, as Steve was pointing out when we were discussing this earlier, development on this project has kind of fallen off, but it does seem that there are some people still supporting it: HBO Lib as part of the auto ml sweet. So this provides a common interface to a couple. Different standalone packages that implement some algorithms, snacks, Biermann, hyper band and vo HB, unfortunately looks like HBO. Lib is not not receiving a lot of attention either, but it does this.

B

It is pretty useful, as is I, would say, then there's advisor, which contains a ton of HBO algorithms, which can be nice to just try out some different things and lastly, on this list is create I, H Bo, the framework that I've I'm working on at krei. So this is a distributed. Hyper parameter optimization for H intended for HPC users, although you can run it on your local machine as well. Currently we have grid random and genetic and PBT, and we are currently in the process of developing bayesian uplands.

B

I optimizer and I put that it's being actively developed because, like I mentioned for a couple of these, all three of these aren't really receiving a lot of active development support.

B

Right now we have some general trends I'm going to talk about some general general trends and practical tips, kind of at the end of this section, and then, if I don't answer your question then going to ask again: okay, then there's a few single algorithm, HPI software's I wanted to point out, so we have spearmint, which has some Bayesian optimizers smack three, which abate is a Bayesian based smack algorithm.

B

We have this HP band ster, which implements hyper band and Bo HB I. Believe that's the implementation from the publication and then hyper grad is one of the gradient based hbo's, which has a memory usage trade-off for storing stochastic gradient descent. Intermediate results. This is kind of just the research toy. At this point, but it will be cool to see that mature.

B

Just for completeness I want to mention that there are some platforms, specific HP, O's software's out there for all the different cloud providers or all the big big names, at least so we haven't there's an AWS, auto ml framework.

B

Sage maker has an HP oh-- sweet, as your ml has an HBS sweet and Google Cloud does as well, and then, if you're, if you recall from the spectrum, slide, there's there on the far left, we have some frameworks that deal with optimizing, not only your traditional hyper parameters, but also your topology features and your the choice of models. So that's. This is a few examples of those frameworks. So automat ml is is a pretty big framework, which has a lot of which tries to unify a common interface to a ton of underlying algorithms.

B

Teapot is another auto ml workflow that utilizes genetic programming there's um h2o AI, which are it's h2o by h2o AI that supports population-based training, notably I, think I, think they're. The only other main HBO package that supports PPT right now and they also support distributed training. You.

B

Saij when I mentioned there, some there are some Paris integrated, HBO software's, so so there's hyperox wrapper for chaos, there's Tallis, which has a grid random and probabilistic successive having algorithm and then lastly, there's harris tuner, which he heard Josh Josh mention I, believe this is of these three.

B

This is the one receiving the most active development right now and currently this supports random and hyper bands and we'll see what we'll see what they have in store looks like an exciting project: okay, so next I want to transition over to talking about some, some just general practical tips and hyper parameter. Optimization. Now that you have an overview of the available algorithms, the available software out there, okay, so like I, mentioned before deep learning in general has long evaluations.

B

So the the hpo process is going to take a long time expects hpo runs to take anywhere from hours so weeks, depending on how large your training takes. So so choosing the wrong search base for your algorithm can have large consequences. It's it's worth taking the time to plan your experiment for how you're going to search your hyper parameter space and what hyper parameters to use.

B

And it's it's worth mentioning a lot of frameworks. Allow you to save results as you progress. That's also nice. If you have say a node goes down or some some software bug happens while you're running HP. Oh, it's good to have store intermediate results that you can recover from.

B

If you have distributed resources available to you, you should definitely utilize some kind of distributed HP oh-- resource or distributed HP oh-- software package, there's no reason not to with so many of these HP oh-- algorithms. Being embarrassingly parallel, as mentioned in believe it was Brenda's talk. You should use a development data partition out of your validation, set to to train your hyper parameters. This is just a good practice to make sure you're not over fitting to your validations.

B

And then it's also important to remember we're not trying to find the global minimum without some kind of cross validation baked into your your performance of your hyper parameters. You're, definitely going to overfit. If you optimize too much so you either need to bake in some kind of cross, validation or or just be careful about optimizing too much.

B

Okay on choosing hyper parameters, so so for choosing hyper parameters. You want to utilize your domain knowledge about the model to focus on important hyper parameters. It's important to start with initial learning rate. The next good candidate is to jump to learning rate decay schedule such as decay, constant and then. Lastly, regulars regularization strength such as l2 penalty or dropout strength, is a third candidate to consider. They're also mentioned earlier this week, be careful about the pairing of incompatible loss, functions and activation functions, loss functions and activation functions.

B

If without an a paired exponential log can can be wasted. Evaluations.

B

And also limit your search base, so starting from a coarse-grained search is reasonable, a reasonable approach. You can kind of do a hierarchical approach you want to. You want to use a log scale for multiplicative hyper parameters such as such as learning rate or momentum or regular, like regularization strength, something like dropout rate. You would want just an absolute scale. All right. Some tips on choosing HBO strategy grid search is bad, don't use it.

B

You should never really use it other than to benchmark against it, but if you do use it, maybe you're using it, because you don't want to learn an HBO framework and you just want to implement something manually.

B

That's okay for preliminary searches, maybe, but you really don't want to use grid search in general, random search as I mentioned, has been very successful in the field. It's a really good option for getting started with HBO.

B

Even if you don't want to learn some HBO package out there, it's really easy to implement yourself and if you do end up learning an HBO pack, it's supported by most packages out there and it's still competitive with a lot of modern approaches within certain regimes, and then lastly, you really at some point you want to graduate to using some of the more modern HPS strategies that have been developed in the past few years, such as hyper upped, vo HB and PBT.

B

This is gonna, be especially important for large workloads where, where your HP o runs for you know days two weeks.

B

And I also want to mention that it's not uncommon to mix and match HP o strategies. So, as I said earlier, you can do a hierarchical search in doing something like that. It's it's perfectly reasonable, to start with, say a random or genetic search for a broad search. That's including your topology and then switch over to a Bayesian search.

B

When you have when you would lock in some initial hyper parameters, you only want to tune a smaller number of hyper parameters, because Bayesian Bayesian optimizers do better with smaller number of hyper parameters, and then things like PBT and hyper ops, because they have that early stopping mechanism. They cannot be used for topology search. So once you have, your topology locked in can make sense to switch over to using PBT or hyper opt to acquire a reusable learning schedule, so that were that was a just kind of throwing a lot of question yeah.

B

So so, if you're storing some kind of intermediate result, you should be able to kind of look over your data and see or kind of plot. Your data is a really good way to kind of visualize how? How different hyper parameters impacted the accuracy? That's a really good point and I didn't want to just leave it off there. This you know the best practice for hyper parameter, parameter.

B

Optimization is really still kind of an open research question and there's lots of people working on it, and so I just wanted to point to a couple resources, some of which have contributed a lot to this to the tips here. But if you want to look more into it, what are some of the more recent practices? These are a couple of good resources to check out.

B

Okay, so with that I'm going to transition over to talking about cray AI HPL, so this is craze. This is crazy. I / parameter optimization framework, so I call it an emerging hyper parameter, optimization framework because it's still under active development or not 1.0. Yet so we consider ourselves alpha release right now and we are. We are reserving the right to make breaking changes than in an interface which is actually happening right now, it's portable, so it could run on your desktop to run on a supercomputer.

B

It's a lightweight has a lightweight black box interface, so it treats the it. It defines an interface to just a executable on your file system, and that can be anything you want it to be, so it can be a Python file with a Python script using any of these machine learning toolkits duplicates or it can be.

B

You know a Fortran program or something if you want to use it, if you're using Fortran and deep learning or something I, don't know, there's DoD folks, here: okay, it's so, as I mentioned it's distributed in HPC environments, so it supports distribution out of the box. It's the mechanism, it's using, it's just interfacing directly with the workload manager on the machine, and it supports two different types of distribution. So we can do to distribute it HP oh--, where we're evaluating we're evaluating multiple sets of hyper parameters simultaneously. We also support distributed model training.

B

Where say you have an allocation of 64 nodes and say each evaluation so you're using the distributed tensorflow package, then each each evaluation could be running on 16 nodes within that within those 64 nodes. So there's kind of two there's two different types of distribution that can be used simultaneously and then an important feature is that we've tried to design the low-level interface which hasn't really been exposed to the public. Yet, but we plan to the low-level interface, it tries to be fairly simple and generic to support anyone.

B

Anyone coming along that wants to write their own strategy or anyone. That's using this can write their own strategy, say some new hyper parameter. Optimization paper comes out. They can go code, it up and add it to the framework.

B

So the back end is implemented in Chapel.

B

Back-End is implemented in chapel, which is I'll talk about that in a second. That's a that's. My other part of my job I work on the chapel team at Cray and then the user facing interface is Python. We're not forcing users to learn a new programming language Jesus.

B

So just a quick blurb on chapel, so Chapel is a modern, productive, parallel programming language, it's open source, also scalable from laptops to clusters to supercomputers strives to be as performant as Fortran it's portable SC elegant is Python and doing all of this. With this in a distributed parallel setting.

B

Create so cray I projects, utilize chapel for a number of reasons, mostly to utilize, their modern language features of shared built in shared memory, parallel, ISM, tomp, built-in tommix in the language, great interoperability with Python and Fortran, and just a lot of other great modern programming. Language features like generics type inference memory management strategies.

B

Okay, so so the I'm going to walk you through the components of a create I HP a workflow. So this is, if you're, just starting from scratch. This is what you have to do so. There's two parts is the training kernel and there's the HBO Driver, so the training kernel is the model training program to be optimized. So this is what you're. This is what you're starting with. So you have a Jupiter notebook from one of these handouts. You have some.

B

You have some code in there that optimize it that trains a neural network and then prints out the accuracy. So that would be your model training program or your training kernel. So the interface we define here is that you expose those hyper parameters through command line arguments. So in the jupiter notebook case, you would need to shift your code over into once. You have it once you have it relatively stable you would.

B

You would put that into a Python standalone Python script, that you call and you would maybe import, aren't parse and expose the hyper parameters, and then we also need to expose the figure of Merit or you can think of this as our cost function. This is the this. Is the value to be minimized in our hyper parameter optimization, so this is just exposed to standard out with a unique identifier.

B

So, as I mentioned, your model training program can be written in anything, for example, can be python plus any of your favorite framework or julia whatever you want. Okay, the second part is your HBO driver. So this is the. This is the program that actually imports the create a I module. So this is this is program that's being used to optimize the hyper parameters of the train, colonel and this actually must be written in Python.

B

So let's walk you through an example of this.

B

All right so say you say you start with a say: you start with a Python script. I kind of already went through this, but so you start with a Python script that you want to use pre hpo on, so you would. You would modify that Python script to express, expose the hyper parameters and you would print the figure of Merit. We would write your HBO driver and I'm gonna walk you through that now, okay, so, hopefully that's visible. It's kind of dark on here, so the first step is to expose the hyper parameters.

B

So we're doing this in this example we're exposing our learning rates and our dropout, our dropout rate and then we're just utilizing those within a we're utilizing those flags we exposed within the script itself, rather than plugging in hard-coded values, and then we print out in this case we're trying to minimize our lost value. So we print out our figure of Merit identifier. It's fon, that's the default. You can set it's whatever you want, so we print that out, so that the optimizer can pick that up now, jumping over to the driver code.

B

So, first of all note you can just import your Cray I module.

B

We're using the HP S sub module, so we're gonna set up our evaluator. Our evaluator is our evaluator is how the framework knows how to evaluate set of hyper parameters. So here it's just running the the kernel script, which is so it's in the source directory.

B

It's called train model PI, then we're going to provide our hyper parameter flags, so we expose the learning rate and dropout rate, so we plug those in here into this hyper parameter list of Lists, and so the structure goes as follows: you have a flag itself as a string, the default value and then your bounds for the search space. And then you set up your hyper parameter optimizer here we're using the genetic optimizer and then you just call optimize on your parameters and you can get your your best figure of Merit.

B

That was printed out in your best set of hyper parameters from Python memory. Here we also log a bunch of data along the way: okay, quick note about the params class, so this accepts a list of Lists where each of those list has the hyper parameter flag default value in the search space. As I said, but it's worth mentioning, these values can be integer float or a string, and the search space can be a tuple of bounds or a list of values.

B

So here's just demonstrating all the possible all the possible ways you could expose or describe hyper parameter, search space.

B

There's also the evaluator class, so the evaluator class is used to up it's already mentioned it's. It's used to describe how to evaluate a set of hyper parameters, and this has a lot of bells and whistles here, most importantly, there are some different types of lawn trees. You can use. As I mentioned, we interface with the launcher directly, so you could be using slurm PBS using nothing if you're just running on your desktop or the Eureka Eureka launcher.

B

The most the thing I want to point out here is that when you are running distributed, you just set your nodes equal to an amount and krazee-eyez typically able to infer the workload manager you're using. So if you were on say nurse and you set your nodes equal to 4, it would try to SL hook, 4 nodes for you and run run its evaluation on there. Alternatively, you could do a interactive SLR and once you're on there you could run.

B

You could sell your evaluator, you want it to run with 4 nodes and it would detect your arjan allocation and run on that existing allocation.

B

You, okay, so feature overview of HP o. We have grid random genetic and crepey bTW I'll talk a little bit about why that's called KB PBT in a little bit and we have Bayesian on the way. So looking at the grid interface, we have so I already walked you through most of what that most of how this interface works, but just real, quick, so we're and we're importing our sub module. Hpo setting up our evaluator. That's going to run some train that PI! That's that's!

B

Going to train our say our neural network and you emit a figure of Merit, and then we have our list of hyper parameters with a default values and search space. So here we're just have an ABC that we're searching from negative ten ten starting at zero. So you can use the grave optimizer which you really shouldn't use it, since we have better better algorithms available, but it's just kind of a benchmark.

B

You can set your grid size to how much so this is how you'd like to chunk up each, how you'd like to split up each hyper parameter space. This would split negative 10 to 10 four times and then your chunk size is how many it's it's, how many evaluations to do before reporting back a results, it's just kind of like how frequently you'll get feedback, and then we call our optimize on the params jumping over to random, so I'm.

B

Using the same example here now, the only difference is we use the random optimizer and our number, our parameters to the random optimizer are going to be specific to the random optimizer. So here we specify a number of iterations. So we're just doing a thousand iterations randomly sampling these type of parameters, and then our genetic optimizer interface has a lot of different values. You can set the ones shown here.

B

You can set your number of generations, you were doing ten generations population size of ten and then for Dean's, a demon oka population that helps you avoid falling into getting all of your population stuck in a local minimum. You can kind of start your deems out in different locations and let them evolve separately, occasionally doing migration between them.

B

So this would this would be evaluating four times ten. Forty individuals per generation.

B

Okay and then jumping over to a distributed genetic hpo example. So we've switched up our hyper parameters here to something more realistic and we have a few more a few more arguments exposed here. You can specify your mutation rate crossover rate and where you'd like to log your global results, but the key. The key thing to note here is that is that we've just specified our number of nodes, and that's that's really all we need to do to enable distributed hpo here you can. If you have a allocation, you can.

B

If you have an allocation that you're not running on interactively, you can pass that allocation jobid here as well to run on it and then multi distributed genetic HBO.

B

So if you're, if you recall, we support two different types of distributions, so say: train dot, PI actually is a distributed evaluator or distributed training. So we specified ten equals four, just pretending that that is the way to run this with four nodes, and then we have to tell the evaluator we're going to run this with four nodes so that it knows to tell the underlying workload manager to run this over four nodes, and so with sixteen nodes and four nodes, pre-evaluation we're going to be running for training for evaluations at a given time.

B

Oh sorry, well, so, actually just a quick correction. So this script you're looking at here, has to be a Python scripts, because it's calling the create a library, but this can be whatever. This is just a black box. This can be whatever you want, so yeah, yeah, I, I, think I understand your question. So that's it that is kind of all put on to the user here. So the it's.

B

It's it's up to the user, how they want to combine combined either multiple different loss, functions, loss, values or or yes, they may have another layer between they're trained at pi and the and the evaluator here. Does that make sense? Okay,.

B

So just showing some data collected with the with the hpo, the create Pio framework, so this is lunette on m mist so that this is a this is a neural network, primarily trained or optimized, for image, recognition, and so in this is the classic hello world in machine learning, where you have digits 1 to 10 that you're trying to identify, and so this is what this is. What this looks like in and create Pio. So this is actually a particularly big run.

B

We're doing 250 generations with a hundred population size only one game, so we're not using that feature. So this is actually going to do quite a few quite a few evaluations, but we're searching a pretty big space here. We're actually searching the if I go back here. We're searching the topology of these layers specified through these arguments here and we're also searching the momentum and drop out so so.

B

This is just showing an example of how you can do a topology search in create HBO by exposing these hyper parameters via the command line, command line flags and just searching over this integer space. Now it's worth mentioning this is this looks clean on this side, but on the user side they do need to basically handle that inside of the inside of there m-miss pie.

B

Fortunately, I think this one's relatively simple, but there are cases where you have dependencies between hyper parameters like say say. We also wanted to do say. We also wanted to do like. Like a I, don't know, say we wanted to do another hyper parameter that depended on, say the this hyper parameter here. We can't do that today in Krejci I we don't support dependent hyper parameters.

B

There are. There are a handful of frameworks out there that do support that, though okay just showing some results. So this is the Luna on an amnesty with the genetic algorithm applied. So this is with the original hyper parameters chosen from the paper, and this is the accuracy they reached and here's our genetic approach, genetic search, finding the optimal hyperplane erse- are reaching the accuracy in a much shorter training time after after finding the optimal high performers.

B

Okay, next I'm gonna jump over to Craig population-based training, all right so crazed population based training implementation has a few extensions to deepmind's original PBT, and that's if you want more information. This is in the paper linked here, but the main, the main extension is that we do. We use reproduction with a probabilistic multi-point crossover between three parents instead of two, so we have two hyper parameter parents.

B

So there's going to be two parents chosen that we take hyper parameters from and then a different parent that we take parameters from rather than rather than just always copying the hyper parameters and parameters from a good parent over to over to a bad, bad, individual, and so the main advantages here that are discussed in the paper or this gives you a more rapid adaptation and allows you to basically shed bad sets of hyper parameters quick more more quickly, which is especially helpful for a large number of hyper parameters.

B

So I'm going to show the PBT interface as it is today, but, as I mentioned earlier, we're in the kind of the alpha phase of create PL and we are modifying. This is one of the interfaces. We are changing kind of doing a pretty big.

B

Just redesign of the interface, so so the way you enable the way you enable PBT today is that you still use a genetic optimizer. The underlying algorithm is still a genetic optimizer. However, the, however we're doing this early stopping throughout the genetic optimization, so the key feature. The key way to turn on PBT today is to enable a checkpoint file or checkpoint directory. So passing this checkpoint argument to the evaluator, and then you also need to specify you also need to include a checkpoint variable specified by this @ symbol in your command line.

B

Flags to your to your training, colonel in your training, colonel needs to take this. Take these flags so take these values of I have a checkpoint directory in a model and I Nate I know I need to load from that one, and then take this one and say: I have a checkpoint directory path to another model. I know I need to save to that one. So the optimizer frameworks going to be handing these paths to the evaluator, but the evaluator, but the training colonel needs to know what to do with them.

B

So it requires some a little bit of coding on the user inside of their training colonel. So here's just an example with dropout rate and modifying optimizers.

B

Yeah and that's that's all you need to do to turn on the PBT. Today we are going to be moving on to a different interface where we are. We have a standalone PBT optimizer that you use instead, so just to show some data from our PBT implementation, so here's ResNet 20 on the CFR 10 data set, which is I, forget the exact number, but a bunch of very large data set of images with 10 different classes.

B

Yes, so we're going to take the ResNet 20 original hyper parameters and try to optimize those. So what you can see here is where, with the PBT, we're discovering an improved training schedule over the original learning rate and weight decay, so so this black line here is our weight decay from the original paper and this jagged.

B

Let's see here, oh I'm, sorry, this jagged red line is our learning learning rate as from the original paper, and you can see that PBT it's up, it's optimizing the hyper parameters as the model is being trained it it's finding a different, more optimal learning rate schedule, as well as finding a separate weight decay schedule for training. Our model and I guess it's not showing here that it's actually better, but the next slide you can see.

B

So here is the air of the original ResNet 20 approaching approaching convergence down here and then here's with the with the PBT learning schedule that we found you can see that we we drop much quicker. There is a there is a point where they pass they pass, but this is optimized to find the best accuracy overall and we do end up improving the air, reducing the air over the original resident 20 by 11%- and this is a just the highlight that you can do this in a distributed environment.

B

This is running over 103 nodes, so this is a pretty large. This is a pretty large experiment. Each epoch took around five minutes and we're going out to 300 epochs.

B

Okay, so so to use Cray a I today, so on the Cray, you read the platform, if you're a Cray customer, you you, if eureka Xu, 1.2 CS 1.1 plus they include Cray a I through the through these modules, which modifies your Python path to make it. You can load either module this ones included in that one.

B

On nurse, it's not yet widely available widths.

B

We're currently just doing local, builds and doing some early testing with with people at nurse, and we do want to open source this to both just make it a more long-term project, a more community project, as well as allow people to grab the source and go build it on their own local machines without having to have some distribution mechanism.

B

That's a good question: I! Don't want to speculate on the fly here, but I'm hoping soon yeah.

B

Yeah, maybe yeah that'd be nice, we'll see what happens.

B

Okay and then I'm just going to talk about some ongoing work with Cray I, some of which I've already mentioned. So we want to continue improving the features and stability we want to support more launchers than we do today, I think. Today we have slurm, PBS and then Eureka systems. We want to support Jupiter integration, so today you can run Cray I and a Jupiter notebook, but unfortunately like any Python program that calls out to an on Python program.

B

You have this problem and Jupiter notebooks, where anything written to standard out our standard air from the non Python program does not get piped forward to the output and Jupiter. So there are some solutions that exist to that that we're looking into so. The result of that is, if you do, go and pick this up and run it in a Jupiter notebook. It's going to be it's going to look like it's totally unresponsive, because it's going to be running your training for a very long time and then it'll finish at some point.

B

You'll get you'll get some output. We want to continue to implement new strategies. We have Bayesian on the way and we'd like to element some of these more modern approaches in recent development and, of course, I, would like to open source it. We would like to open source it as a team and just to give you a bigger picture idea of what we're doing at Cray.

B

So this is just one. A I workflow component of many that we're planning on developing so Cray also has plans to develop a a AI workflow framework where the hyper parameter. Optimization will be just once one stage here. So this is crazy. I hpo. Our next target is feature selection, which may not be as important in deep learning, but we hope that this will be something useful to machine learning workflows in general.

B

Okay and I'd also, let's see how are we doing on time before.

B

I guess I could do a quick demo. Let me jump. Let me do a quick. Well, okay, let me do my acknowledgments and before you clap I'll, do a quick demo. So quick acknowledgments I'd like to acknowledge some people from the AI team at Cray: Alex, hey I'm, Aaron Voss, who was the original author of the crepe EBT Alessandro, who contributed the Bayesian, optimization Benjamin Robbins, my manager and Zach, a chapel team who made all this possible and then at nurse Steve for Steven for BOTS for providing a lot of user feedback.

B

Okay and I'm going to jump over to a quick demo of this live, and then we are going to take questions all right. So so here's just a quick random example. I showed earlier here we're specifying our seed in our optimizer and we're going to optimize this set of parameters. Oh so this example, this is kind of our hello world example.

B

We showed not because it's anything interesting, but because it shows results quickly, because hpo of machine learning and deep learning models in general is very time and takes a lot of time, and so it's nice to do something that evaluates quickly. So here we're just we're just creating a sixth order, polynomial and trying to fit that to a sine wave in the range of 0 to 100.

B

So you can see we expose our hyper parameters with with our parse here. These are our polynomial coefficients and then we print out our figure of Merit, which was R, which was our summed air throughout that fit.

B

Okay, so I I gave it the source I gave it the command to run that signed up pi and here's all of the coefficients. Yes, so it's this this unique identifier which tells the evaluator what to look for.

B

Yes- and that's me, that's just the default value. You can specify that over here with I think it's just multiple figures of Merit. At the same time, oh I, see that's not supported today, but yeah like I could see use cases for that yeah yeah.

B

Maybe, when we're open source, you can open an issue on our project.

B

Okay, so I'm going to run this random example. Real quick.

B

So we're just printing out our baseline hyper parameters, random search with a hundred generations, so I just ran a very short, very short example, but we print out the best type of parameters we found. This was the figure of Merit that I evaluated, which was 1.2 times better than the original set of hyper parameters, so not a huge improvement, but at least it did improve. And then, if we printed out the figure of Merit and the total set of hyper parameters down here, I'll show one more quick example and then we'll.

B

So here's a genetic here's, a genetic strategy so same problem, but now we're doing a genetic approach, we're going to do 10 generations, population size of 5 and 2 deems so so that's going to be 10 individuals per generation and then we're logging.

B

Our results in the CSV file, so I'm going to run genetic example and while that's running I'll just point out so for each generation, it's printing outs, the global best, the identifier of the best individual and its figure of merits, as well as how well it's done since the beginning set of hyper parameters. So this is one point seven times better. It also shows the global average of all the hyper parameters that were evaluated. So we get.

B

We get this set of hyper parameters listed and then we get some information about the breakdown of the deems. So the you can kind of track your populations, how they're progressing and the best set of hyper parameters per deem. Then we get some timing, outputs if you want to. If you're writing some large check point files, it can be important to to track how much time it's taking you to write and read those and see if they become a bottleneck at some point and then this should be done now. Oh, it's not.

B

We okay and then we set of hyper Crammer. So this one found a 4.9, almost a 5x improvement over the original set of hyper parameters, and then, lastly out so you can it prints out these files. So you can go and print one of these. It's just a big CSV file of a bunch of data on your with all of your hyper parameter values, the fitness, the figure of Merit and so on, and there's also a global file with global information on all of the evaluations.

B

You, okay, that's that's! All I have on hyper parameter optimization today, thanks for listening.