National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 25 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Week 2 - A Modern Guide to Hyperparameter Optimization - Q&A - Richard Liaw

Description

More about this lecture: https://dl4sci-school.lbl.gov/richard-liaw
Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda

A

Thank you richard. um There are a few questions uh on the q, a if you would like to take some of those uh live.

B

I can do that. um Give me a second stop. My sharing all right here we go.

C

How about I just read you one right now um can ray tune, also spit out the posterior of hyperparameters and posterior predictive of the neural net, using those hyper parameters.

B

Yeah, so this is a great question. um I think the main focus of raytune is to provide an execution framework um and again, as mentioned in the talk, raytune integrates with all sorts of different, optimization libraries. Specifically, these model based, optimization libraries.

B

These model based optimization libraries, typically build a model, a predictive model over the hybrid parameter space and those are typically usable to be uh to be queried for posterior values, and so so to answer this question, it depends on the particular library you're using for optimization.

A

Would you like us to read the other questions or you'd like to choose them yourself.

B

Yeah, I I can I can choose, I can actually go through this myself um right. So, let's see so so one one question was: um how can slurm be used with or sorry? How can this learn be used with rey and what is involved in specifying available resources to ray so yeah? This is a great question. um Essentially you can query slurp you. You request, learn to to provide multiple nodes and um what you need to do. Is you just need to start the race service on top of each one of them?

B

We have documentation uh about how to how to run rey with thurman from why I understand mustafa is also this, so um I personally don't use term I'm more based on the cloud, but there's many users that have gotten this working and if you run into questions or issues running the um running the examples on the documentation, you can feel free to reach out on the slack.

A

Yeah, so we have also the, as you mentioned, the slurm um nurse creepo, uh with examples of how to actually build a ray cluster with slurm. um So uh I can refer you to that on slack. If you send me a message on the lecture channel.

B

So another question about slurm was there's a time limit on jobs. How does ray handle that property for restarting of slurring jobs right so right now we don't have automatic. Restarting, however, tune has automatic checkpointing, so it can essentially allow you to in certain configurations you can restart the job from exactly where you left off.

B

So I guess a more general question for for is- um or I guess for arrays- does it handle node failures like what happens when a node goes down, um so typically ray is uh fault tolerant. So that means that if at one node uh one of the broker nodes go down um like the the term, uh the raid job continues to run.

B

However, I think um if the entire cluster goes, it becomes a little bit more difficult.

B

All right, um let's see so oh as a rule of thumb, which had prayers, often give the most the biggest bang for the buck uh to tune. So so um this is a great question uh and I was actually pretty surprised when I I saw one of the important plots um when of my typical hyperparameter tuning run. So I think, by default, the most the most important hybrid parameter is is learning rate it's it is attributed essentially to um like.

B

I think the important like whatever metric was roughly like 70 of the performance across all the um all. The different parameters that I was sweeping over was attributed to the ring rate, and it's been over and over again that learning rate tends to be um what really decides, how how well your your mo deep learning model is performing um so yeah and I guess, for random forest. You probably end up with like the size of the estimators and the number of estimators and the the depth of the tree.

B

Those typically are the um the the most important um hyper parameters.

B

Yeah, but I guess as general I I there's um since there's so many machine learning models like I can only speak about probably the two most general ones. um Let's see so.

B

Yes, so one one one, one of the attendees asked um you mentioned um ray tune was made with deep learning in mind, so it doesn't work well with other machine learning models, and the answer is yes, you can.

B

You can essentially provide any um deep learning model or sorry any machine learning model anything that actually just returns a a sort of objective function and tune essentially allows you to orchestrate and execute a essentially an optimization process over over this object. This objective function providing a python object.

B

B

So, oh one question was, or I guess two questions were: how do you specify the algorithm for uh tuning array tune, and richard mentioned that there was built in support for scikit optimized? I was wondering how this is specified so right. So this is a great question. I think where you want to go, is you want to go to um the documentation page? So actually, let me just quickly do a walk through the documentation page that maybe that would be helpful.

B

um The let me give you a.

C

B

Right so here we have um docs.ray.io.

B

Right so here's the or maybe I should do a lot larger. Here's the raid documentation. If you go down to tutorials and guides, there's a quick walkthrough of all the concepts that you might want to know and um what the attendee is asking for specifically about how to choose. Optimizer is uh what we call a search algorithm right, so the search algorithm, for example, here we're using a wrapper around a popular library called hyperopt, and, um and so if we want to actually use this, you would as similar to to present it on the screen.

B

It's just a one line. uh Extension of the tune execution function.

B

Now, if you wanted to look at all the different algorithms that were provided for you, um here's here's like a list of all the different integrations that you can choose with retune and each of them have their own documentation and different features that you could interact.

B

B

um Right so another question was: how does ray communicate across nodes so um so ray does not use files. um It opens sockets between different um nodes and communicates mainly through tcp um npi. All right so ray does not use mpi, underneath the hood, uh but ray instances are, are able to communicate with each.

C

C

B

B

Yeah um another, I guess more specifically, another question was like of the distributed aspect of ray tune. So um specifically how ray tune works? Is we set up the ray cluster underneath um on the slurp and upon the raid cluster?

B

You can execute your ray tune, tuning run and so um rey provides a really simple abstraction for uh creating actors which you can think of as as distributed python objects, and these distributive python objects you can interact with through through the ray api, so specifically for tune tune, essentially constructs a bunch of these different objects that get placed across the raid cluster.

B

These different actors and you can communicate with the actors to retrieve the most recent training result or say, like change. The hyper parameters on that particular object, and this allows us to easily implement population-based training and patient optimization so again to answer the question: how is communication handled between nodes from ray tune? um The the short answer is like using an active framework provided by ray.

B

All right um so now there's a couple questions on um on a notebook and preps I'll just do a walkthrough of a notebook on collab um and I guess we'll do the well. We can do the tensorflow um collab notebook and uh I can also let me just quickly post the.

B

Let me just quickly post the collab.

C

B

So yeah here is: um how do I post.

B

B

Here so I posted notebooks.

B

B

And I'll just do a quick walkthrough of of one notebook right now.

B

C

B

Right, so I'm on this to raid tuned documentation. Actually, let me just start from the very beginning.

B

So now I guess for the next 20 minutes, I can do a quick walkthrough of how you might use tune using on on collab, and this again collab provides a single node, but the way you program tune on the cluster is um the exact same.

B

You don't change any of the code um where you, you add one line of code and there's tutorials as to telling you which line of code that is, and all you need to do is provide a underlying ray cluster and the same code that you used for tuning on single node can then be scaled to across multiple nodes, so where I will go first, this is to the tune tutorials and specifically to this section down below, which is um is collab exercises.

B

And so this is a notebook that quickly overviews how you might use tune on, or actually it's kind of, an exercise for how you might use tune on um on, provided, like google collab.

B

It's a very simplistic example, but overviews of all the core features that you might be interested in using so the first thing I'll do is, um I will comment out or I will uncomment the first section which installs dependencies on collab.

B

And then, while we're waiting for this to so you'll, see this this thing that says, reception crashed for unknown reason on the very bottom, but that's okay, because we had to force a restart of the kernel.

B

So um so as a quick walk through this, this tutorial will cover um uh kind of the process of visualizing the data. So you kind of understand where we're working with creating a a neural network so similar to something that you saw last week by this time, using tensorflow tuning this provided model by using ray tune and analyzing the model by using some of raytune's analysis, objects.

B

So here we draw a couple imports and we're going to be using a dataset called iris. This is a very famous data set that provides a bunch a couple columns, a couple simple columns on some flower characteristics.

B

And here we have a couple different flower characteristics and you can see a couple of these characteristics are more representative of um or allow you to sort of separate. The the flower is better, so there's three different flowers, they all have different characteristics and some of the characteristics are more telling.

B

So the first thing that we're going to do is we're going to use this neural network. That's already created for us.

B

It's a function that creates a neural network, um keep in mind that we're not actually crea instantiating this, this new, uh this neural network, we're actually just defining it in the function. This is important because, um in order to communicate across notes, um ray depends on serial serialization and oftentimes machine learning. Models have trouble being serialized, so serialize essentially means that you are able to capture the model in in a byte representation and then transfer across the network and reconstruct that byte representation into a neural network.

B

um So here's another function that we defined it. First, it essentially trains the model, and um there is a nice feature in keras or tensorflow called a callback which is um essentially a hook that gets invoked every couple iterations for probably every iteration. Every time you do an update here we have a callback that that helps us checkpoint the model so that we can preserve it and save it to use afterwards after the training or trading process.

B

So um let's just quickly check that this works and we should see an accuracy of about 0.368, so so that that was mildly. Interesting. um Let's now go to how we might uh use tune ray tune with um with a callback when, with the keras model, so here we define a simple callback.

B

It literally um take it's essentially one tune call which allows us to report the training function or the the training output. You can call this method anywhere within the training function that this this callback happens to be part of the model which happens to be invoked um within the training function.

B

So here's a couple exercises, but I'm just going to quickly add this in so essentially what I'm doing is I'm porting this again the same code to um the the same code that we saw above to use tune.

B

So the two things I did was change the signature, so it takes in the hybrid primer space or it takes in a set of hyper fibers and then I've also configured the the model creation function that we defined above to use these um hyper primers that we've provided through this config.

B

So what is going to happen now is that this function is actually going to be invoked many times across uh on and in parallel across all the available cores um in in your computer and again, if you're on a cluster, then this function is going to be invoked. You know 100 times, for if you had a hundred decors on your cluster, so we'll define that for now and then the second step after we've converted the training function to use, tune, we're just going to define a hyperparameter space.

B

So what we're going to do specifically is we're going to define the learning rate to have a uniform distribution over the log space from 0.001 to 0.1, and then we're going to set some uh some model architecture parameters and then we'll also specify the number of trials that we're going to evaluate.

B

So this is quite simple: we'll just copy this over.

B

Here and then we have num samples right here.

B

So hopefully this works out of the blocks. um You might see a couple warning messages, but most of them are harmless and disappear after a while.

B

So what you see on the screen is a refreshing or a self-refreshing, tabular format that tells you what is the current progress of the hyperparameter tuning? It also presents all the different configurations that you're using so all the different hyper parameters that you're trying, in addition to the corresponding accuracy of each hyper parameter, since we have two cpus we're actually just evaluating two uh two trials at once and um and everything takes three seconds to evaluate.

B

So um as this is going, it's actually outputting a couple um files to this result directory which you can configure um and with this result directory you can then specify you can then use to visualize outputs or you can there's a couple log files such as some csv formats that you can then also parse yourself.

B

So now we're done with the hyper hammer tuning uh run and we want to identify what's the best tuned model.

B

So um specifically, what we're going to do is we're going to again create this data locally and then we're going to plot it and see this is our test data.

B

So um so what we're going to do now is we're going to take this object that we so this there's, this analysis object. It was returned from tune.run and we're now going to leverage a couple calls so it's data frame and also its ability to specify the best log directory of the trial.

B

So I guess a bit of context here is that we saw that above there was a directory called root. Slash raid results. Slash, um let's see so, there's only one doc director here this dash in iris.

B

So this is the experiment log directory, but if you look inside, there's actually 20 different uh folders on the different hyper parameters, the each of the different trials that we ran within any single one of these.

B

We actually can see that there's a couple of different files that we can get. In fact, one of them is the model that we saved. So what we're going to do now is we're going to just use. The analysis object that we got from tune.run and we're going to obtain the best log directory corresponding to this particular metric um minimizes, so the best meaning the minimum one and this validation loss um again was provided through the the tune. Callback.

B

So so there so we queried, we got the directory, we took the model from that directory and then we um we evaluated the model across the test data. It turns out that the tune accuracy was perfect and the untuned model had a accuracy of 0.368.

B

And then in comparison to the ground truth uh again, we saw that this is perfect.

B

So um hopefully this one works, but what you can actually do is you can use tensorboard within this jupyter notebook to visualize your results too. So what we're going to do is we're going to point it to the experiment directory which uh point tensorboard to the experiment directory which allows us to visualize all the different trials at once, um and hopefully this works all right.

B

It might take a little bit of time to load, but again so we see all these different accuracy plots which are generated automatically and plotted automatically with tin.

B

You try it for yourself, there's no black magic here and and um another nice thing is about the visualizations. You can also click the. I think this should work, but I'm not totally sure.

B

Yes, there we go um right so tune automatically, takes care of the hyper parameter visualization, uh which allows you to um essentially track what type of um what metrics and how do each metrics uh correspond to each other. So if we just filter out a couple of these extra metrics that tune provides, um what we see is that the mean accuracy corresponds to.

B

It corresponds to a lower learning rate.

B

You'll, see that there's a lot of variance across these different, dense layers, and my reading on this is that um there might be an inter uh there might be some relationship between dentist, one and dance two, but most importantly, the learning rate is what decides the the performance of the model so yeah. So that was a just a quick overview of how you might use tune for a typical, a very, very easy hyper primary tuning um configuration for a hyperion tuning run.

B

So mustafa, what do you think? uh What should we do.

A

um I think we have. We have a lot of questions actually left. um um If you would like to answer some of them like one or two questions, that's uh that's good and we can also post the questions on slack and then you can answer them later at your own. um You know your own time.

B

All right um how about I do so, there's 16 questions uh I'll answer, eight and um the rest of the ones we can do on slack.

A

That that would be great yeah. Thank.

B

You all right, okay, how many nodes are okay before distributed? Bayesian uh won't be affected. Oh, this is an interesting question and um it actually, I would say uh the correct response for this- is to count it in terms of number of parallel trials before distributed. Patient won't be effective, so the the number of parallel trials, um sort of also correspond to how many you're willing to do at once. So, let's say for, for example, you have a hundred different trials that you want to run or you want to.

B

You know, evaluate 100 different trials if you did um say like 100, parallel trials at once, and you had um or you had you know, 100 parallel gpus that you could access then running 100 trials at once will not allow you to leverage a prior information to guide your search.

B

However, if you only have one gpu, then you every single one of these 100 samples that you're going to run is going to be after the other and they can build upon each other.

B

um And if you do something like uh like you said, let's say you have 20 gpus and you want to evaluate 100 different trials. um You're. There's going to be a delayed feedback, so let's say you have like you run 20 at once. In parallel, um it's going to take only on the 21st trial. Will you actually be able to leverage some prior information?

B

So in some sense um you won't. um I guess you don't lose that much if, uh depending on what your configuration is is, but I think the only thing to keep in mind is that there is a delay in feedback and, and the first couple runs are not going to be able to leverage um any of the the currently training like they're, not going to be able to leverage this model that you're building up.

B

Let's see so does bayesian, optimization and other events methods work well for non-convex problems. So um my understanding is that deep learning uh for the most part, unless you're trying to make a convex problem, is non-complex and we've seen bayesian organization work well for many deep learning methods or deep learning models.

B

B

Well, willow, I actually don't quite understand this question. I can answer online later.

B

In general, how do you decide how many trials to conduct uh with the given high performance optimization algorithm to ensure that you haven't missed the most optimal regions? um So I guess the there's always this this um illusion of optimality that we get in high parameter tuning.

B

um Essentially, um if say, I had 12 like a dozen hyper parameters right and each of them. I want to evaluate for three different values. Where essentially I have this. You know this massive grid of hype parameters and it's 3 to the power of 12.. That means that if I really want to find the absolute optimal, um it would take like 500 000 um evaluations right because there's no there's no absolute guarantee that any particular uh parameter that you choose is going to be um optimal.

B

So so, if you kind of step away from the sort of optimal mindset- and you think about good enough, then what you would probably do is you would design course like a source, structured enrichment process to your hard camera tuning.

B

What that means is you'll start with the very core screen uh search and then you'll, slowly narrow down the search until um you've kind of identified and how to have a good understanding of how each of the what are the most important hype parameters and uh what would what would be able to be done to in order to optimize performance and typically you'll, see that the hyper parameter tuning methods are only going to provide you a small boost over like some defaults, and it might be smarter to step back and reevaluate how you're designing your model um instead of trying to spend so much money or a lot of time like finding the optimal hybrid parameters.

B

I would say in terms of research the most beneficial thing. The hyperparameter tuning frameworks can provide us an understanding of the relationships that you've designed your model to have so understanding of the relationships between the hyper parameters. And that's why the parallel coordinates is incredibly important and that's why people are still doing grid search.

C

B

Does ray support conditional interactions between hyper parameters? Yes, it depends on also depends on the hyperparameter tuning library that you're using you typically specify a search space um within the hyperparameter tuning, uh or you specify your conditional operators within the hyperimagining space. For example, you might say hey. I want four layers, but I want one to four layers, but if I had a fourth layer, then I want to have the fourth therapy from 50 to 100 uh width.

B

But then, if I had three there's, then this, like fourth value, doesn't really matter so um a lot of hypergram tuning, optimization libraries allow you to specify a search space that that can express this and tune it sort of agnostic to.

C

B

B

How come what, if anything, does the population-based tuning approach do when changing? So how does population-based training, uh perturb, hyper parameters during training, um so what's proposed in the paper? Is that um if you have uh two types of values you have a category? If you have a categorical hybrid parameter, then um you can specify a list of different categories that you can be choosing from and, um and you can like re sample from that list. Every time you do a perturbation.

B

If you have a continuous variable like a learning rate, then the typical perturbation that the paper proposes is to either decrease or increase it by a factor of 0.8 or increase it by a factor of 1.2.

B

So um so I think the caveat here is that these particular parameters that you can perturb are typically not model architecture parameters. The reason is because you can't easily retrain or you can't leverage um it's hard to change the model architecture during training. um So a lot of a lot of people, just um just don't do.

B

B

Let's see um uh have you tried this notebook on gpus um when I do something similar on gpus with tensorflow 2, I usually have memory accumulation problems, as the gpu memory doesn't clear after each parameter point evaluation yeah. So this is a great question. um This notebook does work on gpus. As far as I know- um and I guess I'm typically using pie torch, but I am not.

B

uh I would say the reason why I have a reasonable um prior as to why uh why tensorflow 2 and this particular notebook would work in practice um with gpus is because each reactor is is terminated after the trial is done. So the reactor is again this distributed object. It runs on a separate python process and the memory allocation for a gpu is assigned to a particular python process.

B

When that python process dies, it frees up the memory used by used on the gpu and therefore, um typically, we don't see memory leakage um uh across off across different uh tuned trials and across different hybrid point evaluations.

B

A

Richard I I that's probably more than eight questions so, but if you answer more feel free, but if you, if you feel this is yeah long, then we can answer them on slack.

B

Yeah, um I think I'd be happy to answer more questions. On slack, I know. Let me just get some of the earlier um answers just in case someone feels that they're young.

C

B

Yeah so does ray tune, implement the semi-paralyzed version of bayesian. Optimization answer is, yes, um you can specify the maximum concurrency and you can also connect it to a cluster and it'll automatically scale up the patient. Optimization for you. um Compare, though um yeah. So I guess I sort of answered the second. This other question which was do you? Can you still obtain optimal convergence if you tuned hypothalamus individually, um so yeah again?

B

Typically, your hyperparameters have a biased weighting of importance and you'll want to sort of tune like the main, most important parameters that you can, that you can find and um and sort of the interdependent uh relationships between hybrid parameters um matter, but probably to a lesser extent than the most important default hyper parameters or like the most important hypercameras, such as learning rate or momentum.

B

So, um typically, what I would do is I would try to identify interdependence by by using the sort of parallel coordinate plots and if I still can't provide or um yeah I would yeah. That's probably what I would do and then, if I identify something, that's particularly interesting, a a you know: interaction between hyper parameters, I'll probably run another grid, search um over over, like a selected evaluation of the hybrid hyperparameter space. Just to test some hypotheses about the interactions with the hyperparameters.

B

How do we know who is the best performer in pbt? This is mainly just you. You can identify the the lowest performing model or the best performing model, and that particular model is corresponds to a sequence of perturbations through um through the training. So it's not a single trial, but rather it's not a single high priority evaluation but sequence of high primary evaluations and, typically, what you can do is you can track them uh attract this over time.

B

So convergence guarantees.

B

uh So just so, I guess in practice, convergence guarantees are a good prior for whether or not the optimization method is going to work in the first place, um or it's going to be useful in the first place. uh Yes, many of these hyper-camera tuning model, the models that you're treating for hyperparameters, are non-convex and um and so convergence guarantees with hype. Software for lays, like um optimization methods that have convergence guarantees or rate guarantees, um they're, typically not they're, they're good prior but they're, I guess they're not definitive, and that you won't necessarily converge.

B

um You won't necessarily get the best model, given a traditional, optimization method.

B

All right, I think I'll, take the rest of the questions on the slack.

B

But yeah, um what do you think.

A

Yes, that sounds good, so save the rest of the questions, and uh now you can answer them when you have time on slack. Okay,.

C

Yeah. Thank you again.

A

This was uh the this was yeah very uh pedagogical, actually at so many levels. uh I also enjoyed the uh the demo that you you ran. I think we had also so many questions and a lot of engagement from um the attendees.

A

um So thank you again richard and uh thank you everyone for joining um uh the second week's lecture. I just want to remind you again that um we have a lecture every week. We might have a break in the middle on some days, and uh so please join us uh next week for the deep generative models talk by aditya grover from stanford university, and um um so just so that you know we, you have a slack.

A

If you don't know about the slack, we have slack that you can join through here and you can continue the discussion on particular lectures on the specific uh channel for their lecture and also we do link uh to these slides and the video later. So you find a link to the video here and read your slides, for example, and there's also all the recordings will be available on youtube, hopefully in one to two days max after the lecture thanks.

C

A

See you next week.