National Energy Research Scientific Computing Center (NERSC) Jupyter Community Workshop June 11-13, 2019, 11 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Interactive Distributed Computing with Jupyter and Friends

Description

July 11, 2019 Jupyter Community Workshop lightning talk by Shreyas Cholia, Lawrence Berkeley National Laboratory

A

So I'm stressed Julie I work here at large, Berkeley Lab, just split my time between Nara and the computational research division, and so a lot of what I do working with the Jupiter hub deployment here nurse, but also working with very science use cases on getting Jupiter integrated with bigger workflows, among other things.

A

So what I'm going to talk about is just how we're trying to use Jupiter to enable distributed computing using things like AI pi, parallel to control, bigger jobs that are running on an HPC cluster on the backend, and so this was originally at a use case. That.

A

Lhc detector, where they wanted to do some distributed deep learning, using convolutional neural networks to be able to ossify some of these images coming out of the detector and so without getting into the details yeah. The idea is to be able to use Jupiter to do to solve a couple of these classes of problems, so why interactive distributed deep learning, I think for a lot of projects. This is kind of the next frontier in terms of being able to enable scientific discovery, typically take a while to train these networks and you're.

A

You know doing a lot of tuning and figuring out all the parameters and hyper parameters that go into a model. There's a lot of brute force, scans and optimized automated optimization, and then our batch HPC systems have their own wait times and slow iteration cycles combine this with the fact that a lot of the new deep learning frameworks are python-based, so things like Harris and tensor flow I think using Jupiter notebooks as Jupiter as a whole as the environment to be able to manage these things made a lot of sense.

A

So for this demo this was part of an LDR D, and then we presented this work at ISC, interactive computing workshop. So for the things to get all this to work together, so we ended up using IP parallel to manage the tasks on the back end, we're using huger quanto pian to render the the you know an interactive table that you could use to flip through and you'll see.

A

A couple of movies in a second EQ plot from Bloomberg was really useful to be able to do Jools ation, and then we wrote this little thing called kale. That would let you have fine-grained control over the tasks themselves, and so, if you wanted to issue starts and stops and changing the parameters, kill would just wrap your individual tasks and then you could basically control those through the service all right. So here's a little bit about how we set all this stuff up.

A

So at nernst we have a jupiter hub web server that basically lets you spin up a and there's other multiple go into a lot more detail on the various other ways you can do jupiter notebooks at nurse, but for this particular effort, we're spinning on a the equivalent of a login node. So it's got a lot of memory, a lot of CPU and it's a shared resource, but we can spin up a lot of notebooks on there. So you spin up the notebook server process on here. You start up.

A

The kernel runs the IP, parallel client and you bring up a bunch of back-end nodes. On the compute side, we had a little magic called IP cluster. That would let you do that, so you just give it a few parameters and it spins everything up and calls for you and it lets you set up all of these nodes, which you can then control using iPad parallel, because we're using iPad parallel that they also gave us the ability to use MPI on the backend.

A

We did play around with tasks as well, but in the end, partly because we knew it better, but also because there were a couple of little tunable things we could do and I parallel. We ended up using that.

A

That's that desk is probably better supported and if you know, there's a way to do this in desk moving forward. That would be interesting and so yeah. This is basically just a couple of screenshots or how we set this thing up. So you know you just describe your job pass it into this magic, bring up an IP, parallel client which connects to the cluster on the back end.

A

So it's just connecting to these workers and you're off and running, and so we did two kinds of things there is this distributed: training which was basically just you know, go off and deeper training and we use a tool called par Avadh, which is out of SE and I, think they do basically with a bunch of primitives to do deep learning, I can't and they actually use MPI under the covers, and so, if you look at their primitives, you'll see things like hvd drank and whatnot. So you can.

A

Actually you know it combines this MPI world with a more you know, deep learning, training model world, and then you notice that we could actually just use ipad pal to start the workers and then use harvick to do all the communication between those and there was really no overhead in terms of the infrastructure all right, so that was maybe not quite active. It was, you know, you're using the list stuff, the more.

A

This was for parameter, optimization, and so this actually involves setting up these workers and then trying to optimize for hyper parameters across a bunch of different possible models that you're trying to use, and so what we're doing running each task separately and then seeing which ones which tasks are doing better. So you can get the loss and the accuracy you can sort through what's going on and then so it's a lot you'll see from this short little movie. We have here.

A

I think I can play this.

A

All right, I have a movie on my I. Have it locally I'll just play it off of here, HBO movie fast yeah,.

A

All right so this'll.

A

All right- and so the idea here is that you're basically running this across a space of hyperparameters, which you can see down over here, so you've got a bunch of different values that you're trying out you can flip through and see which models are doing better, which ones are not. You can sort based on the things you can. So if you want to sort on the best model based on validation, loss or accuracy or loss, you can do that.

A

So it's a nice way of running stuff, running stuff in real time, getting those results and then actually being able to do things like you know, stop and start things in case there's things are more promising than others and actually have a second movie there. Okay, we worked out of here yeah. So this is the other, the the the second little video, where it's the same thing. This is.

A

This was a little bit of a more of a toy problem, a short film with this, where we're actually starting and stopping jobs, but yeah so you're here, you're, actually stopping something that didn't and- and you can go and do things like tweak the parameters that you're running against. So you know you can also get resource monitoring under the covers and.

A

Yeah, you can change the parameters that you pass, the hyper parameters that you're running the job with so you're stopping a job, redefining those hyper parameters and starting it up again. So it's it's! It's next way: sort of interactive training and all set up and the pre and post analysis happens as a jupiter notebook. So it's not just a one-off widget thing: it actually fits into a larger workflow, alright, so the same approach with National Center for electron microscopy, where we're looking at a bunch of these images there.

A

It's it's a thing called PI for T stem which takes these two-dimensional images, and then you can explore each pixel in that 2d image, and that gives you an another two dimensions and that's where the 4d comes from and they had this serie o in a Jupiter notebook to do all their analyses. And then we basically just put these hooks on the back end and allowed them to spread their tasks on an HPC cluster, and we got a much really nice speed up there.

A

Alright. So this was extra slide, since we actually have some time I'll kind of walk through some of this stuff and it's mic feed seed. Other topics later in this workshop, so also do it with Dan Allen from Anna. So we're talking about doing curated, no notebook environments, where the idea is that you can browse these curated notebook environments using things like MBA or Club into the users workspace with the appropriate Konda environment and might have reproduce line notebooks.

A

So in some sense it's a lot like binder, but it's a insects, PC world where you don't have either like back-end and really all you're trying to do is copy a notebook over, send it over, send it off with the appropriate. You want to be able to look at things easily and then create a copy in your workspace with the appropriate kernel. I think that's kind of the thing that we've been the requests we've been getting from user, so I think we're still very much in the prototyping and experimenting phase of that.

A

But and so it'll be useful to see what other people are thinking in the space as well. We're also playing with pay per ml from Netflix. Do this parametrize notebook thing where people want it's different datasets and they just want to capture everything as a notebook, but wanna capture parameter so we're playing around with that. We've got a couple of Jupiter lab extensions that we have some students. Looking at so slurm extension lets.

A

You manage things and you you can basically bring up slurm as a as an extension in jupiter lab and start job submit jobs. Release kill do things with that. I think somebody had a request for something like this in the discourse and then we also have a resource usage monitoring extension, which is basically the NB residues thing that we talked about a couple of graphs that lets you display that all right, that's all I, have.