National Energy Research Scientific Computing Center (NERSC) Jupyter Community Workshop June 11-13, 2019, 12 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 14. GPUs cannot be shared but GPUs must be shared

Description

June 12, 2019 Jupyter Community Workshop talk by Richard Darst, Aalto University

A

Let's see so my talk is called, gpus must be shared, but they cannot be shared. So I'll explain to you what this means in a bit so I'm from altar University in are actually close to a Helsinki I work for what's basically our HPC group, but we did I do a lot of different things. No one really knows what each of us does. So we have to Jupiter hubs here, one for high-performance computing stuff.

A

It's basically on our cluster integrated with slurm using batch Potter and basically my philosophy here is: you should make a the cluster easier to use, not make a separate service, that's different from the cluster.

A

Also we have the Jupiter hub for teaching, which is using kubernetes and all that stuff. So really these two things are basically like what you've been hearing about for the past few days. So, okay, so Jupiter, it's easy to search, CPUs and memory. So, basically, oh you know you have the head node CPU. Are you sharing your slides or no I? Don't have any slides? Okay, yeah?

A

Okay? So it's easy to set chair, CPUs and memory so CPUs. Well, that's what processors have done for decades now, memory basically same thing, but from what we can tell GPUs cannot be shared, so GPUs don't have memory isolation, so I should say I'm, not a GPU expert here. So if someone can correct me and everyone else here, that would be great so because GPUs don't have good memory isolation, they can only be assigned to one person at a time.

A

So that's not good for Jupiter Q's, because they said people will start up a notebook and start doing some stuff and then it will work and work and work, and you know they do something, and then they wait five minutes while they're debugging and do something else and so on, and it just hugely inefficient. So with the processors memory, that's fine, but not for GPUs, so we don't have GPUs available in either of our Jupiter hubs. Yet it's our primary request.

A

So people are always asking you know: can we get GPUs and we want to say yes, but we just don't know how to do it without wasting huge amounts of resources. So we were thinking thinking thinking until eventually my colleague comes up with this quote and realized. This GPUs cannot be shared, but they must be shared. So that not me thinking so you know actually he's onto something here, so we can just um know think of this, like something normal.

A

We have to really think about what does it even mean to share a GPU or one of these resources, so we have to redefine sharing. So how can we share these resources?

A

So maybe I should clarify the traditional way of Jupiter sharing would be to you know, spawn everything on a single node restrict the number of people that can be under. Everyone has a lot of processors or memory, and that's just you know, amount people need is less than that's available and it mostly works out, except when it doesn't, and then you just deal with it and kill someone or something or respond them in slurve and give them dedicated resources.

A

It's actually weeds on this one, but I'm especially glad the partition witches participate over to subscribe, bye.

A

A

Yeah, okay, so how can we share things? So I was thinking we can share it, my notebook, so basically people have notebook and they go and um no test it debug it there's some magic command or something we're submitting the book to a patch queue.

A

It runs on the GPU and then gives you the results path and you see those results and then you do debugging and then you submit it again and you can even when you're done, you can save all the state from this notebook and get it back and then play with the variables and stuff that comes out or you can just share it by yourself. Maybe have some code that ships the contents of a single cell to another batch queue like slurm or whatever it is.

A

It runs on the GPU and then sends the state back and gets read one bit. So that's what ideally involves utilizing all the state of the notebook sending it to the other side, utilizing it running Missy and wasting the output. So let me get back deserializing. It uploading it back into.

A

You cannot start a video because the host has stopped.

A

Yeah, okay! Well, if it works, let's go on so yeah I heard a rumor that someone's developed a thing which can take a GPU device. Add it to a container allowed to be used and then removed from the container dynamically. Well, it's running so yes, my question is basically how can we share GPUs? Which of these do, you think, will work so I started working on something called notebook script which works on the notebook based sharing, where it would basically run the notebook like a shell script or something like that.

A

There's plenty of ways to run notebooks non interactively, but personally, all of them seem like hacking on interactive running from notebooks, not making the running a first-class citizen.

A

We just saw the lightening talk about taking single cells data and sending it to a remote site to execute and come back. Also, one of my colleagues has made something which tries to see it basically do what I said. I haven't actually tested it myself, but it would be trivial to make this work with slurm. If it does what he says, it does The Container, think I, don't know so any ideas. Any thoughts, what do you think.

A

I'll say these at least for us we're just kind of content to let users hold on to the GPU. Is it? Is it it's because the duty cycle on the GPU is expected to be so low and you're in your use case?

A

Basically, so we expect people to spawn the notebook servers and leave it running for ages without using it. Of course, we can reclaim the server once when it's idle, but that's not very nice, for the users and well know stead. The efficiency is quite low. There's.

B

A

The film yet the problem, just by tons of um puns for self or have the users pay for what they want to keep the lock on. But you know there has to be a better way.

A

You might be audible if you talk loud because.

B

I'm feeling, like so yeah bouncing so we have a little bit of this, and mostly with the virtual machine, so be if I you through stuff and various I am I. I am letting is British comp cards actual physical hardware. We can bring up amongst more things, but what you were talking about with the logic dynamically, adding and removing GPUs would be to build sounder. They don't think I know anything that goes up because.

A

And really I'm, not convinced that this dynamically, adding and removing GPUs will even work in practice, so everyone will have to train their code to release the GPU whenever it is done, like basically at the cell level. So you run claim the GPU release it and I don't really believe or know. If I should believe that people will make code, that's well-behaved enough, like that.

A

Yeah, so my basic goal right now is this scripting thing so basically train users to make your notebook. So it's a self-contained script. Basically, so you can use it interactively for debugging, but it equally well can be submitted to batch to run, which is going to be good even without GPUs, but we can pack it into providing GPUs here. Also.

A

And they pose to a topic in the discord, so you can reply there too.

B

Thanks a lot: ok.

A