National Energy Research Scientific Computing Center (NERSC) Jupyter Community Workshop June 11-13, 2019, 12 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 11. Parsl: Pervasive Parallel Programming in Python

Description

June 12, 2019 Jupyter Community Workshop talk by Daniel S. Katz, University of Illinois Urbana-Champaign

A

Yeah, okay, very exciting um hi, so I'm Dan Katz from University of Illinois I, should I guess start by saying that I feel a little bit like an imposter here, because I'm not actually developing Jupiter or developing Jupiter hub. But what we're doing is developing a system that works for Python and I. Think some of the things that it does with in that it lets users do in terms of launching jobs on supercomputers and interacting with those jobs and moving data around is fairly overlapping.

A

Some of the things that I have heard about so so perhaps in some ways this is something that may be actually interesting to some people outside of this workshop.

A

Just is a tool to use, but it also might be interesting as a is a different model of how some of these things are done and I should also say that the people that are listed on the on the slide here as authors kind of range from the people at the front, who are the technical people who spend almost all their time actually doing this work to those of us that are on the back and I.

A

Probably would be at the back, who are the kind of the kopi is actually do a lot of the hands-on piece, but um but do the architecture and work with user groups and try to make sure everything looks good. Okay. So so the reason that we're we're doing this is that software is increasingly assembled rather than written.

A

There are pieces that already exist and we want to integrate and wrap those together, and often that means that we need to use parallel and distributed computing HPC because of more data, because that's the resources that actually have the ability to do things because we want to use accelerators.

A

So a parcel um has been written for from the point of view of trying to express parallelism in such a way that programs can say what pieces could be run in parallel and then at execution time we actually figure out how to best run those on the resources that are available.

A

The I think the talk that you gave before you had a little box about quarry and a bunch of things there and one of the pieces on the little boxes was Swift, and so if people are familiar with Swift at all, I would say that parcel is a reimplementation of Swift from scratch in Python, because we realized that there were some problems with Swift and we wanted to to kind of start again in a common language and get rid of a bunch of old code and try to make things in but cleaner.

A

So so parcel then has this idea that you, um you define.

A

And that that means that they still future instead of returning result and the fact that they were trying to future then means that you can go on without that actually having been satisfied, you can do these in parallel, then, if the data dependencies are allow that and the other thing that's nice is in general, we try not to say anything about resources at this level. So so these functions could run on within the notebook itself.

A

They could run on remote resources or we don't say that at this point we have a different way of saying that later on. Oh- and this is all standard- Python install everything, normal open source github, actually, two contributors welcome all that good stuff, okay, and- and also you can try this on the binder.

A

If you want to, and so it's available and ready to go Arsenal project at org and it's there's a tripe Isis, try Arsenal down at the bottom I think most of the stuff I'm going to show is in tutorials that are on there. There's one thing: I'll show of a multi-site execution: that's not, but we can make that available if anybody's interested in it. So just like one example of how you do this might be that you had.

A

Something, and so that basically says that all these things can run a thermal, and this piece is gonna, have to wait until all these things, if you don't have to actually say that within your code, um what happens internally is that oh and I guess I should have said, which is probably obvious, that all the stuff, because it's Python, runs inside of a notebook which I guess is the connection here. So.

A

We have our code, we have some coals, those calls as we CD functions that are decorated. Those decorators basically tell us that here's a task that we should put into a dag, directed acyclic graph and so we're building a deck as we go along. We keep building the dag as things happen, as we can start to run some resources.

A

As those things finish, they then let other parts of the dag run and other parts of the dag also may run, because we have gone through more of the code and we've pulled out more more tasks, so that's kind of basically the model of how this works. This is all execution provider independent. So we can, we can run on grids cloud supercomputers.

A

We can use containers in a couple of different ways and yeah there's a bunch of different kinds of resources that we have. The ability to work on I'm gonna talk a little bit about how you actually configure those and what that means at the very end, with a little bit of does.

A

A

So and I don't want actually get into a lot of the detail here. This is probably more detail than I should have gone into, but this is all I think fairly well documented and there's a read the docs page that talks about this and kind of gives a little four-step way of trying to do this, trying to set up a configuration for a system.

A

We also have in the documentation the example configurations for something like eight or ten different systems that are fairly common, including Lori and some of the systems in Oregon and a few other things. So, okay, all right now. So when you do this, then the fact, if you have.

A

These things, actually, so you can say that particular apps want to run on specific resources, or you can say if you don't say that it'll run on whatever resources and seems to be the right thing that's available. So this thing was kind of saying gives you interactive, supercomputing and Jupiter notebooks, because inside the notebook you're launching things that are going on supercomputers you're, not doing this from the notebook level. You're not doing this room that you further lab level you're doing this from the library that partial library is, is capturing this and making this happen.

A

So one of the challenges, then, is authentication and authorization we're using logos auth to do this right now in this axis is Globus and other services and I don't know, there's not a whole lot. That I can say about this. This is not really my area of things, but if anybody's interested in the details, I can certainly put you in touch with the people who are doing this, but it's I think the key part here is that this is basically just using Clovis's methods and it works.

A

We also, then have transparent data management, wide area, data management. So we have a class, that's a file class, and so we can have a Globus / something file, and when we run that on a remote system we stage file into the remote system, work on it and can stage it back. If that's the right thing, depending on where the output should be and that all happens in the background uses Global's transfer, we can also use HTTP and FTP for data staging if we need to.

A

In addition to Globus, um there is for the DOA people that are here. There's this activity going on called dct-ii, which is a group that's trying to look at best practices and challenges to federate, distributed computing and data systems across different platforms in the do-e. It's working towards a pilot, it's using OAuth and working with Globus and there's a test deployment at Brookhaven. That is there and we're working as part of this.

A

So there's some initial work, that's trying to link Oak, Ridge and Brookhaven, and we've added support for an OAuth ssh channel that is being tested at Oakridge. Now.

A

This is actually as of about a week ago. I was supposed to get an update yesterday, which I didn't get so I, don't know exactly how this testing is going, but I think it was going reasonably well a week.

A

So just then, to give an example of what multi-site execution here, it potentially means. So we can have a partial configuration that says we're going to run on two sites when you actually load that configuration that creates SSH channels, those SSH channels that are used to deploy in the interchange process on to the login nodes and then submit pilot jobs onto those nodes that then connect to that interchange, partial submits tests to the interchange and Globus uses Parsley's Globus, then two-stage data. So sorry, I, should this wasn't exactly animated right, but basically so parcel opens.

A

The ssh channels starts the interchange and then submits jobs to file at jobs that are running on the interchange, with louis being used to move things around between them. So the Globus data transfer doesn't go back to parcel it just talks between the different channels.

A

There's a bunch of code shows this and I didn't like the code, so I'm going to show a demo instead of showing the code, because it was too little.

A

So if anybody's interested you can actually get to this I, don't know that you can run it necessarily, but you can certainly see it and let me show you the demo. Sorry I don't want to show you the demo that way. I want to show you the demo.

A

This way, okay.

A

So all right, so this is Yahoo doing screen capturing and this anymore I'm gonna try to go through what's there so basically.

A

And so we're just going to make sure that we're actually able to connect via SSH port and we can groups reading. Sorry we're not really very good at screen captures at this point so.

A

It's something on theta there.

A

We then do a bunch of importing. This is using personal, zero. Seven, two one one zero eight alpha. Now the HIV out beat Coakley. We can figure out then that setup, because these systems use two-factor authentication. They have to type in some stuff. So.

A

They do this, then they get in your room, switch doing theta password. Then that's.

A

Tough to hear another typing at the end of the year so and then we're gonna.

A

See here's the bunch of things that happen. The.

A

Specify which is part of the ugliness lawless same thing, for you.

A

So we actually, then.

A

Msql no excuse Irwin.

A

And we do the same thing so repeats a favor, we're watching the cues on.

A

Aren't the best way to do this? Sorry.

A

A

We're doing that.

A

Actually going to see if I can jump ahead a little bit.

A

Okay, so we're actually running these things and they're not done yet so their futures were.

A

You're actually walking for results so checking to see what's going on.

A

Something's going on sorry, this is I think there was a little bit of a delay here and actually getting stuff to run. I'm just gonna keep jumping ahead a little bit until see something happening.

A

Okay and eventually these things.

A

A

Looked like you had a condom vironment, it's already kind of yes, yeah yep. So that's right.

A

Something you do look at yeah in his definition of the function I saw it was in right, so we're in the function.

A

A

A

I think that the answer is that we don't do that caching right now, this is all like relatively new, and this is all still very much in progress. So the I think that could be done.

A

We've had a lot of discussion about caching and and also about being clever and where we run jobs where the work already is and then knowing the datura T there and not doing so so, there's a bunch of stuff here, that's kind of in progress and but I, don't I, don't believe that that happens right now,.

A

Yeah sorry I would be very happy to talk about that. Sorry. The reason.

A

So we definitely would want to have yato in this conversation, so okay, Ennio, so I'm gonna stop with this demo, which you actually could see sorry and just jump back for a second.

A

A

Alright, it did get created on theta, it got appended on Cori, it came back and we can then cat it within the notebook. So, okay! So then, back to this.

A

Okay, so the other thing that's interesting or one of the other things that we're doing is we recognize that there's a bunch of different kinds of parallel applications that people are running, including things that are high throughput, like the CERN work or like protein docking, where we've got something like thousands of tests and hundreds of nodes and we're really concerned about reliability and usability and elasticity on clouds and monitoring. And things like that.

A

We have extreme scale workloads where we're running like on a an on a single HPC system, and we only run on hundreds of thousands of cores, or maybe millions, of course, and we're really more worried about capacity. And then we have interactive and real-time workloads where we're really more interested in in rapid response, and so we have applications of all these types, and we realized that we couldn't really build an execution environment.

A

A single execution environments afford to all of them as well as we could, and so, rather than doing that, we have three different execution environments right now and you choose the one that matches the the workload that you have. So one of them is the high throughput executor, that's designed for ease of use and supporting clusters and clouds and fault tolerance less than 2,000, something like 60,000 workers may be a million tasks test duration on the order of one hundredth of a second kind of works.

A

Okay below that doesn't work very well above that is fine, obviously extreme scale executors which has distributed MPI on it and having a manager rank. That's communicating the workload to other workers, and so that goes above maybe a million tasks, 30,000 workers tasks there have to be larger than a minute in order for this to work efficiently and I'll provide some more results on another slide and then the load, the executor, which basically doesn't do any fault, tolerance and doesn't do a lot of other stuff. But what works really quickly?

A

That is probably good for something. That's a small number of nodes and they're relatively small, like less than a million tasks. I'm, sorry I, don't know if that's small or not so it can be small, so just to show some scaling results.

A

Is not designed to do things for fanstastic, so this isn't exactly a comparison sometimes, but it's death. Also, it's relatively well or good scaling get some smaller. We want to use and that's part of the reason that people result and right and guess, an in terms of strong scaling results. This is great a fix mark mode that, where we just workers in apartments at which is fine, it makes sense. I PB is also really bad.

A

Desk is pretty good, I'll just wait and then it starts getting to be a little bit bad and we're continuing to be good a little bit further out again. This is a little bit unfair because it's certainly possible that somebody, that's a Basque expert or a gas developer, could make this work better, but but this is just kind of what we did and so frame words. We could use.

A

We can sit as far as we can tell that we can get up to larger numbers of workers as we roads, and we can mostly, although desk in some cases, actually has a pretty good foot as well. So.

A

It's my throughput in the extreme scale. So it's basically it's these two things.

A

The performance for the HT better than yes, yes,.

A

Don't have a good answer for it's not I mean it's not significantly different, but I, don't know, I, don't know exactly what.

A

Size of machine and some of our testing actually just the size of the allocation we could get and use and with a reasonable allocation that we had from the from the providers. So some of this some of this was on blue waters and you're, using in allocations and materials science, researchers allocation- and he said you can use this much to do this test. So it's I, don't know some of these things are limits of scale, and some are limits that we just haven't tried bigger.

A

Yeah so there's well, yes, and no so there's sorry. Let me let me come back to that. Guess that again, at the end like and I, can show you something that kind of answers what you want, but it's not exactly what you want to think. So let me go on, though, for the minute, so the other things that porcelain is doing is or many of the things parcel students but does resource abstraction, which I kind of talked about a little bit.

A

It does fault, teller and since the fourth retries which I haven't talked about at all, but we do check pointing in memoization, so we basically do a hash of the function and a hash of the inputs, and if you have this turned on and you call the same function with the same inputs, we just return the answer, rather than actually running the job again. Checkpointing basically does the same thing.

A

It checks to see if the job was already run and if some of it's already run, then we can just use the result rather than running that thing. If something died, that's implemented yeah, it's not on by default, and it probably needs more testing, but it is employment.

A

Everything that's finished should not rerun yeah things that were in that's what it should do. Yes,.

A

Right right, so it's the outputs of the tasks are being stored, not the state of the tasks. It's the state of the job based on the state of the tasks.

A

Yeah, so it's either in files if it is files or it's in I.

A

Don't think it's a pattern: files it's in the notebook and I, don't actually know how that's getting stored. So I'm, not sure. That's a good question elasticity. You were not really talking about clouds, but we do have elasticity working. So you can write, call you can write, increase and decrease the amount of cloud resources you have as you need them. I didn't talk about monitoring at all. We have some relatively primitive monitoring, so you can see. What's going on, one of the groups were working with. Is the desk collaboration?

A

That's one of the LSST groups and they were very insistent on monitoring and we've done some things for them in particular, there they were. One of the groups are using beta and Cori, and so there's we have some monitoring. That is okay and it gives you some limited information and it was enough for them to be satisfied with, but it certainly could be improved and that's one of the areas where we're trying to improve with some student projects. Currently it's it's. Basically, it's tasks in their utilization, like memory and disk and other things.

A

Globus I mentioned data management. I mentioned containers, I didn't really mention. Jupiter integration is kind of obvious and reproducibility, and provenance is something we don't do very well at, but we do have some logging and some provenance, and this is another area that we could certainly improve a lot more than we've done at this point right. So we're doing this in a bunch of different applications and I. Don't really want to go into these in great detail because of time, but but it's again it's kind of a mix of machine learning, big simulation, interactive jobs.

A

The things we were talking about before one thing I do want to mention is that in the configuration for.

A

From read the docs message this morning, so we do environment configuration and execution environments and configuration through deductions. We have examples how to do that. That's based on where the domestics executed, which executor you're, using where the main personal program executes, which provide our users in which launcher, and so this is really complicated, excuses in pain, I'm, I, don't really know of a better way to do this, so we have examples.

A

System contains setup, mostly work except then we also have to say that the users have to actually customize those that can't just use even that. So this is kind of a painful thing, so so just to mention a couple things, and if we get into this, that's workflow discussion this afternoon, which maybe we'll get through this afternoon.

A

In some way, that's actually common for a bunch of us or Jupiter MCS for parcel instances for other things.

A

To understand all their systems separately and then points are maintained by ideally by this site. People register those and people that, when you use those could just pull those out and then a little bit I was trying to figure out when I was writing this yesterday, completely sure I, don't quite think they do I think they're doing something a little different, but maybe you could use this if we have this, but again, I'm not 100% sure was describing the applications themselves.

A

So the things that we're calling tasks basically are again are wrapped functions or wrapped applications. We do this as parcel of people with a different interface.

A

Methods for this, so we could say all right. We want to run this app and in here the people that develop this app tell us what its inputs are and what its outputs are, and we can just in a machine-readable way, bring this in rather than having to tell the user that's raising par so that they have to figure this out for themselves.

A

So that's another thing that I would be interested in talking about if anybody else's I'm, so over parcels providing what we think a simple, safe scale: flexible parallelism in Python, minimal, new concepts works in Jupiter, flexible, it's again open source on github. Actually, I was kind of happy yesterday.

A

So we're moving up quite quickly, they're, like really excited when we got a hundred, that was a tweet, that's fun, and so we're moving towards version. One I don't know when we're gonna get to version one. It might be six months, it might be a year, but we're we're on the way there. So right and oh sorry, the other thing is: if anybody is going to ATP DC in a couple weeks.

A

Paper candidates, so hopefully it'll be so if anybody's voting vote for it yeah, so that's it! So let me let me come back to your question then again for a second: let's see, if I can do this, I can't do this.

A

It's gonna work, yes, okay, so this is. This is just latency for different executors, which isn't exactly what you were asking, but it certainly as part of it.

A

A

Three milliseconds executors is seven, but also.

A

Ipp is 12: ask is out here at 16.

A

So the extreme scale- one is doing something. That's a little bit strange at this point, but that's a little bit of what I can tell you about about this piece of things. So when this is.

A

Single answer to that I and I- don't remember what it is in this case. I think this is an SSH Channel.

A

Well, so in terms of like the experiment and the time and get setting something up and not timing it and then using running at a bunch of times over that channel once it's set up. So this is not the set up time. This is the I yeah I, don't I, don't know how to treat MDI the same way. Yeah yeah.

A

Any other questions or comments or anything else.

A

Okay, thank you.