NCAR Climate and Global Dynamics Laboratory 2020 MOM6 Webinar, 11 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Raphael Dussin 2020 05 11

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right so good afternoon, everybody um so I'm gonna talk to you about, like a software stack, that's becoming very popular, um so I'm gonna talk about it in the context of mom6, since this is uh ocean working world, but that's something that you could use for like um a lot of different uh ocean atmospheric models. So um so, if you don't know about x-ray, I hope, you're gonna join us and uh and be part of the fun, and if you already know about it, well I hope you might still learn a thing or two.

A

A

What I'm gonna quickly describe is a bit of like the Jupiter ecosystem x-ray tsar and then what I want to spend a little bit more time on is like that demonstration, because talking about like python packages and stuff is fun, but it's actually think more useful to to use them and uh and see how they work for real.

A

So so, first um for those who are not familiar with like the Jupiter ecosystem, uh the whole thing started with like those I python notebook and then became um a fusion of like Julia, bison and R, and so that's where the Jupiter and Jupiter name come from, and so the idea is that um you're gonna use your web browser as like an IDE, and this ID is gonna, communicate with like your interactive python, Julia or R session, and so that server side over here can be running either on your local machine or it can be running, um unlike um like a remote server.

A

That can be your analysis, machine or that can be in the cloud. um So that's the idea with like the jupyter notebook. The Jupiter lab is just a more functional idea that came after the Jupiter notebook. That's what I recommend that's what I use and Jupiter herb um is just an addition. On top of that, that's going to be able to run like multiple uh Jupiter sessions for multiple people um like on your server and so at any car. You have a good example.

A

You have your own Jupiter herb or you can deploy one on the cloud and that's what the NGO folks need with like the ocean.ngo.io.

A

So next, uh like a really cool piece of software, that's gonna help us like do a lot of science very easily. Is a x-ray, so I used to think about, like our data as being like ND arrays, and then manipulate the things based on uh indexes in in your arrays and stuff.

A

What X-ray does um is that it adds like labels um to all those Dimensions so which means that you don't have to think about. uh Oh, what is the index of my time, Dimension or special Dimension x-ray knows about that. So x-ray was inspired by the data model used like in net CDF. Well, like a data set, would be like similar that a net CDA file and that's a set of like different arrays, but you can also build a data set that spans multiple files.

A

So you don't have that one file, one data set um restriction, so your data arrays have labeled, Dimensions label coordinates and you can actually use those labels to use methods, apply methods on it and that's going to make plotting or compute more eye level. So you're, not thinking about indexes you now you're thinking uh about dimensions, so I've got two examples that are like pretty simple. One is just like a plot where I just um slice a piece of data and I just tell it.

A

Okay, just show me show me that part and I don't have to know or care where my um well my Dimensions, how are my erase threshold or anything or same thing if I want to compute like a like a climatology I, just tell it to average over the time, Dimension and everything um everything happens. I don't have to worry about it.

A

One thing that um I'd like to point out since we're working with like Ocean Models, is that um if you have like multiple set of coordinates so, for example, in mom 6 you're gonna have like cell centers and cell Corners uh x-ray doesn't know the relationship between those coordinates.

A

So that's where um xgcm become very handy, because xgcm is going to add that knowledge to X array, and so with that you're going to be able to perform some differentiation or interpolation operations on a stereo grid.

A

So next uh is desk.

A

A

One thing that's important to understand is that x-ray can work on top of either an Empire array or vascularace, so numpy arrays are gonna, be um basically Computing eagerly. So once you type your operation in It's, Gonna execute it, whereas desk is going to delay it, and so what does is gonna do? Is it's gonna build like build a graph of operation of stuff that it should do and only wait until the last minute?

A

So when you actually ask um ask for like a result to actually perform the operation so I, don't you could see that as like a as like a lazy student who just makes like a big to-do list and then wait for the last minute when he has to present something to like do all the computation at once and then I'll show you the plot.

A

um So what's gonna make the difference between either we use numpy or desk around the other hood. Is that chunk argument? So, if chunks, which are basically the beat the small bits of data, are specified, then we're gonna go into desk and lazy mode. So.

A

When we are like in desk mode, we can define a cluster, and so this is going to allow us to do some multi-threaded operations that can leverage like over computing power that we have on our local cluster or you can um Electro kubernetes clusters on the cloud or you can submit to a job queue, or so your slum or PBS drop you with that desk dropped you on the package and, what's really cool about that.

A

Is that because we're splitting our computation in like small bits that are like easier for the computer to handle, we can actually work on data sets that wouldn't fit into memory, and that's that olc for out of core uh computation that you might have heard or might there in the future.

A

So one of the things uh one other thing I was looking at too is a czar. So Zar is one of those new format, that's also between becoming popular, that's very useful for cloud storage and so who are interested in uh in seeing if that would be a good solution for needs. So, first question: why bother well? This has pretty good compression, so that's going to save us a lot of space, and so that's why this is interesting, um so it was first designed for, like Cloud object storage.

A

So whether or not it's a good solution for like more traditional infrastructure, it's still something we're looking into what we've seen is that um are you gonna get your data set in like small files or small chunks matters, so the rule of thumb is that there should be around 10 to 100, Megs and I'm gonna make another demonstration later on on that, um and also they can be like different types of the store. So the most common one is going to be the directory store on your left.

A

Well, basically, every chunk is saved as a file or as an object in the cloud or you can also have like a zip store, so, for example, on the right so that that zip file- and this is actually just going to be a zip with like all the little chunks uh in that, so why I get interested in the in the zip store um so basically well, your trunk is one file. So if you take like a big simulation, um 3D monthly fields and you break them down in chunks.

A

Well that amounts to a lot of files. So that's that's what I get here and I know system. The number of a nodes is something that we have to be careful about. So that's why the zip store is actually pretty convenient because that turns 26 000 files into one and so then comes to uh performance differences and last I've checked. I've tried to do like the exact same computation like zero directory and zip stores, and the performance are similar, so no loss in performance zipping, the um the chunks. So so it looks pretty good.

A

One caveat, though, is that the zip stores are not as commonly used as a directory store, so there's still like some bugs that you can find and- uh and we worked on, like fixing some of those.

A

So now, let's move to more of um demonstrations, so first I'd like to um say that we have that analysis cookbook from M6. That is like a community um work, and- um and so if you are interested um in like X-ray and trying to work with x-ray, there's like a lot of examples there. If you want to contribute uh some Diagnostics that you haven't found in the in the cookbook, then uh please submit a full request and we'll be very happy to uh to have your contribution.

A

So just to give you like a little tour, uh this is not very big. Is it yeah, but that's like different notebooks, showing like a little bit of everything from like setting up your dash cluster, getting studied um time and space operations, and then we move into like more advanced topics like horizontal remapping or flooding, or doing a comparison with like observations.

A

So this is that uh we also participating into the documentation for xgcm that I'm gonna talk a little bit more um in the demonstration. So let's get started with that, let's make it a little bigger for you guys. Is that all right for you.

A

Okay, nobody's uh complaining, so I'm gonna go ahead, um all right, so I'm gonna load. um The data set that I've got uh here locally, but next cell. Is this exact same data set uh on the thread server, so you can play a little bit with it little uh warning, though, if you try to run with like the dash cluster, that's not gonna work and if you try to run it without the desk cluster, it's probably gonna take a long time. So um that's the limit on how reproducible that notebook is um so I'm loading.

A

My data set um all happen instantly because actually nothing is loaded into memory, but metadata- and this is my data set. So it's like some mom6 of degree data. Let's look at what I have I've got some grid metrics and I've got velocities temperature selling it so, okay, first um bad idea that I want to show- and uh that's gonna- be a new prediction to like broadcasting.

A

So remember that I said that x-ray doesn't have an idea of how the different xhq y h y q coordinates um relate to each other. So if I try to maybe multiply my transport by my temperature is gonna.

A

Do it, but it's not gonna, do it right and so what I'm gonna end up is I'm gonna end up with like an array that have time depths y Dimension, the two x dimensions and right now it's not hurting anybody, but if I try to like load the data or do some operations, I'm gonna quickly run into like a memory problem, because that array is like really huge.

A

So that's one example of what you shouldn't do and that's where I'm going to introduce xgcm and where xgcm is going to be very useful uh in our case, so xgcn is going to tell x-ray what is the relationship between those different variables.

A

So, for example, in my x-axis I'm gonna tell you tell lgcm that okay x, h is going to be the center and xq is going to be located to the right of xh and same thing for y and I can also do the same for um zero. In that case, um I know that I have one more zy than than ZL, so I'm gonna, specify it as inner and out of bounds and I can also tell it that it's periodic in the X dimension myself.

A

So with that I can pretty simply create my temperature, and my salinity on the U point by using an interpolation function over the axis x x is X. I could do the same for uh V in that case, I. Don't really have a use for it. So, let's, let's scroll with it now, Mario ray as the right coordinates it's on YH xq. So that's my U point so everything's fine!

A

So now, let's say that I want to compute some potential density, so my potential density um I don't have a function for it. That's x-ray, based but I know how to do it and I've got like an old python function that I have so I can use that function, but there's one way uh of doing it, so I'm gonna Define that function do it same way that I would have done with like numpy, and so the first thing you think is like.

A

Oh, let's just like apply my function on my on my data array in this case that works fine. Why? Because the operations that I'm doing are like simple enough that it's not triggering like um like I, need your computation, but if you're using like functions that are like a little bit too complex, this could trigger like the computation and if you trigger the computation, then it's gonna trigger it's going to build like the whole data set and that's not uh probably what you want and that might not even fit into memory.

A

So there's a solution to apply lazily that function to use that apply, you think and so apply. You think it's gonna allow you to apply the function in a lazy way, so I'm just going to do that that works.

A

So now, let's say: okay, I'm interested in Denmark, Street, overflow, so I'm gonna take a look slice along my coordinate to uh have a quick look at what my region look like and so from there I can say: okay, I'm gonna take like 23.5 as being my my longitude, where I'm gonna, I'm gonna cut and I'm gonna cut between those two uh different latitudes.

A

So when I'm doing that, I'm actually selecting the whole data set and you can show my transport would look like, and that's that's how it is so everything like plots very quickly because it just like takes in memory the piece that you only that you need um so so all of that is like really cool to prototype, like some uh some Diagnostics and everything so I had an idea of a dynastic.

A

That's actually not a good diagnostic, but I'm gonna show it anyway, just for the just for the sake of like Computing, something um let's say: I want to take like the transport in like uh layers that are like, um like heavier than a certain density. So I can very easily do that. uh Taking my transport and applying that aware function, where I put that condition that the density has to be more than 27.8 and then just sum over the vertical and the Y dimension.

A

So once again the Fig is like instant, because no computation actually happened, but now I want to have like I want to see the plot and I want my computation to go fast. So to do that. I'm gonna spin, a desk cluster on my machine.

A

And so make that cluster is going to return me like a dashboard and so now I'm ready to run my computation I'm, just gonna studied and I can take a look at my dashboard to see what's going on actually and not wait just to see like a line and now I can see. What's what's going on. So that's my dashboard in action, so it should take only 30 seconds but uh um I'm afraid that maybe meat is slowing in the computer a little bit but I should be, uh should be pretty uh pretty fast.

A

All of those blocks are actually operations. I can see how many bytes I have stored in memory. So that's less than what I have on my computer and oh, that's it that's done, and so this is what I get again scientifically.

A

It's not accurate because I'm using like mean velocity with like mean density, so there's like a lot of flow, but actually I'm, I'm, not capturing, but I get my results, and one of the things that I wanted to highlight is the importance of the change. So I didn't um talk about it much, but when I loaded, my data set I took that choice for my chunk, so one time frame and certified depth level, but basically there's like 50 5 Megs for my um dot chunk and so I.

A

Try like different combinations, and what you can see is that if you take chunks that are like too big, you can really degrade the performance of your computation because you're taking something that's so big. It barely fits into memory, and so the computer really struggles with it. I don't even know uh that one managed to go through and then there's like a sweet spot. So basically anything between 10 to 100 Megs is where you get like the best performance.

A

But then, if you take your chunk uh being too small, then you're gonna start also integrating your performance so here. uh The second message is that it's very easy to degrade the performance of your computation so always try to thinking about the chunks and um what are the optimal size and maybe do a couple tests before you deploy something into production, to make sure that you actually uh using a dark skin sweet spot.

A

So that's all for the demonstration and I'd like to conclude some.

A

So what I like with Jupiter is that Jupiter is gonna. Give you like the same experience whether you're running like on your laptop, unlike the cloud on your HPC system. um So it always feels familiar. It's very easy to prototype your analysis and then deploy them in more production. Workflow. So I'd like to say that better meal is something that is very useful for that um x-ray allows you to very easily write.

A

Some high-level, dynastics and GCM is also a companion tool that is very useful and it adds all that staggered grid awareness to uh to x-ray Basque gives you like a tool that allows you to be very performant parallel computation, but be careful of the chunking is also good compression, but chunking is also important, and what we're trying to do is like um really contribute to like Community software and not build like one-size-fits-all manuality package.

A

That does everything that we think you should be doing, but instead of doing that, just like contribute to tools that already exist and um and like teach to fish instead of like giving a fish kind of. uh So that's uh that's all for me.

B

Great thanks RAF questions for ref a lot of good resources. There Andrew.

B

C

Yeah sorry, yeah I was just trying to find the button um thanks Rob for that for that nice demonstration.

B

I think it really ties together.

C

A lot of these disparate examples that are out there for how to use desk and x-ray um uh one comment and one question. So one comment that I would have is um or I guess part of the part of the question. Are you using like cmf6 output for these data, or are you using just like uh output from one of your, like one of your custom runs.

A

um No, it's it's not synaptics data. It's um it's output for, like the half degree that um that was published in the formal PayPal, um but this could also work with cmib6 data. Okay,.

C

Because that's one of the things I think that I've been trying to convince Folks up here at csdma. To do is to you know we have seamorized data right team up six compliant data. So to make these examples a little more concrete with you know you can just download it just from esgf these metrics and get it all working I think would be really helpful um to kind of spread it um and then the second question I had was the big thing.

C

That's always kind of been in the pain for me is to figure out the right chunks. So do you have uh I mean like and, as you saw so there's differences in performance. So do you have any kind of general rules of thumb about what chunk sizes just to start with so I'm, not yeah.

A

Yeah, what I was saying is that the the rule of thumb um here is like something between 10 and 100 uh Mega megabytes uh per shank uh is usually where you get the best results in my case, uh because you can see like on the notebook the best uh performance I had with my chunk being like this G makes so I would say around those lines. It's not completely um well known, what's the best size, so you you kind of, have to try and there's a lot of Trail in there in uh in that.

A

Unfortunately, thank.

C

B

Other questions for ref, hey.

D

Yes, this is a rough. This is Phil, I have a follow-up on the chunking, so the numbers you mentioned are you're. In your example, you basically are Computing stuff reading locally from a file, and if you were accessing stuff through an open dab server with the chunking go up or down.

A

So so the issue with the opened up is that actually, um the dust cluster is not going to work with the open tab and I. Think that's because the opened up is serial by definition and all the Distributing Computing is actually not working. You can you can try with the notebook app tried several times to uh to get it to work with with the open dab, and in that case that doesn't work. So that's still um something um that's not working properly.

A

um But if you were to like compute things um say on the cloud where everything is like stored like in saw with like chunks, um you would try to get your chunk site to be also around that 100 mix so that you can fit a bunch in memory, but not overload um overload your memory or like have two small chunks that you need to make calls IO calls that also have like an override so yeah.

D

It does thank you.

B

Any last questions for ref uh we put the uh GitHub that he has a GitHub Link in his uh presentation. We also have copied that notebook to the the GitHub site for the webinar series and that's on in the chat box right now. So if you want to play with this later.