NCAR Climate and Global Dynamics Laboratory 2022 WCRP Workshop, 6 Oct 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Day 4 - WCRP Digital Earth Workshop

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

A

Good morning, everybody all right, so the flu's colds and no covet so far are taking hold of this conference in an unexpected way. In that it is two of our invited. Virtual speakers had to cancel because they're both sick with flues and that's only fewer, won't be here at nine o'clock and Laura Rosanna won't be here after the coffee break. So what we have decided to do is we're going to ask Richard and Sherry who were the contributed talks before lunch to speak now then, we'll have Mila and clovers invite a talk at 9 45.

A

Then we have a coffee break and then we'll go into breakouts straight after the coffee breaks. So your afternoon breakout session becomes a morning breakout session, we'll have lunch and then, depending on how you're going in your breakout sessions, you can reconvene after lunch or not it's up to you, how you feel, how far you've.

B

A

In principle, there was another hour to prepare the reports for tomorrow, so you may want to use that hour and we are canceling the afternoon catering, so at 3 30 there will be no afternoon coffee.

A

So that means you can all leave at your own Leisure after lunch, depending how forceful your breakout group chairs are right. If they, if the two moderators of the breakout groups, don't let you go well, that's for you to negotiate.

A

um So, basically, the afternoon is a bit at your discretion right, so work for as long as you want, or as little as you want and then have the afternoon or FIFA. That was the better option than finishing the meeting today, which we could also have tried, because there's a few people who deliberately who are not here today, who will be here tomorrow for the closeout of the meeting and the plenary discussion and the reporting of the breakout groups.

A

uh Andrew will add something. I, just want to remind you but I, don't quite know, I think I will say, sent any of the any of Andrew Andreas myself Kathy or cath your acronym proposals or what we should call these models. K scale kilometer scale wasn't appealing. I have to report that in a on a car ride, Gretchen, malindor, Cass, senior and I have come up with a wonderful acronym, so there is competition, that's just to needle you all a bit. We have one and we love it.

A

So if you don't put anything in will win the prize Andrew.

C

um Yeah I'm going shopping. We were discussing I'm going shopping this afternoon for prizes, so this is going to happen um just the other thing to note. If you have the afternoon off. Obviously there are many opportunities in Boulder. um If you just to note, there is actually an in-car shuttle that we can get people to schedule from that will leave from right out here that will take you across town.

C

Some of them will go in towards town to the main part of town. They also go to the different ncar facilities, so that's an option for people if they wanted free transportation to get around, but just noting that.

D

Into it, it's probably worth noting.

D

It's probably worth noting that there's a number of Mountain trails that begin at the Mesa lab. So if you take the shuttle up there, there's lots of beautiful trails that take you into the forest and the hills I've been here all week, but my wife has been hiking and tells me the sumac is: is changing beautiful red and it's bright and lovely and I think the Aspens are past Peak. But there aren't a lot of Aspens right here in the Front Range, but but the sumac are changing and it's quite lovely.

D

So uh if, if that's appealing to you, had to Mesa lab and grab a trail.

A

And go early, there's a cold front coming it's going to be here about 5 p.m. From what I understand.

A

Oh I saw it on TV. That's all I, don't know what they use all right. So why don't we get started um so our first Speaker today then is Richard, Loft and and followed by Sherry and they're doing a tag team. So they have asked to do two times 12 minutes and then only be ask questions at the end. So that's how we're gonna do it Richard go ahead.

E

B

E

Thank you very much. uh I'll try uh two live up to the opportunity to pretend to be Ollie fur uh in some ways. um So today we're going, as was mentioned. uh Sherry and I are going to tag team, a discussion about the computational science and software engineering aspects of the Earthworks project on Tuesday.

E

You may have heard Dave Randall talk about the science aspects as the pi of that project and like a lot of these projects you can see there is a lot of people who have been working on this, and so the list of people on this slide is is quite long, as is the number of different entities in in companies involved.

E

uh My version of of the Marvel Comics origin backstory for Earthworks is work that we did about five years ago, in collaboration with IBM at the weather, company and Nvidia, in which the weather company approached us to provide a GPU, enabled impasse and serendipitously.

E

We were already looking at trying to Port impasse to gpus on the uh you know, in a curiosity, driven way to see if directive offload could actually work uh reasonably well for for a model and- and so the objective was to produce an nwp forecasting system with about 30 percent of the planet, refined to three kilometers. Essentially, the weather company wanted to have three kilometer resolution over the populated areas of the world I.E the land, and they also recognized with impasse that it had local mesh refinement baked in.

E

So it took about two years to get empaths fully ported as a stand-alone meteorological model to gpus, and when we were done, we were able to run on Summit I have here about 4 200 gpus, that's about 700 nodes of summit at Full, Resolution, three kilometer, and we we had to take some shortcuts, one of which was uh we didn't want to Port the radiation because it was in progress elsewhere. So we adopted a lagged radiation where we used the heterogeneous node fully using the CPUs to calculate radiation.

E

While the atmosphere was running on the the rest of the dynamical, cord and physics was running on the gpus.

E

This ended up going into production at the weather company in 2019, and we retained the code with with open ECC as open source.

E

So just a couple results from from that exercise: the left shows weak scaling, so this is holding the amount of computation per computational element constant and then scaling out the number scaling. The resolution proportionate to how many devices you use, and so these flat curves show that at weak scales and then the strong scaling studies, the three uh orange green and brown show 10 five and three kilometers and the blue is 10 kilometers, but run on in-car's Cheyenne system.

E

So we got acceleration and this this set family of parallel. Curves like this is a characteristic of going to higher resolution.

E

And that's that's one year per day at that time, on Summit, so for Earthworks, uh the premise in The Proposal is: can we take this nwp success and translate it into a nurse System model and the target was Global quasi-uniform uh resolution put all the components on one mesh, one at the same resolution that obviates the need to do interpolation or some kind of regritting that introduces potentially errors uh Target.

E

These Global storm resolving resolutions uh Focus here on the G11 casahedral type grid, with 41 million points, that's the kind of the target we settled on when the proposal was written in 2019 and uh the goal you know the moon, landing type goal for us in the proposal was half a year per day at 3.75, kilometers.

E

So uh one of the things that we have to do is sort of you know leverage we're leveraging CSM and we want to run CSM at high resolution, as does uh the SEMA team at ncar and the CSM core team, so we've been looking with them at some high resolution, results for Earth uh an earth system component that looks like atmosphere, land in a data ocean and this just kind of underscores what what happens on a cpu-based system. When you run on this, um you see the atmosphere.

E

The blue piece of this pie dominates uh computational time it's about 90 percent of the time long turnaround times for this. This is 512 nodes. It's about an eighth of our system, I, say R, because I worked at incar for 27 years, so their system 20, you know an eighth of the system, so to get a run-in requires draining a large chunk of the machine. So you can wait in the queue for a week to try and get a big run like this done.

E

uh It's we actually. When we ran this, we revealed some issues with how the model scaled uh this particular run, uh spent as much time initializing. The land surface as executing the the model that's been fixed, but you have to run at these scales in order to find these kind of problems and then the slow throughput, the actual, when you're actually running here, with with the this test case, you run about .08 simulated years per day at seven and a half kilometers.

E

So Earthworks strategy coming just using that, as as an example, you know what we've done is use Regional, refinement of empaths to reduce the cost of tuning the climate parameterizations of meteorological length scales, so we're trying to avoid running fully uh fully refined models at ultra high resolutions to tune that, for cost reasons, Target heterogeneous competing with gpus, that's kind of obviously the point here, but we need to accelerate things in order to get these gpus uh can accelerate this code to something more reasonable than .08 uh based on the times.

E

In a fully couple of runs, you can focus on the atmosphere in the ocean, so our picture here shows GP resident atmosphere and ocean and then everything else on on the CPU part, so we're relying on the same trick that we used for the uh weather company.

E

That is, keep things on the CPU that are paying to Port. So, based on what I've heard here, the land surface me come back to bite us as something that we have to go back into port. But at the moment the notion is to leave it on CPU.

E

uh You can't really get much throughput if you need to use a large fraction of the computer, so we need to go after really big machines exit scale machines as the target platforms for these things, so that we uh can actually get some turnaround time and I think it's important to at least mention that uh our our project also recognizes the huge data problem that is entailed by this kind of approach.

E

So here's an example from microphysics being ported to from CPU to GPU, most mostly I'm doing this, because most of the time you'll see the die core first and not the physics and so I'm trying to trying to turn that on its head and there's a few I won't go through the whole setup. It's here.

E

If you want to read it, but essentially this plot, the cool colors are CPU based results, Intel processors, basically and the warm colors are gpu-based results, either openmp or open ACC offload, and you can see that the the two types of CPUs and gpus have very different characteristics and they show up. In the physics the number of columns per node, that's optimal for a CPU is essentially exactly opposite of a GPU.

E

Now in this in this setup yeah in this setup, we just um have embedded a GPU based um physics in with the uh see in with the CPU model, so you're, actually not seeing all of the all of the stuff that's going on with the GPU because of the data transfer back and forth.

E

So we expect the data transfer to go away, so the actual GPU numbers are quite better than this. Because of that, including that overhead.

F

E

And and Sherry we'll talk about this.

E

This is where the atmosphere dicor is, and you would think this would be the easy part, but impasse 7 is quite a bit different than M past six, because impasse 7 is is designed to go into the climate model. So there's some restructuring but I think the key Point uh Point here is. We have some latency issues and some computational issues that we discovered in impasse, seven that are associated with MPI weight and with some computations some variables that we didn't capture as being GPU resident and so we're working on this.

E

But this uh this Advantage seems to persist in the newer architectures that have come out since the old result and we think 10 kilometers, four simulated years per day on 256 a100s is a reasonable uh simulation rate.

E

With the ocean, we've got six times faster than a Broadwell node on this system we tested on that's at Nvidia. uh That equates to about 12 simulated years per day. We think that this this is for a particular test case called EC 60 to 30, when we take that down to three kilometers, so we think we can get multiple years per day out of the ocean, with a very relatively small number of gpus compared to the atmosphere, so we're in good shape, I think with the ocean.

E

And then, with data, it's like be careful. What you wish for the model could produce. We estimate with hourly output one and a half petabytes per simulated year. uh That's a lot! So we plan to use data compression parallelism in the form of task or desk. I should say the desk python library for workflow and uh chunk it out as Zar chunks.

E

That allows us to use like S3, Cloud type storage and there is a net CDF zar. So there's a possibility here to have a chunked, parallel, compressed accelerated analysis package, but analysis packages just go on and on. So that's a major challenge and the outstanding challenges.

E

um When you, when you introduce gpus, you take the model from a place where it's been designed and tested to a new place in the I o sub system in the MPI system and all of those things create issues so, for example, decoupling the number of ranks per GPU between esm components.

E

That's an issue so internal to a combo like the dicor writing on a different number of MPI ranks than the physics intercomponent load, balancing between GPU and CPU resident, esm components.

E

The other thing, I would say, is just heterogeneity in these nodes is not going to decrease it's going to increase, and so we have to become Adept at mapping these models to heterogeneous architectures Precision. It's been mentioned a couple times, but 32 bits is the new double Precision.

E

We have to figure out how to run some esm components and 32 bits and how to structure that into the way we build models and big data I think, even if we have a few analysis packages ported to be paralyzed and compressed in all this, we have a lot of work to do to get all the analysis packages we want and with that I'm, going to save the questions to the end and give this to Sherry.

G

Thanks Rich, so I'll I'll be talking about the software engineering challenges that we faced in this project, so basically talking about how we got the results that rich just showed oops, where am I sorry, so I threw a whole bunch of stuff on a slide to show how complex the software development project really is and I'm going to start off by pointing out the let's see here there. It is this this uh first diagram up here in the right top hand corner.

G

So we are coupling empathy ice with um and pass ocean within the CSM infrastructure, which presented a whole bunch of software engineering challenges in order to get that working. The other challenge we have is these models continue to develop so they're, not they're, not static, they're, being developed orthogonally to our own project. So how do we actually coordinate porting code to GPU, as the code is, is changing as we're trying to do our work, so we have that extra complexity as well.

G

We also are coordinating this effort across many different organizations, so the project itself is Led through Colorado State University, but we're also coordinating the effort between many Labs. Here at ncar we have the private sector with help from Nvidia and Rich's new LLC r d, as well as some efforts within the department of energy as well. All we also have a very complex software stack that we're also trying to coordinate across many of these different systems that that I show here. So how do we do this?

G

So we form a good, strong community and have a very clear development plan.

G

So you know when you think of software Engineers I'm, going to throw this out here where we're typically stereotypically introverts. We, like our offices, we don't like to coordinate, but in the efforts like this, we just can't do that we have to coordinate. So an important part of this project is getting everybody in the same room or Zoom conference call and nowadays and making sure that all voices are are heard and that they're equally important.

G

In all these conversations, I personally like to celebrate All successes, so every little accomplishment we make we celebrate as a team and keep everybody motivated and moving forward. It's also as a group important to create that Clear, Vision and path, but everyone at part of the project should also understand how their contributions fit into this bigger picture.

G

It's also very important to remove any barriers to entry, so new people can join the project very easily, but also, as you know, parts of the project struggle.

G

We can go ahead and transfer team members back and forth to help each other across the the project, but it's it's also important to empower all team members so as the lead of the software engineering effort, I always try to find Opportunities to um create leadership roles for for the team members, so so they have a chance to grow their leadership skills, but also their technical skills as well.

G

We also need to definitely coordinate our effort with the scientists, so you know from our experience. The scientists have to be equally invested into the project. We can't go ahead as software engineers and say we know best. This is what you guys are going to do. They need to be invested in into these projects as well. We also need to be aware of the science planning that's going into it as well as the scientists need to be involved with where we're going. um Software engineering wise.

G

We need to be in constant communication with each other, so as a team part of our team is embedded into the the science meetings and we also notify the scientists in separate meetings are planning from the engineering effort. It's also very, very, very important to have that exit exit strategy plan so as we leave as we complete these projects, how are how is the software going to be maintained after the work has been completed and that's always been kind of a a difficult?

G

You know uh funding. You know, there's a whole bunch of complications with that, but that is definitely important, because we don't want to create something, that's going to be stagnant and not be able to move forward.

G

So when we, when we think about uh the software development, it's very important to to have Version Control Systems, so as a project, we've basically set up a repo, that's sits parallel to CSM, and we did that because, even though all the changes we're making it in regards to cam and the GPU Port are going right into Cam. We are also remember coupling uh and past sea ice and then pass ocean into this infrastructure which doesn't fit in the long-term plans of CSM.

G

So we had to create a parallel repo for for this work and the way that that this repo actually works is it actually just contains an external config file which basically is a recipe of exactly all the repos and all the all the different repos, and all this, the the software stack with within Earthworks within CSM. It pulls it down um and kind of contains that recipe. So it's really lightweight, um but all the information is there.

G

So as a project, you know going the GPU porting route, it's you know we're, like all other institutions which way do you go. You know we went the open, ACC route for most of our development, mainly because of the research that that we have right now um from this plot that rich showed a similar plot earlier we are seeing better performance with open, ACC right now, so we're going that in that direction.

G

But we are aware that open, openmp offload is still being developed, and so it's still on a radar and we have a separate research project. That's looking into okay, we want open, ACC route. What if we have to switch to openmp offload openmp offload has the advantage that we can run on Intel gpus, so right now, with open ACC, we can run on Nvidia gpus. We can run on AMD gpus, but we can't run on Intel gpus, so as performance improves with openmp or we're needing to run on Intel architectures.

G

We've investigated this tool actually from Intel that I highly suggest. If anybody else is in the same situation, we found that it gets you about 90 percent of the way there in performance as well as portability. So we've we've actually looked at it. It's with some complex codes, like cm1, that's developed here, it's a cloud model as well as this Pumas um mg3 work.

G

I also wanted to give a shout out to project region which had a little bit of about a little bit of a blurb about it. But basically what this is going to enable us to do is to seamlessly analyze data on unstructured grids right now, we're kind of in that situation, where we have to regrade all of our data before we analyze it. This will be provide that that seamless transition to reading in the data and then automatically analyzing it within um like x-ray Das.

G

So it will be able to exploit that parallelization within our workflow very easily, with this work so stay tuned for that, and it wouldn't be a computational talk without talking about testing infrastructure, so during development um I for GPU work, it is very, very, very critical to test often and worry about performance later, it's very, very easy to lose correctness when you're doing GPU work. So it's very important to be testing this every iteration just to make sure that we, as software Engineers, aren't changing answers for you guys.

G

So the way we do this is we've done this a couple different ways so for the mg3 Pumas work, which I mentioned earlier, we actually used a tool called Cajun to develop a kernel and what I mean by that is this tool goes into Cam code, for example, extracts a little test, captures the input and output, and we end up with a self-contained task that we're able to run so.

G

So we did that with with the empath or I'm sorry, the mg3 Puma's work just because it's not a standalone test and we're able to iterate back and forth very quickly on the development and then put it back into the model or the club work that we're just actually starting we're. Actually, Club is very nice. It has its very own like testing infrastructure and it's a standalone application on its own. So in order to do this, this porting what we actually um implemented.

G

Actually we didn't um you, you dubbed Milwaukee, did Vince Larson and Gunther did this where they added the multi-column case capabilities, so we're going to go ahead and iterate back and forth on the development quite easily. With this multi-column capability that they added now after development, we have some cam regression tests that have been added, so we can go ahead and move move to a different part of the project and the science can still keep going on, but they'll.

G

Let us know if something breaks so the way that they do, that is, they have two different tests. We call it a smoke test where it just does a test. Does it run, and we also have another bit for a bit comparison test which Compares between CPU and GPU. So then we can tell if our answers are different.

G

So what are we promising with this, so so December of this year, we're hoping to get out uh more of our CPU capabilities and our and our configurations that we have, starting next year in May, we're going to be releasing more of our GPU offload versions of these packages and a 15 kilometer resolution uh configuration and then a year from now we're going to be releasing more of our GPU offload capabilities. Another resolution, but we're also going to be looking at releasing uh our Diagnostic Center analysis packages and then finally, hopefully May 2024.

G

That sounds way off, but it's going to happen, quick, we're hoping to finally have our final configuration and our final data analysis package.

G

So how are we doing um so when we look at our CPU route, we're doing fairly well we're pretty much on target, as Rich says the iteration process. As soon as we start getting the higher resolution it takes, you know we can get one job through the queue a week.

G

So it's really limiting our throughput in that regards, but we are making some some good project progress and, as rich said, we're sitting at about 0.08 simulated years per day with the uh atmosphere land configuration at 7.5 kilometers we're hoping to speed that up by about like 12 times, and then, where are we with our GPU effort, so we're doing fairly? Well with that as well. So, as Rich mentioned, we have already completed the dynamical core, the empath 7 version. We've also completed the port uh Pumas as well.

G

Right now we're we're validating the radiation. The rrt MGP we've hit a couple compiler issues with that one, but we're hoping to work that one out we're currently working on the Club Port, we're just actually starting that so we're hoping to get that done within the next year and we're also evaluating the other physics within cam itself to figure out exactly how much work it's going to be and and what, where the slowdowns are in other parts of the package that need to be ported to gpus I. Think that was my last slide. So huge.

A

G

I can take questions.

A

That's the attack teaming, you can ask questions to either Richard or Sherry or both.

H

And Nikolai uh from Avi, so we're doing something similar was a great talk. My question is about your testing you're developing code on GitHub and you have to test it. Well, probably, every pull request get tested somewhere. Is it tested like on GitHub actions or basically my question? If how do you test your code on the hpcs, because our admins basically freaked out when you ask them that you have to do a unit testing on hpcs and they say no? No, it's never going to happen well.

G

Thanks for the heads up, because we actually just put in a request for that.

D

G

That's probably why we haven't heard anything back from them. um You know. Basically, our testing right now is the regression tests that they do for cam, so every single time they're getting ready to release or or do another tag in cam that's when they run those particular tests when we're actually testing during development. It's you know for for me in particular, I port, uh you know, module and I run the test and see if that's different, so unfortunately we don't have any CI continue. So.

H

It's not automatic it's more right.

G

Now I'm not um like I said we have the email in we'll see I'll, let you know how that goes.

H

Too yeah that would be great to hear because for us it's uh right now pretty hard.

G

Yeah, it would be nice yeah.

E

Yeah can I I just wanted to say that um two things about this one? We have the added issue of not just testing one Center because we're targeting doe exascale systems, so we've uh we've had written, essentially uh portability, test, request, type small allocation request when the testing gets to a certain size. They start to say: what's the you know where it's the science headline related to this and I I?

E

Think that's where we run into an issue, because at some point it's too too big to be testing and to test e to B science driven yeah? If I, could that's not the right word, but you know what I.

G

Mean we're in the donut hole.

E

Where there's some kind of donut hole like there and I guess, the the second issue is part of the pull request is uh a lot of times, has a code review which is a manual process, and that takes up some time. So we can't always just trigger an automatic.

E

You know build test kind of cycle because of that so changes have been made. We have to review them and stuff all.

A

Right, you have quite a cue lined up here, so go right ahead. Okay, just.

I

A quick question uh event: Earthworks proposal was submitted and you indicated too. There was a Target simulation uh years per day and it was like I can't remember what the exact number was right at 3.75 kilometer resolution now you're into like two three years into the project. You know a lot more about the capabilities and all that stuff Are You On Target in a coupled system with 3.75 kilometers to accomplish that goal by May 2024 or do you can you sort of.

E

Yeah, you know the um the results from the A1 100 and the where we feel I I'm comfortable with the notion that we can hit that there are some dependencies like we have to get Club ported. We have to get the radiation uh actually running on gpus one of the problems with the the GPU thing. From my projection point of view, is it's an L-shaped Roi curve?

E

So essentially you don't get any speed up until everything's ported, so there's no host advice, traffic between the different components. Once you get everything over, then you see this big benefit. So that's why I mean it's L-shaped? You know: managers like linear Roi. You know like I invested a million dollars and I got this much improvement and that's part of this, the scary part in making these projections, but right now, looking at the ocean and looking at the atmosphere, I think we we might.

E

We might actually exceed that, especially with the progress that the gpus are making right now in terms of the actual performance they're returning generation over generation.

A

Okay, thank you very cool.

F

Thanks Rich and Sharia: what's actually the best overview of Earthworks I've seen so far, um my question is: are you planning to host any workshops similar to CSM workshops or Wharf workshops, um or will it be as easy as running one of our Standalone models? Eventually, once it's all done, can a grad student just go to GitHub download it or is it going to be more complicated and we need to bring people in and and teach them how to do it.

E

Yeah p is the PA is not here and it's starting to feel like a good question.

C

Yeah sure, if a grad student has a Nexa skill machine, no problem I mean it is Earthworks is a configuration, that's targeted at a specific thing, that is the global three and a half kilometer coupled, so it is supposed to be a comp set of cesm. So in that sense, yes, it will be possible, but the computational load is going to be excessive they're going to be other versions of cesm using these components that will be easier.

C

The regionally refined stuff things like that again there should be com sets it's going to be a community model, so it is going to be there accessible able to explore, but the computational resource for this is definitely this is the exascale part, but it doesn't mean they're, not other parts. If that answers your question and.

E

There and there is a plan to hold a workshop I. Just don't remember the details of where yeah. So we got, we got a workshe, we you know I. Think one thing is: is you need that version one out there even at low resolution, so people can kick the tires on it and then you have a you, have an opportunity to have a workshop where people can test stuff after they work go to.

A

The work well so that would have been my question: are there any plans to run anything with the model or are they just plans to build it and prove that it could run? If someone had a computer foreign.

E

Question yeah we're just down in the engine room well, I part of it is you have to build a community? We.

D

E

Want to just run experiments ourselves. We want to give it to the community and then have people get allocations on big machines and do the science that they want to.

A

Do yeah, but so.

E

It's not sort of like we're building it for ourselves. I.

A

Let Michael because we we have.

E

A couple of minutes.

A

J

Two questions.

A

J

Be brave everything I'll try to be brief. um Thank you for your talk. It's very informative I, like that. um You did make a comment about land kind of like oh dear. We have to deal with that, but given Martin best talk from Tuesday and looking at being in being on different resolutions or grids or whatever, um how might you think that the land could be accommodated because whenever we look at Earth system components, it's ocean and sea ice and atmosphere? Oh yeah land is always underneath there.

J

So just just curious about those future thoughts and then thinking about what Martin said on on Tuesday.

E

Well, it's exactly because of what Martin said about the land, possibly getting much more expensive and I also think uh that Tim Schneider mentioned that when they run sort of Wharf, Hydro type experiments, the land consumes about 25 percent of the resources. Whereas, when you look at the pie, chart I showed at seven and a half kilometers for CSM the land. Is you know five percent or six percent?

E

So if it grows, then the whole idea that we're just going to keep it on CPU and not deal with it stops working because you don't have enough CPU power to calculate the thing. So at some point, there's a sort of possible bifurcation point where you say: okay, the hell with it I'm going to Port it to gpus and- and that is you know- that's not factored into our current work plan. But it may have to.

G

Be it's going to take a lot of work to actually do that, the way that the data structures and the loops are within that so I mean it it's it's.

B

G

Better to start that conversation sooner rather.

B

Than later, especially,.

G

As these models go through, all these you know intense development efforts. We probably want to be part of those conversations Now versus hey. We really need to do this right now or yesterday, and you know, then it takes us even longer to get there.

A

G

A

D

He let Tim come in I just wanted to clarify one point quickly so that the 25 rich I mentioned it wasn't just land that was that included full River hydrologic riding at a very fine scale. So it was more than land.

E

Okay, yeah, all that all that uh plants Birds Rivers everything whatever all that 25. If it becomes bigger, then it can't be ignored. That's that's something. I learned by raising children.

A

I'm, a complete Outsider but I saw a version of the ncar land model that had Hill slope effects from the in the talks that were about Alaska, so you must have a version floating around where you can play and see what the consequences of making it more complicated might be so I just it might be one of those where people don't talk to each other cases. But if you there's stuff out there to try how much more expensive it might be, no I, we don't have time for any further questions, I'm afraid. So.

A

Thank you both thank you, because we have an online speaker, who's, ready and and raring to go.

A

So our next presentation will be by Milan Clover, who is joining us from Oxford I presume there he is, and he will talk about data challenges when we go to these very high resolutions- and we can see your presentation now, Milan and I will give you a sort of after 35 minutes, I'll I'll shout into the microphone, because that's the only way, you're going to hear me all right, so take it away. Thank you.

K

Perfect yeah thanks everyone thanks for having me uh yes I'm in Oxford right now, so I'm I did not make the big trip over the Atlantic, um but I do want to talk about um yeah different challenges, around data and, um while being at Oxford. Yes, we collaborate with ecmwf, but so many of the things that I want to talk about are probably a bit more high level, a little bit less like technical implementation.

K

Details, uh there's a lot of uh probably view from ecmwf spiced in, but also because I'm not directly working at ismwf, I'm kind of approaching everything from a slightly more naive point of view and I actually found that very helpful to stir up uh discussions and to question things whether we should always do it in this way or that way and I hope. Therefore, you will also have a lot of discussions for me after this talk.

K

So there's a lot of collaborators uh people that I've talked to over the last years, or so in order to understand the different perspectives around data challenges, especially when we produce a lot of data, and so the very first plot I actually just put that back together and put that together yesterday, because um I've received some data from from EastEnders. How?

K

Basically the archive uh scaled over the over the last decades and I basically want to ask the question of will whether we will actually enter the Google regime, meaning like how long does it take us to have as much data acquired as Google had a couple of views in the in their archives, which is um well just an estimate from XKCD. But you can basically see some OS archives at the moment is well. That was uh December.

K

Last year was at uh Beyond half an exabyte, and if we project that forward- and this is literally there's no more signs than just literally hand-drawn a couple of exponential Curves in there- then by 2030 2040- we will be somewhere between several exabyte and if we didn't do any compression and it didn't do any cleanup, we would actually hit the hit the zeta byte, which is obviously an enormous amount of data that I just can't imagine.

K

um But you can also see that there's a big scope for for compression. So if we simply were relative to where we are now have a tenfold compression, then we kind of like gain something on the order of like almost 10 years.

K

If we had 100 100 times compression, then uh we would gain uh probably like almost 15 years or so and in terms of this, this race, towards the the Google regime- um and this is obviously yeah- puts out a massive Challenge and really motivates us to to think better about how we store our data and how we make it, uh make it accessible to our users, because, on the other hand, um while our archives have been exploding.

K

Actually we have not thought that much about how we represent our data in a bit wise way, because uh back in when Nissan burf started the the forecast they used, double Precision floats. The IEEE standard was actually officially only introduced a couple of years later and only last year uh switched to single Precision.

K

So, while we kind of went from these black and white graphs for weather forecasting- and now we talk about digital twins in terms of the numerical position, not much has changed, and this numerical numerical Precision then um leads to the way how we store data in in our archives.

K

And uh while there was probably a bit more movement on the data compression side, I believe it's not as much as uh or not did not receive as much Focus as it probably should have received, uh given the problem that we'll be facing soon um so I kind of want to give a little overview of what different aspects of data conversions are out there, because I feel like in this community around high resolution.

K

Modeling data compression is always the thing that's like put at the back like yeah, or maybe we deal with it at some point later so I kind of want to want to raise a couple of issues around the things that are currently out there that have been researched on that have been developed and developed and I feel that data compression.

K

You can often group it into two uh two groups or two schools and the one first school is like, let's say, transformation, schools, so they're, really the physical perspective, meaning you ask the question of I have some data?

K

What is the best basis function uh to represent this data, and so basically I just have some coefficients that are multiplied with these basis function, and that represents my data really well, and um the idea of that is obviously that you somehow approximate the spatial structure, spatial or even temporal structure, if you, if you were to compress in a time and I, mean I, just gave this example of, like uh the spectral transform that you could think of for representing data in terms of spectral coefficients, there's eofs that people probably have heard of you could just truncate them um and then basically, you have some kind of data compression from that.

K

But there's also more modern approaches to the job which offers some tensor Transit neural networks.

K

um All of them basically somehow include that, um or most of them are fairly expensive to compute, because you have to do a lot of floating Point operations in order to get into your transform space and also in order to get it back, um because you basically represent your data with some kind of underlying basis function. It is um often difficult to bound the error to no approach a priori.

K

What your error could be because if your data does not really fit the basis function that you're using it makes sense that kind of, like you easily Beyond some some bounds, and it's quite difficult for many of these approaches actually also Random Access isn't really easy to do because you basically say like uh you want to know what I don't know the temperature in New York is and then for the spectral coefficients in like your spherical harmonics, for example, you would need to transform everything back in order to know one point: So that obviously makes the random access a bit tricky um the second school and that what was yeah.

K

What I would like to highlight a little bit is the kind of like the school around precision and information. Theory um I will talk about that about this a bit more in detail later, but one of the underlying properties of that school is that you, you don't really think about the spatial structure in terms of like the physical perspective, but most of the time transforming it into this new encoding. uh Whatever this encoding is, is relatively cheap.

K

um If you understand, for example, how floating Point numbers work or how like a linear, quantization or logarithmic, quantization Works, your error bounds are also relatively rigid, and random access is also usually much easier, because you can it's easier, it's more straightforward: how to chunk your data into pieces, and so you can easily say, like I, just want to decompress this one chunk, and then here you go.

K

um This is definitely so tensor trained, for example, back to the school of Transformations is not something that I've worked with, but I definitely would like to basically present it out there as like a thing that we should look out for in the future, because I think uh some of these approaches could be could be really promising.

K

The tensor train approach is basically the idea that you just have um yeah lots of Tangles that you multiply together in order to represent your n-dimensional array, um which could obviously be like spatial temporal Dimensions, as well as like Ensemble, Dimension and so on, and so forth, and at least the few papers that I've seen out there on these ones so far. They they basically claim to be like uh really really accurate at super high uh compression factors. So, for example, this picture that I've put it here on the left.

K

That's super close to the original at like compression factors above 3000, so absolutely unbelievable.

K

How how this is possible to be achieved, but they're, basically there to um they're really good in representing smooth data um and just as a comparison, there's other techniques which are also belong to this group of Transformations- that, for example, the zfp splits your data into little blocks and then basically puts little basis function into these blocks and that's why, if you um ramp up the compression a bit too hard, that's why you get these um this block structure appearing and a similar for for a set which basically fits a couple of functions in there and that's why you see these these faces um on the other was smooth surfaces that should be um so tensor trains I.

K

Think in general are really good to keep in mind. If you want to represent uh smooth data, it is not clear to me how they generalized to data that could be super sharp edges on one end of the of your field and rather smooth on the other one um I think they're more on the end of uh being fairly expensive.

K

It is uh unclear to me how they generalize to, for example, unstructured bits because obviously you're in the unravel uh you were, let's say, two-dimensional field into a single Dimension and uh that I think that takes away basically a lot of compressibility for these methods, because they don't really can't really exploit the the neighbor, the neighborhood of your forgiven gritzer.

K

um The next approach that I also haven't worked on, but I saw one or two papers on that recently and so I want to put it out there, because I think this is something that could be really promising for certain applications of data compression. Now we'll talk about these use cases for data compression in a bit.

K

Is this idea that you could also use a neural network to compress your data, and so the idea is really that you say: um I have some kind of n-dimensional array that has some coordinates um XYZ time Ensemble, whatever Dimensions you can think of, and you basically train a network to return you, the scalar at a given, coordinate, given the input coordinates and so basically start by defining an architecture for your neural network.

K

You train it on it with some kind of loss function, for example, you say like I want to want to want to minimize my let's say we mean square error and in the end, you store the information of your original data. Set. You store that in the coefficients of the neural network and so literally decoding, it is simply just providing into this new network. Your x y z t whatever coordinates, and it will basically uh do you do the inference, then through the neural network, and you basically get out your one single point.

K

The cool thing about this- and this is why I kind of like it is that it basically interprets automatically because it tries to fit a function through all your dimensions, and so it basically doesn't really matter where you evaluate that function. um As long as as long as you. Obviously, your uh evaluation and training data set is is coherent enough.

K

You could just like yeah pick a point that wasn't actually uh where you're um that wasn't actually used to train your data and uh you to train your neural network in the first place, um but I guess it also comes with a lot of difficulties that I'm quite excited for to to understand in the coming years, because I think there's more and more people looking into this, um but in general it is a it's a method that is rather expensive in terms of compressing things and decompressing things, but I've seen papers that basically claim uh yeah, also factors above a thousand X.

K

That should be possible. But it's really unclear to me how easily you can control the error and so on so I kind of want to put it out there, but I'm, not so sure. That's something that we would uh use directly um basically going to the very other end of the spectrum, and this is kind of, uh for example, the standard. That's current use, the similar f using using the group data format.

K

Is this idea of linear quantization I just want to quickly mention that so you, for example, you go from like um you text your data set, you look for your minimum for your maximum and then you just split the the range in between into equidistantly, um and this is where the linear aspect comes in equidistantly with some bits. So let's say you have 24 bits available. You choose the size of for every number of presentation um and then you just split it into two to the power of 24 quants, and then these sponsor.

K

Basically, you then round your original data to into these into these buckets, and these buckets then represent your data. The problem with that is that you really have to know what kind of data distribution you're actually using, because- and this is just like one of my favorite counter examples- if you, for example, try to compress nitrogen dioxide with this.

K

um You will see a bit pattern histogram as it's shown here and at the bottom, that basically most of your values go into the very first buckets and most of the other buckets are basically empty, because your data is more logarithmically distributed rather than linear list distributed. So it doesn't really fit actually in there. And if you then look at things like the entropy, you see basically, that seven bits are effectively unused and simply by like all the values being in the first buckets and nothing in the almost nothing in the other buckets.

K

You can directly see that, for example, your first bit doesn't contain any information, because you basically know it's going to be a zero, not the one which would then encode the second half of your of your range, an alternative. There is like logarithmic, quantization, where you simply um between the Min and Max. You then distribute your your Quant slightly differently, for example in the log spacing, uh if you use the same data for that, actually um it turns out to be better because nitrogen dioxide is a more logarithmically distributed.

K

You can see that your your first bit, approximately is probably the information you first, but it's approximately one, because it splits the histogram into more or less two two equal parts, um and so it's quite interesting that, basically you or like this is one of the challenge that we're also facing that. We basically need to know what our data looks like.

K

What are the statistics of our data in order to choose an appropriate compression method and that's a big, Challenge and I think this is definitely something that has to be worked on more to kind of automate this to um to make it to kind of develop standards of how different compression methods should be used and I mean. This is probably where do would like to criticize. If, because at some point, when they came up with their compression methods of like using the linear quantization, they probably just thought about yeah. That's the easiest way and I.

K

Absolutely agree it is, but it's probably we can probably do better by like thinking a little bit more about like what is the data that we've actually want to pass.

K

And what do we do with that data if we compress it in a certain way um and I think this is really highlights one of the one of the more underlying things that we have to understand when it comes to data Expressions that, depending on the method that we're using there's different control, knobs so different things we can choose in order to make this compression method work uh better or not so good um and the first one- and this is basically in this linear quantizational logic.

K

Quantization- is really the idea that you choose the number of bits. So many you choose the size of your data, set a priority and then, at the end, you get some error out of this in order to get to your compressed data- um and this might be rather unsatisfying because you kind of need to work your way back.

K

In order to understand what error actually happened in your data set given a certain size, but you may be, rather, you might be, let's say like positively surprised if it's really small, but you rather would like to control the error rather than size, um and so let's say the second group of compressor methods and many of the Transformations fall into.

K

That is that you, you basically choose your maximum error that you're happy to um happy to tolerate, um and often this is literally just like one scalar that you choose so, for example, for this zfp library that I've mentioned earlier that spits everything everything into these little little blocks, um you kind of say, like yeah, I want to have a maximum absolute error of X and then press play, and then it kind of compresses that and at the end, you're, like ah my data set, is that small?

K

So you really go from this error two to size and then have your compressed data. However, all of these um or like all of these methods that fall into group, one and two are basically such that the the the compression step and the information loss step are combined. So you can't really disentangle them which- and this is why I've mentioned group number three might be, especially in our applications where we do yeah.

K

Let's say a lot of a lot of research around it, so meaning you want to like swap out different parts and kind of play around with things and understand how they work and uh which, which parts could be done better.

K

Is that actually there are methods that let you choose the errors independently and really literally for every number and I'll we'll give one example later um and you can combine it with basically any lossless compression compressor that you can think of, and so you kind of take these these two steps of like introducing an error because you've, you've, somehow um truncated, your your system, yeah your data um and choosing the actual compression step to make this independent, and this kind of I think gives us a lot of flexibility, because we have a much better control about the error as well as we can even later say.

K

Like oh yeah, we still want to keep the same error, but we want to have another compressor which we know then, for example, is not going to affect our errors, which probably makes all the all the scientists going to work with that data happy because they know we still get the same errors before, but maybe it's like smaller or decompresses, faster or slower, and so on and so forth, um and so I think it really makes sense in order to phrase this challenge a bit better.

K

Is that we kind of need to see which use cases do we usually have for data compression and so I kind of like try to came up with a couple of examples in order to highlight how data compression can be different, um especially at really high resolution, where, obviously, we have a lot of data to deal with so case number one and then just like, as an example is like the typical case of real analysis data so meaning you have like one Institution, for example, has produced the data set. It's a certain data set.

K

That is uh well, it's like it's like one data set, but it's basically used by many many different people, let's say era 5 or something like this. um So it's a data set where if it was small, it would be really beneficial because you could easily like download it to different to different servers to different institutional clusters, and people could reuse it uh quickly because it is basically a data set that is compressed once but decompressed many many times by different users.

K

Decompression speed is basically really the the key, but less relevant might be the compression speed because you may say like. Oh, we actually would be happy to sacrifice like basically have like a slower comparison, speed for a much smaller file size and then also something like portability is obviously important, because it's going to be used by different people and they may say, like oh I, just want to have like one grid location, but give me all the time steps, meaning you kind of have a random access of that data.

K

um The case number two um is kind of like what I've traced here is like research simulation, so basically meaning like oh I- want to do work on this project. I want to do this little um experiment with my model, you run it and um you basically yeah important is the decompression speed, because you may read it out many many times, and you also don't want to really include much of an overhead.

K

If you, if you write to the data, you want to do like a random access, but maybe size is not that relevant, because you're only going to keep it for a couple of months on your on your computer and then you kind of put it into some long story. Long storage um archive, which is then case, number three I.

K

Think, where, like really size, is the absolute thing that matters and uh portability may not be that relevant, because it might be just a file that really is just stored there and isn't actually meant to be touched that often anyway and so um decompression speed, maybe not that relevant, Random Access, mainly that relevant either and then, let's say case, number four operational, really I, probably couldn't think of anything better. Like let's say literally, everything is important because you produce a lot of data.

K

You want to distribute it to a lot of people, they want access on a frequent basis and so on and so forth, um and so really the question that we should ask is: um what do we compress right, because this is I think really the difference between these idealized cases that you see in a lot of different compression algorithms, everything looks nice and smooth and you're like hey.

K

We can compress that to like a factor of three thousands like yeah, but our data doesn't look as much as like a textbook example as your if your paper claims to be- and this is one of the examples that I produced by just looking at the different variables that are in the Copernicus atmospheric monitoring service. This is one of the services um that our operationally produced at dcmwf. It's basically the chemistry forecast of the atmosphere, um and so you see here basically try to put as many histograms into one plot as possible.

K

um You can see. Some of them are absolute like because it's a log scale, absolute spikes, so they actually, their range is really small and they're more like linear, distributed. Other variables span many many orders of magnitude. It's a super multimodal distribution, some value, um they all have like very uncertainties, um somehow, possibly like many many zeros. If you think about precipitation, there's many errors in the world where it currently doesn't drain, um then also the problems that we're also facing are really that sometimes move some fields are super smooth.

K

Some are really have strong gradients that may change which you from location, to location, from vertical layer to Vertical layer. Sometimes we store things on unstructured grid or even sometimes we want to store things in spectral, coefficients and I, have mass data and so on, and so for these kind of uh data that we're dealing with the challenge to find a really good compressor is not bigger, and so I really want to ask this more underlying question more to lead over to the topic that I actually want to talk about.

K

Is um this question of like what information is there actually in a given data set? And there is definitely in terms of when you think in terms of dimension in terms of information? There's definitely different dimensions right, so we're here at a kilometer scale, Workshop. So meaning resolution is absolutely important for us in the higher resolution. The more data we will produce.

K

The more information is then, that data, but we also look at many different numbers of variables and all of them come with numbers that I can have some kind of precision, and so, if we think about data compression, we kind of truncate this information space somewhere and that literally made a little little bend around the resolution because uh Andrew gettleman said many times now that we do not want to question the usefulness of uh of kilometer scale so definitely including the resolution, because I do not want to create any of that.

K

But we make 28 something in terms of precision and I think this is really where we, where would like to Define this, let's say what I call the real information problem of lossy data compression is, where you say, I have an original data set. I may have some research questions that I would like to ask and I want to know, what's the smallest subset of this data set, which can still answer that exactly that same question in a qualitative way.

K

So obviously you do not necessarily talk about a bit reproducibility, but you basically want to have qualitative the same answer.

K

So, for example, your hurricane is still going to make the landfall at the same time roughly at the same time, um and so the question really is like what, if we think about in terms of compression What compression error, is okay to have um and I just literally pulled a number out of the hat and uh said 303.25 times some unit, uh plus some uncertainty, and the question really is like, depending on the application on the use case, there might be certain digits in there that we trust and other ones that we do not trust and we'd rather would throw away in order to save memory.

K

um And so, if you think about this in terms of let's say Kelvin, because at some temperature, 303.25 Kelvin and your use case is a weather forecast that you want to communicate to users on their phone. Then you may say: like okay. 393 is good enough. But obviously, if you think about something else, then maybe you want to preserve the 0.3, uh and maybe, if it's about I, don't know millimeters of rain over a certain period, you say like wow, it's an extreme stream weather event.

K

So our model is probably not good enough to represent it anyway, so we can could just truncate it to 300, and it's basically good enough because it would be the uncertainty of that forecast is masking that um that Precision anyway, and so really there's there's notion of like what is an acceptable error depends on a lot of different things that I've just mentioned, and so the question.

K

The underlying question that we want to ask is: is there any way that the uncertainty in our data can be estimated if unknown, because for many of our variables or like, let's say the other way around for some of our vowels I'm, pretty sure that people have a good idea of what the uncertainty is? Let's say, temperature, um but there's a lot of variables where we don't know. Is it like 10 to the minus 5? Is the 10 to the minus 3 or is it something else?

K

Is there do we really care about the relative or about the absolute error? So some people that I've talked to over the last years definitely said like oh yeah, I'm happy with 10 to the minus three is like yeah. But how do you know? How do you justify that?

K

And this is kind of what we wanted to tackle finder framework that answers that question are probably given some data and in order to motivate that a bit more imagine you have some data here and now it's just represented in some numbers 0.05, and then you have some ditches at the end. The question is: do you trust them at the end or not? Probably not, if you don't trust them for compressibility reasons, you should throw them away and in terms of, if you encode this into bits. The question therefore is.

K

Basically: where do you cut where you just put the line to distinguish between the stuff? That's real that you would like to keep in the stuff that you would rather throw away, because these are the high entropy bits that are really not well compressible. So, ideally, you do want to cut them off in order to get smaller file size um which, by the way, also means that you can directly communicate your uncertainty within the data set without encoding that explicitly the the uncertainty which I find is a great opportunity of that.

K

And so the question is like yeah: do you just cut it somewhere and how do you could do? How could you do that and the idea for us was that we basically just calculate the information across victim bit Dimensions um uh bit location.

K

So basically, here in the vertical we kind of compare all the bits that are still sitting in the vertical, so you may end up with like the bit stream or bits that are in adjacent grid points, and so just as an example here in this bit stream, you would have zeros mostly followed by zero and words. One either remains one or switches back to zero, and so you can basically put this into a joint probability Matrix.

K

um This is just an example here, for example, put into uh into probabilities of transitions between zero and one one: zero zero one and one one, um and then you can calculate the mutual information. So basically the question of like how? What does one bit tell me about the next? And if there's is there any information in that?

K

If there's no information, it basically suggests that we should throw this away, and so we came up with this um with this bitwise real information content framework that we hope is going to be helpful in order to address this is more umbrella question in the future, which is imagine you start with some great data it doesn't have to be, it can be even unstructured and you kind of like have some kind of space filling curve that basically moves along.

K

As long as you can expert point is somewhere adjacent in space and then for every uh um bit position. So let's say the sign bit the exponent bits, the mantissa bits. You look at the bits that are in the surrounding group points. If they're all identical, then your entropy is zero anyway. So information also has to be zero, because entropy is kind of the the upper bound of your information, but the mutual information, for example, if a lot of zeros are next to each other and a lot of ones are next to each other.

K

Your Mutual information is actually approaching one bit, um and so you can basically do that for every single position and then, in the end you get up. You come up with one of these graphs, as you can see in the bottom, which maximizes at one bit. That's uh it's, basically, the maximum information that you can have for a given bit position in your in your data and then, for example, you see like that.

K

There's a lot of false information towards the end and your real information is usually somewhere in the exponent in the participates and then somehow tails off usually I mean because obviously this method does not know anything about your encoding, so you can actually apply it to a lot of different encodings as you want and depending on what your data looks like.

K

This might look a bit different, but most of the data that I've looked at is qualitatively something like this, and so the question that you can ask is then, how many bits do I actually have to retain in order to preserve a certain amount of information? And here in this example, for example, you say like the the purple dashed line is um the line that we would cut. If we wanted to preserve 99 of real information in in our data set and remove, then the rest for compressibility and.

A

So we applied about five minutes Milan about five minutes. Yeah.

K

Yeah, that's fine.

A

A

uh Milan, if you cancel yes, we lost you, so you probably have to go out and come back in.

A

E

He's uh he's back.

K

L

E

Gonna, add me with your videos.

K

Yep, that's fine, I'll um I'll just show a few more slides, and then we can go over to the questions. If that's okay, for you yeah thumbs up yeah, perfect great, and so basically the idea is then that instead of thinking about error, Norms I kind of motivate people to think about in terms of preserved information, because this is actually a quantity that kind of moves with the statistics of your data better and it's hopefully, and we have to see whether whether we actually understand this metric in its full breath as hopefully um a metric.

K

That is actually it's a bit more. It's a more useful metric for data that you don't know much about, um and so in this example we just like compressed, for example, water, vapor and 99 of information, compressing that gives us a compression factor of 40 roughly compared to double Precision.

K

um Whereas then, obviously, if we were to go all the way to the right, throwing away about 20 of the information, we would actually get visual artifacts. So we're kind of well away from that um and kind of want to skip that. um But you can also in terms of high resolution. Modeling we've also applied this technique to um some satellite data, for example, here's brightness temperature- um and it worked rather well.

K

um However- and this is probably one of where we get into the the Bad and the Ugly aspect of this of this Workshop- is that every time you calculate this bitwise information content, you end up in a situation where you get basically an average for the entire domain that you look at so meaning. Here, for example, you would get an average information for as well as these like patchy Cloud features.

K

You can see as well as, for example, the more sea surface temperature signal that you can see here over the Black Sea, um and so this is obviously poses a problem and we realized, especially when you look at fields, for example like precipitation, where which can be quite patchy in the tropics, but then rather smooth in uh were more smooth in um at this resolution, at least in in mid latitudes, and if you then calculate the bitwise information content of that you get one number to suggests.

K

You should keep that position, but actually, in the end uh that might be more inflicted by the tropics, you cut may end up cutting more Precision off in this in the in the extra Tropics that you actually want to um a similar problem. That we faced was that some people, for example, used uploading numbers, truncated them expected an absolute error that was uniform but realized.

K

No, it actually isn't because floating Point numbers are rather logarithmically encoded, so the little transitions in the amplification of the arrow that you can see here are exactly the eight uh 4 8 16 degree, uh 32 degrees, ISO, either isotherms, um and so also one of the problems.

K

If, if you think about this quantization um that happens, then, is that you basically filter out gradients that may cause problems here or there I don't know, but it's basically yeah just saying that you kind of like neighboring good points that you really can't significantly distinguish that you basically put them onto the same level, um I kind of want to skip that, um and basically, in the end, we just want to say that we've been working on different implementations. So I've been writing. This package called bit information.jl, which is my Julia reference.

K

Implementations and other people have taken it up: developing python x-ray net CDF grip, hdf uh interfaces, so this is basically all something that we're working on to understand how it can be used, how it can be useful for people and I hope um yeah triggers some um yeah some discussion around how we actually store our data and, at this point, I think I would like to end this talk and I'm very excited to hear any of your thoughts thanks foreign.

A

Thank you very much. Milan questions.

A

Yep come up to the microphones, because otherwise Milan won't be able to hear you.

L

Hi Milan Peter lauritzen here from ncar a couple of comments um to make use of data compression I, think there's a lot we can do on on the actual model side as well. So, for example, if you, if you look at budgets in the model, they're very sensitive to any kind of truncation, so if, for example, what I talked about yesterday, those energy Tendencies I'm subtracting two huge numbers to get something order.

L

One out, however, if I compute those inline in the model, you know, then it doesn't matter if you start compressing the data, so I just want to shout out that I think it's really important. We compute a lot inline in the model and then we can really make use of of data compression, and you can make similar Arguments for functionally related quantities in the atmosphere such as chemical species and then the other comment I have is, um if you're a dynamicist, there are some Diagnostics that require zonal means.

L

So if you compute those after the fact, you need really high resolution. 3D data and one issue would move into these unstructured grid models. Is it's not trivial. To do. Zonal means less, for example, something we're working on at ncars, so that you can compute these kinds of Eddy statistics in line and again in that way, massively reduced the amount of output, and then you could do compression on it afterwards.

K

Yeah, maybe just to just to respond to that quickly.

K

You know I absolutely agree, and this is why I've been almost like, advocating a bit more to do um the compression directly within the floating Point format, simply because it is a format that we do understand relatively well, where people have been thinking putting a lot of thought into how to actually design rounding modes so that the bias free, um meaning that actually, if we truncate in the floating, Point format, um you should at least theoretically, you should be able to um yeah- do not distort any budgets that you're calculating which, in the end, I mean it's just like a sum over a lot of um truncated values, um but I do agree.

K

This is absolutely something that we should also look out for, for, let's say these are more transformation based. Algorithms is that um it might be really tricky with them to guarantee that any kind of means, averages and so on are preserved, and this is not apparently clear for those ones, whereas for as long as you have like a bias free rounding mode, this should all basically take the boxes.

L

Yeah thanks I forgot to mention great talk. It was really interesting thanks all.

A

C

Is next I'm Elon? It's Andrew gentleman thanks uh for your participation today, uh starting earlier earlier this morning.

C

Our time afternoon, your time um again great talk, I, had a sort of General comment and thought that thinking about the concept that we have of stupidly parallel processing, um maybe there's a form of stupid compression where we just don't output stuff, and this is kind of along the lines of what Peter was saying that we think ahead about what we're going to use from these simulations that we get out of a cmip model of we at ncar.

C

We actually call these mower Runs Mother of all runs where we output everything, um but maybe that's not the model we should think about. We should think about only outputting what we need to Output, based on the analysis. It's almost like thinking the information content, one step further through the analysis of do you need it at all, and some of this again is what Peter was saying about doing things online, but I'm wondering if you had comments on that and that and if there's ways of taking this information content even further Upstream.

K

Yeah no absolutely I mean this is where kind of, like my little graph came in, where I tried to like make the little bend around the resolution, because I think this is really what we actually in the end have to tackle is that our information Space is really high dimensional, including like what applications we use uh who's the end user in the end and um I do agree.

K

If we have a clearly defined objective that our data is supposed to do, for example, give us the temperature forecast at a given location, then it is relatively clear. What we want from the data compression, however- and this is probably the caveat I want to mention- is that we all somehow work in research right, so there might be someone coming up with a research question, a few cup four years down the line and say like I, want to look at this and suddenly.

K

This is an application which requires something different from the data and I think this is the underlying problem that we are facing with all this data compression. That's why everyone's basically careful everyone tries to like? Oh no I don't want to throw this data away because I may use it at some point down the line um and so yeah. This is really where we have to find some kind of Common Sense.

K

This is what we want to keep using, and this is not what we want to have, because no one's going to use it in the end anyway,.

C

I mean I, guess one example that we always do, which is the first thing is we do um off? We, we usually compress with resolution, but remember we have to think about resolution in 4D and its resolution in time, and then we lose information, but that's a convenient. You know way for us to do it, and we can probably think about that with respect to high resolution as well. What time should we compress, or what times do we just throw out so anyway, just a thought.

A

Thanks Andrew um Christian, Jacob, yeah I have a question too. So you mentioned you know you had your precipitation example, which I really liked where you saw the exotropics behaves different from the tropics and then you compress it as if it was a global field, I mean in principle. We don't have to do this. The one advantage we have is. We know the system that we're modeling.

A

So if we know the tropics noisy and the exotropics is smooth, why don't we chunk it and store the tropics in a different file than the exotropics I I, see no our priori reason other than convenience and tradition? Why we wouldn't do that? What's what's your thought on that.

K

I have a better solution, and this is exactly this. This method that I frame is like round plus lossless, where you literally split the uh the truncation and the actual compression to two different, uh two different steps, meaning that really the first thing you do.

K

Is you basically just round the the bits that you do not want to have you'd want them to zero and the idea there is that you literally could you can take an array and you can treat the extra topics different to the tropics, but in the end it's still an array that just has more zeros in the tropics than it has in the exotropics, but otherwise it's not treated any differently.

K

So it is suddenly a method that can be that can have a variable Precision in different parts of of the array and I think this is nothing that any of the other transformation based approaches actually provide, but you can really- and this is probably when, in the example I kind of had these little arrow bars that were like different for different parts, it gives you an enormous flexibility in the Precision that you want to preserve, because you can have a very Precision inside your data set and the only thing you have to do, then, is you bang the lost its compression algorithm on top of it, and it will basically take care of the actual compression, so the user itself will never actually realize unless they look at the bitwise data will never actually realize that the tropics that dealt differently to the exotropics.

A

Yeah, but there may be other advantages in chunking in space right. A lot of research questions concern the tropics, so you many people download Global fields and throw away 70 of them straight away because they're looking at a certain aspect, so I'm just saying there may be opportunities to be smart. So last question, though, goes to Pierre Luigi and then we'll.

B

Move on I Milan, just following on on this idea of future uses of the data I, think you hinted it in your presentation. There are also mathematical considerations, so we may want to take first. Second derivatives. It's not just what we want to do with the data is what operators we want to apply to the data. Is there a way to take that into account.

K

Yeah, this is basically where all the um transformation data compression methods that I've outlined a little bit claim that they're better at um because I mean in the end they're, basically fitting some kind of basis function to your data and if, as long as that basis function is nicely differentiable, you basically end up also with the compressed data set that doesn't just whether once you compute, let's say the gradients, it doesn't really get much distorted.

K

um I do see these advantages um and I find it tricky at the moment to really see like I'm still looking for applications where, like someone, calculates some gradients or something- and it absolutely goes wrong if you use, for example, this like round plus lossless method. So if you have a good application, please uh to say, like oh yeah, for example, look at this data set and if you compute the gradient of that, then you can press it, and then you want to look at this. Please share with me.

K

I do see that it is absolutely a use case, that's necessary for us, but I, don't necessarily see yet that the transformation based methods are uh make it make it basically like the gradient preservation of them, make it absolutely worth uh considering them over the other class of compression techniques.

B

Thanks thanks for the talk too,.

A

Thank you again, Milan for the great talk um you can see from the reactions everybody liked it. um So we let Milan, go and have dinner, I suppose go home and have dinner and and we'll go and.

K

A

Morning, coffee break and just to reiterate, there will be no more talks today. So after the coffee break, please go into your breakout groups and self-organize, but there will be lunch at 12 30.. All right thanks. Everybody.

K

Thanks everyone Take Care, thank.

A

You Milan and we'll see each other in this room again tomorrow morning, at nine.