National Energy Research Scientific Computing Center (NERSC) Data Day 2022, October 26-27, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Evolution of Data Services for Science

Description

Part of the Data Day 2022 October 26-27, 2022

Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.

A

Thanks for coming today, today, I'm just going to give a brief introduction to desk services and and where we see things going at nurse, um so nurse has a huge number of data services. Those supports these are across. uh You know all different areas from data transfer to data management, to visualization to containers and data analytics.

A

um So one way of kind of representing this uh that I produced for a Blog a few years back is uh you know, kind of in this way and I know it's kind of more complex than this. But there's you know need to get data from scientific instruments uh store it on big file systems interact with that in an efficient manner, steer workflows either interactively or via workflow managers, and this interfaces with services that might be Standalone databases or there might be a more flexible service platform.

A

And then you need to interact with HPC system uh and there's all kinds of policy, and also technology needed to to make that efficient. So including things like containerization technology and then actually producing results uh is also a kind of data problem. So there's analytics tools and increasing amount of machine learning tools and visualization, and all of this is a ever-changing ecosystem.

A

So if you actually look at the blog, I've updated many of these entries, but a lot of the picture is static, but some of it very quickly changing so I thought it was interesting to look back at one previous day today, so in in those days we had less technical problems because most people are in building 50 Auditorium in person, but many things other things have changed as well in this time, um so I thought it was interesting to look at this machine learning talk here and who even remembers what lasagna is well well named framework, but not it's not that long-lived.

A

um So you know tensorflow was just kind of starting them and then turned out to dominate and even high touch didn't even exist at that times. You'll see now it's kind of rapidly growing, so things have certainly changed since then, so this plot only starts from 2018, so 2016 was even further back, so it was probably less than 100 users of Jupiter. Now and again. So now there's over a you know several thousand and in fact we can see from daily usage.

A

There is actually more popular as an interface into our systems than SSH python also is rapidly grown.

A

Now, basically, everybody uses python, I think you can say to First approximation and not even that, but it's also used this kind of numbers about what batch jobs use both then in some way, and that's also a large fraction of the jobs that we're running um with deep learning that we saw a growth in just three years of 6x and as I mentioned before, paytouch wasn't even on our radar in 2017, for example, and it's now pretty much overtaken tensorflow I mean this is last year's plot, but it's now overtaken and then containers.

A

We also see a similar growth, hundreds of users, but perhaps even more impressively. uh The top 500 result that uh permature submitted and put it at number five was in fact run inside a Shaker container. So these Technologies really are not uh no rapidly become part of the mainstream.

A

um So here's kind of just an overall picture of where we are- and we see these big numbers for python at 3600 users and uh these growing ones for deep learning Frameworks, but we also see emerging technology, Julia and uh and others that are just kind of already starting this growth.

A

um So all of these Services run on our big super computers. You know that's the the essential uh for or interface with, our supercomputers. So that's the core of what we do at nurse and you've probably seen various presentations on this.

A

But just in case you haven't, uh you know, permatures most of the compute power is centered in these GPU nodes, um the Nvidia a100 accelerated, and this is a great resource for deep learning, for example, um but then there's also a large number of CPU nodes which can be used further, more traditional analytics of experiment, workflows or things that are difficult or impractical to.

A

But then I like this slide as well, because it also shows that there's other infrastructure on the system.

A

That's important to data and analytics such as the all flash file system and the fast connections out to external facilities and to larger file systems in this and across these systems, data services interact, and so this uh shows, as I mentioned, that data comes in from outside, runs potentially on CPU nodes and on GPU nodes uh interacts with the file system, and this can all be driven by workflow integration and I just wanted to comment here that you know we're really at the start of parameter and we're going to be seeing increasing data capabilities integrated into a parameter, and this is particularly true for like work for integration or expect to be able to bring containerized Services closer into the system.

A

Okay, so this was already planned kind of in the nurse nine review period, but for nurse 10 we're just starting this planning now and it's going to be even more workflow and data capable, and so this is kind of the just overview slide of what we're talking about with nurse camera shows, firstly, that it extends out into the system, but also I mean out into uh esnet and out to instruments, but also that it will have workflow Services built into that okay.

A

So, as I pointed out, things have moved on, uh so we now have all of this great data transfer tools. We have I o Library. We have performer file systems. We have these uh flexible, python-based Frameworks that allow really sophisticated tools to be kind of at the fingertips and containerized services that enable complex Stacks to be there, and also for this to be portable on different systems, um and we have all these tools for Building Services uh that sit, for example, on the side of the machine and drive things.

A

um But uh you know there's remaining challenges and I I outlined kind of some of this direction. In a talk, that's linked here. A longer seminar- and this wasn't necessarily coordinated but I- think uh some of the talks that we're talking about in this meeting actually touch on many of these aspects of these challenges.

A

uh So one important area is IO where data volumes are still increasing larger than faster than IO can keep up with, and this both means I think that we need developments in kind of the the way that we store data and the way that we do processing on storage, so that those are research aspects that I know uh Sean nuclear works on as well, but another important area is actually just improving the I o that we do and I O profiling is an important piece of that that we'll hear about soon um and then this area of workflow services and bringing them into HPC systems.

A

And so we've got a bunch of talks about how that can be done with our spin service um and uh with workflow managers running close to the HPC system and using of apis to do that um and then about using these productive languages. I mentioned python. Libraries are pretty capable, but also using them with large-scale compute is not a solved problem, but there's various directions that can help with that that we'll hear about, and then you know, maybe python isn't the right language.

A

Julia is the way to get performance and productivity.

A

um It's up to him to convince us for that, and then I mentioned here alongside this containers, uh you know there's many advantages of containers, but one of them is actually also to help with with scaling of these um uh tools to to HPC systems uh and then in terms of deep learning and analytics.

A

um You know, there's uh again to uh you, know, help needed in in getting these to run and distributed on large systems, uh and it can require uh tuning.

A

So we're going to hear about that and with some uh you know, demos on how to achieve this from Steve um and then there's also this kind of emerging tool, I, guess Jacks, which not only uh kind of helps to address the the kind of question of uh scaling um python onto onto gpus but also I, think an important part of this is: it brings potentially Auto differentiation to uh software written in Jax, and this I think is this towards this last point about adding uh you know direct inference on experimental data by interfacing incineration with differentiation foreign agenda.

A

uh Obviously, we had a somewhat stunted Stark here at the AV problems, uh but uh definitely there's a lot to a lot to come here so um stay tuned for it all. Okay,.