National Energy Research Scientific Computing Center (NERSC) NERSC AI for Science Bootcamp, August 25-26, 2022, 15 Sep 2022

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Day 1 AI for Science at NERSC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

A

Can you hear me okay,.

B

Yep good to go cool.

A

Thanks Troy thanks to um the organizers at Nvidia and my colleagues at a nurse for putting this on, and thanks all of you for for being here and listening um I'm, a I'm, Steve, Farrell I'm, a machine learning engineer at nurse I'm in the data and analytics Services Group broadly, my job is to support machine learning workloads on our nurse high performance Computing systems.

A

um I will talk a little bit about um the kinds of things that we do at nurse. I I won't go too much into introductory things, I'll say a little bit about AI for Science and our perspective from an HPC Center. But of course, there's going to be more introductory stuff coming up later, so I apologize if I gloss over things. That may be of interest to you but happy to supplement with questions afterwards or discussions on Slack, foreign and I think I'm already to go here.

A

um Yeah. Unfortunately, we can't we. We can't actually use a nurse system today we were really really hoping to be able to use Pro Mudder resources for the Hands-On stuff today and in fact that's why we delayed this event from from last year. But uh you know these kinds of systems are complicated and I'll I'll mention shortly that we're in the process of upgrading the system. It's just it's very hard to predict how things are going to be.

A

We actually probably could have used Pearl modder today in the end, but there was a lot of uncertainty yesterday and we had to make a decision I had to pull the plug, so we're very grateful that Nvidia has resources that they can spin up so quickly. For events like this um yeah, so I'll say a little bit about AR for Science and then I'll I'll talk about Nurse I'll talk about our our AI strategy.

A

What we're doing to to enable and support Cutting Edge AI for science applications I'll mention some of our offerings in terms of Hardware software and things like that.

A

um I'll talk about what we're actually doing on the front lines with scientists and researchers to develop applications and and um and push in New Directions, um and a little bit of our efforts on Outreach and trying to empower the community.

A

All right so we're all here, presumably because we're interested in science we're working on science- and we probably um are all interest working on interesting problems and are aware of ai's potential to enhance our research um to really transform the the kinds of science that we're doing uh and, in fact, AI um as as we're as it's being rapidly adopted across uh many domains of science, we're seeing that it can be applied almost um in any kind of science domain. I.

A

Don't really know of any that that it has not yet, you know been considered, um potentially transformative, but even within specific science domains. Ai broadly can be applied to a lot of different aspects of our research workflows, um including, but not limited to the the things that I have here, such as analysis of large data sets. So, um of course, AI is not only limited to large data set. But when we talk about the the modern techniques in AI, with deep learning and deep neural networks, they really shine.

A

We have large data sets and when I say analysis of large data says that can also mean a few things. So um we know that that um AI gives us methods that can learn directly from data and in many cases uh um these learned models can actually get more out of our data than we can with hand. Engineer features by you know, learning the complex features that are needed to solve a specific problem.

A

um Ai can also help us in cases where maybe we don't really have a great traditional solution to a problem. uh Maybe, instead we rely on hand labeling data scanning through data um which is tedious and limits how much we can do, of course, uh with our grad student armies right, but with AI. We can automate a lot of that. So that's just a couple of things so far. Another big one is acceleration of expensive simulations, and this is especially relevant from the HPC facility perspective, but also broadly in science.

A

We know that we rely a lot on having physical models of the world of having simulations that can go from initial conditions. To you know, final conditions or from first principles to some observed quantities, and very very often the amount of science we can do is actually limited uh in terms of the computational resources that we can commit to that.

A

uh Sometimes these these computations, like performing density, functional theory on a very large system of atoms, uh just the computational need just explodes and that limits what we can actually do, or our ability to model the climate of the earth. We can do things at low resolution. uh Maybe we even have good physical models uh for the smaller scale physics, but to try and model the entire Earth at the high resolution. Needed is pretty much impossible with today's resources. So again, science is limited by that.

A

So AI can potentially have a really big impact here in in replacing or supplementing somehow augmenting the simulation workflows, and there are cases where- maybe you just replace one piece of the computation, for example, replace density, functional Theory with an AI model that can predict the energy and forces on the system or a place where you just completely throw out a simulation, replace it with a generative model.

A

um A third area could be control of complex experiments, so this can be things like um particle accelerator, beams or Fusion reactors where these traditionally rely a lot on the expertise of Engineers to hand tune the parameters to get what they want.

A

um You may have seen a paper from deepmind not too long ago, where they had like really great results on controlling a Tokamak, Fusion Reactor with AI and showing that they could do I think even go beyond things that that an expert engineer could do uh so AI is um it's really enthusiastically being adopted by the science communities, uh both in the doe and the NSF and Beyond? uh So we see a recent AI wave here.

A

There are a lot of I think science domains that are still waking up to the the capabilities of AI, but uh luckily we're also seeing I think in a lot of areas.

A

um uh Research moving from proof of concept to maturity or things are actually getting sophisticated enough, mature enough where they can actually be used to do scientific discovery or be used in scientific production. Of course, that doesn't say that the the story is done here.

A

There's still a lot of work needed, um which is why we're all here so that we can learn more about Ai and how to apply it to our problems and, um as things keep growing as things keep getting more sophisticated as we tackle more and more complex problems, the the computational needs of AI uh become quite demanding and they're still growing.

A

So HPC centers, like nurse, can play a really important role, um not only because they provide those needed computational Resources with large-scale high performance Computing systems, um but also the expertise for how to deploy those workloads, um because it turns out that it's still non-trivial to deploy, let's say a massively parallel model. Training of some of the biggest Cutting Edge state-of-the-art, deep learning, models that are out there today, so hopefully that'll get better over time. But that's the situation now uh so introduction to nurse. So nurse is the national Energy, Research scientific and Computing Center.

A

We are the mission HBC center for the Department of energy office of science.

A

um We are located at Lawrence, Berkeley, National Lab and by Mission HPC Center I mean that we uh we cover the whole mission of the Department of energy office. Of science, all science domains that the department of energy bonds and cares about uh they potentially get time on our systems. In fact, the doe decides most they allocate most of the hours on our systems. So we have a very large and diverse user base, lots of different kinds of science being done on our systems in terms of the systems we have.

A

Today we have Corey, which is a CPU, basically Intel, processor, based system and then Pearl Mater, which is our newer system that has Nvidia a100 gpus over 6 000 of them, and this is the one that we're particularly excited about for AI methods.

A

um Okay, I'll get a little bit more now into our our strategy of what we're doing to support and enable Cutting Edge AI methods for science, so sort of three categories here that uh should be fairly digestible. So first we have to deploy. We try to deploy optimized systems for AI for science, uh both hardware and software, um but we've we found that. That's really not enough! You can't just have a system. That's that's! Well optimized! We also have to be there in the weeds.

A

We have to make sure that we have the expertise um and and also be on the front lines to turn push on methods and tools. So we do engage a bit with the community with Scientists. We have post docs that we hire at nurse to work on Research problems, applying AI for science and then the third thing is empowerment. So we do a bit of Outreach. We do seminars, workshops, training events and schools, which I'll say a little bit more about, and and of course, this is one example.

A

This event today, so in the deployment category say a little bit more about Pearl Motor, now, sorry again that you're not able to use it today, but hopefully there's enough stuff in the presentation here that um you'll be able to to go back and try it out later. If you already had an account, if you don't, then maybe you can request on I'm happy to talk with people about how to do that if they need Pearl. Mudder is a system from hpe.

A

um Actually, it's a Croatia esta system. When we first started procuring it, it was just from cray and then cray was bought by hpe. uh So last year we deployed the phase one system, which was all of the GPU nodes, so we had 12 GPU cabinets. Each node has four Nvidia ampere a100 gpus. In total we have over 6 000 of these gpus. So it's pretty sizable um a fairly substantial all flash luster storage system.

A

um That's not available right now because of the upgrade is one of the problems um and then the phase two upgrade is what's happening now uh this brings in a whole CPU only partition to Pearl Mudder, in addition to the GPU partition some nodes without gpus for uh workloads that either don't yet use gpus or don't need them. uh It also brings an upgrade to the network, and- and this is actually the part- that's really impacting the GPU nodes as well.

A

um Here's a picture of our our group lead with heat over here in front of promoter, but we need a newer picture because actually Chrome letters a bit bigger now the whole CPU partition means that it extends a bit in that direction.

A

um And one other thing to say here in video is very was kind enough to call this. The the world's fastest AI supercomputer, when we turned it on all right, so uh part of our strategy is to kind of track what's going on in the community and in our users. So we do see a growing scientific AI workload at nurse and, of course we anticipate that to keep growing as people uh put their workloads onto Pro Mudder, which is particularly well suited for these kinds of workloads.

A

So um we do that and if we track these things in a few ways, one is that we can actually track the machine learning software usage, at least to some extent on our systems. Some of this is not yet working on Pro Mudder, but that's still in progress, but in principle, if somebody does like module load, Pi, torch or tensorflow, we can log that, and we have a way to log python Imports. So we can see what python packages people are using and, for example, we see like on this this bar chart on the right.

A

You can see how we've seen you know six times grow from 2018 to 2021 and then tensorflow and Pike torch. um We'll, hopefully have um you know another much larger um extension of that soon. uh We also put on a survey we've been doing this about every two years. There's one going on right now, um where we ask the scientific Community, uh including nurse users, which I think make up, probably most of it, what they're doing what kinds of problems they're working on? What are their computational needs?

A

uh What kinds of software they're using the tools they need, how they're using their systems and stuff like that um so I I, said: there's one ongoing right now, um it'd be really great if you're applying machine learning to science, if you would help us out by filling that survey, there's a link at the bottom I did share these slides on the slack presentation. Channel I can also dump them in the zoom chat or share them in any way that you need uh so I'll have some plots that come out of those those surveys.

A

um I don't have the preliminary ones from this year. The conclusions are not too different, but I can point out ways where they're, where the trains are changing here. uh One thing we see is from the the users out there from the community. uh We see a really a need for large-scale resources and for parallelization basically motivating the need for HPC systems.

A

um Our users can sometimes um take days or weeks to train their machine learning models. They can have large data sets of hundreds of gigabytes terabytes and even into um getting into petabytes. Now these days, um I I don't have anything on the different ways to parallelize machine learning, workloads like training, machine learning workloads.

A

Today uh we actually have a whole tutorial on that and I'll share a link later on, um but there are, but just you know just to say briefly, that there are various ways to parallelize machine learning, workloads on systems and we we also ask our users about the kinds of things they're doing.

A

Data parallelism is one that that's kind of most prevalent today, but we're already seeing as models get bigger, um a need for more kinds of parallelism like model parallelism and things like that, and- um and so that kind of comes back to that point. I said before, though it's still non-trivial, it still can be challenging to to deploy these kinds of sophisticated parallel workloads on HPC systems. So we do what we can to try and educate the community and make it easier.

A

We know that our scientists need perform in a flexible software that enables their productivity. They need to be able to iterate quickly and try things out. You don't want to be bottlenecked, because this is done. Micro software is slow, so they not only need things that run fast, but they also need flexibility, so people need to be able to add whatever packages are relevant to their domain or their application area and and at nurse we we deploy these things in a few different ways.

A

We enable our users to uh use either software we provide or to install their own. So we do provide custom built modules. Users can do, for example, module load, Pi torch and have an installation that that they know is, um you know, built and optimized for our systems, but people can also build their own custom, condom, condom environments and they can use containers. So we support containers through our shifter runtime system and um a really a really important thing for this on promoter is nvidia's offerings these NGC containers that tend to be very Cutting Edge.

A

They always have the latest um Nvidia GPU software stack, Cuda and kudi in and nickel and things like this, uh and so we we increasingly rely a lot on these containers and encourage our our users to to use those.

A

um I'll skip quickly to catch back up a little bit on time here, but we know that scientists also need productive interfaces. Jupiter is a very popular service. At nurse we have over 2000 users and and users are able actually to use Jupiter on Pearl mutter for their machine learning workloads.

A

You can request a GPU node, you can use software kernels, we provide or you can, um you can bring your own um and then on top of that users also need systems and platforms for managing all their experimentation, uh their exploration to find what models are best for their problems, uh things like Ray tune and weights and biases.

A

um So we don't like pick and choose any specific offerings here, but we we do like weights and biases and rate tune for our own usage, and we try to make sure these things work and encourage people to try them out and let us know if they have problems um and then what do we do to make sure our systems and software are optimized? Well, one important aspect of that is is benchmarking, and this is something that I personally spend a lot of time on so uh mlperf.

A

Is this standard machine learning performance, benchmarking effort um from ml comments? It's it's the it's the industry standard these days and uh working with a bunch of sites. We we put together an ml per HPC, Benchmark Suite that actually brings in scientific applications uh that have the kinds of attributes that we think are important for pushing on HPC systems, so things like 3D, volumetric, cosmology data, high resolution, climate images or um Atomic systems for graph neural networks, and so this has been a really valuable effort for us.

A

It's also been pretty successful with a couple of submission rounds. We have measurements from systems all over the world, 31 submissions, I, think in the last in the last round, we present results at Super Computing and for us personally, this has been great in participating to help us shake out the issues in Pearl Mudder to understand its performance characteristics and what it takes to to get performance out of it, and then we can pass that knowledge on to our users.

A

Now I'll switch a little bit into the application side of things. This is mainly highlighting work that some of our awesome postdocs are doing right now and and that have interesting, sophisticated aspects of the work. I'll just skip that slide, but this first one is this self-supervised Sky survey work, so my colleague Peter Harrington works with George Stein and some others on this. So this is looking at images of galaxies from Sky surveys where in this case we have a lot of data, but not a lot of labeled data, and so this is a technique.

A

December supervised learning, it's it's actually so something that was working well in industry and natural images cases called contrastive representation learning where you have a way of augmenting your images with things that you think are physically relevant like flips and rotations and shifts, and then you train a model basically to learn that you know these similar things should should be close together in some representation space and other images should be far apart and you can use this to pre-train them model without labels, and then you can fine tune on a subset data set with labels and and get more out of your data set.

A

That way, and then they're actually looking for I, can say this they're looking for the strong lens gravitational candidates, where um a Galaxy gets distorted by gravity and can look like this sort of ring pattern here, it's just pretty cool uh this work is called forecast net. This is led by some of our post docs J deep was a former postdoc now at Nvidia Shashank is a current postdoc and Peter works on this as well. We work a lot with Nvidia folks on this one.

A

This is kind of taking atmospheric modeling with deep learning to the next level. It's using an interesting Fourier, transform based operator and using um basically an attention mechanism here to to really be able to do this at higher resolution. That was done before in deep learning models and bringing the Precision up to the level of numerical models, but being much much faster.

A

So again, this is the case that you know we'll potentially open the door to um letting us really uh be able to do better science and modeling the climate of the Earth and uh well one other thing to mention about this is if you watch like the GTC Keynotes from Jensen Huang, you know he he talks about this. uh This work here uh this last one is another interesting case. This is also one of our benchmarks in that ml perf, HBC Suite.

A

uh This is the from the open Catalyst project, where they're they're trying to find new catalysts for energy storage, trying to combat climate change related things, um and so uh this is where you use density, functional Theory, that's very expensive and slow, but you can replace that with graph neural networks to model that system and uh and get good speed up. So uh we had a postdoc uh Brandon working on this who's now at meta, and it's a collaboration with CMU and meta.

A

uh They have the um they put out a very large and um diverse data set for this there's a nurix challenge and a lot of cool work coming out of there, and one thing that Brandon was able to show on this work is that larger models in this case uh do better so they're. You know working on scaling up to large systems and basically out of time so I'll just say really quickly. We do a lot again in empowerment and training.

A

uh We have done a deep learning for Science school at Berkeley lab we've got two iterations of this one in 2019, which was in person and then in 2020. That was an online webinar series. uh You can get videos and slides and everything on these web pages. We do a deep learning at scale and tutorial here. The focus is really on performance and how to scale up the training of a neural network model uh to a large system and all the tricks that you might need to use there all the materials available.

A

We have videos, you can check that out. It's accepted again for super Computing this year. So if you're there please check us out, and then things like this boot camp, which we're doing right now, um I think I'll, just very quickly. Maybe I'll mostly skip the conclusions here. I'm trying to say you know, AI for science needs super computers.

A

uh We see that the the field of scientific AI is growing and becoming more sophisticated, which is great to see still work to do, though, and so we're doing our best to contribute to that um feel free to reach out to me. If you, you have questions or want to ask about collaborations, we're also hiring and there's a link down here with some some openings for postdocs and engineers, and that's all thank you, sorry for running a little over time.

B

Nope that was perfect, um yeah great to see what's happening um so now. Let's go ahead and dive into our boot camp, so I'm going to go ahead and hand this over to Caleb.

B

Hey good morning evening afternoon, everybody I uh just wanted to say: Steven that was awesome. I could I could have listened and looked at way more projects. So if you have a.