National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Introduction to Perlmutter System

Description

Jay Srinivasan (NERSC)
Introduction to Perlmutter System

A

Thank you Jack, and uh thank you to Neil and uh Helen, and all the other organizers for uh putting this together, um I think it's uh uh yeah. It's it's very interesting to to see the progression of this. uh You know we we sort of put together the idea for Paul Miller back in uh in 2015. and uh the the sort of landscape of GP of uh nurse.

A

A

A different bank do you, have any and uh and the overall push getting our user base ready for gpus has been um really spectacularly successful, and so let me give a really short, and none of this will be surprising to you all, but just set the tone here.

A

um Let me see here.

A

Okay- and you can all see um the slides foreign.

A

Obviously, it's very fitting to talk about Promoter on gpus for science, because uh it does actually provide gpus for science, um and so uh you know in the nurse sort of Pantheon of systems um promoters right there in the middle, uh at least of this slide.

A

So uh you know, starting in in 2021, we started deploying uh parts to promoter and uh and actually we made uh uh the GPU accelerated nodes, uh the very first things that were available to our staff and then to some other users, and then all users, uh and so uh typically, as you can see, you know from the progression here.

A

We have a new system every few uh every few years and uh nurse nine and is in some ways uh new and it's in some ways a continuation of what we initiated with Corey, where you know there were many core CPUs and the sort of parallelism and things like that that in the advanced architectures that are kicked off in that era, continued on uh with uh with permanent.

A

um So at a very high level, like Jack said it is a system, that's been optimized for for science right and so by that we need. We don't just have one kind of thing on the system. We have in fact um many different things and we we make sure that they try to work well together right, so we have CPU only nodes and we have GPU accelerated nodes, which uh I'm sure are very interesting to all of you. uh We have an all flash uh high performance, all flash file system and then from the user.

A

You know the first things that the user sees on on the system or even getting used to the system or- uh or you know things like- either the login nodes or or nodes that aren't uh necessarily optimized for for, for you know, high performance compute, but provide the basic foundational blocks to allow you to do uh to compose your workflows. To do.

A

uh You know long-running uh Services make sure that long running services are there in support of all of the HPC and we've put all together lash all this together with this high performance interconnect that we have and I'll talk a little bit more about that.

A

But then, in addition to sort of bring together the nurse environment, we we sort of make sure that you know external file systems and networks can connect well uh and have a good path into the system right um and so a little bit more detail on this, which uh I think hopefully should be familiar to people who've. Actually, you know been on the system now.

A

um uh You know we can sort of broadly divide the system into sort of three bits right. One is all of these gray boxes, which are the supporting environment right.

A

So the log access nodes service nodes how we manage the system is using a kubernetes system management orchestration- uh and it's not uh you know so that that isn't directly visible to the users, but it does uh have some features: that'll make the user environment on, and the access to the system much more useful and interesting, uh and then we have the compute portion, which is right now, like I, said composed of GPU axle, rated nodes uh and uh CPU nodes, uh and then all of this is sort of tied together with these um slingshot switches.

A

So slingshot is the uh high bandwidth uh low latency network, but uh hpe has put together for this system, uh and uh that is in fact you know what allows uh promoter to be and I'll show you the differences from Quarry, but all of this is under one network um for uh for this system right. In addition, you know, uh there's not a lot of point in having a system that nobody can sort of get to and do interesting things on.

A

So we have a very resilient and high band with the link to the nearest Network and the world right. So this uh this these this picture here on the lower left, shows you that you know we have um we're promoter all of this uh slingshot Network, and then we have a very AI bandwidth connection to the edge of from under and that's resilient, and so this is multi-terabits per second and then from The Edge router.

A

We have another multi-turbet per second connection, but a smaller multi-turb per second connection to the nurse Network um and then to the world right. So this nurse network is all in company right. So it's it's stuff like your dtn nodes and stuff like hpss, uh it's Corey!

A

uh All of the other uh stuff, that's outside of prolander, um let's see so in terms of the specific Hardware uh you know, obviously, for for the gpus uh portion, uh the most interesting one is the GPU axle rating nodes, and these have four uh Nvidia uh a100 uh gpus per note right, and so you can actually um see them over here uh on this little picture here, which is the natural motherboard.

A

um These are uh you know: they're, 40, gigabytes uh per second uh sorry, 40, gigabytes of uh high bandwidth memory per GPU, uh which gives you a total of 160 uh gigabytes of the high bandwidth memory, uh and then all of those gpus are linked together with this NV link screen.

A

In addition to actually drive the node uh since the gpus are not yet able to do that, um we have an AMD um uh Milan uh to do that, we also have uh on the GPU nodes uh dram right, so we have 256 gigabytes of demamp, and you can actually see that here um these between these copper uh fins and then uh because we have four gpus. We have four slingshot Nicks per node uh and then on the CPU only nodes.

A

uh We have just the two CPUs um and uh with uh because we then have a lot more real estate on the motherboard. We are able to give you 512 gigabytes of dram per uh CPU node uh and only one slingshot Network.

A

So then the gpus themselves are hooked up together like this and then they're connected to the um to the rest of the node here by this picture on the top left, and you can see that that they have PCI connections to the chip to the Nic and then to the outside world and then amongst themselves, send me a link so from a system point of view.

A

Let me just take a minute here and talk about the the features Shasta is the system or uh management framework stack that we have on the nodes, and so this is a sort of a typical picture here you can see, but there are some differences to standard sort of large clusters or large supercomputers. We have a whole bunch of non-compute nodes on the left, and then we have all the compute nodes uh on the right. The non-compute nodes in this case are in fact managed with this Cloud managed infrastructure right.

A

So this is very similar to what the large very large Cloud providers use, and there are some certain benefits to to doing it at this scale at the scale we have as well. uh It gives you sort of the service oriented architecture that allows us to use a lot of the new developments and sort of system, management, capabilities, and so on that are that are out there to manage the system. The compute nodes themselves are about uh off of bare metal, so there aren't any directly user accessible.

A

uh You know, Cloud managed so Cloud uh oriented features on there, but there is a bunch of uh sort of value that we can leverage from this Cloud thing right. So we can. We have a. We have a system-wide API that users can get access to to help control some aspects of of their workflows and uh and then it all of the services themselves are resilient when we, when we put them on this on this framework here right, so both uh so at a very high level.

A

The the the sort of user environment nodes, which are the sort of worker nodes and login nodes, are all controlled using this uh kubernetes and the compute nodes themselves are not directly controlled kubernetes, but the entire orchestration process, this control music.

A

um So in terms of differences from Corey I mentioned I, think people are familiar with Corey uh again. The main thing is all of promoter, which is this middle tank here on the left on this slide is uh control is under one slingshot, Network right so, but as as compared to Corey, where you had um uh you know, you have the two partitions, the Haswell and the p l partitions, uh those along with the associated service nodes and the you know the booting nodes and as well as the nodes that help you get.

A

Access to the file system are all part of the Aries network, but um the access nodes, as well as the the storage itself, is on a different network. So there's a network translation happening at some level um for uh for Quarry and between, say the the K L's and the I o nodes or kennels, and the paths to the outside world, whereas in slingshot everything is on one network, and you can see this picture here on the far right. That shows you.

A

Those elect those network connections that are there between what we call each each group so largely speaking. Each one of these uh little dots is is a cabinet on the compute side and uh to First approximation and then on the rest of the cabinets and all the service cabinets and so on, and you can see the network connections that go between them, as well as the connections to the outside world. So the sort of deceptively small uh connection here is, in fact our our high bandwidth connection to um uh to the rest of uh nurse.

A

uh So in terms of the software I think you know, you'll you'll hear more about this to uh sort of proxies, as proxies and the talks that you're getting. But you know we have a very rich and robust programming environment as well as uh uh support for various programming models and languages, and it won't go into great detail.

A

But uh it's just to note that you know there's a very good coverage of of things that are both uh formally supported by the vendor, but also sort of nurse supported in the sense that their staff- and this is the the team that uh you know, Jack and uh and uh other groups. uh Rebecca has group and others all are part of. uh Are uh you know strongly supporting all of these uh programming environments that we have on the system. That'll help make the system very productive uh as a science teaser.

A

uh You know, I just want to note that you know a desk has a very broad user base right and you can sort of look at here.

A

The the pie charts on the left here show you that breadth of user base, which is basically we're using um uh where they get their support from from the dodos signs uh under which one of these offices- and so you can see here, we have very good coverage, broad coverage, both amongst the GPU and the CPU node usage on the promoter uh across for the last year, uh and that sort of also translates to the kinds of science.

A

That's happening back doctor a little bit, ranging from so we've sort of had these three pillars that we tried to support on promoter. uh You know traditional simulation data and machine learning, and so we've had successes in all three of these uh pillars at the beginning of this rollout of permiter- and you can see some of the examples here on on this slide and then, finally, uh in terms of the future monster I'm out of time.

A

So uh uh you know, promoter is, is being just being introduced right, so it'll have a uh uh long and story life ahead of it, uh but we're not sort of just statically going to keep the system here. There are a whole bunch of improvements that uh systems folks, as well as the the user integration Fork they're doing to this system to help make it better. Every single uh you know year, and so we can sort of bucket those into three large buckets.

A

One is sort of all operational improvements that help us keep the system up and running and without the least amount of impact, and uh hopefully to give you the most important productivity um uh in terms of the user environment. We're going to be able to start giving access to uh container-based environments that allow users a lot more control over what they see when they log into the system and the kinds of things that they can immediately do, which is a government for productivity and then on the access side.

A

We're going to have a whole bunch of API, driven interactions that we're going to be enable um we're going to be able to enable uh very soon including things like new ways of interacting with the workflow manager through the city of restful interfaces, management of automated tools, git line Runners for CI CD stuff, as well as data and Improvement operations. So let me stop there and we're really uh thankful to the community here of users who uh who have been able the system to be.

A

uh You know very productive, and we look forward to uh to giving you all permanent air for uh to enable great science. Thank you uh very much. Let me just stop sharing.