National Energy Research Scientific Computing Center (NERSC) Quantum for Science Day, October 24, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Introduction to Perlmutter

Description

Introduction to Perlmutter
Jay Srinivasan (NERSC)

A

Okay, so where does sort of promoter said in the uh painting of nurse systems here um right between, uh you know, skate and there's 10., so there's nine is uh sort of, as uh Richard mentioned, we've sort of continued this, this transition of the applications and workflow, uh and uh you know we're starting to support uh new complex workflows on the system, which is a GPU, a mixed CPU and a GPU system right, and uh we started deploying that in uh in early 2021 uh in parts and then, uh where we're getting very close to to rolling out uh it's it's full capabilities.

A

um Sort of very, very high level it. uh This is what it looks like right. So so there are CPU only nodes which are the AMD epic uh series. uh There are GPU accelerated nodes with uh Nvidia uh gpus. In there uh we have an all flash file system uh and, uh from a user environment point of view, we have uh workflow nodes and high memory nodes uh as well.

A

As uh you know, standard sort of login nodes, uh all of which are are hooked up to uh uh you, know the slingshot, which is an Ethernet compatible network right.

A

In addition, the system pulls in other resources, from nersk external file systems, uh high bandwidth connections to to archival, storage and, of course, uh the other system that we have on the floor, which is Corey right.

A

So a little bit more detail uh here and you can see uh you know some of the specifications of each one of these things that I just mentioned. Okay um and so I'll talk a little bit about the the orchestration of how we set up the system. But uh right now this system is, is our first system to where the system management portion of it is orchestrated using the sort of uh Cloud technology.

A

Sorry, what we call kubernetes here, the login workflow nodes themselves- are, are GPU enabled and, as I mentioned, we have some large memory nodes that uh one terabyte of memory per note.

A

um The other thing uh that promoter has is uh you know all of the computes are uh connected to the nurse network with a resilient high band with uh linkage year, and this little graphic here on the on the bottom left uh shows you what that looks like right. So uh promoter has a multi terabit per second connection, to uh its Edge routers. That also have a multi uh smaller, but a multi-terabit connection to the nurse Network itself and then uh onto esnet in the world.

A

uh One thing to note: is you know these, uh so the system itself is divided up largely into essentially two parts right, there's the whole management framework, which are these uh Gray colored uh rectangles here and there's the compute uh frame uh nodes partition. If you will, that consists of the GPU and the CPU nodes, the compute nodes are all direct liquid cooled. Water cooled in this case uh and uh the uh the rest of the management framework is all air cooled, wax yeah.

A

So, in terms of the hardware capabilities, like I mentioned, the GPU accelerated nodes of which there's a example right here, you can look at the sort of complexity of all of the cooling and and the heat sinks and various other connections that are there on the Node. um They have four uh a100 gpus per node right.

A

Each of these uh gpus uh has a 48 gigabytes of high bandwidth memory, so it gives you a 160 gigabytes of memory, uh high bandwidth memory on the Node, because they're all linked together with uh NV what uh what's called the latest generation of nvidia's uh linkage between the CPUs in this case, as the picture here shows, um it's called Envy link three.

A

uh In addition to that, to help Drive the node itself, they have a AMD. uh You know epic 7763, also colloquially known as the Milan, and so these nodes have got all in addition to the high bandwidth memory. They have 256 gigabytes of d-ram per node, which you can see here, which is also actually liquid cooled um and so, uh and on top of that, because they have four gpus per node, we've provisioned them with um four network cards per node, and in this case these are the latest generation of hpes.

A

uh What's called slingshot 11, which are capable of 200 gigabits. uh So this is where they're connected the gpus are hooked up um with pcie to the uh through the CPU, and then those are also hooked up with uh pcie into the next and then go to the outside world. And then the gpus are all hooked in each other.

A

On the CPU nodes, which I don't unfortunately have a picture of, um we have two sockets of Milan uh and uh in this case we have 512 gigabytes of dbm, because we have a little bit more space when we don't have the gpus in there um and they have one slingshot uh card print off right. So if you look at this, node you'd, basically double the number of CPUs uh and then uh remove all of the gpus as well. So.

A

um Let's see here so uh the other thing, as I mentioned, was the the the sort of overall system orchestration framework, and so here you have a picture of the this sort of high level. What it looks like right, so you've got all of the non-compute nodes and then you've got compute nodes. Typically, what happens is you have some small number of non-compute nodes that uh manage and boot all of the compute nodes and handle the storage and so on?

A

In this particular case, like I said we we've um we're using um a new, uh relatively new to HPC at least uh orchestration framework called kubernetes, which is sort of a service, oriented architecture right, and that allows us to put various services on here.

A

You know more or less resilient manner that controls uh all of the booting and orchestration of other services on on all of the non-compute nodes, and then the compute nodes are uh are booted using this framework, but they don't actually are they're not actually controlled, using uh kubernetes right, they're, just booted, and so they uh run uh Enterprise Linux environment, which in this case is Susa Linux, which is a has been vendor modified with certain. You know drivers and things like that. uh It is bare metal booted, so we're not uh controlling it.

A

Using you know any uh the the environment that you get on the compute node is not uh ritualized in any way, uh but uh we have additional uh sort of vendor provided um features in there like the programming environment, uh various other, create, Linux features, and so on.

A

um Let's see um in terms of differences from query, uh you know, Richard pointed our quiz, uh you know our previous system, which is uh you know, coming to the end of its life uh in in terms of Corey at a high, very high level. You know both of these have a dragonfly topology. Obviously the network itself, the the underlying hard Network hardware and the protocols are different.

A

uh One of the the Quarry has an Aries network uh where a slingshot has a uh well, whereas the promoter has a slingshot Network on query, though uh you can see here all of these uh additional capabilities such as login nodes and the you know, the GPU nodes that we have added to Cory, as well as the storage nodes, are, are sort of separate networks that overlap through essentially gateways into uh into the Aries Network right and so you've got the K L's and the high the Haswell nodes and so on, and as well as the service nodes, all of which are part of uh the Aries Network on on promoter, though, everything is part of the one uh slingshot Network here that sort of indicated by these two tiny um uh switches that we have there, and then we have the uh uh CPU only nodes, as well as the GPU accelerated nodes and the management uh Network right.

A

So you can see here. This is a a picture of the uh promoters Network and uh you, you know the different colorings here sort of show the different groupings of the these. These notes. So you have 24 of these compute group nodes, and then you have. um Let me leave uh 12 or 12 of these Service Group nodes of the I o nodes, and then you have four of the service movements right. So.

A

um In terms of software, we we sort of have a really rich uh programming uh set of programming environments that we support as well as uh you know, programming models and languages um and uh all of which are sort of put together by some Community codes that we um uh that we support as well right and uh I won't go into too much detail on running out of time here. But uh you know that's what we have in terms of the science.

A

This is sort of just a teaser for I think Richard uh uh mentioned our utilization has been very good.

A

If you look at sort of the pie, charts of usage of both the CPU nodes and the GPU nodes this year, you can see that the the offices that Richard talked about in terms of who are uh our user base uh comes from uh is very widely distributed amongst the usage of uh of the machine right, yeah right both from the CPU side, as well as the the GPU site, and here are some sort of teasers for uh the kinds of science that we've supported in the lab here right, ranging from standard sort of simulations to sort of newer models and data and learning, as well as cross-facility uh kind of workflows that we're supporting in the super facility right from uh Desi to to lcls and others for the future uh promoter.

A

We're really looking forward to um to getting it. uh You know into people's hands uh in its full capability, but we're not standing still right. So we're going to make a number of operational improvements, improvements to how you were able to access the system, as well as the kinds of things you'll see when you do get on the system right. We we in part have continuous operations now, although we will have continue to have maintenances and updates to the system and we're hoping to do most of those non-disruptively to to our users.

A

Right- um and uh you know not right now, but in the future, where very soon we're hoping to be able to give people access to Dedicated containers when they log in which will initially give you standard images, but also where holding out the possibility of giving the users the ability to to customize their images in in uh in in, in a few different ways. Not you know uh completely uh sort of a wild west approach, but we hope to give users the ability to customize their images and then we're going to give.

A

uh As I mentioned. We have this uh really uh robust and resilient management infrastructure, where we'll give users the ability to run these long-running services that can be managed using the kubernetes framework that allows for resilience and other services uh and then uh from a user access perspective. Once you do get on the system, we're hoping to open up a much much richer way of interacting with uh things on the services on the system, including things like restful interfaces to our workload manager, uh the ability to run.

A

um uh You know against lab Runners that will help with the cicd folks, as well as other data movement operations. But let me just stop there.