National Energy Research Scientific Computing Center (NERSC) Data Seminars Series, 19 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-06-19 - Shane Canon - Status of Containers in HPC

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Title: Status of Containers in HPC
Abstract: Containers have quickly gained traction in HPC and Data Intensive computing. Containers provides users with greater flexibility, enables sharing and reproducibly, can make workflows more portable and can even improve performance. In this talk we will review some of these benefits, the status of containers at NERSC, and trends for containers in HPC. We will also discuss some of the use cases and success stories for containers at NERSC.

A

All right, so I'm just going to give an update on sort of the status of containers in hpc, so I'll talk a little bit about um a little bit of a background on containers and why they're interesting talk a little bit about it says shifter here, but it's really about containers at nurse and some of the activities we're engaged in talk about a little foray into containers and reproducibility, and then talk about some other related efforts. Kind of outside that, I think, are interesting and then talk about kind of future trends, future directions.

A

Okay, so just uh to get everybody on the same page, I think now containers have become pretty commonplace and people are pretty familiar with it. So at some point I should not have to give this just in case there's some that haven't, um you know, had a chance to try this stuff out. I always like to just start with this.

A

So can you know what the container ecosystem, especially really popularized by docker, has done, is provided a simple way to kind of build ship and run applications or services, and this originally really started out in kind of the web space where they needed to deploy applications, maybe at scale and needed a more streamlined way of doing that, and so the current innovation that docker provided is they took a lot of the capabilities that were already in, say the linux kernel and built a tool chain around that to make this process really uh seamless and productive, and so you can use a recipe that I'll show an example of later to build images.

A

Then you can push those to a registry like docker, hub or private uh registry and then pull those down and run those on different uh execution resources.

A

So uh you know these a container really is just uh it uses a combination of linux, kernel capabilities, like c groups and name spaces, to create these isolated environments, there's actually kind of a long history of containers, but again it was docker.

A

I think that really provided the tool chain that made it work well before that things were a little bit cumbersome and complicated, and then, since that time, you've seen sort of a whole ecosystem explode around containers, especially around orchestration, so we'll say a little bit about kubernetes later on, and then we've seen that come into the hpc space enabled by a number of hpc container runtimes, so shifter, which I'll describe a little bit later, was, I think, sort of the first kind of hpc centric runtime, followed very closely by singularity both developed at lbl, which is kind of interesting, and there have been others that have appeared since then.

A

So there's charlie cloud of flannel soros out of cscs, and uh you know there's a few others that I I don't list here so what's in a container, it's basically kind of uh you could think of it as almost like a snapshot of a host system. So it's got all the file system tree. The the linux operating system it can include include any linux. Any libraries binaries tools that you may need to support your application, of course, the user code itself.

A

It can even include data, but generally there are some kind of best practices around how big you want images to be so a lot of times, maybe that'll be separated out. It can also include things like runtime settings, so that could be things like environment variables, working directory, how to execute the applications.

A

So it's really trying to encapsulate all of the kind of relevant pieces that you need to to run an application correctly. It can also include other things that aren't as important to hpc use cases, but that can include things like network related things like what ports to expose or what user to run the application has.

A

This is uh the the way you would typically build an image is through a docker file. You can think of it as just like a simple recipe of how to construct the environment that and install your application to to execute that environment. So, just very briefly the kind of components of a docker file. These are the there are a few other directives, but these are kind of the common ones that you need.

A

So it's you start with a from line that basically says what is the base that you're going to kind of build off of and then there's uh you can add labels or kind of metadata to that image as well, so a common one that you do is you put the maintainer in this? Is more of a convention really and then there's these two directives run and add or copy.

A

Both of these add and copy work similarly run basically says inside this environment that I'm building up do this operation, so you're, basically creating kind of modifications, or you know, changes to that image to arrive at the final product. So you can run things in there to do things like install packages, for example, and then add or copy, is used to bring things from outside the container environment into the container environment.

A

So if you have a source code, tree checked out on your your local system and it had a docker file in the base of it. This would be sort of the typical way you would take that source code and put it into the container environment so that you could say, build it and execute it, and really it's this run and add copy you. Usually those are the key kind of verbs that you use in creating these images and then to build it.

A

You just use a command like below, so you do docker build you, give it some name, including maybe something about where the repository that you're going to push it to in a registry, for example, and then you would use the push command to actually push those contents up. It's I think it was inspired by git. So you see a lot of kind of get kind of concepts come into play in the the command line structure of docker, for example.

A

So why are containers an interesting idea for nurse? So I think everybody here is obviously familiar with nurse. I don't need to really give much background, but the key thing here is like because of the number of users and the breadth of projects that we have to support and the growing amount of experimental and sort of observational use cases and machine learning use cases. You know this creates a burden for us to support all of those variety of applications, and it's you know we can only.

A

We only have so much bandwidth to be able to install and manage applications for the users, so containers really provide a way to allow them to have a little more flexibility in what they want to to run at nurse and can really make them more productive.

A

So the kind of struggles that we hear of you know I'm having trouble building my software on the system. I have a bunch of dependencies that I need to get and I'm having trouble getting those installed, um maybe there's versions that they need, but we only support certain ones.

A

You know all of these kind of questions are start to lead to containers. This is a potential solution to to the issues and another one is around this idea of being able to kind of re-execute the same the same application or have the same environment available over a really long period of time.

A

So in a lot of domains they may need to you know kind of maintain that execution environment for for years, for kind of reproducing reproducibility reasons and containers can play a role in that I'll talk a little bit more about that later um and so for containers and science. Why, I think they're interesting is one: is this productivity aspect, so the ability you know I can when I create the image, I can choose what base os.

A

I want, if there's a an os that already has uh the the packages available for my application, and that makes it a lot easier for me to kind of get up and running.

A

The other thing is once I've built an image or if somebody else from the project has built an image, they can just share that with me, and I can just reuse the same thing. So I don't have to go into my environment. Try to build up the same things I can just uh you know we can share those across a collaboration. For example, another is reproducibility.

A

uh You know, because everything's kind of packaged up in that environment, it's kind of encapsulated inside that image. It means that uh you know, even if aspects of the system change, that image is going to stay the same, and so that can make it easier to reproduce those results later in time and then also portability is. I can take an image and as long as it's kind of architecturally compatible, I can potentially run that image across.

A

You know a lot of different resources, so I can take the same image and run time and run that you know at neuroscore. I could run it at. You know tack on stampede, for example,.

A

You know so why users would like containers why it's helpful to them is for many of the reasons I just alluded to, so it gives them more flexibility and power, they're, not dependent on, say a nurse staff member to go and install something they can address it themselves and they can pick the os, perhaps that their collaboration is already kind of baselined on and has created packages for, for example, and in many cases I'll talk about some examples where it can actually improve application performance in sort of a surprising way and, like I said things like reproducibility and sharing so now, just briefly talk a little bit about containers at nursk, and so specifically, I'm going to sort of talk about shifter and stem um so first off.

A

Why not just run docker, we've sort of shown? You know in the earlier slides how easy it is to build these things and then and then run run these images.

A

The primary reason that we have not allowed docker to date has been around security, so in docker it's kind of a all or nothing security model. So once you have the rights to to run docker, then it's trivial for you to kind of escalate, use that to get more uh privileges on the host system, so for a shared environment like nurse that would be a non-starter we can't allow.

A

You know, I'd show an example here where they can start to manipulate things in the host system, but this also means that they could get access to others, users data they could you know they could delete things. uh You know. Obviously this would be very dangerous. There's been some advancements in this space, but this is still kind of mostly where docker is these days another.

A

There is a system architecture doctor's, really designed around local disk, and if you on our systems, we don't have that, and so you need some ways to deal with that really want it to integrate and play well with our batch system. Since we that's really the key piece, that's managing access to the systems and trying to make sure that how they use the resources is the correct way and then there's some other things too. Around complexity and system requirements.

A

I would say the system requirements are less of an issue these days, but when we first started developing shifter, it was a serious constraint, so this led us to develop shifter. This was back in 2015. we'd actually done some exploration using a kind of a thin wrapper around docker directly on carver, but we so we saw the value of it, but we felt like uh we.

A

We saw some weaknesses in it too, and we wanted to be able to run those on our our really large hpc systems, our crate systems, so that would have been like uh hopper at the time. I think- and so we um this led us to uh start developing shifters.

A

So this is primarily doug and I that that started this and our goal here was we wanted to leverage as much of the docker ecosystem as possible, so especially the parts around building images and distributing images, but just customize the runtime to be more hpc friendly, and so we wanted to make sure it addressed these security issues, but it also was scalable and could get native application performance. So we wanted to make sure that uh you know it was done in such a way that you it would be amenable to hpc type applications.

A

So the you know, what does it look like to use shifter with an in a job? It's pretty straightforward. You can just specify what image you want to use as part of the your batch script, for example, and then, when you run the application, you still would use s run, but really all it is. Is you just do shifter in your app the path to your application? You know arguments.

B

Everything else.

A

After that is pretty much the same as it would be on any, you know when you're running any application, uh so your the process that you do is you use the shifter img command to pull that image down? What that actually does is it it pulls it from the registry, it unpacks it, and then it repacks it in a way, that's kind of optimized. For our you know our large systems, so it creates a single uh image uh out on the scratch file system and that gets mounted up at runtime.

A

So this is why we're able to get really fast, scalable launch with uh with shifter, and then you just submit it. You can also specify the image, as kind of either in the submission or kind of on the command line as well and there's the advantage of the first one is it's.

A

uh It does all that image prep before the job uh before it hands control over to the job script, so everything's already in place, so that makes launch a little more uh scalable and then we've we've also with shifter, uh figured out how to integrate and support mpi applications in a clean way.

A

So what the recipe that we um that we recommend for people developing images is that they take a stock version of of impitch of you know a fairly recent version and then um they build that just like they as normal inside that uh inside that image. And then when we run that that image uh in a container with shifter on, say quarry. We automatically at run time without them doing anything. We bring in the.

B

A

Libraries that are optimized for the cray network so like the aries, uh inpitch drivers, and so they automatically transparently get you know, sort of native performance for that application.

A

uh So this is kind of the model that we've followed from the beginning and we've seen applications we've seen some little pickups that we've had to address, but we have applications that we or images that we've created back in 2015.

A

I think or 2016 something like that, that we are still able to run those today. um So I think that shows that you know demonstrating at least to some of that kind of longevity that we want to provide.

A

B

um I was wondering if the exposure of the mpi libraries and the way that you did, that is something that.

B

Have you did you have to do something special for mpi, or is it done in.

A

B

Allows you to sort of access whatever.

B

Libraries are present on the um on craig's cray linux.

A

What it does is um we basically kind of package those libraries in a specific space on the system, and then shifter has a concept that its own kind of module concept in it and for a given module. It basically will mount those libraries into the container environment that it creates, and then it manipulates the ld library path, so those get picked up instead of the ones that were in the container itself or in the image itself, and we can do that for other apple.

A

You know other frameworks or tools as well, so this is similar to what we do for the gpu support as well, thanks yeah.

A

um So when we first started developing shifter, really the focus was on kind of improving productivity, making it easier for people to bring in applications to the systems than it would be otherwise, but because of the way we kind of did this unpacking and repacking and mounting on the compute nodes. We also discovered that it had a pleasant side effect that it improved some kind of perverse uh use case cases that that we had encountered on the system so one of those being python.

A

So when you look at what python does when you launch it, um it first has to walk and sort of create a namespace of all the libraries and modules that it has access to, and so that's a very metadata intensive operation because it has to walk through all the different packages to build that up, and every python process has to do that. So, if you're running tens of thousands of python processes, all of them are doing this.

A

It could require looking up thousands of files to you know, get that information and- um and that's all going back to say the luster metadata server, for example, because we take that image and we pack it up it's just. It can all be kind of uh handled locally on that node. So it doesn't. It just can look inside its cache uh on that node to do all this metadata lookups, and so that's when we see you know sort of the best, that's typically a shifter or something you know these kind of packed image models.

A

We this approach has been so successful that you know craze even kind of integrated aspects of it into their own system. Image management approach and I should point out that the way shifter does it is also similar that singularity uses the same model uh as well.

A

I just to show that containers aren't just for the small scale stuff um one example we have. This is from a few years back, but it's still one of the more interesting ones that really illustrates containers can help at scale. This was a cnb cosmic microwave background simulation run that was done by ted kisner and they had a milestone around doing these simulations that they needed to meet and at first they were just trying to run this. This is a python application, but then it called a lot of other optimized libraries underneath the hood.

A

So it was. You know it was tuned for hpc, but they were having this issue with exactly what the I was talking about with the python startup.

A

So by switching all that to you shifter, they were able to get their launch times down to you know a minimal overhead and it was really the key between them being able to achieve their milestones and not, and they had this added benefit that now they have an image that you know is a little more portable and they can go back to over time.

A

uh This this analysis was actually done by others, but it's a good illustration of sort of the different um most of the container usage at nurse to date has been dominated by the experimental uh and community data analysis community, especially high energy physics.

A

I will say that, and we've seen a growth from you know a percent back in 2014 to sort of the six to eight percent. I was looking at this just recently to see what the numbers are at they're sort of in that range. At present, I will say that this only captures jobs that will specify the image as part of their batch submission. So if they're using the command line to pick the image, then we we don't currently have a way to capture that.

A

So that's one of my to-do items is to figure out how to glean additional information, so this six to eight percent is kind of a lower bound. um It could potentially be higher. It's almost certainly higher. Just don't know how much and today we've seen something in the neighborhood of 7 000, unique images and over 900 unique users that have you know at least run shifter.

A

At least once uh you may have also heard about spin, so just as just to explain the difference between shifter and spin, so shifter is about running hpc jobs, so things that kind of have they need to run for a period of time and finish and spin is really about running what we call edge services, but maybe think of it as services that either people are going to access from outside of nurse things like science gateways or portals, as well as services that maybe hpc jobs are going to interact with.

A

So that could potentially be things like databases or it could be workflow engine services or api services that um the hpc job may kind of interact with. uh So you know shifter for h, you know compute intensive time, limited sort of applications and spend for things that need to be more kind of persistent.

A

uh You know run all the time kind of thing, and uh you know they they can do a whole talk or corey could do a whole talk and has done whole talks just on on spin, but uh that we've seen a a really good uptake in in the use of spin. I think there's something like 100 different services stacks that are running and spend today, and they span a variety of different domains from things like the materials project that are doing these pre-calculations and serving up.

A

You know, material compounds that are relevant to things like batteries to jji has lots of services running in it. There's plenty from the astronomy community as well. So anyway, it's been very successful and we're in the process of moving that from a rancher one system to a rancher two system which starts to bring in kubernetes, which I'll talk a little bit more about later.

A

So just uh you know, sort of a snapshot of some of the efforts that were around containers going on right now at nurse one is uh we're involved in the acp super containers project I'll talk a little bit about that in a few slides. We're looking at ways to have shifter require less privilege to be able to execute we're looking at alternatives to shifter. That would do the same thing we are still doing. You know small incremental improvements to shifters, so, for example, we just did some some fixes to make it better support.

A

Some of the newer image standards that we've run into- and I kind of alluded to this- we're looking at some potential. You know alternatives uh for the future.

A

One of those is podman and I'll say a little bit more about that later, but this is maybe one way that we could allow people to build images sort of on-prem at nurse, which has been one request that we've seen and I talked about the move with spin to go into ranch or two, so we're in the process of onboarding new users and migrating users to to rancher two I've got a few slides on reproducibility and the role of containers there.

A

These are, I gave a a talk at a reproducibility workshop at ecmwf, and so I thought it'd be interesting to share some of those points from that so kind of to start with. Let's just talk about ways that we see reproducibility fail, and here.

B

A

Really talking about computational reproducibility, um so one form is just me trying to do my own work.

A

Maybe I've done something in the past and I want to reproduce those results and the things that can get my way or maybe something on the system changed and that broke my application. I may go to try to recompile it or build it, and they. You know the codes, something stops working, maybe the compiler changed, and now that build doesn't succeed.

A

I might have trouble finding prerequisites that you know available when I previously did it. You know, or the system doesn't even exist and then, if you start to do that for somebody else's code, maybe I'm trying to reproduce somebody else's work. You know it gets even more complicated because you know I might not be able to figure out how to gather the software that they did. Maybe again it may not be available versions might not have been very well captured or documented and which can also cause problems. I mean I have access to the data.

A

I mean I have access to the right kind of systems for it. So containers uh clearly can't address all of that, but I want to talk about some of them that they do so.

A

These are kind of this is not necessarily comprehensive, but some of the different you know variables that can impact reproducibility, so it can go from the very bottom kind of base level of the hardware on up to you know closer to how you're running the application and application characteristics, and um these are the ones that in green and yellow that containers can start to help address.

A

So they can capture things like the operating system. Libraries, compilers tools, the app itself. Obviously things about the environment that it needs to execute in, like the environment, variables and depending it can also help with some of the runtime characteristics or data and inputs to clearly there's other things that it can't help address so just to kind of a visual of that same thing, but maybe present a slightly different uh way.

A

You know these are sort of the layers of the stack and where the container sort of fits into things- and you know some of the variables that I alluded to before that can start to impact things. The container you know again has the linux distribution, libraries tools, all of those things can be encapsulated in the image, and so that's how it can sort of help to address some of the reproducibility issues, and that still means that you need to follow good practices and how you perhaps create your your images.

A

So this one is trying to show some of the best practices that you would want to kind of keep in mind.

A

One is making sure that you know you're, starting with base images that have well-defined kind of versions and hopefully, are not changing so just like pulling from master on a github release. There's tagged versions, and you want to use those tag versions where possible, when you're installing packages, you want to specify what versions to use as much as possible and the other things is. You can set up environment variables and other characteristics to make sure that the behavior somewhat when someone else wants it is more likely to repeat your own experience.

A

uh You know, but just to point out, you know that these imagery, for is only as good as the weakest link. So the minute you start saying, pull something from uh get repo. That's not tagged or well managed you. You have the potential that if somebody tries to re build that image later they may fall, they may hit some issues.

A

One thing you can do is treat the image itself as an artifact and maybe and tag that image and kind of don't touch it afterwards, and there are even some registries that can provide some tooling or some controls to prevent you from doing that, so that you can make sure that that's a kind of a durable entity that you can reference later on. People are even starting to do things like put dois on these images, so that they're, really you know, there's this digital identifier, just like you would do for a publication, for example.

A

So that was very quick kind of.

A

Advantages of containers, I want to talk about some other related efforts out there, so this is a bit of a hodgepodge, so it's a potpourri for a thousand. I guess um so. First up I mentioned the super containers project earlier.

A

This is really focused on making sure that containers are a kind of a first class citizen of the excase systems and making sure that we're kind of looking at best practices and other things you need to do around the applications to make them work well in containerized deployments.

A

So parts of those have been things like documenting best practices. We've been offering training through things like sc, tutorials, ecp summit tutorials, and we've done some at iise as well. Also we're trying to build up a set of base kind of reference images that show kind of some of these best practices and demonstrate uh some level of portability.

A

uh A related effort from that is e4s, which is the kind of the sdk for the ecp software pieces that are being developed. um So there's a lot of other. This is it's a very it's a stack kind of based distribution for the ecp software.

A

There are other aspects of this. This is not just purely focused on containers, but containers are one component of that, and so they have images that they're creating that have sort of all the prerequisites and uh it's back sort of set up for the e4s software and we're starting to work up on best practices. Where you you can have a docker. I meant to include an example of this. You have docker files that basically are using.

A

This back builds from e4s to make it easier for you to create your full application environment for inside a container.

A

You know long term what we want to get to is also integrating this into ci pipelines, and so we don't. I don't think we have fully in-depth solutions just yet, and this slide came from andrew young who's at sandia. So that's the reason. It mentions air gap networks, which maybe is not as important for for a nurse, but putting that aside, the other pieces are pretty relevant.

A

You know we want people to be able to have a repo for their application that as part of a ci process, it's doing all the building and testing, but part of that is also building up images potentially for different architectures, and then those images could easily be pulled down and run on. You know the future exoskele systems or even the pre-exoscale systems uh like perlmutter.

A

uh So that was sort of some of the ecp related things. Another couple of points I want to talk about is where we're seeing containers used as part of workflows as well, and so there are some good examples of these standardized workflow descriptions that have emerged things like cwl and wdl, where that standard of how you express your workflow is, is maintained as an open kind of standard with a community behind it, and then there are tools that can implement those standards.

A

uh What's interesting is they're starting to integrate containers directly into the specifications. So as part of that description, you might say I need you know. I need to run this task, and this is the image I want to run it in, and you know the other arguments whatever that I need to use when executing that task, and this again you know alluding back to the reproducibility aspect. This makes it easier to get the same results, but also makes it more portable, which is a key kind of a goal of these of these things.

A

They can even do things where they captured that as part of the provenance of once you've executed a workflow, so they can say, like I know exactly what version of the container I used, and you can even do things where you rerun. If you try to rerun the same inputs, it can notice that I've already done that and it can kind of skip through things, for example, and then the other thing I wanted to mention is: there's a growing number of uh repositories and sort of optimized images that are starting to appear.

A

So two examples are the bio containers repository. So this has a lot of images for the bio community and then nvidia has their own registry, where they've already gone and optimized applications for gpus. You know using cuda, or this would also include things like machine learning, deep learning framework, so you'll find optimized versions, let's say tensorflow in there.

A

So I think this starts to get towards one of the value propositions that were made for images. Is you can start to have communities creating these rather than sort of individuals having to do it on their own?

A

And uh you know that the person that best knows how to do that optimized version of it can take care of it and everybody else can kind of reap the benefits of it. So I think that's clearly sort of the case with, for example, the nvidia containers all right and then um just uh to talk a little bit about sort of future directions uh for things sort of container relevant, so one is kubernetes.

A

I've mentioned kind of alluded to it previously, so you know this is not a distant thing, we're going to have at least kubernetes under the hood and our in pearl mudder. So it's part of the cray shasta software system, and I, if you look at the I was trying to be careful, make sure I didn't see anything that wasn't public, so this comes off of their web page and you can see them clearly talking about wanting to run these kind of converged, hpc and ai workflows.

A

So clearly this is um you know it's on craig's mind. I think it's on other vendors are also trying to capitalize on this and I'll talk a little bit about. Why why this is happening in a few slides, so I think initially this is really going to be. How much of this is visible to the end user. uh Early on on promoter, I think, is probably going to be minimal at first, but over time.

A

I expect we're going to be looking at ways to expose more of those capabilities, but this is something that we ourselves are going to have to to learn.

A

I just some humor on kubernetes, since I think a lot of people have probably heard it, but maybe it's not clear what what it is exactly. um I like this somebody mentioned this quote in a workshop. That was a virtual workshop. I was sending this week this one time I tried to explain kubernetes to someone, then we both didn't understand it, um but really what? If you go? Look up? You know wikipedia it's an open source, container, orchestration system for automating deployment and scaling and managing applications.

A

It is very tailored. It was developed initially by google. It is really designed around things like supporting web services, but it's pretty flexible and you're, seeing extended well beyond that.

A

um If you like, writing yemel, then you'll love kubernetes, because that's really how everything's kind of expressed this is just an example of like what a kubernetes ammo file might look like, and you see these kind of patterns repeated throughout. So there are different. You know kind of spec files that you need to generate and feed into kubernetes and then it'll use those to basically generate the state of for the services that you're describing.

A

But the idea here is everything that you need to know about how to deploy that application, for example, is captured inside these specifications. um So why is kubernetes interesting for hpc? um I think right now. It's still an emerging thing, but where I think we're heading is that you'll you'll start to see examples. Where um say somebody wants to run some application and maybe there's other services that they need to to start up as part of that larger workflow, and so that may be one place where you start to see that occur.

A

Another thing that we're already seeing is that these two, these second and third bullet points, are somewhat related. We're starting to see tools that, like they come from, maybe the cloud space so they've already designed it around kubernetes, and so that's just becoming sort of a common language that is being used. And if we want to be able to leverage that, then we need to be able to kind of integrate that into our systems and then you we are starting to see other kind of tooling workflow tools.

A

Things like that that are layered on top of kubernetes and again. If we want to be able to provide those to our users, then we need to be able to integrate those into the system so there, for example, things like argo, which I'll show a few examples of just a bit.

A

This could be a nice way for people to express their workflows, maybe even better in certain ways than some of the examples I showed before, and but there are a lot of different frameworks that we're starting to see that that use that so, just briefly, an example of argo, argo, probably is one example, something that looks more for me or something like that: we're used to with something like slurm, so it has concepts like submitting jobs and listing them and.

B

A

Them so it's somewhat familiar kind of conceptual model that we're we're used to and the specification it's again a yaml kind of model, but it's maybe a little less complicated than the ones we saw with kind of native kubernetes. So I think that you know whether it's argo or something else. I think you will see these higher level tools that layer on top of kubernetes maybe provide a you, know: language or syntax- that's really optimized for particular use cases, and it makes it easier for users to kind of get their their work done.

A

And then you know this is just showing when you submit jobs to argo it kind of. If you you know, if you use firm, this looks, uh looks somewhat familiar all right. The last thing I just wanted to finish with is talk about um you. I didn't say much of it. I've talked about the different run times. I think um you know the problem with having sort of these hpc specific runtimes.

A

Is it it kind of sets us, apart from the broader container and kind of cloud container community, and um I think ultimately, it'd be best? If that were not the case, and so what would an ideal hpc run time? Look like so one is, is you know we are worried about security, the less privileges that these run times have then the less worried we have to be about some. You know bug in the system that could be exploited, for example, so to the extent that they could have no special privileges.

A

This is this would be an improvement. We still don't want to give up on things like scalability, so we want to be able to scale to thousands of nodes and you know be sort of exoscale ready. We want them to be able to exploit interconnects and accelerators.

A

You know as much as possible in as portable way as possible and again we'd like them to be closely aligned with the broader ecosystem, because that means we're able to more quickly leverage, maybe innovations that are coming from say outside the hpc world right, and so I would argue that podman, I'm starting to see, show some signs of that. So I'm very interested in I've been sort of tracking this, I'm curious to see how it emerges.

A

We are going to explore it first as a place to just provide probably build utilities for nurse users, we're going to start with nurse staff, but probably expand it out. But what's interesting about is it's it's kind of part of the normal cloud stack, so it's not the sort of distant cousin. It's designed to be a drop-in replacement for docker, so you can actually alias podman for docker and you can do like alias docker equals podman and you could to first order.

A

You don't see any differences um and it's actually able to run with no special privileges. So that's the big win to date uh and it's got active developers, but with red hat being the primary developer of it ensues, I think is also involved so that it still is it's starting just some of the gaps, but we still need this issue of scalable launch, address and also clean ways to leverage interconnects and accelerators.

A

I think it's looking pretty encouraging and so that's something we're actually exploring actively right now, all right. So, just a few summary points dave, I would argue: they've already become a critical enabling tool for hpc, mainly for these kind of productivity and reproducibility reasons and you're starting to see. You know this growing ecosystem of tools outside of hpc, but we're starting to see those become.

A

You know pulled into our environments, and I think that you know this is only going to continue and my question. I even asked this at the workshop. It's like do. We think we'll be at a point in time in the not too distant future, where all applications that run on our hpc systems are containerized, whether the user knows it or not, and it could be that even with pearl mudder, we have some level of truth, so that would that was it I'll be happy to take any uh any questions.

A