National Energy Research Scientific Computing Center (NERSC) Data Day 2022, October 26-27, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Superfacility API

Description

Part of the Data Day 2022 October 26-27, 2022

Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.

A

Thanks for having me I will talk about the super facility API, which is one of the outcome of the super facility project and I'll. Do the first slide telling you what the supervisory project was or is um so?

A

As you might know, you know, supports a large number of users and projects from doe office of science instrument, observational facilities, and they made like a an analysis a while ago and figured that a lot of these projects that we're having um I actually use nurse to analyze external data or create tools for data analysis or combine experimental data with simulations modeling, and that makes the data day all the more important to have.

A

And what we also found is that we that these users, um that use it for data analysis they actually from external facilities. So a lot of them a lot of the market, certain facilities of the UF due also science uh programs, and um from this uh um you know, from this research or from actually or from all the from knowledge that lbl has in um in working with experimental facilities.

A

So working between, like um you know, nurse and the computational, research, Division and yes net- is that you know that that you can make better uh signs and you can actually make better results if you start uh working together and the super facility is the is the ecosystem of connect facilities, software and expertise to enable new model Discovery and the super facility API.

A

That was one part of the super facility project hypnos, and the idea is, if you create an API internist to embed HPC into cross-pacific workflows, and so it's kind of tailored the admission stays tailored towards. You know the the world for two developers in our facility uh Partners, but it's now also a general purpose, API for all nurse users and projects I'm just going to run to you through a model case to motivate um the creation of the API.

A

um So uh in many cases, experiments at external facilities use high frame rate, 2D detectors for the science and these produce a whole lot of data, and now they need to process that data and hosting the data on site and compute on site has been increasingly demanding. So the idea is, why don't you use you know an Oscar HPC resource for this, and in order to make this work you know they need to know that HPC is like a partner, so they need to kind of figure out the the status machine readable they need to.

A

They need to know that if I'll switch from one HTC physio to another, that you know, I can write, I can use the same tool. um Any other requirements like you know, real-time turnaround. You know that. Maybe you know the the people who are from an external facility here and the audience can speak about, um but the the usual steps to run such a workflow are often the same. So first, you need to check um if your HPC HPC facility is up I need to check the status and check.

A

If you have enough, you know um hours to run the job you need to move the data over to nurse. um You need to start. You need to start your actual analysis, job and you need to kind of you're kind of want to monitor. You know it's a job so working is it done?

A

You kind of need to um you, know tail in the logs or something like this, and then um then you want to make one also, you know gather some feedback, so you want to download some media data or some some, um you know initial results, and then you probably want to move the data to Archive optimizes. Oh, you want to move it back to the site where it came from.

A

So this um this this use case kind of motivates a lot of the endpoints of the API, so the the API. um Why did we actually go and make it another one? But so so the apis are very important. They they meet a critical need in in supervised ecosystem. Automation is no longer optional; it will it would it allows for anything operation, it minimizes the human, the loop. You know you can track. You know your number of jobs and you want to maybe have like some. You have like an experience set up.

A

You know that many other users come through and that you want to run it through, like a machine, a user account and you kind of want to build this. You know it's very good um workflow, where you kind of integrate HPC um into your workflow, and an API allows you um to to get make the HPC resource kind of machine readable that allows for an easier integration kind of gives you this feeling like okay, well, here's my great user interface and I'm running out of resources on my local servers.

A

Now I want to switch on to like the remote server, and it should be as easy as you know, selecting an HPC resource from like a talk from like a a menu or just click a button and then suddenly the job you know will the data will be delivered to nurse and the job will run up nurse. So that's something that my API is really good for and also the other reason why we made a new one. Is um there's so much good, standardized tooling out there?

A

That is really not a reason to uh to not build an API and what we want to be at in the end. Is you kind of want to you kind of want to in this ecosystem of connected facilities like on the right?

A

Is cod there's experimental, observational data facilities and on the left, you see HBC you kind of want to get into the business of having like these closely workflows, where you know they run the they have the workflows at the local site and they kind of can embed HPC uh as they want to foreign can do today.

A

um Well, you have a number of endpoints available, for example status, so you can create a status of nurse systems component system health, and we will see that in the demo too, and you can check the accounts. You can of course submit compute jobs and and check. If your job is done or not queue status is, then you can transfer files um a bunch of these things actually run as in quantity.

A

So you have to check in like the tasks endpoint where, if we so every um endpoint that runs unsupronously kind of like gives you a tough feedback, and then you know you check on the task endpoint how the specific um uh action is working and we also have smaller utilities. You know up and download, but also you know, executing arbitrary commands and reservations to yet account. So the idea is that our nurse interactions are callable and you know the back-end tools that are sent. Behind These endpoints assist the knowledge and complex operations.

A

So let's maybe take a look at how that's done.

A

um You know in detail so say if you, if you were to check the status, you know um you can go uh and go to the nurse modd or you can SSH and pink specific services, but you know with the API you can just go to status and and query the status endpoint and get the information you need or set me a job in sh in s batch.

A

You know just done with compute endpoint monitor job SQ is done also by the computer endpoint, and also by the task endpoint and yeah, and for accounting like you would go into Iris and look at the accounting web page, and this is you can do uh parts of that functionality with the account API.

A

Okay, let's check how this would turn out for this model use case. uh First plant check ability of nurse resource experiment. You know in this class you would go for the status endpoint moving the data over. It would go to the storage transfer endpoint, and then you start the analysis shops, of course, with a good endpoint and check in on the compute jobs with over the toss. Endpoint, and then you gather feedback. You know downloading small files, you can do utilities or you can use the storage endpoint again and moving data to Archive.

A

Also with the storage endpoint.

A

Okay, then this this was was just like the model case. It motivated all the different endpoints, but we have um the. If you've been to the Super facility day, uh coupled the glass that was the last thing, I think it was last week anyway, a bunch of our um science engagement actually use the super facility API today and I'm, going to show one science example, and that's my um that's the national Center for electron microscopy who's, also on on lbl Campos, um and they have.

A

If you look on the right on the left, they have like a bunch of um electron, microscopes and there's one specifically, that's called the 40 stem.

A

There's a lot of um it's a very high framework, very high um bandwidth, detector, it's about fpga modules and it pulls the data at like four times: 120 gigabits per second onto receiver nodes and then pushes it on an intermediate flash storage buffer and they used they built like an app called the distiller app and that that app actually instruments a pull into um the community fire system or the copy scratch or Commodore scratch and from then it initiates a process. That's called um electron counting.

A

So essentially they have all this raw data and they reduce it to just the electron counts on the detector. It's a very drastic reduction, and this has also been done uh with the with the API and then afterwards, the results were much smaller are safe to the community file system.

A

So, with the help of the API and on the and the nurse ecosystem, they were able to cut down the processing time, and now this is a very regular. um um This is I think this is just it's a production, it's a production app and it runs all the time for them. So just like some overview. This is just essentially a basically a catalog and a catalog app.

A

So if all your data sets in there and you can kind of click on it and and check them out um and I, think I have a screencast too.

B

All right, let's go through this.

A

I've been showing this quite often so yeah. um This is made by Chris Harris from kitware and in collaboration with Peter smithian, so they made the screencast so essentially they're. Looking at like one data sets and which we see in a minute and you're gonna press on this comp button and this com button is everything that this app does to kind of instrument this workflow over the API.

A

So it pulls the data over with BB copy, run, selection, counting code and then after, while it's done, you see the state gets this check mark and then um and then all the work is done. So this is kind of uh I kind of like this, a lot because it kind of shows what you can do with an API and essentially for the users, absolutely um absolutely unaware, that's actually running at nurse, because it's kind of it's kind of hitting the assistance uh inside a feature that I really like all right.

A

um I'm gonna now go into how to use the API and talk a bit about it unconscious and demos too. So super CTA, Basics kind. If you go to this URL above the API, that's the golf API slash version 1.2. It would actually go to this Swagger page in this Vega page shows you all the endpoints that the API currently offers and you can actually go and try them out. We're gonna can only try out like in this particular form. We can only try out like uh stuff.

A

That's um doesn't require authorization, so you can press execute if you wanted to takes a bit of wire but yeah. So here you see like the response. um You know this. This checks, if all the, if the uh different Services of nurse are active or Not by the app or not, and you can see, for example, the perimeter here um is currently um in a SketchUp management. So it's unavailable this one well core is accepted.

A

Maybe it's a bit small, but it's going to be in the demo anyway, um so yeah I invite you to try this out. This is this doesn't require any effort, any authorization whatsoever, okay, um yeah. So the the um the API is the rest API, which is input, output standards based authentication through using or Co tokens, um so stand up. There token- and um we have a lot of documentation for this.

A

So if you go to this webpage um in documents cost to SF API, you can see all the um of the information we have about the API and what's kind of cool I, just checked uh latest locks on Cabana and uh from May 1st. We have 4 million uh locked requests, and this is essentially like one request every three seconds, and this is only the uncached ones, so I think there's even more. It's a higher load on the API, but yeah every every three, every three seconds it has to run a new uh a process.

A

That's not been cached in the.

B

A

Okay, um so if you want to use the API where it showed you that you can just move over to the Swagger page, but that will only give you access to the public endpoints, you know if you want to use um or authorize authenticated endpoints, you need to get a super facility. Api client, you see it on the right and I will show you in a minute how to get one in Iris when I go to the demo um and I. Think that's all to this slide. So I'm, just gonna gonna do this.

A

Now it's gonna be a live demo, so fingers crossed it worked just 20 minutes ago and if it doesn't work, I have a visual video recording of it. Okay, all right! So let's go um to Iris, so I'm already logged in um and you can see. This is the standard virus page you can go to profile.

A

um Maybe we should just make.

B

This a bit uh launcher: okay,.

A

B

A

The profile page at the very bottom, uh you see super effectively API clients, so you go here. You make a client and if you're, um if you have never used the API, you probably only get the read-only client, but if you want, but there is another client that allows you also to exit code, it's important for all the post requests or post requests require. The read, write, execute client, which also there is, um which also includes the submitting a compute job. So you need tests. This will require.

A

This will trigger Security review, um but I already managed to do that. So I I can have it. Let's give it a name, uh let's call it Corey demo2, and then we select an a source, IP range and we're going to call the API from inside the query nodes in this already preset that you can get from.

B

This user interface, we create a client.

A

And I'm going to copy the information and I'm going to deactivate the client right afterwards, so hope nobody screams at me for uh for sharing this one in this.

B

Screencast, um let's go back copy. Your key.

B

It's kind of uncommon. This point.

A

So then, what I have here is a Jupiter notebook that demonstrates the use of the API. You don't really you don't need to use the Jupiter notebook you can use like any um make any place where you want to pause it from, but it's just convenient to show it in a notebook here right. We have some convenience wrappers that that handle the deal with the communication. Just gonna go this. Doesn't it's not really important, though all right? So the first thing we do is in this particular part. We exchange the client credentials. For a token.

A

So now we've got an economic got, an access token back, it's a bearer token, and we can use this token now to um to um submit requests to the API.

A

um But first we start with the user checkout and that's the you know we're gonna gonna follow this model um that uh you know we're trying to we're like at least give you like an external facility and we try to run. um We try to move data over there and we try to run our job right. So the first thing we do is we want to check if the system is up right, we checked for Corey it's up and you guys have a bunch of um a bunch of other calls that check for other systems.

A

Corey permanent and, like the data transfer nodes, the community file system globals. All these Services might be important for you if you want to run your workflow, so we execute this and we see in a very active permanent unavailable, as it was just like a couple minutes ago and data transfer notes were both. The community file system works in global works. You know, at least according to our multi information.

A

Okay at least you're gonna have to select a quarry, but if you, if you want to also be future looking, you can also check out um future plan. Outages they're, like an outages planned. That's the end point you need to check for, and if you go for this, then we see that there's only one currently only one Quarry maintenance planned and 16th of November.

A

Okay. So now that we know that Corey will be available and mostly of the services will be available too. We can start to run a transfer. Unfortunately, I couldn't activate my own point, so I'm um I'm, just going to jump, obviously assume this one worked. Apologies for that in the videos that will come with this slide. You see that we see the student working um okay, so now, let's create a job.

A

This is just a simple s: patch script, that will uh that we're creating here and then we type it in with a cut command.

A

And we use the command endpoint utility is command for this, and for this you need to be authenticated to run it so press this, and this as this is a post command, it will be, it would figure, it will run asynchronously. So what we're doing is we're checking uh the tasks every now and then so, let's see what what you see here with polling, so I pulled like um seven seven times to see if the task was done, and you see at some point, it changed to completed.

A

I can just redo it once once more so.

B

It starts with.

A

New at some point, when this file has been written, it will change the computer here and you can see on the left side that file has been written. Is this the beauty of using Jupiter? Here your fire browser gives you like this feedback right away? Okay, but you don't actually have to use like this feature. You can also just use the um LS LS utility and we have an implant for this in the API too.

A

So you can run this. This is actually not asynchronous. To give you a result right away, and you see that the file has been written all right now that we have our job fire done, we can submit to the queue and if you paid attention before, I was using um the real-time queue. So hopefully uh this will be done uh fast.

B

A

All right, so it's been submitted, so it gets. It got a job ID back and we saved this job ID from the response here and we use it to query uh how the job is doing. So we look at this one and if you pay attention to the left, you see that the slurm file has popped up, so the job's actually running now, but you can also do it by checking on the compute jobs, Quarry job ID um endpoint.

A

So if you and we can now inspect how the process is doing, but you can actually tailor you can use the command point to to take a look at the output file. Please learn, since this is command. It will actually be x2000 chronus again, so we call it.

B

And hopefully, in a second there you go so.

A

What you see here now is that the job has started. This is actually a typography code, that's running, and you see that you know yes, it's iteration and iteration. So that's not a job running in the background. um It will take a minute but um I think I'm, just gonna press on forward.

A

So what you do in the end, you can, you know, read the file and extract whether the result has been written to and then now we're just going into just standard Jupiter here and go to the go and find the file and visualize the results all right. So you can do this all of the API. But at this point I was getting lazy. So just gonna just use Jupiter for this part, but just to reaffirm that everything's working all right, okay, um right.

B

Did this is where our example, this.

A

Is a two videos that show you how to work on parameter and.

B

uh Corey stuffiness.

B

A

um Right I show this in the end in the end: okay, so we're coming now to the end of the API um I'm gonna I'm, just going to tell you a bit what's coming up next, um so uh what you're going to do next is we want to kind of redo clients and tokens to have a bit more granular scope, currently you're going to read, write, execute or read so we're gonna go for more we're gonna make it more based on like kind of what the actual um you know, threat the individual endpoints do and then you get like kind of a um have a traffic light um scale on which you can can you can slide for your endpoint and then you can select.

A

um You know an endpoint, they kind of matches kind of how long you want to run it for and how many IP address ranges you need, and then you can see how many, what kind of endpoints it gets. So you can see vice versa. I want to use this particular function of the API, and then the new client interface will tell you. You know what the what the AP limitations and lifetime limitations are, but the the upside is. We have gonna, have much more Source IP ranges per client.

A

That was a popular user request and also we will go to just like you can. Today you can get like a 24-hour sh proxy credential to uh to run your workflows. Now you can have like um for future clients that are for a short lifetime, but you don't have to go through the review process. You know which, which otherwise takes quite a while so I'm quite excited about this, and this should be right around the corner.

A

So the next thing you want to do is run the SF API to retire mute. um This is supposed to be like in a month time scale and uh we the the way it works. You're gonna, have a. There will be a login based route to get tokens from for miners, landscape apps other web apps.

A

So you know you would retire and as if API will become the new, uh you know backend for everything API at nurse and also an exciting thing that we are we're working on is we're working on a common API interface, because the supervised facility API, is only good enough if it actually enables you to run.

A

You know super facility type workflow, so if you can switch between facilities so important for this is that we get other facilities on board that adopt the same endpoints and methods, and we are currently talking uh with cscs um that have developed. The fireplace API was kind of like the blueprint for the super City API. So we we base a lot of our design of the fireplace API, also talking with the iPhone's Computing Services at LVL svdf at slack, and there are the awkward leadership Computing facility to help.

A

This will form a nucleus that we will create like an API that can that is the same across many facilities and will allow you to easily migrate your workflows uh between between those facilities and I. Think I'm gonna stop here for questions rather than showing another video. So thanks so much.