National Energy Research Scientific Computing Center (NERSC) NERSC User Group Meeting 2020, 2 Sep 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: NERSC Users Group Meeting 2020: Breakout 1 (SIG Experimental Science Users)

Description

Partial recording of Breakout 1 NUGEX Special Interest Group for Experimental Science Users

A

Them so you can go back and watch the talks themselves and listen to any of the discussion that happened afterwards if you are interested, so um that's all. I have really in the way of introduction to this um to say what we are. uh I will certainly say, and I'll probably ask again at the end, if you have a project that you're doing at nurse that really kind of fits into this area, please by all means. Let me know I would love to get another scheduled list of talks together.

A

So we could kind of continue this series, uh maybe a little bit later in the fall. I think it was very useful so with that I am going to go ahead and jump into my talk, which happens to be the first talk in this kind of lightning round talk series. So what we've asked is all the people who gave talks during the regular session this couple of months that were users external users to come, give a summary of those talks, so they gave more extended versions during the um the meetings.

A

But here we wanted just to give some brief summaries and we kind of decided that we wouldn't have the nurse staff repeat their talks, because a lot of that information may come out in the other talks they were giving in the plenary part. So this is a talk that I gave initially in at chep conference in adelaide last november. I recycled it for the special interest group and I'm recycling it yet again here, but I've taken out some slides to try to make it shorter.

A

So apologies to those who've had to hear this at least three times now. Not much information has changed.

A

Just to tell you a little bit about the the project, we're a nuclear physics experiment, that's being done at jefferson, lab that's in the southeast corner of virginia here kind of nestled between virginia beach, norfolk and williamsburg, we're about three and a half hour drive up to washington dc.

A

This is uh an experiment that I'm going to talk about where we uh are using nurse course called gluex, it's being run in one of the four experimental halls there are at the accelerator. uh The facility jefferson lab has is primarily centered on an electron accelerator which is buried underground. You can kind of see the access buildings here, giving you an idea of the shape of this. It's really two linear accelerators, coupled together with magnets, so the beam can go around a few times.

A

Three of the experimental halls are buried underground here in these round mounds. The fourth one where blue x's house is up on the other side over here, and I won't go into a lot of detail of the experiment uh itself, since we just don't have time, but I will say something about kind of just the scale.

A

This is a the slide is getting really kind of old. I need to go through and update the numbers, but the main thing to look.

B

A

Is over on the far left for our high intensity running that we expect to produce on around several petabytes of data a year from this experiment, and the data will be taken over several weeks. This might be taking over 30 weeks of the year. The accelerator may be on we'll acquire the data, we'll store it, and then, when we do processing of it, we may have to do a couple of passes on it and we store some uh processed information. That itself will add up to a few petabytes of information.

A

There's quite a bit of.

A

Cpu power, that's required to do the processing of this data, and this was an estimate of that at one point in time. I think that's actually gone up because, uh as you go through time, people only try to improve the code, and most of that improvement makes it give better answers, not necessarily make it run faster.

A

So the way that we do this, we actually do off-site processing from the lab in a few different places. We have our own scientific computing farm at jefferson lab and it's it got. You know on the order of 10 000 cores in it. So it's you know it's not tiny, but it's not really enough to do everything we need to.

A

So we have branched out to try to do as much as we can in off-site facilities, including the open science grid nurse in the pittsburgh supercomputing center and we've tried to homogenize as much as possible, what's needed in order to run jobs at these different places.

A

So we do have a docker container that we made it's a very thin container. uh We use a one line: conversion to create a singularity container out of it. uh One line convert or just import it into shifter, so um so we don't have to do any modifications to the container itself. It's it's thin in that it only has a couple of um system installed packages in it. It doesn't contain our software at all, so we've been able to use the same one for I think a couple of years now without having to modify it.

A

The way we get our software is through the cvmfs, the cert. I guess it's virtual, virtually managed file system, a virtual machine file system, it's basically a file system where you can publish your your files in these in this case binaries, um and that can be then mounted and read like it's locally mounted as a remote file system kind of like nfs, except for it's, it's read only where you're um operating on it from, but that's fine for what we want to do so we do all of our software builds using centos 7.

A

Our docker container is based on centos 7.. We put third third-party software like root, which is a product from cern uh all of our calibration constants go into an sqlite file that is also stored on cvmfs and other resource files like large magnetic field maps and material maps also go there, so they're all kind of published out that way and they're all considered kind of more or less static information.

A

The calibration constants database does get updated and every night at midnight we generate a new sqlite file from our mysql database, which is the definitive source and hosted at jlab. But we don't want all of our jobs that are running off-site to reach back to the jlab database server, so we just distribute the calibration process. This way, data transport to both nurse and psc. We use globus.

A

This down here in the bottom, is a graph of when we first finally got high throughput on esnet from jlab to nurse. uh It took a little bit of effort from rit and network guys working with um the uh guys over at um nurse.

B

To get this all working and es net.

A

But uh it and all finally finished um processing. All our data goes to tape. We don't have enough disk space to store it all. We have to have a workflow system that pulls it off of tape through our data transfer, node to the nurse data transfer node to cory and then brings all the resulting files back, so we can store them on tape, so we have to. I never submit a job to slurm directly.

A

I only submit to our workflow system, which then only submits to slurm once the file is there ready to go so it's a little complicated um and I guess I can skip over this. uh It just shows that we have uh multi-threaded processing the scales.

A

This is from last year we made most of our jobs run through backfilling. This is just a statement. It's a little controversial, maybe but extremely poorly matched to our job shape is the scheduler at nurse two jobs at one time are most free of priority and all others must go in through backfill and it treats uh large jobs. If I want 64 job nodes for 48 hours, it treats that as one job just like one node for three hours. So it's.

B

Really hard for.

A

Us to compete if we're doing single jobs like this, we were able to do a lot with backfilling, though last year uh astrix um we were able to get about a thousand jobs per day through when we were running smooth on quarry two, which was plenty for us, and it was very we're pretty successful on using it.

A

So I guess I should jump to my summary now kind of at the end of my time, but we are running at nurse with large experimental nuclear physics data the backfilling saved us, but the asterisk there is that this is really no longer true in 2020, and I think this has to do with what sudeep may have said this morning on uh they got 10 more out of uh knl and that's to our detriment, because now they don't have big holes for us to go in and fill anymore, and so it's uh it's hard for us to get to get much throughput on it this year.

A

So we're doing things to try to adjust for that. But, okay, that's all I have, and so I've kind of run. Out of my eight minutes, um I suppose I should jump over now to the next person who's supposed to talk, and um I think that's steven uh yes so go ahead and take.

C

It over okay um hi, so I'm stephen bailey, I'm the data management lead for the dark energy spectroscopic instrument, we're making a 3d map of the universe using nursk as our primary computing center. I'm going to be focusing on the computing part, not the science part, but briefly, describing what we do at nurse some challenges we've had and some successes that we've had so first of all, just the basics of what we do at nursk on a nightly basis.

C

We have a process running on a workflow node that every 10 minutes rsync new data from our telescope at kit peak in arizona to nurse once it's at nursk we're using 10 nodes of the real time queue to be processing the data, and that way we can keep up with the data throughout the night and have the results from each night ready by breakfast time for the people who are not staying up during the night at the telescope.

C

So we can analyze it during the day and then that informs the following night's observing plan. And then we repeat this nightly for five years and that builds up a 3d map of around 50 million objects, and so it's it's hundreds of gigabytes per night, and we expect that over the next five years to grow to scale of sort of 10 petabytes and using around 100 million hours per year um in the next five years.

C

um Sorry spam call coming in on my phone shutting that off so then on a monthly or yearly time scale. We have reprocessing runs that use the latest tagged code, starting from the raw data.

C

This uses the same code as the nightly processing, but it's a very different scaling needs, and this is the primary reason why we're working at nurse. If we were just trying to keep up with the data with 10 nodes, we just buy 10 nodes and be done with it, but is the fact that we sometimes need to do sort of a burst of processing years worth of data as rapidly as possible. That drives us to wanting to use an hbc center, but we also benefit from the one-stop shopping for having our daily processing.

C

These big processing, reruns and the final science analyses all in one location, unlike some predecessor, experiments where they do the processing at one site and then ship the results to a different site for doing the final science analysis.

C

um So where we sit in big large scale user projects, you know horizontally is allocation in millions of hours vertically, as storage in terabytes. We're not the largest allocation, we're not the most data but sort of along the diagonal we're in the top five. For just you know, big data and big computing.

C

I wanted to give a shout out to debbie for emphasizing that for a lot of these projects, it's about much more than just flops, and I o bandwidth. That's very true for us. It's you know. We use all the different cues, all the different. I o systems. We use the workflow nodes, we use jupiter. We have you, know: spin services, multiple different spin services, cron jobs, um so we're we're everything that debbie said yay.

C

um One of the key challenges that we face is cueing the complex dependencies. This is showing a cartoon version of the processing needs for one night of data, where each box is representing kind of a task that needs to be computed vertically.

C

The size of the box is representing time needed and horizontally is the number of nodes, and so we have sort of some calibration data, that's kind of big, and then it gets collated together in a small job and then some big jobs and a small job, and then a bunch of small jobs and a bunch of big jobs and a bunch of medium jobs. And then it ends with kind of a big and that's one night worth of data, and you know, sort of the naive version of each of these boxes represents a job.

C

Rerunning five years of data would be like 200 000 jobs with interdependencies and that's just not a slurm cueing, best practice and so figuring out how to put these pieces together when they are different shapes, and it's not just like a big bag of tasks. That kind of are all the same.

C

That's been one of our biggest challenges.

C

So um our first attempt was bundling each step over about a week's worth of data where we take a bunch of these small tasks. Put them together into one job. A bunch of these larger tasks put them into another job chain them together.

C

This gave us big hpc like jobs, it's the most efficient packing in theory and when it works, it works great, but it still requires hundreds of jobs and, as david mentioned, only two of which are priority scheduled and the remainder don't backfill very well. Job b doesn't start aging until job a is finished and it's coupling otherwise completely independent tasks, which resulted in a lot of fragility of just like one rank, can take down all ranks and um mess things up.

C

So our next attempt was to sort of reshape b a bit pack them all together into one job. Accept the inefficiencies of you know, partial portions of the time we're not using all the nodes. This gets us a faster end-to-end sub for a subset of the data, um but it's in it decouples the independent data it matches well what we do throughout the night, but it's not going to scale up to five years worth of data processing, um so we're still working on how to do this.

C

Well, um the the special interest group talks were helpful for learning our options, um but you know it boils down to also that only two priority scheduled jobs is a big limit on experimental data processing, especially when we're running these on behalf of hundreds of users. Other projects might even be doing it on thousands of users. I'm I'm wondering whether you know experimental facilities should be advocating for getting more than just two slots, and it is somewhat ironic that an hbc center has scaling problems with its scheduler.

C

um So you know, speaking to the the nurse folks on the line you know. Investment here with skedmd could help improve the effective use of future systems if slurm itself could scale up better, but I want to end on some positive stuff so um successes. One thing that's worked well for us is testing at nurse. um We have a simple but effective, nightly, cron job that just does a get pull of all of our repos.

C

It runs the unit tests to confirm not just that it works on some travis ci configuration, but that it actually works at nursk. This is especially important. You know, after an upgrade or something, and it also runs a basic integration test um quarterly. We have software releases that we use jupyter notebooks to orchestrate the end-to-end integration, some of that's running on jupiter itself, some of that spinning off batch jobs, waiting for them to finish and come back, and so a question to nursk is how will they be supporting continuous integration? Testing on gpus?

C

um We'll definitely want some sort of equivalent of that. um Once you know, the gpus are deployed, something that's a success, but it's more of a work in progress but wanted to seed. An idea is the idea that we should be investing as much effort into easy recovery from problems not just avoiding problems in the first place as the various experimental facility you know, groups were giving their talks in the the previous series.

C

It seemed common that their sort of order, one percent transient job failures- was kind of common and even if you're doing like 10 times better than that, you know point, you know 0.1 failures, but you're processing, a million images.

C

That's still a thousand failures, that's more than human can easily handle if the recovery requires custom hand work- and so you know something we've come to realize is that we want to make that easy to recover from not just putting all of our effort into avoiding the problem in the first place, um and I also wanted to give a shout out to nisap, which has been really great. um You know a single full-time, postdoc, plus some part-time, you know, senior consulting, has resulted in huge speed ups for us, so thanks to the nissan team.

C

um So with my last bit of time, um you know we're making a 3d map of the universe, using nurses as our primary computing center. It's that yearly reprocessing that drives the need for hpc, but we're also benefiting from that one-stop shopping aspect. We have various challenges. I've covered a few things that I've not covered here, but I just wanted to you know, say it's cueing. Those things is not our only challenge um but we're also having successes working at nurse and that's going well, and I met my eight minutes.

D

A

Thanks steven, so maybe we we should go ahead and jump to michael, so we can try to stay on time as much as possible sure go ahead.

E

E

E

Okay, um hello, everybody, my name is michael pote and today I'm going to be giving a talk about physics, data production on hpc and our experience to efficiently running at scale. So I'm working for the star experiment at rick.

E

So since pdsf is end of life and we've all been migrating to quarry at nursk, we've had our ongoing efforts to get our data production to actually run on quarry and really focusing on our containerization model, the scalability of cdmfs to serve our software, our workflow, our database access and efficiency.

E

um So, in order to run on quarry, we need to have docker or shifter containers and to enable our software to run, and we found that it's best to deploy minimal containers with the software stack provision from cvmfs.

E

So initially our container model had our base scientific linux, 7 os. We added our rpms. We added our software and one of our libraries and we realized that we deploy about 12 to 15 new libraries every year, and this is just going to be a big mess of containers to maintain.

E

So we went a different route by having a minimal size container, with just the operating system, the base os and uh some of our rpms and having cdmfs serving our software and additionally, uh in the past, we used to have one node that would serve our database uh on quarry, while all the other worker nodes would run star tasks.

E

So we have thus combined uh the mysql service to run along uh the test, the star tasks as well, so we can have everything packed in one container and one node could do one job without having to rely on a head, node or another worker node, uh so cdmfs on quarry, uh there's a fuse restriction on quarry, meaning that you cannot mount cbmfs natively.

E

So there are nuris, does provide these dvs servers that forward the I o for cm cvmfs, but they don't support metadata lookups. So we wanted to test this out. So we did a throughput test with 15 000 tasks at 240 nodes, and if you see this little plot here, uh the flat curve is a good sign showing the number of events completed per minute.

E

But in order to achieve this flat curve, we had to modify our workflow with some time delays.

E

So our workflow uh looking something like this, where we launch a master script to the batch system uh and each node that runs in the job will run our container and immediately launch two scripts, one for launching the database service and one for launching the star software script. But both of these scripts will have these sleep.

E

Delays that create a load spreading effect, one for the database payload each node is copying about 25 gig database and each node is loading the star uh software through cdmfs and by having the time delays, creates, allows each node to not copy the same exact file. At the same time, once everything is up and running, uh we can then launch our parallel uh root for start tasks, and one thing to mention here is uh this portion.

E

We consider our job start efficiency, which I'm going to talk about in the next slide, so uh really we're focusing uh for on our efficiency on quarry to get really to maximize the number of events per second per dollar.

E

So, to define a few things, we have our job start efficiency, which is the real time to copy the database load, the environment, sleep delays, etc. Then our event, efficiency, which is the cpu real time of the star event, data reconstruction tasks and then the total efficiency, which is the slurm job, start to the last task finish.

E

So what we found uh is that, with the first off with having our database uh being served, we initially had the one head node that would basically only run the database serving say, 10 other worker nodes, where now we're doing the one-to-one model, where the node is self-serving itself. So this really makes a big impact as the one-to-one model. Our total efficiency is 99.3 percent or the 1 to 11 model without 89.44, basically dedicating an entire node for that.

E

So it's better to self-serve uh the database, the job start efficiency, it's only a 0.05 percent loss. This is over a 48-hour job, so bigger the job higher the value same thing with the event: efficiency, bigger the job higher the value it's 98 to 99 and uh what we've? Since our tasks require about one gigabyte of memory uh per task, we can't use um all the cpus on a household node or a k l.

E

So we just we found that it's best to focus on packing the best number of tasks and focusing on how efficient we can use the machine with the software that we have to run so really just to wrap it up uh for our containerization model. We find it's best to keep them to a minimum and having and leveraging cdmfs to serve our software uh for the database side. Since the query nodes are on a private network, we have to run the database locally, we're able to copy our database payload to nursk on demand.

E

We remerge with authentication tables there, and we can. We can self-serve for the workflow.

E

We will it's best to launch uh the database environment scripts in parallel, so get everything set up as fast as you can and to start uh doing the event processing, uh although we did find we need to have our time delays uh implemented for uh cbmfs uh and overall for the job for the efficiency, the job start, efficiency and the idle cpu that we we get out when the tasks finish is really a small impact, especially if we run over the whole 48 hours and really the headnode model introduced our biggest efficiency, because we're paying for that node to just run a database um and looking forward.

E

Our next steps uh is to ensure graceful termination. So the idea of using signal handling if the tasks need to run past the 48 hour limit. There is the potential use of the burst buffer for our database content uh and our event service uh is coming soon. That will allow us to start uh new events, uh new tasks uh when, when one finishes- and uh that's really the the summary of uh of the whole talk. Thank you.

A

Great thanks a lot michael, so um I guess we should move on to jeff you're ready. You can take over the screen.

F

um Michael, you need.

E

To stop sharing yep, I'm working on that.

F

E

Can I oh stop sure there we go there.

F

F

All right hold on a second.

F

All right, can you see this fine all right? Yes,.

A

F

Great, so I'm gonna be approaching a slightly different uh uh approach. I mean we're looking at the challenges for creating nurse resources into an existing uh distributed and automated data processing model. That's that's been around for for for many years uh where, where uh alice is a at least, is a experiment, heavyweight experiment at the lhc, we have a history with working with nurse at pdsf um and I'm just going to go quickly here.

F

We when, when nurse introduced corey, we put in some effort to to try and make use of the system, and uh this is just a kind of a hodgepodge of different things that we were working on to to leverage the system.

F

Do some benchmarking of the the of the resources build, a system that could handle uh could could handle serial jobs but combined into something that was more uh fit into the way nurse processes, um but in in reality, four years later, uh it's mainly used by local tasks, local groups for one off task and remains an outlier in the out system, and that's because alice has this very specific computing model, and so the the the point here was to try and figure out.

F

How do we tie directly into the uh nerf system with the alice computing model as it exists? So, just briefly, what is the ios computing model? It's a distributed facility, it's a great facility of about 80 sites that are that act together as one facility. um It's a 120 000 or more serial jobs. It runs 24 by 365 all the time. It has a 110 petabyte file system. That is a distributed file system and there's software that ties all the pieces together and it really is a facility you can log in you can do ls.

F

You can edit files, you can move files around, uh it does act like a facility um and so the way you can achieve something like this is that.

B

F

Is very distinctly different than any other site. That's how this thing is able to glue the pieces together if every site was different, there'd be a lot of a lot of manpower, maintaining this facility as a as a a unique facility, so every site runs nine percent of different subtypes, monte, carlo simulations and organize uh data analysis.

F

So uh since this is uh to to what we look at is how do you link in a facility to uh the alaska grid um and I'm skipping, some slides that were in the other uh talk, but um so there's a couple requirements. One is at no level um and for the most part, uh particularly since cvmfs has been set up and and using shifter for the per node cache. This is working really well to load at the node level. There's some issues with the swap not having swap but that's a small issue.

F

It's not much and the facility level is. Is it works pretty? Well, we have access to a workflow node, which is one of the critical pieces of uh having a single contact between the facility and the rest of the atlas grid facility and the the local resource management system of slurms works. Fine with us. What is not working well with us is that is that we.

E

F

A facility to be optimally configured for serial jobs and we need long-term storage that is grid enabled that doesn't go year to year. It goes for long term and it's grid enabled so so we can kind of look at how to address those without without disrupting the alice computing model. And that's the point here now. uh This is just a simple cartoon of what happens with the house computing model and if you just consider this is working out, these are just serial jobs. It could be other people's jobs in here.

F

The local resource management manager schedules the job the agent an agent is launched and the agent's all the same. They build a wrapper and that rapper goes and gets the payload and the payload is defined in the central services at nurse not at, um and so these these they're independent. They don't. They don't interact with each other, and these are the the pieces that that that operate this this uh in the facility from the node level.

F

Now one thing we did was we just decided since we want to leverage uh either multi-node but whole node and multi-node scheduling. We we figured it something called a job runner, which is a very thin layer that actually combined the resources of the entire job of many cores and many resources uh and then uh uh acted as a broker for those resources. So now it's a job runner that manages the resources and launches the job agents, uh but the rest of it is pretty much the same.

F

The the the job rapper still goes out and gets the payload and runs the job, uh and so this was actually um initial fund from uh uh an elderly already with uh physics and zac marcher marshall helped us put this together, so the the the good news is that we did this initial deployment. um Just to give you some some scale reference at the top left is the the normal alice grid is running 130, 000 jobs.

F

The bottom left plot shows the two um facilities that we that are production facilities, that in the us at bridge and lbnl they're running.

B

F

5 000 jobs. uh The nurse allocation, if we ran 24 by 365, is around 700 or 800 jobs, um but so we were able to to deploy this system and retain the full automatic workflow of the grid, we're getting only about 100 jobs, but we're able to maintain the late binding, uh the auto, clean up and resubmit on failures. We don't have to to do anything special for failures. We did this automatic um and it and it's usable in serial whole node and even partial node uh scheduling. So this is the good news.

F

Low resource utilization rate is something we're looking at now and that's um there's several things. I think that that we discussed this in during the talk, uh the the actual talk, I think the main thing is that the what other people has already said about the only two um nodes um uh uh two jobs are are um are used to for scheduling and the rest are just backfill and we're using 48 hour jobs. So what we need to do is look at uh reducing the time to see if the backfill will work.

F

Making big wide jobs is probably not the right way for our model. Just because we like things that are really dc, you can see from these these plots of jobs running uh that's typically what we what we prefer, but this gives us something to work with, and and we we're continuing on that. uh The other piece is the storage. And how do we manage the storage and there's?

F

Some work was done uh actually through also through the ldrd uh with physics which was to make you utilize that we, we do have a large grid storage element at lbnl nearby nurse, because in another facility so- uh and we can use uh this- uh a something called- a proxy cache uh to access the data directly from that storage, and we see some market improvements on that.

F

So I mean that's been that's something that we're working on in the future to really actually optimize that um the effort, just as summary, but the effort was you know it's analysis development, but uh that the corey was a target was a use case, but is is, it was also for alice's future was we're getting into multi-core simulations and other hpc facilities really requiring whole node and multi-node scheduling.

F

So this is what we're trying to connect in without without disrupting the uh the ounce workflow and we've already seen some benefits and there's another uh computer uh at ldl lorenzium that has uh whole node scheduling requirements and but has opportunistic utilization. We just didn't do anything. We just turned it on and it's running fine. It learns them. So both this activity both helps us use, nursery and other sites as well.

F

G

A

A

Great thanks, jeff all right so next up, I think we have chris.

D

All right, uh I'm speaking now, so if you can't hear me, let me know we can hear you good.

D

D

Okay, so I'm going to focus on one particular aspect of nurse usage, which is a real time analysis at nurse which turns out to be important for our facility.

D

So what is our facility is if the slide will advance uh we're lcls um at the slack national accelerator laboratory, so a big long, linear accelerator, where we create these short, intense, bursts of x-rays for doing photon science.

D

We operate uh 24 hours a day.

D

Currently we send down these short bursts of x-rays um 120 times a second, but next year we're supposed to go up to a million times a second, and that is what drives our increased interest in nurse and other facilities in the u.s.

D

So what do we do at lcls?

D

Well, this is what we were doing yesterday and this morning, roughly speaking, we've been down for 18 months, installing new upgrades to prepare for the million shots per second, but we do many types of biology, chemistry and physics.

D

This is one of the big examples is we'll do a nano nano nano crystal crystallography so coming up with structures and the experiment which we just turned on yesterday for the first time in 18 months, was imaging uh covid related stuff, so trying to see which amino acids.

D

As I understand it, will bind to the covid protease and effectively try to put a wrench into the works of covid so that it can't reproduce itself as well yeah so kind of work that the world could use today, hopefully a little bit to figure out these structures of things all right.

D

So um what about the real-time nature of uh of what we're doing so? This is a billion dollar facility and it runs 24 hours a day, seven days a week um and we're gonna. Currently we generate about two gigabytes, a second of data and we're going to go up to 20 gigabytes a second next year and that's a challenging data volume, 20 gigabytes per second- and that's just for starters. It's supposed to go up after that.

D

We get about 200 gigabytes per second coming off the detectors, but we reduce it by a factor of 10 in real time and here's the key point in green. um The things change all the time at lcls, they're really kind of flying by wire.

D

So we need real-time feedback to steer the experiments and the experiments change dramatically uh multiple times per week, so we have to be able to adapt very quickly to changing requirements for the experiments and this real-time data analysis feedback is critical for running these experiments, so we have kind of one second of latency for our in hutch analysis. This is done before the data even touches a disk.

D

We multicast the data currently and we get it over infiniband before it hits the disk. So we can get one second latency, um and that's not what I'm going to talk about here, because we're not expecting nurse to provide one second latency. But what we're trying to get from nurse is a few minutes of latency from disc. So this is what I'm going to talk about today is getting this one minute. Latency.

D

So we've been looking into this with uh help from debbie and david skinner and other people at nurse at the possibilities, from forgetting a few minutes of latency reservations are a big one, but they're kind of inflexible.

D

They need a day of advance notice um and lcls is just too dynamic. You know the beam will go away for a while and then we don't need the computing and then it comes back and then we need a ton of computing right away and yeah. So that's too inflexible, there's the real-time quality of service.

D

um So the way that it's been described to me. This is like uh oversold, first class seats on airplanes and, if you're fortunate enough to get one of those first class seats, you can take advantage of this pool.

D

We've been approved for 20 nodes, but this is not going to scale because this pool that they have for the real-time qos, it's not going to scale when we move to 20 gigabytes per second next year.

D

There just won't be enough nodes in the real-time queue for us to use.

D

Then there's this intriguing thing: the so-called flex queue where jobs that can checkpoints like density, functional theory codes, I think, are the big example like vasp and quantum espresso will write out their uh wave functions every once in a while, so that the jobs can be killed and they get a discount.

D

These sorts of jobs for using this flex queue so that they're willing to be killed, and nurse currently uses this to chop big jobs into small pieces to fill in the cracks in the slurm schedule, scheduler to try to get pack all the cores that they can efficiently.

D

So this is sort of starting to feel a little bit uh like what we would want. We would want to be able to preempt these jobs and then there's this effort, as I understand, with a dmtcp with zhengxi zhao to make all jobs preemptable, but in the user domain. So you don't have to go. Do weird things inside the kernel is my naive understanding of this effort, and but it does require the user jobs to do some work to become preemptable.

D

But then you get a discount if you're willing to do that. Work and you can save save some money.

D

Okay, so um the summary is the options for us in the real-time quality of service is inefficient, so it's not going to be an option for us.

D

We at lcls we have our own job preemption, but I don't think nurse wants to suspend the jobs they want to kill them, so they can free up all the memory.

D

So this preemption method that we use here at slack won't work at nersk, so the flexq is the closest to what we need, which receives a discount, and so we've been in talking with david skinner, and my understanding is that nurse has agreed to provide to expand this flex q idea and somehow provide us with the mechanism to preempt these preemptable jobs so that we can get our our few minutes turn around time. So we can give real-time feedback to the experiments.

D

B

D

uh That's all I have to say.

A

Great thanks chris, so the last stop that we have talk that we have scheduled here is uh bryce bryce. Are you.

A

A

All right, so I'm not sure if bryce's is here, I'm trying to look for him in the uh list and I don't see his name showing up there, so maybe he was unable to make it today, um in which case then uh I guess what I should do is probably open.

A

This up there's been a little bit of um messages going through the chat window, but if anybody had any questions, maybe they wanted to bring them up here and ask any of the speakers or or any of the the nurse folks who are on about it. That would be a great great time.

H

Hey david, this is um this is katie antipa, so yeah. Thank you again for organizing this. This was really helpful, especially because I wasn't able to attend all the weekly ones. um I had a you know a comment and then uh just wanted to encourage folks to for one one other item. So first, is you know I? I guess I continue to hear about job job, throughput issues, um and uh you know I thought we had some some solutions that could work for people that were helpful in bundling jobs.

H

uh You know, I think, there's a challenge here, because I can understand how, if you're part of a big international collaboration, you know it's hard to change your workflow for for nurse.

H

At the same time, you know our it's true our scheduler and, I would say any hpc scheduler just will get knocked over when there's millions of jobs individually going through, and so we have to find some way to sort of meet in the middle and, if that's providing more assistance for people to change their workflows. I saw something in the comments that um shane is working on like a condor option.

H

I'd like to keep keep this going, I mean, I think, if, if you all cannot get your jobs through through the queue, then it's obviously not going to work to run to run it at nurse. At the same time, you know the current scheduler we have, and I would say every hpc center has is uh can't handle a flood.

H

The second comment I wanted to make was the ercap season is coming up, and so I wanted to encourage all of you to make sure that you all knew about the community file system, so the community file system replaced project, um it's about 10 times bigger or so it's about 75, petabytes and the storage is actually allocated and approved by your program manager, and so I would encourage you not to be shy in saying what you need um if you need 30 petabytes of storage I put in 30 petabytes, and I think we don't want you to shrink your ass based on what we you think we have, because we really need to know what what your workflow needs.

H

So that was just a plug and if you have never heard of the community file system, we'll we'll get back to you and tell you more as well.

A

So, as maybe just as a quick follow-up on that so um for our particular workflow, we don't need the space for very much time other than when we're trying to run it through. So um I guess I haven't asked for really large disk space in the request before, because I thought well, the file only needs to be there as long as until the job runs, and then we get the results back. So we don't really need it to be year-round quota.

A

Is there anything in between uh or any option for getting you know, scratch space that exists for a temporary amount of time.

H

Yeah I mean so the idea you could stage into community and then go up and down to um yeah.

A

I mean I could, I could request a large amount, but then I feel like I'm not being a very good citizen to the rest of the nurse users. If I'm getting some huge quota that I just really don't need for most of the year.

H

If you get a large quota, but you don't use it you're not hurting nurse users, it's only if you're using that space and you never delete like never delete it. So if you have a 60, I don't know, if I jumped out, did that make sense yeah that made sense. So if you have a 100 terabyte quota, and then you only use that once in a while, you know on average you're not taking out away space from from people and and we can kind of do an over subscription factor.

G

But david, what um is there aspects about how scratch is is scratch? Do you just need more space than you typically get on scratch, or is what.

A

About scratch, maybe it doesn't work. No, we asked for a um a special allotment that got us up to 60 terabytes and that actually, that number came because of our bandwidth limit. We kind of it kind of all gets tied in. How fast can we transfer data to nurse and then how many nodes would we be able to feed in a steady state there? So you know how much disk space we need really is dependent on that bandwidth and how many nodes we can expect to have at any point in time.

B

A

Change if we change our workflow so that we say okay, we're going to transfer the entire, you know six or seven petabytes.

A

First before we start any jobs at nurse that would change everything that we do, and it makes probably simplify a lot of things on our end, but that that would be just a different way of doing it than what we have. What we initially set up.

G

A

So I I have one other comment, but I want to give other people an opportunity to say something or ask something if they.

G

I'll I'll, just um chime in so I'd mention this in the chat window, but yeah, I'm a nurse staff member, but I also work on a couple of other projects and one of those is nmdc.

G

I don't know what um uh bryce was gonna it was it bryce it was supposed to, or was it.

A

Yeah bryce is going to talk about too.

G

Yeah I and I I know that they're using uh something called cromwell, it's a way to you, can encapsulate your workflows using a sort of the standard description, language and then there's a tool called cromwell that can kind of take those in and run them, and uh so he was probably going to talk about that.

G

We're using this for nmdc as well, which is the national microbiome data collaborative and the I've hit some of the same issues that uh you know were brought up here by others, and so you know even me, as a nurse staff member, I see exactly the kind of things that you're you're mentioning, and so you know for these particular workflows. What makes them challenging is there's a kind of an iterative aspect where it'll do some work and then it will submit jobs.

G

The workflow engine, the way it wants to work is it'll, run things and it'll submit follow-up jobs as those things complete. So it doesn't like submit the entire dag up front, and so because of that, then the kind of turnaround time of the scheduler becomes a huge bottleneck.

G

Now, if we submit it to the real time queue that would probably mostly address some of these things, but the way that I've worked around this for nmdc and jji has a kind of a similar approach, but a different, um a different piece of software is there's some intermediate scheduler in my case, I'm using condor in their case jgi, is using something they developed internally called jtm. That uses like a rabbit, mq message.

G

Bus same kind of thing, though, is like there's this intermediate queue, and then the you submit jobs that basically pull off of that and that's not too different from how some of the hep projects work as well. So um I do think that long term, we need to work with slurm to figure out ways to have it more effectively deal with these things directly, and I don't know exactly how that would be done.

G

Potentially some kind of more hierarchical method might be might work so that you could kind of have the idea of like there's a sub. You know there's a subset of nodes, that another scheduler can kind of just focus on those nodes and that might relieve some of the scaling issues that we have to wrestle with.

G

And another thing I think that slurm needs to deal with is really the idea of scheduling a workflow versus just scheduling tasks with a bunch of requirements. So how do you kind of treat a bundle of work as something that ages together and even if you don't know it all up front, it can kind of schedule them effectively.

G

I think there's opportunity sort of in both of those dimensions.

A

Great very good points, so um I I guess uh one thing I I want to make sure I get full confession here uh that uh this time when we we started our uh camp most recent campaign, uh we found that we weren't able to backfill, like we did last year, which really kind of worked fairly well for us, and I guess um I didn't hear exactly what the um the improvements were, but I guess sudeep was saying this morning about uh improvements on corey k and l that uh got you from mid 80s up to mid 90s, and I'm I'm going to guess that.

A

Maybe that's one of the reasons why I can't get somebody in anymore and and what I've done is actually made the problem worse because trying to fit into smaller jobs. I've made my job smaller. So now I have 10 000 jobs at once that I'm submitting, which probably just puts an even larger burden on your schedule.

A

So anyway, now I feel like I've cleansed myself by confessing. Okay,.

H

Well, I think we should have someone follow up with you. um I did notice bryce just said he he was here.

B

Yeah, I'm still here.

A

ah Okay, we missed you uh before.

B

Yeah, I wasn't sure which meeting I was supposed to be in.

A

All right, uh sorry about that, so we only have like three minutes left here, but I'll tell you what why don't you go ahead and give your talk, and anybody who wants to stay on is welcome to I'll, certainly stay on and listen to it. If you feel like you have to duck out to the uh plenary session, then go ahead.

A

So if you're, if you're running ready to share your slides and willing to do it.

B

I am is a slide showing up. Yes, I can see it all right. So thank you, everybody for inviting me to talk today about jgi and our pipelines that we have so for those of you who aren't familiar with jgi we're a high-throughput sequencing facility that does dna sequencing for researchers around the world, we're looking at different metabolomics and other analysis on these dna sequencers.

B

So you can see here that we have our researcher. They give us dna.

B

We have these sequencers that are at the lab that are producing tens of terabytes of data every couple weeks and then you know we're processing it through multiple pipelines uh constantly through nurse, and you can kind of see the automation here of you know all these little outputs going into different boxes for our collaborators who are helping out.

B

So last year we had about almost 2 000 users actively doing projects at jgi. um We have about 16 000 different, active projects. We received about 24 000 different dna samples over the year for 2019, uh for we have almost 100 000 pipeline runs and that's for rpc. Only I'm the group lead for rkc doing all these pipeline runs for 30, different pipelines or a few other groups, also at jji, who also do a number of pipelines, and we all use nerfs pretty heavily for that.

B

um So we have a few sequencers.

B

So here you can see the plots of uh our growth over time. um This is just since 2013.. You know it's fairly linear, but it's going up. Of course, 2020 is a little bit different for everybody, uh but you know these things go up because every few years there's a new sequencing technology that comes out that does things cheaper, faster, produces more data for us and so we're able to accommodate more products over time.

B

Our compute usage over time you can see it's grown and then it shrank and really that's a sign that our product mix is changing a bit. We used to have products which required a lot more heavy compute. um One other thing that we did is there is something called blast which is a way of aligning the sequencing off the sequencer, with the sample that um or big database of samples to try to identify what it is and uh that's.

B

It was a huge compute uh sync for us, and so we were able to replace it with something much faster and better. That gave us kind of the same output and of course, you know over time, we're also looking at ways of replacing older tools with newer ones like blast, like I mentioned, to make things better for us, I'm seeing chats come up, but I'm not actually following them.

B

B

Okay, um all right so for the pipelines that we run. For example, um we run almost all the pipelines on the quarius gene pool partition and we do this to meet our cycle time requirements.

B

You know we get data off the sequencer and we have to have to really do a fast turnaround on these things relatively fast for our users, and so they can't really be waiting weeks and weeks for analysis to come through. So we actually have a dedicated partition um on the quarry machines that.

F

We can run.

H

B

On so our tempo pipelines that we run um they're high memory pipelines because they're loading all the sequence state into memory, uh they can take anywhere from 16 gigabytes to three terabytes. Depending on the pipeline uh for memory usage, uh it's really heavy io. All these files coming off of the sigma servers are pretty heavy files, even when they're compressed and we have a lot of variability in the runtime, uh it can be five minutes on some pipelines.

B

So more than seven days for other things, you can see here on the box plot on the right, um even for some of the same pipelines, there's still a lot of variability in the runtimes, and this is really product dependent, um not even product pen. It's sample depends. Some samples are a lot more complex than others and take a lot more resources to run.

B

Also, we've purchased some nodes in quarry that are 1.5 terabyte memory nodes that we use for some of our special pipelines to run, and it has slightly different characteristics but uh yeah we're using corey pretty heavily for all this.

B

So working with nurse, um you know, I think, like everybody, we've had occasional challenges like last year. The uh create upgrade on causes some disruptions for our product cycle time at jji, because all sorts of things started failing and we're running slower.

B

But you know: we've worked with nurse and nurse has agreed to help us by running some reframe tests. Whenever they're going to do some of these upgrades that we can potentially get ahead of some of these problems, um you know over time it seems like the quarry file system. Performance is unstable.

B

You know when dvs goes down or in dbs is slow. You know it does take us time to go into pythons and say: okay, what happened wrong here? Was it actually a problem with the data, a problem with the python, or was it something with the query file system and a lot of times? It seems to be more of a query file system issue.

B

um As for pro modder gpus, we did have a hackathon. I believe it was made 2019 where we worked at um taking some bioinformatics code and trying to port it over to gpus and seeing what kind of performance increase we would get. You can see here on the lower left here kind of a timing slide, and the green here is essentially the nvidia gpu results. You can see they're not better than running on cpus.

B

The difference between the blue bars and the purple bars here is really just.

H

B

The optimized flag for c that isn't done by default by turning on. um We already got a huge improvement without doing much for the code anyways. It was interesting that a lot of bioinformatics software that people tried to port over at the hackathon didn't see a huge amount of gpu acceleration.

B

um I'd also want to say that you know jj analysts really like the jupiter notebooks that nurse provides and they use them regularly.

B

That's a really nice feature that we have, and we have a program called metahitmer that several jj staff are working on and that's a assembler that does a huge assembly that takes three terabytes of memory and multiple nodes to take a huge amount of data and try to assemble it to get out of assembly from it, and we've been able to um get that somewhat running on the query cluster and there's actually a paper published for that in nature.com.

B

um So that's really the last of my slides. I was um working through here because I know I didn't have much time, but um questions or other things I can answer for people.

A

Hey thanks a lot bryce, so yeah. If anybody has any final questions or comments, they want to share we're kind of over time a little bit on the session, but how's your pants.

C

I want to give another shout out to david for organizing this, this sequence. This has been really great.

A

Well very welcome. I guess I should also uh very much thank all the people at nurse. Katie was the first person who initially suggested this so uh so I have to thank her for that and for all the support we've gotten from the nurse staff on this whole thing.

A

So the last I'll I'll just take the last word here and say: if you know of any other projects or any other person who may be working on something at nurse that is kind of relevant to this, please let me know uh you can send me their name. I can anonymously or not to prod them to come, give a talk to us and not bring your name into it if you want, but it would be good to uh to get a few more talks together.

A

I think so I found this very, very useful so with that. Thank you guys very much, and I guess we can all jump back over to the main session.

H

Thanks a lot yeah. Thank you. Thank.

F

You bye everyone.