National Energy Research Scientific Computing Center (NERSC) NERSC 2021 Early Career HPC Achievement Awards Seminars, 6 Jan 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Raythena: A Massively Parallel Data Processing Framework for ATLAS Geant4 Simulation, M. Muskinja

Description

Raythena utilizes the Ray software (a high-performance distributed execution framework) to distribute the highly intensive ATLAS Geant4 simulation workflow across a few hundred HPC nodes. Geant4 simulation is the most computationally expensive step of the ATLAS Monte Carlo simulation chain and represents about 50% of the ATLAS computing budget. Conventionally, it is run on ‘grid’ sites and each simulation campaign takes a few months to simulate the desired quantity of proton-proton collision events. Raythena is a solution for running ATLAS Geant4 simulation efficiencly on HPCs and it could significantly reduce the duration of simulation campaigns in the future. The goal of Raythena is to process as many events as possible with a given CPU-hour allocation on an HPC as fast as possible. An effective mode of operation at NERSC’s Cori was found to be running 100-200 Cori KNL node jobs in the flex queue. Raythena is a central application that orchestrates the workload management across all nodes using the Ray API. On Cori KNL nodes, 132 Geant4 processes were spawned on each compute node, amounting to more than 25,000 Geant4 processes running in parallel. Raythena handles communication with the ATLAS central PanDA database where it retrieves the input events and feeds them to the Geant4 processes. The Raythena framework was found to scale very well up to 100 to 200 nodes on Cori KNL with virtually no delay between the consecutive processed events.

A

um This is how we discovered the Higgs boson in uh almost 10 years ago in 2012, which actually completed the so-called standard model, but there's other fundamental questions which remain unsolved, and we hope that the LHC can give us the answers in the next 10 years. So just to highlight some of these questions, for example, one one of them is: why is gravity so weak?

A

So what we mean by this is, for example, we have two protons another say once incremented apart, then the electromagnetic force that attracted them together is much stronger than the gravitational Ripples and it's very curious. So a couple of possible answers that we have here are super symmetry or extra spatial Dimensions, which will sort of weaken the effect of gravity.

A

And then the other question is what's the origin of dark matter. So, for example, we know that uh the great majority of the mass in the universe is actually dark, so we don't see it. It doesn't interact with us in any other way except gravity, and we hope we could probably we could maybe produce it um directly at the NHC.

A

um Another interesting question is why we have matter at all. So the point is here that all processes that we know no processes that we produce at LHC equal amount of matter and antimatter and when metal antimatter meet they annihilate and turn into light. So there must have been some process, a real Universe which created more matter than antimatter. We just don't know what it is, and so the last question here is neutrinos, so neutrinos are very curious.

A

We know they have mass, we don't know why they have mass and we also don't know if the junior, so they don't want antibi particles or not now. So these are some of the questions which we hope to answer and um I'll go through more details about the address to decorate itself. So Atlas is a very large machine. That's really designed to detect the smallest particles, so here I'm, just comparing it to the size of the nurse building for fun.

A

uh So you can see it sort of roughly the same size they're a bit larger than than this building. If you look at from the side, uh so it's really very large. It's 46 meters, long 25 meters, wide it rates, um 7000 tons and it's equipped with strong magnets of the 3.5 Tesla, which banned the uh the characters of particles so that we can measure their momentum um so slide. 7 shows sort of how the collisions look like. So this is now the transverse plane in the actual detector.

A

So it's the cutoff into the first plane um and roughly we have more than a billion broken collisions per second, so that's a huge number and we're not able to save all into disk, uh because we have limited resources and there's no technology that allow us to do this. So roughly we save about 25, 000 or a bit more um equations per second, which gives us around 20 billion events um in the day-to-day periods between 2015 and 2018.

A

and District requires huge disk space, for example more than 100 petabytes, um and because of this we also need huge Computing resources to process all this data and get physics results um so to make things even worse on slide 8. So the way we interpret the data is that we compare it to Theory, um but to get the activity predictions we need Monte. Carlo simulation, so Monte Carlo simulation means that we create certain protein equations, one by one using random number generators.

A

So, for example, I have one illustration here, um so we create this uh personal applications with generators, then we have to simulate them, meaning that we simulate the response to the atlas detector. For this we're using the so-called gn4 toolkit, which will work with more later on, and once we simulate the response, then we can use the same reconstruction software as used for real data collected with the atlas, detector and LHC to reconstruct the events, and then we can make statistical analysis and get physical physics results now crucial um I think Dimension.

A

Here is that to not have large statistical uncertainties, then we actually need to simulate more events than we have actual collisions and I said we had about 20 billion collisions and we also simulated about 60 billion um events so that we that we are able to analyze this data.

A

So these are huge numbers and the only way we can tackle this is using the uh so-called worldwide Computing grid. Usually, we just call it grid for short, um so this is the connection of around 170 sites uh between 42 countries in the world, and it gives us roughly 1 million computer cores.

A

So this is just one screenshot from one particular day in the past, but um the global transfer rates can exceed uh the the 60 gigabytes per second and uh that's what really really can meet the beating demands of daily experiments uh now in more details in slide 10?

A

This is sort of how uh which this shows the structures which Computing resources are mainly used in Atlas, um and you can see this sort of yellowish color corresponds to so-called grid, so these are regular grid sites um and there's some small fraction of hpcs uh starting to increase since 2020 or so, for example, you have this dark blue uh called HPC special, and this corresponds to, for example, Corey canal and then there's also other hpcs, for example, um and 2021. We included the Vega HPC actually from my home country Slovenia.

A

So you can see a large Spike here and then a couple of months back real circuit edition of Carolina from czechia. Both of these are part of the Year HBC project, um but you can see.

A

Utilization of HPC is increasing, so we need to be able to use them efficiently in order to do the analysis, uh so just break it down here on slide 11 of what we are using the CPUs for um so there's many components here, but the largest one, uh the blue one, is Monte Carlo simulation, so this corresponds to what I was saying before about simulating the detector response we designed for toolkit, and it turns out, depending a bit on the year and so on, about 40 to 50 percent of product with cbrs are spent on uh on this workflow.

A

um So obviously it's important to speed this up. If we can and there's in fact, a huge effort uh to make this run faster and I was also working on. This I actually achieved the 20 speed up, um but it's not in this stock I just have a link to the paper here within other Solutions. uh That we can do here is to actually optimize this workflow to run more efficiently on non-conventional sites such as hpcs, and the focus on this talk is running.

A

This particular workflow uh on hpcs such as score now just a bit more more motivation before I move on is this projection of the CPU needs.

A

So the plot shows years from 2020 to 2034 and the collage points, for example, blue and red, and show how much resources you're expected to need under certain scenarios and then the solid black lines are the projections of how much resources you have available, and the point here is that we expect to have enough resources to tackle the Computing issues, but we will need to be able to use them efficiently, and many of these could be hpcs in the future.

A

Okay, so first I'll overview how we are running uh simulation on the regular grid sites, then I'll compare the difference to hpcs, so I can greatly illustrate the issues that we faced on hpcs.

A

So, firstly, we're using the so-called um Panda system, which is stands for production, distributed analysis system, and this is where we aggregate all of the tasks in the atlas collaboration. So it's essential database, any user can submit tasks and user that's fit around 2000 physicists that are around the world from any institution. So we can also build tasks to respond the server and then Define the server delegates, the workload grid sites.

A

um So here on grid side, we have a software called pilot which receives the job. um But in order to do this we need HTTP communication between the sites and the panda server.

A

um Okay, now once we are on the site, we will use patina, which is the main software processing framework and in this particular case, when running simulation, we're using Circle detina MP, which is the multi-process version, um and here just the point is that um we first have our initialization step in this Dino software, which basically loads the detected geometry magnetic field and so on, and then we do a fork with this process and we start the so-called worker processes.

A

Now these worker processes can share some of the memories that is detected geometry magnetic field, and then they start processing events in parallel. So events here are really different. Proton proton collisions- and this is the smallest unit of process that we can possibly have in this skin um and then, in addition, we're using the event service node here where we can provide input.

A

Events on demand from an external application, so the number of elements that we want to process does not need to be determined in advance if we're using the d9p vdm service mode um okay. So this is how we're processing events on OneNote and now coupling this with the panda server, so we're going to submit multiple tasks on various grid sites and then each grid task. We run a separate Latina amp instance on one node. That's, for example, Illustrated here. In this case we have a D9 empty process with eight worker processes.

A

So at the beginning we have some initialization time. Then we start processing events in parallel- um and here one thing to note is the processing time of one event is unpredictable and it varies so it can be from 2 to 30 minutes.

A

um So really the runtime of the task is determined uh by the completion of the last event, so in this case around 15 we're done, but because of this community, we want to render some that time here at the end, which is basically lost TPU, and we cannot really avoid this. So this is not really a big issue, not in one note, but you will see this doesn't translate well directly to hpcs, so here on slide.

A

17 I'll cover the issues in hpcs, so firstly, um depending on h2c but Q polish is generally disabled, singular tasks for large workload. So, for example, if you want to run something when hurricane Elders no shirt queue available, which you typically use for a single node job, also the chart cprs may be larger compared to multiple, multiple jobs.

A

So already at hpcs, we want to run multi-nodes tasks as opposed to singular tasks, and then second point is that networking may not be generally available when hvc compute nodes, and this could be an issue because in this conventional grid tasks we need actually HTTP to retrieve the test info from the panda server um and then the third point, which is perhaps the most important ones, is the treating nodes as independent um as independent processing units creates large inefficiencies, so I have this Illustrated and below.

A

This is how we would just naively run our simulation on Multi, multiple nodes. So, for example, in this case, we would start on a job with eight nodes and on each node. We would just run a 10 amp instance with say the two cores, the two processes, and we would instruct each one to process 1000 events um and because this processing time varies uh he'll be able to get much larger inefficiencies at the end.

A

So, for example, seven of the nodes would already finish processing, but then we have to wait for the final one um to actually finish the job, and this would be lost lots of lost CPU hours while waiting for the last process. To finish so that's something we would like to avoid if possible- and this is where our solution comes in, uh so this is exactly what we wanted to solve uh now. um This is the conceptual design of the solution on slide 19..

A

So, first of all, we treat all nodes, you know tasks as a single process and then we create a central application that feeds events one by one and to each node on demand. So it looks sort of like this. We have a one Central application, which is record driver that has access to the shared file system, and then we have other nodes where we run this in amp um application on each of the nodes and then, instead of treating them as independent, we control them centrally and we feed them events one by one.

A

So basically, by having this fine granularity at the event at the event, level uh really use the inefficiency that would occur at the end of the end of the job.

A

um So now, technically how we implemented this was using the uh array framework. So Ray is a very commonly used nowadays framework for distributed computing. It has a very simple python, API API, to express parallelism and data dependency, um so, for example, this is just a screenshot from from their GitHub page. You can see it's widely used by the community. It has integration with many other libraries and also it's developed by rice Lab at UC Berkeley. So we have a connection also locally, but basically how it works is Illustrated on this spot below.

A

um So so we in this case we have three nodes. So these are separate HPC nodes and we have a driver application on one of them, which controls the whole thing um and on this head node we also have the so-called Global control store, which is basically using a redis server to establish connection between the various nodes.

A

um So these worker nodes connect this ready server on the main node using TCP connection. This is how we establish direct cluster and then delay framework automatically schedules. The assigned work on all available resources in the cluster. So that's really. The nice feature here is that we don't have to worry about scheduling because trade of the scheduling itself and it does in a smart way so that we utilize all available resources.

A

Okay, so slide. 21 now shows the actual architecture of our solution here called retina. So, firstly, we have exactly one Ray process on each node in the cluster, so it's either a red driver, which is the main process or array actor.

A

So actor is a special um process in Array which keeps its state and then inside each of the actor inside a shifted image. We start this ethnic process which is going to process our protocol generators and this this process will have 32 or processes on a as well or 134, on k, l um and then the design process uh feeds events one by one on demand to the processes in the other nodes and depending on how many nodes we have. This can be of the 1000 events.

B

A

Second, that have to be coordinated across the nodes. um Then we also write the output to the file system to the server process uh to to keep the output centrally, and this enables us to do a checkpointing. So we can stop this process at the same time and we know how to continue, because we have the outputs already saved locally.

A

um Okay, now this is an example of how this system performs. So that's an example with two hazards, and the way to interpret this plot is that you have the beginning this initialization step.

A

So this is where we start up the Athena and below the geometry and the magnetic field, and this is common to all worker processes, and once this step is performed, we actually for this process across 32 uh one has 32 worker processes, and then they start processing events one by one- and this is what each of the blocks corresponds corresponds to here. So each of this blue rectangles is a separate event that we have processed and you can see, there's almost no light spots between the events, meaning that this is highly efficient.

A

There is basically no downtime and we were able, in this case, to provide events uh fast enough. So there was, there was no downtime.

A

um Okay, we've also done various scaling tests here and in this simple example of event, up to 200 knee alerts and actually on Canal we're using uh we're running patina with 134 processes. So if you combine this combine this, that's more than 25 000 processes running in parallel, that we can control sensory from one application and with them events.

A

Also in this case we did not find any bottlenecks, and this is one of the votes from Julian uh where, where he shows, depending on how many nodes in a class that we have what's the average latency between the events and it's basically constant- and it's about 0.5 seconds, which again shows that we don't have any bottlenecks in this in this system after 200 nodes good. So that's sort of the first part.

A

This was the proof concept when we demonstrated that we can run on query uh distributed computing Frameworks such as Ray to parallelize the atlas Gen4 simulation on HPC. We had no issues in scaling up to 200 nodes, and basically we solve this bottleneck of having to wait for the last note to finish by having a fine granularity at the event level.

A

Now, the next step of the project was to actually integrate this in the atlas standard system, because this is what you would need to run. This introduction, and the two main components here which are needed are to use this hardware store project software for the communication with panda and then the pilot software, which is a intermediate layer between the reactor and Athena um and again. This is what Julian really did great, uh and this was part of his master's thesis disintegration work.

A

uh How am I doing on time. Yes,.

A

Okay, so now, firstly, um in this integration, I'll start with hardware store, so hardware store is the software which talks to the panda server and retrieves the job info from the sensor server and then starts the processing on on Quarry. So this is an application that runs the Daemon on an edge node. So it's running all the time and in this case we used 4021.

A

So we got special permissions and use this node and and the Harvest application constantly uh keeps HTTP connection with thunder, um and we wait for some user or approach system uh to submit a task which is suitable for Quarry. And then these Harvest notification starts the retina job using slurp, and then here we have the routine application running as I was explaining it before, but there's two Communications still need to communicate, communicate what the job is running and this communication is implemented via a shared file system communication.

A

So basically they are both creating text files with info and they are exchanging these text files. Then, on the other side, we also needs to retrieve the input files from grid. So, for example, the the files which we're going to process are not already located in query, and here we were using the Globus software to actually retrieve the inputs from wherever the input was in some other compute side, and then we also have to write back this output again to grid all right.

A

So this is uh Harvester at the start of the start of the integration and then the other parts of the integration is using. The pilot software um and pilot is something that's already been designed to handle Communications with Athena MP.

A

um In addition to doing this communication, it also saves all sorts of Diagnostics and log files, and this is useful um for running in production, because then, if something goes wrong, we can easily investigate these log files and figure out what what went wrong um and actually here we wanted to mimic exactly the behavior that regular grid jobs have where Pilot is directly connected to Panda, and for this reason we implemented this HTTP communication here.

A

So we have very actor on OneNote which, in the pilot process on the same node, and we have local HTTP communication between them now this was to mimic how this is usually run on on grid sites, so here one difference corresponding to the original version where we have this event level granularity is that pilot now is related designed to have a cache of n events, where n is the number of uh Latino worker processes.

A

So this fine narrative event by event level is now reduced to elements, but it's still very fine and doesn't produce any bottlenecks at the end of the process.

A

Okay, now, when we put all this together, this is how it looks like you know, so, at the edge node, we have the Harvester application, which achieves the job info from the panda server then does the batch submission starts up the right driver which has already server there? Are the servers used to have this PCP connection with all nodes in the cluster and then on each of the nodes we have a reactor uh which runs this pilot application and has HD communication with the pilot application and the pilot application.

A

Eventually runs the AP, which is our processing software for running design for simulation.

A

um So now, this scheme allows us to run to run this kind of jobs to the central Panda system and we can run a multi-node HPC application, and the nice thing here is that really any physicist in the atlas collaboration can submit such a job and get access to a to this kind of a multi-node application. But because we added all these additional layers, it became significantly more complicated as the original application so, and also now we're using multiple communication channels, which is maybe not an Isis, for example, share price used to communication, TCP http.

A

Okay, so slide 29 now shows you a diagram of how this looks like in practice. So this is an example where we have 50 Canon loads and 134 processes um per node um and I'm, showing more things than before. So at the beginning, this uh small gray.

A

Area is actually the initialization of array process itself and it's this is very short period um and then any in addition to initializing Katina. We also need to initialize this pilot process, which takes some time, but after everything is initialized which, on k, l can take sort of a level of 10 minutes.

A

Then we started processing events and again you can see this has almost no white spots. So it's running really as efficiently as it can. um In this example, you can see there's some white spots all at the same time. In this particular case.

A

We assume it's something to do with the high load and the shelf life system, but this is not something we would usually see and then, at the end of the job, I'm, also showing you this green lines, and these are events which we started processing but never finished, because the job terminated after a certain time in this case was two and a half hours.

A

So we terminated the job and then every event did we processing, but we didn't finish is lost, but this deficiency is something we cannot really avoid, because we cannot process units smaller than one event. So all in all this, this is highly efficient ly.

C

If you say so, um this is about 20 minutes of initialization time. Is that what you? What this means.

A

C

Exactly yeah what's happening in that.

A

um So so, let's so the blue one is easy to explain so the blue one is when we start up Athena and it has to retrieve the detective geometry and loaded memory. It has to retrieve the magnetic field loaded in memory, and this turns out it's quite slow in kennel, you know Canal, usually it takes about 10 to 15 minutes, um so it's just. It just has to fill up the memory with with all the relevant data that we need to process the process.

C

There is like file for file per process, input or something.

A

So it's accessing databases and all this kind of stuff, so so actually to avoid. So usually we would do this via Network, um but we wanted to avoid having the access network. So here we had a shifted image which had this database saved in the image, so the image was around eight gigabytes, large gigabytes, large uh and then Athena has to access this databases and it is development information.

A

C

Much faster is it if, on on Haswell.

A

uh So I believe I have one example from uh from a different process here yeah.

A

um So this is one example: I have from uh argon where we had to see on processes, and here you can see it's a bit faster, but it's still still a significant time. So that's yeah, that's something we unfortunately cannot avoid, and the only way to mitigate is is to actually run longer jobs right. So because, once we have this initialized, we can continue running for as long as we want um so right. So this we can mitigate by running the job longer.

A

Yeah thanks great question.

A

Okay, so slide 30 just summarizes the current status of this, so we have successfully uh demonstrated running multi-node, retina tasks through the atlas production system and we've tested this up to 100 Corey knowns. In addition, we also tested this system on other hpcs, such as visual nodes at Argon um and and the reason for this is the whole system was designed to work on multiple hpcs, because we want one solution that we can use um on any site in Atlas.

A

We don't want to develop a different solution for every site, um but we did actually run into some scaling issues uh once we uh did this integration with Harvester and pilot, and particularly one of the issues I, was actually collecting the output, binary files and lock files. So so this job, you can imagine, produces lots of output files.

A

um In fact, we produce one output file per event and we can process hundreds of thousands of events here um and then we need to collect this output and send it out to grid storage, and this principle is performed by Harvester, which sits on the edge node. But in our particular case, it created a very high load on the file system, especially once we started running 200 or so nodes and that basically, it could just scratch the entire node or or broke the whole node.

A

So that's something which is not yet solved and requires um requires some. uh um We need to work together with harvest the team, but in general we also need to do some more stress tests. For example, um in some cases, Ray may not be entirely stable, um sometimes difficult to produce um so right. So we need to figure out this kind of issues before this community be certified and ready for use Optics production.

A

um Also one one of the issues I would like to point out here is more a sociological one that it's actually quite difficult in the physics Community to find interest to this kind of work. So, for example, I was the main developer on this, but I already transitions to a 100 physics job, um so I'm not really working on this um to any large extent.

A

So that's one of the other issues as well.

A

um Okay, so I'd like to give some Outlook as well.

A

um So one nice thing about this whole system is that retina actually establishes the connection between the panda server um NRA cluster, on an HPC where we can have a cluster with 200 or so nodes, um and you can imagine swapping out this Latina MP payload um with some generic payload, which can be written by any user, and then this basically means that any user in the others collaboration can submit a job which utilizes the whole array environment, which gets integration with all sorts of beautiful libraries, for example, machine learning.

A

So essentially you could run something like distributed extra boost for distributed machine learning on a HPC cluster of hundreds of nodes, and you can do this. You could submit this kind of a job from anywhere in the world through the spanner system um yeah. So potentially it's very useful for any other application. That requires lots of nodes to run efficiently and just the last slide to summarize so, regardless experiment and the LHC has recorded about many billion protein protein events to interpret this data, we simulated about 60 billion events and to process all this.

A

We need enormous CPU power uh in the future. This CPU power that we need will only grow because the rate at which we collect the data is going to increase and efficiency utilization of HPC resources um is crucial to meet this kind of uh Computing demands.

A

um So this presented retina work is a framework based on Ray for distribute for distributed pathogen per stimulation across multiple CPU nodes. uh We have shown scaling tests that work well up to 100 or so nodes, where we have more than 25 000 processes in parallel uh with cigarette efficiency, with almost no downtime between the events that the process. That's thanks to defining granularity, which we could Implement. um It solves the bottleneck of having to wait a long time at the end for the last job.

A

To finish a bit more work is needed before we can have a production, ready version and all in all, this integration of retina with the sensor natural server also opens up an exciting possibility of distributed.

A

Distributing other workflows such as, for example, machine learning on hpcs.

A

Okay, that's all thank you.

D

Oh, thank you um yeah. That was that was a great talk. I really enjoyed it. um It's a pretty impressive body of work and and workflow that you put together here.

D

um It's really it's really exciting for us to see various Technologies, enabling Technologies being used like the like the the container capability and shifter, and uh you know, Globus and and various other things um that you mentioned, and you know we're working more and more to design systems that will more easily support um this kind of computing. This kind of workload, this kind of approach to things um like enabling direct, Network, ethernet access to compute nodes, and things like that.

D

um So yeah and I I appreciate your comment about the um the workforce as well. I mean this is this is really um work that enables so much it's critical to enabling you know the whole system to be able to work, and so it's really it's really important. The people do it and that people really get recognition for it, and so I want to I want to recognize you for this and I hope. I hope this work continues to get a lot of usage uh going forward.

D

But with that I'm happy to open it up to questions, I think.

D

C

Okay, all right um I have a question about um the fact that you had um the cluster running on one partition or the other, so you had one kind of node but um available to you right, k, l or Haswell.

C

You couldn't um pick to have the scheduler be one kind of node inside you know inside the compute partition right, um You couldn't set up something where, like within the job application, you had one kind of node and then all your workers be some other kind of node that were maybe more optimal for for doing the event processing.

C

So is it interesting to you to be able to have within a single job allocation being able to have multiple, different kinds of nodes, doing different roles um for this kind of event, processing like if you could have had that? Would you have liked that better than what you had to do here, or is it not that big a deal.

A

Well, that's very interesting comments, so uh first I'll say um the reason why we were using um k, l here um so so, actually so this Southwestern for simulation, we found it runs actually quite slow and KL compressor has, though, but still given the uh costs of cprs and KL competitor as well as we found it to be more efficient to run this this kind of stuff on Canal, because we can have 134 processes, and it turns out to be that we can have a higher throughput, uh given some finite CPR allocation on piano.

A

But having said that, uh we have also this today in Java application with the Reddit server and perhaps indeed it's not really optimal. To run this in k, l- and in this case we had one entire node dedicated just to running this ready server, because we didn't want to risk. Also running is huge CPU load in the same node, um but certainly that's something which might run better if you're running this in handswell. So for example, yes, you could have one has been load and then other Cannon loads.

A

Maybe this this could actually uh be quite good thanks.

E

This is Haya I'm.

C

Going to ask you.

E

A question hey um uh so it was there any sort of uh like uh resilience features uh set up for raythena so that, if like a node went down or if the shared file system was struggling or some sort of um you know, infrastructure issue was arising that it would know how to fail or alert the user that something was wrong or recover.

A

Yeah notice, yeah not a great question so, um to the large extent we solved this by um by saving the out so as soon as we process one event, maybe if we go to one of these plots here um as soon as we process one event, uh we immediately uh save it through a common location on the shared file system. So if, for some reason the job closes unexpectedly, uh we we know how much events we processed and then, when we submit the next job, we can just continue from where we left on now.

A

In this sense, we have checkpointing, which works very well. um Another question, I think was what happens to be running like 200 notes, and one of one of the nodes dies for some reason. So this is something we were debating. What we should do here currently in this case we would just shut down the entire job um so that we don't have any inefficiencies, but that's something which one would start and optimize and obviously could Ray.

A

Certainly, it really has capabilities that we could decide to keep running even if one of the noise is down, but at the moment we were just terminate the entire job.

D

Yeah thanks Heather.

B

Yeah so I was wondering about. I was watching your plots where you showed what happens at the end of the job right, where you have these processes that get killed when the job ends, and it actually looks to me, like it's quite a significant fraction of time um that you actually lose um from this effect.

B

Are there any ideas that you thought about how this could be improved in the future, because it's not clear to me that would necessarily be so easy to run these jobs for a huge amount of time right, which would mitigate the impact of these events, which start but didn't finish.

A

Yeah, so um right, so the sort of the easiest way to mitigate is to run this.

A

You know uh much longer time, but this may not be possible again due policies and in this example, we were using the flex gear um again, which has a very low cost and we could use the flex Q because we don't really care how long the job runs, because we have checkpointing um yeah, but there's not really any easy way to uh solve this, because we are limited by the seventh level um greenularity now, with lots of development, uh one could imagine uh using this kind of a system for the next version of retina called Latina Mt, which actually breaks down this event level granularity into smaller pieces.

A

So in the next version of Athena, which we're going to use in round three so sort of in the next five years, uh it's going to be running in multiple multi-threaders as opposed to multi-process, and we will have a finer granularity of work than ornaments. So we divide every event into so-called algorithms, which we execute here. So, in a way to respect the data dependencies- and in this case one could imagine having such a system.

A

That's that runs across multiple nodes and has a finite Dynamic level narrative, but that's something which would require some significant developments. But it could potentially make this a bit nicer. At the end.

D

Yes, great question: yeah, we thought a little bit about this, we're shooting in things like this and I guess this is a a fine-grained.

D

um You know view I, don't know, but you know one one thing we could do with the schedulers is be able to um allow allow jobs to release resources if they're done with them after a certain amount of time, but maybe other resources are not done with that's one thing.

D

We've been thinking about another thing if you go back to slide 22 um so that that packing as it was, is really quite good um in the grand scheme of things I think but is so is each one of those um Rose as it were uh kind of randomly given uh work as something finishes. You know you say you can't really. You can't really predict the length of of time for one of those is and- and so is that kind of a a distribution that arose from kind of a random round robin type of thing.

A

Yes, exactly so so as soon as one of this is finished, um so this is called sort of worker process uh and we have one of one base process here um so that the initial MP process, so this worker process uh tells the base process that they turn out to work and then the base process talks first to the reactor and reactor talks to the red driver and the red driver then sends an input event for this worker process. But this all happens very quickly.

A

So you've seen this happens quicker than in a 0.5 second, ah okay and right. So it's really as soon as we run out as soon as we process. One event, then this one worker process out of thousands of working process versus requesting event, and it's going to get it very quickly.

A

Such large number of notes so.

D

The length of these have time in one of these um blue rectangles is, is not that much so so.

B

One thing I was.

D

Wondering I guess is um you know, I guess the difference between the runtimes is based on some random input. Configuration is that true, so.

A

um So this, this sort of thing is very chaotic, chaotic and this sense, so um these are typically going to be same kind of physics processes, for example a Higgs boson decaying in a certain way, but depending on which, at which angle, it hits the other detector.

A

uh It may take a significant longer to simulate, because some parts of the detectors are much slower to simulate than others, and it's very difficult to predict so because, for example, initially we can start with like 100 particles, but then in simulation, because these particles interact with Detective material. We end up with millions of particles that we need to simulate, and all this is impossible to predict.

A

So so yeah we've seen cases where it runs uh sort of the level of minutes, but also it can in some extreme cases it can run up to one hour.

D

I guess if what I was going was I was wondering if, if so, if you look at this as kind of an optimization problem, like I said, you have lots and lots and lots of data that correlate some input parameters or configuration to how long it runs and I was just wondering if there was any opportunities for analysis or some learning algorithm to be able to to use that to to predict um some of the runtimes yeah.

A

No, that's actually very interesting, so uh certainly we could try this um because yeah. So at the beginning, we really don't have that many, uh the beginning, it's sort of well defined and depending for example, at which angle the initial particles uh travel. Yes, then maybe the algorithm could could learn that it would take more time than personality events. Yeah. That's very interesting, interesting.

D

Okay, other questions as you're, hey yeah is your hand, hand.

E

Us up again, oh.

D

E

Okay, I have tons of questions, but um um I was wondering. um Was there something uh that you wish the you know the HPC Center did so nurse in this case. um That would make your workflow integration much easier. I mean I, know like you said, like they were able to set up the https node. You know to reach to pandas and you know, but is there something that would have just made all of your life easier.

A

um Maybe so so yeah, let me um so let me just maybe briefly go back to this quote ahead here uh right, so we have this other hpcs, which there wasn't too much problems, just uh it just used them. But the thing is this hpcs here, for example, Vega.

A

It's basically used like a regular grid side, so in this case they're just running sort of one of these 39b instance on one node and if you want to run 100 businesses, never just submit 100 uh tasks right and that's something which we I mean just doesn't really work at UH at quarry.

A

But that's right because of the Q policies and uh because it's much more cost efficient to run multi-node jobs, but that's that's something which may not be uh doable but would certainly make it easier for us at least if we could treat the HPC I said: I got a good set, but okay, that's, maybe not not very uh constructive.

A

So right some other, so we didn't really have any other particular issues. I mean the fact that we have uh network access network access on compute nodes is actually really great. um We could actually simplify this a lot because, uh because we have network access, but we made it deliberately resistance um to not having nectar access because, for example, some other hpcs have it.

A

um But, for example, I was saying we have this. 8 gigabyte shifted images. We could avoid this by accessing everything through Network on Corey. So, for example, we could simplify it a bit in query, um yeah, I'm thinking, if there's anything other that pops into my mind.

D

A

D

Let us know we're interested, so we can try to try to smooth things out. um Another question I had I guess was you had um some places where you either knew you had or you suspected there was I o probably contention or um issues that were slowing things down. I, don't know if you were, if you were able to experiment with uh the burst buffer or if you think a flash file system would would help alleviate some of that.

A

Right so personally, I did not uh try to burst the buffer. Perhaps there's I know some other folks from the office a group of slipping and tried it.

A

um So so we didn't really experience any um all your issues during the running of the job, which certainly creates a large stress on the file system, because we're writing into thousands of files at once, but really the biggest problem uh we had with IO was here on the edge node.

A

So right, so basically we produce millions of output files and then we need to um concretely what what this harvest the trash to do is stress to zip them into one zip file and then send them out to some other site where they would eventually be merged into larger binary files. So that's yeah!

A

Well, so we don't really yet have a good solution of how to get this output out of Quarry and send it back to a grid site, because this seems to cause a large, uh a large load in the file system that we cannot handle very well. So one of the solutions we started investigating uh was to actually do some of this cleaning up and merging.

A

On the compute nodes, but that's not really ideal, because we want to use them to actually process the data not to uh entertain up and and merge the data as much right. So right, I, don't know what a good solution is, but that's one of the problems is how to actually aggregate this output and send it to read and I think this is where we have some other issues: okay,.

D

D

Well, yeah, it's very interesting um we're near the end of the time. I want to thank you again and um okay. What else have any questions uh we'll close the seminar today and yeah I really enjoyed it.

D

So thank you and thanks everybody, a good rest of your day.

A

Thank you, bye-bye.