National Energy Research Scientific Computing Center (NERSC) Data Seminars Series, 23 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021-02-23 - The Superfacility Team - The CS Area Superfacility Project: Year in Review 2020

Description

NERSC Data Seminars Series: https://github.com/NERSC/data-seminars

Title: The CS Area Superfacility Project: Year in Review 2020
Speakers: The Superfacility Team
Abstract: The Superfacility project has been busy in 2020! In this talk, we will present lightning updates from each area of technical work, highlighting the progress and achievements we've made in the past year.

Bio:
The Superfacility project includes staff from NERSC, ESnet and CRD.

A

Welcome to um our data seminar today uh today we have um uh maybe slightly unusual format for the data seminar. um The work leads for the computing sciences area. Super facility project are going to give a series of uh lightning highlights of the work we've been doing in the last year kind of highlights of uh 2020.

A

This is part of our 2020 year in review um set of talks, the first one I'm going over the science highlights was given at the nurse all to all yesterday afternoon, and so what you're going to hear is a series of um updates from each of the technical work leads in the project talking about their work.

A

So for those of you who are unfamiliar with the super facility project, I'm just going to give you a brief overview of what we're doing so. Computing sciences supports many users and many projects from experimental and observational facilities, both within the department of energy and um from external um external experiments, and this is something we've been doing for a long time commuting.

A

Science is her long and very uh lustrous record of working with experimental science teams to help them get their data where it needs to be help them analyze their data and help them to use supercomputers.

A

So this is something that we've been doing very well for years, but we're seeing that this is an increasing need from the experimental science teams to be able to really handle large amounts of data and large quantity in computing, and the needs of this community is starting to really challenge us and challenge the computing capabilities of these science teams. So the needs go simply beyond providing a nice network, some strong compute resources and a lot of storage.

A

They have rather complex needs that we've seen over the years through various requirements. Reviews and the supervisility concept within computing sciences aims to address these needs. So we want to link up experimental, um experimental detectors to high performance, networking and high performance computing and also give them the tools and capabilities and expertise. They need to do the full workflow end-to-end. So this involves yes, high-performance networking, computing storage, um but also the ability to move and manage data between sites and the ability to have real-time or short turnarounds and interactive access for their computing.

A

They need to have resilient workflows that run across multiple locations and a whole ecosystem of edge services that persists over long periods of time. You know months, maybe even years including workflow managers, visualization tools, databases, web services and a lot more, and uh this is something that kathy ellick recognized many years ago and has been part of the cs area strategic plan um for the last five years and um around about a year ago, we instituted the super facility project.

A

Now this is a project with a small p, in inverted commerce, because it's not a doe project. This is an internal project to computing sciences and it was designed to coordinate all the work that's being done across computing sciences and where it makes sense to coordinate and track all the work we're doing to support experimental science, because one thing that we've noticed um very much is that a lot of different uh science areas and a lot of different experiment teams are facing.

A

The similar kind of challenges and for us to be able to scale up our supports of all these science teams, we need to take a coordinated approach so that the solutions, the tools and technologies we're developing so that they are, um are going to be applicable across multiple science communities.

A

So the goal of this project is always designed to be a three-year project, not because uh the science will go away in three years, but because we wanted to have a defined end goal and a concerted push and effort to develop the the technologies we need for a sustainable um support of this workload.

A

So the goal is that by the end of 2021, so by the end of this year, more or less, um we will have at least three of our science application engagements able to demonstrate automated pipelines that analyze data at scale without routine human intervention. So automation is a really key part of this.

A

We've identified specific capabilities that we want to be able to support, um including real-time computing, high performance, networking data movements and management, automation driven by api, using jupiter as one of our compute methods, and then also starting to use federated identity for authentication methods, and um this is hosting this whole ecosystem of edge services with our spin service, and I have here on the right of this slide.

A

Eight science engagements that we're particularly working with this is a selection of science teams who have been chosen because their needs represent a whole spectrum of from different scales. They each have slightly different needs of computing and workflow and storage and networking, and so by working closely with these teams and by gathering requirements from all these teams, we're able to design tools and technologies that will meet the needs of a large part of our user base, and that's something that's very important.

A

Okay, um so we've now been running for two years and really this last year has been uh really very impressive on part of the super facility team. The super facility project has achieved a lot this year under difficult circumstances.

A

They've done a lot of work to gather requirements, designing and launching new tools and capabilities for our users and that's what we want to talk to about today. We want to kind of show off this work, we've been doing and give updates on where we are with a lot of these tools and technologies.

A

So the format, as I said, each technical work lead in the project will present the highlights for 2020 and we'll be doing this by a work area.

A

The project has four uh work areas in our org: charts, um applications, gathering requirements and deployments of applications for the users, the scheduling in middleware work area, an area really focusing on automation, automated systems and automation in the network and then a work area around data management, and with that I'd like to start um going through our updates one thing: I will ask I'll ask that we, um everyone hold questions till the end.

A

um We have a lot of highlights to get through and I want to make sure everyone has a chance to present um and then we can go uh go through questions at the end, all right. So first up um we have uh lori talking about our knee staff for data work.

B

Hi everyone, so, if you're familiar with nisab, it's a program where we work with science teams to try to optimize their applications for our system, so one uh part of that is nisap for data who's been focusing on a lot of the projects debbie just described.

B

uh We have a lot of postdocs who are kind of wrapping up and also a new class that are coming in. uh So just to give you a quick overview. We had uh yongsan who was uh working with atlas and cms, he's he's now off to nvidia uh kind of our middle class of postdocs. We have sheen and daniel who are working with lz and desi respectively, um and we have upcoming felix who's, not pictured he'll, be working with xfl.

B

We have nistor nick and lippy, who will be working with toast the cmb experiment, uh jgi resilient workflows and lippy gupta, so uh she'll be working with the als. So we're looking forward to all the contributions that the nisap for data program can make in this facility space.

A

Okay, thanks laurie uh next, um with kelly updating us on outreach.

C

Thanks um so outreach this past year has taken sort of a two pronged approach. um The first uh slide here describes one of the prongs that is sort of um the larger outreach to the larger community in the hpc and um user space.

C

So last year, uh buren enders gave um a great talk on best practices for running at nurse. As a kickoff meeting to our nurse user group special interest group for experimental facility users.

C

Following that introductory talk, the community um posted a series of talks from one from each experimental facility sharing opportunities and challenges that they faced at nurse um and and really came together to to learn from each other.

C

um Following that from nurse there was a demonstration series of super facility tools presented virtually um where a number of the technologies that will be discussed later on in this presentation were demonstrated to the um the community and then finally, we had a number of great talks and workshops at the 2020 supercomputing conference, including the presentation at the x-loop workshop and a state of the practice. Talk on the next slide is a description of the other outreach prong that we've been working on this past year, which is outreach to the science partners.

C

How can nursk help the science partners make better use of the super facility tools? What existing documentation do they already have? How can we um help them, help their users or help their staff use nurse more effectively?

C

And out of this, we we've collected uh existing their existing documentation and done a comprehensive review of that and as a follow-up, um we're doing targeted meetings with interested uh research group areas at experimental facilities like the als, the advanced light source.

A

Thanks great thanks kelly, uh so next bill's going to tell us about work in the area policies.

D

So super facility policies, the goal of this sub area- is to see that user incentives and user decision making process sort of encourages the good outcomes that we're looking for. We want to. We want to try and guide people in the right area.

D

At the beginning of the year, the state of this area was, uh I had just accepted the role as the lead of the policies.

D

uh Currently, uh we are sitting on a we have designed a consensus vetted uh setup for what I want to call campaign users, and the goal of this is an implementation of an idea that has been floating around for a few net years now called data users, which would be ways to bring in sort of additional user facility associated researchers into nurse in sort of a streamlined fashion.

D

E

D

Strategic talking point for a number of years: how does nurse grow from being uh an institution with thousands of users to being tens of thousands or more users, and these campaign users are one of the ways that I'd like to see that uh made real.

D

uh The goal would be to leverage some of the other super facility work such as federated id and shared user accounts. That would belong more to a research project at a partner user facility and not individual humans, and we can make use of lots of the existing infrastructure that manages quotas that manages identity access management, automation and apis to end up with a system that can be easily connected to existing user onboarding systems at other user facilities.

D

I yield my time.

A

Great thanks bill all right. uh Next, we're going to hear about uh jupiter and some of the work being done.

F

Okay, uh so yeah, this is uh rolling. I'm going to talk about jupiter. Briefly in the super facility project. We think that the jupiter notebook is a rich user interface that has really great potential to make interacting with supercomputers, easier and more productive, help attract new kinds of users and explain, expand the application of super computing to new science domains, especially experimental and observational science facilities, like those that are the focus of the super facility initiative.

F

Of course, this takes work. uh Jupiter didn't come, uh come out of berkeley to be uh necessarily directly installed onto supercomputers and just ready to go, uh but it takes work from people like us in super facility to make this work and other people at nurse who've helped to make this possible. So this past year we've been able to support about 2 000, unique users uh using jupiter on quarry, so that means about 25 percent of all. Interaction with corey goes through jupiter.

F

Now um many of the users that are using jupiter are from the eod space lsst desi lcls. These are projects that are all using jupiter.

F

um In the past year, we've um we've been able to um change, uh make changes in the stability of the system, basically through changing deployment, moving to rancher 2 in in spin and leveraging ci cd practices for for supporting the service and we've also in introduced a number of extensions that are useful at nurse, but also at other hpc centers like interacting with slow arm through jupiter lab.

F

That's a picture on the right of that and we've also continued our robust community engagement, uh working with other hpc centers and other facilities uh to adapt jupiter to hpc um and large-scale science facilities. So um treyas is going to take the next slide.

G

Yes, I'll dive into some of the actual work we've done with our a science engagement.

G

So this is some work we've done with the als, particularly with the doula, parkinson's beam line and we're working to basically help help them run jupiter to do these parameter sweeps where they can use a tool called paper mill that lets you take a set of parameters and then run it across a bunch of notebooks, so run each parameter set across a bunch of notebooks to figure out what the right parameters that you want to use needs to be, and then you can kind of mint the sort of here's my master parameter set, and then I can run that against the rest of my data set.

G

So this has been really useful for them to be able to figure out um how to you know, hone in on on some of their their parameters for the tomography workflows, um and then we're also doing things like building various tools in the ecosystem for them. So there's a slice viewer that lets them.

G

You know dig through a set of 3d images and kind of flip through a bunch of those with different parameters, um and so what we're this is part of the work that we're doing between the super facility and nurse can crd, and so this has been really useful and really helpful, and we presented this work at um x. Loop at super computing this year as well.

G

Next slide, thanks.

A

Okay, so next up, uh bjorn is going to tell us about workflow planning for nurse9.

H

Yeah hi thanks, um so this is a new area. Basically, we want to make make sure that nurse 9 is ready for automated workflows and can be easily integrated in our patent facility pipelines and to this end um you know we collect requirement for our super facility partners. We make sure we have like adequate milestones in the integration in the use integration area.

H

We also took some steps um to reach out to waffle to developers, and you know made enough noise. The waffle related events um in contact with extra work, so taking part in the workflows at community summit. uh Really this is a work in progress. So um if you have an automated workflow that you want to make sure it runs on a nurse nine, uh any kind of offer you want to be considered, you know please feel free to reach out, so we can make sure that your requirements are reflected in our planning.

A

Thanks so next uh I'm going to talk a bit about workflow, resiliency and building on from that, so in 2020 uh we started really actively working on um enabling our science partners to run reliably across multiple facilities, and I emphasize we've started working on this. This is uh also a work in progress, but one that I'm I'm quite excited about. So there are three main areas that we've been focusing on.

A

um One is that we have an alcc award of time at um at all the the oscar compute facilities, so at alcf and olcf as well as at nurse. This is work that katie antipas has been leading, and this is a group- that's exploring container technologies and data management tools and see how the technologies can run across the different facilities.

A

um We also have an ldrd award within berkeley lab um to develop resilient workflows for 24 7 science, and this is uh working initially with physics, experiments and we're starting to look at how, from a user perspective, um how easy or how hard it is to take their workflow to different uh compute resources.

A

We already learned um some very useful pain points, and that gives us good um pointers to where um we might want to be focusing future efforts and future work in trying to run uh workflows across facilities. We also have a nice uh demonstration this year. uh The jupison notebooks are actually one way that you could consider having a workflow that could run at multiple locations. We've been able to demonstrate that with notebooks running at nurse and also running at slack at the lcls experiments.

A

So again, this is uh very much active work in the next year as well.

A

Right um now, we're moving into um the work area of um middleware and scheduling, and the first uh work area to talk about here is federated id and mark's going to tell us about work being done. There.

I

So, a year ago we had developed a proposed architecture and we had some isolated proof of concept. Proof of concept, implementations um really just to kind of kick the tires and make sure that we were familiar enough with some of the components and there was a strong likelihood that they would be useful for for our application and a year later, our status is we've.

I

We've really been working on two prongs: one of them is kind of a policy area, and so we've developed a set of federated id policies, and these technologies are mature and are heavily used in academic research in higher ed, but it was treading new ground for nurse, and so we really had to be thoughtful about how we were implementing external authentication and think very carefully about which external institutions we were willing to trust.

I

And so we created a set of proposals, and we worked closely with our security team to refine those and adjust our our risk, tolerance appropriately for nurse, and then we went through a fairly uh comprehensive security review of those policies and have made adjustments based on recommendations. We've received.

I

We've also now implemented all the major technical components of the stack, and these are a tool set from internet2, as well as the identity, python group or a project, and these include identity, registries authorization as well as tools that allow the users to select their home institution and place a filter on them, so that they only can select institutions that we're willing to federate with, and you can see an example of that in the upper right hand corner of the slide.

I

The interface remembers, which institutions you've used in the past, and it lets you search for new institutions and those institutions will be filtered in the manner I described it. You'll only find them if it's something that we're willing to trust.

I

We've also completed a security review of these technical components and we've briefed doe headquarters and and gained approval to proceed with our pilot project.

I

Pretty much all of the features are ready, except for the the final multi-factor step-ups that we want to make sure that if the home institution doesn't implement multi-factor authentication that we have an opportunity to require of the user, and so the code for that is being finalized. And we expect to have everything put together.

I

So that we'll be ready for a pilot implementation. In march.

A

Thanks mark so next gabor's going to tell us about the api internet.

J

uh Yes, hi it's gabor and I work on the api and the idea of the api is that anything that you can do by logging into the machines. You should be able to do from a script via the api so a year ago uh the api most of the api calls just returned fake information, so it was sort of a proof of concept that didn't do anything. The authentication was a homegrown authentication that we wrote ourselves versus today.

J

We have a standards-based authentication that uses the connect to id server that uses the um oidc standard for authentication. Many of the apis are now functional, so you can check on system health or center health. Rather, you can run jobs, retreat, retrieve their results, and you can move data around using the storage apis.

J

um We had a security review of the api and and the authentication around it uh the change the changes that were recommended that came out of that review were implemented. We also had a ux review that sort of looked at how to use the api and how to make it friendlier for users. So we have implemented a lot of those uh suggestions as well.

J

We've started uh to reach out to other other centers uh and facilities that have similar apis like cscs and tech, and we're hoping that in the future, there will be collaboration between the centers and and maybe um have a set of apis that are sort of common to all. The centers.

J

Next up will be a soft launch where, where soft launch, where we select certain users and partners that will enable- and they will be doing more testing and using the api, the api is up at api.nurse.gov.

J

If you want to check it out and documentation is in progress, it's viewable inside nurse and we're hoping to integrate it into the customer facing documentation soon.

A

Great thanks next corey's going to tell us about spin.

K

uh So spin is our container platform at nurse where users can develop services and websites and science gateways to complement their scientific projects.

K

About a year ago, we actually launched in may of 2018 and had a long pilot phase uh about a year ago, we're still concluding that pilot phase and we're just on the brink of introducing uh a major new um version of the system. That's been runs on. We had about 135 users of that at that stage.

K

uh So now, a year later, we're in a full production um with and we're on. A new uh instance of the underlying system is rancher two. This is essentially a complete redeployment and this system is based on kubernetes, so it has a lot of new modern features. It's very robust. It introduces a new web user interface, that's easier for users to to get started, and this also dovetailed uh nicely with cares. Act funded um large memory nodes.

K

So we built our production cluster this summer for the rancher 2, in instance, based on a large memory node. So we could support a wider variety of of workloads uh in this new rancher 2 instance in in spin. Meanwhile, we kept the support and training going and add some new features there. So we did five workshops. Over the last year we added an office hours every other friday.

K

The workshop materials and the documentation were redone and we've added some support from other groups at nursk, as the popularity of the service has continued to grow so uh apg and das and dc and ueg all have um folks that are helping out in the support and training and we're uh up to over 200 users.

K

Now, as of the last workshop here, just in february and uh over 40 nurse projects that have expressed interest or already started working in spin for for their work and a few of the um a few of the highlights are shown there to the right.

A

Thanks corey so now we're moving into technical work areas in the automation, work area within the project and taylor's taylor's gonna. Kick us off there. Talking about self-managed systems.

L

Sure so, as these systems get uh larger and we have more complex workflows, one of the efforts that we realize we need is to to make these more automated and easier to operate. So this is a monthly meeting group um where we discuss best practices and what is going around or what's going on in the whole cs area.

L

A lot of topics include things like monitoring and tuning system power. We have conversations with hpe presenting uh their work within rel collaborations, with alcf we've presented and presentations on splunk capabilities via es net. So it's uh you know really good set of discussions and um the the I think most exciting thing is that you know. I think this will springboard uh how we do things on on pearlmutter. So we're really excited to get to work on that.

A

So next ashwin's going to tell us about work being done around software defined networking.

M

Hello, everyone, network resource reservation and advanced networking capabilities are covered in the sdn technology area. As of jan 2020 nurse esnet and slack have deployed a dynamic multi-point. Oscar circuit infrastructure projects like xfl at slack, can automatically initiate an oscar circuit to nask and tell it down. After the data transfer is completed.

M

We have extended our data center to provide uh an aggregate, 400 gigabits bandwidth to the ensim user facility. At the molecule of foundry uh and last year, uh we've deployed a slum plug-in to allocate uh bandwidth balanced, compute nodes. In query, uh we've started a major project uh to overhaul nurse core networks.

M

uh This will provide more bandwidth and programmable resource reservation capabilities for super facility, scientific workflows. As part of the effort we have started deployment of 400 gigabit ethernet at nask.

M

We also have a group effort underway with slack and es net to iron out the last mile data transfer issues. We are working on building unified data, dashboards for end-to-end monitoring and tuning. Lastly, we have projects underway this year to incrementally upgrade our bandwidth to es net by 5x.

A

Okay, all right thanks, ashwin, okay! uh Next tom is going to tell us about work being done in the sense projects.

N

Yes, hello, I'm talking about the next generation work we're doing in the network infrastructure, so this is uh focused on multi-domain, orchestration and automation for network services. It expands on some of the oscar services that we heard about just now and the um the objective there is that the through the yes net develop sense project. There's the ability to do orchestration of multi-domain network services, where the network services include not only the network infrastructure, but also the networking stack inside of in systems like dtns.

N

So we made a lot of progress in trying to understand how that type of multi-domain automated orchestrated service could be utilized in the context of super facility and distributed infrastructure. So one of the prototype research projects that we've worked on this year is with the xfl and the lcls in terms of moving data from slack to nursk and integrating that with workflows so kind of the there's. Two main focuses for this effort: one is the distributed infrastructure itself and the other is the integration with the application workflow.

N

So we've done some initial work there and we're making more progress and it's a little bit broader than you know, just uh one type of workflow or one type of infrastructure. So if we'll go to the next slide, I'll talk about that a little bit.

N

um Actually, I think maybe the next slide might be one more forward than that debbie.

A

ah Things have been mitch, sorry there we are.

N

Actually, one more yeah, that's that's the one, exactly okay, great yeah, so just in terms of the idea of distributed uh network infrastructure and orchestration, we are very focused on how the main science workflows can integrate with this, and you know now that we have kind of an api driven, multi-domain network services capability, and now we heard also about the super facility and nurse super facility api. You know that now there's an api going to be available into the computational resources. So a lot of the work we're trying to do is figure out.

N

How can the main science workflows utilize these different apis and how can these apis work together to enhance workflows so we're looking at multiple workflows, but one of the one communities that we've been working closely with is the lac community and they have various systems like russia and fds and things above that that are trying to develop the intelligence of how to use these. So that's a large focus for this effort is to integrate these types of capabilities. What the demand, the main science workforce.

A

Great thanks tom, uh so next we're moving now into the data management work area of the project, um but on the continuing the theme of data movement, elise is going to tell us about working done. Data movement in this work area.

O

Hi, so we've had a pretty productive year in 2020 in terms of the data movement space, um moving data across all the centers quite a bit um and starting in 2020 um we only sort of had the standard globus endpoints available where users could write as themselves to and from our various file file systems. We had a very early prototype of gpfs hpss interface deployed, which is called ghee. It's a new way of interfacing with hpss.

O

So as of now at the end of 2020, we've deployed a new service in globus. That's, let's that lets groups that use collaboration, users read and write via globus as the collaboration user on the file system. So it's very useful for the super facility partners. They make a pretty extensive use of collaboration users and so having this is much more convenient. They no longer have to transfer data in and then open tickets to ask the nurse staff to do chones and and permission changes.

O

So in 2020 we had about one and a half petabytes of data moved using these endpoints and in total we moved about a little under 40 petabytes in 2020.

O

For all of our all of our data throughout nurse with these tools, and so the early testing of ghi is done, the gps hpss interface um and then during 2020, um we worked to dramatically improve the usability and security of the interface and we've done successful testing with several facility groups and other teams and there's some set of published documents uh on how to use this system, and you can go and check it out yourself if you want to kick the tires.

O

We've also scheduled worked with slurm to develop some capability to have batch system data movement so that you can migrate data automatically from cfs to or hpss to night scratch before. Your job starts have that be integrated into the batch system, so that it will, it will hold your job until the date is ready and the target for that is fall of 2021 and pearl meter.

O

A

Thanks lisa, so next uh mariam will tell us about work being done in networking tools. Okay, so hi.

P

Everyone, um so this year we were focused more on network analytics tools, um so there are two main tools which we are currently working on highlighting today. So the net predict tool which was launched last year. It's mostly kind of like a machine, real-time machine learning, google maps style uh to predict what the network traffic is going to look like on es net in the future, so that we can plan on big data transfers accordingly.

P

So what we've done is we've uh updated that tool with the graph neural network behind the scenes which has uh includes the predictions of the network traffic and we've published a paper on this. uh Initially, the tool was deployed on google cloud platform, but now we're working with the spin team to move that tool onto spin so that we can have much more control over what is going on behind the scenes.

P

The second tool is a new tool, uh the net preflight tool, uh which we've been developing specifically for the super facility team. uh So this is in collaboration with bjorn and bashir. My postdoc so bjorn identified this problem where, when we do a d2m to dtn transfer, there is no way in knowing what the network performance is before you do. The transfer and the tools like iperf and persona usually require a client to be running, on the other end, to actually get the correct information.

P

So what we've done is we've developed this tool where you don't need any client to be deployed, but you can use socket programming and a bunch of file transfers to actually calculate what the current bandwidth and the trace route is before you do the transfer. So our current tests have shown comparable results to iperf and we are writing this up into a paper and we will be presenting this tool to esnet first to get feedback and then we'll, hopefully, release it to the rest of the community.

A

Thank you thanks. Miriam okay, next uh annette will tell us about the data dashboard work.

Q

Hi yeah, so the data dashboard is integrated into the my nearest user interface and in the beginning of 2020. What we had were was one tab called the data dashboard and that offers features for people to go in and see how they're doing versus a quota in terms of their data storage and the number of inodes they're using.

Q

We also gave them a tool for browsing through their directories and seeing the metadata for all the files in each directory and being able to kind of get a sense for what's where uh and then we also introduced a tool that allowed them to identify their largest files and directories and those with the most inodes in order to find what they might want to archive off and free up more space in their quota. So the at the end of 20 uh 2019.

Q

What we had was basically tools for reporting back to the user, not so much in enabling them to actually make changes in the system.

Q

So what we've implemented in 2020 uh was uh this tool called the pi toolbox, and this is a way for pi's to be able to go in and actually fix things that may be blocking them in terms of you know things that they might otherwise have to go to a consultant for to file a ticket and servicenow and what we're enabling them to do initially, as you go in and and set the permissions on files or change the groups of the group that owns it um and they they can basically go in and make changes, regardless of who actually owns a file.

Q

So in a way they're doing what would normally be the impossible even at the command line, but since they're a pi, we authenticate them and we allow them to make changes to files within their own project directories.

Q

We also give them a one click option. That's super easy if they want to just make all of the files and directories within their project directory.

Q

Have the all the same group and all group readable permissions super quick, just one click and then it runs in the background um and we're moving towards setting it up so that they can also do a change of ownership on a file, and that will be uh just a separate um another button that we'll just add into that that interface um and that we all this stuff.

Q

We were able to present at the sc20 state of the practice track, um and now we are working on developing uh what will probably be yet another tab in my nurse that will allow people to do things like file transfers and we're calling this a petabyte data portal.

Q

So that's basically the state of the dashboard.

A

Thanks annette uh so now, uh quincy and serena, going to tell us about work being done around hdf5.

R

That's great uh so over the last year, we've spent a fair amount of effort, improving the support for experimental and observational data in hdf5.

R

Beginning with this work, in collaboration with the lcls team, at slack, we've worked hard to improve the support for the xtc2 files that lcls instruments and data facilities produce, but those aren't very widely supported by the analysis tools that are currently out in the ecosystem.

R

On the other hand, hdfi's ecosystem is quite large, but doesn't access xtc-2 data. So in order to kind of meld those two together, we devised a prototype ball connector, a plug-in for hdf5 that allows hdf5 applications to directly read xtc2 files and support the most common xtc objects in those files, as well as the most common hdf5 api routines. So we can enable matlab to read xtc2 directly through its hdf5 interface and the other tools that are built around hdfi next steps over the next course of the project.

R

Whatever that looks like for this would be to enhance the the support for xtc2 objects and broaden the number of hdf5 api routines, so they're supported, and then you know, do the usual improved performance and kind of optimize things for tool, chain efficiency. I guess.

R

Next we focused on enhancing hdf5's support for streaming data. It's it's a common, you know experimental use case and uh just not very well supported in hdf5 and likewise for variable size. Many records come out with cameras or other sorts of instruments that are not nice multi-dimensional arrays and that's been a weakness for hdf5.

R

So over the last year, we've enhanced hdf5 to support stream data very directly, just improvements uh to the basic parts of hdf5, the infrastructure that everyone uses so to the 10x performance. Improvements are already baked into the library for the next release, we've written up and done some prototyping around a more optimized api interface for hdf5.

R

That really focuses at the data streaming use cases and if we can implement that and move towards it, we would probably based on the current prototyping, expect another three to five times performance benefit and likewise designed out the changes needed to support really efficient, variable length, data storage and access and hdfi files, and would really like to play both of those pieces out over the next uh section of the of the project.

R

You know and improve and productize the streaming api, as well as the variable length data storage, so that we can really raise all the boats by improving some of the storage infrastructure that hdf5 provides for everyone and, lastly, we've spent some time working on the uh querying the metadata in hdf5 files, so that um there's a there's, a tremendous amount of metadata, that's already built in and baked into self-describing formats like hdf5, and so I wanted to bring that out with this metadata indexing and querying mix project in order to allow applications to really extract the science data and enhance the ability to deliver science.

R

Discoveries by you know, looking at the data that they've already got they've already produced this data, they just want to explore it for better knowledge out of it. So next steps on that are to kind of explore the semantic relationships. You know not just stand-alone pieces within the file, but the relationships between those guys and be able to query and build more science discoveries out of the relationships between objects, not just kind of across individual objects.

R

A

Great thanks, quincy all right and our last uh technical work area update, uh is from chris tell us about work being done in advanced scheduling.

E

G'day, um so where were we with advanced scheduling in in january 2020., so we had nre, which is non-recurring engineering with skymd who maintains slurm, um because we have an issue where we want to be able to accommodate experimental workflows without causing disruption to the existing workloads.

E

Now we're going to use reservations for this, and there is an issue in scheduling where, if you have reservations in place, they can cause a what's called a shadow in the on the workload. So the idea was we would have something which would allow a reservation to say. I will allow other people to use these nodes as long as they agree that they will be preempted within a certain amount of time, as my experimental workload arrives.

E

um So that's now insulin, 2002. um We have um done testing with that um to check that it works. The issue we have is integrating it with um how we charge for these two systems and um there's also a test configuration on gertie, which is the test system, the corrie for the nurse staff um for people to experiment with um coming in 2011, which we have um it, was released in in november the end of november and is currently on the test systems for call mutter.

E

um We have s crontab and the idea here is crontab workflows. They often tie people to particular login nodes. That means that, should that login node get a hardware failure, then those tasks won't run and we need things that are resilient in the face of that, and the plan here is that essentially the skid md folks have implemented s-con tab, which is a command that looks very much like the contact command and allows you to specify jobs to run at certain times. It has another nice effect, which is you don't have the case.

E

If you you say, have something running every hour and your task for some reason takes an hour and ten minutes. Then you end up with two running. At the same time um with this, it will only start, then, the next one when the previous one is finished, um and it's it's also people for this work.

E

That's it. Thank you.

A

Great thanks chris um all right, so um we've really done a lot of work in the last year and we're already really thinking very carefully about what we're planning um to do in the in the next 12 months.

A

I'm in the last year of the project, our main focuses um are really around getting the science teams up and running on perlmutter and being able to demonstrate that they can run an automated pipeline, whether that's on cory or pelmeter, um so running their analysis without human intervention, and then we're also thinking very carefully about a sustainable support plan. For this. um How do we transition the tools uh to being able to handle long-term support, ensure that their production hardened?

A

um That's another area, we're working on this year and really what all the work we're doing here within the super facility project is pretty groundbreaking um and we are making sure by doing this work, we're ensuring that we're well placed uh to take advantage of the future directions uh for oscar infrastructure. These are discussions that are going on uh within the pro between the program managers and between um people in oscar.

A

We uh are very well placed to take advantage of um and to lead in the area of um developing components for the framework for geographically distributed workflows, and I think that's something that's going to be very exciting for the future. So I just finished um by saying many thanks to the super facility team. This has been a very challenging year and they've been doing fantastic work and I think you all agree they've been doing a lot of really impressive work in this past year. So thanks to everyone.