National Energy Research Scientific Computing Center (NERSC) NUG Monthly Webinars, 15 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG meeting 15july2021

Description

Video recording of the NUG monthly meeting, June 15 2021

A

Okay, I think we've probably got uh around about a quorum. So let's begin um a heads up and reminder first that we are recording the meeting we'll make the video and slides available on the meeting webpage at www.nurse.gov sort of shortly afterwards.

A

Yeah, okay, so we'll follow our normal format, which I think people are fairly familiar with now. um The idea of this is to be an interactive meeting, um which is to say: please participate, uh you can you can either raise your hand or or just unmute yourself and and speak up when you've got a question or a comment to make there'll be lots of opportunities for that and I think we're you know: we've got around about 20 people at the moment, so that's a comfortable enough size. I think that we can just speak up.

A

We also have, if you're not already part of it, the nurse user slack I'll paste a link to that in the zoom chat.

A

Chat that was there.

A

um So that's a a good place to uh yeah to post comments and to sort of continue the conversation and we we tend to use the webinars channel just to keep it as a separate place from the general channel. Yeah.

B

A

Either is good if you're not part of slack already it's a great forum for swapping ideas and communicating with other users and giving heads up about if something doesn't seem to be working, asking questions about how to do things.

A

So our agenda will follow kind of our normal patents, we'll start out with a win of the month and the flip side of today I learned: we've got a few announcements to make and there's uh opportunity to. We have for participants to make announcements there too, and then for our topic of today. We have uh norm barasser from nursk's building infrastructure group who's, a uh energy efficiency expert and has done a lot around uh nes set up for energy efficiency with uh with its computer room.

A

So they'll talk us through a little bit about some of the things that we're doing and have a opportunity for discussion there and then we'll finish up with some looking ahead to what's coming up and a quick run through our numbers last last month, so starting out for uh win of the month. So the aim of this segment is to you know, basically share and celebrate the achievements in our community and they can be big or small. Getting a paper accepted solving a challenging bug.

A

um Yeah, especially, it might be something that's uh you know worthy of nomination as a candidate for a uh a science highlight or a high impact scientific achievement award or in in the innovative use of high performance computing award.

A

C

A

uh Something they'd like to share something new and.

A

A

On nurse side, pretty sure this happened since our last gathering with the isc june, top 500 list came out and uh perlmutter came in at number five, which is the the same point at which uh corey debuted actually also number five on the list.

A

So I know a lot of people put a lot of effort into getting the the system up and ready and you're running the benchmarks for.

C

A

A bit of a quiet month we can uh step along to and maybe also combined with, uh so so. The flip side of that is today. I learned um yeah.

A

Of course, it's great when something works, but a lot of the path to getting something working is finding a lot of things that didn't work first, and so this is kind of an opportunity to swap ideas and notes and share stories about things that were things that were difficult things we got stuck on uh dead ends that we hit things that seem like they ought to work but didn't uh and get.

A

The idea here is that what yeah we can learn from from each other and and bounce ideas off each other of how to how to solve things as well. uh It's also kind of an opportunity to talk about a a new tip for using nest systems that you might have come across recently or just something interesting that you learned or read recently that might interest others.

A

It's been a very quiet month by the sounds of it.

A

So now, for me, a lot of the learning in the last month has been uh experiences using spac, using spec to set up a bunch of software environment uh which we have both on corey as well as they're setting up perlmutter, and it's a it's a very uh powerful tool. uh It does a lot of clever things. uh It could also be. uh There can be some challenges, um yeah working out, what didn't work when a when a build failed, saying it's.

A

Yeah, it's it's uh an interesting experience of diving in and looking at how it you know how it does things, but uh something that's been really helpful. There is the community on on slack. So, if you're, if you're, using it to install software, there's a spec slack, which has quite a helpful group of users.

A

It sounds like we have a little a quiet session for today, so we'll step forward to the next one, then of some announcements and cfp. So we we definitely have a few from the nurse side here.

A

So the ones I have listed here have more detail in the last weekly emails.

C

You can go.

A

Back and check that for links to things and so on uh important things that might that yeah that well, that will affect people um in the latest uh maintenance. We updated sloan, and there is a slight change in slums behavior, which is a dash dash overlap flag for when you're running, multiple s-runs on the same node, so most kind of typical, you know straightforward use, isn't impacted by this.

A

The most common use case, of course, is yeah s, run dash n mini nodes, there's little in many tasks, my executable, but there is a you know reasonably useful uh this case, where sometimes you want to run multiple programs on the same node, so you you're, starting one s, run on on half of the cpus, for instance, and another s run on the other half of the cpus, either as part of a workflow or two things that are working together and for those it used to be a fairly simple geo.

A

Do one s run put it into the background. Do the other restaurant put it in the background? Wait for them all! uh You now need to let slum know that it it's allowed to start. This runs on the same node and so there's a new dash gesture over that flag. For that. So we have some examples in our docs of how to use that, and it's just a kind of a short form of what's what's changed there.

A

Other important um announcement is we have a corey os update planned for september, it's a minor update, uh but we will be changing the pe, which is the programming environment, kind of that the default set of modules that you get when you log in, and one expected impact of this- is that statically linked things will need to be re-linked, so dynamically linked themes should be fine, because dynamic linking they'll get linked at load time to the appropriate, updated version of a library, but for aesthetically linked things.

A

uh Some things might have changed in some of the underlying libraries that you know if they're now, statically linked in some apis can change.

A

So so just a heads up in september, you probably will need to re-link not necessarily rebuild, although that's sometimes the most straightforward way of doing it, but at least re-link the any statically linked executables that you're using.

A

Some cfps coming up there that we know about the links to these are in the weekly email. If there's a workshop or an accelerated programming using directives, a parallel applications workshop for alternatives to mpi plus x. I think this covers things like uh gas net and I get global array sort of stuff, uh also um and also at um sc21.

A

Is the super check workshop on checkpointing, a few training things coming up. The sap webinar series, if you haven't seen this, this is there's some really interesting stuff there and some good civilian little tips for using hpc and for scientific computing and after the webinars are complete. They upload links to the recordings onto the website.

A

So uh I think links to the website are in the weekly email, take a look and go there, so the next one coming up is on multi-institutional scientific software development and some lessons, london, best practices and so on, and that's in august.

A

uh We have uh tomorrow, I believe, there's a training hosted by nvidia, I think uh about uh cuda multi multi-threading, with streams. This is useful for preparing for um perlmutter uh there's also in about another month in late august, uh a four day: cmake training, where we're partnering with keywear. So if you uh develop or even just sort of your build and install applications, that could be world worthwhile.

A

And one other kind of interesting announcement that we have, if you haven't seen it already, the e4s, which is the extreme scale scientific software stack, I think, which is part of the sap project, is updated uh version 21.2 just to say february um february, 2021 is now available on corey. I think everything's built for cory haswell.

A

It may or may not be available for uh knl as yet, but you can, you know, use the specs there as a starting point for that um there's quite a thorough set of packages and libraries yeah adios hdf5.

A

um I think pets you might be part of it schlepc, so so there's some sort of good libraries available there uh to get it to sort of a two-step process. You'll need to modulate p4s stack first and that will put other modules in your path with the specific packages. It also sets up a spec environment for building things. On top of it, that's the announcements that I have from nurse's side does anybody have any other calls for participation that they'd like to bring up.

A

A

Yeah we're racing through today's meeting. Normally normally we don't arrive here for another uh 10 minutes, but I think I saw that norm is.

C

On yeah I'm hearing I thought you said this was a chatty group.

A

C

A

Is a chatty group, so everyone.

C

Can hear right now.

A

We haven't had enough coffee yet this morning or something.

A

um Would you like to share a screen and.

C

Yeah I I've likely, like you, suggested I've condensed my slide deck that you had seen, but it looks like I'm going to have generous time. So maybe I should go to the big extended I've.

A

Got some uh time to bounce through bounce through additional topics.

C

Well, yeah, we can go below the fold. I kept the other slides um in there for below the fold if needed. um Here we go. Oh hang on a sec here I got to show you.

A

By way of introduction, norms, yeah building infrastructure group and has done a lot of work around energy efficiency and uh yeah seriously, some uh yeah some quite interesting stories and and clever tricks and what tricks might not might not be the right word but uh yeah. Oh.

C

Yes, approaches that it's all to.

A

Improve its energy efficiency.

C

So I I will say I I tried to find that slack um channel webinars, but I I didn't see it so if any questions end up coming through there I'll have to rely on you to.

A

C

A

C

So um as yeah as I said, I'm my name is norm. Rasa um a little word on my history. I've been an employee at at lawrence, berkeley, national lab since 2000. So 21 years now, I initially came to lawrence berkeley lab after having worked in an area called energy engineering, energy efficiency for commercial buildings.

C

For five years in california, I graduated from cal under architecture program which merged with the previous electronics engineering degree that I have, and I got dovetailed into the energy efficiency world of commercial buildings in california, in the late 1990s, culminating with joining berkeley, lab and researching building science and energy efficiency for uh commercial buildings and um 2017.

C

In the wake of us coming to the new crt building, now called chai wang hall here on campus, I joined nurse at 50 time in 2017 to help with the energy efficiency and energy performance of the building I'll go over those reasons later and um in early 2019 just before the pandemic transition to well actually before that it was 100 time, but I transitioned fully over to the division and now focus uh my time on on making sure that the building performs well um uh I'll.

C

Go over with the reasons for some of that later, but I want to just talk first about something that all of the the users here in in our community are probably pretty well aware of that. The first level of energy improvement just goes down to the processing capability of these uh scientific computing platforms, and this is some an aspect of the generational improvements of our systems that that users may not have appreciated. But if you look, you know, let's just go as far back only as as as edison.

C

If you look at the at its uh power consumption for its compute throughput, when we deployed corey, we got roughly five edisons with really a doubling of the power. So if you think of that, this is more. You know law stuff, but that's energy efficiency right there that computational throughput for less power consumption is, is energy efficiency at the first order and we're getting the same thing with perlmutter.

C

We only we're getting less than a doubling in power, but a three times uh improvement. That's at the design level! Actually, after the hpl runs, we now know we're actually getting more of a performance improvement than than this, and I don't know exactly what the full numbers are.

C

So it's important to lose track of that, but one thing and you will notice we are getting a doubling of power, but a five times improvement and it's going to come in slightly under five times, so we can see the softening of the moore's law and we're looking at oh believe it or not. We're actually looking at nurse 10 um and- and it looks like they'll- even be more of a softening of that. But but this is an important aspect that our our computational technology is providing core energy efficiency improvement.

C

That's not to negate the need to make sure that these infrastructure and other support aspects of our operational services don't need to be paid attention to. So I've got three basic levels of energy efficiency at the support and infrastructure level. I don't want to point out the the next level that I always like to emphasize is once you start a compute job.

C

It should complete, because if it doesn't complete, if it gets to, 70 percent, gets to 30 percent or gets even before it gets to 100 and it crashes. That's just pure waste out the window, and this is one of the reasons why we put so much emphasis on helping our users with their car, their code and and we do the best we can to ensure that once jobs are being deployed, they have a high probability of succeeding.

C

Then site-specific facility design is next nurse. We are located here in berkeley and we happen to have a very mild climate and we're able to produce an hvac system that doesn't use chillers, basically vapor compression air conditioning which we're all very familiar with. Oh the lights turned off on me, I'm the only one on the floor right here.

C

So uh this is an important aspect of being able to just dissipate the heat to the environment without very energy, intensive hvac equipment. I'll talk about that later and then the last thing is once you have that equipment deployed having high resolution monitoring tools and data analysis tools to to adequately adequately determine whether the those systems are performing optimally is a a perennial challenge.

C

A nurse has invested a lot in both in staff and infras and and systems deployment to to be able to uh to monitor how our systems are performing analyze and and improve them. This this, this positive feedback loop and indeed we're helping to set the standard um uh for the state of the art and scientific computing. It's with quick words on our building, uh we're a four-story, 150, 000 square foot, building, and basically 40k of that is offices.

C

So we don't count that in our efficiency metrics we basically have a power supply uh capability of 21 and a half megawatts, which is usually uh about double what our expected uh draw is and our two systems promoter and quarry are capable of of drawing a peak of 10 megawatts.

C

As I said, we're year-round, compressor, free air and water cooling systems. We're lead gold rated. We have an annual uh average pue of 108 I'll talk about that. Further.

C

Most of the time we just use outside air condition it you know cool it down, goes through the the computer room and then just exhausts out, and that's that's speaking to the to the a very mild climate that we have for most hours of the year in california.

C

um Right are in in berkeley uh right now. Nurse or shy. Wang hall is approximately 40 of the campus energy demand. So we are basically designated here as the significant energy user on on the berkeley lab campus. So that gives us a lot of extra attention and help from the lab directorate in identifying energy efficiency measures and I'll talk a little bit more about that.

C

uh We've received lots of recognition in the department of energy circle at the head office and in the various uh programs there uh for our multi-disciplinary approach to maintaining energy efficiency uh in the building.

C

We collaborate uh deeply with my former division: the energy technologies area, uh where there is a data, center's efficiency center for general id for the for the private sector, and they are their experts, come in and help me. We also have a a specialist energy engineering consultant called kw engineering.

C

This gentleman right here that that comes in and and does consulting for all of the buildings on campus and there's got a special attention for for nurse because of our high energy demand. um What are the targets that we use to know that we are actually operating efficiently? We have two metrics. We have the power usage, effectiveness and and a subset of it, which is it power, usage effectiveness. I'll talk a little bit more about that, basically pue's the facility wide.

C

That's where you have the total center energy divided by the compute energy. So you you basically end up with a facility overhead.

C

So, for example, if uh the compute energy is say six megawatts and our total facility is uh six point, eight megawatts that point eight represents the the you know the 0.8 part, which is an 8 facility overhead, that's how much electricity that we consume for all of the services.

C

Over and above the compute, I t eliminates some of the of the you know, non-support stuff, that doesn't matter as much and looks just in at the I.t um inside the cabinet and peels out the hvac and and it's it's a way of understanding the efficiency of the hpc directly another one. That's becoming much more important is wattage water usage effectiveness, that is, um our cooling. Our water cooling facilities use these large cooling towers, which you can see a picture of down here.

C

We've got seven of them and they evaporate a lot of water during the year and here in california, as you all may know, there is a lot of a lot more importance in water efficiency.

C

Currently, with pearl mudder were projected to evaporate somewhere around 60 million gallons of water a year. So it's a lot of water. That was a large increase from the corey only or the or the corian edison era, when we were down around the 12 to 15 million and that's because pearl motor, while it's a more efficient system, it's using 100, 100 percent, liquid cooling and so hits the the the cooling towers harder and we're evaporating a lot of water.

C

The preliminary uh analysis numbers for nurse 10, on the other hand, has that water use blooming to around 180 million. So we are in the process of really really seriously looking at our water usage effectiveness and we're starting to evaluate new technologies for the nurse 10 era that will use a lot less water.

C

Water usage effectiveness is a different type of a metric. It's not unitless, like the other two, we we look at the amount of water and the cooling plant energy and you end up with the liters per kilowatt hour hour. We have been monitoring that metric for a little bit over a year now, and this is site specific. So you really can't compare one site to another and we're in the process right now of determining for our current cooling plant with pearl mudder.

C

What is our efficiency point and we will be developing energy efficiency measures and improvements on that? Any questions on this by the way, um don't don't feel uh shy about just interjecting with questions I I actually prefer to have more of a conversational approach to to these topics.

C

um uh How do we? I mention that in a third budget, a bullet? uh How do we pay attention to how the systems operate? Well, we have deployed this system. We call omni. This is an old um flow chart of of everything. That's happening. It's actually not very current, no more, but it's still illustrative of how detailed uh it is out of all of the hyperscale dui data centers.

C

We have the most high resolution instrumentation system, where we are operating on an ethic where we most of the time in most of the other facilities and historically uh when a new project is occurring or when we know that there's a deficient area of of performance in a data center, a project plan is put together uh uh data.

C

um You know a data monitoring plan is put in the the systems are deployed, there's a period of gathering data. Then we decide what to do and then a project is designed and then you know the work is done and then post work. We look at the data again and see how well we've improved things. Well, that's a lot of time delays and in the period when you're gathering the data you've got inefficient operation and wasted energy.

C

We operate in a different ethic where we say we don't know what we need to measure and that time delay of once we notice we need to uh get more eyes on the performance of that equipment. It's too late. So we have decided that we have the the capability of just gathering everything and when we see that we have a problem, we will have that data already.

C

We can go into implementing corrections and adjustments immediately and thus the omni was born and and so that's what we have we ha and we keep the the performance data indefinitely.

C

We we use the omni system and data that we have becomes this triumvirate of of support for our our optimization of the entire facility, with the sustainable berkeley lab. That's the the ldl directorate that that helps leverage their resources to help us improve nurse, and it's this uh a positive collaboration which is rapidly becoming a template for uh for energy efficiency improvements in the entire d.

C

We indeed the um the hyper scale scientific community in large one example, that of an area that's becoming emerging in all of the large top 50 type data centers is this area operational data analytics um our omni platform? Does this is, is that it co-mingles hpc, telemetry and hvs ac infrastructure uh monitor data into a common uh data base that we can then analyze together time synchronize and be able to deploy solutions.

C

uh This? This is a very, very powerful uh tool. um This tool, right here sky spark allows us to look at uh at all of our. uh It basically has cray fan. Blower fan performance data from from corey in this same platform along with the hvac, and it allows us to look at. We establish a baseline. This is showing a scatter plot. That's showing the cooling plant performance in both the baseline period and a targeted um analysis period. So what we have here is this baseline fan, power and pumping power.

C

uh Scatters are are showing what where we should be performing. We've done settings changes in. In this example, we've done settings changes and we're looking at that analysis period versus the baseline, and this plot is showing that that we are actually uh burning a lot more energy here in the in the cooling tower fans, and so this helps us zero in on on the settings that are in in the cooling plant, um and we can do this real time and we can iterate and dial things in yes,.

A

See I I noticed on it was just this past weekend, um this uh being used in in real life, in fact, and probably non-concomi with more detail, but so we had a power outage over the weekend for power, maintenance work, uh and so when, when that happens, when we don't have kind of the main feed of power, we can actually keep a fair bit of stuff running on backup power, but corey computes. uh I just sort of too much for it. In fact, corey.

C

A

Is pretty challenging so we had so so corey was unavailable, but a whole lot of other things were available.

A

So so you might have noticed like people, you might have noticed that you know you could still use things like that: the dtns to move data around yeah, and so I was watching the internal slack channel a little bit because uh the yeah the one time that people were a little bit concerned about was during the yeah the middle of saturday afternoon when the temperature was forecast to kind of be at the peak and and the big question was yeah.

A

We think if the forecast is right, that um we have enough cooling capacity that can be driven by the backup generator to keep the thing just to keep those things running and and cooled, and there was a yeah a bit of chatter. During that time, um you know on the internal slack channel as the operators were watching. What I imagined was these charts that norm's showing right now and watching.

C

A

Yeah, the temperature in individual aisles go, you know up and down and discussing you know.

C

At which point do we need to stop.

A

C

This is the chart exactly and I happen to have it up here. These are the environmentals and the racks that you were talking about, and this is what we were chattering about. These um are deployed sensors that show us uh the air intake temperatures as a matter of fact we're going to go the last seven days and uh we can show everyone the outage uh that was over the weekend right, so so this period right, yeah these.

C

These were the temperatures right in this period here, where we um were just operating on that one air handler- and we were talking about these- these temperatures, as we were down to just the the two backup air handlers, and we were able to make sure that that this common area, air cooled equipment was uh operating correctly. That's exactly right- and this is the omni system that was still operating during the power outage.

A

Exactly yeah, so that meant that, instead of shutting stuff like instead of shutting everything down for the weekend, we could basically keep a lot of stuff running and yeah without without endangering stuff right. This turned out to be quite useful.

C

Yes, exactly right, yeah, that's that's an example and that this this grafana here is an example of one of the oda tools that that we use, and, uh for example, I think we got this one right here. This is uh this is where I can view the actual performance of all my air handlers, and this one shows me: uh here's the cooling distribution units for pearl motor, I'm able to see the inlet water temperature and the outlet water temperature.

C

So we have a whole host of instrumentation that allows us to to watch the the performance of these systems, and this is one of the other ones. So getting back to the slide deck. um I mentioned that that we have the capability of hvac high resolution telemetry as well as hpc high resolution telemetry, and I've got one example that I'm going to work through that that you might be interested in this is corey um an xc40, uh crazy system and it's part of the the cascade system.

C

This is something that users may not have known about corey. It's it's a combination. It's a hybrid system. It uses cooling water for 70 to 80 percent of the heat extraction from the processors and then the balance of that at 20 to 30 percent is from air that is blown through all of the compute cabinets in a cascade fashion, and the way this works is the. This is why, if you've ever been in the room with corey, so a lot of people like to use earplugs, it's a honking, noisy machine.

C

These blowers are they're, not gentle with what they do.

C

So there's six blower fans in here that that range from 25 to 4, 000 rpm, and they are ear splitting loud when they're at 4, 000, rpm, there's six of them brings in the air, goes into the first compute cabinet and through the blades, and there are some heat sinks that extract heat and then there's a cooling coil here that extracts, like I say, seven to eighty depending on the cooling water temperature of that heat into the cooling water loop, bringing the air temperature roughly back to the same as the air and through that next cabinet.

C

Repeat uh another blower fan to bring velocity back up. Rinse repeat all the way down through the um something like 15 cabinets until it exhausts out into the room.

C

Well, we um interacted with craig way back during the edison phase to to tell them that hey these fans consume a lot of energy and it was a significant portion of of edison's at the time total energy consumption roughly 12, and for uh corey it was somewhere in the range of 350, almost 400 kilowatts of corey's energy use, which is uh somewhere um in the uh at that time was three uh 2.75 uh kw uh megawatts, but uh but anyways um they only gave us three fan speeds in edison uh idle, um uh just um um I've nominally called it and maximum, and so it was basically uh 2500 3200 and then 4 thousand, and we interacting with said this.

C

You got to have some way of sensing how the the fans are needed, or not, so that we can turn them down and save some energy, and they came up with this dynamic fan.

C

Speed control feature where it would monitor the processor temperatures in the row and if there is a one, hot spot node in the row, with a with a processor that is running uh hotter than the rest and for every five degree c processor temperature up change up, it would modulate up the blower fence by 150 rpm up until it got to the 4k maximum. This way it allowed some some power responsiveness demand on these blowers up and down. That was great.

C

It provided us roughly seven percent energy savings just out of the box, but because we're compressionless and our water temperature uh kind of fluctuates with the outside air temperature. What we call wet bulb that dictates how the cooling towers, how cold water the cooling towers can make. We found that this static, cooling, coil exiting air temperature, basically the the servo control loop that says. Okay, this cabinet air temperature should be 22 degrees. C. If that's a static set point.

C

If that cooling water temperature got too high, there would be points where that it couldn't make set point. So the cooling water valve would just open up totally and we'd start wasting pumping energy in the cooling plant.

C

So we decided to develop an active script on quarry that um we call dynamic setpoint.py and it's a system management, workstation script which looks at the cooling water temperature and actively adjusts that cabinet cooling temperature set point to make sure that we are not just widely opening opening up that valve in order to try and chase an unobtainable cabinet air temperature.

C

And this is an example of how we are co-mingling, the on-board telemetry, that being the actual cabinet, air temperature and the cooling of the cabinet. Air temperature set point we're feeding that back from the plant's cooling, water temperature and adjusting accordingly.

C

And this way we were have been able to get a much more agile, seasonal performance of the dynamic fan, speed control feature in in quarry and we're shaving off these points. Here. These represent a cooling water pump energy, and these are really really big: cooling, water pumps, they're 125 horsepower each, and so, when uh we're circulating all this water uh with those uh cooling water pumps. If we can, you know, knock those uh cooling water pump speeds down by a couple percentage points. It translates into some some energy savings in a small way.

C

This is actually a representation of what we can expect to be doing more of in the exascale world, because, as these hpc systems get even larger, the cooling demands. Indeed, the power delivery.

C

Delivery demands from jobs starting up in these huge exoscale systems are probably gonna demand, even more interactive communication between the cooling plants, the power system deliveries and the hpc system within an exascale system that starts up a job say: that's going to use 75 percent of the system and it may all of a sudden in the blink of an eye, demand uh five megawatts more of power, and indeed, that translates into cooling demand uh right away. A cooling plant can't respond to that instantaneously.

C

It is going to need some sort of pre-learning where you know the job. Scheduler says: okay: cooling plant get ready, 10 minutes, we're gonna, need five megawatts worth of cooling capability, and this in a small way kind of represents the future world we're moving towards for um power, responsive cooling plants uh just enclosing a couple words on what the building infrastructure group does for energy efficiency uh in the future for perlmutter and now nurse tan.

C

We interact heavily with the design teams in making sure that we incorporate energy efficiency concepts in the actual design right down to the owner's project requirements, uh as well as all of the the review during construction and commissioning, as well as engaging with equipment. Vendors where we see where some new technologies um that might be coming from from manufacturers uh might be four or five years out that we could benefit from.

C

I mentioned earlier in the talk that nurse 10 could potentially raise our water evaporation up to 180 million gallons a year which actually exceeds the capacity of the pumps feeding lawrence berkeley lab. So we have to find alternatives. We started engaging with uh uh what we call dry coolers. These are just basically huge automotive.

C

Radiators is what they are: they're they're, big fans that that blow through uh radiator arrays uh to help cool down a closed loop, which then goes into the cooling plant. Now they take more surface area, but they don't evaporate water and so we're engaging with with those manufacturers to see if we can use that technology for nurse 10 to help with our water evaporation other nursing activities that are in planning.

C

We are looking at machine learning uh methods to help us optimize our settings, uh especially with the the balance uh between uh the feed of the cooling tower fans, our most energy intensive uh component in our cooling systems versus the pumps, which are a little bit more efficient right. Now we are kind of hitting the fans more and the settings between the two are difficult to do manually, they're kind of seasonal.

C

They need to be set one way for one cooling, seas, cooling or heating season versus the other and and it's difficult to find an algorithm that is fully agile between all of the different types of conditions, so we're in the process of uh planning the the data that we need in order to deploy some machine learning models that that will be much more agile in in that regard.

C

um Hpe also has some products that we're evaluating and we regularly do outreach and collaboration with the with the the various centers around the world, like I said earlier, the top 50 centers and help them, and we exchange ideas they help us and and and we try and stay on, the cutting edge of everything and then closing that I just want to always like to put the word out for everybody.

C

That's on the team, and for this summer we also have two summer students uh engine basically engineer in students, uh uh gabriel o'reilly and uh nicholas ventura, who are both helping us in various corners as well. So uh with that I um and this time I'm gonna put you guys on the spot for some questions.

A

That's really interesting thanks tom, so a couple of uh questions and comments that that uh your presentation brought to mind.

A

So I thought that was really interesting, that you found almost like a law of unintended consequences there, where adjusting the fan, speed to improve the efficiency of corey kind of had this like it interactive it triggered yeah. It treated an interaction with the cooling system that then kind of undid, the good that the fan speed was doing, and so by sort of coupling the information together. Basically you're able to you get them to cooperate instead of stepping on each other's work.

C

Yeah it's um it's like. We have these offsetting savings, so we got savings in quarry, but then we, when we looked at it holistically with the second order effects that might be elsewhere- and this is very common in building science, energy efficiency, all the time they're in second order and third order, interactive effects, um we it looked like they were roughly roughly offsetting now. This is a unique situation in our facility and and every facility is a prototype.

C

We are also site specific, a lot of the other cray xc deployments uh in in other centers. They will have chillers, meaning their chilled water. Loop has a set point and the the air conditioning equipment make sure that that set point tightly stays within a tight window, so they wouldn't have had this issue, because that chiller is just going to be using whatever energy it. They maintain that cooling, water temperature.

C

So in that situation the dynamic fan, speed, control out of the box with uh with the cray xc system, actually works very, very good, but in a situation where that cooling, water temperature diverges a lot due to the outside conditions that static set point cabinet air temperatures that point does get into a situation. This was a very agile solution and uh the the the csg group here uh uh worked.

C

I I gotta send all sorts of thanks to them, and the and owen owen, james in the uh uh otg group who developed the initial python script and, and then several people and adida gower, is now the expert in the csg group that uh that that helps with it, and um uh I actually presented this way back at nug extreme at um super computing 19, I think it was yeah, is in denver.

C

I can't remember, I was in dallas.

B

That's right in dallas.

C

In dallas and um uh um presented it, as you know, this could potentially be a future improvement, and um I don't know craig didn't opted not to go that route, but that's because very shortly thereafter, pretty much everything all of their capabilities were focused in on shasta, but um yeah that that is not uncommon in energy efficiency.

C

Second order and third order effects can often offset your first order savings that you thought you were going to be getting.

A

Yeah, so it's never quite that simple, so the the other thing that you reminded me and and either norm or possibly somebody else on a ss3 is on as well. I can can clarify so for for nurse users, uh the s account command, um has a output option where you can get consumed energy and for completed jobs. It shows a value and I think that's getting the energy from from somewhere in the omni system. Do you know that yeah.

C

Can you show it to me and I'll stop share and you show me that and I should be able to tell you, I am relatively confident that's going to be the sedc, which is the onboard telemetry of the um cle operating system.

C

Yes, you're right now, yeah right she did. I figured.

B

C

Hoping you would speak up.

B

uh Yeah, I think it uses the gray captaincy or something like that. Yeah.

C

Yeah but but you know nominally speaking yes it it, it is the omni system that extracts that and then likely serves it up to the nurse community user community.

C

That's exactly correct!.

A

um Just looking for an example of.

C

It that um so, while you're looking for that uh that they basically for our power numbers, we've got several flavors of them. I I don't think my slide. Deck actually says that we've got. We've got master meters, which we call ion meters that that we use for the campus uses for our total uh power consumed. They are revenue grade and highly accurate and they're at the substation level, and then the sedc meters that that look at cory alone are.

C

um There are some on board power, sensing meters, that that go through the scdc channel and into omni, and then we have some and I believe, those use modbus protocol, which we use to pull those into uh to query they're not are into omni they're, not revenue grade, but they're, very highly accurate and for perlmutter um we deployed mu meters, uh which are called uh their trend.

C

Point uh interval meters highly accurate, just just tiny bit less accurate than the ion revenue grades uh very, very high resolution, so much so that we decided to submit our top 500 as a level 3 because of the high level of sampling rate that that it.

C

Used I see that we're getting we're. Probably out of time. We.

A

Might be uh pushing time you're right um yeah so for um for people interested in seeing energy use via this account um s account dash e lowercase e shows you the list of fields that it can display and the fields are called uh consumed. Energy and consumer energy, raw yeah.

B

One thing I just want to make it a point like if you want to do uh in-depth analysis or use that that number may not always be accurate, so it's okay to just get a rough uh estimate of what you are using. But um if you need more information, I think you can get in touch with us and one of us will be able to help you get exactly job usage values um yeah. You know with with more certainty.

C

Yeah yeah either sweetheart or I can help you with that and in fact we we helped another nurse user. uh Who is a graduate student at uc berkeley uh recently in that regard, that the number does not capture all of the uh line, losses and stuff. Those are sedc numbers they're on board numbers from chlorine right. So so there's a.

A

Degree of approximation, but but for a first um first approximation you can get.

B

A sense of how.

A

Your job's performing from an energy perspective, so we.

C

Are at the sorry I will. um I will uh give I'll point you to where a pdf is of this uh deck just in case anyone wants it. Okay,.

A

Sounds great and we'll um we'll post that on the website uh fairly soon, so we are at the top of the hour, and people probably need to uh head to the next thing, um just very quickly rush through the next couple of items uh so coming up. um Earcup season is coming up, so we'll probably aim to have a topic of the day around ercap, perhaps for the august webinar always interested in topic, suggestions and requests, and especially if participants would like to show off their work.

A

Let us know last night's numbers: uh we didn't have a regular schedule maintenance. In june. uh We had a couple of uh brief, uh I think not even complete outages but system degraded events.

A

uh Utilization has been sitting fairly steadily up in the kind of 94 ish 95 um about a third of the jobs that we saw in cory, we're using more than 1024 nodes and our ticket incoming and outgoing rate is sitting to be kind of reasonably steady.

A

Thank you all and we'll look forward to seeing you next.

A