National Energy Research Scientific Computing Center (NERSC) NUG Monthly Webinars, 16 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG Monthly Meeting 16 Feb 2023

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

B

A

Monthly meeting for February.

A

So um for those who have been to these meetings before we follow a reasonably uh predictable and very interactive um format and schedule, so we have a a reasonably large turnout uh today, so I think um for for speaking up, maybe use the raise hand icon under under reactions.

A

It's not obvious um yeah. You should be able to hit the reactions button and raise a hand, and um that way we can kind of. If you manage how many people are speaking at once, that's it please do raise a hand and uh and contribute um well. We also chat sort of in the in the chat and in the um the nurse users of slack at the same time, keep conversation going there uh after the meeting as well.

A

So we'll follow our usual agenda pattern with a one minor Edition today, so we start out with a win of the month and today, I learned- and these are opportunities to you- know- talk about uh things that have gone well and and things for there that haven't gone well or that you've, uh you know stumbled across that are interesting and beneficial to other nurse users.

A

um We have uh Lippy Gupta from nurse care who's, going to talk a little bit about the user, Community survey that is uh in flight. At the moment, yeah a number of people have filled it out.

A

um We have a handful of announcements and goals for participation and there's also an opportunity for, if there's an event that you know of that. Other nurse users might be interested in um to you know. Let people know about it and then we'll go into our topic of the day, which is going to be Chloe's retirement. So uh Rebecca from nurse ueg is here and she'll. Give us a bit of an overview of the plans for Curry's retirement coming up yeah fairly soon.

A

It's it kicks things off with win of the month, so the aim of this segment is an opportunity to show off an achievement or shout out somebody else's achievement that you know of, and this can be big or small. um Have you know having a paper accepted somewhere solving a bug, and you know it was always interesting to hear how yeah how you solved it and I think is uh you know it's good tips for other users as well.

A

um You may have uh have either made or know of um you know. Significant scientific achievement uh might be a candidate for one of the science highlights that we present to uh doe really really frequently uh or even a high impact scientific Achievement, Award or an Innovative use of high performance Computing award.

A

Would anybody like to kick us off got a win of the month to to show off or shout out.

C

Kevin I don't have anything as fancy, as uh you know, the word or anything, but um last week I got the first test for stream triggered communication working so now, I have something that's running and working and getting cool results and I get to test it even more and give I have something to actually present at Siam in next week. No two weeks. Yes, so it was pretty big win for me. So this is. This is interesting stream triggered stream triggered communication.

A

This is a different model to MPI, or this is a model within MPI.

C

uh So the various vendors are potentially building their own versions. Now the one I tested as the code is stable and sitting is nvidia's ACX, which is a library that sits on top of an MPI and essentially uh maintains and manages a thread that puts a little trigger into the stream and when you reach that point in the Stream, the thread then takes over and makes your MPI call the appropriate point in time. So it's it's probably not gonna, be the final State.

C

It's sort of the proof of concept, but that one's up and running and it's since it's fairly stable I can try it out very easily on lots of different things. And hopefully that's next week's work.

A

So this is interesting, so then, as a as a um usage model, um just the application poll, the stream occasionally or does it just- is it for applications that are, you know, reading a constant stream of input and they, you know, block until they've got the next.

C

While the current implementation is primarily focused on gpus, so it's a that's literally a Cuda Stream So and let the Cuda scheduler and everything handle it. uh The general model is so that it's anything that can be represented as a stream, so a CPU stream object. That I'm not sure, is very well clarified at this point yet, but something like that could also be used to control it and manage it where the user essentially fires it and forgets it or you have a little bit of control over.

C

You know where you're at and and handling of it as you go, uh but that that's all in progress and fun stuff to watch.

A

Yeah, so it's almost like a channels model of um communication to the streams yep. Something like that yeah be good to see that uh yeah get uh get take up.

C

I think I've seen this one one of these yeah.

A

There's a link in the chat, Paul's put put a link to a relevant paper.

C

It'll be good to see and I see, the PDF is purple, which means I've clicked on it before, which is. This is one of the ones I looked at when I first started, exploring thanks Paul.

A

So I think I've got something now for today, I learned as well, which is at least uh or began to learn about stream triggered computing Thanksgiving. That's that's really interesting. So uh anybody else got something they'd like to shout out.

A

We actually have uh some extra things in the agenda today, so I might move on to the flip side of the coin of today. I learned um and uh I guess that the charge question for this segment is what surprised you that might benefit other users to hear about um you know and- and you know, might help with our documentation, for instance as well.

A

um So this is yeah, not everything works the first shot, in fact, very few things do and in the process of uh you know doing research and and achieving something you tend to learn a lot uh and uh yeah. The goal here is to actually you know, talk about those things that they they might not have worked, but that doesn't mean they're a failure. That means they're, something that you know we can learn from that. Potentially each other can benefit from as well, but it doesn't even have to be something that you got stuck on.

A

That can be something that you stumbled across. That was yeah yeah. This is this is an interesting topic to read more about that other other users might be interested in, uh for instance, uh stream triggered Computing, which was more or less completely off my radar and took him and talked about it. Just now,.

A

You have to say I'm, looking forward to yeah looking through that paper.

A

Already have a something interesting to share.

D

Maybe I can share a short one just to get things started. I.

A

D

I ran into this when I was trying to debug preemptable jobs and um I I learned that there's a flag as a flag time in which to me, I sort of naively assumed meant. This is the minimum amount of time. I want the job to run um the minimum amount of time.

D

I can tolerate having the job run, um but what it actually means to slurm is this is how short you want your job to be so, if I ask for a time in of 10 minutes and it could fit in 10 minutes, it would only give me 10 minutes and nothing else. It wouldn't keep going um and keep submitting it afterwards. It would just say: oh you got your time in was 10 minutes.

D

You got your 10 minutes, you're done so I was pretty confused by what it was doing for a while until I actually went and read the documentation. So today I learned that um it means Slim's time in not my minimum time.

A

Right so I might have just learned something as well, because my interpretation of time men was that slim just looked for the first slot. That was at least 10 minutes long.

D

A

D

I was thinking like that's the minimum that you want, and instead it truncates your job to 10 minutes right.

A

So um so have you have you seen times when you got longer than the time min in the schedule, because I wonder if this was related to how busy the system is as well.

E

I have in the preempt queue so in the preempt queue. You say how much time you want I mean I didn't use the time Min. But but then you can, uh you know, say: I need, you know five hours and it gives me two and a half hours because it pretty up to me at two and a half, but not that the two minute, the two women.

A

Right, yep so yeah if you've got kind of a flexible job that can keep on running until it gets stopped. That's kind of a good option.

A

I think we we require the time Min flag if you've run out of time and you're using the over and queue.

A

Okay, good, uh what do you call it? A good reason to to dig into um checkpoint, restart and other options like that, thanks Lisa, that's a good tip.

A

A

Does anybody else have a a lesson learned I'd like to share.

A

If nothing's jumping out, then you might uh move along to talk a little bit about our user community of practice and the community survey um uh Olivia. Would you like to tell us some more about this.

F

Yes, I would be happy to I think um you can actually skip this I think something went wrong in the yeah there. We go okay, great um okay! Well, let me introduce myself. My name is Lippy.

F

um I am actually a postdoc at nurse, but I started out as a user, um so I was using uh nurse resources to do um uh my thesis uh work when I was in graduate school and that's how I learned about nurse and ended up applying for a postdoc. So if any of you are in that boat and are interested in learning about the postdoc opportunities at nurse, I would be happy to talk about that, um but also because I was the user. I um I've been really interested in user engagement.

F

Now that I'm part of nurse um and currently we're in the process of wanting to create a really um active community, in particular a community of practice, and so I. Think uh at last month's meeting, uh Rebecca uh told you a little bit about what is a community of practice um and it requires a couple things. First of all, it requires a shared debate of Interest. So um likely you are a nurse user, but also now, hopefully interested in research Computing in high performance Computing for the purpose of doing science.

F

So many of us share that domain of Interest. That's how I pivoted from I was doing my PhD in physics and now I'm at nurse learning about high performance, Computing and not doing quite as much physics, um because you know I I learned, I I became interested in that um there's also an actively cultivated and maintained sense of community, so I think the key word here is really active.

F

We we have to be much more um uh involved in how we're not only just cultivating this community, but also maintaining it so having different programs and events that um are uh taking place within the community, so that people can can practice that shared domain of interest can can participate in it um and exactly the last thing active practice of the sure domain of Interest, so that Community is is involved with uh creating or collaborating in some way uh we're sharing information, and this is happening often um and in various forms, um and so we're in the process now of creating that user community of practice next slide.

F

Yeah, so um the one of the things that Rebecca had shared is that we are looking for a lot of community feedback. We have a lot of ideas. Many of us we're nurse users. I was a nurse user. I have a lot of ideas about if this. If these things have been available to me when I was just just a user, um you know my experience at nurse my ability to do science. My ability to use resources would have been.

F

um You know better or different in some way, and so we want people who are currently in that position to give us feedback about a lot of different things. We, you know we're we're wanting to hear about. You know what what might a user Community look like to you? What do you think might be missing from your current user experience?

F

um What kinds of trainings and programs would help you feel like you're, actively participating in this community of practice, and so one way we want to collect this information is through focus groups. um The idea being, we want to gather people who are interested in talking to us over Zoom um small groups to discuss with us. You know what are your ideas? What are your challenges? What are the reasons that you know this kind of involvement would be not helpful.

F

Maybe to you or you know what are things that could be really helpful to you um and so it'll be an opportunity to discuss things with us um directly and um but we we also want to just collect some feedback that doesn't require participation in one of these focus groups you could um participate in our survey.

F

Part of that survey is letting us know if you want to participate in a focus group if you'd like to talk to us in one of these groups, but we're also collecting a lot of really important information in the survey that you can provide, even if you don't have time or want to participate in the focus group.

F

um So if you go to the next slide, we really need you to make this happen and we're going to do kind of an in-class exercise. We don't want to ask people to spend time outside of this meeting to to do this, because that we know that sometimes that's a barrier to participate. So Steve has kindly allowed us to take a couple minutes here during this meeting for everybody to go into this survey and complete it. Most of it is something that you can just provide a yes or no answer to.

F

There are spaces in there to fill in some information. If you want to um there aren't any like paragraphs of information requested, and it's also a great place to let us know if you're interested in a focus group, so um everybody go ahead and you can use the QR code. You can use the link that Rebecca is putting in chat and we're just going to let you fill that out for a couple minutes so that you don't have to worry about it later and we can hopefully get a ton of feedback right now.

F

um I'm also happy to answer any questions that people have either about the survey or about um user Community engagement or anything like that um or about being a post-socket nurse whatever, um but I will I. Will let people fill that out if there's no questions.

A

Thanks, let me uh I guess, first of all, are there any questions.

A

B

A

Would people like to either use uh use the the phone camera or click on the link and we'll spend maybe kind of about five minutes on the on the survey and then come back to see where people are at and.

A

And uh I guess yeah: if you have any any questions along the way, uh raise a hand.

F

And thank you again for taking the time to do this. This is really informative to us, because we we want to make sure that the events programs, trainings uh whatever we're thinking about helping um you know put together, is actually going to be useful to the users. So this is a really important part of uh the process for us. So thanks.

A

So it's 11 22 by me, mate check in again about 11, 25 and just see how far people through uh through it.

F

Yeah, that sounds great.

G

Hey Lippy for those of us who are staff and I guess not allowed to fill out your survey just wondering what kinds of things you're asking about.

F

Yeah, so we're asking um a lot of questions about current uh engagement in the form of um participating in maybe these meetings trainings um slots you could check it out. You're, absolutely welcome to I think you should be able to click through it because I I, don't think any of the questions are actually required, um so you should be able to click through it. um Otherwise I can even share it with you in another format, but the idea is um yeah questions about um the current engagement.

F

um We have a nurse user slack, so we want to find out, like is the slack useful or helpful to you? Do you use it? um If you do use it? How do you use it um and then some questions about um just to find out who who is filling out our survey just so we get a sense of who engaged with us even in the survey um and then also some opportunities for people to think about. Oh, if this program existed, you know would I participate. Would I be um interested in you know.

F

Would would this be interesting to me and then a place for people to provide any of their own ideas as well.

G

F

And I did put in the chat my email address. If anyone is you know doesn't like surveys which is valid, um but you would wouldn't mind sharing your I'm I am in the nurse user slack. You can message me, you could email me if you have a thought um you could email. Anyone that you know is in that you know nurse user, um any nurse user space or anyone at nurse can they can forward it on to me. um If you have an idea, I'm I'm open to hearing about it,.

A

How are people going with it so far, people through to page two or three yet.

F

F

Thanks Gregory yeah, I, I, agree, I, think um people get survey fatigue and they also um have a hard time scheduling it in for their own time. So we thought this would just help people uh do it and then we'd have that great data. So thank you.

F

Steve I think giving people, maybe one more minute, is um a good idea and then I think we should move on. Let's say another, just one minute.

A

Sounds good we'll give another minute and uh and I guess if you haven't finished it by then you can probably keep on going while we go through some um announcements and course for participation.

A

To have it half finished in the background.

A

Okay, it's been, uh it's been probably six or seven minutes now, so uh hopefully people were able to get uh most of the way through it and and continue either so yeah during the meeting or afterwards, and um thank you all for, um for you know, yeah, for participating and working through that.

A

It's a really valuable information.

A

um So we have a handful of announcements and calls for participation. um There are some that were announced in the weekly email and you can easily go back and see those and click on the links. There are some that might be of particular interest to people. uh If you are a student or have or know students, they uh you or they might be interested to know that nisk is a bunch of summer internships available.

A

So yeah we're looking for interns for the for the summer period, um yeah, there's a list of projects and some more information at this leak here and these slides will go up on the web page afterwards as well. But if you go to the most recent weekly email, there are items for all of these there's a couple of cfps that we know about. There's the AY 23 research in Quantum information science on palmada um call for participation, is now open.

A

um A couple of webinars and and seminars coming up through ECP. So the the ECP ideas series uh as a talk on uh the red extra scale, particle accelerator and lasers, laser modeling on March 15th, and actually this link to the best practices. Webinars got also linked to their previous webinars and there's some really interesting content in there.

A

uh Another event that ECP is doing is a HPC workfast, Workforce seminar on strategies for inclusive mentorship and on kind of a yeah Workforce note uh nurse has got actually quite a few positions open at the moment, and nisk users often turned into a really good nurse staff. So yeah. We encourage you to take. Take a look at this careers page there's a link to it in the weekly email and um yeah considered joining nurse as a as a staff member.

A

um Another big one, that's coming up!

A

That's relevant to today's topic is we have some training and office hours around migrating from Corey to perlmata, so there's a training session session scheduled for March 10th, so page on that on the www site and we'll have office hours coming up uh starting next week for kind of several sessions, and just before uh passing on to Rebecca, to talk more about Corey's retirement than that migration, does anybody else have any announcements or calls for participation that uh other nurse users might be interested to join, want to know about.

A

If not, and if you think it went along the way feel free to drop a drop, a link in the chat, but we might move on to our topic of the day, which is about Corey's retirement, uh so Rebecca's, the leader of user engagement at desk um Rebecca. Do you just want to say next at the appropriate moments and I'll move through the slides.

H

That works for me, okay, so how about next?

H

Oh same thing different day? Okay, all right! So everybody is here for this exciting topic about uh Corey retirement, so I'm gonna try to give you all some background information so that you can kind of understand everything that's going on and what our plans are for the retirement of Corey. So first we're going to talk about the life cycle of a super computer.

H

uh Then we're going to talk about why we're why we aren't going to retire Corey uh and the Corey retirement schedule and we're also going to talk about Pearl Mudder. So that's that's sort of our outline next slide, please!

H

So this. This is basically an overview of our the life cycle of a super computer. So the first thing that happens is we design the machine I'll go into a few more details about this in subsequent slides, but the idea is: we've got to actually figure out what we're going to get and how we're gonna do it. The next step is to actually build that machine, and that is primarily done by our vendor, but sort of a collaborative approach as well.

H

In some ways, and then we're going to test and validate this machine that we built make sure that it's good and it works, and it does all the things that we need it to do and once we're happy with that, then we're going to operate the machine for the rest of its lifetime and we're going to maintain it.

H

So we you know, as you may have noticed, we take monthly maintenances on our machines to make sure that they're still in Tip-Top shape for for you all and that they're still working um eventually at the end of the life cycle. Then we start thinking about retiring machines, and so we then retire them and decommission them and then, after the machine is turned off, then it gets recycled and so again I'm going to talk more about all of these phases. This is the general overview, so Stephen field push it again.

H

um That's my one animation in this presentation, so I put the machines at the various different places where they are in this progression. So if you look down at the bottom right, there, we've got Corey we're in the operate and maintain, but we're getting to the retire stage on Corey.

H

We are still test validate and we're getting to the operation stage for Pearl meter um and then I put n10 that stands for nurse 10 that'll be our 10th machine that we acquire. We are in the process of Designing that one right now, while we're also trying to do all these other things with existing machines next slide.

H

So, like I said when we designed a machine, it's really a collaborative process between us and our vendor or vendors. So first nurse develops requirements for our machine.

H

um So we say what what sort of functionality we need, what kind of power processing power, Etc um and then vendors will then then submit a proposal to us about what they can provide. uh And then we look through all these proposals and we select the best proposal for a price and value.

H

Now we think a lot about what what we're going to do and how we're going to design these machines and make sure that they're going to be something that users are going to be able to use and that are going to be able to do all of the things that we know users want.

H

um So the next step is start to start building the machine, um and so the building of it begins in the vendors Factory, uh and so they actually will assemble a machine and then they'll kind of test. It a bit make sure that it sort of functions and then they will disassemble it and send it to nurse, and- and so that's what they provide. We on our side. We provide all the necessary power, water, cooling and stuff like that, for the machines.

H

Now it I used to work in Australia and we got a machine and they actually shipped it on an airplane to Sydney, and then they shipped it in a Road train. It's a truck with lots of long um trailers after it, like five trailers, um they're shifted on a road train to us in Western Australia. That was a pretty exciting Journey, but in our case I think they mostly just come by track. Okay, next slide, please, okay!

H

So the testing part I alluded to this before testing actually begins in the factory, so they do some Factory tests and often under normal circumstances. We actually go and we look at the machine while they test it and we sort of help to make sure that it is at least providing the initial uh functionality that we would. We would expect to be able to provide in that environment before they bring it to nurse. So then, once they do bring it to nurse and they reassemble the machine.

H

We test it a lot further, um and so we do a lot of Hardware software, Network testing um and- and we also let friendly users on the machine. So we had, for example, an early science period with Pro matter, where we let everybody on who was participating in our early science program and they kind of checked out the machine kind of broke it in and they were what we call Friendly users. So they understood that maybe everything wasn't working quite properly, but they were there to help us too.

H

um Now the vendor I wouldn't want to be a supercomputer vendor because they have to put forth I mean millions of dollars for all of the parts in this machine and then after we accept the machine, then we pay them for it. So they're they're fronting a lot of the expense of these machines. I mean there are some Milestones where they'll get some partial payments, but the bulk of the payment comes at the end. When we accept the machine and in order for us to accept it, we have to pass a lot of tests.

H

So we have functionality, testing performance, testing, stability, testing, reliability, testing. You know we do very thorough testing, and this includes a 30-day stability test where we have all of our users on there just pounding away at the machine, and it has to remain in service for a very high percentage of the time anyway, with users on it. During this 30-day period before we can pay them money all right next slide.

H

Now we've accepted our machine. It's it's all going! Well um one: when we're operating our machine, it's around the clock operation, so we have staff on site, 24, 7, 365 days a year, there's somebody there on Christmas, there's somebody there. You know at 2 A.M on a Sunday. You know every day, there's somebody there um and while we're doing that, um we also are doing a regular maintenance of our machine. So we have actual on-site vendor staff at nurse um from hpe and they perform actual physical maintenance of the machine.

H

So they, if there's a node, that's gone bad or something they'll, pull it out and they'll, replace it or they'll fix it. um They'll replace cables that have gone bad. You know you name it anything physical, they will, they will do to repair the machine um and then, in addition, we do regular upgrades of the system. Software, for example. There may be security issues that we need to be sure to patch before they become a problem.

H

There may be bugs that are in the software that we also will upgrade or patch in order to fix those problems, and also sometimes we actually get more functionality. When we introduce new software onto the machine, all right next slide, okay, so let's stop and for a second here and talk about reliability. So when we get the machine, there's like this, what we would call a ShakeOut period where there's faulty new hardware that you know somehow wasn't detected or or things just fail.

H

um They tend to find this in system reliability, which apparently actually an area of study that people study. So that's kind of cool. um So there's they call it an infant mortality. Failure I hate to use that term.

H

So I'm not going to say that again, um but early on uh parts will fail on a machine and also late in its life parts will fail um and so there's the early failures and then there's the wear-up failures and then there's just also just totally random failures and during those uh during the lifetime of the machine that is. This is what's called the bathtub curve. If you look over here, there is a curve of of the failure rates and um yeah.

H

Thank you, and so you can see it's kind of higher at the beginning and then it kind of slopes down and it's pretty flat through the middle, and then it comes back up at the end and so in the middle period. You know the middle age of the computer, it it has a pretty low, constant uh failure rate, but then, as as the time goes on, we start to get more of these wear out failures uh and, and so the failure rate goes up again, all right next slide.

H

Okay, so we retire our machines at the end of their useful life, because after a certain point as you've seen like the bathtub curve, there failure rates really begin to rise and machines become a lot harder to support and much less reliable. So um another reason is that as new technologies come out, they tend to be more energy efficient and provide more compute power for the same amount of of uh energy. So that's another good reason that we like to get a new machine.

H

um So once we retire the machine, then we actually, we return it to the vendor, That's How, We, Do, It, um and what they do with the machine. Is they recycle it in some way? So sometimes they resell it. So, for example, my understanding is that there's a part of Edison that is in Texas now having a second life, um others other parts of it.

H

They will use it for spare parts for similar models that are still in operation and then, if they can't do either of those things, then they will just take out all of like the valuable metals or other components and they'll remove those and they'll recycle uh the the valuable metals and things like that from the machine, all right next slide.

H

So why do we need to retire Corey? Well, Corey has reached the end of its useful lifespan. So you know how dog years, like there's seven dog years to one human year, so I think there may be like 12 super computer years to one human year so about the age about one year a month. So Corey is in its 80s 90s, maybe even 100 years old, um and so it's really not. It's really reached sort of the end of the expected lifespan of a super computer.

H

um This model of supercomputer is no longer being produced. um So that means that, of course, the processors haven't been produced for years. The memory also, um but there's also other components like the cabinet components like fans, electrical parts. Those were all custom made for this machine and those are no longer manufactured. So if something fails, we have to rely on remanufactured replacement parts assuming that we can even find them.

H

um So we also have observed more frequent failures in the in the individual components of the machine. I mean um it may not be as visible to uo, but occasionally lately we've been having all cabinets going down because of the rectifier in them. It's an electrical uh part, I, don't exactly know what it does, but the the rectifier has been going down and um and then we have to replace that, and so that's very disruptive, two years or so. That's that's another reason uh why the the reliability is going down um and then also failures.

H

Subsequent from now are becoming more and more difficult to recover from um because we don't have parts that we can actually put in the machine we've run out um of particular concern to us is the scratch system. There are no spare parts available for the scratch system. So if something fails in the scratch system, we could lose user data. So we've already told you all that it shouldn't be a surprise to you and any and in order to recover from that, we may end up repurposing internally parts of the machine.

H

um So it's um if something fails, then we just may end up having to shrink the machine because we can't we don't have that part anymore, and so we just have to shrink everything down and we may shrink it even more so that we can have a spare part. It just kind of depends on what happens and also recoveries may take longer than they would normally, because again we don't have replacement parts.

H

We just have these refurbished parts that are very limited and where we don't have refurbished parts, then we have to kind of figure out what to do and try to make do with everything that we have all right next slide please.

H

So here is our current plan for retiring Quarry, so um on March 31st we're going to remove all of the auxiliary components from Corey, so large memory nodes, um those are nodes that we acquired in I, guess 2020 that have a large memory on them. Those are still very useful and they still exist.

H

um Those parts still exist. So what we're going to do is we're going to take that and we're going to migrate it over to promoter. So then, the large memory nodes will be on promoter and you'll be able to use them.

H

It won't be a fast process, but but that's our plan for those and then there's also a GPU partition non, quite just a very small partition of gpus that we also we acquired those probably five years ago or more um and those nodes are going to just be retired, because those have obsolete gpus that aren't as good as the gpus that we have in Pro matter. So we don't need to keep those at all so that part we're just going to retire.

H

um So then, at the end of the at the end of April, we're going to retire Corey as a whole and let's call that date, t okay, um we will let you have access to the scratch system for another week after t, um but then after that, we'll power down the machine all together and then the next month we will start to remove Corey from the machine room next slide, please.

H

So, let's talk about Pearl better, because that's kind of the elephant in the room for everyone. um So we haven't yet completed the testing of our final configuration of prometer. So prometer has 14 GPU cabinets, 12, CPU cabinets and the network is slingshot 11., that's what ss11 stands for.

H

We we have finally reached that configuration earlier this month, I'm, not sure when we're going to be able to start testing this file configuration. But it's going to happen soon. Now we are not going to retire Corey until Pearl letter is Thoroughly tested and working for users. So I mentioned before about testing. So we've already done some testing, like the functionality like the system, provides a lot of basic functionalities check. We can check that box performance. We were able to achieve certain performance levels on certain benchmarks, so we can check that box.

H

The two that are really remaining right now are the stability and the reliability. So the system needs to remain up during our testing period. I told you there's this 30-day window where it needs to basically remain up and then reliability. We need to have fewer hardware and software failures than what we're seeing right now, the next slide.

H

So we know that promoter is not meeting our expectations and it's not meeting yours either yet, and we understand how important it is that promoter is a reliable machine in order for you all to continue to make scientific progress, so we meet with hpe, that's our vendor. We meet with them every day to address bugs and issues um right now we have some hpe experts at nurse today this week to focus on resolving these issues that we are all experiencing and we're optimistic that this collaboration will improve promoters.

H

Reliability next slide, so we're working together with hpe to address the stability of the slingshot Network. We know that's not not working right, yet the I o performance on promoter's scratch system and on our community file system. Those are also not working well and um the node Hardware reliability. We also have seen some issues there um and in this collaboration we have developed some new processes, so we've developed some new methods to uh like have a methodical process.

H

I guess for fixing notes that we find to be problematic before returning them to service, so there's a certain procedure that they go through to make sure that the node problems truly have been fixed before they return them to service.

H

um We've made some configuration changes to the community file system and to the CFS client on prometer um in order to stabilize the network communication and performance. Some of those things seem to be panning out for us.

H

So so that's good too, and then we're also rolling out some fixes to some slingshot Network, bugs that we discovered and hpe has also discovered and um things that that they have been able to fix so we're rolling those out this week and next week and um we're confident that those will make a big difference in the performance of the machine.

H

Now, of course, there could be additional issues that we have not yet detected or haven't figured out the cause of, but we're going to continue to work really hard on this, and it has our utmost attention because we know that as it stands today, the machine is not meeting anyone's expectations next slide, please.

H

So, in summary, in the super computer lifestyle lifestyle life cycle, Corey has reached the end, uh there's no new parts being manufactured for it. It makes the upkeep especially challenging we plan to retire Corey at the end of April uh promoter. Reliability issues are being addressed at top priority by us and our vendor hpe and that's all I have at this time I'm happy to answer any questions.

H

Hi uh Alex Copeland here from jgi yeah I'm glad to hear you say that um promoter is, is not meeting your specs with respect to reliability.

H

um That certainly I think been a question for all of us, trying to use the system and now I'm just wondering if you're, so you have these aspirational goals for what you hope to address in promoter to get it up to spec and in contrast, you have a hard seems to be like a hard uh calendar deadline for the retirement of Quarry and I'm wondering if that is also uh actually more aspirational and that we should understand this.

H

Is you have a firm commitment to get promoter working and then you will, you know, firm up the calendar for Corey, or are you saying that Corey is disappearing? April, 30, no matter what now? Thank you. That's an excellent question Alex.

H

um So what we're saying is that we really want to retire Corey on April 30th and that's our goal, um but but the prerequisite for that is going to be prometer is going to have to be in a state where we feel comfortable retiring Corey. So if we have to, we will extend that date, but um but you can I guess the reason we say April 30th we provide that date is so that you can know that it is guaranteed to be there through that date.

H

So, if perimeter tomorrow was suddenly perfect in every way, okay, we wouldn't retire Corey. The next day we would. We would wait until April 30th. Does that make sense it does and if I can ask a follow-up. The the measurements or the metrics that um you've discussed about measuring um reliability and stability for promitter um seem a little bit um squishy a little a little vague and I think I mean, for example, um your last comment that a promoter today was perfect.

H

It doesn't seem like, from the point of view, a user, that's something that you could possibly determine in a day it's I mean we would like to see stability measured in months, not in days right absolutely absolutely I mean that was just purely hypothetical right. um No, so we have. We have a really big long document. How many me, how many pages is it Tina, hundreds of pages I, think yeah.

H

It is 100. Pages worth of specifications that we have for the system, the vendor is expected to be a to be a reliability portion.

H

um The 30-day test has typically been a very good indication of how stable the system is in the past, um because we actually have users on the system, and we will we use that as the final test.

H

We we generally want to see some stability before we start that, um because we want the expectation that the system is going to pass that 30-day test, and during that time there has to be at least a seven day window where the system does not have any failures so um other than maybe you know falling out or something along those lines, but nothing that isn't not recoverable or that takes out the majority of the system um in what we would call a system-wide outage, and that is is a term that we use for some portion.

H

That is what that is defined in the sow to fail, um but it's not like the whole machine has to fail for that to be considered a system-wide outage, so those um so somewhere within that 30 day. We have to have that seven.

H

Seven days of reliable running as well as there is a Only One, system-wide Outage allowed during that time frame the understanding um you know as we bring in these systems that are new we're using new technologies that we aren't expecting them to be, maybe as stable as they will later on in their life. So we that's why we have the ability for them to have one system white outage during that 30-day window. That number decreases as the age of the system increases um based on our requirements.

H

Thanks Tina, so does that? Does that help Alex? uh Well, it's it's certainly um moving in the right direction. I actually have some other questions, but I'd like to you know, give other people a chance. If they have questions sure. Okay, thanks anybody else have questions.

H

Okay, I see vivec has your hand raised, go ahead, hey Rebecca, uh thanks for opposing this um I was just curious. What are there? um Are there any suggestions or maybe recommended best practices for users who um kind of rely on Prom letter, as kind of like a development resource um during these kinds of periods of downtime, maybe some kind of uh like workflow, that kind of replicates the conditions that would be on promoters, so we can kind of be productive during this downtime.

H

That's an excellent question. I might defer this to some of my other colleagues who are um also here um so I mean so a lot of things are quite similar from Corey to Perimeter I mean in the like: the the user environment. Of course the architecture is different. If you are, uh if you are using the gpus on promitter, obviously that's different.

H

um Does anyone else here want to offer any recommendations to Vivek.

H

um I think, maybe are you wondering sort of in the in the realm of like performance, um how you can gauge the performance of your code? Is that what you're trying to think about? Or what do you specifically sort of I mean there are certain codes that you know I would love to test.

H

You know on like a single node of promoter because of CPU problem letter, because that is a nice fast CPU and I can't really and lots of memory and I can't really do it on my laptop, but during these periods of downtime I can't really do that. So, um if there are some, is there some kind of option um it would be? It would be great to know about it, and if there isn't, you know a suggestion might be.

H

You know if there is some dedicated set of nodes set aside, you know just for development purposes and not like large scale runs. That would be really useful as a user okay. So your your question is mainly about sort of mitigating the down times mitigating the down times and if there's like an option um because, like you know, it's not like I need, like four, you know, even more than one node I just need the resources of a single node.

H

um So there was some kind of option to get get that, and especially you know, without something that replicates promoter's environment, so I don't have to spend time changing things when I do a production run. That would be good right. Go back to Corey switch back to switch everything back. You understand, um yeah, I! Think that's a that's. A really good suggestion. I think we'll have to think about. Some of these um maintenances are are things that affect the entire.

H

um The spine of the system like this one in particular we're working on the thing that runs the whole network for this the system. So when those things happen, um we end up having to take the whole machine away, but I think that we can definitely look at options to keeping. Even you know the portion of we're not doing uh brain surgery, basically yeah.

H

That would be really useful, um you know, but for the people who rely on it for development and I wanted to say thank you, for you know uh like giving high priority to kind of address uh some of the issues that have been going on with promoter appreciate it great. So um just looking at the time we are at the top of the hour, uh however, I suspect that people still have a few more questions.

H

So um if people need to you know, move on to your next thing, then then you know, please feel free and and thanks for joining us today, um but we'll continue the meeting going for a little bit longer to extend the uh the Q a opportunity.

H

If you have a question, yeah I have a question actually so uh yeah for for Cody. uh This uh CPU part is larger than the GPU, so CPU portion is large and the GPU nodes are I. Think not that I mean not in numbers, but in the new machine is you you expect, can I expect two gpus to be major portion and CPU to be less portion or they are kind of mixed up, because some of the codes will not run in GPU and some of the codes will not run a CPU so right?

H

um So yes, so I believe there are more CPU nodes um than GPU nodes on prometer but the number of cabinets. We have 14 cabinets of GPU nodes and 12 cabinets of CPU nodes, but because you can fit two CPU nodes in the same space as one GPU node that there are overall, more gpus since there are four gpus correct. That's true. There are more gpus than CPUs, but yes in terms of number of nodes, yeah. That kind of depends on how you think about it. I suppose, okay.

H

Did that answer your question? Sorry! Yes, thank you! So so that's right! So so the TPU notes will be GPU. Processors will be more than the CPUs. Yes, thank you. Thank you. Yes, yes, all right good glad. I could help.

H

um Okay, we've got Bolo yep, it's also Burger um I had a question about the data transfer nodes, which is our nice resource on query or is something equivalent going to be available uh with promoter to not have to have your non-computational resources to use or what's up with that good question?

H

Okay, so the data transfer nodes are, um you know, are independent of Corey, but they do have the Corey scratch system mounted, which may be why you're you're thinking that they could be related to Corey um and they currently do not have a promoter scratch system mounted on them. I think, ultimately, I think we would like to have the promoter scratched on them, but I. Don't think that um that we have the time at this point to be able to do that so for now they won't have it um Tina.

H

Can you comment I think that there is. um There are some technicalities there for trying to move the file system outside of of the parameter, Network and so I think we're looking at other options like maybe bringing some dtms inside the promoter window or things like that, but we haven't quite gotten to a firm Solution on it yet, but we are looking for some ways that we can make that improve the access for the to the file systems from um external nodes.

H

All right thanks, yep! Thank you all right, um so Paul.

H

All right, um thank you. I was just typing this to post in the chat, but I'll, try and say it quickly.

H

um So this is about a question about using Palmer's GPU nodes and looking at our usage this year, I was wondering if there are any plans for a shared queue on the pole, Mata GPU notes there is a shared CPU queue right now. It's my understanding and the motivations behind this is that we've noticed that, even if we have you know GPU codes, sometimes there are calculations that won't scale to using all four a100s and I. Think it's always worth being around.

H

1A100 is still tremendously powerful, particularly if one is using the tensor cores, for example, and extra motivation is that I think it might be a better onboarding experience for people not used to running on gpus, because going from zero to four a100s is a big step.

H

Yes, okay, that's an excellent question. Paul thank you for bringing that up. um We ultimately! Yes, we do want to do that. However. Currently there are some technical details that we have to work out before that can happen.

H

Yeah I can probably talk a little bit more about that. So the um we had an issue with the ability to actually partition the memory on the system, so a user could overrun their allocation of memory, which would could impact other users on the nodes that required some fixes from um from the vendors, so we and unfortunately that was one that was crossed multiple vendors, so we have been working with them.

H

We have a a fix, but it's not going to be fully implemented until a little later, so um I believe that will be coming in um it's in the either the dates keep shifting on when the releases are coming out, but it's it should be here before June, but it's somewhere in that time frame.

H

Thank you appreciate the efforts and recognize as a lot of other things, to get to before. Probably.

H

Yeah all right any other questions. uh If there's nobody else, then I'd like to go back one more time to the reliability question. If we have time for it sure uh so you know given given the reliability of Corey and I, haven't looked at this data in a while, but I think the last time I plotted it. um There was only one month in its entire lifetime that didn't have an unscheduled outage that was February, 2022 I.

H

Believe um I, wonder if this 30-day test that you have planned for Paul Mudder, especially given um the fact that the system is similarly complex and similarly uses sort of new and untested Technology. If that 30-day test window is, is really going to be enough to give you confidence that the system uh well after the test is over and pass going to be reliable and if there's any flexibility to change it and make it a longer time window.

H

Given the newness of these systems, there is always going to be I think, especially in the early phases. Some I'm, actually a little surprised at the numbers. I'll have to go back and look so I thought we had more time than that without unscheduled outages, but um on Quarry, but um they is no. We do not have the um because we've already signed the agreement with the vendor as far as the time frames for those tests.

H

So we cannot extend that date, um but we do work constantly on trying to improve the reliability of these systems.

H

They are very complex, though, and there are a lot of parts and components that end up causing problems, and so we do our best to um keep that under control. In this case, I think the network has been the most challenging that in access to the the file systems are and.

H

So we're hoping to get that to be definitely more stable over time. I think the advantage that we will have over Corey in the longer run is that we do have the ability to do rolling upgrades on the system. So our hope and expectation is that, as we stabilize the system, we will be able to do even our upgrades on the system without taking the system away from our users. So um we are hoping and part of what we put into our sow is to be able to do those kinds of rolling updates.

H

So we have fewer down times that you're experiencing and we'll we'll see we'll get there yeah. The system is also physically smaller than Corey. It doesn't have as many nodes, so I think that should help too yeah.

H

So one last remark, and that is that, on your slide, 16, where you describe the process for uh selecting a new system, I was struck by the fact that there at least I didn't see something that I recognized, as uh you know, basically getting input from the users. That seems to be mostly communication between you and the vendor and uh I.

H

Just wonder if if that was different- and you did specifically get um add some sort of user input to this process and I just bring it up here, because this is, after all, the the user group meeting. uh If you wouldn't get, uh you know a strong request for things like reliability as being like a primary objective and something that you spec out in the beginning more rigorously and you know, have tougher requirements for the vendor and that that you know those those constraints end up sort of balancing.

H

Whatever else is pushing for throwing some new and spiffy tack into the system, so I I think I probably started here at step, one instead of Step Zero, which of course is um you know getting user requirements so and we have regularly done requirements Gathering um from the users and then we we kind of translate that into our requirements for the machine.

H

Okay, uh I, don't recall having seen that. Maybe it's in the some part of the user survey or.

H

We have had, and we we published I guess now. We have um gone through three rounds with all the different program offices in the office of science and health.

H

What we call these requirements reviews the last set was also included, um Oak, Ridge and Argonne, and we're called the exascale requirement reviews and we do draw heavily from those refinings um in those reviews um when, when together requirements for a next system, I guess I was thinking of something a bit more Grassroots that you, you know you ask your user group and um and people that the user group are in contact with.

H

You know that or ask directly the users of the system through through one of your surveys, but this is really just this. Is you know a comment in my my opinion? Suggestion? That's not a ticket or leave it I suppose so. I think that one of the things that is I understand what you're saying is like we could put in a cluster that is using known technologies that is very stable, um but part of what um these supercomputer centers do is try to make sure that the US is keeping up in the technology Realms.

H

As well so part of what we try to do is to bring in some of the newer Technologies so that we can Harden it and make it something that is viable um and so yeah.

H

It does result in some instability, especially in the early phases of these systems, and we do our best to try to stabilize those as the system ages, but that is part of what we do is to try to bring in those newer technologies that aren't yet um that they're still in the relatively new phases, so that we can help the vendors get those to become more reliable and useful.

H

You know, um I was just talking with the the nurse 10, which will be our next system team and they are in the business of putting together the draft technical specs for um for that system and those are just draft and uh I believe we were going to try to we're going to try to schedule. We will. We will schedule a time to present those draft specs to the user group, and then people can comment on them and we can provide input in that way.

H

H

All right, any other questions sure if the nurse staff isn't running off I'd like to um voice one of my concerns regarding the retirement of Corey and that's the loss of the workflow nodes.

H

um I understand that s crontab is supposed to be the solution. uh I'm concerned that there are at least two known issues unless it's been fixed recently there's this issue, where one failing instance, um basically cancels that line comments out that line in our s-cron tab. So any of these reliability issues that you've been poked on today can end up, causing me to you know, lose all my automated runs over vacation because I wasn't around to edit my S Chron tab to remove that comment.

H

The other one is the lack of time zone, support and the fact that we have to do things um in Universal record at a time which does not align with uh daylight savings time. If the US drops daylight savings time great. My problem solved I subtract, seven, eight hours, whichever it is, but the inability to have the time of scheduled jobs adjust seasonally to the work day of our our team is, is annoying and it's going to require either semi-annual.

H

Somebody has to go in and change the s-con tab to deal with that or writing some sort of weird cluge. That starts the job uh an hour that does asleep 3600 for fractions of the year so that it can start at the same uniform time um just having uh regular Cron on something like the query. Workflow nodes would would fix that. For me, I think there was mentioned earlier that having some dedicated subcluster for something like the data transfer nodes, Globus endpoints things like that would also be something you're looking into so I'm.

H

Just wondering about. Is there possible to bring back the the workflow node concept? That, for me, has worked very nicely on on Corey? Sorry, very long question? No, that's a great question: I'm hoping Lisa can maybe yeah. Maybe I can respond to yeah Paul. Thanks for the question and I know that there have been some really painful um behaviors with the scrum tab as it gets rolled out um that I run into as well. This, the one you mentioned in particular canceling. The whole job is what they preparing.

H

um So we actually have a couple bugs open with alarm with skid MD about this Behavior, because it's it's not the right, behavior and we're having actually next week, our Q committee is meeting to talk about how we can Implement some solutions for that particular problem. So I think that that should get better I think.

H

Ultimately, we are going to look at some kind of longer term solution, perhaps involving containerizing um user instances that that would handle these sort of long-running things that that happens, that I think um is going to be actually kind of exciting kind of forward-looking.

H

um But in the meantime, we're going to work on making this ground tab less painful to use. So I definitely hear you about that and about the time zone.

H

Yeah, as I said, there's a sleep 3600 depending on the current time zone, hack that I'm, aware of but I haven't I've had the heart to go in and Implement something that that disgusting, quite yet right, yeah yeah we're definitely looking for Solutions in that in that particular the time zone. One is actually pretty tough, because the inside of all of pearl meters and in that time zone and storm has to talk to it um and it somehow doesn't translate well from the outside.

H

So we're still working on a solution there, but regular regular cron works. Just fine I have cront tabs on machines at Oakridge that work. Just fine scheduling everything in Pacific time. uh For me, it's not like the technology to convert doesn't exist, but I understand it may not be the top of sketch MD's uh priorities and it's not for nurse to fix. It is definitely for that vendor to fix I, understand the roles there.

H

Yeah and I think that, and unfortunately uh we around, let's see I, mean the reason we're going with scrantab offers a lot of advantages and that your stuff will always work. Even if your favorite node, where your con tab was um it's offline right, so I mean you can run into the same problem where, if you, if you're off on vacation and Corey 21, goes offline for some reason, we have to take it out. All of your crons are going to be disabled. We won't have any really way of moving them. Where is this one?

H

The batch system should move them automatically to running nodes, so I think it offers a lot of advantages for users, but I do agree that we have a long way to go before it can sort of right. Now, it's like in the uncanny valley of KRON, where it's light cron, but not enough like it that the difference it's really stand out. Yeah and I personally am holding off as long as I can migrating.

H

My uh nightly regression testing from a cron tab on Corey 21 to scront tab on fermudder, because I'm hoping that these issues can get worked out, but yeah come April I'm going to have to do it all right well, thank you.

H

As I see, Leon Sean has your hand up. uh Yes, uh this is Benson from well I know so I have a so. Is it possible to use a like a container technology, so users can move their clothes from Curry to Sonata?

H

That can help that or save some helpers on moving or compatibility. While you move all those outcomes from quality to per Mountain.

H

I think there's a fair bit of it depends there on the code.

H

um Palmetto's CPUs, uh you know compatible with Corey's Haswell nodes, at least in terms of instructions, said uh I'm, less sure about the the specifics of the operating system because they're, you know there are differences in things like their networks and so on. It could be that things just work because of dynamic linking.

H

um But I haven't tested it um I noticed recently this there are no uh Intel compiler on the model uh for I. Think uh for all users. I mean I have to like download it for myself and it started on my home uh folder.

H

Just for who was I think for probably for.

H

So that might be useful right. Okay, so the um access to the Intel compiler is a effector in the in the difficulty of migrating yeah.

H

uh Your Technique right uh told me that uh Intel comparison of the body in your system or something like I'm, also uh yeah. We don't have a programming environment for the Intel compiler I believe the football API compilers can be sort of you know, get downloaded and installed in in dollar home, but I haven't explored it too closely.

H

All right, Francois, hi uh yeah, thanks so much for the presentation. It's actually very useful, um I have a question regarding uh something you had on your slide, which says that the performance part of the acceptance of pro model has been completed.

H

So I assume that there have been then therefore benchmarks that have been run that show the actual performance and by actual I, really want to say the one that can be reached as opposed to the nominal performance that we see in in the specs. These benchmarks available and I'm thinking, in particular, about the interconnect bandwidth, because those are numbers that would really be helpful for us to guide us in what strategy we will take. What kind of specialization will make do we want to work on single node, single node, multi-gpu multi-node and such decisions.

H

H

I, don't I, don't know Tina. Do you? Do you have any idea?

H

um Well actually? um So technically we have not done all of the benchmarks, because we don't. We haven't, started acceptance, testing yet um I think we do have some numbers, but I have to go. Look to see uh what we have available: okay, so so Francois. Why don't you just um submit a ticket and we'll try to figure it out? Because I don't have the answer to your question right now, I'm afraid yeah, actually I I do have a pending ticket.

H

That's been there for some time uh where I specifically ask about the um reachable bandwidth of internode communications between GPU nodes, which, according to the diagrams that are on the website, should be uh 100, megabytes, 100, gigabytes per second, but I, don't know if that's actually reachable, okay, okay, well, I'll I'll ask people to follow up yeah. We should have some numbers for that.

H

All right, well, whatever is available, would be great, so we can take decisions based on that. Okay,.

H

um Okay, John yeah, um just following up on the question about the Intel compilers I mean we had an exchange in the chat, but for every for people who didn't read that we had an issue with a code that um had vectorization directives that weren't being respected by the any of the existing compilers on promoter, and we found that using the Intel compiler gave us three times speed up um compared to the the supported compilers, and this is for a GP for CPU code.

H

um I realize Intel is not supported, but um it seems like the existing compiler Suites are sorely lacking in vectorization, um because these are explicit directives. This wasn't any asking anything especially hard of the compiler. It just was ignoring them um and so yeah. So we built Intel and of course you have to use the MPI that comes with the Intel, but otherwise um we got a factor of about three speed up for our CPU code on Pro modeler.

H

That's really interesting.

H

um We were told that, because these are AMD chips, uh Intel is not supported right. We certainly have not. We do not have plans to support the the Intel compiler at this point, but um you know we it's not that we're totally closed to making any changes. It's just we yeah I mean compiler. Compiler support is supported as a combinatorial issue. Of course um you just explode what you have to do, um maybe um the existing compilers um so that the question from Paul kit in the chat. uh What does the Intel compiler vectorize?

H

So we have vectorization directives that tell um you know, registers that are vectorized um to Loop over. You know four floating Point operations uh within that register, um and um and you you look at the diagnostic output from the existing compilers and they just say you know unsupported directive or I'm, ignoring this directive, and you run it with the Intel compiler and it it actually uses it. And so you know there's a three to four times speed up because they're, um because you know you're getting more calculations done for each cycle as it's.

H

It's piled together in the register, um I believe the Pearl mutter registers aren't as wide as the Corey ones, but um they are still Vector registers there that um that are ignored by the compilers hmm okay.

H

So yeah well put about a CPU. Well matters: CPUs I've got adx2, which is the same as what Cory's as well knows, has got. uh They don't have the AVX 512, but the k l nodes have right right exactly um so. It's not, but there's still, um you still can vectorize over them and um and I mean the compilers, of course, are not identical um between Corey and Pearl Motor. So, for whatever reason, um the code does not compile with that support on um promoter.

H

H

um That's that's using the gnu compilers I guess well: um I I tried four different compilers included, including the unsupported semi-broken ones, from AMD.

H

um Right so none of them um succeeded in um supporting these uh directives, but Intel did uh do you know whether that was particularly the was it Intel classic, or did they Intel the the llv M1 work as well, um pretty sure it's llv M1 I'd have to check, but it was just basically the latest Intel installation package, um including MGI.

H

um Like I can post the directive in the chat, um if that helps.

H

Yeah I mean you do. Did you mention that you have a ticket open? uh uh We it's probably closed. Now um we had a long chat with um one of the um consultants and um and they weren't able to really help with the existing compilers and, and they said you know they can't support the Intel compiler. We we've finally tried the Intel compiler, because the one of the code developers said that they had good success with Intel on these factorization directors and and just uh Paul hargrove's comment yeah.

H

We we're not using the gpus, so this is a a legacy code that doesn't have GPU support yet so it just is vectorized CPU and it did quite well under Intel.

H

Obviously, you know vectorizing, sorry yeah, that that should have said cpu.gpu, because these are it's Nvidia gpus that should have said AMD CPUs, the just even just a symbiization um Intel classic compilers at least I'll include this disclaimer in their documentation. Most users don't see it aren't aware of this. uh They actually had. They actually had to put this in because of a lawsuit, because their compiler actually does not do some simd type optimizations on AMD, so I'm actually kind of surprised or was surprised you stating that it did do it.

H

But then you said it's the llvm based one, the the newest recent one, in which case this warning does not apply. This is definitely related to their their classic compilers. So I I mixed up at least two things. I said GPU in that text, where it should have said CPU, but also if this is not the classic compilers, then this disclaimer is not not present. Yeah I can dig into that um sorry go ahead. Oh I, just remember from back in the day you know they they.

H

Actually there was that lawsuit, because they would actually check like what type of CPU is this and then it would not optimize if it was not an Intel CPU, exactly yeah, even though the the CPU was capable of accepting that optimization like it still wouldn't yeah, wouldn't do it and I think that's that's still. In still the case with uh mkl and uh Israelis was a module on Palmetto for enabling a workaround, for it basically trick mkl into thinking that it's running on an Intel CPU.

H

Yeah, um you know I I can open a new ticket describing um our experiences in more detail. I can't um navigate quick enough into what I had done several months ago to specifically answer all the questions about what the directive was or uh be 100 sure on which version of the Intel compiler Suite. It was um when it's uh taking another look at yeah, but yeah. Just um just to comment, I mean our preference, of course, would be to see the supported, compilers um doing a better job on vectorization yeah yeah. Thanks for that feedback.

H

All right do we have any more questions or so I'm. Looking at the time and we've we've hit 12 30 now so we've sort of gone a fair bit over our normal time um and thanks everybody, for uh you know a great, a great set of questions and thanks, especially uh Rebecca and Tino I, think they needed to drop out and Richard and Lisa for um yeah for a lot of help with uh you know answering some of these questions, um yeah, maybe for further questions.

H

We can sort of continue the discussion uh asynchronously via VIA Slack, right yeah thanks everyone, so yeah thanks all again um well wind up here and we'll see you at the next month's meeting. Thank you and thanks Rebecca again for um your presentation on uh describing the life cycle of a supercomputer, so yeah.

H

All right thanks, everybody see you all later foreign.