National Energy Research Scientific Computing Center (NERSC) NUG Monthly Webinars, 15 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: NUG Monthly meeting 15 Oct 2020

Description

NUG Monthly meeting 15 Oct 2020

A

Those of you who joined last month's meeting and or uh read through their entire message in the emails we've gone to a slightly different format from the previous webinar format, so we're going for a much more interactive format.

A

So please participate um raise a hand or just speak up. I think most people are able to speak, uh keep an eye on the participants list, um just a heads up, so we're recording the zoom and we'll put the recording up um with a you know, a link to it by the webinars channel and on the www.nurse page about the monthly meeting, also encourage people to chat away in the nurse user slack.

A

We have a hashtag webinars channel, um that's intended sort of yeah. For this. um The advantage of using the slack is that we can continue the conversation beyond the zoom meeting and also sort of record things and after the meeting in the next day, or so I'll, uh uh add a summary into that channel of uh yeah, interesting things that came out of this meeting.

A

So our agenda kind of regular agenda for the um for this meeting is start out with a a win of the month and then a today I learned a section we'll explain those in a minute uh go through some announcements uh and then the topic of the day for this month is going to be the sea scratch one crash, which I'm sure has.

B

A

People's imagination at the beginning of this month and attention, uh then we'll just go through some um incoming meetings and last month numbers.

A

So the idea of the win of the month section is for our users to show off an achieve achievement or shout out an achievement of somebody else that you know of you know things like how to having a paper accepted, um solving a bug that had been giving you some grief um yeah. It can be something quite big like a scientific achievement. um Yeah.

A

This is a good source to nominate something as a candidate for a science highlight or a high impact scientific achievement award, innovative usage, high computing award and what we're interested in hearing is yeah what you did. What what you achieved? Yeah tell us your success and what was the key insight that came from it?

A

I think people can just unmute themselves and speak. uh Does anybody have a win of the month they'd like to show off.

A

So I have one as a bit of a shout out to one of our users: the new docs.nurse.gov site. I guess not that new anymore is hosted on gitlab and users can contribute. You can make a you know if you've got something missing or a correction to make, you can make a merge request and submit it, and during our c-scratch issues we had our first user contributed. Merge request came in through git lab, um so yeah shout out to uh tauren bechtel.

A

I I hope I'm pronouncing that correctly, who spotted some missing information on our current issues, page and posted a merge request that were then able to just uh merge in and yeah. That was, uh you know, a timely um contribution to docs and yeah. We encourage everybody else to do the same. When you see things that could be added to or corrected from our docs, uh it's at. um What's the site, nurse.gitlab.io I'll paste that in the chat.

B

Hi, um this is uh koichi from bill. Now I don't. I can't hear me hi hi steve. I don't have any particular. You know innovations, but this one's a small thing. I just started more: it's actually using uh the slack of nas users, um yeah more studied more.

B

How do you say um often using this in the past several weeks, and I found it quite nice to see what's going on in the community, particularly when I have some issues like logging logging in in the uh the last machine, then just look at the track and how other people doing if it's just unique problem to me or everyone's having. So I wonder how how many people or users are using this slack nurse channel and do they usually use in inside web browser or it's more typical to download its app.

B

It's more like a questions, though,.

A

B

A

Interested about what other people do, but uh personally I use the slack app, but um I quite like it and I have quite a lot of slack organizations. I see channels lined up on it yeah. It is good to see that the um general channel and a few others seem to be reasonably active.

A

I heard. um Apparently, a lot of the activity is actually in direct messages. You know private messages between users, so uh users are finding it a good way to communicate sort of amongst amongst user teams. Basically.

B

I see yeah okay- I I haven't done that too much, but it's already it's kind of helpful for me just just looking at those channels and streams, so yeah that's good to hear.

C

Yeah we I'm peter mars and we're using the slack channel as a private channel online eric's users a lot for uh a hackathon preparation for pro motor, I'm a pi on my knees up for pole, motor.

A

ah Yep, that's that yeah. That sounds good. Good luck for the uh for the hackathon work.

C

A

It takes a it takes a bit of a run-up, but but you can have quite a you know, a good uh effect with these intense um activities. Yeah, that's true.

C

And I appreciate the help I'm getting.

A

That's great thanks, um okay, so then the other side of the coin is today I learned and- and this is a opportunity to talk about something that happened or that you discovered that surprised, you that you know might be a benefit to other users and, incidentally, might also give us some tips for improving our documentation, things that we could uh call out or make more obvious um so, for example, so so this is yeah. This doesn't have to be a success.

A

This can be, and I'm stuck on something um yeah so yeah, something that you got stuck on a dead end, something that you really thought was the case and on further debugging turned out not to be true and that that leads to a tip um new tips, you've discovered for using nurse or or or something external that you've discovered. That might benefit other nurse users. You know a good um presentation, for instance, yourself.

D

A

B

Again, one thing I found.

D

When um we had to move to like the gps file systems because of the test, issues and whatnot was that um we have one file system, uh a dedicated jgi, but but it's basically a read-only file system, and I found that um reading from that was about 10 times faster than reading from project b or from computing file system.

D

um And there was a you know, a clog, maybe in the project b or or a community file system that was causing the I o to be much much slower than uh what we already saw in muster. But the read-only file system was about the same speed as luster was.

A

Interesting, that's so yeah, that's a that's a good tip and that's um that's one of the file systems. It's a jgi specific file system right.

D

Right right uh project dna is our read-only file system. Can't it can't.

A

D

From the nodes, um but you can read from it, it was much much faster than project b.

A

Yep yeah, that's a good tip the I guess the equivalent in a way um across the the wider um your nurse community is for, if you are um building software for your group to use the uh global common global common software in your project. Name directory is, is similarly mounted read-only on the compute nodes, and it should give you a little bit of performance for your starting the software, particularly on lots of nodes, and I think part of the reason is to do with the ability better ability to uh to do. Caching on the nodes.

D

And on the way through well.

A

Yeah, I don't know just put.

D

Something in the chat that says something to that um where, if the gpu file system is mounted read only then it can use all the um servers um to distribute the load, which makes a lot of sense.

A

D

Yeah so yeah, that's a really good tip thanks rob! Well, so I mean that that leads into a suggestion that might be useful to have a read-only mount available for applications to use the read-only portion of it. If they only need to read from the data.

A

So a read-only of which well.

D

A

For all the gpu first-class.

D

Systems it might be useful to have an extra read-only copy. You know mounted so that one could.

A

Oh yeah, that would that would be an interesting type to explore.

A

Could we can we use um yeah, more read-only file systems.

E

So, if you don't mind me, jumping in I can at least speak to it a little bit um where the reason this works well for jgi is that jgi has one place that can write to that file system and we've communicated very carefully that no one should ever change a file that might be accessed on on query in a job. That's running, and that's because of this very aggressive caching and the non-awareness that a given I o thread will have relative to another one.

E

So if you make a change to a file, while a node is reading it, you can actually get files that never existed with. These read only mounts it's it's really cool, but also, if you're, not very, very, very careful with you with this new non-posix, I o capability that dvs can deliver.

E

It is a lot faster. It's much much faster it you know. Posix I o is, is damaging in some ways to hpc performance. So it's worth it, but it takes a great deal of care and and planning and preparedness, uh and I I think it's a great idea. I just I've always been confused about how to best communicate that it's been a little easier with jgi, but yeah.

A

Yeah, that's really good tonight, thanks thanks, rob and doug any other kind of uh hot tips.

F

Hi, this is steven bailey related to io. I found one of the most effective ways to move data around internally to nourish us to use globus.

F

So you know it's a website in chicago or something for moving data between you know scratch and cfs. For example. um That's you know far faster than rsync or just copy, but a limiting factor until very recently was that it didn't work with a collaboration account.

F

It only worked with individual accounts because of how you have to authenticate, but yesterday lisa got us set up so that we can now use our desi collaboration, account to authenticate with globus and move the data around with a custom endpoint, and so that's really great and opens up more possibilities for us to use it for doing productions between the two. So thanks lisa and if you're a user and haven't used globus to move data internal to nurse, consider it.

A

Cool yeah, that's a that's a really good tip, actually, um even even for users who are not using a collaboration account. uh The idea or the the mechanism of using globos to move data around even internally to nurse is uh is very valuable, but glitters is a really nice tool. Actually, I quite like it great for moving stuff around between sites as well and being able to um you know, set up kind of a fire and forget for a large transfer is pretty handy too thanks steven.

B

uh Can I ask um kind of question about the globus sure uh this is called cherien, uh so I'm using glovers to transfer files externally from other systems to mask and during uh this trans mission? If I look at the event log on global's website, they sometimes say some errands. This is like time out. Error or file system does not allow append, but in the end, when I just wait for enough time uh all the files are copy and there's no, you know programs in all the files.

B

So eventually these are doing nothing from my side. It just works, and I am just curious. Then what are those error messages and it's coming looks like it's coming from the nas endpoint, and this is the case when I use globus to mask hpss- uh that's hpss, endpoint yeah. I wonder if this is common. I did transfer quite an amount of data in the last few weeks and many times I see these errors, but eventually it just works. Fine. I wonder if that's the case for other people or most of the users.

G

Steve, maybe I can comment on that.

A

G

Yeah, so so this is a specific peculiarity with how globus interacts with hpss.

G

Globus normally will split your files up into multiple streams and then send them over the wide area network and then stitch them back together at the other end.

G

But because of the way that hpss works, the software, you can't do that it's a single stream, um and so you can't resume an interrupted transfer that gets blocked for hpss, for instance.

G

So that's what that file does not allow append message means it means you've got interrupted midway through globus tried the resume, and the software at the end of hpss said no dude start over again.

G

So one of the things that we recommend, if you're trying to transfer lots and lots and lots of data to hpss, is that you do a two-step transfer. You go to cfs or scratch first here and then use hsr h-tar to put them in. But that's you know, that's if you're doing like lots and lots like terabytes and terabytes hundreds of terabytes of data transfer, um because these resumes can be kind of bothersome for large data.

G

But um otherwise, if it's just a small amount um as as you've seen, globus will just try again and it just takes a little bit longer and it eventually gets in there.

B

Oh okay: well, thanks for the recommendation.

A

B

That's helpful.

A

So so I think I got two take homes from that. One was that when you see these messages, don't panic the transfer succeeds in the end. It's not broken, and the other is that if you're seeing them a lot, which is more likely to happen with really large transfers and particularly to hbss uh doing doing a two-step transfer, where you first transfer to c-scratch or cfs should avoid them.

A

That summarize, it correctly lisa.

G

Yeah sounds great.

B

Cool. Thank you cool. Thank you.

A

So I guess one more today I learned from the nurse side of things, and I guess this was well. I certainly learned it and I guess others did as well um when we were working with corey in read-only mode and yeah. I think that our users helped a bit in discovering this yeah. We we found a few things where we have dependencies on scratch and um in at least some of the cases we're able to mitigate those some some of them were yeah.

A

We can flag to look at in more detail so yeah, for instance, um shifter use a c scratch for uh some of its staging of images and that one that one will take a a little bit more of a run. Up to you know to work around. I guess for time, says c stretches not available. um I think you know a few people noticed when we first went into the kind of debug mode. Where see scratch was unavailable. There were some issues with logins hanging, and our systems group was able to.

A

You know, make some changes reasonably quickly to decouple that particular reliance on something in the dot files. That was looking at c, so scratch was uh it's. It's been a good month of learning, so our next item on the agenda is announcements and calls for participation.

A

So we have a few of these from the nurse side um and then we'll kind of open the floor for uh announcements from users, so this is kind of an opportunity to if you're um hosting a conference, for instance, um you know we're or a meeting events that other users might be interested in participating. This is a good opportunity to announce them. uh So, first of all, there's a few that were in the weekly emails. You should be able to dig this up in the email or on the announcements page at www.nurse.gov.

A

So the schedule maintenance for corey for october has been cancelled. uh There's an upcoming parallel we're training. I forgot to write the actual date that that is, but there's information and and how to register for it in the weekly email.

A

uh Those of us who were at last month's meeting would have seen uh zinji's presentation about checkpoint restart on corey and during that she made the first announcement of a new conference first international symposium on checkpointing for supercomputing, supercheck, 21, and so there's a cfp for that and your links to it through the weekly email, also, if you're interested in being part of the hpc community kind of in a in a wider sense.

A

The sc21 um committee steering committee uh has a call for planning committee volunteers, that's currently open and we have a very new announcement that will send something around in a little bit more detail shortly. But we'll have a pause of the job queue and see.

A

Scratch will be temporarily unavailable for a few hours on monday morning, while we add a new adu and we'll actually talk a little bit about what that means uh very shortly, um to help with debugging from the outcomes of the file system crash that we experienced at the end of last month.

A

Another big one kind of I think you've probably seen this a little bit already on the nurse users slack channel. It's also been in the weekly emails and this will soon become the default thing, but we really encourage people to try it out. We have a new help portal for our ticket system. That gets you fairly quickly and easily to tickets, to documentation and to common requests. It's got a fairly decent search bar a much more user-friendly user interface, and you know when you do open a ticket.

A

It goes through to a much more um particular usable form for doing so. So we think this is a great improvement and um we're really keen to have users, try it out and tell us any kind of your issues or glitches that you notice um yeah and we're hoping to make this the yeah the live destination for help.nurse.gov quite soon now, so it's at uh nurse.servicenowservices.com sp for service portal, probably easiest is the there'll be a link, or there is already a link to that in the um general channel.

A

I won't do the webinars channel of the nurse user slack.

A

That's all the ones that I know about at the moment. um Does anybody else have any announcements or calls for participation they'd like to make.

A

Okay, if not, then we'll go on to our uh topic of the day. So I think this is a a very topical one that people will be interested in and um I'm going to pass over to doug jacobson who's, the leader of our computational systems group.

A

So uh he heads up the group that sort of does the the systems administration, if you like, for corey and associated systems- and he has quite a lot of knowledge about you- know what happened and why, at least to the degree that we know- and uh I will pass over to you doug- um are you able to share or do I need to enable something.

E

I I should be able to share, I think, maybe you'll need to stop sharing. Yes right, I think you need to stop and then I get to start it's kind of funny, because it'll be the same.

H

E

Right, let me just find that particular window here it is.

E

Can you confirm that that's the slide, if that is working I'll, even turn on closed captioning?

E

Oh good thought all right, so um so what so so see scratch one crashed as you know, and it caused the system uh to be down quite a lot. um This last time now I am happy to say that we found a way to remain in some form of operations and we'll talk about what that means through most of this, but what you know so so just sort of working through the slide a little bit um to give you an idea of what actually happened.

E

So the structure of the lustre file system is such that it's comprised of about two is comprised of 258 servers and 248 of those store. Your data and uh six of them store your metadata or manage the file system in some way, and then two of them manage the whole rest of that cluster. The actual file system is a cluster all of its own. That's the name. Luster sounds like cluster right anyway.

E

So when we look at the structure of the file system, we have sort of you can sort of the easiest way to think about. It is to divide it into two groups of nodes. One is, is metadata nodes data about your data and the rest is: is object, storage, nodes, your data and so the metadata nodes. These are the things that provide all of the structure of the file system that you see when you change directories to global c scratch. One sd slash your username.

E

That is all the work of the metadata server when you type ls, minus l or chamod to change permissions, or you move a file you're talking to the metadata server when you open a file or close a file, you're talking to the metadata server and then the metadata server is saying and hey look over here on this oss to get that first block of data that you need so that that's that's what the metadata server does.

E

Is it's really important because it is keeping track of what all the data on the file system is and where it is one of the interesting features that we'll talk about today is that with lustre you can stripe the data. This is to say that you can say you know. I don't want my my entire file to be on on the first oss I want you know.

E

I want it to stripe out over, say four of them, in which case the first megabyte would be on the first oss and the second megabyte, and the second oss and the third megabyte, and so on and so forth, and straight back and forth. There's actually a lot of really important benefits to doing that.

E

One is that you get better performance, because now four servers are talking to you instead of just one, but also, if you're doing a sequential read of the the file, you actually um are being uh sort of being a good citizen in many ways, because if, uh if you put one single large file on the system, that oss will actually only talk to you and that oss can be completely taken over by your process. Trying to read that file, so striping also gives all of your neighboring users a chance to get their data as well.

E

So you get better performance and everybody else does too, when we do striping, there's a reason. I'm telling you about this, but okay. So what was happening uh in this crash is that uh is that our metadata server was crashing now you'll see that we have a big metadata server and apparently a little metadata server. It's called a an adu which stands for additional dne unit which, as you may have noticed, has a yet another acronym wrapped inside of it, and let's just not worry about that.

E

The point is is that these are other metadata servers. So when we created your scratch directory here at nursk, about 30 of you ended up with all of your data on the primary nds and then the rest of folks end up getting sort of shunted off to this other adu, and we have four of them. um Unfortunately, in order to get to your your directory in order to find such global speech, one slash sd up to that point, you've got to go through this one.

E

So you no matter what even if you're in a little in one of these 80s, you still have a dependency on the primary mds and it was the primary mbs it was crashing. So it's very, very disruptive, okay, all right, apparently you don't get to click okay. So what actually happened- and actually I happened to be on call uh for almost this entire incident, so that was interesting at least for the two production crashes. So on thursday september 24th it crashed- and you know, we've seen this before well- we've seen the metadata server crash.

E

It's not great, but you know it's. It's a very complex system comprised of very complex software, and software sometimes has bugs, and so you know we filed the normal critical case that we would and unfortunately when we tried to bring it back up, so we do have failover capabilities for all these servers. It couldn't fail over automatically and that's because some of the safeguards to make sure that all the all the system data is secure and safe and well written out was damaged in this crash.

E

That's the file system journal here and so that forced us to go through several different runs of what's called an fsck file system check and repair, and that takes that's. Each fsck takes eight hours to run and to do it correctly for our system, we have to run it three different ways three times so that's 30.

E

It basically takes 24 hours of checking and if any errors come up, um you know, then those have to be manually resolved, and so it ends up taking about 36 hours to recover from a metadata server crash where it can't fail over it's quite disruptive, so we were very happy on friday. We got to come back into production and then on sunday it repeated again at the time both of these were right around noon and we were signing a lot of significance to it being around noon.

E

It ends up not being as far as we know that that that's not the issue, um and so you know at this point we were nervous because it was exactly the same crash um in terms of you know. We look at this. The linux kernel uh stack trace, it was identical, and so the the concern was that some unknown, some unknown uh portion of the workload was sort of reproducibly touching this and would potentially put user data at risk, and so we didn't feel that we could bring the system back into production.

E

So we went through the crash. You know through the repair again, but we didn't have any immediate direction forward. On tuesday we worked out a plan with hpe uh hp is our vendor. They used to be called cray um wherein um the the systems team. uh My team would basically completely rewrite all aspects of how the system is accessed, and so we put in a we put in a a big day and we changed all the zone queues. We changed all the login policies.

E

We changed every aspect of how people get in in order to create this debug query capability, um which we'll have at the ready, if we ever need it again, but it's our goal to not um and at the same time, hpe put together a special debug kernel that would help them to zero in on what was the problem and then, finally, we were uh some of the some of my team was, and a lot of hpe was dedicated to trying to look at the actual crash memory dumps to try to understand what might be causing it.

E

So we spent time putting together a reproducer workload with the goal of you know now that we have a debug kernel, we don't want to just and we don't want to stay in debug query mode forever. So we need to start doing something.

E

I mean take action in order to try to crash the machine, so it made it kind of weird that normally our goal is to keep corey up as as well and efficiently as we can, and now our goal was to crash corey as well and as sufficiently as we could and it took about another 36 hours from wednesday. We came back until friday at uh at 8 am to crash the system again, and so we reproduced it.

E

um Unfortunately, the journal was damaged again, and so it took another 36 hours to repair almost immediately after they repaired the journal. We brought it back up on sunday.

E

We had a pretty good idea of what was going on on then, and so we tuned our reproducer and we crashed in an hour and a half and then another 36 hours to repair during these times, they're in yellow.

E

What we did is we made quite available without sea scratch, one at all, it's not to say that it was as useful as as normal, but at the very least you know some useful work can be done with the machine, and I believe that we may have made that time available as free time so that that's always a nice thing um and so what we changed when we got into after we repaired it again is that we we removed everyone from the machine, and we did that so that we could produ be sure that now that, with your help with all of the nurse user bases help had, you know identified the correct workload, we wanted to be sure that we could isolate it down to a particular synthetic workload.

E

At the same time, hpe gave us some additional. uh You know, capabilities with that debugging kernel, to try to assure that the journal would not get corrupted. That did not happen, unfortunately, and so it once again took another 36 hours to repair the disk, actually a little longer that time because of some complications that came up, but in any case um on on thursday, the 8th we decided to return to normal production. What we did is that we took our reproducer workload and we ran it without the secret ingredient that we figured out.

E

We that we think was is at least associated with it, and by doing that, we were able to show that we can run the system at extremely high load. So normally you know so when we were doing these reproducers, we were running cory's metadata servers at about 15 times.

E

The typical load that it operates at and so load alone is not the crashing sort of associated issue, um so that gave us the confidence we needed so that, basically, as long as we avoid the thing that hurts um we won't, you know we won't crash and we've been working very carefully with a number of people to try to make sure that we avoid that behavior okay. So what? What? What did we get out of all this? um So there is no root cause, yet we don't know exactly what this is causing.

E

This we're sure of a couple things. One is that this is a bug and it's a bug at the kernel layer, the linux kernel layer. Now that could be in a kernel module like luster, or it could be deep in the I o subsystems itself. uh These, the the corey scratch system runs a modified version of red hat enterprise, linux 7.. So it's not it's. It is modified, but it's not super modified. So you know it's a well-tested kernel highly regarded as being reliable, so that should be generally okay.

E

um However, what you know, like I told you- we were doing some pretty deep investigation of these memory dumps and what we found is that so we were getting then very consistently um that the the specific invalid value was changing a little tiny bit, uh but it was always a very specific pointer um in a in a very low level. I o data structure, so a data structure that was you know basically about to be written to disk um there.

E

That does it sort of imply a couple different methods for how it might be getting modified. The details don't actually matter here, but the point is that is that it's very sort of deterministic, it's always the same crash. It's not random um and we found when we looked at the pages that were modified as being associated with those I o structures.

E

They did reveal a particular application as being correlated and that really sped us along in terms of identifying that reproducer stack and so what the application was doing. um Is it's performing um a lot of unlink operations in parallel and and the the particularly unique item here is that this is happening when the file is striped over many many ost. So, basically, all of them- and this was not an intentional configuration.

E

However- and I just want to stress that- there's nothing wrong with what this application was doing. It just happens to be that it's tickling this bug so no problem there we're going to solve it. We're going to fix the bug, we're still not going to recommend that workload, though, but so the specific mechanism of how this workload is causing the corruption pointer is unknown.

E

We do not know neither does hpe but they're working on it and they've got a lot of people dedicated to this project, and uh throughout this incident I was talking to them every single day, sometimes twice a day as well a lot of people at nurse, and at this time we are still meeting with them uh three times a week, uh while we're sort of moving into the next phase.

E

We'll talk a little bit about that, but based on what uh what we learned there is that it's somehow tied to using many many osts and if I have to, if I'm forced to speculate, there's a couple. Different paths that have been discussed. One is that when the stripe count is extremely high, extended attribute, I know blocks must be used, and it may be that this is related.

E

um That in some way, when doing deletions uh pulling in these additional extended attribute blocks may may generate the kind of race that we're seeing. Another possibility is that there's a big, dramatic increase in the complexity of messaging when deleting files, when you have to talk to a lot of osts and maintain locks.

E

So you know very naively. We think that it's completely safe to use one stripe or two stripes and there's clearly some risky using 248 stripes, it's hard to know what the distribution is within there, but everything that we know tells us that using our um our largest recommended sizes, the so-called striped large, that's in our documentation, that's a striped count of 72.. It should be safe. However, for the time being, we're asking people to not stripe higher than 272.

E

Okay, so to avoid this, like, I said, we're avoiding high stripe counts. This is a general request of all users.

E

We are working on a file system scan to detect both the high stripe counts so that we can directly contact users, but also to identify damaged inodes that were triggered during this. I am happy to say that almost everything we're finding is in my directory, where we ran the reproducers or in some of the other known places that were associated with this. With this issue, there have been a couple of other uh user files and we'll talk about what the recovery will be for them. But essentially the message we want to send is contact us.

E

If you see something weird with the file system, uh hpe uh has options and can work with us on each each and everything and very important to understand is that there is no user workflow, that's actually causing this. The error is deeper in the system and, like I said for now, our goal is to avoid these conditions, so there have been a couple of after effects, and one thing I want to be clear about is that we actually don't know that this is related to the crash.

E

But people have noticed that, since we came back in production, sometimes login nodes will hang and by hang, perhaps you were trying to access a file on on scratch, one perhaps you were trying to ls. Perhaps you were trying to submit a job because submitting a job actually talks to lester to check your quota um and in some cases that just doesn't work anymore.

E

What we found is that the leicester client can only handle a limited number of simultaneous change, requests to the metadata server and it's possible that there is accessing one of these damaged files may be generating. You know basically be dead, locking one of those one of those potential rpcs, that's unclear. It could just be a different bug, but so we don't we don't know if it's related or not, but they happen one after the other. So we we're sort of it's hard to not assume that they're related to each other as a mitigation.

E

What we're doing is we are monitoring the number of rpcs in flight and try and then using a number of techniques to try to identify a login node, that's impacted and we'll try to reboot it as soon as we identify that it's failed. The idea being is that at that point we know that login node is no longer useful for submitting jobs. It's no longer useful for accessing scratch is no longer useful for data transfer and so it'd be better.

E

If we just crack, you know crash it as soon as possible and get that to get that debugging information over to grey. um The correction, you know, will sort of depend on what the problem is. uh If it is related to the damaged files, then the check and repair of luster, which is at a different layer in the check and repair of the metadata system, uh will complete over the next week or so, and in that time we should. We should know more.

E

um If, if it's related to a bug in the client, then we may need to install a patch in order to fix that and we'll identify the right way to do that once we've worked this out, so that's that's one thing another is that some a very few people have identified that they can't look at their files.

E

If you try to ls minus l, you might see all question marks for the mode of the file or the size, and if you try to delete it, it says it's not a it's, not a it's, not a file or directory, which is kind of weird. um You can move its parent directory around, but you can't move the file. You can't rename it. um We are working with hp to repair and recover those, but the best thing to do for now is just to rename the directory they're in or just ignore that they exist.

E

We are working to build a scan it takes about two weeks to complete and the scanning server is having the same problem that the login nodes are so it it hangs from time to time. We have to restart the scan from that point.

E

If you have any questions, please go to servicenow help.nurse.gov to file a ticket. Okay. So this uh I I want to point out that I'm I'm you know like lisa. Many people have been working on many different aspects of this. I'm just sort of reporting it so okay, so debugging this has been, has been extremely challenging and we have to fix it um because you know this, you know we we don't know.

E

You know, we've identified one way this bug can happen, but we don't know that this is all the ways that this bug can happen. And so you know what are we? What are we doing right now? So one thing that we've been doing the whole time is: we've been trying to reproduce as a smaller scale, as it's become increasingly clear that the number of osts seems to help it's becoming very unlikely that we can reproduce this on our small test system, which only has four osts.

E

So that's probably not going to work so we're going to have to use corey. The next problem is it's a very long debug cycle. It takes about um 36 hours for us to cr boot system, crashes, crash the system and then repair it, but it takes even longer for hpe to analyze all the results at the level of verbosity and complexity of the debugging kernels that they're building so crashing the whole mdt is not actually a useful activity.

E

At this point, it's disruptive to you and- um and it's not getting us the the type of iteration site time that we need and with so many with such a complex system, we need to basically be able to do a lot of very rapid, well-designed experiments to try to start teasing out. What is the fundamental issue?

E

In addition, there are lots of people involved, so there's been about about 30 hpe engineers involved about 20 nurse engineers involved in this effort.

E

All trying to understand this and, of course, in the early phases of this, we heavily involved the the very wide nurse user base, and you know more recently have been seeking detailed expertise in some some areas from some of you.

E

So that's been really important. That level of engagement really want to say thank you for that. um So we are we. We are running instrumental kernels in case something crashes now, but more importantly, um we hpe is sending us a new metadata server, oh oops, um so-called adu. This is one of the small ones that we will use to isolate just this workload.

E

This should allow us to take about 200 nodes of query. We suspect, plus that adu plus a little tiny slice of all the oss to then uh crash just that adu and since no other user data will be on it. That will have two important aspects. One is that it will be very quick to repair because it won't have like 2 billion files on it. Second, it won't have any of your data on it, so you won't notice it and it won't cause logins to hang or anything else.

E

So that's our plan in order to get that done. We actually just today right after the the great shakeout completed you know. After the hpe engineers were able to uh get out from under the table, they were able to then install the new adu. It just came today and it's been installed into our data center and into c scratch one, but we can't add it to the so to the system until we can quiesce the whole c scratch one and so on monday morning, at seven a.m.

E

uh We're going to we're not going to reboot cory, but we're going to stop all the jobs and we're going to lie. We're going to kill all of the login sessions again we're going to unbounce the scratch one and we're going to add this new adu, and then at that point we can begin the debugging experiment alongside all of everyone else's work.

E

So let's open this up for questions. We've got four different people on the phone today to help answer your questions, myself, lisa, alberto and steve. Any questions.

A

I also see there's a few other nurse people online um and we've got. We've got about five minutes left the last two items in the meeting. I think we can shoot through in the space of about half a minute, so so we'll have about five minutes for q. A.

E

And I have time, in addition, I don't know if that's okay but go ahead.

B

um Can I just start this very uh just clarification uh question about just one acronym. I saw in the few slides back, uh it's rpc what this rpc means. I think you mentioned that when we yeah sorry.

E

I didn't define that that's a remote procedure call so in any sort of distributed system like this. In order to basically ask another computer to do a thing. There's a general term called remote procedure, call rpc, okay,.

B

And that happens whenever we submit a job or access uh scratch system actually.

E

Yes, absolutely so, anytime that you ask to open a file, that's going to send an rpc to the metadata server and the metadata server will then send more to the oss. When you go to submit a job to the system, sbatch will submit an rpc to the certain controller software to add that job. um So actually, I I tend to think about file systems and slurm in a very similar manner. They saw they don't solve the same problem, but they use very similar techniques.

B

C

I have a question about the hanging. Login notes.

C

I understand there's good reasons that when you log in to nurse that they tend to send you to the same login as that you were on last time, but if that login note is hanging and you're, not that yet noticed it and having to reboot it. Yet is there a way that I can clear the history? So I get a random login note.

C

E

This so so I'm going to answer both of your questions, I'm going to answer the question that you asked as well as the question that I that it implies to me so so uh the short answer is: is that um it's based on your ip address, and so, unless you can change your ip address, it's not going to be easy to change to a different, login node.

E

However, what we can do, and what we have done is that the load balancer that's a particular piece of hardware. That's why it's you know it's it's configurability every time we every time we I'm nervous about touching that particular piece of hardware, because every time we do it, it seems to seems to generate a new, exciting, a new excitement for us, but what it does is it talks to each of our login servers every couple seconds and says: are you up? Are you up? Are you up?

E

Are you up and we've never done much with that up or down check until now, and so we identified two different ways um that we can see if the scratches, if, if the mount point to scratch, is hun without blocking that process, one is is looking for a well-known login process that can hang that's the thing that you're noticing you're getting hung on and if we see five copies of those that are still in the process table.

E

So we can check the process table without getting hung up so to speak, then what we will do is we will mark the node as down.

E

So that way you even if you were assigned to it once we once it's marked down in the load balancer no new sessions will go there like I said I think it's happening every two seconds: okay and then the second thing that we did is we also we were able to identify a signature at the lustre layer itself so that we're not waiting for users to necessarily uh you know, walk you know, walk in the front door and find an unwelcome environment.

E

Where you know that's now responding to activity that was already on the login, node and already dust you know already failing and we're marking the node down in that case and then finally, what we've done is that we've engaged our our our operation staff. You know we have site, set reliability, engineers that that work, 24 7 at nursk and they are.

E

um They are now monitoring very carefully uh for this, this new health check and they uh and they are empowered to reboot these nodes as soon as they notice that it's happening, it still has to be a manual process because we need to collect debugging information in order to solve the problem, but the during daylight hours from well from 7am to 9pm.

E

This is an urgent issue for us and overnight uh it's handled uh only a couple times.

C

E

You're very welcome.

A

So we've officially got about one minute: left uh doug indicated that he can uh stay on the meeting for a little bit longer. I can stay on a bit longer and and a few others might be able to as well. uh What we might do, though, is I just reshare. This screen we'll quickly run through the last couple of items and then return to q. A and people are free to stay and go according to their schedule.

A

So the last couple of items we have coming up the third thursday we're interested in topic, requests and suggestions, uh but I think we can take this offline and please make make suggestions or requests on the nurse slack, particularly their webinars channel, and the other item that we finish up on is um last month's numbers. So for september you can see that corey's scheduled availability yet took a bit of a hit reasons that we've been talking about. Hpsfs and cfs were still all very good in the timeline.

A

It looks like this and, of course the crash ran over the end of the month. So I expect that we'll see another lower than usual number in for october's numbers next month as well.

A

So we can return to q a at this point and thank you all for joining us, and you know those who have other other commitments. um Please feel free to leave as.

B

Needed, okay, I can ask another more general question so about this queries issues quite special and not really haven't, took place so frequently in the past, like I believe, but still uh maybe could happen in in the future, something similar. But so I was wondering if this incident can affect the future planning of computing systems in in particular. In my uh I am wondering if we change this so right now, my understanding is the given the space at nursk.

B

We can install two big systems like corey or paramotor. I guess so when we, you know change one of the two like 87 paramount. We have to decommission the old one and move away, make a space and then move into the new system. So while we are doing that, we have only one system, and if something like that happens to one system, then it affects users, uh productivity quite a bit.

B

But if I don't know, if you have enough space to have another more intermediate uh systems, that's not maybe too big, but many of our application doesn't really need that much big uh systems, even though that helps queueing, but our job itself. That doesn't really need too big uh systems, so maybe during the daytime, during the you know, standard time, that's more like for analysis or smaller jobs, and then we separate maybe scotch system, just like eddie anderson corey used to be and in that direction that might give more resiliency to the overall system.

B

So yeah, that's sort of what I was feeling. um But having said that, I really appreciate the how you guys took that corey to the product production, at least without even without accessing scratch space. Actually, that two weeks made me able to do some analysis and and put up the presentation which I made in this week's tuesdays. We are having pm meeting this week for one of the auto science uh rgm program and it really made one nice slide in the last three days before the conference.

B

I really appreciate that, but anyway, but that's yeah, something I thought, and I was wondering what would you thought for the future uh planning of nasa system.

E

So um right, what I can say is that um you know clearly. This is a this. This was a major bug um and it's a bug at a very deep layer of the machine. um It is challenging to predict exactly the the we we're working very carefully with with with our vendor and internally, on sort of crafting. Our our longer term plan, as as a result resiliency for pearl mudder is going, is a central goal for that system.

E

The way that I'd like to phrase it is that our goal for for pro mudder is, is continuous operations, so you know short of you know a a facility outage. You know like we have to shut down the power because of a psps event, you know, or something or one of these facility. Anyway, you know our goal will be to try to keep promoter online to some level now. I also recognize, though, that this particular bug is.

E

Is I'm hoping that it's a once-in-a-lifetime per system type bug, and so I you know I I I'm just I I I I basically we're trying to balance sort of you know what is the you know trying to balance sort of you know the risk versus um versus resiliency, uh because we can't be resilient to all possible things.

E

That said, I do also like it when we have two systems.

E

Exactly for the reasons that you talk about cory and pearl mudder are going are fundamentally different systems as well, so um edison and corey, while they have different processing elements, you know iv, bridge versus knl and, of course, very different scales. The way that we operate them was identical.

E

In fact, we used identical software uh for both machines from a system layer, and so it really gave us a nice, a b test strategy bring it to edison, let it hang out for a little while, and then that said, this isn't thought to be a result of an upgrade. uh This is thought to be more of a bug. That's always been there and was expressed by a change in the workload.

E

um So these these kinds of things um will be hard at this scale to avoid in the long term. So anyway, I hear you- um and I just want to assure you that we're thinking about different ways of dealing with these types of things in the future.

B

Okay, thank you. Okay,.

I

Hi um this is ramesh from from argonne, and I I'm doing some computations on hankuri and during the course of my conversations on chat, I was actually chatting with somebody at nursk when you had had this problem with your file system initially about a month month ago, and we were actually just comparing notes because I'm from argonne I'm familiar with a system which is fairly similar to corey the theta system. Over here we have recently upgraded our operating system and before we updated upgraded the operating system.

I

uh There were a couple of occasions when I had seen issues similar to what you're, seeing now with corey.

I

So the suggestion that I had made was that you know perhaps some of you might want to actually contact the alcf folks just to see if there was something that they did, which could help with uh with the situation that you're facing, which also primarily seems to stem from the wound, killer associated with luster the file system. So it's just a just a suggestion. I thought I just mentioned this. I'm I'm sure you're far more conversant with what you're doing than anything. I suggest I'll tell you so.

A

Thanks very much so so you're talking about uh this is the the login hangs or see scratch issues or a different issue.

I

Well, there were, there were login, hangs, so one of the symptoms, I didn't notice. At my end, uh of course, uh let me also preface this by saying that I'm a recent user of corey I've been on corey for about a month now and because we have a d o e b, our stimulus funded kovit 19 project, which has some fairly aggressive timelines, and I'm trying to get some of those computations done on corey, because I just can't do it on theta.

I

So I have to do it on on corey, so in the in the course of those interactions with corey in a manner of speaking ellis was one problem that I did notice because ellis was hanging.

I

Another problem that I had initially noticed was that I was actually not able to submit my job at all, and that was actually a problem with, uh with with, I think, the job scheduler and which in turn uh affect the job scheduler, which in turn was affected, because there was a there was a problem with the with the file system, so.

E

Very close, however: um we, the job scheduler, doesn't actually talk at all to to to the lesser. But what does is s batch itself.

I

Correct yep, yep yeah. Just I'm sorry, sorry, but that's bad! My bad! I I was using the wrong wrong terminology.

E

But yeah that's done as a it's been it's it's not in order to implement.

E

Yes, there's a there's, a lot of sort of initial login processes and and and job initialization processes that check your quotas and things like that or assure that you have a scratch directory and those are the things that can that get sort of jammed up uh when sea scratch one is is unavailable.

E

That is a rare occurrence when it's unavailable, but because of the issues we've been having. It's not been so rare, uh a thing that I haven't talked about, uh but you know we are working on removing at the very least, the uh the login hang um if c scratch one is is is is is not available.

E

That will not be, of course, in the october maintenance, because we don't have one, but my expectation is that in the november maintenance we will have a different mechanism for dealing with with logins in scratch.

I

Okay, okay, the other thing I thought I would mention was that um initially, when I had gotten on to the corey, I was experimenting a bit with file striping and of course I had actually used a striping script that you actually have. I I didn't. I didn't stripe as wide as 272. It was far more nominal than that. Just to see how my write speeds would improve, and I was wondering if it's still okay to do that or you you don't want us to use those scripts at all. With regard to striping.

A

The scripts are still good, um they're, still good, okay, okay yeah, so so the scripts only go up to stripe large, which is a striping of 72.. So so we kind of ask that you don't strike more than that, because you know there's a there's, an increased risk with with increasing the script, the stripe further yeah, but we're pretty comfortable with that.

I

Okay: okay, okay, very good! Thank you! So much.

A

So we're getting close to 12 15 now and it sounds like the the rate of questions is diminishing. We can also continue conversing by the webinars channel on the nurse user. Slack.

A

But perhaps at this point we'll wind up this.

I

Just a very quick question: is there a particular channel on the nerf slack that we could talk to you or is it just general.

A

uh So so nurse actually only informally monitors the the nurse slack. So it's not actually an official support channel. We, we kind of yeah, encourage users to chat via it, because that you know it's good for interaction that uh enables our user community to you know help each other a bit, and um you know we do sort of check in on it occasionally to see if there's any yeah, to see what what is going on.

A

um There is a channel called hashtag webinars that we set up, particularly for um yeah, for conversations around this meeting, um the most sort of general q, a that I see, seems to land in the general channel and there's a handful of channels that are a little bit more either project or software specific.

A

But the general channel is probably a good place to start.

I

Okay, very good. Thank you.

A

All right, thanks again everyone- and we shall see you next month.