Ceph Science Working Group, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Science Working Group 2021-03-24

Description

https://ceph.io/community/meetings/#science-wg

A

Let's take off I'll start with my little spiel here um I put a link to the pad in the chat of the video conference here there you can add topics or add any notes if you want during the call um if you haven't been on this before this is just an informal conversation between a bunch of us stuff users and hpc htc.

A

I guess big clusters in general um keep in mind that the the meetings I recorded and posted to the seth youtube channel and just feel free to speak up. If you have any topics, thoughts or anything, you want to talk about.

A

I kind of just go through the topics list on the pad one by one, and hopefully that spurs conversation as we.

A

A

So the top two things that I always ask is anybody have any uh recent outages or any serious bugs recently that are useful to be aware of.

B

Well, based on irg channel discussions and the mds issues with the cfs's kind of annoying things, at least it sounds like people are having issues when they're they bringing their mdss back on on a cluster, and things like that. I think that someone here had last time to mention about the same kind of issues.

C

Yeah I was referring. I I that was probably me. I was talking about uh the rejoin step on an mds can use a lot of memory. We we did this again recently and for an eight gigabyte ram mds. It used something like 60 gigs of ram when we failed over to a standby.

C

What are they talking about on the irc exactly.

B

uh Well, there is no backlog on oil or irc, so uh there was just someone that had some issues with uh with a box. Johnny might have some maybe pointers about that, but there was someone having issues with with a server with a cluster which is totally down and it was mds related stuff. Also, there.

B

D

Yeah, like my irish history, is erased.

D

But yeah there was some guys that had some issues with this.

A

Then did that 60 gigs decrease after the switchover was complete or did it stay steady.

C

A

C

Now it decreases because the mds needs to load every directory that was open beforehand like the the new mds needs to load all of those into in it needs to cache all those inodes.

C

I think, there's a fix in octopus, maybe pending, but maybe released as well. For this.

C

In did anyone did anyone see for bugs there's? I think we we're planning to upgrade to 14 to 18. Soon it looks pretty good. I don't know if anybody had real big problems with it. There's one known issue with the with the interface that it binds to for some people it binds to the loopback interface because of some some commit that changed.

C

How seth selects which network interface to bind to, but there's a simple workaround for that. But is anybody upgraded already or is that for later in the agenda.

A

Talk about it now, it doesn't matter. I was thinking about it, as I was looking over the releases last week and everything and starting to fall a little behind.

E

We don't have any nautilus clusters, then we're going luminous to mimic to octopus at the moment. We've done it for the test cluster, um so I don't think we're gonna land land on nautilus at all.

E

B

Our internal schedule is already said that we are. We are trying to upgrade our clusters that release, but we haven't done any any actions yet.

B

I was wishing and waiting that the fifth bug that I was preferring already on etipad would have some got some fixes because it's been there quite a long time- and even rafael pointed me about that there there is a bug fix on a master branch. It's not point and click solution for for nautilus. Yet so the structure for the user tenant is different and it would require some coding. And frankly, I don't have a time for for making that code change for backboarding that bug.

B

So we haven't been hurry to push that next release in our.

F

F

Are you worried about it.

B

Well, our we have five different clusters and we are, we are not running. There is a legacy, client interfaces that we need to support and I'm not confident going to uh like master release or octopus. Yet with that system or systems, and and that's why I was asking about to make some matching different versions. So I have every anyone.

B

Experience running brothers, gateways from pacific or even master brands and same time running the full osd and monitor base on a older environment.

E

So the upgrade instructions usually have the uh rattles gateway being the last thing to be upgraded. I.

B

E

Which would incline me to suspect they're quite picky about version, but I don't know I've not tried yeah.

B

But there is no concrete rules or or anything like that regarding that issue.

C

On our side, our plan is to upgrade to the last nautilus so like 14 to 18, um because that will let us immediately do some moving of hardware around that we have. We have some new machines and move a lot of pgs around and then from there yeah. I think octopus is okay. Now, right, I'm not! I don't have big fears of octopus. There were some scary s3 things in octopus. Like objects disappearing.

C

I don't know if anyone on this meeting had those issues where the the garbage collector was just deleting random objects, but I think this is fixed now so yeah I think we would yeah we would go to. We would go to octopus soon.

E

I I think I saw that issue was fixed um yeah. I shoved 10 million objects in the bucket on the test cluster before upgrading it and they're all still there. So it's not very strong.

C

This doesn't seem very panicky right now. It seems pretty calm. I would say: there's no one with like massive corruption issues. There's some upgrades going wrong, probably more due to process than than bugs. I would say.

C

Rafael, what do you guys run.

F

We are still in nautilus, uh only backboarding some features or bugs currently, and we stay in uh nautilus, 11. So pretty old.

F

But we are preparing to upgrade our clusters to octopus or pacific, but some of our systems are ubuntu 16., so it's and for them and we are preparing upgrading the operating systems and then cluster, and currently we are trying to compile uh nautilus on uh ubuntu 20..

F

There are a lot of uh dependencies with python because in focal we are, the ubuntu is using python, 3 and now telos press everywhere, python 2. So we have to change every line: encoder dependency inside the building process, but in last week running unit tests.

F

170 tests and only four pairs, so it looks promising for us.

F

This octopus you're saying it's prominent: no, no! No! No! On.

A

Why not do uh why? Do you say that uses python 3 and it passed that problem right.

F

Can you repeat, kavin.

A

I was just asking why: why are you doing compiling for nautilus on ubuntu 220 and having python issues when octopus should do a bit smoother since that's python, 3 right.

F

uh Because we have some assistance um running on ubuntu, ubuntu, 16 and 18 and from 18 it's straightforward to go to octopus, but from 16 weekend, because it's not compiled by community and we decided to upgrade all of the operating systems in all clusters to ubuntu 20 and then upgrade that cluster to octopus.

F

We are trying to maintain only one version of operating system and software, so it will be easier to fix, bugs and know what happened in each version.

C

Okay, just I'm curious, you said that you are also you. You asked why we don't upgrade to pacific.

C

C

Were you kidding or.

F

Oh, it's going to be stable, maybe, but we will see uh as far as I know really yet the version I think, but.

F

I was talking about end of march. Am I right.

B

Could be yeah there? There is some building process woes, at least for some people on on building a pacific package packets. So I don't know, what's the real deal there.

E

Yeah, our internal roadmap is, we hope, to have our test cluster on pacific by the end of the year, but we have no plans of it going anywhere near production for a while, shall we say.

F

I think that the people go to go, go to pacific, maybe in half a year when all the new bugs will be patched.

C

Yeah we didn't go to nautilus until 1429 or something or maybe eight. I don't remember exactly that's when we first upgraded to nautilus.

C

Yeah we go to octopus at like 10 or 11. I would say.

F

Oh, we started now to lose with uh second stable release and we find the back corrupting osdisk.

F

F

It has to take some time to to warm this. This software on some dev clusters prepared clusters before production, because because there are maybe small change in code, but it can erase data like this s3 garbage collector.

B

B

Well here at finland we are, we are waiting, the pacific kind of official releases, and we will try at least that on our latest hardware, that which will be a european hpc super computer storage cluster, about eight racks of storage.

B

So maybe we will hit some bucks, but still still we we are thinking that that would be the way to go with a new, fresh installation.

F

So I think all of us are waiting for your feedback on this uh new version.

A

Yeah you can test it for the rest of us. Thank you.

B

Yeah, that's just 35 petabytes of data.

A

A very thorough test- and it definitely probably would be easier just to go straight to the pacific instead of having upgraded versus something shortly after installation anyway,.

A

Put a warning sign on it for the users.

A

One of the topics I kind of just do the pacific release out there and has a generalized thing to discuss and some features that I were talked about and highlighted from uh stages of tech. Talk on it early in march.

A

G

A

Stuff in it, I don't know if anything particular people are looking forward to or.

A

A lot a lot of useful, tooling, more selfie dm uh advancements.

A

And just a whole lot of uh small stuff for performance and improvements really is what it seems like.

E

I think I understood from sage's talk that seth adm was still not really quite there for big clusters. I I might have misunderstood him, but I I think he said that its performance wasn't really quite there.

B

Less than like, 30 no notes will be just fine, but if you go beyond like 60 notes, the deployment time will be nightmare either. I don't. I don't have nervous to wait that much time.

A

I think they said um that's coming in quincy, where, like uh stuff adm will talk to like an agent that runs on every node instead of like and those get pushed up to cefadm, instead of doing like some polling method right now, so I think that's supposed to improve a lot, but that's a.

C

A year off now, if anyone's used cdm for upgrades like how does it know like how does it know which machines to update, I mean, let me say that a different way, if you can, if you have like rack wise for big clusters, we would have like rack wise replication.

C

Does it upgrade whole racks and restart whole racks at the same time, or does it always go host by host or does anyone know what it does.

D

Based on my tests has been doing like an osd upgrades because it's like containerized, so it does one osd at a time it does one osd at a time. Okay,.

C

It takes ages to do it that would take a while. That would take too long, I think, because then it has to heat backfill for each one or like recover yeah. I see.

H

I thought it just had to restart the container with the new um with the new image. Is that not correct for the osds?

H

Does it actually have to backfill data for them.

C

Well, it would recover and we have to recover the objects that were written to while that osd was restarting.

D

Yeah in my my test, cluster isn't so busy, so it didn't do any backfilling, but most of the time it took to do. It was like for three days that it did upgrade for 64 posts, with your three osd's.

I

Were you containerized before you did the the move, or does it force containerization onto you?

I

It was a fresh install with chef adm. It was already containerized. So if you don't have a containerized system, is it going to move you on to using containers.

D

I

Be yeah, there was a quite a thread on the user list. The other day. Talking about this and.

E

Yeah, I hadn't meant to stir it up quite as much as that.

I

Bring it on matthew, I I saw that and I had the thumbs up the whole way, because um we'd really like to keep everything really simple, and so we can see what the osd's are doing and we don't use containers. Hardly you know anywhere in our infrastructure, we're scientists and we're just lots of data um like for probably you guys, but it adds another layer of abstraction which complicates management from our perspective and I'm you know I was depressed this morning trying to use self deploy on a uh alba eight lineup spots of money.

I

It wouldn't work and I've got to throw my nice comfortable toys up out of the pram and start again, but anyhow.

I

J

I I've got a question as well, so with the new version, do you have to use fadm, or can you control it by some other means? So we are we're using no tillers on on manage on managed linux systems and we don't use self-deploy either. So it we've got our own scripts to to do all that and yeah. If he had to change too massively that that might be quite quite tricky for us.

E

No, I mean, I think you can still use stef, ansible or any other mechanism. I don't think, there's an expectation. You must use step, adm or septifier anything like that, at least not in.

E

A

And it seems like there's a bit of still uh hesitation from a lot of us in big clusters to use uh containers for everything.

B

Yeah definitely all right all right.

B

I can try condor containerized dev cluster with our fresh eight racks set up, but I'm pretty sure that we don't put that on a production, because we we we've been having issues during the years and debucking containerized environment is pretty hard. If, if there's a real big issues on a hardware level, the obstruction layer add the additional additional obstruction layer between the hardware and the safe cluster will will keep give us, maybe a too much complexity with the scale that we are installing the systems for small scale like 15 node, dev cluster.

B

Why not and that that size is nowadays rather big if you put that all ssds or nvmes and 15 nodes is pretty fast and decent system.

H

Yeah, I have actually I'm running a 12 node cluster and using the the ceph cefadm tools from octopus right now, and it's been working pretty well for us so far.

H

I think the biggest issue we had was that maybe one of the versions had uh too many logs generated within the container, it might have had like a debug mode on where it's just creating lots of logs and so you'd have to restart the containers uh periodically to clear them. But I think that's been fixed.

B

What is the size of the disks or osd's on your cluster.

H

There are of 16 disks that are 14 terabytes each.

B

Okay, how about the memory usage on the container on per osd.

H

I'd have to look. I didn't think that the overhead for the memory was was substantial far still compared with what we have on the nodes, I can pull up what we actually are running for the nodes, so we have.

H

These these nodes have 256 gigabytes of memory on them total and um they are 16 core uh cpus.

B

Yeah but yeah. What is that? It's utilization of a memory on a certain osd when you're containerized, like 14 terabyte disk.

H

I will take a look.

B

Because I'm pretty sure that uh well, we are running bare metal, osds and we've been having issues with the memory constraints earlier, for example, this uh we had a downtime and there was some issues with that with a pg lock that grows too fast or or too much, and it it required a lot of memory on those nodes. So that's why I'm in interested about the content, containerized ost memory.

H

H

I'll check and I'll try to get back to you on that.

J

So I'm actually curious. What's the reasoning behind containerizing death.

A

Just to have a easy, so you're, not so os dependent. So though you can have uh everything's all all your libraries are there there's less variables involved in running your your demons yeah. It makes troubleshooting easier for the developers because they know how the image was built. It's not you know, somebody's cluster that could be built in any way with any number of variables involved.

A

I get it, I'm not a huge fan of it, but I do get it.

H

I was checking on the osd and I think each of the osd's is using four gigs of memory. Four.

B

Gigs, that's quite low still on the size of the osd disk size. What what kind of io you're running lip riders or brothers gateways or.

H

uh It should be rattles, but we're running it with a cfs on top.

B

F

Yeah but memory remember is limited by osd memory target parameter in configuration. So so it is this value adding the value of level 0 for rug db.

F

A

It looks like um somebody wanted to talk about the manager and the balancer steering and whatnot.

C

Did we did we jump into the did? We did we mention already this month thing? Oh, no, sorry that was.

A

Under the thing yeah, so another thing from um uh sage's talk was that he talked about the uh instead of doing a cephalocon. Obviously this year it isn't really possible they're doing a stuff month. uh We're trying to do um a couple two or three tops a week. Instead of burning people out.

A

You know zoom meetings. Basically, um it sounds interesting. I haven't like seen any like requests or talks from people yet.

C

Yeah, it's still it's still in the planning phase. It was discussed to the last supporting step board meeting, and I do you guys like the idea is everybody is the idea favorable to just do one or two days a week?

C

Maybe you want to.

B

Well, that's better than nothing. For example, I I really regret that there was some kind of small pandemic in globally that delayed my flights to south korea.

C

And let's say given the alternative between a week sitting in front of zoom or three days or doing it over the whole month, which would you prefer.

E

I have to admit not sitting in front of zoom all day, like I spend too much time in front of zoom already, so I think from that point of view, given there's no need to have everyone together, and so you know the need to put it all. In one day I think actually spreading out and not having so much student fatigue in a day is quite a good idea. Actually that does seem like quite a good idea.

B

Yeah, I'm currently one of the most.

B

Heaviest user in the zoom in our company, so I don't have extra time for.

F

B

Month, zooming, for example, with the safe, even though I would like it, but I would add something more to that. The irc channel is like seeing a tread and well f is not single thread. We have a lot of threads right and uh what, if, along that, we have a some kind of low dock or rocket chat, or some some, this custom board with that that.

B

Zoom days, so we could discuss about different topics on a threads afterwards, with that with that death month, which um which tools did you say, rocket, set or flow dock well in europe, uh rocket shot and flo dock are like us slack right.

B

I don't know we use matter most at work. We have our own yeah you, you have a sorry. I I forgot that, but that was that's also one option. So, okay, sometimes I.

C

B

Have chat rooms, chat rooms with a uh trading, so you can go with that thread that I'm I'm I'm really interested about. This discussion related the blue plus door acceleration, for example, and it led to different thread, even though the original speak was like how to handle nodes so.

C

Okay, yeah: I can bring that to the there's still because we're still like it has not fully formulated this idea. So that's good feedback, but the other question that they had is like is um should uh like. Should there be a call for papers and and like should there be a general call for people papers?

C

Do you think that it would be really popular to have people submitting talks, or should it be like each day, maybe like one developer, a 7s developer presents what's new in cfs, and maybe one or two users present some some related talks in that area like how do you think do you think it should be a regular call for papers, or should it be like? Maybe the board, like picks like like, knows, picks known people and invites them invited talks? Let's say what do you guys think?

C

I think it should be a.

A

Call for papers because the board doesn't know how the interesting stuff people are doing off hand because um there could be new groups out there other than the same faces that we keep seeing every couple months. Basically,.

B

Yeah there there should be a marketing that you, you should put your car for that. If you have a good paper, even if it's not the best paper that you can get, but it's if it's still something like if you are dealing with a genomic area, and you have a certain genomic pattern that you need to feed on a safe cluster. You you are using this kind of speed up parameters and make a sensitive data check your way, and that might be a good, but it it will go on on outside the radar quite easily.

B

If, if there is not of in enough examples, marketing that that we we kind of, are interested about how how people are using the theft as a tool for their things. And are you saying marketing for the call for the paper is really important.

C

So when you say marketing, do you think that there should be like voting on the paper on the on the proposed topics, because I think normal in a normal cephalocon? We would have many many many tracks and many many papers or talks. Yes, but we probably won't have the same capacity for for ceph month right.

C

We we won't have room for like 50 talks. I guess I don't know. I don't think I don't think so.

C

So should we have people voting or should like how or do you think we should try to have as many as possible and and uh and people can you know we could have parallel sessions and do the do like a regular conference, but.

E

I think you need some. You need some sort of review. um I don't think it needs to necessarily be a popularity contest, um but presumably the surfboards know enough. People who know about seth to be able to review the proposals.

B

Okay, yeah, that that would be great, that there is some review and of course we can, we can have a backlog for for other proposals. So in case you missed something that you didn't hear here here we can set up next death months or no, for example, in august or if this is on on may so.

A

It should all be recorded and probably posted to the youtube yeah. Definitely.

A

I mean it could be also useful, even in like at my university. They do a virtual like flash talks for you know an hour session, and you know four or five people give 10 15 minute talks.

A

You know, I know. Sometimes you see that at the conferences- and I think I've seen that as cephacon before, but I mean that format can also be done virtually and uh if people don't have 30 40 minute talk, do the a flash session for an hour with a few people or something.

B

They're, like two minute papers, you have time for a coffee and what's that, then, if you're interested about more about that, you could go on on a tread on a chat room, discuss more about people that are interested about same tasks.

C

Okay, I'll bring that feedback to the in the next board meeting where they're going to discuss this thing.

C

Actually, while we're on this topic of like community thing, maybe I could bump up another thing I had at the bottom at the last board meeting. I was also saying that this meeting that we have is like really, I personally find it really nice and also different operators and share experiences.

C

Do you do you think that this kind of concept should be opened up to wider to the to wider users, like not just scientific, like maybe expanding this meeting, or maybe having another meeting call where that they could host? They just have informal chat between operators.

C

You're, probably a biased audience because you're all at this call but yeah.

E

I think having some sort of shared focus is quite useful. I think um I mean this course sort of grew out of a meeting of the stuff like on, if I remember rightly um where we thought it might be useful for both running in a similar sort of problem space. If you like to catch up from time to time- and I think maybe that topical focus is quite useful- that's not to say you couldn't have other similar meetings with slightly different topical focuses.

A

Yeah I agree having like the the shared. The shared topic is nice, but um for like another group like um you know, if they wanted to split off and do like uh a cloud users where they focus more on rbd usage or I don't know whatever, we just would have to figure out what those topics are and see. If there's a community enough within the seth community to want to get together every once in a while and discuss this.

C

A

Bring that feedback also yeah. I think I think these are do they're great and we've kind of had a nice format. Every two months and other users in general could take advantage of such a.

A

A

All right, so what was that going back on before we got way off uh the manager balancer steering.

A

Just reading the notes here.

B

So I'm, for example, using a certain script to up mapping clusters or if I'm bringing one host back on a cluster that has been broken. For example, I can bring the osds up and then just put the no backfill on run the script, and then my cluster is healthy.

B

That's that's great when, when you have a lot of stuff going on, but it's bit annoying that if you have up up map manager process running and it sees that well, I have one host full of like one for four four percent of pages or not misplaced it. It will strike, try right away, put them all in a queue and then my up mapping, pre-op mapping is kind of wayne. It's it's! It's taking them on a backfield state again quite fast.

B

So I was just thinking that if anyone has a more sophisticated way of steering that balancer process, when you, when you're, adding a lot of capacity on a cluster.

C

Because, normally what we do is we there's a threat on this right now, on the thing we we add machines to a cluster with initial weight zero and then we and then we just increase either gradually or all at once, but, like you know, depending on which set version you're running, there's always like different bugs involved, whether or not the data moves transparently or not.

C

Lately, we've been moving even we've been adding whole posts, which is like, let's say one-eighth of the cluster.

C

We've been adding one new host and we just let it go, and we let the balancer we let the balancer go. We don't. We haven't been using upmap remapped for adding new capacity. Lately we just set the weight totally up to the to the real weight.

E

We tend to just stick machines back in the cluster and let it solve itself out, but we have our our recovery parameters during quite low, but it takes a while, but we don't. We we've not seen any problems from that and it's simpler, which suits me.

A

E

A

The same thing, it's just yeah, add it and slowly wait it in or if I'm feeling brave that day, I'll just add the full weight and let it rip.

B

Yeah, well, that was just.

B

For the discussions, if you have some fancy ways to do that kind of bringing notes heavily on a cluster, well, I've been using the up mapper python scripts. I I think that's one of the best takeaways that I got from berlin.

B

B

B

For the next four bullet point about the beast: civic wet experience in our side we haven't- we are still running civic web on us on our certain clusters, mainly because it was there. We haven't changed that and we didn't see the on that user load, much change about the performance, but on on some some environments, I would go with a beast if possible, because it's.

E

Yeah I mean I was asking this because we've moved our test clusters to octopus and our production is coming along later and I'm wondering whether to stick with civic web, that we know or look at moving to beast. um And I don't know if yeah.

C

We moved back in near the end of luminous when it was still experimental and we didn't really. There was no, it was almost like no change, I I don't. I don't remember any massive improvement or any big bugs just it just sort of worked.

F

We switched to beast the last week due to the bug in uh corrupting stock and after switching in production, we observe a little very little performance improvement, but nothing small challenge. Latency is the same on object.

F

E

I guess that sounds like we should probably give it a go and not expect much to change yes, fair enough. Perhaps I mean basically, if it's going it's it's the default or not. First, so there's an argument to run it just for that.

E

F

I think in only the huge loads, the better is this because it has better io engine because because of asynchronous eq and.

F

I think you should try to kill brother's gateway with requests and then you will see the difference between civil weapon. The beast.

E

Yeah, I probably have to wait till the next holiday period before I can get away with really killing the rattles gateway.

C

What about something came up at in our meetings recently they were asking. Actually they were surprised that we still allow http plain http on our on s3 dot. Cern.Ch.

C

Do you guys force https, or do you allow http? Personally, we keep hdp enabled because if you have, if you have applications streaming at like, if you want to use real stream at gigabytes per second, you can't, I don't think you can encrypt that fast, like I find that there's a performance there's a throughput limit, and I did it open ssl test this week and like one cpu, can encrypt at 300 megabytes per second. Only so am I am I like speaking nonsense here and has everyone switched to secure or.

E

Not because um a lot of our use cases, people are sharing, signed, urls and external sharing, and we didn't want to have the the risk of people using the wrong protocol by mistake, and so we now also promote as well. So if you connect for a while, we had no listener on hd on port 80 at all, and now we have remote in hd proxy.

E

But we have dedicated rolos gateway machines to do the um tls and that sort of thing.

E

And we can get fairly good performance out of them, blackening the cpus on their own skating machines, but we've got six so yeah.

F

We only use secure connection while accessing private buckets on our public clusters. We use plain text because they that are the they hold the data that are public here. Okay, so we don't have to secure them.

A

When you choose https, can you force the cipher to match the like built-in aes 256 instruction sets on the cpu, because maybe that 300 megabytes a second means you're, not using the actual cpu acceleration for encryption.

C

I have no idea when I google around I found that there was a website called, is tls fast yet or something like that, and it gave me some open, ssl commands to run and it seems like they've focused on making http like they've focused on the latency aspect like now, there's like there were that you used to have to retry or something I don't. I don't know. I don't know how this works, but you used to have to like used to be many many hops or like many back and forth, to actually negotiate the connection.

C

And now it's like zero, zero, rtt or something like this. But anyway they focus on the latency. But not the throughput aspect, and I mean our raz gateways are running 10 gigabits. They have 10 gigabit interfaces and I yeah. I would just be curious to know if people can saturate a with https, like at one gigabyte per second from one router's gateway.

C

E

Secure just trying to find the last lot of benchmarking. I did.

B

I think that when the civic web beast option came for running beast, there was a throughput limit that at least with the civic web, you could couldn't go beyond certain amount of speed. It was 500 megabytes or 700 megabytes bytes per second.

E

So, with our random gateway service, the last time I benchmarked it, we got 12 gigabytes of second read performance, um which that's off uh fixed gateways.

C

E

C

Yeah, that's impressive and that's all secure.

E

Yeah, hmm okay and we spent I, I spent some time tuning because there's hk proxy in front um which gives us some reliability, um and I spent a bit of time tuning how much cpu we give h a proxy and how many threads we give. If it were then, and the number of random people that kind of thing, um and that made quite a difference before we.

B

So you have a civic web like on http, and then you have a h8 proxy which is taking care of the https. No, the civic web does.

E

Https so uh because the way the network is we we wanted to have the ha proxy might not be on the same machine as their ls gateway. It's talking to. We want to keep that traffic encrypted, though the tls termination is done by the civic web process, um and at that, with with that um benchmarking, we were basically using all the cpu on our other google.

E

um That was that was the limiting factor, but it took a bit of tuning to get to the point where, like we could use all the cpu available.

B

Would you share your benchmarking scripts at some point.

E

um Which is it's all right? It was it's a bit of a pain to get going, but it was. It was like.

G

E

Answer um I've got somewhere. I've got my notes of um exactly which parameters we used for cos bench. I can try and finish that up.

F

Then did you ask about performance on secured connection on without.

C

Yeah I mean I'm asking about performance on secure connection, so let's say one right gateway with one rattle's gateway running like what? What can you? What kind of throughput can you get to that one house gateway.

F

Here year ago we tested but unsecured, with many rada's gateways, probably 12 and from this multi-petabyte cluster we get over 200 or 250 gigabits per second.

F

But it's without dls.

C

I mean- maybe it's all just parallelized- maybe I would use I have like 32 cores, maybe maybe they would just all be. The cores would be doing this in parallel for each connection, but single streams will be second, which is probably reasonable.

C

Yeah, maybe I'll try again, I mean we're kind of having a internal discussion about this, and I want to make sure that the facts are I'm not saying nonsense. You know, anyway, thanks.

A

The next thing was a mds bean cpu bound.

C

Oh sorry, that's me: well, we had an incident like uh february 18th, where cefs got really slow for everyone, and we don't really know why and by slow I mean like it took 10 seconds to create a file and then we well. We kicked out some clients and we didn't very. We don't really have very good monitoring to know which clients are doing are are being heavily loaded, but the whole time. We noticed that the the mds was like just spinning like the cpus was flat 140 140 percent.

C

So I guess we've known for a long time that the mds is single threaded, but I'm just wondering if other, if there's anyone else, that's running like big sffs that notices this and do you have like. Have you already been through a cycle of testing and tuning and getting the best possible mds optimized hardware, and then second thing is like?

C

Is there? Is anyone aware of developments to make the mds multi-threaded so that this limited by the gigahertz on your cpu.

C

I'm pretty naive to the whole subject. So, if anybody has anything, please share.

B

This is like off topic, but it sounds really like last development. Five to ten years ago, you have mds pop problems and things like that. Okay,.

E

How do you fight for us ffs is still one of those. We should look at things rather than something we've actually got deployed anyway.

C

I

We use uh surface for our hpc work, but we're quite lucky in that most of the files we deal with are microscopic images which are some 100 meg and upwards in size.

I

So we only really need a single active mds on a five petabyte cluster, but certainly it works very well for us. You know it's it's it's uh not quite up to vgfs, which is what we use for the small file stuff, but it's certainly very very usable.

J

So we we just recently started using cfs for all our school storage, so that's from home directories to large large files, so it's a whole mixed bag and it's all pretty much straight out of the box and we've got a single mds and it just works. So basically our you, for you, know general usage.

J

uh It just doesn't sweat.

C

Really, that's really interesting how many, how many clients are mounting that.

J

uh Store clients.

J

So we we've got uh automotor set up, so individual home directories get mounted via auto mount and and all the various things. So I think we've got about 900 all right. I think so.

J

I think it's 900 clients, something like that and yeah it just so at some stage I I wondered if he should have multiple mdss and I just noticed the load goes up and figured well. Actually, I will deal with dive if there's a problem but yeah it just works. So, okay, I I did notice once it slowed down for a little while I think some people copied a lot of stuff around um and the cluster was quite busy, but otherwise it's it's fine.

J

C

By the way, see if you're, using it for home directories, does that mean that those 900 machines like your root on those but the users are not root. Is that right? That's correct! Yes! uh Well, yes, so it's not because I'm asking about this fx key the set you can't let the users get this fx key right.

J

Yes, that's cool, so we have split up our system, so it's we use, managed linux systems, so we'll be a bit odd and use lcfg.

J

um But so all the machines inside the server room are adjusted, so we've got general compute boxes, multi-user boxes, remote, desktops and so on, and so we have one client per mount and that's mounted cfs and then outside the server room. uh We deliver the home directories via samba and nfs and but we probably get rid of nfs. If I get around to it that yeah.

I

We use the surface keys to explore some of the cfs directories, for if people have got instruments that you need to write directly to seth um or and that works quite well um and we're doing a bit of experimentation with the the windows driver, um we're on a test cluster um again, because we've got capture devices which run windows and it's timer tends to bottleneck a bit on write speeds for us, but on our testing, we've been getting 800 megabytes, a second rights from a windows machine using the the self token driver and they've recently added 7x to that.

I

So you can just explore the folder uh from service to windows box, which I think is quite interesting, and it's really gonna, maybe open up a way of storing. You know some of the some images which we capture.

C

But do you do you give like a single? So you you create like a shared directory or something and you give them the suffix key and then they mount it directly from you know: yeah yeah.

I

But you you just um create a key for a sub directory and give it to them, and they can't mount the root of the class to do anything nasty. So it just seems to work quite well.

C

And a project area, I guess I don't know, that's what we call.

I

That okay, yeah on the people have got linux boxes, who have their own roots. We've got a handful of those and we trust them with this fx key. But it's not the roots of x key, it's just their home directory key effectively, uh and so they can access it from the cluster and they can access it from their their own, their own box and as long as we know, they've got a fairly up-to-date, client and they're not going to use some neolithic version of seth.

I

Then we're reasonably confident it's just a handful of cases, but it works for us.

C

And they don't yeah, okay, it's a handful, so they can't there's. No. They can't like access their friends files right. No, no! No! No yeah.

I

Just sharing not sharing this way, yeah you're, not showing the route, we're just here's the key for your directory and and and your sub directories and we've had um somebody who wants to do their own samba server for their own group and well. Why not? You know if it's if it's from your subdirectory tree uh you're doing much else, it works yeah, but it's good.

C

H

I do have a question on cfs for others: ffs users um from some some of our clients are generating a message that says: clients failing to respond to cash pressure, and I was wondering if there's a feed scene, something like that and if there was a workaround. But I believe that's.

G

H

By the mds server trying to compress its memory cache or trying to shrink it down, and then the clients are holding onto some files.

K

I think you can increase the memory limit on the server which makes it shut up.

I

To a degree, um I think the nds is being a bit chatty, that's my own, take on it and we've never seen anything really nasty happen as a result of it.

J

Yeah, it just uses more memory than it is supposed to so we do see that we for nfs, we use ganesha nfs, and the problem seems to be that ganesha does some caching of metadata itself and so it's holding on onto bits and when the mds tells ganesha to release some as some bits, then it doesn't respond to that and yeah. We, I just muted that particular complaint, though our the machine that runs the mds has plenty of memory and it's fun.

C

There there are a couple of like buggy cases like ganesha. Does that and also if a user mounts ffs and then mounts and then lists and like like ls the directory and then mounts ffs again. On top of it then like for some reason, those caps, those bits in the in the original, the outer 7s. They never get released the nds, so the mds will ask for those be released, but they never get released. So there's weird things like that that can cause this, but otherwise yeah. It's like it's.

C

It means that the client is usually when we see that the client is really busy and they're they're do they're starting a lot of files or creating a lot of files, and the memory on the mds is too low. So he's asking the client to to call back to like drop some of its cash, but that client is kind of that. Client is allowed to do what he wants. There's no throttle on the client, so he just keeps grabbing more caps, but this is all um actually in 14 to 18.

C

There's a new throttle related to this. uh Maybe I can find the pull request. There's a there's now a if, if a client asks so there's a there's, a max number of caps per client by default, it's one million!

C

You can decrease this if you, if you don't want clients to cash, so many inodes, but if a client with this new throttle. If a client is reaching that limit- and it's not like recalling back then then the mds stops giving him more caps.

C

He slows down like so that client would then start suffering he would his ls would would slow down um I'm going to try to find the they just wrote a documentation for that after the release was released, so I'll. Try to I'll I'll. Add that to the minutes, to the to the minutes of this call, as soon as I find it might be, quick.

C

G

J

So maybe it won't be back.

G

For everyone, okay, so I've got some other issues with setfs, probably not not having to do with the mds uh getting exhausted.

G

uh For me, when I'm running mounted into linux on a file system with a lot of snapshots, then I'm having this, this very weird behavior that sometimes for some actions in only some subdirectories, the client itself is exhausting the cpu multi-threaded up to the whole cpu that the machine has.

G

If I'm mounting in stats with the fuse client, then everything is working. Fine metadata operations is, of course, about a factor 10 slower compared to the color client. But apart from that, I'm not running into these issues that the whole machine is hanging.

G

So that's that's rather weird, but it's probably caused by the fact that I'm running multiple hundreds of snapshots in this young directories in this ffs, that's something that I'm going to change during the easter days. Now.

I

We find it quite important to use fairly recent kernels to get the best performance out of 7s, so um some of the stock red hat- you know, seven enterprise, seven kernels were in sub-optimal and we ended up using some of the ml kernels from to to get a better yourself kernel module in place and that that improved performance for us.

G

Yeah I actually tried the dml kernel as well, and I could sort of replicate problems. It's it's really only some directories and just for test one very problematic subdirectory. I just moved it over to a pool which is not sharing snapshots with the main server as data tool, and then the problems immediately went away. So it's definitely got something to do with just the number of snapshots, I'm having on the pool. How many snapshots are you using uh it's uh currently around six to seven hundred snapshots.

G

H

G

Yeah the mistake I made was I in the beginning, so I'm very new to stef, so I was just using the the sub volume abstractions uh from openstack, although I'm not using any openstack at all, um and I have uh multiple I'm migrating currently from from an old installation that is just using iscsi xfs for home directories for different work groups here with different quotas.

G

So I was doing different subdirectories for each of the home directory subtrees more or less and wanted to do uh around 50 to 60 snapshots for all the home directories and since they are not in since snapshots, are not supported in some volume groups. I just had to do them for each uh sub volume themselves and that just adds up that's something that I'm going to change in the next few days by just migrating everything into a a tree that is more or less completely ignoring the openstack.

I

Abstractions we limit ourselves to about between 30 and 60 snapshots on the whole file system. um We've never had much of a problem with it. It's.

J

Yeah, the dragons are there yeah? Likewise, we so we've got. We've got well three weeks worth of snapshots um and we do nightly slap snapshots. Now. That's no issue at all.

C

Where do you do the snapshots like just in the route or in user directories or.

J

So the way I've set up so actually the most difficult bit for for our system was deciding how to to split up the these ffs and I decided to at the top level I've got a scratch and backed up directories backed up everything underneath is backed up, gets shipped off every night and we've got different snapshot schedules at those top levels. So the backed up directories get snapshoted every night for two weeks in this clutch one every night for one week and that's just done at the top level, and it's super nice.

J

It's really cool our users love it so that that seems to be the the main issue that someone deletes a file and realizes it and then now they can get it back themselves. So good.

I

Yeah we were a similar idea, but um we've been doing snapshotting on the roots of the file system. We also have a second cluster and we have a whole lot of home brewed scripts to do a type of geo replication for backups, and we keep snapshots on the backup cluster for about a month, um and we only keep a very limited number of snapshots in the primary cluster.

I

We're looking forward to look to the uh the second fest proper geolocation stuff coming in pacific. So we can get rid of all our hacky scripts.

C

I mean it sounds any way that I mean you guys so I've been afraid to enable snapshots because of all the snap trimming like load that gets created, but it sounds like you guys are happy with it yeah I mean the.

I

Snap trimming, it is just another load of um you get on on the cluster a bit like balancing it's one of these things. You just need to keep an eye on and.

D

I

It it doesn't get out of hand, you know for us, we probably have about between five and ten. Sometimes twenty terabytes of data being snapped snap trimmed out a day, but for us it works.

I

B

Done as a side note, you have already won one paper for for for the thing, if you subscr squeeze down all this discussing about cfs and snapshotting.

B

I

Yeah 45 drives: um uh we've released a sort of a geo replication uh tool as well, um which is uh based on c. I think what it calls rsync. It's it's been rough around the edges, but um so there are some people out there trying to do something. We wouldn't release our scripts because they they are like granny's knitting. You know they're very messy, but they.

H

Work, I've been really interested in setting up snapchatting on our cluster, but I've been a little bit tentative on starting this, just based upon warnings and worries about instabilities.

H

um And it's interesting here because it sounds like these snapshots are over large sections of the of the file system, and so they would be covering like hundreds of terabytes of data, and that's that's working, but just maybe avoid having um too many too many snapshots at a time sounds like the recommendation.

K

Yeah, I think, avoid too many of them um they're supposed to be the.

I

Stable since mimic um so there's been a lot of code under the been written since and being tested, so we've never seen anything go really badly wrong with them.

J

I think the warning was that it doesn't work with multiple file systems, so you can only have ones ffs and snapshots.

I

Well, I think I think that's been being fixed now, so the episode the file system id or something- and I think that, but I thought that would have been effects that yeah we already used one file system, yeah.

C

I said I think I would like to use multiple file systems per set cluster. To avoid these kind of. I would like to give big users their own dedicated mds, without relying on, without relying on like um directory, pinning subtree pinning because subtree printing pinning works really well, but it brings another kind of category of problems when you need to when you need to like upgrade. You need to decrease always down to one mds due to the upgrades- and this is always painful.

A

Now and an octopus, it's fully stable they're, calling multiple stuff ffs.

I

Well, we like the the idea of using multiple file systems, so we could have an erasure, coded pool for bulk data and a replicated pool for look for small fire lio, or something like that, uh because most of our data is just a razor encoding. So that'll be interesting.

I

B

What is your size of eraser coded system on us ffs, so we.

I

We use eight plus two encoding um and the raw is about five petabytes, so we get about uh not far off four petabytes usable of it. um There was some guys at nasa who were using uh razer encoding at south barcelona I spoke to, and I think they were using eight plus two as well. uh It seems a good balance. We would use eight plus three, but we have a replicated cluster to store everything on.

I

So that's what covers our my paranoia is that we've got a second copy of the data um elsewhere in the building.

B

Because it depends on what erasure coding, you're running or when, when you're talking about the capacity on that scale. Yes, yes, I was interested thank.

B

H

Yeah we're using seven plus two for ratio coding and we did try to compare five plus two six plus two and seven plus two for their performance. When we first set up the cluster and it's they all seemed pretty similar, um and then we only had we have ten nodes at the start. So that limited us to seven plus two.

A

The last thing on the topic list was just a generic centos, eight stream thing. If anybody wanted to talk or gripe about that and stuff.

A

Another, the alternative was it rocky linux would be. The new centos is what it sounds like.

C

Let's say that cern and fermilab are working together to come to a decision between them, because I mean certain fermilab used to do scientific linux. As you probably know, and then both we just went with centos eight so now, like together, they'll find a common solution, but but there's some kind of committee of like 50 people or 100 people. Even I don't know exactly that, are all putting forward all the different use cases.

C

For operations like so I went for like running a set cluster, it seems we'll probably just use stream. I mean we'll, try and we'll see if there's any we'll see what kind of problems we find, but our, but like our linux team at cern, is going to.

C

They have they've already, given us an upgrade path that we just like set one set one thing, and then next time we um update, we get, we get the stream repose.

B

Do you have some kind of snapshot on a stream repost on your site, so facebook or things like that for.

C

Yeah yeah they will, they will so the details they'll be worked out, but they will. They will definitely do like a qa thing where they'll they'll they mirror upstream they put like they put the upstream mirrors into like the next next week's version, and you can have a few qa machines getting the latest rpms and then, if everything looks okay, then you can you can just roll ahead next week. All of your machines will get it or like our actually, they use ffs. For this.

C

They put all these yum repos instead of fs, and they do snapshots with hard links. They have this tool that does for each each day they take a snapshot like a hard linked snapshot with the date in the directory name. So if, if people want to have the state of the young repo at this exact date, then they can do that.

C

This is the use case that kills our self-effects because they have really like I don't know- 75 terabytes and millions and millions of hard links and files and hard links and deleting them and trimming them and it's a big mess. Anyway. It works.

E

Yeah, where are we open to shop so we're not changing anything.

C

You won't change the stream.

J

Yeah, we are we're in the process of moving from sl7 to ubuntu at the moment. So I guess that's probably where, where that's headed for at those reasons, although I don't know um so, we we get our linux to large extent from informatics and will do what they do. Basically, because.

A

I think I'll end up just going with stream running a month or two behind whatever the latest thing on the serious breaking bug that needs to be.

A

A

Unless anybody has anything else you want to bring up for today, we can follow the day.

A

Here all right cool um next one may, whatever the fourth wednesday of may, will be I'll, send out my usual stuff.io users list and uh if you have put your name and contact information into the the pad, I send also a private email to the group along with a calendar event in it. um So yeah you're not on it and you want to be on it. uh Add your information.

A

Other than that, thanks everybody, this was a great conversation today. Take it easy.

A