Ceph Science Working Group, 31 Jan 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Ceph Science Working Group 2023-01-31

Description

Join us for Ceph Science Working Group meetings. We alternate the third and last Tuesday of each month at 14:00 UTC.: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

Welcome to the first one of these for the year, if you haven't been here before this is just a meeting of people, it started kind of like a research Computing, but it's really basically anybody.

A

You know big pluses, small clusters, Cloud research, whatever they get together in chat for an hour or so about anything, Seth issues upgrades whatever um these meetings, I recorded and posted to the Seth channel, uh usually within a few days or something um I I'm, not like a presenter I, just organize them, and uh hopefully we end up having a good chat and I try to prod it along a little I suppose uh so, with that um anybody here uh who hasn't chatted before or joined one of these want to say hi or like what uh science you guys do.

A

Do you have an amazingly big cluster, or you know anything like that.

B

They can break some silence so Jeremy did you want to speak? First, that's how your hand go up.

C

You got in there before me.

B

I just started talking, so you know, oh so, I I think I sat through one of these before and looked in the background, but didn't talk, but I'll talk this time. um Garen Atterbury I work with the University of Nebraska, Lincoln, uh well, University of Nebraska system. uh Our research computer groups called the Holland Computing Center. We have a few ceph clusters of various shapes and sizes.

B

um The biggest one is we have a tier two site for the CMS project, which is part of uh CERN. The Large Hadron Collider um used to be a large Hadoop file system where we just used hdfs, but we migrated to Steph, as did most of the other us sites for this project.

B

Currently it's about 17, petabytes raw, so usable is in the 10 Erasure encoding for most of the data. It's still ceph16, something they can't remember. It's been been sitting for uh for six months.

D

At this point, can you hear me yes, uh So you you're running hdfs. So are you now running CFS? Oh.

B

Yeah we went to the set of this for, for this particular uh cluster. The majority of the data on uh the 17 raw petabyte one is, is this ffs file system, um then, in the past, when it was hdfs, we only ran the hdfs file system component. We didn't actually do mapreduce and the Hadoop things. On top of that, um that's honestly been a pleasant experience of the you know. The the file system comes along happy, as can be. uh I was planning on upgrading to 17 at some point just haven't gotten there.

B

uh We have a few other ceph file systems as well, uh one that's roughly a petabyte that was actually used for well. It was intended to be mostly an object store to back a next Cloud instance.

B

um Largely that's it's unused or underutilized. I should say uh not not a lot of activity on that project, um and then we also use Seth via Rook. uh We have a new kubernetes kubernetes clusters that we've spun up for various reasons. So one has a bunch of pure nvme nodes, uh burning Seth, um it's actually triply replicated I know a lot of people do double because of the reliability of nvmes.

B

In theory, but that's that's another one we have and then because we've had success with stuff I guess uh we're planning on building a new, uh roughly five to six petabyte Seth cluster. That will largely be general purpose: storage for our campus and our researchers. The intent is to provide well. My intent is that it is a you know, the start of a storage platform. So it's you know we have a chance to get some brand new hardware, uh use that and you know utilize it in uh various ways.

B

The majority will be as ffs again, but we also have uh people wanted to do some block device uh things which are back be backed by it uh as well as likely the next Cloud. So the object store side, um so yeah, I, guess I I'm here, just because I thought this was interesting. We're a scientific research, Computing Center, all the stuff we do is research, uh oriented uh and- and our success with Seth in the past few years is, has led to that be kind of coming.

B

The default uh path that we're trying to take going forward.

A

I got a question so you're processing like for your big uh 17 petabyte cluster, um due process directly off of that for the physics data for CMS.

B

Yeah, yes, and no um so the CMS workflows are, are mixed. um The majority of the access to it is uh largely bulk data storage. uh We we do have in the past. When we had hdfs, we actually had a mix. The the data nodes also were the worker nodes of the cluster. So it was, you know, by servers with CPU, good CPUs and then disks in them too, um as we moved to Seth the the servers with usds are separate. They don't run any of the workflows CMS.

B

Doesn't it's not like a traditional HPC type of workflow, where every single thing is always reading and writing constantly from the file system?

B

uh There are cases where that happens, and we do see Heavy usage at times that pretty much just maxes out whatever our capabilities are, but a lot of it is uh remote transfer in and out over the wan to and from other sites, just because of the nature of how that project works, that if the working set size is fairly small, hundreds of terabytes or less the majority of it is, you know, data that just happens to be staged around various sites across the planet.

B

uh For replication and redundancy and sites remotely access that and read streaming from from our file system when workflows demand that data that we happen to have I. Think that answers your question. Maybe yeah.

A

Yeah I think so I know, uh but it was UW Madison here there's a the physics. We have a I think we're a tier 2 site somewhere on campus as well, but I think they do a lot of like improv processing in place on their cluster yeah.

B

A

B

Here too, they're actually um I I know the guy or the people that work on it fairly well, they're still running hdfs and plan to continue with hdfs for a while yeah I.

A

Don't think I ever heard about them moving this stuff yet I know ice cubes move this stuff, but not them. Yet it was slowly taking over on campus with stuff well, but it's going.

B

C

Yeah great uh thanks, Karen and thanks uh everyone um I was invited to this meeting by Thomas Bennett who's also what was in the same Institution I'm from the South African radio, astronomy, Observatory, we're based in Cape, Town and South Africa, and can't leave us if cluster and operation there, if there are a few of them but we're using SEF to provide optic object, storage for the data products of the meerkat radio telescope.

C

um As far as the detailed technical um details go, I'm not going to try and pretend I know exactly how it's set up. Thomas Bennett has all of those details and he's and and he's on, the call and I'm, not sure. If you you know him, but this is this is part of a transitioning process of Thomas is, is now moved out of cereo um and it's moving into a private consultancy and hopefully we'll still be supporting our safe cluster.

C

So I'm here for for continuity and, uh of course, to maybe learn a thing or two as we go along thanks for hearing me out.

A

Yeah thanks for joining I, remember Thomas and the Seminole folk. You know this call before I feel like it's been a while since I've seen him on. So it's good to have you guys back.

E

Yeah and it's certainly been a while yeah.

F

Anybody else got uh hello, here's my cluster type of thing. They want what do.

G

You think okay, I I, think I could save one word or two words. You have I'm from CSC pieter and you only have most likely told you about our clusters already, but uh just straight to hello, you and they I- haven't been in this meeting earlier I'm, basically trying to take that Union, better and other guys have resources and capability to run. Our machines.

G

H

D

I can continue with the CSC side, so I I, don't remember if I told you a year ago or within a year that our plans to make a supercomputer compatible S3 authentication things uh I on that timer we're planning to use a token based Authentication.

D

So there would be a secret and the key and then token for every user, so token would expire. And then then we could uh put uh S3 credentials on a on a supercomputer side.

D

On those patch queue systems, without thinking that we are leaking too much credentials on the time, so temporary keys for for certain usage on the supercomputer side, um that was really promising at start, but it ended because of uh client-side tools, the handling of S3 credentials with the token failed.

D

So some of some of the tools they understand the token principle, but majority majority of the tools that our customers were using are failing, and now we've been developing uh method to Auto expire keys on on uh on uh S3. So if you are using familiar with this fifth API, you can there expire the keys quite easily, but with S3 they are like forever. If you have a key and secret there, um people are tend to put them places and they forget that they have leaked, for example, the keys.

D

So we have developed a method to expire, the secret part. So when the timer is up, the rider's Gateway will beat the secret part of the user key.

D

So then, the user access access information is denied on the runter Skateway side and I was thinking that if I present this more detailed on cephalogon in Amsterdam, if you are interested or not.

F

A

I think that'd be an interesting uh soccer whatever at.

F

The basketball Club.

D

I'm not aware that anyone before has done expiring history credentials.

F

A

So my span of those credentials: what did they, how.

F

Long does it affect very frequently how quick.

D

uh It's user definable with so user can Define that I can create a key with a one hour, expiring time or 10 hour.

D

For example, if you have a pet skill system that that the waiting time on the computer side is eight hours, for example, and your job is starting during that eight hours I can set up my expiring time to 10 hours, for example, and then, whenever my keys expiring, there is no use for that key anymore, and if the fat skill system is delayed for some reason, I can extend my key, the temporary key for for that project.

D

So if you want to keep your S3 key forever, you just have to authenticate yourself every now and then and extend the time basically.

A

E

So if I could just ask you saying that the um the tokens weren't working for standard tools, where which standard tools were you trying that, when supporting tokens.

D

We were trying multiple tools: everything from our clone, uh S3 CMD cyber duck things like that.

D

So uh most of them worked just fine, but but not not wide enough correct or different tools, so that was that was a good intention and really a way. It should done still. I think that it would should be the way and we found some bugs on a runner's Gateway internal code. Already with that, and we have made made a progress with the Naya about fixing those token problems on the back end or the right Escape by side. During that.

D

But not now we we are making in a way simple.

D

Task for solving that this problem of expiring those S3 keys.

D

F

I was thinking that if.

D

I, if I make a speech on a cephalicum about this in this spring, so I can have a picture. Some more detailed information about them project.

A

I think that'd be good and you can get some feedback from people who have maybe ideas or whatever.

D

D

And uh size of the cluster is 40 petabytes.

D

So it's there will be challenges. We are not orchestrating that with a safe orchestrator because it doesn't scale on that level.

I

That's interesting to hear.

D

We have eight racks of machines and when you are adding a node it it starts failing, uh the timeouts will will get way too long when you are adding a node with orchestrator.

D

I

That timeout, in particular, is that OST heartbeat so.

D

H

You remember where I think the limiting factor was uh regarding the safe ADM like the ultimate check or something we tried to extend it, but after extending it, it will do something else. Funky.

H

Or some other diamond is not responding and it will start restarting the processes.

D

G

We got three thousand dollars this.

D

It started looping up and down flapping the whole cluster and all the way, the stop that was disabled, orchestrator and adding manually. Those obviously notes.

A

Is this an issue that you brought to like the Developers for the orchestrator.

D

um They were testing that uh with sage on on Australia, with that uh 4C set up already and they had similar issues on that time.

F

D

I think that the the box or I don't know how should I say box because they were scaling and timeout polymets. So.

B

Version was this with out of curiosity. Is this with the whatever 17 is called I can't remember.

B

D

Was uh um policy Universe uh in Austria? They they were using the pre-quency I think while they were testing and we were using Pacific on that time and we we Face the same issues and we have now upgraded from Pacific to Quincy without downtime manually, and we were happy with that.

B

I was asking because I'm about a just over a thousand osds on Pacific with, but with the orchestrator and after we've gone through various other bugs with the orchestrator get into that point. uh Mostly, the ones with nodes that had too many osds in them uh was happy to see that one get fixed.

B

um So yeah I was just wondering how much further I have to go before. I have to start recording.

D

Well, three thousand is too much so I I bet that two thousand and three thousand osds are the.

B

Where the breaking point is yeah, I would encourage all of you to fix that before I get there.

D

It's really hard to test because it takes time and it requires quite a big cluster to test. And we have an opportunity to try out and.

A

Yeah, that's the stuff, that's really hard for the developers to find at that. You know three to three thousand OSD scale and.

D

But, for example, with a queen see if you're running rados gateways, uh there is a quality of service component on Quincy and my light testing. Is that I really like it, because I can suppress some.

G

Customers, if they.

D

Are using the GPU computational tasks right away from that safe cluster I can slow down there, use it without affecting the other users with that quality of service component, it wasn't on a Pacific or in Nautilus, and I've been struggling with that already earlier.

D

So that's why I'm I'm really happy about the quality of service component in Rogers Gateway.

E

Yes, I know that um Luca from um from the poison soup Computing Center, uh he gave us some information on their uh their experiences with Quincy, eskel and I see. There is actually a link here on um in one of the blogs on it's called Quincy at scale.

E

Testing with proposing super computer center and I see they've got 4 320 osds 69 petabytes, which that they're deployed with um um with uh uh yeah, which yes there's the which they deployed Quincy and I, don't know what what they were using, um but um yeah I can I can just copy and paste it into the chat.

F

Yeah, please do.

A

All right, I guess we'll um kind of just hit some things on the topics list uh upgrades that people want to talk about that have gone, really good, really bad somewhere in the middle.

A

My contribution to that is, uh I have a cluster that is on a mix of Centos, 7 and 8, and one of my team members is working on that uh conversion, so that we can actually upgrade pass octopus to Pacific would be really nice one of these days, um but we're doing he figured out how to do an In-Place conversion.

A

Where you know, if we Kickstart using in place the only destroy the OS drives, don't touch any of the uh osds and then when it comes up it just um it has a process that you figured out to bring the osds up, discover them with stuff volume, bring them up, and you know it takes 60 Minutes of host or something I think I saw him if he joined the call. So.

J

Yeah I'm, actually in here all.

F

J

It's been a couple weeks since I touched a host with this I've been working on some other projects, but I think it was taking me about 45 minutes per host.

J

uh We've got 56 total to go through so taken a little bit, but it's manageable and I do I, convert one host at a time just to keep my data durability guarantee intact.

J

Like so I we ran a combination of the Erasure coding, six plus two, and we do 3x replication, so I try and only take out one host at a time.

D

J

Don't rebalance during that so I, just let data be slightly degraded for a little bit and then 45 minutes later. If everything went well, the host goes back in.

A

So yeah I hope that answers that their question like uh doing a full drain node would have been very painful with that many we definitely skip that part of it and just do the in place and it's Gonna Save Us just tons of time and then at the end of the day here.

E

So, just to understand you upgrading the base area so we're upgrading Seth as well.

J

um For for the step, we're talking about yeah.

A

Stuff will come after everything's on Rocky 8, we'll go to Pacific.

I

Share some observations of an upgrade if people are interested.

F

I

I would just a quick recap: my name's Bruno canning I'm working welcome Sanger Institute in South Cambridge. Here in the United. We are one of the largest genome sequencing facilities and genomics research facilities, world um we're running a 51 node Steph cluster with 60 drives uh storage node, it's procured about five six years ago now.

I

So it's a little long in the tooth fix terabyte drives for each OSD, um and the purpose of the cluster is mainly two-fold and that's to serve as a back end for openstack, and so our users can create VMS with bespoke analysis workflows, and they can do so as obviously as root, and we also operate um radish router Skateway service, which is essentially just um data data drop for um for users.

I

We have other data storage facilities on site like irons, which gives us metadata annotation of our data as well, and we operate compute files um too. um So our surf cluster. It was running a bionic, Ubuntu 18.04 and we upgraded to vocal, which is 20.04. We had to go to focal because we're on octopus and we wanted um I, think I think we are required. We can't go straight to Quincy, I. Think there is this: changing the OSD code such that we have to go over the Civic first and so our upgrade.

I

We started with our monitor hosts one at a time, because we want to be very careful with these and observe them, but we're running the entire cluster is three-way replicated so um and we have four failure domains, so it spans it spans. Four racks.

I

uh After a few Ginger um attempts doing a we wiped the operating system and reinstall this software, and then we use.

I

The forest deactivate set volume activate to get the cephlus to ReDiscover its its OSD drives, and we eventually did it in such a way that we could take an entire failure domain offline and do do the upgrades in parallel, and we do this with no outset.

I

The other's degradation in the data redundancy um one thing I did notice is that uh it's perhaps related to the the ceph um orchestrator problem observed by um you guys at CSC. Is that uh when it's time to bring the osds back online, if you set a note up and what it can do is set for them patiently bring all the osds up, but they won't actually start peering. You can bring up, you know uh 14 times, 16, osds and they're all ready to go, and then in one command.

I

You are you uh the no upper flag and you don't get. You obviously get a huge amount of peering uh happening, but you don't get the state where there's a long delay between the first OSD coming online of the upgraded hosts and the last one, because this will this will cause uh backfilling and then, as more rsds, come online Seth realizes that? Oh, some of the backfilling it's already embarked on is is actually not not required.

I

F

I

A smooth upgrades in normal I think it took us, uh we, we weren't active on it every day, but we did it in about uh 16 days. That includes weekends. When we weren't working.

I

It's done, silence.

D

Thank you for the update. It was really interesting, interesting to hear about the scale and the update times that you were using.

I

At least I was a little horrified when one of my colleagues suggested that I could take an entire domain off, but he said well, look. We had a power failure in the data center some time ago and the cluster just kept on kept on running kept on serving IO um and then, when the powers restored to that failure domain, you know it's then um back uh catches up backfills and you know it's Health occur again. So you know what it is scary to take that many nodes off in one go.

I

You know the sort of practice it can be done.

A

Yeah, it's pretty impressive, but a wow architected crash map can do for your data availability.

I

There's a roundabouts mm-hmm.

B

You have a question since talking about Crush map um back when I was first starting with Seth, and we had a mix of node sizes. Some were small 40 terabytes, some were 800 terabytes uh and it turned out. I had not uh not enough of each type to do what I was expecting to do and I had all sorts of balancing problems that went through all of that all resolved after having a sufficient quantity to meet the you know erase your coding profile. We had set, of course, and everything seems fine.

B

So my generic question is whether there have been any developments there or are there any cases with some of you that have larger clusters and perhaps a more mixed environment where there are still challenges as far as keeping this utilization balanced across the large things, or is that largely solved? And uh you know happy easy times.

B

Maybe everybody has nodes that are all the same size and it's you know.

A

I think, even with our art festival, you know we had six eight ten, twelve, whatever 14 terabyte drives on at once. You know 24 each and ever since you know they started putting the effort into the manager and the balancer and up map um back in the day yeah. It was a pain, keeping things balanced across a cluster. But now it's pretty, in my opinion, it's pretty trivial.

A

uh It just generally works pretty well as long as you have your OSD weights done in a correct way.

A

D

That the customer data is in a certain mode or set.

G

D

That it fulfills seven hours T is too much so there's a small small data on overseas and it's really hard to uh move that up out properly because in a way it's on the right place. But it gets some of the operations really really really really really slow.

D

Because if you have a kubernetes cluster and they are all hitting the same Spinning Disk or three replicas. So.

D

How to know which are the objects on a certain pages that should be spread out on a cluster, sometimes it that kind of issues we are having, but but normally with the up map, we've been fine, no, no big problems with that.

B

Thanks for the feedback I assumed it was that case, and we haven't had issues since we started doing things correctly with newer versions, so I thought I would ask because I had a group of people with larger and maybe more varying use cases here.

B

J

I have a related question. Another balancer question is anyone on this call using a balancer other than the built-in algorithm? No there's at least one other floating around there.

J

All right, I'm going to assume everyone's using default. Then.

J

Cool. Thank you.

A

There's the other one floating around there like provided by the staff team, or is it just something somebody developed a.

J

Third party I think I've got the link to it. Actually I'll put that in the chat um I've seen it on the mailing list, a few times recommended by various people.

J

Thought I was curious if anyone had experienced using it.

D

I think that I tried I'm, not sure if it's that one but I tried one of those alternative ones, but in that cluster I didn't.

G

D

The people big difference between the current one, so I stopped my testing, but I I, don't know I'm, pretty sure that there might be some improvements depending on the data that you are having on a cluster.

J

Yeah, it's I'm a bit foggy on the details between the different balancers right now, but I seem to remember that uh one of them may have been or more complicated. Crush Maps example.

B

One you listed is actually the one that I used uh and it was recommended to from people at CERN on some of their early days with dealing with balancing issues. But when I had 16 nodes that had 40 terabytes and three nodes that had you know upwards of 800 terabytes, uh you know I had to use this tool in order to get anything resembling equal space utilization uh and it it got out of hand quickly. I use this Tool uh too many times.

B

Essentially in my I had you know revamped smile log, and it was like wait. How do I undo? All of this now took a while to actually learn how all the things were pieced together, but it certainly works, and you know, has its use cases. But uh ever since we had we, we have enough nodes. Now that are eight plus three Erasure encoding. You know that the balancer can figure out and do the right thing with the ability to put data in the right places.

D

B

Had any issues so I haven't had to use this uh anymore.

D

B

D

Just under a thousand.

B

In total now, but.

D

Before we are running one one of our clusters within articles and the balancer doesn't get there, it fails in the end to 2000.

D

1200 dollars, this is too much for for the balancer to get through within a decent time, at least on that cluster. So that's a plane, plane balancer one.

E

What question of stuff is running.

D

Not on that cluster there's no deals.

E

Okay, yeah, because I had some issues in in luminous at one at one point uh where it was yeah it just um there was. There was a bug, uh I can't remember, but um yeah it basically gets its. It gets very confused.

B

Another question so, when I introduced the Seth clusters, we had I neglected to mention the really old original ceph cluster, which backs a openstack instance, which is still running Joule um in the case of you know, don't touch it, don't talk about it. It has uh anyone gone the whole Jewel all the way up to latest things with some of their clusters were there any major gotchas between versions that I should just it's not worth trying her plan is to replace it. You know new Greenfield solution, but I'm just curious.

B

If people have done that and there weren't any uh disastrous. Just doesn't work to try to jump between some of those major versions.

A

I said: do it my cluster, my big one started on Hammer I, think and I've ran octopus. Now I've hit every version. I prefer to just hit I know you can skip some versions in some cases, but if you just read the yeah, if you read the release notes and you just do all the steps you know, I've never had a problem coming all the way from Hammer up to octopus.

B

All right, you know, I had gone back and looked at some of the you know, older release, notes But, as time has gone on, finding the finding documentation that is correct or looks correct and official for the old versions has become more challenging. So it's like this really applicable still. Is it a is something yeah.

A

So I mean I think there are a couple breaking changes. A bit like there was like some and all client stuff in there I forget, which version.

B

B

Always good to know at least some other human has gone through the whole thing and there's a chance. Thank.

I

You there's a nasty bug in the Eurasia coding uh in Jewel. Do you Erasure coating on Jewel Ry part.

B

Thankfully, this is just a three replicated. That's.

I

It yeah I was gonna, say, but if it's been in operation for that long, you've obviously not encountered yeah um the people to speak to about that would be Rutherford Appleton laboratory in Oxfordshire. Okay,.

B

And actually I actually.

K

I

Those people used to work there um some time ago and I think.

I

We had some issues with the first deployment, so we tore our cluster down and we built it completely from scratch. I think that was the Dual release and they certainly got as far as luminous um I'm a bit out of touch with them, but I should imagine, I mean they're running it as a production system, so I'd. Imagine if they're not on the latest they'll be only one main releases behind yeah, Tom Byrne is I. Think he's uh principal operator of Lancaster he'll certainly know once more.

B

By chance at the hippix, that was in Oxford meeting I.

I

B

I was gonna, say: I thought you looked familiar so I I was there at that too. So I.

G

B

Your name sounded familiar or that you looked familiar and I just realized. Yes, you were on the email list of that meeting. So hello again after many years.

B

I'm going to ask another question: if nobody's gonna stop me: um we've done the orchestrator on rail 8, so Alma 8, actually systems and we're gonna deploy a new one here, as I mentioned at the beginning, um I just quickly glanced and it looks like uh rel9 packages and all of that exists or have are people using death in the orchestrator on rail9 and it's all happy or is there still a little some work to go there.

B

Everybody's on rail later later or something else entirely.

D

Well, we haven't upgraded the nine version of a railover events yet, but there is a one big thing which is uh difference between teaming and bonding. So in case you are using a teaming on a Centos Rocky. Whatever eight person, the nine version is in combat with income of it all with that, so so the now the sit they are shifting back to bonding bonding will break your teaming. If you have done teaming on on a.

B

Yeah we actually do on on this. So that's a that's good to know that I'll have to pay attention to that.

A

Yeah same here we do all teaming at our el8 hosts good enough.

D

G

Been really annoying.

D

Because now everything is working with the team, the and now they are switching back to bonding and.

F

D

Was frustrated when I was running hammer that the bonding wasn't properly working on that time.

A

Oh Red Hat needs to make up their mind what they want to support.

A

um I see somebody put in some stuff for bugs if they've had anybody who did that. But if you want to talk up to the let's say some kernel client issues.

K

F

K

F

K

Yeah, okay I just wanted to put these bugs because they appear to be slightly annoying now use case on HPC because we use ffs uh for most of the data, including holes and stuff, and uh so basically the this current degradation. So there is already a back report over there.

K

It's not clear Colonel back off, it's ffs landmark right and the tendency is maybe it's more kernel, but because nothing much was changing seems that uh it creates much too much load in the system, so I'm not quite sure what it happens in the plant whatever but effectively it works two times slower right to Benchmark and still it still works.

K

Okay, the stress on the on the Note, the client node, is much higher they're coming to work and the other so I'm not sure if somebody's pushing for that, but even Bradford nine is experienced this back. Let's say um redcatates uh has let's say an older kernel which doesn't they work quite well.

K

Four point: I think I'll be using 5.10 for a long time, but quite fast.

A

The other one is about yeah, so if you're, seeing like the 2x slower uh for reading and writing from suffer fast like at the same time, you're seeing high like CPU utilization and like CIS time or something, you said,.

K

A

Interesting sounds like a bug, that's kind of being CPU bound or something.

K

I'm thinking that button to suppose I just started to appear in a five point.

K

And the other box is now large clusters.

K

Sfs is quite extensively used, it's quite a lot of memory and we've seen a lot of issues when we got how to replace so actively play, enabled all right or MDS and.

K

At random times those standby viewers brush yeah and uh example, last week this enabled we had a lot of streaming was not working actually, so we when we removed when we disabled, active, replay everything started to work again, although it took an hour recover and it's looks stable this way, that might be some bugs in the latest release it's pretty new technology anyway.

K

So it's hard to pinpoint what was actually going wrong.

K

So for now we disabled it.

K

And that's about it, and otherwise everything else Works. Quite fine, although maybe I should say that 17 2.5 really works well. The previous versions had some bucks all right, Stephen immigrated about four clusters for that release.

A

Yeah and that's only that the hot standby is crashing, the primary PS is not.

K

uh The primary wasn't crashing, but the last time I was away at the time, so I'm not sure what happens. Also, um but we've got this stable only after disabling back to.

F

F

A

's got a comment about nvme red amplifications.

F

A

All right, well, um I, guess they're, just asking any experience, observations of right, amplifications and nvme clusters, I, don't know if they're trying to talk and mute it or we can't hear them or something right now. But anybody.

A

And have any thoughts on that.

I

I don't know about that.

F

Issue, in particular,.

I

I am, but we have been stung on octopus, with um large deletion campaigns uh by our users, which create a very large garbage collection, to-do list um and when the garbage collector in rados Gateway starts operating.

I

um But it does so sort of independently of whatever the users actually do, and this can overwhelm the workload of an OSD such that it fails to respond to heartbeat messages.

I

Although the OSD is it's running, just fine, um it gets marked down by its peers and then um I think. As soon as one of our osts goes down, we're actually uh rebalancing back feeling. Sorry, then the osc comes back and says: oh yeah, no problem I'm still here and then yeah, but uh after a while, what can happen? Is you get an avalanche failure? You know eventually the OSD will be marked down properly and you get an avalanche. Failure of the next OST becomes a problem in the next one.

I

So I've got a kettle whistling at me at the moment.

A

Yeah, that's interesting, I, wonder if that's something that those app priorities in the self-config on osds, if that could help it so that those operations don't overrun normal operations.

I

And perhaps but the solution from our vendor was to get nvme storage devices and move the um the routers Gateway pools over to that.

I

I mean it's actually quite a small amount of data, I mean apart from the actual router Skateway data pool itself. It's actually quite a small amount of data. It's been 500 gigs.

A

I

I don't know if that was related to the query. Any experience, observations of right amplification on nvme, Club.

I

Something we're doing.

F

Anybody else have uh any topics they wanna throw out there.

A

Outages you've caused nobody's, talked about a hitting the wrong button and taking down the cluster.

D

Maybe next week.

A

Let's hope not.

B

I will at least admit to a failure of monitoring where, after a power outage, many of our large nodes that are Western, Digital external jbods, 60 disks, some of them came up before the hosts came up and some of them came up after the hosts came up and therefore some set of hosts had no disks, and that of course happens on a weekend in them like not paying attention and tough happily tried to uh correct and rebalance and solve itself throughout about three days before somebody noticed and said what's going on here, um but through it all data was accessible because the uh number of nodes that died were, you know, separated out and enough that uh it didn't impact the data availability.

B

It just meant a lot of hard drives ground constantly for three straight days for no good reason.

B

Plenty of room for mistakes, but thankfully Seth saved us from them all.

D

That reminds me about things that don't leave pause the host too for too long time playing around without the usage we had a test cluster that has one OST server down a couple of months and when it came back to the cluster because it stopped well, there was a bug unrelated to Chef, and then it suddenly came back on one day and it tried to replay everything past months.

D

So what is the actual date and what is not, and that was trying to bring the whole cluster down so when, when one owes the host is down for for too long, please do not try to put it back right away. Private wipe, wipe the server and bring those overseas as a new ones.

F

F

All right, if uh nobody has anything.

A

Else for today, I just got one quick thing left and that is hopefully everybody saw catholicon 2023 in Amsterdam is happening. April I am planning on submitting. uh You know in-person birds of a feather session for one of these things. Hopefully a bunch of you will make it.

F

Yeah about it, for that.

F

F

So if we do another one.

A

Of these I usually do maybe two months that'll be like the end of March, which is really close to cefalcon, so maybe it's better just to like do it at seven kind, instead of having a virtual one of these meetings, uh I'll think about it.

A

um If you see the emails we'll have one, if not just assume we'll, do it at supplicon um yeah. If you want to be on the private reminder list for this I take the sign-in emails and do that because things are easily pissed on the Southwest sometime.

A

How's it other than that I don't have anything else. Unless somebody else wants to say something.

B

Thank you for organizing these.

F

Yeah, of course,.

D

I think that is a good idea to skip even in the next one, but don't make that as a habit. But but if you can wait in person in Amsterdam, I I think that it's it's better. We can discuss even more detailed things without the delay of of this room meeting or blue jeans meeting.

A

Yeah, exactly it's only a couple weeks after when the next one would be anyway. So it's like just wait for that and uh we can have a nice birds of a feather session and then continue it with some beers and food afterwards or something.

D

F

Thank you, yeah. It's everything for me, hopefully uh see some of you at ceflikon.

C

F

C

Week, thanks for organizing yep good night, bye.

F

Bye, yes, yes,.