Ceph Performance Weekly, 19 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-09-19 :: Ceph Performance Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right so, the past week a couple of people have been out of town, so things have been a little bit slower than usual, although still still pretty fast, three new PRS. For this week there was a pretty new one hot off. The presses fix, broken unused calculation in blue store.

A

That is interesting and I. Think probably Sager Igor are gonna have to look at that, but you know potentially that might save us some work.

A

There's this other OS d1 for changing the OSE up to cutoff default too high I have not looked in depth at that. Although there's a couple of people in the community that I guess are seeing better results with it, I've got a new PR that doesn't really affect performance in any any real way, except that it makes it easier to diagnose. We were, we are spending time and do OST ops.

A

That currently is a giant switch statement with thousands of lines in it and that pair breaks it up into smaller chunks, which, just on one hand, makes makes things easier to follow up with and, on the other hand, also makes it so that, when you're doing all clock profiling, you can can't tell what you're spending time in which is nice, so I'm going to run that through QH day, hopefully or closed PRS. We have this rgw change for max usage trim entries I did not look closely at that, but uh Casey did, though, looks good.

A

Loose store introduced fast automatic legacy, Stata festa fix that was one of equal recipe. Ours and keep ownership.

A

And then Reno suspension, sequential and random read this was a PR that fixes some kind of bad behavior and rate his bench and and I looked at it and generally thought it was good and keep you looked at it and had a more additional feedback. It looks like that that must have worked out well, because people who murdered that as well so in terms of updates we've got some updates to Eric's teaches for doing some smarter filtering in the OST for rgw pocket listing.

A

Let's see, oh this one for immutable small object within the o node. The idea here is that your in lining and a really small objects. It does look like it's faster in the tests that were run, but those tests were only run for 98 things might 900 seconds, but still that wasn't that many objects, though I, would very much like to see those tests to run for much longer periods of time, potentially, if possible, to like 64 million objects.

A

One when I was deep, I think that will give us a much better idea of what happens when rocks. Db is starting to you move stuff around between multiple different levels and a kind of higher rate of amplification, which is what happens when you really fill it up with stuff. So that will be interesting to see.

A

There's sams, odmg, trace points, PR, I, think he's we inferred or you from had him on that motion. Pangs finish. 4P are still failing testing, so he's gonna go back a look at that one Adams I object and similar P arms are still in the works. It looks like they're still getting updates. So that's good sorry about my phone MDS cache memory limit is still being worked on, needs another rebase, but hopefully, as close Adams sharding work other Adam.

A

That is going somewhat better. He discovered the issue that he was having. That was causing rare seg faults, so he fixed that. But the other problem that he's hitting now is that rocks TB is corrupting itself in kind of unfortunate ways and it's happening after not too much work. It's like two and a half hours of work you can get. It I think he was even getting that faster, it's possible that we are actually seeing this very rarely a master.

A

It looks similar to some reports that we've gotten for just kind of you know random ones regarding rocks TV corruption, but may be that he's head upon a way to to meet. This happened much faster or it might be something totally different, but I he's still trying to figure out how to get this passing tests.

A

Okay, there's a walking change from my GM, paying in blue store, I need to go back and review our locking before I can offer any useful advice on that one. But there's that and then another one of Adam's be ours for is work.

A

So that's about it. Any questions on Pierre's or any peers- and this is the people- are working on.

A

All right, well I've, got nothing else really set for today, though one four four three six, two, seven, five, five, three one who are you guys I.

B

Am just introduce ourselves, my name is Rob and with me is Luke hi. We are we're on a basically a auto threat hunting here at the firm we work for and basically we run our own set cluster. So it's not like you know, enterprise wise and like the entire, is just something that, like the two of us and another person, basically we run to do a lot of our data analysis and data storage, and things like that.

B

So we've been doing stuff now for I think about two years but recently- and we don't know if it's recent because like has anything changed on our setup or is it does it new bugs or new performance issues? But recently we've been seeing quite a few performance issues, mostly on the rgw side, and we wanted. We've tried everything at least our mind anyway.

B

I'm sure we haven't tried everything otherwise on we, you know hopefully there's a solution, um so our hope was that we could kind of go through like what our setup is, what our use cases are and the bad behaviors were seeing. Yeah absolutely um did you want to start this with the set up? First.

C

B

You kind of have an idea of like how we have things architected and go from there. Whatever is easy is for you. You last note like what's what's most fruitful, for you.

A

Yeah sure, why do you want you talk a little bit about yourself and and Canada well, yeah, just the order that you gave us just fine go ahead: okay,.

B

Cool and and please obviously, if there's any questions or more details, please ask away and we tried our best to gather as much information we could before this call and we've been doing a lot of testing over the last probably three weeks, and we kind of both came to the conclusion that um is probably best to hopefully talk to some folks that know more than we do so we're hoping you have some ideas for us because we're definitely a bit frustrated. So our setup is. We have 120 nodes in our cluster.

B

Each node has one OSD each. Our OSD is basically a raid 524 Drive set up. Those are just spinning this. We wish we had better, but that's what we have um total size of the cluster ends up being roughly caught like three and a half-ish petabytes. So you know deuce amount of storage. um You.

D

Have information in the placement groups so we were like 9700 placement groups in nine poles, but basically we have a RBD pool and a rgw pool.

B

For the rgw x' we originally had roughly ten rgw spun up we've gone up to 21. It may have helped things slightly, but not not anything I would say super noticeable. We are on version 14 to 4%.

B

One thing to note, though, is that our boxes are not like dedicated just force. F is, actually you know, is multi-tenant, so there's other things running on those boxes, other jobs and other workloads. So we don't want to make it sound. Like you know, these boxes are just dedicated just to running sets so that they are not on. Those boxes are running Red Hat 7.6 kernel is 52.3.

B

Networking between them is 25 gig on between all nodes in the cluster, even across racks. The memory footprint is about. We have about seven hundred sixty eight gigs of memory on each box with about 80 cpu-z.

B

Any questions about our setup before we kind of dig into the use cases in.

D

B

Think those are the major things we've wrote down, but is there anything else you'd like to know I.

A

Know as a seasoned systems, no.

B

Unfortunately, we have at least put in four more we're hoping to get the UM nvme, so we can do things a little bit better for the metadata, but um we right now do not believe me. We wish we did and we are pushing hard to try to get them sure.

A

Are these hardware raid controllers.

D

D

Seven-Thirty many raid controller does.

A

That have very bad cache.

D

Right that can't be done does.

A

Okay and that's enabled and everything it.

D

A

D

I'm just question: while you right, there is like the read ahead. Caches is also enabled an adaptive because should not be adaptive. I, don't even know to.

A

Be honest, I, don't think it'd make a whole lot of difference. I figured since you're right there yeah it's been a little while, since I've really gone through and tested like hardware raid controllers, so yeah it's possible something's changed since then, but the last time I tried it. It didn't seem to do much of anything at least for our purposes.

A

Let's see what our using blue store or a file store, wait.

D

A

A

Buckets you have.

B

So real, quick before again the buck is just so you know on the RBD side, things have actually been running really well, our.

D

Main use case there is we're mostly using it for elasticsearch, so.

B

Heavy in jazz, searching as well and for the most part Lee we haven't seen any like, like you know, I can't characterize the performance is perfect, but at the same time like no like major issues there, so this well, let you know we are using our BD and, like that, sighs been pretty pretty solid, like no major concerns there at all, but yeah. Let's go into the.

D

And so you know per bucket, like you know, some some, like the big buckets that we read and write from or read from a lot, maybe have one-and-a-half. Terabytes is like our biggest bucket and size with like 160,000 objects,.

D

That kind of range, most of our buckets, are smaller in size than that, but that's kind of, like our biggest ingest, goes into a bucket that has a1 1.4 terabyte 150,000 objects.

D

We have we have another, pretty active bucket, that has like 100, gigabytes and 500,000 objects. We I should say that, like on all the and all the set that config, the only thing that we've actually changed from, like just the defaults that came with 14 to four is: we've changed the dynamic restarting to 150,000 versus a hundred thousand? That's the only change we've made and.

B

We did that only when we were try to like improve performance like that, wasn't something we would normally done like. We had it at a hundred thousand. We were like just grasping at things that might be causing our issues, so we try that but I think a really effective anything positive or negative. Okay.

A

So what what, where are you seeing performance issues? Let.

B

Me talk real quick about the use case, we're seeing this performance issues and then.

E

B

Good right, like what we're seeing so I would say the two main use cases we have is in jets.

B

Where we're getting data into our GWS and the good news is on ingest I would say we are seeing zero performance issues I'm going to give you a characterization there, we're seeing roughly when things are in a steady state, 300 to 600 requests, a minute, I would say and what we try to do there is we chunker or we combine our murder data, as you know, big as possible, so we don't have like a ton of objects and that's what we're talking basically like in a given day.

B

It is a couple you know a terabyte or two of data, but the number of objects for a given day is only in the like. You know, 100,000, you know range in that that area and that that um ingest has been going great. The area that we're having issues is when we actually are trying to use the data to do basically like big data analysis. So what we're doing there is. We are basically writing distributed. Python jobs.

B

I know it doesn't really matter which language but online distributed um an Alec jobs and basically most of those jobs. The way they work is they usually work on a day's worth of data. The stage one of the job is basically reading a bunch of data for the day from one of those buckets we just talked about, and at that point we're probably doing about about a hundred thousand requests a minute and notice loading that data into our Python screws.

B

We then obviously do some actual analysis and processing and then once we process our results, we try to then large, there's pretty large amount rights to basically take those results and put them back into us three at that point, you're, probably also in the sixty to a hundred thousand requests a minute, and what we are seeing is that, basically, when we run these jobs, not every time, that's what we have been joking about. The only thing that's been consistent is no consistent. The times we run this and things run just fine and things finish.

B

Fine and latency looks great, but then often enough. That is not like a fluke. Basically, what we will see is the latency. Our median latency is use like point O two seconds and, like things are great, but when we start seeing things to like eat into bad territory, but our job will still finish on Lancie median. Our maximum will be about 50 seconds and when things go really bad, the latency maximum will go up to 120 seconds and things start looking real bad.

B

Then, in those cases the median is like where things are really bad about two to four seconds, so we see basically a degradation there and then ultimately, what ends up happening is we're usually see a bunch of timeouts happening first and then, eventually, we are actually seeing seg faulting rgw x', where it just dies. We've tried to scour every log we can to figure out like what's going on.

B

With that seg fault, we even added the logging to ten on the rgw x' and basically, what we see in the log is things are running fine and then all sudden, basically just stopped in his tracks and just died. We've had some jobs so bad that again, it's not consistent, but we've had times that we ran a job and we had in a short period of time. At that time we were using about 10-hour GWS. We had like 30 restarts in that short period of time.

B

We actually killed that job because it basically looked like nothing could could stay up, but bottom line is the symptom we're seeing is that under some type of load, condition we're seeing seg faults and and definitely degradation of performance latency before that, and then ultimately, these seg faults am I characterizing it well Luther.

D

Is it yeah I? Think that's good and also I mean we've, we've kind of throttled the amount of trying access to those buckets to get it to actually perform. So you know if we kind of hit it with as many workers as we want to do the dish reads: we I think we would get this behavior dinner, so this is and at the manageable level, where most of the time it finishes, we still are getting these seg faults, but we'd really like to even do more and have it yeah.

B

Right now, at least with our jobs or absolute bottleneck is the you know, basically the rtw communications and then other things just to notice that when things get really bad, we've definitely seen degradation across the cluster. Basically, so like turned into you, the bucket will just not happen.

B

Surprisingly, the rights don't seem to be affected. That much like I, don't see our ingesting affected deeply by things but like doing reads from it have been affected for sure, and just bottom line is that um you know we throw and we did try civic web and because we're using feast right now- and things were probably worse with that we just ran one or two tests, and things were crashing right away. um So we did try that to go away from beasts. If that was a problem, we were seeing this problem.

B

What persons that will reason before lead into wine yep, so we just trying to go into lately. There were any any hopeful like. Maybe if it's a bug or something and our I opted just so you know those don't really spike during this. That's one thing like we're: we're kind of not surprised by, but it doesn't appear to be an I ops issue.

B

Our I ops usually stay about 1500 to about 2,000, and during these events we don't see like a huge spike in I, ops or anything like that, and we have seen some cases where it happens where we start getting weird random errors from the rgw like throw sixes. Oh fours, we've had times where our requests will come back with Unicode when we requested a bucket so like it just seems like things get into a really weird bad state. Where, like you, can't really um you.

E

Know understand.

B

What's really going on and then other times with kind of interesting is we're be running a job? This behavior will happen. We keep the job running it kind of calms down at some point, which is also kind of weird. So again it's one of the things. I can't really tell you like what exactly triggers it, but and again, we've sometimes forensics same exact job with the same exact data. We try to be scientific about it, where we like basically blew away the bucket that we were writing to.

B

We ran the same day's worth of data, which should be you know pretty much saying workload. We made sure that we're hanging the rgw in a you know very uniform manner so like one rgw is not being overloaded and we've absolutely had times where you run the same exact job with the same data with the same empty bucket, we're writing to, and in one case that ran just fine in another case that we had our GW crashes um happening.

A

So the the fact that you're, seeing like super long max Layton sees the a thing that we've seen internally, that that sounds a lot like is what happens when a bucket is regarding like for like a bigger rashard, so I guess one question I'd have for you is: is it possible that your workload when you're running that that that analysis, workload and doing both reads and writes that you're pushing it over the retarding threshold like one of the next restarting threshold, is.

D

A test where we were seeing that all the buckets involved were staying under $100,000.

A

Okay, okay are.

A

You guys familiar with gdb and using looking at core files.

D

It's been a while, but um I.

D

Mean I mean at the orchestration level. This is like in need of being running the kubernetes I, don't know where the core file.

A

This isn't or Nettie's, okay that terms another twist into this, so we have seen behavior. That sounds a lot like this in kubernetes, and this is not rgw specific at all, because it can happen on the OSC and other things too. If you run out of marie in kubernetes, unless you guys have found a way to do this, you don't get swap space and we've seen cases where.

A

Basically, when you exceed the amount of memory that is available, then you start basically paging out parts of the application and you get these ridiculous horrible stalls that sounded a lot like what you could be seen here.

A

Out of curiosity, how much I.

D

Would have memory and like on that on the RDW like no press, because we don't feel out of memory, kinds of things or even memory being taxed at all.

A

How much memory are you assigning in your container, or you know how I get out of the container setup like? How do you have.

B

Dockers, that's one thing we are doing I, don't know if there's any issues with that and we can check with our limits are set, so we definitely have one is set and then we're using rook to do our orchestration of like set on kubernetes, okay,.

A

D

We're going beyond the memory limit, we would definitely see logs about trying to like that and when I, when I, like log into the pause, look with our DW. You know the load is like 2 or 3, and the memory usage is not at all like same. Let's also say that at some other limit, like that's, hitting below like the system limit that is causing this and and we just don't eat any logs, I mean if some log pointed to that we would love it. Be able to have some evidence for that.

A

When you see a load of 2 to 3, is that consistent or is that, like a peak.

D

So I think a peak is is probably like 3.

B

Forgiving RDW and it's consistent among them. We aren't seeing, for instance, one our DW does like higher than the rest like by any. You know, physical, um you know method. Basically it just looked like they all are kind of uniform and, like we said, we thought added two or three I, don't think they're being packed that hard, but maybe we're wrong.

A

Yeah, the if you're, not seeing like you know, stalls at the container level, like you know, with with even even like just tried too long, you do it or something is probably not what I was describing.

B

Yeah we can get into the containers no issue at all and yeah. Everything has.

D

B

On the actually box, that's the machine that we started that the containers running on, and that is fine as well like we've definitely seen in other cases not with set but other parts of our stack where we've had issues like that. You're absolutely right, like everything just sounds like sluggish and overloaded. We haven't seen that all with this, it's kind of like you know that something triggers as bad performance and then all sudden, a seg fault, yeah.

D

I mean like RDW, probably started by logging in a box under D message, be a couple lines about the fault, but no other detail. Nothing.

A

Yeah and it's I don't actually know how to do gdb and and actually get like core dumps out of of suffering containers. It's the last time, I heard anyone talking about something as a pain in the butt. So you might imagine what kind of for you, but if you like.

D

Listening to like the size or a cluster and kind of workload, are there any like vanilla settings and the set that confident might be tweaked because we're just like whatever the defaults are, and the only change we've made is recharging.

B

Threshold anyway, if there's anything even in beats that we should be configuring like, we could really find a good document of like. What's even configuring be so like again, all that's gonna be vanilla which may be per our size, maybe isn't appropriate, but we we just didn't know. We did at least find a redhead document about some OS settings to change, and we did try those but um doesn't seem to really haven't affect one way.

D

B

A really highly in.

D

The network message size and some trans might receive buffer sizes and concurrent connections, kind of things, there's a handful of OS settings and you know that we made. But we didn't change the.

A

I guess from a high level, one thing I would say is okay, so the way you guys set up you've got 24 disks in a raid, 5 and they're all sitting behind one OSD that that concept kind of makes the OSD itself kind of a point of contention right rather than having like 25 24 OSD is running on 24 individual disks. You've got, you know, one that that is sorta faster in a way, but also with raid 5 you're going to be introducing extra latency for small operations.

A

So you know if they'll be faster than one disk, certainly but you're not gonna get the kind of like you know throughput, especially like you know, I have throughput that you would, from from those 24 discs each running with their own OSD you're, also going to be really limited in terms of by people. You'll have like four gigs of memory for that one OSD, but you may want to consider, since you have tons of memory, it sounds like just you know, bumping that way up.

A

You could, you know, give the side of no SD memory target of like 16 or 32 gigs and sounds like you'd still be way under what you've got is that? Does that sound right.

D

B

Absolutely we could try, I mean okay, we're not hurting for memory, our machines, which is nice and yeah,.

D

A

Ost memory target is the the name, and by default it's four gigs I, don't remember how that works with rook I know that there is some recently some discussion about that, because it's sort of trying I think it's trying to like automatically set it to the container size or something.

A

So that would be another thing to check to see how payment container is or the OSD heavily overtook that, but definitely the more memory you give it the more like owed cache again, the more Roxy be cash you'll have for cashing Oh map data and, if you're now doing something like thrashing the cash by trying to like read new SST files and constantly just because you've got like lots of lots of objects in rgw the more about you can keep cash the better. So that definitely would be worth looking at at the very least, is.

B

Highly worth looking at changing our OSD structure, like do you think, that's like on the top of the list? Would you recommend doing the one disk or OSD or somewhere in between where we maybe make a few baller raids like we're open to suggestions, we just kind of trying to go to the simple route so to say, but that doesn't mean the best route. So do you think it's somewhere in between or you think we should go all the way to the one Despero XD? What.

A

What kind of replication are you doing and stuff right now.

A

Ok, so you have 1:1 replica or like 1x yeah. So it's like 2 X, 1.

D

A

So that's as an interesting question. You kind of typically we tell people just to do three X replication with one disk Borowski for like high drives, there's been in the past folks that have done like 2x triplication and then do like 6 or 8 disk, great 5s, or raid 6, arrays and I can work I.

A

Guess I would be tempted if I was gonna redo it all to just do the 3 X replication with with one disc for OSD and see if it's better I suspect it may be. But having said that, you could it's really easy to start out with just tweaking the memory right you have to read it.

B

In memory doesn't work, it sounds a good absolute or this is a system. We luckily have a spare that has our data that we can start over at some point. So that's not like Al question, obviously not like ideal, but they think it's our performance of it. Absolutely it's worth it to us sure.

A

And, and depending do you know for your workload, are you doing a lot of book and listings as well or is it just.

B

We do, that is one of the things usually like in.

D

The beginning of.

B

The job we do a listing, we might do another listing or two, but not like hundreds of listings I- think that's nice I think that's a handful. What I say a handful say ten or less per job.

A

Okay, okay yeah, that probably shouldn't be an issue then I wouldn't I, don't think.

A

So so yeah I guess the the next question then, is yeah so like during these. These pauses, you know: do you have any insight into like? What's going on, like is rgw doing a lot of work? Is the OST doing work like you see in top, but you know it's using a lot of CPU or is it we've.

B

Been doing all our testing on our kind of spare data center so to say so the only really workload happening on that data center in general is. We are ingesting data into obviously into our s3 and into elasticsearch, but no one's really searching anything. No analytics are running on that thing, so I would say it's kind of in our quote-unquote, steady state and that things just kind of humming along and I, even like we have like Jenkins that runs different jobs and stuff and I.

B

We even like look like when one of these bad behaviors happen to see like that. Something happened to run in the cluster that we weren't aware of so like honestly, we believe things to be kind of calm in general on this cluster and at least like when we were doing tops and stuff loads on the boxes that had the rgw x'. We didn't see like a spike on the actual box that I can think of I.

D

Mean I actually didn't look at the overall load changes I'm across the OSD pause. That's something! Maybe we could look for I guess.

A

Yeah I mean just you know anything that you can see in terms of like system metrics. If there's anything you know that would.

B

Only because we weren't seeing like AI ops or anything like that change, so we weren't sure like does that mean the province lose at the RDW l. We just weren't sure, because everything else is saving fine. So.

A

So when, when this happens like when RW is stalling, you can still do like RVD rights and that's like totally.

B

These have been happening. We check our. We do our ingest using a knife, I Apache knife I, and when we actually looked at like when this was really bad, I kind of looked at our in just even in to rgw then, and our elastic search for the RB DS and in both cases our engine was just fine, which is kind of surprising.

B

But we have seen cases where, where, when it gets really bad, we have some Jenkins jobs that will just like download a binary from s3 like nothing, crazy and some of those were timing out. So it is really odd that, like I really would have thought when is.

E

B

Really, really bad state are um ingest on both elastic and s3 women like greatly affected, but like maybe there's a timeout or two like a minute baby, but nothing, nothing like that's causing things are like backlog or anything like that to see which yeah it was like. That's really weird to us like. Why is that not being affected yeah.

A

So, unfortunately, we should Casey we're here because have thoughts on the sexual you're. Seeing Adam do you if you guys see anything that sounds like what what they're seeing with signals like this I haven't seen my testing with rgw recently, but anything on your his isn't.

A

There there it's like almost random, it sounds like where they've got like they've gotten our Beauty workload, that's running fine and then like periodically almost randomly, it sounds like rgw will, will stall out and then eventually segfault. It beat with a beast friend and.

B

We've tried also like that was super fun as well, and it happened with that as well, so we did try both cuz. We were like well made something with beast and um we were able to replicate it actually pretty quickly. Our first try with it. It crashed one almost immediately I think within five ten days.

E

Haven't seen anything like that, I've thought a couple, seg faults, but they were very deterministic and in specific areas of code and well also fix them. I like if you'd give us more details about the workload we could try to.

E

We have tried to reproduce it.

B

We try our best that that's everything we probably should have mentioned so like we have. We have an analytic that again doesn't do it every time, but it doesn't say if you run it ten times, you're almost guarantee it'll happen, at least once so. What we try to do is write a very, very simple thing that we thought was kind of replicating what we're doing and when we wrote that thing that you know is doing some reason writes to the same buckets and just like.

B

Instead of doing um you know, actual analysis, we're basically taking data and just moving it and things like that, and when we try doing that, it was much harder to make it happen. It's nothing had never happened, but it wasn't like we really. Oh, we just pounded hard enough he's going to just fall down and it didn't necessarily seem like that was the case, so we're a little confused by like what is the actual environment. That's causing this thing betrayed.

D

I mean whether it's, whether it's just some of the timing that are you know that kind of, like does between reads and writes and actual numbers and sizes of the objects going around it seems to hit it I mean you know. In a sec, I'll get RBD, I mean I'll, get a rgw segfault, you know, probably on average, every third run and but I'll get a slowdown. Probably every other run. That's pretty significant.

D

If you're using like Python, you didn't request, I'll get, you know hundreds and hundreds of timeouts trying to get to s3 interface and and then, when those hundreds become five hundred or thousand often it is corresponding to one of the RDW that focus. But it's pretty common I get to slow down. Then over the job. I'll. Have you know hundreds of retries from a from a Python request perspective trying to try and access that mystery.

D

B

D

Tried to get our total retailing.

B

Once one RiRi sorry wants us, we call timeout happens, we switched to which rgw we go to, and things like that, trying to like isolate the bad one, but like any of that code we were in it doesn't stop it from happening, because, obviously other requests are still going to go to that bad one earns may not bad one. The one does bad things and.

E

Which person are you running for.

D

E

14 to 4 I mean we've, I mean the main thing I would think of. Right now is just to try to is maybe try to open a bug. Maybe if you can, if it's feasible, maybe try pulling a debug in a debug, build or maybe, or something like that and see. If we get a more detailed, detailed core dump out of it or a static.

D

E

Downpour stack trace or something like that.

A

They're already inside kubernetes out of mission assistant.

A

E

I'm not trying to do that. Two containers are special and magical. In that sense,.

A

So I guess oh go ahead. I was.

E

Going to say you might want to come to the rtw course stand up and mention it there. Okay.

C

A

E

It is five minutes from now, and it is also on that else. It is also on blue jeans. um I give you the number if you like, yeah.

B

Yeah be often we will show you that immediately after this.

E

It's not actually a stand-up meeting. We have a bunch of things, so they're called stand-up meetings that aren't it's like they're more than five minutes long and everyone sits.

E

Alright, you want it's in red, hats and blue jeans calm. The meeting number is one one. Three, eight three one, two six three.

B

Awesome cool and you see that starts to get that on 11:45.

E

Eastern Time 11:45 every 11:45 there daily, but the time changes day by day. We have some that are early morning by my time and late evening by India time to be more a pat friendly or a PAC friendly. We have some that are more East us right us, your friendly.

B

We definitely call them like in a minute, that'd be great unless you all had anything else, you know we're definitely trying to memory. If we can do a rebuild, we tried the OSD. Are there any other settings or anything at all? You all can think of or I guess. Maybe our GW folks might have some ideas of maybe what we could tweak inside of beads yeah.

A

Yeah I mean those are the things I was gonna, say the nice thing would be the you know if you thought there's a chance that was the dynamic focus charting to disable that completely and just seeing it work. But your friends came.

B

Out that picture yeah.

A

But I mean it sounds like you guys are. You have some evidence that that's not what what's going on so.

B

We don't see anything log things, I never turned happening during the jobs, but I mean at the same time try it.

A

So yeah I mean that could be something you could try. I.

E

Know you could find some way to get at the east and back trace out of kubernetes. That would be the most useful thing. Honestly, no.

B

Look into that, if nothing seems to work, we can talk to some folks about that, how we could get any yeah because you're right right now. Unfortunately, the the log message we get is basically thank fault with no details that are interesting at all. Unfortunately,.

D

Nice stack pointer, number.

A

All right, yeah I'd say a.

B

Try and we will, we will try some things out and um if you all don't mind, maybe one way or another if we have either positive negative or some news I'm recalling again and just let you know just so, it's always nice have a little closer to release. I find so um well, let you know and assume it's you know we're. Definitely, and maybe you come up with something bad, you say. Oh, we started seeing something similar and absolutely if we ever come up with a simple thing that can replicate it at all.

B

We would absolutely be happy to share what we can so um we did wish. We could find what that thing is. But but again we really appreciate all your time and least you got some leads for us to at least track down a little bit, and um you know again, we report back at that school to you all.

A

Absolutely yeah, definitely, as so, you know, certainly in the RDW stand-ups, if you've done those and also you know next week here, it'd be great. If you could, you know, let us know what you find.

B

A

I say I just quickly since I know you gave in to the next meeting here. But if you could document what you're seeing in the tracker create a new bug there, and and even if you don't know what what it is yet but just could write down some of the things that you've told us now just so that we can start getting. You know, keep that in the system and have a record we're.

B

Happy to do we try our best to try, make sure we're giving back to community in any way. We can so no we're do that today or tomorrow, cool.

A

Thanks man, we.

B

Really appreciate it thanks again and um we're try the next one again, thanks for the leads- and we hope to talk to you soon with some good news. Thank you all.

A

Right well, good luck, guys! Thank.

B

A

All right, well, I, think let's take everyone else's left. So it's just you and me now anything else. Yeah.

C

No I'm just sitting in just trying to see what you guys are up to with performance I'm in I'm, from Seagate, doing Mart, autonomous storage devices and looking at def and embedding stuff and those kind of issues and just kind of looking at what you guys are doing in the way of performance and so I. Just thought. I'd sit in for the next couple of weeks and just kind of listen in and see what you guys are up to Oh neat.

A

Cool yeah, so so today is a little bit slower than normal, a bunch people out, but right now a really big topic for us is a metadata and how we should be storing kind of the massive amount of devalue data that we can generate.

A

You know for forgiving OSD, so you know things like you know, persistent memory, you know, rocks TB some of the efforts that I think some of the companies now are making towards having like a nvme key value, interface.

C

I definitely work in all that space, I I attend the nvme key value meetings for Seagate and also work with the kinetic team here doing key value inside the drives. There's some optimizations that can be done in the drive that can't be done outside. So that's my interest about you know pulling in the data management into the drive is because it's so tightly connected and but Steph's memory footprints a little big and we're trying to figure out.

C

Yes, we're trying to figure out where those pieces are and and where we might do some optimizations also I work with UCSC and I. Don't know if you've ever heard of cross, it's the Center for Research on if it's off well, we have a program. There called eusocial storage and that's based around the idea of putting a you know a system that gets rid of the OSD server right that you can have these drives BB OSD, essentially sure and just allow you to rack and stack these drives right.

C

But we have to figure out some of those memory issues before it becomes practical. But I heard from goodbye I heard from David byte of Susa that he's gotten memory footprints down to like 2 gigabytes in at OSD. Right now, have you heard that oh.

A

Yeah, you certainly should be able to the the trick is shrinking it, while simultaneously not part of your performance really badly right right.

A

One thing, though, that if you guys are interested in that the I think going forward the one of the big questions we have is right. Now we keep the PG log, basically the log of reason that you know operations on a PG in memory and we read over to disk, but that's not going to work long term, especially as devices keep getting faster and faster. Always in my opinion, it's not going to you because you know you can't keep a log of any real length of time in memory.

A

It's just you're getting consume massive amounts of memory. Trying to do that, so we have to make it somehow have some of that log in memory. Well, you know other other stuff. We go back and read from disk later. If we ever need to use it, there's.

C

One thing I think I note.

A

But then you know more generally, just things like ok, you've got a bunch of different kinds of data that you want to cache and how do you decide what to cache and what not to cache and- and you know, what's important in any given point in time, so I've done a fair amount of work in that area.

A

But that's that's kind of the the question is: if your, if your memory constrained, how do you, how do you make those decisions right.

C

And then the final piece that makes it worse the memory situation even worse, is that you know the ultimate goal. For you, social storage is to leverage. You know like the object classes, to do computational, storage, sure so memory then becomes even more of a premium because you need to save some for for processing other things. I.

A

Wish memory do you guys have to work with that's.

C

Not defined right now, but you can kind of imagine the the problem we're faced with. Is that.

C

Traditionally, components are always valued within the component box right, in other words, how many gigabytes per second. What is the failure rates? What is the cost per gigabyte?

C

Nobody, it's very hard to get people to pull back a level and say all right now, let's look at what the system cost is per gigabyte and even though I might increase the cost of the device, a my overall saving money in the environment by not having to purchase things by reducing network traffic by getting rid of provisioning of network components that kind of stuff, if you could actually, if you could actually legitimately make that argument, meaning people would listen, then you could put more memory in this right.

C

You could take that memory that you normally would have had in the system and distributed across to all the drives right.

A

C

Could take you know four gigabytes per OSD and put it in inside the device and maybe even a little more sure and I- think we're headed that way in the long term, because, as we start going to more composable systems, the network, traffic trans, mary, transmitting stuff, back and forth across the network, huge data sets is just gonna, get more and more costly and also you start hitting bandwidth limitations of DDR inside the system.

C

When you start talking about SSD transfer rates where, if now, if I can distribute these data loads across you know a thousand DDR buses instead of just one DDR bus per proc right on the host yeah I, get a huge benefit right, yeah yeah, absolutely so anyway, that's what we're looking at right now! It's it's definitely futuristic, but you know we're trying to move the ball slowly in that direction.

C

A

Yeah, definitely, uh you know, feel free to stop by for the meetings that we've got here and I. You know I'm not sure exactly what we'll be covering in the next couple of weeks, but some of the things that you know recently we've been going over like switching to the smaller metallic size for the OSD or for blue store. Rather some of these can't rgw things that have come up recently.

A

That may or may not kind of become a focus again here, but then they also generally, you know nvme performance and and canopy value, storage and all this kind of stuff. So happy.

C

Yeah yeah, so I'll just be sitting in kind of being a fly on the wall for a while. Okay sounds.

A

Good well, nice to meet you.

C

Philip nice to meet you mark, talk to you later good. All.

A

Right have a great day, yeah.

C