Ceph Cephalocon APAC 2018, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Running Ceph on Flashcache - Paweł Sadowski

Description

Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Paweł Sadowski, OVH DevOps

A

Now, my obviously that's over each time you internship pause.

B

So that's to get started SP with that Calif I identify this with not enough speed or we each developers he will zoom in us hello.

C

My name is prabhav ski I work as a DevOps for OVH and I started there in 2013 and since 2014 I'm playing with Jeff, and today I would like to talk about running check on on top of Westridge based on three years. Experience.

C

So how we use staff at OVH, we use, we create it's fully managed set of the service, and we have around 150 clusters raised running on one hundred fifteen hundred servers.

C

We have around 44 petabytes of storage using two petabytes over to petabytes of nvme, and we have our over 70,000 flash cash instances running on our clusters.

C

D

C

/ a lot full SSD or HDD. So while we doing some tests at the beginning, we found out that HD engine needs are too slow and using only SSD. This was expensive back back then right now, SSD are going cheaper and cheaper.

C

So this is this might range, but at that time we chose to use flash if we did some tests using different caching mechanism like BK the M cash enhancer yo, but fresh fish was one of the most stable and it was actively developed by Facebook in 2014 and that that's that's why we choose it so I'm, going to tell about the three use cases that we had during our three almost four year experience with self.

C

The first case is construction, so cash project is a it happens when you put in cash stuff that you don't need, for example, when deep scrub is running, it reads all all of your data and fill the cache with data that won't be needed within until next next dip dip scrub. So this is the problem with most caching solutions. You have to tell the caching mechanism that this data won't be needed soon.

C

So there's no need to cash it so as deep scrubs is no required large rates that those large reads destroy the hot data set from the cache and the same history is happening during backups and you're in recovery, who also during recovery. Not only real statements only also writes happens, so they also pushing hot data set from from cache. So what what can we do to avoid that situation?.

C

There was some discussion acceptable, a mailing list about introducing some some functions that will tell all the liars below the OSD that this this request doesn't have to be cashed. It seems to be the best options, but it requires some camera changes and all the layers.

C

All the layers on the top of the aisle have to support it. It has to be supported by OSD, has to be supported by caching. Device has to be supported by the kernel, so it's hard to to make them it's hard to make that that change. So what can we do? Is watch this first webcast, some mechanism that is called PID bug listing, so you can disable caching for particular I/o.

C

It has some advantages like if you have some back at Mohali that is running locally on the machine, then you can just put process ID tell the flash that this process ID shouldn't be kept with this scrap. It's not that easy to find the thread ID that is actively working and the PID blacklisting works only with direct I/o. So when you have rights that were delayed and there are flushed from the system buffer, the.

C

They are coming from kernels, so you don't know the process ID, so we could reduce that debt solution. So another solution that flash cache has that flash memory supporting is skipping sequential threshold. So flash cache when you do large reads or write large, writes in sequence. Let's graph can detect such situation and can skip catching of such I hope. So all of the operation that I mentioned before like deep scrap recovery, Becca are usually big, big operation. So it's easy to it's easy to detect them as skip getting for such forced a trial.

C

This can also help with client I'll doing some large rights, or rather large, reads like video files or something like that, and you can configure the threat felt that the threshold that will allow reflected ethic that sequential operation and it turns out that it works pretty well in our case.

C

Second story is about matching the block size on the Liars on the storage layers. So when we were tracking our latency on on our clusters, we noticed that clients latency, is becoming higher and higher for an unknown reason, and we found out that the client was doing some snapshots and forgot to remove those snapshots and multiple snapshots on the same image are causing large large sets of data.

C

That needs to be saved in X attributes, and we were missing some its attributes conflict value that will that would move, storing exact recruits from XFS to leveldb those caused.

C

Those fences such effect that each right to the new right that will copy that will try to copy and write. Some blood will have to also update the exact tributes, and this took way much too much time like 300 milliseconds, and so after adding this this option, we saw that the latency has dropped to the normal normal levels. So this this option was not was not mentioned in the documentation. So we that was one of the first.

C

Pull requests that has been merged to the documentation from our site.

C

So there was also another problem that flash cache doesn't catch requests other than four kilobytes. Our exhibit was formatted with four kilobytes data block, but with forgot about metadata. So after changing metadata to four kilobytes also, we saw also another improvement in latency.

C

Last story is about a flat curve deadlock, so we had some machine after our air or or something like that. We had some muffins for freezing randomly and without any apparent reason. So the effect was that the machine gets loud over 1,000.

C

We only saw blocked IO on the machine and the only solution was to hard reboot the server. So when we started to investigate that issue, we had to gather some some data. So first first thing was to enable carrot current class, that programs on our servers.

C

That is tricky because to enable that you have to reboot the server, so we had to report a reboot, a lot of servers to to catch enough data to analyze, and we started after that. We started to analyze data.

C

We had to dig in the kernel to find what's happening. So the issue was that there was some deadlock which is pretty uncommon and it's now fixed in XFS.

C

So the case was that on low memory condition in kernel, if there was some new job from from flash cache like flashing or evicting dirty data, colonel sponsz new thread to reclaim some memory and memory manager tries to free some memory, and for that it has to run some recovery thread and ask flesh tears to free some disk space, so free some memory space and there's a there's a circle and that costs a total total deadlock on all the FS file system.

C

We used another file system for the OS, so we were able to log into that server and that's why we could find what's happening so fun flash just performs in our case. So in our case we have over 50% more I/o from agde, so rights are more localized, so so they are merged before sending to the device, which means that we have less 6 on the hard drives and we use nvme for journal.

C

We have write heat on the flash cache at 40 percent level and did the right heat around the same level and that that's all for me.

B

Umaga Nepal I requested spotty criminal.

E

Is it subsequently affect the performance of the same cluster Colonel Bridget? Can you repeat.

C

E

Mean what kind of I mean the colonel bridget is really safe?.

C

Sir I don't get it could.

E

You tell me what kind of witch okay.

C

Colonel Burton, so we are, we were using 3.13, so very old father. It.

E

Doesn't have between you between bridges,.

A

C

We did the only change. We did notice that the new colonel doesn't allow you to increase the disc you on the HD DS it's by default. First to 128, and on this colonel we could increase that to 8000, which allows the colonel to rational the aisle to the device which improve the performance. So reads our merge before sending to the to the device.

F

Running a cluster without flash without with let's get home screaming.

C

You want the performance between.

F

C

So we have, it's really depends on the on the load. Is it sequential or random, but in our case it's about at least 50%, more iOS that that on HDD.

D

The flash cache that you used it.

D

D

C

Yes, Buster's died around one year ago or.

D

C

Right now we are looking for alternatives, so we are working from different with different companies because we don't see any other open source solution that is capable of doing what's. Best bet was doing.

E

So what was an eternity.

E

C

Right now we are working with Intel and we are testing Intel, calf acceleration. Software.

E

C

We are still using flash and it works fine, so so.

E

C

Afraid that the new kernels might not compile with the old old model.

G

Just trying to cure something through the site versus career outside talking about we catch scenes that will occasionally eyes around that time, we're a little questionable, but we still grad school, there's well and respected. We maintain a trace play and climate, so it's.

G

G

C

Okay, so we were testing because almost four years ago- and it was just the beginning of the Paquette and right now because has become bigger- surface.

G

C

Last time I checked, it was that the biggest biggest main developer started.

G

Because you're fast.

B

Yeah, well, it's not new viruses weather, but she does export for Salmonella positive tipper, 30 minutes, 30 minutes.