Ceph Performance Weekly, 3 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-03-03

Description

Open Cache Acceleration Software: https://open-cas.github.io

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Okay, well, it looks like we've got a couple: people here already, uh uh we'll we'll just maybe get started so um uh me how uh feel free to uh to share your screen and uh and I'll turn it over to you.

B

All right so, let's, let's start then so hello, everyone. Once again, my name is mikhail sadinsky. I'm a cloud software architect at intel based in poland, and my work is focused around caching, around storage caching, including opencast, and this is the topic that I want to show you today. What opencast is? What are the features?

B

Maybe it could be interesting for you and if you have any questions, feel free to interrupt me at any time.

B

All right so quick agenda, so first I will do very high level opencast overview. I will talk. What opencast is how it can be used. Then I will go through the features from user or administrator standpoint. And, finally, I will do quick intro of opencast architecture a bit of technical details, but not too much understand all right. So, let's start what opencast keys.

B

Okay, I think this window, so opencast stands for cache acceleration software and the idea is pretty simple: it's a caching software storage, caching, software for linux that works in a block layer. If you have familiar with dm cache or bcash, then opencast is something similar.

B

So, as I said, idea is simpler. Cache slower media on faster media just to accelerate the storage, so opencast, comparing to other caching solution, has some unique features that I will go through them on later slides.

B

It's also optimized for fast ssds.

B

During a couple of recent series, development team did some improvements to reduce the latency, improve the throughput and bandwidth of the software.

B

It's free and open source available on github and it's also kind of portable.

B

I will go into the details. What this kind of portable means, but it's not tied to specific platform or storage, stack.

B

All right, let me switch to the next slide so opencast for linux scanner. What what is it, how what it contains? So it's a block driver a kernel module for block driver. Currently we support kernels up to 5.12, but we already think the ads.

B

I would already add a new feature and support for newer kernels. It's managed through command line utilities. We have integration with system, boot scripts, like udf and systemd, and, of course, the source of documentation available.

B

And just to show you how it fits into the system. Here is a very, very quick start for very simple scenarios. So, first you need to select a fast block device to be used as a cache media.

B

It should be ssd, preferably nvme,.

B

And it starts the cache instance using this block device. Then you attach a backend device to that cache instance. You can attach one or more in this example just the one and then you can start using your uh your storage through opencast.

B

The important thing to note here is that you don't need to provision a backend device in any way you can take your existing, let's say hdd with data attached to opencast and all your data would still be there who could be still accessed. There is no data loss during this procedure after that, after you finish that quick, startup quick configuration, uh all your data would be visible from new block device like cast 1-1 and, as you can see on the screenshots here from lsblk.

B

The cast 1-1 is just just a view on the sdc, but all the traffic would be rooted through cast driver and this driver will decide where, where to put or from where to get the data from either back end or from cache device.

B

B

The previous example was was very simple: it was a typical one by one configuration when you use single cache device to accelerate a single backend device, but with opencast you can have more complex configurations.

B

I mean one too many configuration, in which case you use a single cache device to accelerate multiple backend devices, and in this case the cache is shared. So if, if there is a need for more cash to for one back-end device, it can be borrowed from another one. It's simple one cache area to accelerate multiple hard drives or multiple other type of buckets.

B

It's also possible to stack one cache instance on top of another. I don't have this on the slide, but cas devices I mean those cast. 1, 1-1 or 1.2 are regular block devices, so they can be used as input for another cache instance. So, for example, you can have very very fast cache on the very top and you can use that very fast device to accelerate another cache instance that uses slower cache device, and you have kind of three of cache instances.

B

And, as I mentioned, it's fully transparent, there's no data loss on the backend devices, how it fits into into ceph, then. So, if opencasts, if opencast exposes regular dog devices, then you can simply deploy osd on top of the device. Instead of deploying in directly on a hard drive, you need to configure the cache instances and then start to use those exposed virtual devices instead of origin original ones, and there are two possibilities again. The first picture is for one and one and the second one is for one too many deployments in the first one.

B

You simply create separate cache instance for each osd. Why, on the second one you share the cache.

B

Both approaches have pros and cons, but why, in the save the traffic across osd osd should be more or less evenly distributed. The better choice seems to be to use one-to-one deployment because cache metadata is not shared it. It would have some performance optimization. Then chad, cache is more for cases where traffic between back-end devices is not distributed. Eventually,.

B

All right, so any questions to to this quick overview of what opencast is and how to use it.

C

So you're mentioning that there's no data loss. Does that mean this is a right through cache and cast is only doing read acceleration.

B

uh No, we have. We have four more than one caching mods. We have right through. We have right back uh by data loss. At this point I mean that when you start new cache instance, you don't need to do some kind of provisioning on the back end, so you can take existing block device with existing data, plug it into opencast instance, and we still have you will still have access to this data so for that full open cast instance.

D

Okay, but I mean once you start running it, then it's going to flush out in whatever order it wants to to the backing device right, yeah, okay,.

B

Okay, so let's, let's get to opencast features. What opencas can offer you so catch modes? Cache modes are the simplest features that each caching software should have. So we currently have four caching modes. The write through it's the probably the simplest one. In this case, your data is always in sync between cache and the backend storage, um and because of that, it accelerates on the reefs are always targeted to both cache and back installation. So there is no acceleration to rights for the right back.

B

Both reads and rights are accelerated, but in this case some data might be out of sync between cache and back-end storage. So in case you want to stop cache instance. You should first flash the dirt data, the data that is out of sync with backend storage, and then you can safely detach the back-end device from cache instance. There is also background flashing background synchronization.

B

That I will cover on the next slide this. We also have a write, only mode which is which should be rather called right, buffer mode. The difference between right back and the right only is that in writeback mode the data is inserted into the cache on both reads and writes requests.

B

Why, on the right only cache mode, we insert data into cache. Only when handling write, requests, read requests are handled in a pass promote, as they are handed directly from backend storage, of course, assuming that data for given lba is not in the cache already in case it's in the cache. We of course return data from the cache, because it's the recent one and the flight around.

B

In this right, around accelerates also only reads so in this in this manner it's similar to write through, and the difference is that uh data is inserted into cache. Only on uh only during read handling writes are targeted directly to the packet storage, so it's kind of opposite to write.

B

Only right we also, we also have a pass-through uh cache mode. I didn't put it here because it's not already caching mode, the pass-through mode is just to temporarily just disable caching in password mode, everything goes directly to the to the storage to the backend stage.

B

All right so cleaning or flashing policies.

B

It applies to the write only and right back cache modes, because flashing is responsible for synchronizing data between cache and back-end storage in opencast flashing happens in two different ways. You can force force opencast to flash all the data manually by using a command side utility, but also opencast flashes, that dirty data. In the background and this background, flashing is controlled by a cleaning policy. So currently we have two cleaning policies.

B

Although there are three on the side, because the last one is no background flashing. So it's like no cleaning policy uh alu.

B

This policy is simply based on access time on llu lists, so least, recently accessed blocks are flashed to the back installation.

B

The second one is icp cleaning policy which stands for aggressive cleaning policy, which is best for hdds, because this flashing policy tries to sequentialize data as much as possible. So it simply selects the most dirty region in lba domain in the cache and flashes that that first, so it divides the whole lba lba range of back-end device into some chunks, then flashes those chunks in the order of percentage of dirt. So there is a highest possible probability to form some sequential regions.

B

And nlp it disables background flashing. As I mentioned all right promotion policies. You can also configure how how do you want to insert data into into the cache by default?

B

Given lba block is inserted into the cache if it's not yet in the cache on the first access. So first touch of the given lba would insert that data into the cache.

B

But it's not it's not always best strategy, because there might be cases that you have some uh important data in cash but from time to time there happens very, very random traffic that just touches the blocks and those blocks are never used again. So for that we have unhit promotion policy which inserts data into the cache. Only when given lba all given block was accessed more than specific number of times in some time window you can configure how many times this block should be accessed in order to be inserted into the cache.

B

Okay, selective and priority based caching: this is this is probably the most interesting feature of opencast.

B

Which gives you opportunity to cache only the data that you know that is citadel for caching, and only the data that you want to accelerate most so from so you can configure opencast using various attributes or values, classification rules and assign priorities for given classes or even tell opencast not to cache given class at all.

B

So, for example, you can classify data based on request, size, lba ranges. uh If you are using file system, then you can classify, classify based on some file system, related attributes like file, name directory file, sizes and some process related things like pid or process name, and also recently, we added support from for a vlight lifetime hint which is available on um on kernels, starting from 4.12 or 18.. I don't remember exactly and it's supported, for example, for by roxdb.

B

It simply allows application user space application to specify some priority of data expected lifetime of given file or lba, and we can utilize this right item hint to in opencast to, for example, put data that is expected not to be touched for a long time, static data directly on the back installation, while very dynamic data. That is very often over overwritten to put it into the cache, with the highest priority.

E

Sorry, you said, I think you said that you are able to say, uh which uh I don't know. File name directory uh will not be out of the cache. Can you also say which the only one that will be in the cache and all the others would be out of the cache? So one directory you want to be kept and all the rest are not.

B

um Yes, you can you can build those attributes that you can use, but based on those attributes you can use, you can build some classification news. So, for example, you can say that you want to cache directory a directory b directory c directory d, but you don't want to cache anything else or you can say that you want to cut you, you don't want to cache directly a b c and d, but want to cache everything else.

B

uh You can also you can assign different priorities for each directory. You can specify the directory a is most more important for you, so you assign priority one to it. Directory b is a priority two and so on. So on so on, you can even use some logic operations and combine those attributes together. So you might want to cache, for example, all files that are smaller than let's say, 40 kilobytes that are placed in directory a.

B

Does it answer your question.

E

Yeah, yes, absolutely thank you.

B

Okay, more details on that, because this is a bit complex topic you can find in our documentation on github his settings, so you can look at it into more details for more details.

B

All right, sequential cut off sequential streams are usually not very good for caching because they can quickly pollute your cache evicting random data. So opencast has a feature that we call that we call sequential cutoff that allows you to track sequential streams and.

B

Avoid caching it and you can configure when the sequential cutoff should be triggered. uh We have uh three options. uh Actually, the one is that you want sequential cutoff to be active only when your cache is full. The second option is that you want to have sequential cut off. Always even you have a lot of free space in the cache and the last one is to disable sequential cutoff at all, and you can also configure.

B

The minimum size of such sequential streams, so it's it means that you can configure how much sequential data should be sent to a given block device in order to be treated as a sequential stream, and we can track multiple sequential streams that way. So, if you have multiple applications, uh writing to this to the same blog device.

F

Yep, why would I mean you? Okay, I mean you would never want to catch a written, sequential stream. I was thinking about reading. So all right.

B

It applies also for reading.

F

You might want to prefetch right, I mean so you need to recognize them, and so it sounds like you've got that and then you can make decisions.

B

uh Yeah, you might, you might prefetch, but if, if sequential stream is large, then it would evict your data that might be more important to.

G

Have in the cache some random data, for example, right yeah.

F

No, I've just worked with scenarios where, where you once you recognize this sequential read pattern you prefetch, but you also, your eviction strategy has to get rid of the old data quickly, but you don't doesn't sound like you. Have that.

B

F

B

All right: um okay, let's get to the next one and manageability, as I said at the very beginning, uh opencast will be controlled using command line utility. But this this is good if you want just to play with opencast to test it, but in production environment, it's not very, very friendly method to use. So we have a configuration file that you can use, that you can dump your whole cache layout and then system. Startup scripts would rapid the cache configuration on on the startup.

B

And we also provide statistics for cache to monitor the cache. We have about 30 counters to monitor cache, how cache is used by what kind of data we collect that counters at different levels. You can query for statistics at the whole cache instance level. You can query at the single backend device level if you are, if you are caching, multiple devices using single cache device, or you can uh check statistics at the higher class level.

B

If you are using io classification and you can either output it on the screen using uh humanity formats in the format of some tables or you can export it to csv csv files for some machine processing.

B

All right any questions to feature sparks.

B

H

So let's get started.

B

H

Just um we're getting the I o classification, so it's covering um file data and just from the description it says um it discovers metadata and data is there, a specific file system meant to be used with that or what does that catch? You.

B

uh It works for xfs and ext based file system, so ext3 xd4.

H

B

All right, so, let's get a bit into uh more technical details, a bit into architecture, um so how opencast organizes its cache space. So as typical cache, we divide cache space into cache lines.

B

We support you can configure what cache line size should open cassius and we support 4k from 4k to 64k, with power powers of two step, and it is important to select optimal cache line size because all the caching operations we are performing like mapping eviction researching. We perform with cash line granularity.

B

So if cash line is, for example, 64k, it means that 64k of cash is mapped into 64k continuous space of a backend storage, but we also track.

B

Validity and dirty with a granularity of 512 byte sectors, so it means that you might have 64k mapped to 64k engine on the backend, but you might have only some part of that valid into the the cache line.

B

We do. We don't perform any padding or prefetch. So, for example, if you request, if you have 64k cache line and you request 4k of data- you send 4k uh request, then we would read only 4k data from the backend and put that 4k into the cache event default. 64K is mapped for the 64k region on the bucket.

B

uh It prevents from increasing the bandwidth to the backend device, and it's also important during flashing. So we flush the dirty only the dirty sectors. So if you have 4k cash line, for example, and only one sector, one 512 byte sector is dirty in that cache line.

B

We we would flash only 5, 12 bytes, not the full, for the cache lines, so we would prevent, would avoid right, amplification factor and because of that, it is important to set the correct question size, because, if you set, if your question size would be too large, you might end up with some optimal cache utilization like in this example. You can see that this is a 4k question. Size example the smallest one, but you see that some sectors are mapped but invalid.

B

It means that data for those sectors were never requested, so cash allocation is currency. 30 percent, because three of 10 cash lines are mapped, but utilization is only 20.

B

Only uh only 20 contains valid valid data in the cache, and if your, if your workload is, for example, 4k uh random workload and you set 64k custom size, then your effective cash utilization would be 1 16 of of your cash size. On the other hand, if your workout is your average request size for your local is, let's say 64k and you set 40k cash time size, nothing but happens.

B

Your cash would be utilized perfectly, but you opencast would need to track more cache lines than it's necessary, so it means it would consume more dram for metadata and it would consume more cpu cycles to do that. Caching logic, so the optimal configuration would be to to match your average request size with your cache line, size.

B

All right, so how opencast looks how opencast architecture? Okay? Are there any questions to this caspase organization? Before I go to the next one.

B

Okay, so how opencast architecture looks from the very high level at the very beginning, I said that opencast is a kind of pool table.

B

It's because our default from our whole caching logic is contained in a caching library that we call opencast framework, which is platform independent, and currently we have two cast products. One is an opencast for linux, which is opencast for linux, kernel, which I mostly described in this presentation. We also have opencast for spdk.

B

It's the same caching framework, the same caching library but integrated with two different environments. One is with linux kernel. Another one is with spdk.

B

We can also you can also integrate opencast framework with any other in the environment. Assuming it's a block. Oriented storage stack because opencast firmware is a block oriented caching, storage.

B

In order to integrate it with some stack, you need to wrap opencast framework into either driver or in your application, and provide the top adapter and bottom adapter. The top adapter is a layer on top of open custom framework, which is responsible for accepting requests from the storage stack. So it needs to understand how given storage stack sends the request in kernel. It will be bio based in spdk. It would be bdf.

B

Layer and desktop adapt adapter then sends given it translates the request from the platform, specific structures into opencast framework structures.

B

There is no data copy. Only the io request description is transformed. We don't copy data data is always the in the buffers that application or upper layers sends to us. On the other hand, there's a bottom adapter, which is responsible for sending requests to to cache and the backend device. Whenever opencast framework wants to uh perform some ios and it has to perform some ios, it has to put data into the cache uh get from the backend storage and so on. It opencustomer doesn't perform it directly because it it doesn't know how to do that.

B

Instead of that it will, it uses a api, a common api with bottom adapter and the bottom adapter translates cast requests into the platform, stack requests.

B

And two examples how it works. I I partially already described that on the left hand, you have architecture for opencast for linux kernel. On the right hand, it's for spdk, so top adapter, which is part of opencast kernel driver, accepts from request from block layer in the linux kernel. Translates it into opencast framework requests. Opencast framework performs all the caching logic decides where to put data, what to do with the data and so on.

B

And finally, it asks bottom adapter to put data on the devices and bottom adapter communicates with a block layer, which is the device driver, layer and data goes to physical physical hardware.

B

Similarly, for for opencast for spdk there's a bit of layer on top which accepts it translates the bdf type requests into ocf requests, opencast, performs all the logic and finally, bottom adapter uses.

B

Other bdfs from spdk to write data or read data from cache and or backend device.

B

B

B

So what is in opencast framework?

B

Basically everything that cache uh cache software storage cache software needs, so caching engines um support for io classification, cache, partitioning eviction, policies, promotion, all the policies that we have eviction, promotion and cleaning uh cleaner. The background, cleaner implementation, metadata handling statistics are also collected in and managed inside opencast framework and api for management, and all these items are exposed from opencast frame framework through api.

B

So the cast integrations like top adapter and bottom adapter can can query or can manage opencast framework for that api.

B

And I think yeah. This is the last slide that I had so any questions.

A

How um out of curiosity uh I I know that there was some some concern about showing direct benchmarks, but um in the in the past it's looked like you guys, have um seen some advantage by the way that you uh handle your your promotion and evictions.

A

Can you talk a little bit about that compared to just generically other solutions.

B

Yeah sure it's mostly related to that cache space organization.

B

We compared this with a dm car and we found that dm cache, especially for lower request sizes, uh generates very high light amplification and re-amplification.

B

It's uh because dm cache organizes its data in chunks, but those chunks are much much larger. I don't remember the minimum chunk size for dm cache if it's 32k or higher, but it's not that important. More important is that when you, for example, read or write data, that is, you send request that is smaller than the chunk size. Dm cache needs to vm card doesn't track.

B

Does not track the validity of each sector? So if you read just 4k and you have 32k chunk size on the m cache, it would say default 32k.

B

So if your workout is random, then your your read amplification in this case would be 32k divided by 4k, so it would be 8 times so it would generate eight times higher traffic to back end than your workload and while for it it's not that bad.

B

It's more important for flights, especially if you are using ssd, drives tlc or even qsc, because the same happens when data is flashed.

B

If, if you write with small request sizes, flashing on dm cache would still happen with chunk size, so only if 4k would be dirty. The whole third 30 and 32k of data would be flashed to the backend storage, which would generate right amplification, which is more harmful for ssd devices.

B

Well, this is basically the the difference. The the main difference that that we found.

A

One of the things I I remember when looking at dm cache is that there was a maximum number of chunks as well. I believe, if you exceeded that, then your chunk size was automatically uh incremented by two, so um I don't remember what it was. Maybe a million chunks or something like that and then and then your your trunk size was automatically increased, is that is that is that? Do you recall that was it correct in my? Am I thinking about that.

B

Right yeah there there is a limit for trunks, so you have to choose. If you have a big cache, you need to choose the um the smallest possible chunk size. That's that requirement. So, even if the minimum chunk size is 32k, that limit might force you to use, let's say uh 64k or 120k 28k trunk size, which would further increase the write and lead amplification.

B

F

B

In opencast, there is also a limit for cache lines.

B

We track cache lines with a 32-bit integer, so it's a much higher number than one one million.

A

Excellent excellent.

A

And do you see it that if you have a lot of chunks, then do you does that cause slow down or or dramatically increase memory consumption, or is that a problem.

B

um Memory consumption is uh directly dependent on on your cash size, so number of uh of chunks. So this is this is why I mentioned that you should match your cash line size with your average request size, because if you, if you choose too small uh cache name size, then you would consume more dram than easily required.

B

Actually opencast dram consumption for linux. Kernel version is two percent of cash cash space of cache device size for for the smallest cache line for 4k cache, and for her for larger cache lines. It's you should just divide it by 2, 4 and so on.

A

When you've tested your stuff with with opencast what cache line size, do you use.

B

We tested, we typically test it with 64 cash 10 with the highest possible, but if we want to use small request workload 4k, then we also use 4k. It depends on the scenario you should. You should choose that, depending on your typical use cases.

B

Okay, another questions.

A

Do you have any results from the benchmark and you're done with seth.

B

uh We are in progress, we are doing some benchmarking right now. We we, we have some preliminary, but they are not um added at that point to present, but we are, we are doing extensive benchmarking right now. We should have some data soon.

A

There was some uh benchmarking that was done versus dm cash, uh maybe five or six months ago. uh Are those results, something that we can eventually share with the community or would those be still governed by nda.

B

um Yeah, let me let me check that I don't have those results right now, but let's check and we can potentially follow on some next meetings. Okay,.

B

All right cool well thank.

A

You so much for your questions.

B

Thank you very much.

A

Yeah, I think thank you miho. Thank you for presenting this was. This is really interesting. um Yeah definitely would be, would be really interesting to see some of those earlier results or um or the new results that you guys have been working on, uh especially with with hard drives um so so yeah. I would would absolutely be interested in having you guys do a follow up uh well. Thank you very much um welcome.

A

I don't have anything else for people, so uh does anyone have anything for the last 10 minutes that they want to bring up or talk about before we wrap up all right? Well then, uh thank you, mihao uh and, and thanks everyone for coming uh have a great week and we'll see you next week. Thank you have a good.

B

Day and good weekend, bye-bye thanks hi everyone.

E

B