Ceph CDS G/H, 25 Jun 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS G/H (Day 2) - RBD: Database Performance

Description

https://wiki.ceph.com/Planning/CDS/CDS_Giant_and_Hammer_(Jun_2014)

25 June 2014
Ceph Developer Summit G/H
Day 2
RBD: Database Performance

A

Looks like we have um jyg one of the other people. Mt wong was here a bit ago, but he dropped.

B

A

Yeah um well, it looks like it looks like we're: uh we're ready to move to the database performance one so luke are you able to hear us right.

C

Hi, can you hear me.

D

C

Okay, so by the way um we are waiting for luke to come, he just finished uh coming from the another meeting: okay, probably another one: okay, okay, yes he's in now, okay,.

A

Great okay, no problem, we can wait. We can wait a minute. That's fine, but.

B

B

Thanks guys for uh giving us a chance to talk about our idea, the blueprint that we put up basically is to kind of look at.

B

Can we get databases and what not to run on rbd specifically- uh and I think a lot of us probably will be aware that a lot of discussion has been put up in the email links within this uh this couple of days and weeks, I think one of the things we hope to achieve out of this blueprint is maybe a set of documents or best practices to kind of work here, where different kind of databases can have a different set of documents on what and what not to do uh giving kind of work, benchmark and so forth.

B

So generally, that's what what we are trying to hope to achieve uh of that of this particular blueprint and obviously one of the hidden agenda. The ultimate goal is to see how we can leverage ourselves capability to run database on a really really wide area network, rather than maybe using multi-site to do a replication. Can we just do it in just one single cluster covering the whole white area network, for example?

B

So that's that's the goal. I know it's uh quite this, uh but that's the fight of why we put up this particular blueprint.

E

Cool, so you talked, you said that you're just talking a little bit about um testing over white area networks as well as lens.

E

I think it might be. I could just get a kind of a baseline for a lot of different workloads and uh to see where things are at.

B

B

B

Yes, basically, that's uh what we what we would like to uh kind of look at, um but I think the first step we are trying to do right now is just to try to understand how database behave uh behave and how it affects, for example, in a virtual environment with rbd on top of it underneath it.

B

So I think we recently actually done some benchmark with postgres uh in our little uh uh testbed that we have and we found that the performance is not really what what we are expecting and obviously because we don't have the luxury of having ssds and whatnot. We are just doing it on a sata stripe for our osd. So what we really hope um moving forward is: are there any other people within this community that are willing to maybe share experiences and best practices to to to look at?

B

You know not just postgrads that we are using yeah, but my sequel, no sequel, star other works. So so, obviously, once we have some of these things tied down, then maybe we can move forward towards uh some use case for for wide area network as well. So so I think our first step is at least to figure out what works and what doesn't work and what needs to be done.

A

Yes, yeah, I think think the database performance is going to be interesting because um my in my limited understanding of what the databases are usually doing, they usually have a journal file where they have like a stream of small, writes that they're f-syncing um or using direct right direct io to lay down um and then those for that. For that, like you, know, high up rate of small rights, those are with the general striping strategy.

A

Those are going to pile up on a single object um which could potentially be a problem, at least for high performance databases. um This is one of the reasons why we um added the fancy striping, so you can sort of spread those out across. You know shard that across a lot of objects. um On the other hand, you don't want to do that for the full image, because that's that'll, probably it'll, tend to amplify your rights um when they span striped boundaries.

A

um So I know I know that um or I assume that, like you know, big databases like oracle and so forth, will let you specify a separate device for the database journal from all the other data, at least that's my assumption. I'm not, I actually don't know if that's true, but I assume that they do that. I don't know that postgres in my mysql do that at all, though, since they're sort of designed to run on traditional file systems.

E

I'm not sure how much of an issue that's actually going to be for spamming, multiple objects, just because the kernel will break up larger requests as much smaller requests in any case, so I'm not sure how large the requests can actually be.

A

Right, well, I'm thinking yeah yeah, I think for for for sort of mid size, small databases, it's not going to make a big difference in any case like having the journal. You know pound on one object for a while and then handle the next object for a while, and it's not going to be too limiting, um especially since it's pure right so you're not doing any reads.

A

I think for high performance databases, then then it could be problematic and you'd want to do like very carefully choose your stripe size and probably configure a database to lay out its journal events aligned to that stripe, width and so forth. But I don't know if chris or my school will do that.

F

B

Probably good so I don't know yeah yeah, so so with that mind uh does uh inkten or any of any other guys out there. May that may have done some similar work, uh that that may be helpful for some of us here to work on and maybe come up with some packages or work packages or whatnot.

A

What I mean packages? What kind of packages kind of work I mean? I mean.

B

Like some kind of uh rubber packages, but more like what kind of things that we can work on uh items that we can work on in sequence, that maybe we can kind of uh feedback to the community and say: okay hi guys, we have done something like that. I think this is a kind of a good good best practices that you can start off with, and maybe you can get more discussion rolling later on.

E

Deck is in the chat here is saying that they've been running postgres infrastructure and rvd and they're um happy to help in trying to improve that, and also that you can uh postgres actually can put uh the journal on the separate point.

D

um We did some tests on uh running mongodb on rbd, but that not so much luck. Performance was really bad, especially since mongodb can be configured to run multiple charts on different vms and the set replicating. So you run on problems like have multiple vms riding through the same pool, um so you get uh yeah with replica level three, for example the data written nine times, or something like that, and that's not really performing yeah.

D

So so you, you need to split, split these over different pools and make sure the pools are not hitting the same hardware so um yeah and for testing. uh We did run the database directly on uh real hardware and uh trace the stuff with the block tools and replayed it on the vm to see what the difference really is between real hardware and leds.

D

That's, maybe in this case you can run to see the performance differences.

A

I wonder if the the way to sort of um focus in on what the what the specific issues are and how to how to best address them. We can get people to focus on like one particular database and then do a series of experiments to narrow down exactly what um you know what we can do to improve it, whether it's the latency on the right ahead log, for example.

A

So I mean one so one thought off the top of my head would be that if postgrace lets, you put the write ahead log on a different device, um try putting um you know just the write ahead log on rbd and see what the effect is or put the database on rbd but put the right head log on a local disk and perf.

A

You know figure out whether it's the log that's deliberate, that's latency sensitive, which is what I'm guessing or whether it's the um the background activity on the on the rest of the file system. That's the problem, or mostly at least and then, and then we can figure out whether you know. Maybe the solution is that in when you want good database performance. You have um you know two pools of in radios where one of them is backed by disc.

A

One of them is backed by pure flash, and you create a bunch, a series of small rbd devices on pureflash just for the logs, and then you do the rest of the database elsewhere. Something like that.

E

Or maybe you enable rvd caching and the cache can call us those journal rights.

A

Yeah yeah, maybe it's simple like that right: it's issuing barriers right.

A

Could you say that again, sam.

F

If it's girdling in the first place, it's probably issuing an annoyingly large number of barriers.

A

Probably on the at least on that on the journal: yes, yes,.

F

E

I would tech is saying more things in the chat.

E

Apparently, it's not just the the log, since they have a synchronous, commit turned off, so it's not actually waiting for the f sync, but it's still slow, especially when um there are backfills or other failures.

F

Oh yeah: well, that's the one where we have to perform rights to the greater others.

A

So there was a there was a this is sort of a side discussion, but what we're talking about postgres? There was a long discussion at the linux storage and file system summit like three months ago um about post performance in general on just linux, where they described in detail what the issues were, and it basically comes down to they want.

A

If I'm remembering correctly, they want fast synchronous rights to the journal, and then they have all their other table files that they're sort of writing out um asynchronously in the background, but at some point they need them to be durable and the only way they can do.

A

That is you via sync, and they have this problem where they have to like, tell the kernel sync and that shows a whole bunch of dirty pages down to the ios and then slows down their their red head log, which, even though they don't actually care when it's terrible, they just need to know that it is sortable whatever. So there's this whole, so I think postgre struggles in general um just because the kernel isn't sort of exposing the right type of interfaces for it.

A

um That's very.

F

Analogous that we have in the fellowship right.

A

You know exactly it was a very interesting discussion.

A

Yeah yeah. I was trying to dig up the link here, but I'm failing at the moment. There's there should be a lw in coverage somewhere.

B

A

Sorry, look you're breaking up.

B

Yeah yeah yeah, our network, has seemed to have been some problem. The last couple of days, uh speaking of some tools, uh are there any suggested tools um that maybe we can kind of use or agree on when we, when we start doing more tests or benchmarks.

A

This, I don't know you're talking about just like standard standard benchmarks or something that we can use.

B

I mean um I mean when, when we're running benchmarks, some of the times, uh we really don't know um what kind of tools we can actually further look at now to pick up. uh You know we, we have the benchmark tools which we are using right now, um but we don't don't really know what other tools we can actually use for tracing for example, or not uh figure out what what is being done at what level, uh but we just we're just doing pure benchmarks.

B

You know so we are just getting numbers, but we don't know really what happened behind it. So is that any suggestion what kind of things we can look at.

A

Yeah, so maybe we can just make a list in the in the ether pad um I mean the things to check are like is librbd. Caching enabled uh that's rbd cache equals true.

F

We could try reducing.

A

I was gonna say I think this is why go ahead.

F

I was going to say: are you looking for statistics information, for example, uh ratios of flights of rights going into the of to rbd versus rights coming out of rpd?

F

That kind of thing, so you see what the cache is doing or are you asking about um insight into the set the rbd configuration.

B

Well, I think more than just rbd, but possibly also, maybe maybe we can look at things happening within the vm like what is postgresql doing and what not you know from both levels. I guess.

E

Yeah so inside the vm they could use tools like block trace, to see like exactly where um postgres for my sql or whatever database is I'm writing to the block device.

E

um That's probably going to another level of the file system as well, um but that will give you some idea of and um you can use as another tool called sequencer which you can use to visualize. um What kind of patterns uh are going at the disk which can be interesting when, if you see I see like the journal being written over and over again- and there are random rights going on in other areas,.

C

C

uh How how about uh adding more osd does that will improve the performance of database running on rbd.

A

um Yes and no, um the absolute latency of a single, I o, probably won't change much, um but if you have lots of ios that are contending for the same disks and they'll be spread out and so you'll you'll get sort of the minimum latency on all requests instead of having somebody slower.

A

um So I'm mike I mean my guess is that the the thing that's really slowing down, slowing us down is just the the lower bound on the latency of a right is, is higher than on a local disk, because we're going over the network and we're replicating to multiple nodes, and then those are all going to disk.

A

um So I think it's it's. The thing I'm most interested in is figuring out how to quantify how how much of it is that and how much of it might be something else like. Maybe it's, maybe it's the that the barriers, it's periodically doing a flush and we have a whole bunch of dirty data and that's the thing that's slowing it down like it's kind of hard to hard to say I wonder actually, sam or josh.

A

I wonder if, if doing like, a debug ms1 type trace um at rbd might be good enough too, just to get a sense of like what the what the right sizes are and what the latencies are.

F

Yeah we could get things like 99 95 percent latency- that would, that would alone be pretty valuable.

E

Yeah that'd be pretty useful. Maybe it needs to be some remote post processing to analyze that, but that's not too hard. um It might also be useful to get like debug rbd equals 20 to just see when flashes happen. How often we're actually seeing them if we're investigating the cache behavior.

A

Sorry, sorry, what was the second thing? Josh.

E

um Deal with rvd equals 20 to see how like, when flushes, are happening.

A

Oh yeah, is there a? Is there a debug level lower than twenty.

A

Like ten, maybe.

E

I don't think so.

A

E

It's basically, it looks basically log each um rate and read other operations when they begin. Okay,.

A

C

One more thing: how about other parameters instead of uh we? We we're now planning for uh things to experiment with something like rpd the the hardware as well like um in this case, maybe ssds right. How about other parameters like network parameters, for example, kernel parameters.

D

I'm not sure if that's really helping, so if you already get the full performance of the network, um I would wonder if it helps to tune the corner network parameters there. So.

D

I mean we played around with that, but that didn't have anything so yeah except the usual problems, maybe with 10g or something like that.

D

If you get the full performance over the network, then I wouldn't expect anything except maybe stuff that is caused by the virtualization layer in general. Maybe.

D

You can run in trouble and don't get the full performance there. So maybe you have issues there, but except that I wouldn't expect tuning. The kind of parameters would help.

C

B

Are formatted uh in our case our osd are formatted with sfs. um Is there anything that we can do along the line like different kind of a mount options for those uh osds.

D

Yeah, you can tune it a little bit just, but it's yeah like yeah.

D

But but that ends and more in a general question about performance at this point so.

D

Yeah you can tune in that's another general question, so.

A

Yeah, I'm not sure that that's like the the dominating issue here yeah, I imagine it's more in the sort of the rbd and radius behaviors um and just the general desire to make ios faster. Obviously,.

D

The real question for me is: what are your expectations for the performance of the database compared to real hardware? That's that's the question: do you ex expect to shortage the same performance or uh can you live 50 or something like that? What are the expectations for you.

B

um At this stage we have not set a goal for the expectation yet, but what we do hope is if we can get something that is close to, like you know, performance like running a postgresql on the local drive. uh That would be a good start for us, uh at least. If you can see something along that line, then we can.

B

You can move forward and see, okay guys now we think we have something that is very close to running on a local drive or at least a local sata drive nothing fancy, but then we can start to see what else can we do and and and and may be better than that? It's a local drive things like that, so we are keeping our goal quite low at this moment of time.

A

Yeah- and I think that that aligns pretty closely with what most people are expecting when, when they want to deploy, databases, is that.

D

Is that even possible? I mean, if you put the also the network layer and the replication in place, then I would expect it.

A

I mean it depends on if you're, if.

D

You're, better.

A

Yeah yeah, but.

D

But if you have solder drives in place, then I wouldn't expect to be faster, or at least the same faster than a local drive.

A

I mean in theory, there are some things that we can do right, because we can. um We can reorder rights.

A

So when you're writing to a local disk, then you have to write to the offset that the file system says, whereas, if you're writing into an object store, we can just lay out all objects like sort of sequentially on disk, regardless of what their semantic location is.

A

So in theory, we can do okay, like we can get someone there to pay for it later on right or on reads, but for rights at least um but yeah there's definitely a handicap, because we're replicating and we're over the network- and we have a couple of stuff in between.

A

I think I think the hope is that, when once you're, when you have like our the seth osd journal on ssds, that should be helping give you faster turn around on the rights assuming you're, not saturating. Your discs.

A

So you can do okay. What can I can? I ask the on that the step cluster that you're running on, where you're using just just hard drives, or were you using ssds for journals.

B

We are just using uh setup, drives um yeah for both the data and general, so we we haven't looked at uh using ssd yet, uh but I think one of the ideas later on is is maybe we can separate the journal and the data away first, and once we have a bit of budget for the projects we can hope to. We hope to get some ssd in as well.

A

So that that will tend to help definitely using ssds for the right journals, because putting the printing the journals on the disk is hard on the disks.

D

Do you use a raid controller in between the disks or is it pure disk.

C

B

Controller, but we don't actually rate our this per se, it's just a pass-through, so I'm not sure what what effect a controller will have on on the performance.

B

If that's the question you are referring to.

D

Yeah we ran some of our clusters uh also with jbot um and switched over from jbod to a raid 0, with right back caching enabled on the red controller and the red controller has, uh uh I guess and gig of ram or something like that, and that's improving the performance for us, uh at least by a factor of four or something.

D

Okay, so if you don't have uh ssds, maybe it helps to use the enable right back on the red controller cache if you have a cache on the computer.

D

Yeah, better back up there for sure.

A

um So I mean I guess I I think we have a in the aether pad. We have a whole list of things to sort of experiment with.

A

I think you should definitely follow up this discussion on the um on the email list um you know describe describe what the test was and we can suggest you know what the next thing to try might be and seeing what what effect these different things have will sort of lead us down the road of figuring out. You know if it's the journal or if it's the you know whatever it's a caching that helps.

A

I think lots of people will be interested in this too.

B

Yeah, um would you suggest you use some firefly for this, because currently we are doing it on emperor.

A

Yeah, I would just suggest firefly over emperor, yes, cool.

B

All right cool looks like uh well we'll follow up with some of the suggestions and we post some of the results out later on on emailings.

F

Cool okay cool does that does that cover everything you needed, luke.

B

uh Yeah, I think so we work on it and um we'll come up with some results and we'll share with you guys.

A

All right great sounds great.

E

All right, I think, we'll move on to the next session. Then.

A

All right, which is a break, a break we just had.