Ceph CDS Jewel, 4 Aug 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Jewel -- Hadoop over Ceph RGW

Description

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

A

Aight on to the next session here looks like we're going to have a hadoop over steph rgw status, update so uh yawn and jen. I guess take it away.

B

Okay, thanks Patrick, so this is actually a status update from BP in in finales in Inverness, we gave some Hadoop solution over a relic into a with SSD cache and scenes over the past few months. We have some updates here and we would like to share with you guys so. The content for today's status update is like the first. We are going to recap: the design of a Hadoop over rails gateway with SSD cache, and then we are going to update the status since infamous we have. We actually have some tips center of the code.

B

Ready and- and last thing is. We have actually get some poems testing with poop over Swift, and we we saw some bottlenecks here and that that may be also the bottleneck of Hadoop over a toast gateway.

B

B

So this is a general design of a hadoop over rails gateway PP. So actually there are three paths in in this product. The first one is rather skidaway FS, which is a additional plugin for Hadoop compatible file system. We we actually have done some tip scent of the code. There and second party is a rather skate web proxy, which is restful service based on our PI, some whiskey, and it can give out the location of the data based on the object name and a container name in this third parties, rather skate away with SSD cache.

B

Contrary we are using a some vanilla, rather skate away and and the cast here as the with SSD cast year as the backhander storage, so we actually haven't touched any rather skate way code there, okay, so the general procedure of the huddle prong is the first thing is scheduler word: try to ask the audible service where the block location and then a jabra proxy will return the closest to active roller, skate, or instance, and then scheduler allocates the tasks on server.

B

That is near to a date, and then tasks were trying to access state from the closest real estate way.

B

Okay, so so this is a detailed status update since employees. The first part is a rest into a proxy pad. We, we actually have done a as small demo, based on passing whiskey module, which is which accepts some rest for requests like like the curl command. Here you can use some Cory Cory the data location, with some restful request.

B

And relics gateway will give out to the closest.

B

Well, as proxy able to give out that closes the relish gateway instance based on the container and the object name.

B

Okay, second parties are rather skate, WAV file system part. We we have a seventy percent of the cold down here.

B

It's actually a fork of Swift ifs and with that rascally can talk to a single, reduce gateway directory without ending without much modification, but it is only able to talk to single rattles gateway instance that actually limits a solution scale. So we modified the part of the code and now with rather skate web proxy, and let us get right of screen to FS can talk to multiple retro skate way instances.

B

We, we also add some block level location where read feature in this, and we are going to go through that in the following apple, slides and the last parties are we, we have done some baseline performance tests with HDFS and Swift, and there's about twenty percent performance digression in sweb side.

B

Okay, this is a detail updates for a GW FS load. So in general there will be a new file system URL with a GW prefix, and with this protocol we can running we can. We can make Hadoop talk to relish, kill a cluster. Basically, it's it's it's a fork of a swift ave s, but their needs their needs to some underneath there is.

B

There are some modifications on the on the code and whereas good way is able to talk to multiple rows right on schedule, instances, and also we we we, we have add some new block concept to edge wfs, because in Swift this actually no block block level concept there. So all the order get and put would happen, add object level. This would make make the all the catch read and put go through the proxy side. Don't practice I.

B

So so so, with this new block concept, we actually can do some improvement on the read on the gas side based on the block location we can choose which rather skate, for instance, to read from basically we are using some ranch get API. But, however, for the put we, we still, we still go through a single rattles gateway instance, since there is no multiple put and there.

B

Okz, like I, said that we have seventy percent of the call it down here and there. There are still some issue on the objects: ladders in five gigabytes and since, if the object is larger than five gigabytes, there will be some zero byte manifest file and lots of small small chunks. So we still trying to resolve this issue, but for the objects smaller than five gigabytes this, this Angela BFS is actually working now.

C

I have either classroom, did you disable your algebra striping, or how do you know that you access the Gateway that is cloak closer to the data because with striping, which is by default, four megabytes it'll be closer to the data of you know of this of the first block, maybe or you know how.

B

Do you little data? Is yes, so we actually have done some special configuration in the idw cast a layer or the straps eyes had been configured as 64 megabyte Alexis. So we also in increase the max chunk size and to make sure or the chunk size r equals M, though this we, we may make missing that much call 64 mega byte blocks here, and then we we actually get some a pack in a GW proxy side. The first thing is, we would use some Leigh brothers and get X attr API to get the manifest file.

B

That is, that is a exit here from the head object. So you actually nosey all the object. Brock names from the manifest file, like the Hat object name, is actually the object name, and then you know the rest of the blocks, like header of your name, plus manifest plus cash. There a dash one dash two, that's three right so.

C

This is a very each chunk. You go to different gateway them. Yes,.

B

Exactly all right and all in the soul, okay, let's let me go, go back so so like like just yah, yah yah hula, just to ask we actually using some a wrench get that, for example, the first first get a for the task for the first huddle map is using range, get from zero bytes to 64 megabytes, and the second task is using reg, get to get 64 megabytes to 110 th megabyte.

B

B

I'll fall put side, we we have to figure out the right way to do this. Maybe we still need some time to study the multiple upload API there. Maybe this is a potential solution. Well,.

C

For put, if you using multi-part upload, then the name of the the objects and any of you you're going to use a 64 megabytes channel, then the name of the objects are going to be named after you know it's going to be the back tidy. Then it's going to be the name of the objects that the the uploaded ID daughter, some kind of running- number: okay, okay, that's going to be so, and you have the upload the upload ID, because you you initiated the the f load around.

D

But it's only gonna work using.

C

Multiple upload, not regular upload.

B

Yeah, actually it's um it's not a big problem for the first fourteen, since we we actually know the workouts here are very much read my workload, so there's not much a pull traffic I! Think alright, okay. So this is a algebra proxy pot, and this.

D

Is for general purpose to dupe, it would be a read, mostly workload.

D

I'm, sorry: what I'm sorry are you saying that for general purpose Hadoop, it would be a read, mostly workload. Yes,.

B

Yes, this, this is Pato. uh Okay,.

B

So far for the algebra proxy part, this is a passing whiskey demo that accepts rest for requests and give out the leg closest relative instance make sure we we generate a top load file of the cluster which actually a this is a so, for example, 0 40 s DS there, and we we use some top-notch file that covers red. For example, Raiders gateway is mapping to the first 20 SD and the rest gateway to is mapping to the second 20 s DS.

B

So this is a actually at top notch of the entire cluster, and yet just just like the so actually I have said this before so. The first step is a rest, get away we'll try to get the manifest from the head object. First, using some Python liberal dose epi with cat exit here and then read as proxy would try to use some oil to follow the crush map and catch the location of each each book. That is actually a staff.

B

Osd map command in unit aster and and the last step is algebra proxy, would try to go through the top lodge file and look up which, which rattles get ye see closest to the data I OSD. So so, with this, we actually we can get the closest rather skate, wait instances you.

B

Okay for the SERP at rwe, actuate haven't made any modifications on the algebra website, so this is a some vanilla version of a DW, except we have done some a special configuration there. For example, we we use some dedicated chunk size and.

B

Also, we increase the head object size to make sure all the data as jumped equalized there. We use 64 megabytes here and based on the.

B

Account continue up the name and relevé could get and put the content there. We actually using a wrench get to get each chunk.

B

There's a configuration of cash TR there. We we actually using some rights through mode as a start point to bypass data consistency, issue.

C

D

C

You are aware that, when using multiple to upload the head object side is going to be 0.

B

Oh yeah yeah, that's that's a yeah! That's a problem!.

B

Yeah this is uh yeah. This is actually the same issue. We we match in the objects larger than five gigabytes. These. There is also a zero byte manifest file, and currently we have some some solution in at wfs. That is due to additional translation, I'm, making the zero byte manifest file and the real real date data chunks. That is.

C

B

C

Two different things: you mean that the head file is zero bytes, oh, that the manifest itself has some issues manifest.

B

B

C

Yeah I'm not sure why that would happen.

B

So yeah, so the background is we actually using a swift API so for four objects larger than five gigabyte? We we actually follow the Swift way so.

C

I said sorry fusing the Swift, then I make cloud objects. Yeah exactly I, see ah alright.

C

Yes, so for the multi-part upload, you have an issue there that the first object is just going to hold the manifest and not anything else and the race that is going to be the it in the table objects now, maybe for you can tweak rgw for larger than five gigabytes subjects to just do the same as it does with with regular objects. Maybe the regular object upload can work for you for larger than five gigs.

B

Ok, I see yeah the we actually have discussed of this solution before, but we thought it might be a use, some vanilla or upstream rattles gateway, frost.

C

Yeah, well, you can change that would make it configurable, it's probably a tree ville, so you can make the change in you know and make it ready for five gigs and we can probably can probably get it upstream, pretty easy. It's it. I'll complete like few lights codes.

B

B

Okay, so this is some awesome results. We have done on Hadoop over HDFS forces over Swift we there. There are three different deployment considerations. The first one is out of over HDFS. This is a typical setup and then we have a devolver Swift with a list and point middleware. That is a special configuration for a devolver Swift that makes a hadoop can do some local local, where read but part for a cure for the right. It's it's. It's actually going through the proxy server anyway.

B

So there's no locality in right side and the last configuration is we actuate disabled as a list item point middleware so or the catch and input where goes through swift proxy server. That is the original or typical Swift deployment.

B

So, on the right side, we have some performs numbers so as HDFS as the baseline- and we can see this time point middleware is impact is huge and this about 44 lips and decoration without the standpoint middleware, and we we have done some analysis on this and we found the rename renamed Oh Haddie's, quite big in swift, swift side, because a in HDFS there's a simple methylated change and in name node that is HDFS, rename and I mean surf to rename there's some copy and delete process. This is quite heavy. Actually I guess this this.

B

This will also be a issue for real skin 28 /, a huddle over Redis gateway. Since we we are actually using the Swift IP idea. Well,.

C

The greatest gateway itself that, when doing a copy and the copies within the same placement target or storage policy, is not actually copying the data, it's adding another reference to the data. Oh that's.

B

Great! Okay! So that, oh all that that is great news.

C

Yeah so doesn't matter whether you're using the Swift front and / or the s3 one.

C

B

Okay, okay, are we trying to uh taking you that's to see the poems for Hadoop over red ass gateway later okay? So this? This is a next step. So for for the development side we have a steel, stirrups and left and I think we can use the development quite soon and then we are going to complete the promise test work. That is a double over whether skateway, with with local block-level read. I think it's it's going to be down in dry or august and based on zip, performs or swift.

B

We are going to need to resolve a bit lemme show but I. If, if the said your hula said is is okay, then we we may need to investigate the copy implementation rather skate away side, and then this might not be an issue for a rather skate way, and then we have. We actually have a cadre power on github we're counties, it's private, but we are we're trying to open source the code. I think it's going to be happen very very soon.

B

So this is the updates from a loop over real estate wayside and.

B

Here is earning costumes.

C

um Yeah I, just it will be interesting to see the performance numbers of the gate gateway and whether you know what kind of battle next you're gonna hate eat there.

C

What another question this I have is is back at listing something. That's actually you actually need listing objects in buckets.

B

No, we actually don't rely on this feature.

C

So you might find that.

C

And if you disable and and you need to patch the Gateway for that, but if you disable the bucket index rights, when you write an object okay, then it should give you some performance boost for for object, rights.

B

Can we do this? I mean I'll schedule right now, uh its.

C

B

C

Mother, it will require a code. You know you need to remove some some code there. It's not okay, it's not just configure mobility. um Ok,.

B

C

We were discussing that quite a lot. Call it a blind buckets. Basically that package don't hold on the the index data, and that sounds like a use case for that specific feature.

C

We don't have it implemented, but the path and the code that actually does the the bucket index updates is pretty trivial. Think thurs.

C

Not many places in the code that you actually need to change, it might be something that will make sense to make it configurable. The problem with this specific feature is that certain things aren't going to work like a multi-site multi-region, but it's not it's not anything that you, your you're actually interested in. So it might be that you you can explore that.

B

Okay, okay I see, but according to our experience, the most potential bottleneck for Hadoop er runnin on Swift. The network is going to be the first bottleneck, because, mostly all the objects are quite quite large and yes network is.

C

Yeah right so in that case well, it'll be interesting to see what what's the effect effect. It might be that wouldn't change anything, and in that case it would make sense to do it, but it's something that ish you can try to explore.

C

B

B

C

Another thing and that's something that went in right recently: there is a way to bump up the number of librettist connections between the gateways and the backend there's a new configurable to do that. That's another configurable that you can try to look at I'm, not sure. Actually, if you seen it went into hammer, probably not.

C

But you can try to look at at that change and it's pretty it's. It's not very interested, and you might want want to take a look at all. The people reported that they had some performance loose stuff out of that.

B

Okay, so what do we mean? The number of red, oh scandalous, patch yeah,.

C

Is it something that you don't you remember we did it not here. Did you take.

B

A look at that, it's all zip it yeah, it's actually in master, only right, it's not available in hammer yeah, it's not in here, okay, so so I guess we we need to some. Do some pack pot work? First, yep, okay! Okay, thank you. Our definitely look. Look into that.

C

Yeah, it's the HW number 8 of hand off you.

C

C

You hmm yeah I I've, no more questions.

C

A

You alright, that by wrap that one up then, or do you have any more young, oh.

B

That's all okay,.

A

Great. Thank you very much.