Ceph Cephalocon APAC 2018, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Petabyte Scale Object Storage Service Using Ceph in A Private Cloud - Varada Kari

Description

Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Varada Kari, Flipkart System Engineer

A

Very good afternoon, all I am Barda I'm from Flipkart, it's an e-commerce company in India and then have team mates here. Also with me, I'm going to talk about our journey so far with SEF so going to talk about our journey and then how do we kind of deploy and use surf internally for Flipkart and what are the challenges? What we faced and how do we take care of all our internal loads and then what kind of QoS mechanisms what we deployed energy w's? Most of you are using it as an object, storage service.

A

So how do we take care of all the administration and the health amore ring in our cloud platform, which is an internal one for us and then it's not it's something like open socket is not an open stack deployment, so something about flip card. So we are an e-commerce company.

A

It's only the largest e-commerce I would say in India right now we have 80 million products and we have our own different, 70 plus product catalogues, and then we have around 100,000 sellers and then 30 million daily users daily visitors in this to Flipkart and then 8 million shipments per month in the peak load will just double that around 16 million apartment and took at all these needs of Flipkart. So we have developed a highly scalable and reliable storage infrastructure which is based on safe and we are using it as an object, storage service.

A

It's almost in production from past 3 years approximately, and we have two data centers, which has around 20,000 hosts each and it's completely virtualized infrastructure. So we have our own infrastructure, so this virtual essential for us, which is catered and then caught out for the staff you see for us and the as I mentioned. So it's mostly a scale out objects or a service on SEF. What we use and then mostly use s3, edible accessory compatible as W's and our key focus or key features of using surface, mostly on durability.

A

So we want to have during data suitable, no data laws are because it's a live service for us and we have to be fine lines of reliability which have guarantees, and then we want to build, on top of it, to guarantee that fine lines and then multi-tenant systems with different users and different user sls for for each user, and it has to be elastic for performance and capacity so which self guarantees is as a base platform, and we have three different workloads and 3d fellows three different SLS.

A

What we offer for Flipkart one is mostly a low latency platform which has smaller object, sizes but low latency and then they're big in number, and the other workload is mostly the backup use case where most of the critical databases like my sequel and radius, elasticsearch Kafka, everything is backed up on one of our own offer clusters, which takes care of all the backup workloads, and this is mostly bandwidth optimized. We have medium to be bigger objects when you are or backing up the whole database.

A

So this is an again in in-house tool, our our backup service, what we developed to to back up all these bigger databases on to the clusters, which are still using the object, storage service, s3 on SEF and the other type of workload.

A

What we have is archival workload, which is for the legal compliance of data and everything we have to keep it for more than seven years to 10 years of the thing of the data, and these are generally bunched out to a bigger object things, and then we use a razor quad pulls for these things and which has to be kind of ready for all the verification or auditing purposes and etc.

A

This is also again in-house their in-house tools to really go and archive this data from the other two clusters, the shared actor, which has smaller number of objects and then the backup clusters for the legal compliance reasons, as you can see in that, so the hot data is based on the shared active workloads and the we have our own tearing towards the archival load so coming to it, I mean most of the things are very similar, but I'm just going to concern mostly on the low latency part of it, how we are using safe to to the needs of Flipkart.

A

So one this one. We have one big cluster, which is around two petabytes of raw storage and it is growing every day and which has around 750 OS lease right now, which has a mix of SSDs and hardrives, and it has around 1.5 billion objects. More than that right now and we have around 200 plus internal customers which use the service, and then we just grow it. 50 million objects every month and the peak you can just reflect.

A

We are going around 100 million per month, so I'm just going to give overview of what is our tech stack and how do we deploy itself in our cluster and then what are the challenges we faced to arrive at this kind of organization of the stack for us? So we have one endpoint which multiple clusters are kind of reaching out to, and then we have multiple clusters inside a single endpoint, depending on the user and and the SLS will respect to the different cluster.

A

So we have one load balancer, which is a software load balancer and just based on hitch a proxy. This is actually owned by different teams. That's the reason we we have it at a different layer right now and it reaches out to a bunch of nginx nodes which works as a request for ordered for us and then to the our JW's so for to get into the SLS on these things. So we have again split into two two instance: groups what we call it as for the read and then the right.

A

So there is a load balance to load balance or attach to this again to handle all the read traffic and write traffic, which is again H a proxy based software load balancer, which has two different endpoints and nginx will forward the requests. Looking at the headers, whether it is a read request, a write request down to the wandering as W group, and these are W switch reached out to a virtually VMs, which is mostly in the virtual infrastructure.

A

What we have so each node is in VM, so we have Mons and then castillo nodes index nodes and other, basically at higher voice keys, which has a combination of SS citizen HPD's. Here and- and these are all part of one of the bare metal hardware, which is again a debian host hypervisor for us, and then we have different VMs carved out from the hypervisor. So.

A

This just just talks about what are the clients we really kind of handle every day in the Flipkart workload. So we have a bunch of internal customers which are using it as a repo service. So all the Debian's or the internal code, whatever has been developed, is actually packed into a debian and then being pushed into this cluster.

A

And then we have the e-commerce part of it right now, then, more service, catalogs, etcetera and all the rings which are which are just coming to the same cluster and the other end is mostly the internal developers which are which are using their own tech data on this on using java or boat or some other SDKs which are coming to the same cluster, and we have the other backup workload which is coming to the same cluster.

A

So and- and there is a CDN which is serving the public cloud or the public interface to the Flipkart so which will be accessing all the invoice and then images and then other details of the products. This is, this: just constitutes the user and part of it, and we have our own health monitoring system, which is again time series based database, which you use open, TS DB, which has a visual UI with the graph fauna, which we'll see in more in detail in the later slides.

A

And then we have our own alerting system, health monitoring system to alert us. If there is any problem in the cluster or how do we deal with all these things?.

A

So I mean so any instance, as I mentioned civilian for us right. So any instance. We have the following things in that right, so there is a host to us which is VPN for us, and then there is specific guarantees for the network depending on the instance type and then how many CPUs are the virtual CPUs, what we have and then type of disks we can have SSDs our hard drives, are a mix of both and then the main memory.

A

What we want to give it to each of the VMS so when we translate it to the safe workload, so we have monitors which we need more CPUs and RAM to handle all the data base for the level, DP, workloads and etc, and when it comes to our JW's, we need more CPUs and the more network bandwidth. And there is a group of these and our JW's. What we have we all off the same type when it comes to the voice.

A

These we have SSDs alone, which are used for the casters and then index nodes, but they have different configuration for that right. So one is a low latency one which are larger capacity of SSDs and cache series of mostly disk bound thing. We don't need too many cords there, but not too much of RAM, so we would use a different configuration of the VM there and for the base universities. We have higher capacity, but it's a hybrid one.

A

We have SSDs and hard drives, both in the same same VM instance, so everything put together right now, so we want to guarantee some of the QoS aspects of to the internal customers. What we have teaching of reaching out to the Flipkart cluster right, so so to guarantee these qsr. We have around three to four layers of guarantees.

A

What we want to manage at a cluster level, first and foremost, is the load balancer, so we won't actually spread across the load across different our CW's and then different engine X's and then reads and writes that what part is taken care by the load balancer and the second one is at the nginx layer where we separate out that request into who reads and writes two independent sets of them handle the gates and boots.

A

We just manage the engine ex-cons to have them separately and the third layer what do guarantee this is we have our own rate image developed in inside large W, which is per user again, so we have certain support, offer the gates inputs, what the users can do per minute or at a different level of unit of time, and you can see the number of requests, what we can rather get put and creates and deletes, and this is actually deployed for each of the audibly instance.

A

It is embedded into the rtw code right now, so we just look at the headers and then find out what kind of adequacy is that and then we'll take a call whether we hear we, it should serve that or not and the in the fourth level of guarantee. What we're trying to do is on the crash, so we physically separate them into different PDU level groups into different pools and then, if one one pool has a problem with that down low st or some hardware failure, we don't want to affect the whole cluster for that.

A

So that is the fourth level of Q s. What we want to guarantee at the cluster level and coming to the health monitoring part of it or the administration part of it we have as I mentioned before, so we have. We collect different kinds of stats from each of the of the component in or in the cluster, so mostly categorize them into two different categories. One is on the system side, where we collect all the you know, system specific how much of Ramy is used.

A

What is the disk capacity, how many, how much of latency for each of them and etc? And then we call it as half periodic stats which we run it on each of the nodes, OS keys, roger w's Mons, and they have dedicated min nodes with to run all these stats together. When it comes to one, we collect the DB stats. What are the space you use in and then what is the quorum status, etc? And when it comes to our WS, we capture? How many requests are we serving? What a per user I mean?

A

What are the requests sent? What kind of op is that etc and the boys? This week we collect all the I/o traffic memory, used, DB, stats and etc.

A

All these things are actually fed into overall metric service, using something called as collecting and and and all this data is actually pushed into a time series database to predict that friend and everything- and this is presented into graph on aware- we can visualize all these dashboards for all the data which I will show in next slides and we have another component called alerts which will actually alert us when there is a breach of any of these SLA contracts.

A

What we made so any user is, is breaching the contract or we have any cluster issue which is actually making slow request down to the user. So we will be alerted for that and if there is any download, OSD or don'know is what we have or under any failure with any nacho seller. It's like going out of memory or everything we captured it and then to put it to the user. So the uncle person will get a call and a male an SMS to attend to the cluster.

A

Immediately so this is one of the graphs to show how many requests what we going to cater so the green. Actually, so this is, you can see on the right hand, side I, don't know the scare what kind of HTTP error status we are actually HTTP status. We are actually reading to the user so to hundreds or 404s or 429 or 5-0 threes. What kind of history deepest status is returned to the user from GW? We capture that the green actually is reflecting the 200s.

A

What we have, we typically cater around eight key requests in in average, and then the lower bots talk about different error, different codes and error reports there so and we can actually see on the right hand, side again there. We can actually go and look at it the level of what is operation, what each user is doing and the user, and what bucket you can actually go and look at how many ops are being sent at the peak.

A

And then how is a trend of the user or how he is using the cluster to to kind of better accommodate the user need right now and then we will inform the user of kvo is going higher or lower, or we need to go and increase the limits for them or provision more capacity or on the on the space side or on the IAP side.

A

Okay, so that's that's how we use it so far and then the journey is not so smooth so far, so we have different challenges to reach it to this level of organizing our own tech stack and then handling all the challenges, the problems. So we started with this with a hard drive based cluster and then we start to kind of grow and to the need right. So the first problem, so we have a lot of writers, was coming in which we do around.

A

We used to do around 2 to 3 K before it's a hard drive, blaze, cluster and suddenly one of the hosts is failed, and then we started having rebalancing that triggered a lot of splits and merges on the file store back-end and we are not able to serve any of the I where it was so it's kind of an outage for us, so we were not able to save any I. Was at that point so then we started thinking about okay. This is not going to handle it now. I have to do.

A

We have to do some more work there, so we started having migrating all the journals into the SSDs, and then we started having our own write back cache, which is SSD based here, to handle all the right traffic for us in the faster way, so that any failure also will be handled, not the hard drive based one but much faster than SSD tier, and we introduced our rate limits to the clients with basic gates and puts so that they will not go at a bursty nature whenever they have some need, and for that we actually separated them to read and write right.

A

So very in cluster rebalancing in place. We are out of jobs and then there is, there is no queue s in place for the recovery, so we started absorbing a lot of I/o traffic and then we started getting lot of errors to the customers.

A

So to have all these things, then we have separated them at nginx layer to have different, write and read groups. That is the first challenge. What we faced- and there is another outage with the metadata poles right so again again, it's a learning for us, so some of the customers which actually went beyond more than hundred million objects in a bucket that has a huge index per bucket, and then we have an usage pool also, which is actually logging all the operations. So on that it firfer each user again the same problem.

A

One of the discs failed on that pool and there is a lot of rebalancing happening. So the cluster is on to its knees. So far you can see the trend there. So for a couple of hours or something we are not able to set any of the eye, whoa so mitigation again, so we started looking back so what we did was we started sharding manually, shutting all these buckets, so we came up with a number of charts seats in hammers, so we don't have any dynamic charting that time.

A

So we have our own decided using threaded shards and then rebuilding all the index there and we disabled the log pool which is actually doing lot of metadata on that pool and we started introducing bucket what not to go beyond one main objects to handle the bigger buckets.

A

So you even see the the charts down so where we were actually having total block pick was there and then how it is the read and write pattern so they're not able to say any rights they're down for a certain duration, and then there lot of we are giving any lot of errors back to the customers.

A

Third challenge, so there is an application error from the user, so it was in a loop. He was as whenever he tries to put an object into these his bucket. He tried to do the create bucket also, so he does a create bucket and a put object immediately in a loop so which again has a lock contention on the level DP. We are not able to I mean plus trees again down to his knees.

A

Right I mean we are not able to take any of the requests because of the lock contention on the level DB and then the the sharding. What is happening everything so no other ways. Again. We look back and said. Okay, we have to enable bucket what us on the create bucket creation request as well. So the rate limits were a player again on the bucket creation and deletes what the user can do per minute or per hour, depending on the time we are at what he shows to.

A

That's mostly on the usage part of it. So again, so we face some more problems with the coordination and some of the missing features like so something like data. We have, we were leaking some of the objects with incomplete multi-part objects, so you can just abort the multi-part also, but we were not able to kind of clean up all the objects which are incomplete, so the we have something called as not rad as got me, not fun. Fine. We are into some billions of objects, so the scale again.

A

So we are not able to list all the objects in the pool. So then we had developed some offline tool to go and find out all these leaked objects throughout the cluster per bucket per user and we have to go and manually delete them.

A

The second thing is again to the multi-part, so there was a later loss. There was a race in the multi-part thing, so what happened was once become one of the is actually a bigger object. So there are multiple parts to this object when, when the, when the user actually kind of did a complete multi-part, so it took a lot of time to update the meta so and then it was. It was not in the within that iMode period of the boat or something so he he tried to complete the multi-part using a post.

A

Then there was a race we actually kind of deleted all the objects which other than the head object. So there was some kind of a greater loss for us, and then we have another peer for the to address all these issues, this one having a lock and then trying to not to race them. At that same point of time to account the data loss, we have our own tools again, not the usual or fun find out something.

A

So what does object so we'll start all the objects on the riders pool for this bucket and then see whether we can reconstruct the object or not, but but the tedious task looking at that, 1.5 billion objects, whatever we have on the cluster.

A

Yeah, so we still have challenges, I mean we have beer in the hammer cluster and then we try to upgrade to jewel and it is a pain right. So we have our own close to on three clusters.

A

We have around around 3,000 Oh s Keys together, so what we need to upgrade right, so we tried upgrading to gel, and then we have the root user to the safe user and group problem where we were actually trying to do CH own and that took ever for us to really come out of it, and then we try it on some of the test clusters with the same similar kind of workload when those are in progress, some of the some of the ways this will take actually days to really do complete to ch1.

A

Then then we said: okay, we'll do something else. Then they came luminous which fill address our file, store problems of spirits and merges what we were facing earlier and then thought. Okay, we'll use blue store to do this again. The path is not so smooth from hammer to luminous. There is no direct path. I have to go to. We have to go to jail and then we have to from there to upgrade to luminous, and then the file store OSD has to be destroyed and then added as a blue store back-end. This introduced it again.

A

Some wear and tear on the hardware and the hard drive started, failing SSE started failing then we are just coming up with different ways to kind of upgrade the hammer to luminous right now and being in hammer again. So we have occasional PG, complete problems which are resulting in data losses, suffer asses, I mean this happened a couple of times right now because of the cache here. What we have and then, whenever the PD one of the voice D goes down and comes back again.

A

There is a complete, acting set change which triggered in some other problem, and we lost some of the objects. So then we have to go back and develop some offline tools to go and figure out what all objects we lost and try to kind of reconstruct them or re ingest them into the cluster again. So from all this journey, what we have is there are a lot of missing pieces to put together right now. So what we lack right now is an end-to-end queue from Ceph right.

A

So we have multiple layers right now we try to intersect at any level. So we are we're planning to do in two different block and file system interface, but the qsr has been becoming a problem for us right now and given given this.

A

Actually, we are not even so running a live service like an e-commerce things, so we have to be enterprise ready, but but whether some lacking pieces, what we need to add to be at that level of guaranteeing the service, finance of reliability or anything so end-to-end QoS and then were actually looking at some of the issues there, a bunch of issues to list here, so we do not have time true going to go. Do that one of the measures is like: how do you predict the waste d failures?

A

So can we get some of the stats and then try to plug in into our dashboards or or suffer manager, to give us the trend off? Okay, what's going to fail next, something like that, no.

B

Hello before the presentation so I have one crashing. So what's the motivation of separating get and put a trade offs gateway layer and did you do any optimization or tunia a trade off skate layer to you know, I mean against this separation so.

A

So we have actually introduced our own piece of software they're saying just go: we have some configuration about per user so which will actually ultimately what we want to is we actually want to tie it up with the user quotas? How much an user can actually go and do how many gates and puts he has he can do how many bucket creates he can do. That's the reason we introduced.

A

It is large w saying that, okay, you can distribute among the users among multiple, are double use and then respond it very quickly or in a organized manner. Okay,.

B

So is mainly for the rate limit. It's.

A

Morally, for the rate limiting butts on the the server side, so the rate limiting also we want the two categorization one is on the request based one another is on the bandwidth based one, so we have mostly done the request based one and.

B

Performance consideration you know by introducing this mechanoids, not.

A

Much so, but we would be actually waste some of the bandwidth we're handling at the rjw lair, so you will actually get the whole request and then we'll have to throw it down if you are not handling that request. Okay,.

B

Cool. Thank you.