Ceph Ceph Day Melbourne 2015, 13 Nov 2015

Previous Meeting

⏯

youtube image

►

From YouTube: Venkat Kolli -- Ceph on All-Flash Storage

Description

http://ceph.com/cephdays/ceph-day-melbourne/

A

And I'm Venkat Kali I'm the product manager for a product that we have in sandisk. That's focused on self, so I'm not going to be talking a lot of the product, so you have Simon.

A

If you have any questions on the product that the later you want to get you get to know more about, so we'll be able to have a happy to answer on that so sandisk, as Andrews mentioned before, we started on the journey with self about two and a half years ago, as an all-flash storage system, solution provider, primarily focusing on Seth and OpenStack.

A

So what I'm going to be walking to you through these the journey that we are taking this past two and half years and what we really had done so sandisk primary focus on surface on the performance is to really unleash the performance of self, especially when you put it on an all-flash storage and as we talked to the community and as you see it in a part of part of the community discussions, you will notice that there's actually a lot of interest.

A

That's growing on on how you know itself could take advantage of an all-flash solution, all SSD solution, so it's a very growing interest and in the community- and this is something that you know- we are really proud for us to be involved very early on and you know, go ahead and being part of it so very just quickly, so that well some legal stuff about saying- and you know you can- you can read it fine just you know to get the context of what is that that we are doing right and what is that that we are selling.

A

So this is an all-flash storage system and I'll give you a little bit more details in the context of self what it does to what it does and how you could use it with self. But this is basically a system that we designed primarily for OpenStack, and you know self-paced deployments right.

A

The use cases that we designed this for is primarily for a very large capacity workloads, beat Big Data, large container storage containers. Typically, when you see as an all-flash storage system, you tend to see it in a in a small capacity for a very limited. You know capacity, but a really high performance, but that's not where we are addressing this is: where does not where we are with our flash? We are really going after a very large capacity systems like scale-out systems like sap right we're.

A

Typically, it's deployed on hard drives, but we think we have a cost point and we're able to bring the cost to be competitive enough to offer the flash advantages for this large capacity workloads. So these are the big data, large media repositories, etcetera, is what what we really designed the system for so again, giving a very brief, specs and war review of the system before we actually go talk about the self and how the self works with flash right. So this infinite flash is the in the system.

A

So it's a very able to pack in 512 terabytes of flash in a 3u chassis right. It's a very high, dense and a very large capacity, and you don't necessarily need to use it with all 512 terabytes and especially when you use it in a scale or cluster. Typically, most customers don't pack it up as as a full capacity as a single node, because you, you know you're going to be scaling out, so you can start off with the smaller capacities for each evening flash node and scale out.

A

At the same time, there are customers who are looking this for multi petabyte deployments. So in that case the density in a three you, you know the 512 terabytes in the three used becomes critical for them. So that's what you know. This is what it is designed for them and the way we achieve that is with these cards. These are not SSDs. These are specially designed cards specifically to achieve that scalability and the density right, and it is a SAS based flash, and you can see that there is a huge capacitor bank at the bottom.

A

That gives you the powerful capabilities of the. So it's how the full enterprise SAS, based at Cape voltage with this. The one thing that we purposefully designed this is to not include the servers in it not include the compute in it, and the reason for that is, we want to make it a more disaggregated solution, and primarily that is important for self is because you know self could be used for multiple workloads. Right and different types of objects are different types of data, be the object, data block data or in the future file data right.

A

So we want to make sure the solution is flexible enough by adding in the right set of compute to n, rather than designing a certain set of amount of compute inside the box.

A

We want to make sure that it is flexible enough that you could bring in the right amount of compute and make it usable for this optimized for different types of data right, so we purposefully eliminated the you know the compute out of this, but you can connect the server's through AIT's, a sports that comes out of the box, so you can have multiple different combinations of the server's up in up till eight to be connected to this box at the raw box level, and this is not the safe performance and we're going to go into the set performance a lot more detail soon, but in the box level we're able to get about 10 million I, ops right around 4k, random and the throughput is between six to nine gigabytes.

A

This is based on the current six gigabyte back backplane that we have and soon in a couple of weeks we're going to be upgrading their to 12 gig. Then you can get the performance almost doubling it now. Please remember these numbers and you'll see when we talk about self right, how they relate to the the performance that we are able to get at the raw box right. We opted myself and, in fact, suppose optimize their. It's almost almost getting a full raw performance out of it, and that is quite you know quite interesting right.

A

You know, given the capabilities of self right, all the features and everything and still I is able to fully optimize. The box is a quite remarkable and that's something that you know that you know we're just very proud to be involved in making the SEF quite efficient that way right and just very quickly from the available T standpoint. But this comes up all the time, because when you have these large systems, you know what the data availability is.

A

Flash obviously is a lot more reliable than then the hard drives, but also every single component in this box is completely hot swappable and you can have it in this fully redundant. You know you see that for fan dries, you know, and please do a configuration in the back. You can see it here, but again you have multiple redundancy failure, and here and the MTBF the failure mean time between failure is about 1.5 million hours.

A

This quite high begin it's just because it's now flash solution, so that's basically what the infinite flash system is right so coming to what we are done with so again we sell this other so as a solution with SEF right. So we have it fully optimized and tuned self configurations that comes with the comes to the system and again one thing that we want to really know that emphasis is sandisk from it's very early on has been committed to the open source right.

A

Anything that we do with SEF is one hundred percent committed back to open source, so we use the open source step, we're not making any proprietary in our branches or any proprietary extensions of self. So everything and what you'll run on on infinite flash is the open source f, and it's good thing that you know we make our living selling the boxes nor the software. So we don't have to go how to go. You know you know do this, do this thing, so we are fully from very early on.

A

We are fully committed to the open source principles of self and will continue to do that right.

A

So just quickly, as I mentioned before, so we started this journey around in summer of 13 2013 with the dumpling release, and so when we started building this box all right and Lee and and when we quickly decided cept to be the system for us to power this power, this infinite flash one thing to our shock: we found out that the performance that we were that we were able to get with self on internal flash at the time was really not that much different than the hard drives.

A

Yeah. Imagine our surprise. So we are, you know proud of this. You know big huge in the all-flash system and it's doing no better than than a hard drive system right, so I think at the time we were able to get about 10 to 15 ki ops, with the SSD journals and hard drives, and we even when you put that in an all-flash solution with similar configuration.

A

It's really not that much different, so one thing that we quickly realized that you know there's a lot of optimization that is required within self and also a lot of tuning with the hardware is required right, because most of the deployments we've seen at the time and most of the experience in the community at the time is- is having the self optimized for hard drives, not for an all-flash. Yes, flash has been SSDs have been used for journals from very early on, but you know having it as an all-flash solution.

A

There's no one has started they're doing that when we started looking at it, so we quickly determine that it's really the voice. These are you know getting the bottleneck here and by the way I know some of you are new to self. So you know some of the terms that will be using here are and if you're unfamiliar with its, please feel free to stop us. But mostly you know, these are the talks that generally are oriented to the community that you know the people are the users of SEF today right.

A

So if there's anything that that we need to slow down and clarify, please do stop us. So one thing that when we started this is that the voice is turning out to be the bottleneck here right. So it's about, we were able to get about thousand I ops around the time and that's using about 4.5 course / or OST, and so we thought about and third location, then, should we be using more always these part that always / SSD right or in our case that flash card remember our flash card is a terabyte.

A

So it's a huge capacity cut that we have right. So we thought about using more OST, but we quickly moved away from that idea, not just because it is not optimal, but you know having the failure domain or the crust rolls. Magic thresholds is going to be a nightmare. You will not like it and it is going to be very hard to manage so an interest of the you know the usability and there and the manageable to the solution.

A

We quickly move away from that idea, so we started working on enhancing the data path, primarily focus on the reeds around the time. So we we started when we started this. You know we thought that there's a lot of room for improvement, and especially that's you know more impactful on the reeds and I'll actually talked about in the roadmap. What we are doing now with the rights right, so the data that we have today in which I'm going to show the numbers are primarily with these changes in without the right changes.

A

Yet right, the right changes are going to go into jewel and with the upcoming release. So in the reed side, what we found out is that there's a lot of context, switching happens in self and they did not matter much when you're using in the hard drive, because the latencies of hard drives are, you know pretty high, so in no matter how? How much context switching is happening.

A

It didn't matter, but when you put it on a low militancy media like flash, that becomes very expensive right, so we kind of took a lot of you know, queuing, optimization right and we removed a lot of lot of Lok Lak contentions right so make the locking more granular. So you can. Actually we can we speed up a lot of this thing, so you know doing a lot of lock, optimization and queuing. Optimization really made a lot of difference. Everything self and the other aspects is the socket handling right. So there are too many.

A

You know getting the small IO request going into the kernel, that's something that we have eliminated right and we quickly disable the nagels algorithm inside their self, and you know primary that made a lot of difference from the latency standpoint.

A

There are many other things are a whole bunch of things that have been done that to get to the performer, which I'm going to talk about in a few minutes. You know primarily just you know, enhancing the cash lookups right and you know anything anything regarding handing the copies copy mechanisms right all of them.

A

So there are many minor things that went in that really made it quite a bit of a difference, and the net result is that so, with the current testing that we are done right, which is currently, we are testing it on hammer. We are able to get to or 80 ki ops, / OSD remember. This is when we started this about thousand die offs is where we were at. The more interesting thing is really the looking at the the cpu usage right, even in a thousand I ops.

A

At the time when we started it, we started consuming or a 4.5 course, or so now we just double the decor consumption, but you know for the amount of the work that we're doing right.

A

It's become very, very optimal in terms of the inner cpu usage as well, and this is one thing that we you'll find and when you see the performance numbers coming with the flash, the one thing that you'll quickly realize is that now cpu matters a lot because with the flash you completely remove the bottlenecks from the storage media and all of that bubbles up now, it's actually the cpu and especially with the small blocks, especially if you're going to be using like the OpenStack workloads, which is at the 4k rights or 4k. You know iOS.

A

This is the CPU matters. A lot and I'll show you some a couple of members that shows you how much difference the CPU in our Township I'm, pretty sure that makes Intel very happy all right. So most of these changes have been the reed performance. Changes have been to through these releases. Most of them are in giant so anything that you obviously hammer inherited all the giant release, so anything that if you're going to be using hammer, you will see the same performance coming out up there coming over the sirs.

A

Okay, when you deploy it in our flash so quickly getting to the number so talking about the system itself, the one that we used, so the one thing that we started wanted to really compare is to kind of see how the staff would perform right without any tuning and and with all the tuning and all the changes that you are done for for flash right. So we used in this test configuration we use the one in free, flash I of 100 is, or is your number for the infinite flash system.

A

So this is a 15 12 terabyte system, a fully populated system, and we used to OST notes. These are the dual sockets in a 12-core, each so 24 physical course for each OST, node and therefore blocked or drivers or rbd clients running this right and using a 40 gig switch a little more details on this again. Just if you want to know the configuration that's used for this particular test, so you have this for your reference right so very quickly, so this is where so. This is a test that we done with the giant.

A

This is a very first release that, with all the changes that I mentioned in the previous flight and a couple of slides ago right, this is the the net result of what we were able to get right now to compare this again. This is a lot more lot of data in here, but let me just quickly walk through what it is. So this is done for an 8k, a random read, io workload right.

A

What we are really comparing between the red bars and the blue bars really are. Essentially, if you take the safe as it is without any sort of tuning- and you know, without making any changes with just the defaults and using the same hardware, so the board that both those tests are run on in a flash one is without any tuning right at the other, is essentially with the complete tuning and with all the changes that that went in right.

A

So if you look into the on the read side right, so the blue bars is basically what it is with any sort of tuning on all flash and the red bar that you see there is about 250 ki ops compared to about 10 to 15 ki ops. So the net lesson that I want to take you away from is is tuning matters a lot for self, for the particular hard drive for the particular hardware right, and that's especially true with an hi performance system like flash right.

A

So you really need to figure out how exactly and I'll talk about some of the tuning parameters that we used, and that made it a lot of difference here, but essentially having these self tuned for the particular hardware that you're using makes a huge amount of difference. Tremendous amount of difference and I don't think this is going to be the true with the current release, with the hammer release, because with the hammer there's a lot of changes that went into the defaults as well.

A

That really is not going to be this dramatic, but it still is going to be highly impactful right. So, that's something that you would have to consider it the same thing with the latency as well. The the multiple the bar that you're seeing is for different q depths right so and obviously using an all-flash system like infinite flash, a large hi performance system. You're you get a much better.

A

You know results with the higher queue depth right from the I off standpoint, and that might not be true with the latency, but from the I up standpoint. You know if you're looking going for the maximum, my ops, you got to have the highest tech queue depth as possible. Here we are using about.

A

You know: 6616 queue, depth, okay, so very quickly. The other thing that I really want to show as a showcase test is that so how does it behave as a cluster? How does the performance scale when you put it on multiple internal flash knows right. So in this case we are taking the same capacity right, the 512 terabyte capacity, and that I showed you earlier now. We split this into different nodes in a cluster right because, after all and the end of the day, this is a scale-out cluster right.

A

So you know you want to basically grow as and and grow as scaling out right by adding more nodes. So what we did is that each of these infinite flash nodes now holds a quarter of the cap, gave a capacity, so each of them holds 128 terabytes right, and so now we have 20 s denotes same as before.

A

You use the same servers, so the key difference now you need to note is that in the previous case, you have a full capacity by 12 terabyte right running with the 20 s, denotes and running it with a few rbd notes right for clients, / /, infinite flash system. Now you actually have 60 s denotes powering this powering this you know cluster right. So at the end, is it's basically 385 terabytes around 385 terabytes, the total capacity of this cluster with the 60 s denotes and running about 55 gateway notes right.

A

So that's in the same configuration and the same type of servers. Writing this. Now, if you look at the performance that that you get our out of its I mean running the same for K blocks right, so we are now getting almost 900 k, I ops, on a 385 terabyte capacity cluster right. So if you note that million I ops at the raw box level, we're almost there running safe right, getting the performance out of it, however, I mean the big difference really is.

A

Is that we kind of spread that out into different infinite flash nodes right the same capacity rather than you know, measuring it on a single box with that difference? But still, if you look at per terabyte performance right, so you actually with self you can actually get to almost a million I ops with three in 385 terabyte, flash sister right. So that's basically the net of what you know, what, with all the optimizations and all the changes that went into the surf and how is able to perform?

A

Ok. So the just a couple of other points. Again, it's a from the latency standpoint. We are around averaging around two milliseconds of latency, with SEF right at the fork, a blocks, and if you look for two 9s consistency, latency it's around 10 milliseconds right, a 39 s is about 20 milliseconds, two nines is around 10 milliseconds.

A

One is obviously as I mentioned before. If you're looking for, if you're working on a small block workload, CPU makes a huge amount of difference. So in this noise I mean, if you have your, we are still CPU bound at the 4k blocks right. So if you increase the number of course, our number of servers, that's powering this you'll actually get higher. You know higher. There are much more improvement, improvement for I, ops, number of pi ops and then there's a dissolution right, but obviously, as you get to the larger block, CPU doesn't matter much.

A

Now it's a bandwidth. There starts the network, that's really getting good and away the critical. So we are testing this to the 40 gig and that's where so, as you get to anywhere from 64 K block workload, anything higher than that you really 40 gig almost wake up the requirement for this kind of solution right or else you really really will be severely constrained at at the network level. So we do most of we do.

A

We have a lot of customers that deploy 10k they're dead, sorry, 10, gig and again we want to talk about the network is in the next session, but one thing that you know your loan is: that is that you know for most typical workloads and using an high high performance media like flash. You really network needs to be sized properly and 40 gig is going to be getting a critical requirement once you reach a certain capacity, okay, so moving on from the reads too right.

A

So one thing: if you kind of notice- and I kinda quickly in a glanced over and in the previous slides- is, if you look at the right, all rights are still a major issue itself right and that's that's what we are working on right now and there are many changes that we went through, and so one thing that we quickly figured out is that you know again the rights again, just as with the reeds you know are primarily are designed for hard drives, and so the you know, generals are used, for you know, are used as a mechanism to the latency issues right, but one of the things that that really happens is that there's a lot of batch processing happening in the background right.

A

That still is a big constraint for for the right performance, so this is still so for the flash when you put it in all flash right. That becomes a major issue, so we saw a lot of spiky spiciness, even when we used is some kind of low performance first, what we tried out is that we try to use the nvram front, ending the flash right so to kind of make that make the journals more even more efficient right.

A

So typically, this strategy is that you put the data on hard drives and you put in our journals on on on flash on SSD right. So we took the next step of putting the data on the flash and putting the journals on and envy, and we're am the one thing that we found is that there is obviously some improvement there, but there's a lot of spiciness.

A

That happens because of that the big batch processing that that happens at the backhand right, so you don't have any consistent performance and for most of you have been running storage systems for long inconsistent performance is worse than not having a you know. Even an average performance is worse than a bad performance, consistent bad performance. Now most customers and most of your workloads Aparicio, would prefer to have some kind of consistent performance even as bad as it is right then, having something that's very spiky and unpredictable right.

A

So that's quickly, you know went out of the way, so we went to basically work on primarily to eliminate this kind of heavy batch workload, so we modified the buffer writing primarily again. This primarily will benefit the flash right, but most of the things that we went to kind of figure out how to handle those buffer writes. All of these changes are are in dual release right now, and so in our early testing. What we found out is that we are about 2.5 times purely on the right more than hammer what hammer is getting right.

A

So when you see the release with Joe of again this again, this is it all flash by the way. So most of these things would not matter much when you deploy it in hard drives, but when you deploy it an all-flash, you will see with the upcoming duel release on the right. This is going to be about 2.5 times where we are with the with a hammer. The latest is also cut to have that you have that okay, so there are a few other things quickly.

A

What we are doing is on the we are working with mellanox very closely on the Rema, so there's a significant reduction in CPU even further, then what what you, what you are seeing out there remember as I said before you know, CPU becomes the critical factor when you put it on our now flash. So that's it becomes a key for us to have.

A

You know some of some of these techniques with like RDMA to minimize the contention and we're working with clearly closely with the new back and store the new store and then to make them more optimizations. For that, so we have sand disqus developed a key value store, that's primarily for any open source system running on an all-flash.

A

It's a very highly optimized, all flash key value store for an all-flash system right, so there's something that we are now open, sourcing it and making it part of the back end of the of the self as well right. So the new store is now we're planning to have this new key value system.

A

Key value appear to be used as a back-end one of the key things also, that makes a huge amount of difference, although it is not quite part of the self code, is the memory allocation of the underlying OS right, so the TC monologues, J email log in a sink messager. So these are the some of the key to Nobles. That I mentioned earlier makes a huge amount of difference. So right now most of these changes that we are manually doing right. So that makes a you know difference of almost a 3x of the improvements.

A

So one of the things that we going to be doing is that we will be working with the other set providers. We primarily partner with red hat as one of our premium premiere or self provider, and we kind of make those as the defaults and as make it available as a tuning scripts. So when you're deploying it, you know, becomes easy, and so you don't have to manually do these things, okay, so very quickly. So what we do with self again- and she mentioned so.

A

Our key focus really is with self- is to get the performance highly tuned and highly optimized performance for all flash, but also make it a lot more usable when you deploy it on a large scale system like infinity flash right. So it's basically starts off with an open source self and with all the changes that we're done again, part of the open source of the community staff right so with with the open source and the and the sandisk enhancements. There are a whole bunch of things that we built around.

A

It primarily for you know, make the installation easier. Patrick talked about self deploy and the other new provisioning tools, so we are going to be working with the new provision tools. Currently, our installation is based on safe, deploy. It's a modified, safe, deplore, enhanced f deployed to that is specifically tuned for infinite flash, so I mean the key thing. Is that our goal? Really? Just as any other you know, systems vendor out?

A

There is to make self more consumable and more easier right, getting safe more to beyond the smarter folks, like you, you know who can handle this by yourself, but most of the system administrators out there are nearly not as capable and that's one thing that scares them heavily right. When you look into self as powerful as it is to you know, make it you know easily digestible and easily. Consumable is a key part of our strategy and that's true with the Red Hat are pretty much all the other safe providers that are out there right.

A

So there are things that we are doing many things that we are doing I'm not going to walk through all of this, but primarily on the usability side and on the planning side right. How you actually get the right configuration. Remember right. The configuration matters a lot attuned configuration matters, a lot in performance. So how do you get this out of the box using infinite flash and and lastly, the supportability aspects? A lot more log collection diagnostics are built into the self on infinite flash right.

A

So that's, basically, where sandisk is primarily focused on performance along with the usability performance, is all open source is all the code that's into the open source and usability, something that we are focused on around the hardware, okay, so the so we have a huge amount of team and resources that been invested in self for for a long time.

A

We, you know what team is about lower 25 engineers that is focused on staff purely on self right, and half of that is primarily working on the performance enhancements that I mentioned earlier, and the other half really is in the amount of heavy testing that we do obviously for infinite flash again one of the early vendors of the all-flash system right. So one of the things that we have to go do is to really enhance the pathology, a test suite to bake it, the more relevant for now flash system.

A

So we did a quite a bit of work and a lot of contributions that happen from sandisk on this. Automated technology tests sweet as well right, and we still continue to do that, and so there's a very heavy amount of testing that we do, and you see some of the numbers there in terms of the hardening in terms of scale testing and in terms of the failure testing right. So this is a again just to make it more enterprise ready and for the customers.

A

You know who just you know who just are you know want to have a more assured solution with the hardware and software combined. So that's basically what we are going and what we're going to be providing right, one of the other key difference with infinite flash compared to what you see out there in terms of self deployment, the mini infinite flash differs slightly because, typically, when the self is deployed, its deployed in a convoy snowed right.

A

So, where you have the hard drives attached within a within a CPU, complex right within a node, and you scale with those notes, so you have the CPU and I'm the drive and the storage. You know kind of more or less in a fixed ratio right and it's much easy to deploy, because you just need to replicate with the nodes, and you know self can take care of all the all the you know. Balancing and everything now coming that Hardware.

A

One thing with the difference with infinite flashy is, is the amount of disaggregation that happens between the storage between the storage, heads and the storage, clients right and the key thing that's key for us, because when we look into this multi petabyte deployments, that's where the solution is primarily geared towards to right, one of the things that we find with our customers and again this our experience comes from selling these two large hyperscale customers, huge social media companies, large ecommerce companies, mostly us-based right for them.

A

The key thing is that the very large scale, all these customers are moving away from this hyper conversion odds, because one thing that they find is these conversion notes are guests to be quickly unbalanced, based on their workload characteristics right. Obviously, so when you need more processing power right, you have to bring in add in more nodes, adding more capacity to it or even, if you're not going to be using that capacity, and vice versa.

A

Sometimes, when you have change your application policy to add more more protection right, so you need a lot more capacity, but you don't need all the processing power that goes with it and in a very large scale that becomes expensive to use the processing or to use those resources that you're not using. So for them. It is very important for them to tune the configuration very specifically to the amount of compute that they need and the amount of stories that they need in the amount of network they donate right.

A

As a quick example, if you're running a small-block workload at 4k block workload, you need a lot more cpu processing right. So here you can just add more OSD notes without having to add more internal flash storage right.

A

So you can address these small block workload with net cluster by adding more CPU notes by marrying more OSD nodes right and in vice versa, if you're basically be using it for a large object right, like 64 k, object or one Meg object, you don't need much CPU, so you can actually downgrade your servers to a single socket servers, probably use you in a single OST node, but actually connect that to a 40 gig network right.

A

So within the same cluster, it is not possible for you to tune your system to the right amount of you know, workload that you are basically planning on and, if you think about it, you know most of the clusters that are built out today is built in the in a OpenStack environment where they're building it for private clouds. A lot of the enterprises are building their private clouds and when you're building it for a private cloud, you're not optimizing for any single workload right.

A

Your goal really used to make it as a universal infrastructure that we actually can have multiple workloads go existing in it. So it is very important for your cluster to be balanced and to be able to tune and handle the different types of workloads, and that's where this disaggregation comes. It becomes really critical, but they're right. So that's that's one of the reasons why we, we specifically by design, eliminated not having the CPU and everything built into the storage right, so that it could be balanced right way.

A

Okay, so very quickly, I'll just going to skip through an interest of time, so there are different ways or customers to deploy the deployed in fin applies, and obviously an all-flash solution.

A

And again you can talk to Simon on what the price points that is available as and you'll be very surprised at how fast the cost of flash is coming down and, more importantly, it's also not just the cost of dollar per gig on flash.

A

It's also some of the techniques like this like where you disaggregate, and you look into the world all system costs, including the optics and and the capex everything you will be actually very surprised, just like our largest customers, are finding it to be much cheaper than the hard drives for the performance that you're getting so the other ways that some of our customers are deploying. It is with the combination of hard drives right. So this is not a typical way.

A

This is not based on the typical caching that safe provides, but this is more about the tiered solutions and there are some things that we are doing within the set as well to increase that tiering, a persistent persistent data tiering within sep, so taking the self caching and make it a lot cat self caching policies and make it a lot about more robust for having it as a persistent data tears.

A

And so this is again one of our customers they're, using it to be used in having a low activity data on a hard drives and that are back in it bare behind infinite flash right. So that's one way to be using it with the hard drives and there's one customer. That's actually trying this out as having keeping the primary copy on flash and keeping the secondaries on the on the hard drives and you keep you can have the higher affinity set to the primary copy.

A

So most of your reads keep happening just with the on the primary copy right. So because you have a lot of bandwidth coming out of this, so they're completely fine with that. Obviously, your rights are going to be limited to what the hard drive can handle, but even there, if you're, able to eliminate the reads from the hard drives and just keep the hard drives only as you know, just going to rewrite going there now he's just doing a sequential, writes and hard rest can handle just the sequential writes fairly well right.

A

So if it's just doing one thing, you know it can just do okay, so your your performance is really not going to be that bad then, compared to where you're, using a hard drive system for both reads and writes simultaneously right. So when you just relegate that to just the right for protection where, as this customer is doing right, that's also works fairly well for them right. So so this is again just as a TCO.

A

As a quick note about where this customer is, how measure and the kid is the data coming from from the customer. They basically are planning a hundred petabyte cluster, so this is going to be one of the largest cluster sep cluster. This is not yahoo. By the way, it is something that the customer don't want to be pub.

A

Don't want to be announced right so in the next year and a half two, they plan to go to a 100 petabyte cluster, and they did the TCO analysis on the traditional hard drives, which they are planning to how, before with with before evening flash- and so you know, based on the commodity hard drives, they are able to measure to be about. You know, around 45 million is what they were budgeted. This is including the three-year optics right.

A

You know, basically, the acquisition and a tree or objects combined, and they compared with infinite flash with different techniques. So one is with their full replicas. Obviously, it's more expensive. The second bar that you see is there is the full replicas three full replicas on the object, storage running on your new flash, but the other key thing, that's very interesting thing and reaching and talk offline is a ratio coding and how it works with flash. The one thing that that we find with the orig, according how many of you are familiar with Eurasia Cory good.

A

So that's one thing that you know self provides as with a full object, storage, a native leadership coding- and this is what they were planning to use with with an all-flash media. So they find to be a very cost. Competitive are actually even cheaper compared to the optics.

A

When you compare the opportu your optics right with an all-flash array coding and and the last one is basically primary on flash and hard drives, which they are not going to write, but just kind of want to give you a comparison that they did and the primary the primary factor that is affecting the the cost is really at the bottom graphs.

A

It is basically showing the total data center footprint right, so they were planning to have around 90 racks, or this hundred petabyte data based on their earlier hard drive based model and with the infinite flash with an erasure coding model which they're going to write. They are able to get the 218 racks from around 95 racks to 18 rack. They eliminated a complete data center, build out for this, and that's basically the huge amount of safe. This customer really doesn't care much about performance. For them.

A

Performance is not the key criteria for you know, for they are quite happy with what they are able to get with a hard drive hard drives today. Right failure issues are different things, so their experience is that they have 35 hard drive failures per week on the unlit on the cluster size and when they compare that to our AFR, which is less than point 15 %, it's going to be one SSD or one card failure per week, so 25 hard drives do one card and the big difference again is not the cost.

A

There obviously means all covered under warranty for them. You know once you have a hard drive failure, but the ceph to rebalance your cluster, that's going to take them more than a week with the workload that they're having because they're, not balancing that we're, not optimized network at that cluster for rebuilding. It's primarily it's built for the throughput higher throughput, so that whole rebalancing of the cluster and the spiciness that comes out of that is a huge issue right. So that's a big difference as well.

A

Besides, this cost, so I just want to kind of give you a quick glimpse on how how flash gets is getting used in their data centers today and the performance that that you get to get with the optimized flash. But yes,.

A

So operates primarily is one thing that I talked about. Is that the failure rate in the power savings and, if I don't know if I mentioned this way, with the infinite flash? It's about 470 watts of power at a fully populated, fine and 12 / 10 and 12 terabytes? Now, if you can imagine you know, that's almost equal into a or a single server, your single socket server, what it costs.

A

So the primary topic savings are few one is the power savings, you know power and cooling and the whole data center space, and this all the labor costs are involved in handling the failure media failures. So those are the three key factors that basically, that factored into their optics savings with an all-flash.

B

A

Fourth, one because what they this is again, this is there they haven't deployed this so from their modeling.

A

They figured out that if they're going to be using flash they're going to be using SMR drives because they're purely there for production they're not trying to do any reads or any I/o on that, so they felt that you know, because even if they have failure in those in those media drives, you know it really is not going to be a big much impactful, so they figured that those SMR drives and in the last one is going to be just as a passive. Writing. Writing.

A

You know right right devices, so they didn't feel that there's not much objects in there. If you need more details, you know I can cover this in our flying discussion on this, but just want to kind of give you a glimpse elf and how it look like. Ok, any questions.

B

B

A

So currently it is the you know, balance it is currently running on XFS. The one thing that you talked about you know: that's coming up in a roadmap is, as we move to the new store, the new back end. That's coming with SEF. We have this new. You know a key value pair, that's optimized for flash, so we want to kind of move. Have that and right now we're in the process of making that open source. It's currently a proprietary software from sandisk.

A

So once we make that open source and once we integrate that with itself, so we'll be planning to move that we hope the whole community would be planning to move to that key value store for for flash.

B

A

There are a lot of scheduling changes that also went in in the current in the current code, primarily that you know that went into their into the giant release right and yes, there. We need to be more changes that need to happen once we move to the object, store.

C

A

Yeah so again, if you're talking about the OpenStack summit in Tokyo, so sandisk be the presentation allen, samuels are yeah, so we did a presentation on just igreja coding of it. I strongly suggest you to go, look at look it up. It's all on the on the web and just the gist of it again is the same thing that I was mentioning before when you use an all-flash storage. Your holy rigid coding dynamics, change, changes right compared to a hard drive system.

A

One is that you know with just a twenty percent or head you're able to get. You know in fact much better protection than a full to copy or three copy right, even with the full copies most of the customers or a hundred percent of our customers are just using it for two copies right with flash, because you know flash failure rates are much much lower.

A

So so you, even when you compare that to to copy solution right with the two times the warhead in our row, with the gist twenty percent or head you're, getting much better protection and the key thing there is that, obviously you know when you use the erasure coding, the big downside to it with the current hard drives. Is that rebuilding mechanisms right? So how long would it take to rebuild and how can't a cpu heavy that is and that's what the flash is taken care of and eliminating that right?

A

So it's actually much it's much more suitable now for the active working data rather than just an archival data, that's typically what you region coding is currently mostly used for right. So that's the key difference really. That makes a huge amount of difference, and obviously, today, with self natively only the object store. You know, rgw gateways are the only ones that negatively can can support the orig according and that's something that we're working on a blueprint to make that also for an A two blocks as well.

A

Today you can still use the rigid coding with that with the cash to your in front of it, but natively. We want to offer that for a native RVD blocks, that's coming up in the roadmap.

C

B

A

Again so you know the gist of it is that you know mostly, we are focused on the rights that are coming up in the near term for for the jewel release. That's going to be, you know little or a q1 before we're targeting q1, but I think is a little bit going to sleep beyond q1 and the next focus area for really is just to drive down the total cost when you deploy it an all-flash, I talked one about the erasure coding.

A

The other aspects that we are working on with the community is a compression and d. Do the one thing with the d dope is a little bit iffy, because again we were not sure for the kinds of work clothes that that you know that gets deployed on self right. We know a lot of all flash vendors out there make a big deal out of the loop and, in fact, the prices that the code is with the huge dee doop assumptions of marantine or six x, 28 x.

A

You know dee doop savings, you know we don't go to those you go and do those shenanigans or anything that we do is basically / draw device. But one thing that we want to make sure is: does dee doop really effective for the type of workloads that gets deployed on self right right? Now we hear from some customers. Yes, you know some of their VMs have a high de loop affinity.

A

I'm gonna want to know more from view as well all right to see how important is that dee doop and how much your workloads are going to benefit that you're, considering forcep going to benefit from dee doop saudi Dupin compression is the other aspects, but our key focus area for us to lower the cost of self deployments on flash is a ratio coding for blocks.

B

So these going to 3b now you can see within nine to twelve months time.

B

Headed by the flashing through you and then pricing is dropping. So what you might fight today, you'll be.

A

Okay, so a little bit all of my time, sorry, but if there are any other further questions again, we are here toward the day, so we can have a description. Thank.

C