OpenZFS 2021 OpenZFS Developer Summit, 17 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS on Object Storage by George Wilson, Matt Ahrens, Paul Dagnelie, Manoj Joseph

Description

From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1HCswW3mvc2Nnn0EdkNWAptQ_T1W9kjjBwEDDXwSco7k
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

A

Okay, uh yeah, as matt said, we wanted to kind of talk about zfs on object store, um but you know before we do that. I kind of wanted to like take us back in time and I even wore like my oldest uh opengl t-shirt, that I could find uh back to 2005..

A

So, like you know, some of you may recall some things that went on with 2005 like hurricane katrina, the cancellation of the nhl season, the first flight of the airbus a3 a380, but there were actually other things that were more relevant to us and that was zfs was uh introduced into the open source community by the open source, open solaris project and one of the things that uh if anybody ever saw this presentation, it talked about like really just kind of blowing away, 20 years of obsolete assumptions.

A

um And so, as part of this, I was like hey. What's the 16th birthday of zfs, let's you know want to celebrate that. uh Let's go get some reaction from like some zfs developers and and kind of like see if they knew what was so exciting and what they felt about that, um oh and sure enough.

A

We actually had some that were really excited about the project back then, uh today, things have changed quite a bit, but I wanted to talk a little bit about some things that we looked at when when zfs first came out and kind of what they really meant to file systems, so, in particular like it addressed this brutal to manage problem that existed and still exist today with with different file systems.

A

But you know for those that were doing like system administration uh in those days without before zfs, you had to deal with like the partitioning of volumes and kind of um labeling them and modifying a bunch of etsy files and dealing with inode limits, and how big do I need to make things? It was really just kind of problematic, and so it's you know it was a wonderful breakthrough when zfs came out and said, hey we're going to go, tackle this problem and and tackle it and solve it.

A

In addition, that was kind of breaking some of the rules that people knew back then, like you know, we realized that, like the way we would make file systems bigger, we would actually create this shim layer with volumes, and we introduced this abstraction of a virtual disk, um and so now we would have a volume manager that was involved, and it was another headache that we had to deal with and, of course it it had its share of problems. We would leave storage stranded.

A

So if I had um you know, storage in this particular file system that I needed somewhere else well, I couldn't move it very easily. It was just a pain and then, of course I had to like you know, still deal with like partitioning and growing and shrinking all of that stuff being done by hand, and so when zfs came out and introduced, the pooled storage bottle, it was great.

A

It was just a breakthrough in the way that we were going to manage storage back then um you know it introduced the the abstraction of the malik free, where, like I just add, storage to the pool and all my file systems get to use, and it's wonderful. So I didn't have to do any of these.

A

Like all these system, administrative tasks, just kind of went by the wayside, um so it really was a game changer, but you know if, if we kind of like look at 2005, there were a lot of things that back then that have really changed. You know youtube. First, video ever published, 2005.

A

apple was just switching to intel. uh Git, you know, just came to them, came out as an open source project. Netflix was shipping, dvds, you know and of course, zfs was released.

A

It's also interesting, like you know, the largest hard drive in 2005 that was, shipping was 500 gig, that's nothing now so, as we kind of look at how things have changed, we see that, like youtube, now gets over 300 hours of video per minute that get up uploaded apple has switched their processors once again largest hard drives shipping 18 terabytes soon to be 20..

A

You know, netflix now drives over 11 of all internet traffic is netflix. Traffic and zfs has kind of proliferated and now is available on windows, freebsd, linux and lumos. What's interesting, is that one thing you don't see here is what was going on with amazon. You know. Amazon was selling books and videos and now they're a huge retailer, a cloud platform and a video streaming service, and it's this cloud that has really changed the way data centers now look and how they've evolved. So again, let's look at kind of how things have changed.

A

You know, back in the day when zfs was coming out, we were talking about servers that were directly attached to storage. Then over the course of time we started. Seeing these. You know virtual machines getting propagated everywhere and vmware using nas and san as a way to distribute storage to all these uh virtual machines that we're running now on our servers and they introduced the concept of a virtual disk and having this abstraction layer where we now had like a vmdk.

A

That was really some kind of you know: block emulation, storage under the covers, and the cloud has just done this and kind of like in proliferated this in a much larger scale. You know now we have virtual switches and virtual machines, and you know instance, types and storage still abound, and we have a lot of virtual disk abstraction layers, but it's changed the way we kind of deal with things. Today. It's made life significantly easier for most people.

A

um We now can do like on-demand infrastructure, which is wonderful, like you know, as developers or system administrators or any company that needs compute on demand being able to like issue an api call or click a button, and now the sudden you get a machine, an entire network set up. You know your own virtual private cloud configuration and you only pay for what you need. So you know I don't have to worry about going and buying a huge rack of storage.

A

I just simply create some evs volumes and I'm good to go and if there's a service that I need well, there's a whole catalog of managed services that I can put around it and just you know, leverage them. So I get like instant scalability and unlimited virtual block devices, but unfortunately it's not all rainbows and unicorns.

A

There still are problems and the problems that we tried to solve back in 2005 have now shifted. So this virtual disk abstraction that you know we had with the volume layer with the volume manager has now been pushed down into like an ebs layer outside of the virtual machine.

A

You still have the the possibility of having stranded storage. I can't move things out of this ebs layer onto another virtual machine. I now have introduced capacity limits, which are things that we were trying to like get rid of. Now I have limited controllers and number of drives that I could put on a virtual machine.

A

What's interesting is like cost has been a big factor. You know the cloud allows me to pay as I go, but it also means that I don't want to overspend so the way that I build up my virtual machine, as I assign a small number of gigabytes to it, fill it up and then I'll buy some more fill it up, buy some more fill it up, which leads to imbalance, space and potentially some performance issues.

A

So as part of this project, we wanted to kind of like step back and say you know kind of. Is there a fresh perspective to solving some of these things that we, you know we had solved in 2005 and are now having to solve them again and we thought to ourselves what about object? Storage um so with cloud object, storage, just proliferated, the abstraction of having abstracting out your disks to these objects, where you just have like a key and can get access to any uh object, was kind of interesting.

A

It was totally scalable, so unlimited capacity, all great things that you know, especially for a file system that was designed to be scalable and have unlimited capacity, seemed to be a perfect marriage and it was cost effective. um You know it didn't. You didn't have to pay for the huge infrastructure that ebs volumes provided, because you know these weren't always nvme drives that are driving that are sitting behind some storage controller.

A

They might actually be like spinning metal and it was a very simple api to connect to and in a sense this was another way of looking at pooled storage. What object store has done is it's just pooled all the drives together and exposing this through a very simple api and making it accessible to the application?

A

So we thought well. Could you marry those two things together? Could we create this architecture again, where you have pooled storage, on top of object storage, which is a pool, drive concept and reintroduce the mallet free abstraction layer that zfs was built on in 2005.?

A

This was great because it means with like with object, store. It's like there's no disk to manage. I don't have to go and say: hey s3, give me some more devices.

A

I don't have to go shrink anything I just simply remove objects and, like I, I'm only paying for the space that I'm you know that I'm using so instead of having to go allocate, you know 100 terabyte drive and putting 20 terabytes of storage in there and still paying for 100 terabytes. If I put 20 terabytes in s3, that's all I pay for, and I could scale as much as I needed.

A

But if anybody's looked at object, store or done in your research around it, you'll often stumble across statements like this.

A

That simply say you know, object store, is great, except don't use it for highly transactional and low latency types of applications, because one thing that we know about object, stores, tons of bandwidth, but it's slow.

A

So we wanted to look at that and say: okay, could we actually overcome this? Are there ways to overcome the latency concerns? No, we could provide a read cache layer. Well, cfs already has an l2r component, so great we could leverage that we need a way to like buffer writes, so that we don't have to wait for the long latency. When we actually issue a synchronous rate, hey we have a solution for that too. We can put a slog in and then lastly, we need to talk to s3.

A

Well, there's an open source project called s3 backer. Couldn't we leverage that you know uses fuse ties in lets us go kind of directly back to the s3 device and we can talk to object stores.

A

So this seems like a pretty good model, because it gives us all the components that we think are necessary to overcome latency, but there's some problems.

A

S3 backer, as wonderful as it is, has the limitation that it's every zfs block is one object, which means that it's, if I'm using large zfs record sizes great, I can take advantage of the bandwidth by sending like one or two meg or 16 meg blocks.

A

But if I'm doing something where I need like 4k 8k blocks, it's going to be expensive, l2 arc for again as as good it is um with like working with ssds. We needed something that could scale and it has a pretty high penalty on its memory. Footprint fuse itself has some issues. In addition, one of the things that we discovered is like, if you try to like, destroy a pool, that's built in this model.

A

Those blocks remain on s3. So, even though you think you destroyed the storage you're still paying for it. Unless you go to the aws console and remove all the objects manually and in the end, it was just another virtual disk abstraction which is again not solving the problem.

A

And then, when we stepped back and looked at kind of our problem at delfix, we want to be able to handle extensive random reads and writes we're dealing with databases which is one of the no-nos for using object store. So we needed to be able to have small blocks in our case 8k, which are typically average when compressed or average around 3k, and we wanted to take advantage of the lower cost storage again pay for only what you allocate.

A

We also have customers that aren't system administrators, so the thing that was so wonderful about object store is that it's got a simple management layer. You don't have to worry about. Like add additional storage. Remove the storage move this over here rebalance that again, a lot of the principles that cfs was built on objects are kind of mimics and by having that simple management layer. That also meant that we wouldn't have these user errors that can often crop up. So people may have.

A

You know you may have seen cases where, like somebody inadvertently remove, removes the wrong vmdk or they think that that vmdk is free and then end up using it somewhere else and now you've corrupted it. We wanted to get rid of all those problems, so we needed to create something better.

A

So for that we set out on this architecture journey. We wanted to leverage the slog device because we felt that would work really well, but we needed a better caching model. We needed something that could scale and scale to extensively large.

A

So if we wanted to have 100 terabytes of cash great, we need something that can do that with a very small footprint and we need to to be very smart about its eviction, which means that they also had to be smart about the way it allocates blocks.

A

We wanted to use large objects, but small blocks, so this one-to-one mapping that you know s3 backer had couldn't work for us. We needed something better. We need to pack as many zfs blocks as possible into one large object block and then ship that off.

A

So this is what our architecture looks like. So the things that I'll call out here are the things that are in blue are components of zfs that are effectively untouched. Things that are yellow are things that we had to modify and things that are green are all new components.

A

All the red components you see are our external hardware devices so we're using a slog, so we still have access to an ebs volume, although because it's just for slog purposes, it doesn't have to be very big, so again cost savings there. I don't have to go, allocate 100 terabytes for a slog device.

A

The communication to s3 is now being done by this userland agent, which we call the zfs object agent and it's actually talking objects to s3. But it's interfacing with the kernel through a new v dev layer, a new v dev component, I should say, which is the object, store v dev.

A

So everything in the kernel still talks about zfs blocks and it's when we actually ship these blocks to the object agent that we actually are going to convert these into into um s3 objects and then the zeta cache, which is also a user land component, is also talking blocks. So we have the ability to like cache, blocks, ship objects um and then the rest of zfs kind of remains more or less untouched.

A

So you may be asking yourself: why did we choose userland? Why didn't we just implement this in the kernel? Why did we put the object agent and userland? The biggest thing is we didn't want to go implement a bunch of http and json parsing in the kernel.

A

There's a lot of tools out there. We wanted to take advantage of that. We also wanted to kind of create fault isolation, so we built this agent that was able to restart and it could crash it could be upgraded and it could change without impacting the rest of the system.

A

So you can, you can actually put in a brand new agent, restart it and the rest of the kernel just picks up where it left off and in addition to that, we wanted to take advantage of newer, richer languages that were available that are available that aren't available in the kernel, and so we chose rust.

A

So then you're probably saying well why rust? Well, the big thing here is rust is fast. It's got a lot more uh semantics, it's much richer than a c programming language, and it gives us a lot of additional libraries that we can take advantage of, but the biggest feature is its safety net.

A

So taking advantage of that has been huge for us, especially when you compare the number of lines of code that we've implemented in rust versus c, and I show that on the bottom of the slide, where you can see, we've had 1700 lines of c code that we've introduced into the kernel and and also parts of userland and then 13 000 lines of rust code.

A

But of those 1700 lines we've had to deal with, like you know, data races, things in the I o pipeline stalls, deadlocks and they've taken us a long time to go solve, but we haven't, we haven't had any of those issues in the userland russ code. So it's a big benefit to us. But this talk isn't about rust.

A

I'm sure that there's a lot more features that are not covered here and we have a lot of advocates. That would be happy to tell you more about it, but instead we want to talk about the design principles for the object store project.

A

The biggest thing is the agent's independent. I mentioned that we wanted to have some fault isolation, so having it having the ability for it to die and restart was key for us, which also means that the colonel has to be able to like resume and detect when the agent has died. So if you're, if you're in the middle of the transaction group, you need to be able to say, oh, I noticed that the agent didn't finish my you know pushing out all my transactions.

A

I can't assume what it knows or what it went out to do or how far it got. So I just need to push it all again, um so it picks up and remembers where it left off. So in a way, it's it's kind of, like you know, has this concept of um communication on that has kind of out of band. That knows when to send data, what parts of data to send where it needs to restart and so forth.

A

We also came up with a very simple block id allocation scheme since we're actually just allocating and doing like I plus plus, that meant that we would never overwrite any given object.

A

So when we allocate a block, we're never going to reuse that block ever again. This allowed us to make some very interesting things um and and simplify um things like in sync to convergence. We now don't have to worry about multiple passes over the data waiting to do things like don't compress during this pass. Overwrite blocks during this pass. All that got simplified, we're able to simply say we're going to allocate a block we're going to ship it to the object agent.

A

The object agent is going to put it into an object and from that point forward, if we ever do another update on that, it just simply gets updated. We don't ever reuse the block.

A

So the three big architectural components here are the object store in the v dev layer, the zeta object and the zeta cache. The v dev object store is using a unix domain, socket to communicate from the kernel to userland.

A

It's the one, that's going to be responsible for taking offsets, which is what the rest of what zfs uses or dva's and convert that to block ids, which is what the object agent wants to talk about, and it's in this layer where we do all our you know our resume logic. So if we get a disconnect, we simply pick up and resume from there.

A

It's worth noting that, in this object, store there's a lot of room for improvement. So today we're copying the data from kernel userland, um but we recognize that we can make. You know share those pages, avoid the copy and even change the data format that we're using.

A

So at least at this point in time, we've cut. We've kept it simple, primarily to get this project off the ground and the performance numbers actually look pretty good and I'll talk about those a little bit later.

A

As far as the zeta object agent, it will then take these block ids and convert them to object. Ids and matt will talk a little bit more about kind of the details of how it does that it's also totally responsible for all the object management. So it knows which objects are currently cached which objects.

A

It needs to go fetch how it maps from block id to object id when it has to do consolidation when it actually has to do, freeze and is responsible for all the s3 communication authentication and then the new zeta cache is a simple implementation of an on disk. Lru cache and this afternoon, uh there'll be a whole talk that describes the details of it.

A

So with that, I'm going to turn it over to matt to talk about the key data structures. Thanks.

B

All right yeah, so I wanted to um talk about how we implemented this. What are the on-disk data structures, or maybe I should say on object data structures and what the implications are for performance of the project all right.

B

So I what one thing that we could do is we could say every object in the object store is one zfs block, so each block you know zfs takes. Let's say you have record size, 4meg it compresses down to 1234 kilobytes. We take that and we write that out into one object.

B

We call it object.d2, it contains block id2 and we put it in object, store using a key that looks like this. So if everything goes under zfs slash and then this number is the pool guide. So all the objects related to this pool will be under here. Data for data objects and then um object id number two, and then you know the next block that the kernel allocates.

B

Maybe it's bigger. We put it into a bigger object. Give it black head the next block at e3 in the next object: id3 put it in there. So uh this is really simple. um As long as your record size is pretty big, then this is going to probably work. Just fine and you'll have performances. Kind of theoretically should be similar to s3 backer.

B

uh In practice we saw it's actually quite a bit better, um but uh you know obviously a lot easier to manage than using s3 backer, because it's all native zfs um the space is coming and going as you're doing those block allocations.

B

So uh if all you care about is large blocks, then that's pretty much it like talks over. You know I'll go home, um but uh so why why why that caveat like? Why does this only work for big blocks? um Well, it's because uh we need to use big objects to get good performance, so uh this graph is showing us how throughput changes uh maximum throughput changes as we increase the object size, and we want to be up here at you know where.

B

Basically, this is the network throughput limit of about 25 gigabits, but the problem is that the object store is dominated by latency, so put object. Latency is around 50 milliseconds.

B

And that's kind of true, regardless of object size up to some point, but it's at least 50 milliseconds.

B

So in order to get good overall throughput, you need to either increase the object size which is like moving to the right on this graph or you need to have a larger q depth so have more put objects operations going on concurrently um at the same time. So that's like switching from this red line, which is 100, puts concurrently to this blue line, which is a thousand puts concurrently. um So uh we saw that kind of fur. The parameters that we cared about around one megabyte uh makes sense that's kind of a starting point.

B

Maybe two megabytes would be a little bit better um depending on the q depth, but the key thing here is that um we want to be able to get a good throughput when we're writing um and we don't want to have to have like a ridiculously large q depth I mean 100 is already like pretty large.

B

You run into limits like number of file descriptors, that you can actually have open in a process um and other other things that impact efficiency. When going to really really huge queued ups, all right, so uh that works fine as long as you have big objects. But sorry as long as you have big blocks, you can just say one block per object. But what, if you are storing databases- and your database has an average of three kilobyte compressed block size?

B

Then um that's not going to work. Three kilobyte object sizes would have very, very poor performance. So what we're going to do is combine a bunch of blocks into one big object. So in this example, we have like around 300 uh blocks being combined into one roughly one megabyte object and again we're going to store it.

B

um We're going to have this concept of object, id we're going to use that to generate the key that we use to store the object in the object store, but the um the mapping from object id to block id is less obvious here.

B

So as we're writing we're going to write out we're going to the kernel is going to be sending us a bunch of blocks. This can be sending them in sequential block id order. So first it gives us 123, 124, 125, et cetera, and um the agent is going to be batching those up until we get about a megabyte of data and then do a put up. Do a put object to store that into it. um One large object and the object.

B

Contents are self-describing, so you know if we look at what's actually in this object in object store, it has the data and it also has a description of what the data is. So it says you know I have a black id 123 and it's three and a half kilobytes, and it's at this offset within the object um right and the um the block ids that we're talking about here are our allocate sequentially and they're never reused.

B

So uh once we've written block id 124, the kernel is never going to come back and say: hey like change the contents of block of d124, it's just always increasing for forever.

B

um So hopefully this uh the right code path, kind of makes sense here, you're, just batching it together, speeding them all up. But what about reads so? If you want to do a read, the kernel is going to send over send up a request, saying please get the data for this block id and we need to figure out which object has that data inside of it. um So the way that we do, that is by keeping a mapping in memory that maps from the object id to the minimum block id.

B

So we want this data structure to be really compact. So this because the object, ids and block ids are both sequential and increasing.

B

If we get say, for example, a read request for block id 400, we can look through this table. Do like a binary search and see that okay 400 falls between 346 and 667, so it must be in object, 85, and we can go. Read that object id and see that it has black id 400.

B

So there is obviously some memory cost to this in kind of the current naive implementation. It's about 16 bytes of memory for each one, megabit object which uh it comes out to you know. If you have 100 terabytes of data in your pool, new storage pool, then it's about one and a half one and a half gigabytes of ram, um which is like. Maybe that's acceptable.

B

I mean 100 terabytes is you know pretty big, um at least for databases, uh but we think we can do a lot better than that because, like if you look at this table, you can see you know the object ids. We we out, we allocated eight bytes for that, but really each one is just sequentially the next number. So maybe we can just get rid of that entirely. The block ids um they're, not sequential, but they can't differ by you- know uh 2 to the 64.. Maybe they differ by.

B

You know 2 to the 10 at most or something so maybe we can get away with just like 10 bits of information encoding, the delta between each next entry here. um So I think all things considered, we should be able to get this down to about a quarter of that which is uh even more reasonable, 400 megs of ram for each hundred terabytes.

B

I mean, of course, we need to keep this mapping on disk as well or in the object store. I should say uh persistently so that if we crash we're able to regenerate um this uh this in memory table- um and by crash I mean if the object restarts so there might not be a full system crash, uh but it might just be that the object process the agent process um is restarted.

B

um Then we need to read this back into memory. So how do we store the mapping on disk?

B

um Basically, we store in a log. The log is just logically speaking. It's like an array of entries. The entries are going to tell us. What's the object id and what's the block id associated with it, the minimum block id- um and this is going to be split into a bunch of uh objects.

B

uh The objects are going to be numbered sequentially. So in this example, we have like chunk id 0, which basically means like the object object. Id 0 of this particular object-based log it'll be stored in the object store again with the key, something like zfs slash.

B

The pool guide um now we're in the object block map namespace and then this is object, zero of that and that that'll contain maybe the first ten thousand entries in the next ten thousand entries in object id one.

B

um So uh you know what what comes down to is basically every transaction group we're going to be appending a new object, or maybe a couple objects to this log. um That indicates uh what uh what other objects, what data objects were allocated in which blocks they contain and then periodically we're going to have to um these objects might be kind of small, so you may end up with a lot of them and so periodically. We might want to condense these by rewriting the um this log.

B

More compactly in in big large objects, similar idea to what we do with zfs's space maps. If you're familiar with those.

B

All right, so um that kind of covers like what happens when we're writing, what happens when we're reading? uh There is one other operation that we might want to care about, which is freeing. So how do we reclaim space? That's no longer uh needed, um so there's a couple problems that come up when we think about this when you have a bunch of blocks within one object.

B

So in this example, we have one object, object at e4, I mean it has a bunch of bunch of blocks and the kernel says I'm going to free block id 125.. So what the free means is. um It just means. I promise that I'm never going to read this block id again and so fyi I you can do whatever you want with that, um preferably maybe you use less object, less um less data in the object store so that amazon charges you less.

B

um So what we want to do is uh in order to actually reclaim space from the object, store, we're going to read this object and then write out a new version of the object that omits the block that we don't need anymore.

B

um So this works uh and we're overwriting the object in place so we're basically replacing its contents with the new contents and because the object contents are self-describing um there. It's pretty cool that there's no race conditions here so like if, if the reclaim is going on in the background- and we don't have any locking and somebody else, some other thread is trying to read block id 128.

B

Then um they can do a get object and if they get the old object contents or if they get the new object contents, um it's going to work either way. So because the object is self-describing, whichever one they get will tell them where to find block id 128 within the object um and that's fine, but uh it does cost us a lot of throughput so to process this one free of three kilobytes.

B

uh Maybe, for example, we were doing it over a logical overrate of some random block of our database file and uh we so we we wrote a new three kilobytes somewhere else, some new object and we're freeing this old one because there wasn't a snapshot of it. um So we wrote three kills. We in order to write this three kilobytes, we had to write the three kilobytes of new data into some big new object, but then we also had to read a whole megabyte and then write almost another megabyte.

B

So you know we're talking about 600x io inflation. um With this naive implementation.

B

So the solution to this is batching, so uh when we get when the kernel says, I no longer need block at a125.

B

We say that's great, uh I'm gonna, remember that, but not do anything about it right now and I'm gonna wait until. Hopefully we get a bunch of other frees of blocks that are also in the same object and then I'm gonna process, those all at once. So uh maybe we get um you know 200 out of the 300 blocks in this object are freed eventually, and we now we can read this megabyte and then write out a smaller.

B

You know about half a megabyte less than half a megabyte, um so we've improved uh the situation a lot. You know it's, it's 170x better than the naive implication, implementation on the previous slide, but there is still some. I o inflation um because we do have to read, read this and then write it out, um even though it processes a lot of freeze, there's still some costs associated with it.

B

So the key to getting good performance out of this reclaim process is to make sure that, when we're, when we're rewriting an object, it has lots of freeze. um So how do we find the objects that have lots of freeze? Well?

B

First of all, it would be nice if we could just wait a really long time for lots of freeze to accumulate in all the objects, but we can do better than that by sorting the objects by how much, how many, how many free blocks are within them, and processing only the ones that have the most only the best ones, the ones that have the most freeze within them and then we'll save the others for later you're, hoping that they accumulate more freeze in the future.

B

um There's a bunch of like tunables here, and we can probably be smarter about this in the future, uh but right now, there's like a tunable that says like: what's the um how that basically tells us like how long do we wait until we reclaim and that's just like a percent of the total pool size, so we say like hey, you know, whatever your pool size is having ten percent of it uh free, but not yet.

B

Reclaimed is okay, uh basically means, like you know, you're going to be tank, paying 10 more for your storage costs um than if we were extremely aggressive about reclaiming the free space. um But the trade-off is that the reclaim is much more efficient and uses less throughput.

B

um So how do we keep track of? We keep track of the um blocks that have been freed, but not yet reclaimed in this reclaimed, log, which is basically just an array of like this, is the block id, and this is the size. The sizes are used to go in to find the objects that have the most free space. When we go to do the reclaim.

B

And so the reclaim log you know as we're getting freeze we're appending to it. Then we need to load that reclaim log into memory in order to find which objects have the most free space, so uh that uses up some memory. um Actually, we want to be able to have like as much uh outstanding freeze. As you know, you want to set that tunable suit without having a impact on the memory requirement.

B

So what we actually do is we break up the reclaim log into several logs with different freeze to different object, ids going to different logs and that limits the amount of memory required um to process the fuse.

B

So you know this this example here is: um you know to process to load 10 million freeze, but we can have a bunch of we have as many logs as we need, so that uh no log has more than 10 million freeze, for example, to limit the memory used to co. You know: limited memory is to a constant, regardless of the pool size and the amount of uh outstanding reclamable space.

B

All right so there's a couple more problems. First, off the object block map memory usage is going to keep growing and growing over time. So, as we're doing, writes we're allocating new objects, we're appending new things to the object of block mapping, we're keeping track of those in memory in the table. That looks like this and the issue is that, uh if you're, if you so, if you're, just writing and you're, not really freeing much, then it's not a big deal.

B

But if you have a lot of churn like say because you have a database and the database is like randomly overwriting lots of different blocks in the in the database files all the time and like writing its logs and then overwriting its logs at the database level, then you have a lot of churn, meaning like a lot of the data. That's present in the pool now won't be present in the pool you know a month from now, but it'll be replaced by new data.

B

So the size of the of the total amount of data in the storage pool might remain relatively constant or be growing slowly, but the total amount of data ever written to the pool is very, very large.

B

So, for example, you know if you're writing at 100 megabits per second on average, uh then this table might grow to be like 12 g need 12 gigabytes of memory for every year that you've been running this pool, um and you know we have customers that that are writing at this rate, uh at this average rate, um with our block based solution um and that have pools that are you know many years old.

B

So that seems like not a great use of memory um and the the other problem is that the objects are going to get small.

B

So as we as we rewrite each object, omitting the free blocks it's going to get smaller, and then maybe we do that again and again until it's like very small, and uh that means that if, if we have it, if we need to like do a table scan or read through a whole file getting all of its blocks now uh we have to read a whole lot of small objects, um which has the same problem uh as large objects to a little bit lesser degree.

B

um It's also latency dominated the latency, is kind of at least 20 milliseconds. So that's that's less than half of the puts, but it's still pretty significant and again you want to be using large objects to get very good throughput with reasonable q dots.

B

How do we address that problem? We address it with object consolidation, so when we're processing, uh when we're doing reclaim and processing, freeze we're going to look at several adjacent objects uh in the freeze that are associated with all of them and we we might see okay in object, id 4 we're freeing most of it. The yellow blocks are the ones that are being freed and the blue blocks are the ones that we need to retain, um but maybe you know there's only a few of them. Those don't add up to a megabyte.

B

They add up to like some small fraction of a megabyte. So let's look at the next sequential object. Id so object id5 and see what needs to be freed there. Okay, a bunch of, is being freed. We need to retain a little bit here. Let's add those to this object and as long as it's not a megabyte. Yet we keep on doing that um accumulating more and more you're. Consolidating uh the the retained blocks from more and more objects into this one object.

B

That's going to replace them all um so to do this, you know we can kind of figure out all of this based on the in-memory metadata without having to read any of the data objects, and then um we can then we need to read. You know all three of these data objects and then write just the one new large one and then uh eventually delete these uh object. Ids four and five that are no longer needed at all.

B

I'm sorry object, ladies five and six that are no longer needed um after we've persisted the object to block mapping change so in the object of block mapping, we're going to remove the entries for object, ids, five and six um in the uh on this log, we're gonna, add a free type entry saying that these are no longer needed and then we're going to remove them from the in-memory table.

B

um So now uh you know there'll be some object id seven. You have an entry for object, four and object seven, and then you know we know that all of the blocks between them are in object. Four, because five and six aren't there.

B

Cool, so um uh this this works pretty well, there's a lot of work that we um can still do in the future to improve this even further. So like one of the ideas, is um right now we're kind of batch processing, the uh the reclaim. So it's like, we start reclaiming. We do it as fast as possible. We stop doing the reclaim. We wait until you hit the high water mark and then do it again.

B

um We could probably improve on the behavior of that by uh chipping away at it, like every transaction group, a little bit little by little and then having like some kind of feedback mechanism that tells us like how fast we need to be chipping away. Based on how close you are to getting to the high water mark, um you might also want to have different kind of controls over like when this happens.

B

If it's impacting performance, you might want to only do it.

B

You know at night or on the weekends uh you might want to have some like minimum efficiency setting where you're like yeah- I I don't really mind paying for that extra storage space for a while, but I really don't want to um uh impact my network throughput by doing lots of uh guests and puts so you know I want to say, like only do it when you're able to free like 90 of an object or something like that, but uh you know absent those future performance. How does it actually work today?

B

um So this is a graph uh showing the x-axis. The x-axis is time. um This is about 30 seconds here and the um the y-axis uh is um well. The the green red and blue are network throughput, and then the the magenta here is uh the amount of space um that's allocated in the storage pool, so what's happening is um we're going along cheek. Studio sync is, is putting data to the object stored at about 700 megabits per second, then we start a reclaim and um then we're going to be reading and writing data.

B

We can do that very quickly and then we go back to that's done and we reduce the amount of allocated space. So um the workload here is random, writes um about 15 000 iops into a one terabyte file, uh completely random locations with record size, 8k and the average compressed block size is 3k. So this is uh what I would consider a very demanding um test.

B

uh You know the only way that I might make it more demanding would be to increase the file size which serves to scatter the the rights even further, and um the indirect block updates become even more expensive.

B

So on this on this, you know very demanding workload, we're seeing that we're still able to process freeze at about 1400 megabytes per second uh versus the the writes are about 700 megabytes per second, so um it kind of makes sense that, like half the time we're reclaiming and half the time, we don't need to be reclaiming.

B

uh So uh you know for our workload, which is this- you know pretty demanding workload um the because we aren't reclaiming all the time we're able to keep up, and we can see that the impact uh on the rate that we're ingesting data with with txt sync writes. It goes down a little bit, but not not that substantially, while we're in the middle of the reclaim.

B

So you know the kind of conclusion for us was uh you know, there's spare, there's spare network bandwidth and um we're able to do this uh in parallel with ingesting the rights uh still at a good throughput. So, even despite not having some of these some of that future work, um it looks like it's good enough for our workloads, and um you know for more typical workloads where you're, like ah the record size is the default, and um most of my data is in large objects.

B

It is in large files and they have the you know record size of 128k or maybe even more mostly. What I'm doing is accessing files sequentially really this this solution is, is vastly over engineered for that kind of use case um and the really the impact of reclaim is going to be negligible on that kind of use case, but we wanted to.

B

Obviously we wanted to make this work for our use case, um and um it's also really satisfying that you know this works for a wide variety of use cases we don't have to say, like oh, like zfs, on object store as long as you're, just using it for archival or backup or whatever. This is really um it's usable for general purpose, workloads, I'm even including some of the most demanding ones. um Obviously, there's performance concerns there, uh but the the ultimate the bottom line is that performance is very good.

C

All right, okay, so um as you've been told, uh the zfs object. Agent uh is a user land uh thing, so we'll start by kicking it off and we'll let it run on that window. So we are invoking the xavius object. Agent.

C

And off it goes, and now we are ready to create a zip. So, as you might imagine, we have to specify additional parameters so that we can specify where the pool is going to be created. So we need to specify the end point and the region where the s3 bucket is going to be the pool name.

C

Obviously, and we have, as george mentioned, we have a new vw type, the s3 type, and we have to specify the bucket uh the place where all the objects will reside, and in this case I'm going to add in a log device as well and so pool is going to get created and so um that that's that's the pool for you um and if you notice, uh just like the devices specified, the bucket is specified here um and uh you can see the nvme log device there as well.

C

So I started off the the agent by hand, but you probably want to be using some system management service like systemd, to run this. So let me pause the video just a second again. um So, as george and matt mentioned, the agent is designed such that it can be killed and restarted or it can crash and come back up and things are still going defined. So it's resilient.

C

So that's exactly what I'm doing here, I'm killing off the agent the pool is still online uh and I'm going to start it off as a systemd service, so the systemd service is on uh is is going um and let's take a look at the log on this window, the debug logs. While we proceed with the demo.

C

All right so now the agent is running as a service, which means, if you get auto, restarted in case. uh Something goes wrong.

C

So matt talked about workloads where you have a 8k record size compression is turned on and you have a large file and you do random writes. Let's do just the same thing. Okay, let's create a large file um and we'll we, we have a z, first file system with eight k, record size, compression turned on and let's go do some random bytes, so we're going to first create a large file and I'm going to speed this up.

C

All right and next, let's do some random rights, and then we are going to use iostat to see how it looks and iostat has been improved or enhanced with the dash option so that all object stories things are listed. So if you look at the first column, that's that's the traffic between the kernel and the agent.

C

The second column tells you about the traffic between the agent and s3, the the object store. uh The metadata traffic is captured and reclaimed happens. Man went over that quite a bit, so reclaim happens as well. So the the thing to notice here is that um we are doing about um 100 meg of throughput um and the the the throughput between the agent and the s3 is also roughly close, but the number of objects. The the operations.

C

There is a significant difference about 150 operations from the agent to the s3 bucket, but about 36k of uh operations is what the kernel is doing.

C

um You also notice that um the the reclamation store the uh object reclamation is also going on um and it goes in bursts and we keep reclaiming objects.

C

All right, let's kill that workload and let's look at something else, so import export now export is, is fairly straightforward. uh It's it's! uh You export a pool and you're done um nothing special about this, but import as you might imagine, uh uh likes equal, create it needs additional parameters, it needs to be told they have to go import stuff from the the endpoint, the region and the bucket the s3 bucket. So the dash d option specifies the the bucket think of it.

C

It's similar to the dash d option for block based devices where you specify a directory to go, look into that's the bucket equivalent.

C

You specify the region, the endpoint and then the pool name and zpool import imports it for you.

C

It takes a little bit of time, um we'll talk about that in a bit about five seconds, and we should be done so we are done. Let's take a look at the z pool list, so pool one is, should be there yeah it's there and it. We can also do a search import where you tell zfus import to go, look for tools, and so let's try that on our bucket and um interestingly, we have a pool that we did not create on this vm cool 2.. So why don't we just go?

C

Try doing a zip import on that.

C

um Pull two: let's go, try and import that and we hit a failure. Specifically, it says that the pool can't be imported because it's currently hosted on a different host. It's imported on a different host. What exactly happened here? Hey paul! Can you walk us through this.

D

Sure thing, so uh this is an example of a feature called multi-modifier protection or mmp. um Zfs already has this feature implemented in the kernel, uh it's useful for situations where you have uh storage area networks or network attached storage, where multiple systems could potentially try to import the pool.

D

At the same time, um we re-implemented this in the agent uh for a very important reason, and that's that it's extremely simple in the uh cloud and object storage use case for multiple systems to be accessing the same bucket to be accessing pools within that bucket and with some pretty easy, misconfiguration steps. It's very simple: to have multiple systems try to import the same pool at the same time.

D

uh Zfs is an extremely reliable file system, but multiple people trying to write to a pool at the same time is a very fast way to data loss.

D

So for that reason we decided to re-implement mmp in the agent protecting both the zfs data and the agent's own metadata. The checks take slightly longer, as minoj pointed out during the import process. It takes anywhere between five seconds and 20 seconds in a contended use case, but it's designed in such a way that it is almost impossible for multiple systems to end up succeeding at an import.

D

It would require very extreme networking configurations on multiple systems happening at exactly the same time, and so the additional time cost was deemed worth it, given how important it is in the cloud use case to prevent multiple systems from accessing the data. At the same time,.

C

All right thanks paul, so let's, let's, uh let's uh do a little bit little more of the mlp stuff. um Let's go over to the host where the pool 2 is online, yeah, it's right there and instead of cleanly exporting it and importing it on the other pool. Let's do something more interesting: let's just power off this, this other vm and we power it off and then we are going to see if we can import it safely with mmp on the other other vm.

C

So let's try that again, but this time it's going to take a little longer because, as paul said, it takes a little longer for the negotiation to happen and I'm going to speed this up.

C

All right, it took us about uh 34 seconds here, but the pool has been safely imported and it's online.

C

So um let me pause the video again uh to talk about something else. The if you noticed uh we had multiple pools on the same bucket and that's that's that's part of the design. That's that's very much. How we wanted this to be uh the bucket is, uh can be shared by multiple pools across multiple uh hosts and each pool has its own name, space into which all the objects grow, so they don't clog or each other and you could have a shared bucket or you could have different lockets.

C

If that's the configuration that you want, but shared buckets are totally okay. So if you look at the status yeah, they are both on the same bucket old one is there pool. 2 is also on the same bucket.

C

Okay, so next up, let's take a look at some of the z pool properties.

C

We have a new object, endpoint property, the region, the credentials profile, we'll talk about that in just a bit, um and so those are the new properties that an object, store, based, vm z, pool um next. Let's talk about zip will destroy, zip will destroy for object, store uh based, z, pools are slightly different uh different because uh we we or the customer is paying for these es3 objects and, unlike a block based z, pool where you destroyed the block, stay there, it's marked as destroyed, but the blocks just stay there.

C

We don't want that behavior for object store when we want when we say destroy, we want the objects to go away uh so that uh we are not consuming space on s3 right. So uh when you uh destroy a pool, um what happens? Is the agent kicks off a background task that goes and cleans up? All the objects destroys them and reclaims the space, so zee pool status.

C

Why the tool is being destroyed will tell you um that it will list the pools that are destroyed and once the pool has been completely destroyed, you have an additional flag. The dash dash list destroyed option that can give you a list of our tools that have been destroyed so that you can confirm that your pool has indeed been destroyed, and if you.

D

Don't want to see that anymore, you.

C

Can use the clear and it will go away now. I very cleverly did not deal with credentials so far, and let's now talk about that. So if you have used the aws cli, then they take as credential sources. They can take credentials from a variety of sources, and for this demo I used uh the aws identity and access management rules that are associated with an instance profile, and that means that or what that gives you is that you don't have to deal with credential rotation and things like that.

C

It's done for you, you could do that and all the various sources that the aws cli takes as input. The zfs object agent can as well. So you can use instance profile if you like, or you can use environment variables, aws credentials and environment variables. If that's your preference or you can specify it in the dot, aws credentials files.

C

You can also have multiple of these credentials as profiles, so that in case you want to have pools different ports on different buckets with different credentials. You could do that as well.

C

So here is an example of how you would do that fake credentials, but you have profile one with one set of credentials profile two with another set and you could use the third one, which is the ec2 instance metadata and that's all. I have thanks.

A

So what I wanted to talk about now is um a little bit about performance. So when we um you know, I mentioned before that we actually started off kind of looking at like s3, backer and l2r and some of the existing components.

A

So it made sense for us to kind of see where do those things stand today so like um we ran some tests and performance numbers just using s3 backer and trying to get like its raw throughput.

A

So what you see here is the raw throughput of sequential reads and sequential rights, and I'm using fio and the zfs perf test to actually generate these numbers. So this is using 128k. I o block size, so I'm going relatively large, but um not as big as as maybe we would necessarily need.

A

But what's interesting is it gives us a baseline. So then we wanted to see like. Where do we stand with our current implementation of object store on zfs?

A

You can see that in um we're right around the three to like 6x range of uh performance numbers, so we're getting a lot more uh read throughput and a lot more right throughput, but that's great because, like we, you know, s3 has a lot of throughput to give and you're going to be limited by how much bandwidth your instance type has.

A

But what we really care about are what are the iops? So again, I wanted to start off with like s3, backer and kind of what does that give us in kind of the the configuration um that we had and I started off without an l2r. So this is just going straight to um the s3 device you can see. The random reads are just kind of horrible um and random rights are all right, um but nonetheless this would not be something that would satisfy us, um and the latency for s3 backer is is actually really bad.

A

When we talk talk about reads so we needed to kind of look at like you know, this was kind of a non-starter for us um again, looking at the comparison with just using zfs and our implementation of object store. We see that, at least in this case we're able to lower the latency of random reads from 346 milliseconds to 89 milliseconds, so still not great, but a big improvement overall on kind of where we're headed. So we kind of knew we were on the right trajectory.

A

So what if we added an l2 arc device and again comparing both s3 backer and zfs with an l2 arc?

A

The good news here is that we're actually now getting latencies in a more acceptable range and our market here is yellow, so 17 milliseconds for random reads: okay, it's probably like you know comparable to like you know, maybe some older spinning discs, but not necessarily what you would need for like high transaction rates, so this was kind of like our starting point and again our starting point was really based off of you know, taking zfs and making a more integrated solution to talk to the object store.

A

But, as I mentioned in the talk like we needed something better and that's where zetta cash is going to pick up. So we'll talk more about that this afternoon, but is there more to like what we've solved here like, as you start thinking about kind of some of the techniques that we use to implement object store? Can we go beyond that? So I wanted to kind of like throw out some kind of crazy ideas of like. Where do we go? um You know. Can this be leveraged in other ways?

A

For example, could we use this for shingle media?

A

You know, could we define an object to now like define a track on a single media drive and then use some of the techniques of like actually knowing a block to object, mapping to define and move regions around when you're actually going to do writes you know, maybe this replaces the need for bp rewrite. We've talked about that for many years.

A

Could we abstract the block pointer? Now we have the ability of having the block pointer just be more, you know virtualized, and it doesn't matter where its location is because again using the object map, could we actually determine where its new location should be? If we need to move things around and there's many more there's many things that you could think of the other thing to think about here. Is you know what would you do with zfs if you had unlimited storage?

A

What if you could like store anything on here and not have to worry about those capacity limits?

A

I think that's the arrow that we're kind of going into and that's what makes this a really interesting thing for us, and hopefully it sparks some ideas in the community and with that we thank you and we'll open it up to questions.

B

All right um so there's a question about scrub and then a couple of kind of related questions about scrub and how does xeeple status dash v work um at a high level I mean zfs is still maintaining a checksum of each data block um stored in the block pointer in the indirect blocks, um and so uh you can run zeufo scrub.

B

It'll go read every data block off of the object, store and verify that the checksums match um and if they don't match, then it'll be reported in the zepal status v, just like normal right now.

B

The performance of that is not great, because uh we didn't, we don't have the like, sequential scrub, uh optimization hooked up. um It needs to be updated to know about like block ids versus um v devs and offsets, um because we kind of shoehorn. The idea of the block id like into the block pointer, but um like a naive interpretation of the block winner that doesn't know about that would see like would think that there's overlapping allocations.

B

um So we need to teach that uh code about the block id stuff.

B

um But otherwise you know works works naturally, and.

A

Yeah, it might be worth mentioning that too that, like uh we, although we didn't talk about it here with zetta cash, is we have taught scrub to know about zettacash, yeah and b have the capability of scrubbing the blocks that are going to be stored on the cache um you know for those that are kind of familiar with, like the way the cache normally plums into the rest of the stack. It's like this is kind of it's living below the I o pipeline.

A

So it's in a different location, which uh meant that we had to kind of treat it a little bit differently, but it actually uh worked out really nicely for us.

B

Yeah, I think that's a great point of it does integrate with the um is that a cache that we'll hear about this afternoon.

B

um There are a couple questions that haven't been answered yet here so um from thomas wagner. uh Are there any active use cases for s3? Already? I assume you mean, like anybody using zfs on object, store on s3. um We we haven't, put this into production, yet uh we're still working on implementing it. But you know the use case is the one that we've described of um you know. Storing databases in our dell fix uh your data virtualization product.

A

Yeah, the the one use case that I found very commonly when we kind of were getting started is there are a lot of people that are using the s3 backer implementation?

A

uh I I found people both on linux and freebsd that have been using this they're using it primarily for backups right like um because you know, as I mentioned like any of the literature you read object store is just not really designed, for you know for low latency, you know high transaction type of applications, and so the applications that are out there are not pushing that limit. uh We're kind of venturing in that and you know, from what we've been able to find and the performance numbers we're getting.

A

uh We feel that this should open up a lot more use cases if we can get to that kind of low latency high transaction rate.

B

Cool the next question from powwow uh is asking: uh what happens if you allocate an object in s3, but then the system crashes or maybe the agent crashes, um is the object leaked or do we somehow figure out how to delete it? um Yeah, that's a great question, so um you can imagine something here where we're like we're writing out these data objects we're doing a txg.

B

um I mean the interesting so first, let's cover the case of like the system crashes um so like let's say: we've written out we're in the middle of a txg we've written out object days, five and six. Those are part of the next txg, the whole system crashes, the kernel crashes. We come back up, we open the storage pool. um How do we find objects, five and six to delete them?

B

uh We do so. We do find them and delete them and the way that we do that is um so. I simplified the the key here. A little bit, uh we've actually like forward padded like padded. Each of these object ids with like it's actually, you know: zero: zero, zero, zero, zero, zero, zero, zero, five um and the reason is that uh then it lets us easily find. um We know what is the last valid object. Id that's basically stored like in the equivalent of the uber block like in a per txt data structure.

B

So when we open the pool we're like okay, the last valid txg is 192. in that in that txg the last valid object. Id is four, and so we can do a list, objects and list list all the objects that are in this prefix after object after the one. That's zero, zero, zero, zero, four and that list will contain you, know five and six and then we'll delete them. So that's actually very quick. uh An efficient way to take care of that.

B

um There's actually a much more interesting and tricky use uh case where you're in the middle of a txg, the agent um crashes or is exited and restarted, and then um you need to pick up where you left off in that txg, so we've written some objects.

B

The kernel doesn't know about that, because this is no objects at all. It's still running, but it doesn't know about it. The agent has been restarted, so he doesn't know about that. So in that case, what we do is um we do the same thing of like okay, the last txt that I wrote out has the last object at e4. Let me list what objects are after it, um but those objects.

B

We don't necessarily delete them because we might be in the middle of writing them as part of this txg, and we may have already told the kernel that all the rights to those have completed because, like it was, it was persistent on disk uh it persisted into the object store and as far as the agent was concerned, like everything was done, and it told the colonel great those those rights are done.

B

You don't need to hold on to that memory anymore, and so the kernel's like okay, like I forget that memory, um the kernel doesn't have any way to replay those because they aren't in flight anymore from its point of view. So um but then there might be other uh outstanding writes from the kernel's point of view that do need to be replayed.

B

So basically we have to stitch together um the what was left on disk, like what objects are there and the outstanding the the zio rights that were outstanding when the agent crashed um which the kernel can replay into the agent, and then we can stitch that together and fill in. You know any missing objects that hadn't been persisted.

E

Can you hear me now.

B

E

Perfect, uh first of all, awesome talk, awesome functionality, uh which I was waiting since uh for waiting for since uh the leadership meeting uh it was introduced on and on that leadership meeting. I shared my use case for that and that would be ability to mount to import uh pool on multiple hosts and mount data datasets on those multiple hosts, and I have a volume migration of containers in mind.

E

For that I I know that writing to the same pool of multiple hosts as a world, as was said on the presentation, is questionable, but can we imagine a situation when hosts are allocating a specific data, sets and block the access to specific data set and when it was unmounted, it is allowed to move on to the another host right now for the containers that have less demanding workload, I'm using s3fs fuse implementation that allows this quick, switching between hosts for the more demanding containers we have zfs and receive, and the classic data set on the each house.

E

But switching those workloads is challenging in matters of time that is used to send and receive said data set.

A

So one one question merchant. uh So in the case, were you using it? uh Are you kind of envisioning like moving the storage pool to another host or how are.

E

You no you're replicating the data completely. I have in mind the one storage pool uh imported on multiple hosts and data sets that are dedicated for each host.

A

Yeah so so effectively, you want a kind of like an active, active type of high availability type of.

E

Yes, but with exclusive exclusivity on the dataset level, yeah.

B

Yeah, I think that that's I mean that's a neat idea. It doesn't fit exactly into what we've done here. I mean, because there's still pool wide metadata um that you know all of the hosts would need to access and update um like, for example, the last block id that was allocated right, so um we maybe during the hackathon or the discussion afterwards. We can see.

F

B

Some ideas around of like what could be done there, um but I don't think that it's like uh it doesn't just work. That's for sure right.

E

Okay, second question: uh have you in mind uh optimization for the use case when we have uh uh our own implementation of s3? Let's say min io when there is no impact on the amount of the api calls you are allowed to.

E

D

We got a related question on youtube, actually, which is do we have any tunables around uh controlling api calls to limit costs for the services that provide that.

B

Yeah, uh we haven't thought extensively about um like what exactly you would want to optimize differently. um We we do have extensive tunables to control all the stuff, so, like average block size average object size. um You know the reclaiming how all that stuff works.

B

um So uh I hopefully you know the fact that s3 is like more restrictive in terms of like you have to pay for every little thing means that um it'll work, fine, you know, and we you know kind of tried to make it work well in that environment it should work well in less restrictive environments as well. um If there's some other cloud provider that like charges you for something that amazon doesn't charge you for, then that would be interesting to know about so that we could think about.

B

You know how how you know what costs might be incurred there, given the design.

D

A

And definitely with like midio, um I mean we've tried that out it. It works just fine, but yeah. There might be cases where we could make a different design decision. Knowing that we don't have to pay. You know the per egress ingress. You know, operation type costs um yeah, that's something to definitely like. You know, think about.

D

Yeah and then a related question that we also got on youtube, was a number of people were asking about um stuff at the v? Dev object store layer, uh you know: do we support clouds other than s3?

D

Would it be possible to have multiple clouds backing a pool in like a mirroring configuration, um and is it possible to have disks combined with uh object storage as the back end, and I answered all those there, but I figured I'd just repeat the answers here um right now. We support any cloud that use provides the s3 object, storage, api, which is a number of them, uh but we do plan to add support for a few more things in the future.

D

Like azure, I think, has their own api and we would like to add support for that as well. um Currently, there's no capability to mirror between multiple object stores as the back end, but again, this is it's not precluded at all by the design. It's just something we would need to actually like work on and implement and then the question around having disks combined with object, storage and like possibly as a tiering solution.

D

um It's an interesting idea like having disks as your primary store and then, following you know, migrating data back to object, store to save space and stuff, like that. um We haven't done any design work on it, but it is definitely something that would be interesting to work on in the future. I think.

A

Stay tuned for the zeta.

B

Cache talk yeah, it's kind of related to that yeah. We we, you know we kind of need multiple tiers, but um we're you know we designed it as a cache, rather than a tiered kind of thing where you know, tiering usually means that the data might live in exactly one place and you can like move it from here to there and it doesn't exist here anymore, which we didn't see as a requirement for our use cases.

B

um Caching is good enough. um There are a couple other questions on the text that uh I think are interesting, I'll repeat um from youtube.

B

Rahul asked how uh how does reclaim happen if you're, using like s3, tiering or lifecycle policies, um so the free space reclaiming that we're doing is doesn't interact with those um s3 level things um so, basically like we're kind of assuming that only one copy of an object is retained um in terms of uh life in terms of like, if you're, using s3, like uh lifecycle, like keeping you know, keeping old versions of it versioning, um we don't take advantage of that versioning and we kind of assume that you don't have it.

B

You could probably use that to like roll back. Your pool really far or something um but we haven't, uh we haven't tested that out in terms of tiering um the tiering. I mean it's gonna kind of just work and do what it does like moving stuff from the from the s3, like normal tier to glacier or whatever.

B

um Of course, if you know if we go and do reclaim and that like needs to read some old object, then that's going to bring it back from the glacier tier to the main tier, um so you'd probably want to um configure like the reclaim and the like movement policies such that, like normally you wouldn't like normally stuff, wouldn't get moved into the glacier tier until like the freeze had already most of the freeze had already been processed.

B

um I suspect that it'll kind of gen like if you have those kind of workloads that for which the turing is useful, then it'll probably just work. Fine anyways, because you're, probably using big files and big blocks and the reclaim is really not an issue in those cases. um In theory, I think we probably could add some smarts to say. Like you know, before doing you know, you have different reclaimed policies for stuff, that's in the glacier tier versus the normal tier and like we can query and find out.

B

Oh is this object already been moved to glacier. If so, then, let's not bother reclaiming it, maybe at all or not until it's like 99.9, free or something like that, and that would be an interesting extension.

B

um You're again, do you want to ask your question live.

F

uh I suppose I could I don't really know much about s3, so I don't know if there's a particular limit to it, but all your examples you had the key count, ids counting out in your example, there is four five and six yeah and then you have the object, mapping to take the block number and a map into which id goes to that table. Is there any particular reason you couldn't instead of using four five six use, one two and three three, four, five, three four six and so on yeah, like lowest block.

B

Why not just use the object id equals 123 right?

B

F

Can you not list what you have perhaps.

B

um I mean you: can um I think that a solution along those lines uh we we consider that I'm trying to remember the all the the shortfalls um you so essentially, what you're saying is like take this table but get rid of the object, id column and then just have the min block id and use that as the key.

F

Essentially, yeah saving the memory and so on.

B

um I think that might work until you do object. Consolidation.

F

Yeah, it does photopotent.

B

Because, um let's see uh it.

D

Shows the consolidation of emergencies consolidate to the left here.

B

Yeah since roy's consolidating to the left, it might work. um It's an interesting.

A

D

Mean it makes it makes finding the orphaned objects after you do a reboot. You have, I mean you're, not just going like. You know, iterate through the next objects, you have to like scan upwards, but it should still be pretty doable. I don't think that.

B

I actually had the design that way in an earlier version, um and now I'm trying to remember like I'll, have to look through my notes and see why I changed it it might. It might be that I just changed it for ease of comprehension, um because, like it's a little bit hard, you know it's like hard to wrap your head around. Like oh there's like this object whose id is you know, 346, and that tells me something about the contents of it versus this more abstract.

B

You know more cleanly, abstracted layering um but yeah that we should go. Look at that again in um and see if there's a memory savings that we could have by doing that, okay.

F

B

Did you have another question that you wanted to ask.

F

D

um There was one other question from youtube, which was about uh the mmp stuff, which is, would it be possible to have one modifier and also have readers operating in parallel with the modifier and the answer that is yes, except that once blocks start to get reclaimed, you can run into some issues, so you could potentially use something like checkpoints for that which would prevent the reclaims from happening and should make it possible to do reads of things like snapshots safely.

D

um Even if you know the active system could be destroying those snapshots or whatever and checkpointing is implemented for uh the object store as well. We do have that working.

B

Yeah, the behavior would be like at one level kind of similar to having a block based pool that has uh you know one writer and multiple readers where it's like yeah, like as long as the reader starts from a given snapshot from a given txg and no and those blocks aren't freed or overwritten. Then it'll continue to work, and um you know you could use checkpoints to ensure that that's the case where you say like okay, like multiple readers, open the pool from the checkpoint that that's totally safe.

B

But if you want a more like arbitrary thing, then you might get check some errors or on critical metadata and things might blow up um for object store it's kind of similar, but it's a little bit safer because you know you aren't going to get like check some errors on weird things. You do. You know the only error. That's really possible is um like I read I I'm doing a read. The block should be in this object, but it's either not in that object or the object. Id doesn't exist anymore.

B

So it should be like a little bit easier to handle that error. Maybe um but you would still have that problem in the kind of general case if you weren't using the checkpoint, like pulsing.