Ceph Performance Weekly, 24 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-09-24

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, I will start uh with the pull requests and people can going on. uh Let's see not a whole lot again. uh This week, we've got two new pull requests.

A

uh One is from adam other adam core adam, that is replacing my old pr to allow for separate, rocks db block caches on a per column family basis. The idea behind this is that, by allowing for multiple rock cp block caches, we can distinguish between the block cache that services uh o node column, families and omap column families, and by doing this we can avoid double caching ono's, both in the roxdb block cache and the blue store o node cache, while preserving cash for omap entries.

A

So uh my old pr uh was basically uh in place before we did the column, family, sharding and needed. um You know a refactor and rework, and since adam had already done all the calm family work, we uh I asked him if you would mind, taking a look at it and he uh he's re-implemented it. So uh I think there's still a little bit of work to do there, but hopefully that will be uh fairly quick and we can get this in.

A

uh It should result in basically much better cache behavior under the same memory envelope in the osd. um Let's see uh other the only other new pr was one for soft volume, retrieve device data concurrently, it was labeled with the performance tag. I don't actually know very much about this one, but uh presumably uh retrieving data concurrently would be a good idea, so uh I assume that that's faster, otherwise, two updated prs this week. uh There is a pr for allowing dynamic levels in roxdb.

A

This is probably a good idea, but could be kind of complicated for the user to set up properly. uh So this pr just basically puts the makes it possible to do this. uh The way that our code was structured before you really couldn't use that option in roxdb with this pr now you have a way to actually try to do this, so we still don't know if it actually works. Well, we still don't know um how a user would properly set this up, but at least this is the starting point.

A

So I think it's a good change. It would just be really good if we could demonstrate a case where uh switching over to dynamic uh leveling uh level sizing is, is really beneficial. Theoretically, it should be, but we should show it the other updated pr.

A

Is this d3n cache changes in for rgw? um I think that went through a qa test. I don't know if it actually passed or not, but matt did a couple of additional reviews on it. So it looks like there may be uh some additional work that needs to be done.

A

Other than that. I didn't see any updated prs this week. Anything I missed guys.

B

Hi mark you probably missed my pr about optimizing placement group removal.

A

ah Then uh we should get that in uh you. Have the the poll number.

B

B

C

Do you want to talk a little bit about it.

B

um Well, I actually that's the first part of the changes on placement group removal and it contains a couple of well fixes which are unquestionable, so they're beneficial enough.

B

uh The first one is to reuse next position when collection performing collection listing and the second one is to remove immediately remove deleted nodes from the cache.

B

This allows to avoid long collection ripping on final collection removal.

B

Other than that, I'm I have a more complex fix, which introduces range deletes coupled with compact ranges.

B

Well, for now it shows pretty good results, but unfortunately I don't have these slides already so going to present next time.

A

Did you say that with range delete uh or delete, ranger you're you're, seeing good behavior with doing? uh Is it extra compactions that you're doing.

B

uh Yeah, so the idea is to perform range delete, followed by the synchronous so uh well. Actually, the idea is to queue.

B

Arrange delete operation.

B

And then perform this range delete coupled with compaction the same range compaction in a background thread and also this compaction thread can match this range deletes and compactions. So, instead of multiple operations, it performs in a single one, pretty often and well for now I can see so this means that we don't have these large tombstones in roxdb for a long time. They are compacted shortly after the indication and well so far, I can see pretty nice results.

B

Well, sometimes there are some performance drops, but they are. They are not massive, so they definitely look better than the original removal procedure. Nice.

D

B

From another side, these compactions.

B

They do not impact the primary operations performance. Well, the the impact is not very large.

D

A

Fantastic, your that's really! That's those are that's exactly what you'd hope right.

B

Yeah but yeah for sure one one should use range delete with chaos, since just a few number of ram stones result in insignificant performance drop scenarios. So that's why I am trying to to to come back as fast as possible after such a period yeah. I.

D

Agree, I agree.

A

Excellent, okay, uh anything else, guys that I missed, I, I missed stuff all the time so feel free to chime in.

A

All right: well, if there's nothing, then let's move on to the paper um josh. I think this was your suggestion. Do you want to jump in? I read it, but I I'm afraid I'd spend the next 10 minutes complaining. So.

E

Yeah sure I mean I hadn't read the much billion the uh abstract when I suggested it, so uh I I don't take. uh I don't know, let's look at that, um but let's quickly summarize, um essentially they look that we're looking at trying to make more realistic workloads for uh uh evaluating, rocks db performance and they make traces from three different applications at facebook that ended up using roxy b under the hood um that did fairly different things with it. Some of them used like multiple column families in different ways.

E

um Others were fairly simple kind of incrementing counters and they try to decompose the uh different parts of a benchmark or of the trace so that you could get a more realistic synthetic workload that matched those bases more accurately.

E

So, as I recall, they had a few different aspects. They focused on one was kind of key locality having certain hot ranges of keys, which makes a lot of sense. If you have like certain prefixes that you're using in different ways, they have the concept of varying intensity in terms of queries per second or operations per seconds, which exhibit exhibited a strongly diurnal pattern for a lot of workloads. So they ended up modeling. That was kind of more more based on a sine pattern, rather than a.

E

Periodic function, and essentially, they ended up comparing the workloads they generated, based on um fitting the different characteristics of these workloads to uh more specific models for each of these characteristics, with um a generic ycsb benchmark.

E

And they found it that for roxdb's case in particular,.

E

The more the workload that was closer to their traces had much different characteristics that more closely matched what ended up happening at the storage layer in terms of like reapplication and right amplification. They saw then the more kind of almost uh more randomized, more randomized and less realistic. um My csb workload.

E

That's kind of a brief summary.

E

I guess there are two aspects at least that that got to me. One was just looking at the trace. Data itself was a bit interesting because of how small the uh the keys were. I think, like the largest or things their very large values were 10k, which is still fairly small, compared to like a lot of the keys we have and stuff.

E

And I think the concept of of um trying to very closely match our trace to um with a synthetic workload using different models for different components. If it makes a lot of sense.

E

F

E

Quite challenging to apply that without good real-world phrases, so this this paper did remind me that that raspberry had recently added those those tracing capabilities. It might be interesting to try enabling those for some software loads.

E

And see if we're kind of doing what we expect, we we're doing.

G

So um what tool were we using so far to to benchmark rocks to be for our clothes? Were we using something like.

E

Typically, we're using higher level benchmarks, rather than benchmarking rocks to be it's uh directly, something that like going through our one of the high level protocols, type of stuff, even like rtw or rpd,.

A

Yeah yeah, we, I don't think anyone's ever tried to figure out how to simulate uh an osd workload such that we could just run it directly with roxdb in a standalone way.

A

It'd be really hard right, because stuff can accept so many different kinds of workloads and if you run like an rbd workload, roxdb is going to be used much differently than if you run an rgw workload right. So you'd almost have to create separate models for all of this and it'd be a lot of work.

E

Yeah and beyond just the at the different access methods, I mean you have many different patterns on top of those access methods too, like a database running on rpd is very different from this virtual machine serving a website or something.

A

I kind of question how much you actually need that, like in the paper they kept on talking about how different their their workloads that they were running were from the real workloads right like they were seeing much different cash hit rates. um So, okay, yes, of course, um that means that maybe you need to focus more in from a performance perspective on how well the caching is working or maybe you need to focus on.

A

uh You know some other area depending on what the workload is, um but, as I finished this paper and read through it, I kind of was left with this. Well, what have you actually accomplished? You know kind of taste in my mouth like um okay, you know they. They showed that they can uh figure out better ways to represent these workflows. Great. Yes, um that to me feels like the very first step in what should be a much bigger.

A

You know more interesting uh uh paper, and maybe that's what this is. Maybe they are going to do that, but so far I guess it felt like 17 pages was a lot to um to cover what they actually did here.

E

Yeah, I think, there's a lot more that could be done based on that, but I think there's a lot of value in having like being able to directly generate workloads that do match up with with a real real world workload, um because.

F

E

Measure that kind of performance without having to run all the other layers right.

A

Yes, yes, I agree, um I I don't want to totally discount what they did, because it is useful.

A

One thing I would have liked to see is that they mentioned like ycsb, using zipfian distribution and how poorly that modeled, uh I don't remember which workload it was, but one of the workflows that they had that didn't it didn't it wasn't a good representation of it. um But if I remember right, ycsb lets you control the uh the way that the zipfian distribution is laid out like it. It lets you change the scaling factor of it. Basically, and I didn't see any reference in the paper.

A

Maybe I missed it says the long paper but um or dense at least, but I didn't see any reference that they actually tried doing, that they didn't try adjusting the existing models to try to, like um you know, better match what their their application behavior was.

E

uh I thought that was kind of what they were talking about in terms of fitting the models to their applications, behavior like, and then they described, trying to fit different different distributions to the to their traces like they would now imply, like varying the zipping parameters. In addition to varying the like parallel parameters or varying the parameters of other distributions,.

A

So I would have liked to see that right, like maybe maybe.

G

I thought that that's what they did in section 7.3, um but whether they were comparing them uh benchmarking results. They said that that they configured the ycsv um to fit the zpdb.

G

G

Like the second paragraph there, I thought.

G

A

Okay, I'm I'm looking at the section we run ycsb with the following four different request: distributions uniform zipfian hotspot and exponential. We use the same pre-loaded database with 50 million randomly 3kb pairs. Okay, that's all fine um roxy be cash. Is configured yeah, yeah yeah,.

A

So, okay, the total number of block reads and the amount of read bytes by ycsb. Zipfian workloads are at least 500 higher than those of the original replay results.

A

F

A

My question there is: did they actually try adjusting the parameters of the zipfian distribution? You try to better match the cash hits that um that they saw with the real results. That's that's what I was getting at.

E

Yeah yeah, I guess I don't see them doing that with the real system, but maybe more in the modeling phase.

H

As far as I can remember, I think by csb has something called read latest and you can also uh you know, give it a distribution of how much you wanted to read what um based on time and also proportions that I think everybody knows. So I don't see that being used.

A

The the feeling I got was there like: okay, let's try these different like ways of setting up a uniform distribution and hotspot, and you know whatever and they tried it and it didn't work so like. Oh, we need to make a new benchmark, because this doesn't work right and they didn't, like. Maybe.

E

I think in their in their name, they had a very brief paragraph, but I think they didn't go into as much detail as that, as I would make it clearer uh by the end of section three they described just looking at the.

E

uh Correlation locations between different factors, separate them out and then uh fitting different different types of curves to the those different factors and finding the curves that have the lowest fitting or each of those factors.

E

They would have considered like different speed parameters, but they didn't really show the details of that process.

E

And that's your point about the csb having other parameters they created used. That's interesting, I'm not that familiar with ycsb's current state but sounds like those might might give them some of the what they're looking for.

H

Yeah, I mean I'm not super familiar with it, but I have used it in the past and I know that they have a bunch of tunables, that you can tune to model workloads and that's the reason it's such a popular. It has been in the past at least, and it just seems like they they've been like okay. We we're not going to experiment too much with this, and we want to, as mark said that you know we want to. um You know just create something that works for us better.

A

Yeah, that was my understanding too. I'm not super familiar with. Why csv I've used it a couple of times um and it seemed like you had the ability to like tune things a lot. um So my I have a very much the same feeling you do, it seems like they. um Well I don't know. Maybe they really did look into it and maybe they they determined it wasn't sufficient, um but it would have been nice if they had fleshed that out a lot more.

E

Yeah, I suspect there may be some pieces here that are kind of more in depth, respect to like the keyboard pair ranges and sizes um they mentioned trying to contribute some of that to icsp in the.

E

G

um Did anyone have a chance to take a look at, I think they said they released. uh I think they they said they released the dominated tools already at open source, the tracing, we're playing and.

E

Analyzing yeah, the tracing stuff has been actually uh including the past year at least or something, um but I don't think we've tried it out really unless you've done some experiments mark.

C

Nope nope, I haven't looked at that at all.

A

But yeah actually I mean that that could be useful for us right.

A

um Potentially, if we we were able to do traces on.

A

A ceph workload and replay it with db bench that could be interesting if we could make that work.

E

Because, with some of the changes, we're looking at it'll be very interesting to see what effect I have on the behavior, we see in the traces.

A

Yeah I mean historically my my view on a lot of this was just to make it as insanely fast and easy to set up stuff cluster and run tests as possible. But presumably you could go this way right. You, you create some kind of very, very uh or maybe a variety of different workloads and if you can just run it directly with db bench against rocksdb that you know would be even even potentially faster and more useful. Maybe.

E

Yeah, I guess I was thinking it could be useful for like uh analysis, in addition to um a kind of running up, any more isolated benchmark.

E

It's just seeing what kinds of workload we end up giving rocks to be like, for example, before and after these pg log changes.

E

Seeing if that matches up with what we expect, I.

A

See so, rather than recording the trace and replaying it later actually just doing the trace analyzation.

I

C

Yeah I mean maybe.

A

Do we sorry I I this is I kind of started honestly tuning out a little bit by the end of the paper? Did they give? um Did they give kind of an overview of what all the tracing is capable of.

E

um I did a brief overview towards the beginning.

A

A

Tracing okay, yeah, I see here under uh on page 212. I guess um which is the same thing as page five in the pdf uh looks like that's. Where section three.

A

I'm a little surprised that they said that uh the the lock you being used to serialize all queries doesn't cause any performance overhead in the their their observations.

A

That's interesting.

E

Yeah, it might be interesting to test it, and then I just am only increasing with the cipher class to see if there is any. If we can notice any overhead there or not,.

A

Yeah, you know it's interesting because if you look at stuff osd and you look at the number of records that we actually push into rock cb we're like you know, people talk about the number of records that they they do and we blow it away. I mean a single osd like just hammers, rocks dbe.

E

A

We're we're almost like our our own version of our workflow generator.

E

Right with the record sizes too, like like these ones, are all incredibly tiny compared to all of the values that we have even the keys that we have.

A

At least one of the things I've noticed when I've looked at, you know, like other people using roxdb and the the sizes and the amount of data they're talking about like well geez. We can, you know, do that in like a half an hour or less on a single osd with a big rgw.

E

A

So, okay, difficult to track large scale, long, lasting workloads that make sense.

E

I guess another uh note that was important for, like the tracing aspect was um they were. They looked at both um large scale like a 14 day period of time, worried analyzing things, but also decided.

E

I treat a trace over every single day as well, uh since the workload workloads will change quite a bit over that period of time. So if they have like different things running on different days, I guess that's an important thing to mine. If we thinking about doing tracing in the future.

A

How how important do you think that is relative to just identifying different common workloads that people do.

A

Like do we need, do we need, like.

A

I think of how to say it, but.

A

Once we have an idea of what a a particular use case for a cluster is, is it actually that interesting for us to to see the day-by-day changes of of how it's being used, or do we have a good idea of you know kind of generally what what people are doing in that particular cluster.

E

I think it's a good feeling to be able to verify that like if you use the clusters being used for one thing and.

F

E

Look at that cluster or over the course of like a few different days and see if that, if there are, if there's different patterns that you see on different days, that would be a signal that maybe there's something unexpected going on or say, maybe more than one type of workload. Yeah.

A

I guess what I was wondering is: do we need rocks to be statistics to know that or once you see certain kind of client requests happening, can it will just kind of fall into some category of io.

E

I think you probably want higher level data for uh better workload. Characterization like the lower you go, the harder it is to get back to what was originating the request and how, like.

F

E

I mean at the osd level. You know you only have one view of things, but if I file some level you can, if you can get data from like the first client, that's a lot more more detail.

A

I guess what I was thinking of is like okay, rbd rbd, the rock cb workload is probably gonna, be very well known.

A

I would think right unless you start really changing things like making really small object, sizes or something I'm guessing, that a lot of rbd workloads are going to look a lot like other rbd workloads if, if they're the same kind of um uh the the the same kind of access pattern right like if you have reads happening, you know, depending on how much cash you have and how many notes are being read from cash you're going to have more or less rocks to be.

A

um uh You know key look ups for those o nodes, but those key lookups are always going to look kind of similar, I would think is that is that a good assumption or my problem on that? You think.

I

I guess you're wrong, mark I've, seen workloads, rbg workloads and they are very different from even from osd to sd working. Similarly, at the same time, so it even depends especially uh predominant. Is that from time to time uh there are like streaks of set of very small rights it. So it's basically like other time. It's just like keeping large lights, just keep coming slowly and then there's a fast burst of small rides to different objects. So that's what I've seen and the patterns are really.

I

I couldn't see any patterns in those patterns.

I

What kind of workload was that uh adam.

I

Josh, that was the capture from a customer 24 hour capture of rbd working on customer cluster. I don't know which customer that that was, but it was that extracted from rbg traces.

E

Yeah, I don't think we know what the workload was exactly.

A

So the the behavior in rock cd was really random. You said it was rights, but not consistent and bursts at random. Weird times.

I

Actually, I cannot confirm that I can only confirm that the patterns incoming were very different from different time points and different osd's.

I

Of course, they translate a bit to more order in on roxdb side, especially because we're using the same objects, for example often, but it couldn't be inferred that if I had a streak of small writes for some time, then it will uh continue for a long time or devolve into larger rights. It's just unpredictable. Sometimes it abruptly stopped and sometimes it turned into large rights.

I

One explanation could be that I've been seeing different workloads actually mixed and I've only seen results on osd site and I couldn't infer actual workings from just rbd object. Names.

A

My assumption was that they, if you had an rbd workload, you would always see some amount of well except in the the perfect case. You'd always see when you're doing writes that you'd be doing. Oh note reads to make sure that no note exists and potentially you'd always see, uh like you know, fixed size, key values coming in, uh because rbd has fixed size, keys for the block uh you know layout, but with rgw you'd see very different behavior right because it could be all over the place.

A

You have many different key sizes, you have, um you know, potentially crazy kinds of workloads depending on whatever the client is doing. uh I guess my hope was that rbd would be more consistent, but maybe that's wrong. I don't know.

E

Yeah, I think, you'll see like many different kinds of requests on top of rbd that results in many io patterns I mean at the base level, they all have four member objects and you expect to see kind of more small, random ios in general, but that's a I think. It's a pretty wide generalization.

D

A

So what can we get out of this? We we type of the tracing stuff. So, okay, maybe that's useful for us. Is there anything else useful for us here.

E

I think if we would try to if we're trying to uh do much benchmarking with more realistic workloads, um the uh idea of a contrary to model limit more fine grain uh might make sense for fao, for example, I'm not sure if it has some similar kinds of options in terms of controlling locality.

E

Mark you're- probably more familiar with that than I am.

A

Yeah I mean it does.

A

And you can also do the same thing with fio that you can do with white csp regarding laser feed and distributions parameters of the zippy industry.

A

Can really, you know, try to better model what it would look like if you were doing most of your reads out of cash um so and we've done that in the past um you know it's so yes, um the thing I don't like about it is that what it ends up doing is: is it really kind of uh de-emphasizes the the performance of stuff for the impact of stuff and kind of more just focuses on how fast you're doing you know read some cash, which are kind of completely uninteresting um as far as you know, uh actually making stuff better?

A

Maybe it's interesting for the the customer, because if they can say well, if you know 90 of my reads are coming from cash anyway, then then it doesn't matter. If the storage is is you know not super fast or slow or whatever, but um yeah? I don't know, I mean from a perspective of giving the customer a valid.

A

You know a a good representation of what their their application is going to do. Yes, maybe it is interesting.

E

Yeah, I think it sounds like you're describing uh in general, like what, um when you'd want to run like a benchmark, a more realistic workload versus when you'd want to do a more targeted benchmark.

E

And I think, there's a kind of more realistic benchmarks do so they have their place in terms of like testing regressions and seeing what what a user would actually see from different changes. We make.

A

Yeah and and jens has actually done as far as the fio and the block layer goes, I mean he's done a lot of the things that they talk about in this paper, with you know, being able to record traces and replay traces on block devices and and even fio has some support for this kind of thing built in and fio does give you a lot of control over um you know what kind of access patterns you want to create and and layout of files and things.

A

So I mean yes, you have a lot of a lot of control. um I don't know I.

A

It's it's probably valuable. It just seems like it's really easy to get um complacent. When you start doing this and saying well, everything looks you know really really good when you, when you start modeling these workloads, but then you you, maybe can easily overlook corner cases where you know.

E

A

E

Both you, you can't just run that in one kind of test right, yeah.

E

But as we're looking at like different different different sorts of access patterns like the aiml stuff on top of rgw, for example, I think that's a politically different access pattern from what we see with other things for hw.

A

Well, right now with something like mlperf, we see access patterns that are completely ridiculous, because everything is basically coming in cache, so I can actually show you graphs of a real ml perf run using a couple of different data sets uh just one second,.

A

The the real problem is they: they don't actually represent the kind of things that, like google or facebook or amazon, are doing because we literally do not have the ability to to test data sets that big.

A

Nor do we have access to data, sets that.

A

Big here we are one.

A

Second, okay, so uh this is the result of actually using the the ml perf test, the nvidia ram to to showcase uh performance on their their uh dgx1 ai ml. You know reference architectures that they built, and this was run in our alias lab, which has very slow gpus.

A

It took forever and it was not super interesting, but um now the gist of this is is basically you see. You know a big spike of reeds being not even that big. This is kind of pathetic to be honest, but there's a spike at the beginning in both this uh gnmt, which is in english to or german to english translation, I think and then also the ssd test, which is a an image workload.

A

uh I don't remember exactly the the exact details of the algorithm, but doing some kind of image, recognition and work, but very similar in both cases. It's a lot of cash. It's super uninteresting for.

A

A

We need like three terabyte data sets. Big data sets. Real data sets to actually test. What's going on.

E

Yeah, that's really seems, like I said the case. It's not really stressing my storage.

A

Exactly and but this is the benchmark- that's used, this is like the industry standard. Everyone that's doing, gpus is for this kind of thing, but it's completely uninteresting the storage side. It does not even remotely represent what, um like a very large ai uh workload, that real people are doing uh would represent, and maybe this is where tracing helps. Maybe this is what we use it for like right. Maybe maybe this is a justification for why to actually do this is that getting access to these data sets is is really really difficult.

A

We can't either we're going to have to generate data, sets that look a lot like the real ones and then runner will close against those or we need to get traces of real workloads and and then replay.

E

Them yeah, I think, that's a definitely a viewable approach.

E

And I guess the commentator in the past has been a difficulty of gathering traces from like large-scale uh users of things like this, but if we can do that, we can replace them on even smaller scales.

A

To play the devil's advocate a little bit, though say for this ssd test. This is basically going and performing image analysis on a what used to be a large data set of images. Like I don't know, I remember how big it is, like, maybe maybe upwards of 50 gigabytes, but it's something like that. It's small enough that now it fits in memory in main memory, so you just hit page cache um but say it's 50, gigabytes of images that are like. I don't know on average, between 64k and 128k in size.

A

Do we actually need to run this workload, or is it enough to say that you're just doing random reads of of images into memory and for whatever memory amount you have you know you you can you can do you? Can you have a certain cache hit rate? And you know some of these are going to be read from disc.

A

I think that's. Actually all this is is you know random reads: a random loads of images into memory from a pool of images that have a you know, size a standard distribution in average size.

A

It may be that this is actually not that interesting right. That's I guess what I'm getting down to.

E

Yeah, it could be- and I think, um there's there's something that we're closer to stretch the storage where it is interesting, though you know.

A

What I mean we'll stress,.

E

The storage certain.

A

E

Go ahead, are you misunderstanding? What you mean.

A

What I'm saying is that it's is is in you know, we can generate load, it's not that we won't be able to generate load, but it's just that all this may end up being is um how quickly you can load files that are, you know, moderately large, between 64k and 128k, in into memory. um Given some, you know, amount of of page cash on the client, and you know, uh read cash on uh sorry buffer cache on the osd.

A

You could simulate it really easily right.

E

Yeah yeah, no yeah. I agree it's definitely not not. If you know that that's that's what the world ends up doing because it has a small data set, then it doesn't make sense to try to model it more accurately because it doesn't doesn't make a difference for the actual storage workload.

A

Even if you had a large data set, though right, all it might end up reducing to is how quickly can you read files uh in a range of this size, given a certain cash hit rate and and, and you know, uh a certain amount of memory and a certain you know uh likelihood of hitting the same file.

E

Yeah, I agree that I mean I think they definitely are work workloads that are simple enough for it, that you don't need more, more uh accurate modeling to get a general idea, a good idea of why they would perform sure.

E

I think that's kind of one thing that um got me into the paper like rocks to be is a fairly complex system with the lsn3 and and the different things it does to try to manage that. So that's one reason why uh more accurate workload monitoring uh modeling um might be quite important there with something like sure.

E

uh What are you doing like a immutable object reads and that's a much more simple workload and stuff.

A

With roxdb, the other thing that complicates things is how big your data set is and how it grows right, because that really changes the dynamics of how it behaves.

E

Yeah yeah, exactly that's what I'm getting at like, but that also increased structure is very complex, whereas a single object, look up and stuff is not going through it uh that same kind of structure or every single write isn't isn't going through nearly as complex a I o path.

A

Did they talk in the paper at all, and I apologize if I missed this as I was going through it, did they talk at all about the size of data sets and how behavior changed as as those changed.

E

They talked a lot about the science data sets and the different sizes of keys and values. um I don't think they did a lot of analysis on.

E

The variations caused by that.

A

Necessarily good, you can imagine that if your model is wrong, the bigger it gets, maybe the more the errors.

E

Diverge, maybe not.

A

What I was thinking is, if, um if you end up in a situation where the database itself is being more strenuously hit that as the database grows, you end up with more time spent in compaction and more likelihood of hitting um uh situations where you're blocking right, like you're blocking a specific level as being compacted, and so your your cash miss potentially could become a lot more uh severe.

A

In a situation where you have a lot of data versus when you have a little bit of data.

E

Yeah, certainly, you could certainly see some effects from like that kind of working set size like or scaling factors for the data set size that you wouldn't see. Smaller, set data set yeah.

A

That'd be kind of interesting to.

A

E

um Mark and I have been talking a lot there, anyone else have any other thoughts about the paper or racing or modeling workflows in.

E

E

We've got a few minutes left. um Do you want to dive into the other discussion? We started with the stand-up with the guppy in there.

E

Where do you have a look at rock tv.

A

Josh, are you talking about from stand up the discussion we had going there.

F

Okay, so what we try doing is um at the moment. What we do is that object. Node goes to roxdb as um column family a and we got a pg log, going to column fundamentally b for pg log. We never want them to go to disk. We just want them to be on the same right ahead log as the object node, which is why we put them in in works to be in the first place so and the problem is that the way we we deal with them is that we create log entries.

F

But when the object node reach the disk, then we create uh remove a delete entry which becomes a tombstone, and these things tend to stay forever, because usually you have to propagate them all the way to the last level before you could remove them.

F

So, instead of doing add entry put entry and delete entry, we decided to try to use something like a recycle value, so we so we never delete anything when the on the. When the object was this stage and then the the key is being recycled and assigned to the new entry.

F

So we just keep updating the value with the same key, but we never remove anything the and we and we hope that that way, we're going to get away from the tombstone and all these problems, because in theory, because our main table is going to be very small, we have now like 3 3000 entries with just 3000 entries. We assume that we could keep the m table in memory and never go to disk.

F

What it didn't work mark did a performance analysis and we doing now double as many compaction operations.

F

So I started reading a rockstar document and I realized that our design is not going to work the way it is because what happened is that the right ahead log can never be removed until all the mem tables which assign entry there being this the disk so and because we never delete anything. What happened is that the right headlock become bigger and bigger and bigger at some point. What happened is that um they take the main table for for the pg log and stage them. So everything goes down and they go.

F

They propagate all the levels without ever being deleted, and what happened before is that in the older code, things used to be put and delete, put and delete, but because many of them been happening, while staff still been in in the main table, then they never reached level zero.

F

It was a lot of extra work on the right ahead log, but they never reach the level zero.

F

So our our solution was doing worse and before and now my understanding is that if we want to get performance, we need the right ahead log to be deleted, so we need to find a way to allow to be deleted and still never get without average staging the main table for the pg log. So the way to write a headlog is being deleted is every time a mem table.

F

So I'm sorry, all the main table for all the column, group or the column family all go to the same right ahead log whenever one of them is being filled or for some reason somebody is calling flash.

F

So this thing is being staged to disk and now it's all safe, and what we do is that we create a new right ahead log and the previous one is set aside and now everybody moves to write in the new right headlock, but the new, the old one, cannot be this stage until everything which been written there being flash to disk, and my understanding is that the way this thing is implemented and that every time a new meant every time, the main table is this stage.

F

A sequence number for that mem table is incremented and every right ahead. Log is keeping the the highest sequence number that was used for every column, family and so to be able to delete it.

F

You need to make sure that all um the main table that have copies there, all of them being all of them being distilled, which means they move to higher version. Does it make sense to everyone? Until now, I actually wrote a lot of things in my email today, so I think it's going to be easier to follow next to you, so we need to find a way to fake the mam table this flash. So I was thinking. Maybe we could do everything which is done for the main table flash except actually flashing.

F

So what happen is in the main table flash is that we increment sorry, we create a new write ahead log. We increment the mem table sequence number and somehow the the old right, ed log is being updated with our new version. So he knows that he can ignore everything from us, because we moved all the old versions in this stage and now it only have to wait for everybody else to this stage and report that they move to the new level, the new sequence number, and then this thing can be discarded.

F

So I'm trying not to find the code doing the flash, and hopefully you will be able to do these things. But it's it's it's it's! It's a dirty hack because the code never intended for this to happen. I'm still trying to find on the logic which could allow this. Maybe we could design a mod which say, discard main table discard so by doing mem table discard.

F

Logically, it should allow everything in the right headlog to be removed.

F

But if, if we really want to do this, it means.

F

That we also need to remove the main table which might not be the end of the world. So instead of using a single mam tab uh mam table for the pidgey log, we could use two of them and every time we fill one of them, we just call this car every time we want to trim everything there. We can call this card and then that will immediate emulate the flash, but without actually writing everything. Does that make sense, because I mean the other way is.

E

F

Sense, yeah, okay, so the ugly way is to call flash, but don't do the flash just do everything around, but that's really ugly, because it's it's have no semantic except, and anybody going you're just like why these people doing this that make no sense.

F

But if we break the pidgey log into two separate main tables, then we could use like double buffering and whenever we finish with one of them and everything until that things, we know being this stage. The other object node that we open pg log for them up to that point. In this stage we could call the equivalent of trim, but we could say uh discard it. Just I mean there must be a way for people to say you know what forget about this name table. I don't want it anymore.

F

So if you don't want it, you could add. It's called the write ahead log and you don't have to write everything so the whole thing being freed, so I'm still trying to find if there's a, if there's an actual way to do that, because that would mean that we all have to keep um creating and this and and discarding mem tables.

F

I don't know how expensive this need to check this. I need to find, if there's a way to do that. If no, maybe we could just add an api like this.

F

A

Seems like sorry I was just gonna say it seems like you could do this in a way, that's universally useful for rox tv right, because that's the if we want this upstream, that's the the ticket.

F

Yeah, so that's why it has to be discarded. Can we just hacked exactly and and if we know it, we can even create a mod in mainframe. They have a mod um when they do write, they have a flag to say, I'm writing to temporary file, so keep it in memory, but never put it to disk. I need to. I need to remember the way that the name for this there's a name for this, and so we could maybe call it like it's like a temp file like a tempe fest.

F

So we could make a main table which is tempe fest. We still want to create a right ahead log, but once we decided to remove it, then the whole thing can be trimmed. The memory is going to be trimmed and the right headlocks could be discarded and also to make it a bit more efficient.

F

The default behavior from em table is a skip list which, which is very efficient. If you want to keep everything ordered because doesn't stay have to be ordered, but if we know that this thing is a temp file which never going to go to disk, then.

F

We don't need the ordering so we could use. There is a linked hash table, so we could use just hash table which is going to give us faster performance. Oh just one. Second, sorry, there's somebody in the.

F

Door yeah sorry. um So, instead of using the skip list, we could use the hash table which going to give us a faster access for, read and write. Actually, sorry, not not a just. It's going to give us faster access for update, read and write and whatever, and we don't need a skip list, but that's just a secondary improvement, but we need to have this.

F

So the there are some there is memcache. For example, right I mean memcache doesn't need anything to be persistent, so we could have something like that, like a temp, temp mem table which it's possible to get to be discarded, and sometimes things get discarded, because you know what we found some better place to store them, we move them to somewhere else right, I mean what happen if your mem, if your mam table has been migrated to somewhere else, you don't need to store it.

F

Anymore, doing key compression yeah yeah. Oh, I mean the skipless. Yes right.

A

F

I just wanted to.

A

Yeah I just I just wanted to mention very briefly that um it could be even if it's a side effect. It could actually be a really important one.

F

Yeah, I mean, and also you don't need uh ocean.

F

What is this big of log n instead of I know that n is not very big, it's three thousand, so how much is login of it's, like maybe fourteen fifteen. So maybe it's no big deal, but still it's nice right.

D

Now so so three thousand.

A

F

Keep it small right, we intend to keep the pg log small.

A

The the question is: how valuable is pg log at n equals three thousand. If you have the ability to do a million writer apps.

F

Yeah, but you don't need- or I mean the millions is so say, you're doing: 1 million iops, but actually you finish every 10 milli. You finish, uh 10 000 right. So if you feel like exactly.

E

F

It just means your walking set needs to be in the pidgey dogs and if it's going to be 10 000, but you don't need, if everything finished within 10 milliseconds, you don't need to keep a full seconds of digilog.

A

What's the value, the question is: what, in my mind, is um you know historically having fiji log helps you avoid the situation where you're uh you're going into actually looking at objects on disk for recovery right.

F

E

It's that it's also correctness with the pd dupes. You still need the dupes, regardless of recovery.

A

We we have to scale it. I think, if we keep with this.

A

E

Double buffering with like like, if we uh potentially even don't, don't, have as much of the vlog itself but more of the more focus on the dupes, um because those are very much time bound.

E

If they're not useful after, if you're not going to see requests being replayed minutes.

F

F

So in my side, I'm still digging into roxdb code, I'm I just started looking at the actual code today until now, I is just a reading document and there's a huge amount of documents documentation about roxby, which is really cool, a lot of explanation, a lot of case study and everything which, how I got to understand that this behavior, but everything I did until today was just a reading document.

F

I didn't read the code itself so now I'm diving into the code itself and I'm still trying to understand the structure which object is doing what and where are they located?

F

um So it's going to be time consuming, I'm assuming so if anybody here is familiar with the coding can give me a crash course on rocks db code layout, I would be very happy.

A

I can't probably in the way that you want. I can I've got a piecemeal understanding of certain parts of the rockcv code, but not not. uh You know kind of what I think, you're hoping for adam you've looked at it as well. How do you do.

I

You have that's true, I'm not confident I can recognize pieces and places, but it does not form entire pictures, so I can be held, but not a guide.

A

That's alright youtube.

F

Okay, so I'm going to be um the guinea pig I'm going to try to do it. My my way, uh so it's I think I understand the semantic, and I also read about other databases, because I mean write ahead. Logs behavior is common to all key value databases, so I read it also in other another database design documents and also mention similar behavior.

F

So I think it makes sense that that's the way to do it and it's actually efficient way to do it like yesterday, I was still under the assumption that there is a code walking the right headlock one entry after another and just purging them, but it's very inefficient and it's pretty stupid.

G

Oh yeah about it.

F

So once I read about this thing using this versioning yeah, everything makes sense so yeah and I've seen it also levy. The b is doing the same thing. I know I think quark to b is is a successor to level to be something.

E

F

I found was under level db. Some other works to be, and um there was somebody else, but I think all of them are just the same family so that behavior makes sense. I'm still trying to find if there is a way to do, perish, uh discard this codman table, and if that could be done, then I don't have to do anything. If not, then I have to add this code to do mem table discard, because the thing I suggested in the email is is pretty ugly right.

F

I mean you pretend that you flash, but you don't flash.

A

I think I will be very interested in your opinion when you've done all this research and looked into it if you still think that it's good to do it in the rock cb log or not, that's the question. I keep asking myself.

F

So the question is, and I'm I cannot make any any judgment call here. What sam told me is the reason they put pg log in works to be is not because they cannot use another writer head. Log is because they wanted it to be consistent with the object node. They want it all to be in complete sync, I don't know it's critical. I didn't try to think if there is a better way or sorry if there's another way to do it while using separate right ahead.

F

Look I mean we are doing a lot of crazy work just to benefit from having right ahead log without actually using the database.

E

Yeah, I think the main main reason is for hdd the uh case like where you want. It's really expensive the hardest case, where it's really expensive to have multiple extra ios. You need to do.

F

Oh okay, okay! I get it.

A

Yeah yeah, I mean you can go.

F

E

A

Do all the weights together.

E

With the one might have log, instead of having multiple of them having to see yeah.

F

I I guess that actually on ssd, it's even more expensive because of right, amplification.

E

Essentially, yeah teams are very.

F

Happy how big? How big is a pg log entry.

E

It's not gonna, be it's gonna be much smaller than a flash black size, most likely.

A

Maybe I mean if, when I was doing 4k random rights to rbd, I was seeing in reality like 7k rights to the the red hat log. So.

F

If it's a 4k the object node and the pg log.

E

A

That's true, but if your object name is bigger than rbd, then it might not be.

B

E

It's gonna, it's gonna, be like a few hour bites, probably for the option. Teams.

C

E

C

Log is gonna, be that.

F

Much a few hundred is that w case.

E

Yeah, where the object.

F

Itself is like yeah but yeah, but if you think about rbd case, so the object name is like what like 32 bytes.

F

Probably going to be wasting huge amount of space if we stage every single pity log entry to to a standalone sector, but are we trying to do any um um coalition coalition of pg dog rights? Do we try to.

E

F

Many of them and stage them together.

E

A

Yeah, that's exactly it right, because you start going down that path. That's a very different route than what we do right, but especially.

E

A little database with its own little buffering layer and.

F

But I I think rocksdb is doing this on every io. It doesn't do this aggregation.

E

No, it doesn't.

F

It gives you it gives you an option to.

I

Yeah, but we do it, we can create it.

F

So so we do, we do now on flash call. We do our own call to flash.

I

Then we aggregate uh writes to roxdb and on certain synchronization points. We just tell that we need that uh that batch, that we already have written to land safely on disks, so we we force it periodically. I mean often but periodically.

F

Yeah, okay, well,.

I

F

Case we could use it also for the writer. If we make a standalone right ahead log, then we could do the same thing aggregates and then it's not going to be that bad.

F

And I was even thinking we could start doing that we don't have to wait for the right to actually happen.

F

We could start every iteration that somebody, uh the dispatcher, is poking in all the cues and you can say. Oh, I can see there's a request here here here. So he's just going to take the list create the right ahead. Log, send it to bd stage and only let her go and start processing the right. So he could do it ahead of time really. So it's going to be write a head ahead. Log, whatever read the head plus right ahead, loose yeah, so you do read ahead into the future.

F

If you have command queues, you know the future.

I

I don't believe we can. How would we then recover from sudden uh loss restarters events like that.

F

Sorry, I I I didn't get a question.

I

I'm not sure we can do that. uh How if we do it that way, so um writing as a future to our all the rights. How can we then synchronize uh properly now we just blindly trust roxdb to budget only one transaction and we are synchronized just because all of those mechanisms that are already there.

F

You could batch it yourself right, I mean if you, if you create, if you look at the future and say oh, I can see, there's going to be like 15 new rides in the future, I'm going to create one big entry describing all of them, send it to be the stage and then go ahead and process them after this thing be the stage to the right ahead.

F

F

You could just build your story right, I mean you could read and you say from the queue I could realize that we're going to have that many rights, let's put them in the right headlock before we even start processing the right, so you might have some cases in which later you're going to decide. You know what you don't need to do this right because of some reason there is a there is some mistake in the right, but it's very uncommon, so you could create. You could send a tombstone.

F

You could just do like an undo in the right headlock for that action, but in vast majority of cases you don't reject right, I mean the reason to reject rights is usually if you, if you're running out of resources or if there is something wrong with the request permissions. I don't know, but you don't expect this to happen.

F

So again I mean I'm still not really sure. I fully understand what is the semantic of the writer headlock sam was pretty hard on this, that he want consistent view system-wide between all other column, families and object, node and pg log. Everything need to be consistent in the same right ahead. Log, I'm not sure why you want it, but you want it so, but I don't understand the system enough to say we don't need it. Actually, I don't understand it all.

F

So there might be an extremely good reason why everything have to be together, but maybe it is possible for us to batch them together. Just make a plan of what we're going to do we're looking in the future. We could poke all the cues and we could construct descriptions of our future walk and send it to the right headlock and then, in our long good time, start processing them and then do it from time to time.

F

Every time you reach some point, you start looking in the future and building your right headlocks, but I need uh josh and sam to see if there's any merit to this concept,.

C

Josh um sage and I talked about.

A

Something kind of like this a long time ago, and there was some reason I keep forgetting why it's difficult to have separate um instances of like a right ahead. Log per pg, there's there's some reason why we can't do it without some extra work.

E

For pg, um I think there's some existing structures that uh aren't specific to pd's, like the snapmapper, I think was one of them.

C

Yeah, that might have been it.

E

I think that's not a fundamental reason. That's an international we could. We could change that way. That works.

E

But I think, like the you know, I guess the general idea with the watching the right headlight is to avoid those extra ios essentially um and the the the single point. uh Red log is a really simple way to keep things consistent in a single transaction as well. It's possible to do in other ways with multiple right, headlogs.

E

And doing it for pg probably a lot simpler in other ways, but if, if we do it property, we also lose some of the batching ability that we get from having all the vts going to the same right head.

E

F

Why should we make it a pg? I I actually, it makes more sense to make it as big as possible right to cover the bigger.

E

F

E

Angry, I think it makes.

F

More sense to have.

E

F

ah But the dispatchers are working on a prpg right. The dispatchers don't see all the future. Every dispatcher just see one pg is that correct.

E

Yes, although the kv within blue store, for example, like the kb sync thread, um is aware of everything. That's going. That's been submitted.

A

I have a feeling we don't necessarily want it per pg, but we may want multiple ones to achieve parallelism while still having a good ability to do batching.

E

Does I mean, is there, is there an experiment? We could run, that would tell us whether more parallelism would help or whether more matching would help.

A

Right now, maybe not. um We know that in the past the kv-sync thread has been a source of contention and actually originally in bluestora was very slow. It was the primary bottleneck. We had we've improved it dramatically now we're we can do about like 70 000 eye ops in a single osd before the kb sync thread becomes a major problem uh and we're not there yet for a real cluster.

A

We can do it in isolation, but at some point, if we want to do better than that in blue storm, we may need more parallelization than a single, a single thread.

F

uh Adam, I think we could uh remove the code during the pg logic stage. I mean just don't send anything to the pity log so that we could that way. We could see how expensive is the pg log.

E

F

Think that's a good idea.

E

I think mark mark you did some of that like last two years ago or something.

A

Yes, I've got results just one.

D

F

Is good enough.

A

This is what we see when we remove.

F

A

F

Me I I I everything froze for me. So did I speak to myself yeah. I heard you okay,.

I

Yeah, I heard you at least you were talking to me.

F

Okay; okay, because I mean talking to myself it's it's interesting, but.

G

What was the last thing you said? Did you say that for performance, it's good enough is that the last sentence.

F

Yeah, I said because by all this faking, if we remove the code that calls to pg log this stage and or just aggregate instead of doing one by one, do them every end entries.

F

So we could measure the performance impact and there's, of course, we're going to lose recoverability, because if we don't write it or if we wait, then the semantic of recovery is not going to work. But we don't care about this. For now, we just want to see how expensive it is to see. Look at the gaming look at.

C

F

F

So no pg log omar.

F

So you did try, I'm not sure. What's the home up, so no page log means we don't stage any log.

A

So no pg log means we actually comment out all the code where we're actually doing all the calculations for pg-log no pg-log omap means we still do that, but we don't actually send anything to the database.

F

What do you mean we.

A

No pg log omap means that we don't we we basically short circuit, sending the data that would have gone to the to roxdb in omap. We just don't do that anymore. We just you know, don't actually send that data to roxdb.

A

The no pg log case in general means that we don't even do the calculations necessary to compute pg log. So one is database.

F

One is yeah, the red one is the more interesting because, even if we be extremely efficient in the way we do things on disk, we still have to do something. So the yellow thing is probably cheating the red. One is like our best case scenario,.

A

Yeah, yellow is to to give us the upper bound on what we could do if we also improve the in-memory uh processing of pg log right like if we could in you know, make it much much faster, then that's the upper bound of what we get.

F

Okay, so what do we see? We see that the iops are not much better.

A

This may have this may have been from before we did the oh map, actually, the the oh node cash restructuring in blue store. I don't remember it's relatively fast, but it's not that fast, because we've gotten better results than this. So.

F

We we got like 10 extra iops and the legacy on the first case.

F

F

15 better the fourth case, the latency is way way better, I'm not sure. What's the difference.

A

Tail in c 99, percentile.

F

Oh okay, so you cut that which I like in the previous system I was working. There was a requirement that the average and tail latency they cannot be more than like 10 difference. It was like huge effort to stay away from day latency. I don't know if seth is really that concerned about the latency.

A

I tell him it's huge.

E

We're concerned about it, but we haven't designed the system in such a way that it is possible to get that, though, of a gap. Yet.

F

Yeah, like I can tell you like system designed to be like this, they are sometimes crazy because in some cases the the normal flow was being much more expensive than it should be just to make sure that the tail latency is not going to deviate.

F

So it's I don't know if it's like a very achievable goal, if you don't design your whole system for this, but that's always good to try to.

C

I think best effort right.

E

Well, I think it's one of the goals with c story to improve that tail agency aspect by not using.

G

G

I think it's worth it.

E

Is more designed for that, like maybe a slightly increased average case but much lower um case.

F

So what we see is like 10 extra high ups 15 percent, less latency, which is very nice. So but that's like the best we can do. I mean people usually will go dancing if you give them 10 extra and you cut down the latency.

A

Question well yeah, and the question though right is those are the numbers that we see with the current or of the code at the time that this testing was done. The next question, though, is okay: um every single bottleneck that you remove everything you make faster, um you know. Sometimes it could be that we'd see more. You know better results with this. If there wasn't something else in the code that was slow right, yes, but could also.

D

F

Wars, right I mean, and actually I don't think it should be worse but yeah so so.

E

Is it worth you running.

D

Josh, what are you going to say.

E

I was asking: is it worth re-running the this kind of uh experiment again with uh today's code.

A

Could be um been a little while, since we've looked at this stuff, I wonder, though,.

A

A

But well gaby you tell me, is it worth having you continue your research into roxdb and trying to prototype, or would you rather have updated numbers first before you go down that path, so I can.

F

A

F

That could be done in parallel next week is pretty slow here. Sunday monday is yom kippur, which is a big holiday here, so everything is shut down, so you do have some time and and- and actually I think today is my last day until tuesday.

F

H

F

Walking is just tuesday, so that gives you some head time to do things and you know what even researching rocks to beats. It's very interesting, so I don't know I'm spending time there worst case scenario, I'm just going to learn things and maybe find better place to use them or other places to use them.

F

Yeah, yeah yeah, the more we understand that the better we'll be able to use it.

A

I think my my slight hesitation with doing this over again is just at some point.

A

All of this is kind of pointless right, because crimson is the real future in making all of you know this kind of thing good. This research helps us, but um what I want to avoid is for us to end up doing a bunch of work studying a problem that we may never really get around to fixing. So if, if we're serious about really changing rocks db, you know like really changing some of this stuff, then maybe we should do it, but um it's a lot of work. I think you should know.

G

You should evaluate how long it will take, because I think before before see, I think, there's a lot of time until like c-star will be ready or like for stable release.

G

So I think it's um I think it's worth evaluating how long the changes you talk about are going to take.

A

If we want to change roxdb's right ahead log, we need a plan to actually make it feasible. Well, we at least have to change it in our fork of rocks db. I don't know that we're ever going to get it upstreamed. Maybe we will, but I don't have a lot of hope for it.

F

So my idea was not to change the write ahead log but to create some modified version of mem table in which you could call perish and by doing okay and then just utilize. These things.

C

If I remember right, did they give you a public api for a mem table or a a.

I

C

I

Class you can modify. There is a plugin system format table. You could could insert your own okay, okay, but from what I remember, you would need to actually include a lot of other stuff, because there's a lot of dependencies there, but you can make um the the top level of mem table is, does have an interface.

A

So gabby you may be able to do it without actually changing your hdb. Then if you can create your own version of the move.

F

Table yes, if you can have their own version of main table and if it's possible that that name table will be by doing this card or discardable main table, but every time you push a reset or discard, then everything goes away.

F

The version is updated on the wall on the right ahead log, so they can be trimmed uh implicitly and we could start again so again, I'm hoping that this thing is doable but interior. It sounds like it could be done, but I'm really not that familiar. Sorry, I'm not familiar at all with folks to be caught, but the theory makes sense to me.

E

Yeah, I think the theory makes sense and I think, like an inside, it's worth like evaluating how much effort this is and how much benefit it could provide like some of that again. Some of the other things we talked about with the like the metadata are also pretty interesting because they would apply both to uh booster and crimson.

F

Yeah so this thing I can only start doing tuesday and actually I'm really. I really have to run.

A

Yeah we're way over the meeting. Yeah sorry.

G

There's one thing I wanted to mention before we: we wrap things up and that's regarding for, like maybe next papers uh we can discuss, there is um there's a storage uh systems, so systems and storage conference happening uh in two weeks. I was actually supposed to happen in israel by ibm research, but it's happening uh virtually.

G

um I'm gonna link it here. um It's obviously free registration and there should be some pretty interesting. uh Some pretty interesting talks there. So, okay, can you send it by email.

F

Because here I I'm losing everything which is here, I'm going to I'm going to lose.

G

Yes I'll, send it so I'll, send it here for yes I'll, also send it I'll, also send it on on email for anyone uh I'll, send it to you gabby um and like the rest of the team, but it's also here in the chat if anyone's interested. Personally, I know a lot of the people organizing this event, so I I and I looked at some of the accepted papers, um so it should be pretty interesting. uh So if you guys are interested in that, that's happening on the 13th through the 14th of october.

A

A

D

A

And then uh maybe if people attend any presentations there, that are particularly interesting, uh we can add it as discussion topics.

G

For future meeting- yes, I I I plan to to attend at least at least one of the one of the days so yeah, but I figured like if anyone else is interested, it seems like it's in pretty convenient times for both uh um people from the us as well or europe, us uh even china, so yeah.

A

Is this the I'm sorry did you say this was free registration.

G

uh Yes, okay uh also paste the registration link, because maybe it's a little bit to navigate in this website.

F

Guys, I really have to go so have a nice weekend. I mean I'm known, have a nice weekend. Everybody else have.

D

A good day all right see you thanks.

F

A

See gabby should we wrap this up guys we're way over.

E

Yeah, I think so yeah thanks for watching that's.

I

But as a last note, I thought about what gabby was saying about skip lists. I think we do have some uh prefixes that we never actually iterate through them. So like different rights, maybe we could really uh drop a skip list mem table for that and just go for some hashtag never thought about it before, but that seemed feasible.

E

Yeah the same for the pg log, I mean right now it's in this thing: it's just using omap. So it's it's in the same column family as the map data, where we do potentially iterate, but we could. We can move that out to a different column, family.

A

Yep comparisons, uh key comparisons to maintain ordering is one of the highest things I see in our wall. Clock profiles uh regarding uh the kb sync thread and right-hand log, so that potentially could actually be a big win for us.

I

Yep, it's true until column families we couldn't avoid it, but now we really.

H

D

All right, so, thank you guys have a good day that was good, see everybody next week. Everyone.

G

Today, later.