Ceph Developer Monthly, 7 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2020-10-07

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Should we move on to another topic and you know get back to workshop once uh our internet and things start working.

B

Yeah, I think a good idea um is uh scott peterson on the second topic, let's see in the list here, I.

C

B

Maybe we can uh start with the third topic, then um more efficient tracing guppy. Do you want to take it away.

D

B

D

I'll try to share my screen. Hopefully it'll work this time, so I tried.

E

D

Prepare a short presentation about performance racing, so hopefully it will work so high performance tracing and why we need it.

D

uh We, I I'm first trying to explain: why do we even need this tracing, so the tracing is trying to fulfill something that we don't have in our system? So if you look in a simple system like standalone application, you could use a debugger. If you want to find that something is not working as it should be, then you use a debugger like gdb and the debugger is instrumenting the operation. You can get step by step execution and then online you could decide if you want to skip a function.

D

If you want to step in, if you want to skip multiple iterations and so on, you can monitor and modify values in the program variables and again online. You can decide which parameters are you interested in and you can also collect a predefined uh variable say and the the problem with this mod is that require user intervention.

D

The uh the debugger is blocked waiting on user input.

D

So it's it's very nice if you debug in a same simple user application with a single for the execution when you go to debug high performance systems, which can do tens of thousands to millions of io of io per second operation per second and there's no way a debugger could uh support this kind of flood operational timeout waiting on on user, because in gdp you always have to decide what you'll do next, so you have to watch something you have to see it and then okay, you step in the next step.

D

But when we do things as humans, it takes seconds to operate when the machine is doing millions of iops. It's microsecond uh response time, which, of course we cannot compete. So the system would slow down every distributed. Pilot behavior will be broken because the fed of execution timing is different now, and race, conditioning and timings will be hidden also in many cases operational time out a lot of network operation.

D

They got very short time out. If, if they wait on us, they will never compete.

D

So the way to address this is to use an online tracing online tracing, allows the system to progress without waiting for the user input, the system collect and keep changes in variables and the user can review them postmortem. So you just collect everything and then you replay them execution can be replayed later from the collected data and then the user can analyze the data and find what is what went wrong so a.

F

D

Simple and basic solution that we all know is something like syslog. Basically, you just add print out to your screen in the beginning. In the beginning, everything was just printed to the terminal, which of course, is very inefficient. Then people come with this great idea of using a logo.

D

So instead of the printing to terminal, you put it to a logic, log device which is written into memory buffers and when the buffers are full, they go into sequential files and later, when you want to search the log for event, you're using all kinds of text, utilities like head tail, grab, less and so on and so forth.

D

So far, so good.

D

You guys hear me yeah all good.

D

Okay, I got no feedback, I don't know if I'm talking to myself, okay, so so, there's direct limitation to this simple tracing, so the easiest to see that they produce either two little information.

D

So you try to find a problem and then you got all this crazy amount of data, but you don't have the information that you need or that they got just too much and the system is slowing down and logs are just becoming bigger and bigger.

D

Another problem is that if you want to add event and in runtime, you want to change which event will be used, then you need to print to edit the line, compile and build every time. If you want.

G

D

Tracing or less tracing another problem is the overhead. When you do a thing string base string base operations are very expensive and you take enormous amount of space.

D

So again the traces tends to be too big and they run out of control and, of course, everything is too slow and one thing people don't notice until they actually use a real tracer is that it's almost impossible to follow an execution.

D

When you got just strings, you.

D

um Actually, if you want to follow a single, I operation.

E

D

If you don't have the context and if you don't have type something which identified this is all coming from a single thread of operation- and I don't mean execution thread, I'm talking about logical thread, io starts here. Then it uh go into these data structures.

D

Allocating this buffer uh copying the data to this guy, the staging to here working here all these things. You need something to combine them and the string themselves. They don't have this finger, so move up, uh so some walk around for this. So if you want to control the amount of tracing you need to build the system again, you need to change it and compile so people added these things. They call trace level so trace level, one until 20, every level you're going to show more event, races and what you do in development time.

D

You try to judge which trace event is important and you assign priority based on that, and you say the most important one would be on level zero and the list on level 20..

D

But in reality, when you have something not working, it might be something you put on uh priority 20 and if you put it on 20, it means the whole system going to flood with information. So it's again it's very it's. It's not flexible enough.

D

Then people start using stuff. It's it's complicated to follow string of execution with just grep and others. So they start using said awkward python. Writing a very complicated script, trying to make sense of all this amount of crazy text. They got so it's complicated.

D

Not everybody is a great scriptwriter and even the best scriptwriters will find it extremely hard to follow an arbitrary printout and and make something out of this without type in context. It's just strings.

D

So I'm trying to build a department for my online tracing system, so requirement number one. It must be simple to operate. If it's not simple people don't use it, it must be lock free, otherwise it won't scale. If we're talking about um now, you could have 50 or even 100, calls running on the same machine if all of them would take locks each time that you want to trade. To put in event rates it just won't scale.

D

uh Online tracing system should not impact the normal execution. Again, if I mean usually when you have event racing so the simplest one is something you do in debug time. So in compile timing, if you could put something that is a macro and by combining this you remove all of them. So there's zero impact to the performance.

D

But of course we don't use it because we want some level of control, so we're using this mode of level 0 to 20 or whatever, but even if we did not level 0, which means don't show me any event trace still every time the system is going to see a code in the code is going to say.

D

uh Do I need to print this thing? Yes or no? It's an overhead. We need the overhead to the normal execution, which is a customer or somebody running performance benchmark. We don't want any other coming from event. Tracing.

D

Data should be cut, so that's when you're not running when you do run. Data should be collected in efficient binary tracing, but even so you collect it in binary. It should be still display in human, readable format, so binary collection for efficiency, human readability so it'd be useful for us and we need online filtering of event, collection, which is something we don't have in the previous one.

D

I would like to be able to filter by event type. For example. Let's say I find there's a problem in my scrubber code, so I say this time when you run even so, scrubber is priority 20, but I'm now trying to debug a scrubber problem, so please collect only scrubber event, nothing else or maybe scrubber event and a pg log event or scrubber event. Pg log and.

E

D

Blue storage, so you can decide which events you want to collect and you should have. You should have a lot of events, so the granularity of littering would be good. Like the system I've been using, they are like uh they use two bytes for event types, so you could define in theory, 16k event types in reality. I think people use something like 600.

D

And you should be able to collect and filter by object type.

D

For example, I need to be able to filter by pool id pgid object, not id device id lba block id, you name it or, I could say, show me everything which have them or show me when this thing only pull number five, only pg number 17 only object, node id x and so on. So you should be able to have control of how much you collect. So you start talking about collection and by limiting or by filtering in collection.

D

We greatly reduce the impact of the system. It means we need much smaller memory size and if we create the correct filters, we might.

H

D

Fit the whole eventually it's in memory and we won't get a slow down by getting to this concern and then, like you, need sophisticated tools to view search uh the collected trace.

D

We need again so there's filtering when you collect the trace, because you want to minimize minimize the trace size, but then there's filtering when you view the trace, you try to look for something and say: okay now, please show me just event related uh to scuzzy, or I just want to see. Events on the fiber channel or on the network show me tcp event whatever and of course the language should be complicated.

D

You should be able to say I want to see event type x on pgy from client z, starting in timestamp t and so on and so forth.

D

So I'm trying not to elaborate. So it must be simple. uh It should be simple process to add a new tracing point. It should be similar to adding a printf if it's more complicated than that people just not going to add event trace. I mean you could force to use it in some very critical location, but if you want people to be generous with event tracing and remember, because we can filter event in runtime, it's it's good to have more of them.

D

So it should be easy to do this, and I'm actually jumping to the last uh item here, one thing's to make it easy. It should be globally available without parameters in our system. Every time you want to put an event, and you don't have it in the object, you need to pass. The cct object down the line from filter, so I've done it in one file and I had to change.

D

I think five files to get the cct object. Just so one function would be able to trace it and that's change. It's. It means if I just want to trace something here. I need to start compiling the whole system and it's crazy shouldn't be should be like this.

D

The tracer should be a global variable accessible to everybody, and it should be easy just sitting and waiting for you.

D

So that's how you add event to the system, then on the runtime. It should be easy to add and remove filter online. When you try to trace, you should have easy interface to add and remove filters.

D

For example, you could say: um let's say you debug in osd, so you could say: okay since I'm debugging osd, I don't care about everything else. Let's start with default, osd events and you could create some events which most osd developers need and for this event set. You could then later say: okay remove this one, maybe add also that one and so on and so forth. You should be able to add multiple event filters in a single command.

D

You don't want to start issuing one command after another and of course, if you could have a gui for this, that's the best it should be easy to follow in search. The collector trace should be human, readable and strong, logical language. And again you don't need people to do crazy, scripting to be able to search a tree. The traits should be easy.

D

uh Lock free operation, so it's actually. This thing is well understood and you could see this thing on other tools like jaeger and others. What we do is that every tpu call is running an independent stream of event. Log free, completely log free, the events are collected into private buffer can be written by a single core, so you don't need to lock it.

D

You can use double buffering and when one buffer is full, you dump it into a private disk clog. So again, you don't need any synchronization, because that this cog is dedicated for you and to be able to merge events. Every event is going to have a timestamp.

D

That's very easy! If you do it on the same, on the same note, because all of them going to show the same timing, if you want to trace something which is spanning multiple nodes, then you need to synchronize clocks, which in many cases we do anyway.

D

Now the timestamps are added to be able to merge the event and to get ordering between two independent streams, but we also get something else from them. It's not something we waste, because timestamps are useful by themselves. You could be able to judge how long an operation took. It's not going to be a performance um quality timestamp. But if you see that some step taking crazy amount of time, then you should be suspicious.

D

uh So the streams are merged post processing bets on timestamps. So you work on multiple streams, every core data stream and you do it based on timestamps. So you have multiple pointers and when you advance you advance advancing all of them in parallel.

D

Should not impact normal execution when the application is running without trace without collecting trace? We should have minimal impact. Trace code should never be inlined. It should always be a jump outside the normal flow of execution.

D

Usually people using trace macro to prevent instruction cache pollution. So it looks something like this: you have. This predict false.

D

Every compiler got his own prediction parameters, so you tell them I'm predicting that tracing is inactive, is not active because on customer boxes, tracing are not going to be active and I'm saying, but I'm predicting it's not active, but if it's active, then you could jump somewhere else and then execute the function.

D

That way, the compiler will optimize the code and it won't affect production code because everything is outside there's a branch false which the computer can execute in parallel, and if this thing happened, then it would jump outside the code, but the normal code is not affected by this. So this contradicts a bad code. Would look after this? If stress level is bigger than some level, then one by code.

D

So here you are forcing the cpu to take a branch and by default, when you say when you, when you're, predicting it you're asking for for something to happen, it's predicted is true and if the code is inside then even if we don't execute it, it's already loaded as part of the instruction cache.

D

So it's going to pollute your cache instruction and it's pretty bad.

D

I'm going to make a short uh stop here and I want to show what actually I'm waiting I'm aiming for. So I'm going to jump to the end.

D

So that's actually what I'm looking for when people think of event race. I I think the event race that we got in depth in many and also in many systems. It's not really adventurous that it's a combination of few things in some cases. It's just um it's just it's a system logger, so you could see a system you're starting up.

H

D

uh Join, oh somebody, disconnect uh network link is being brought up. The link is down these kind of events. They are just logging days, it's not something suitable to high performance. My example is wireshark, which uh is everybody here, familiar with wireshark.

D

Wireshark anybody else know this product.

D

Is this something people know, should I explain what is wireshark.

F

D

Is a network analyzer?

D

It can capture, uh actually it captures packets going on the network and then later it can display them and help you to find some problem with your protocol. A similar product exists for many years.

B

D

Analyzer, you can have finisa used to be the big the market leader for this. They got a finisher analyzers and which you use to connect the physical lines and see when you develop a network card.

D

What is wrong with your cards so when you're thinking about gigi and 100g kind of links, so this crazy amount of information to collect and wireshark would never be able to collect it if it would be if every packet passing in the network would be stringify and dumped into some crazy thing by some kind of levels so in to be able to collect this huge amount of information.

D

Wireshark have lots of filter you. Only when you define when you play flying shot, you can define what you want to collect. You can tell them. Okay, I want you to collect um ipv6 only coming only from this internet address and going to this ip address or- and maybe I can say, I only want to see rdma over tcp, so you can define what protocol you want. You can define what appears and you can also define what kind like the more the more you get inside.

D

You have more and more fitness, you can say okay, so if it's iscsi, I only interested in seeing uh read requests, so you could and maybe really quite coming to this lda.

D

um I didn't use wireshark for this, but I'm assuming it could do that uh finish. I was able to do that. So you need crazy amount of information and you need to be able to filter.

D

By as many parameters as possible, or else the system will not going to keep up with this information, the data is always collected in binary format. Nobody is going to collect the data in in text in text format.

D

But still after the data was collected in binary format, you could display it in a human, readable format. There's actually a very nice gui, very user-friendly, to see everything and it's getting you can get inside and you can search and you can try to find xiaomi packets coming from this coming from here. Show me, packets, with this kind of problems, show me packets, with this kind of header and so on and so forth, and of course you could view and search and everything.

D

So that's that's the example or that's the model I'm trying to emulate here, I'm not trying to compete with sysblogger, I'm not trying to emulate jaeger, which is another tool to collect system log with jagers and friends. You should be able to see that everybody is running and that the network state is stable or not stable. You could see that the device is up or down osd up and down, but you're not going to get a granularity needed to see every single io.

D

H

um Hang on jager is a structured event. Tracer, it's not a logger.

D

Yeah yeah, I know when I'm when, when I'm saying logger, I'm I'm saying like in my my thinking.

H

Is much more like what you're describing than like says vlog.

D

I know, but still there is no way jaeger be able to collect millions of items.

D

Jager is not something like wire shot. I'm looking for wire shark model not to diego model. Jager is good to know how the system behaves. You cannot get the granularity to watch every single. I o.

D

So that's that's. The distinct which I make I'm making between login and tracing tracing should be able to show you every single operation in the system and.

B

Maybe I'm using.

D

The wrong uh name, I'm just it's a name, I'm giving this this I'm just.

H

Not into myself, I did want to point out, though, that we probably do want to replace jager. We probably don't want two mechanisms for those, sorry that we do what we we probably don't want two mechanisms for this, unless we have a good reason for it.

D

uh Sorry, I don't, I don't get you, uh maybe the sound here. Isn't I'm trying to increase the sound.

H

Could you please repeat at the very least we don't want two sets of source level annotations. We don't actually want both jager tracing and what you are proposing.

D

In that case, I think jaeger should go. That's what I'm telling you yeah. Okay, I mean so this would be.

H

D

Here yeah, but I'm more concerned about a debugger utility. The adventurous thing I'm talking about is something to debug your code. It's not something to show the customer or to take a snapshot of the system and say: oh, I can see something look suspicious here. I want to be able it's the event. Tracer I'm talking about is is a is a debugging tool for developers.

D

It's not for system engineers. Jager could be used for system engineers just to see how the system.

B

D

But I'm not going to go deeper on yeah, so now I'm trying to explain how we can collect data and efficient binary tracing. So now I'm familiar with three different modes. The simple one, you add event in a printf like format. It's easy! It's simple!

D

The event trace is a global function, so you say I need event trace. You write a function which look exactly like printf variable format and you do present d percent s present c everything so that the types are coming in a format string and you just use it like you'd use printf in c.

D

It's simple! It's easy and doesn't take long time to to to write it. You don't need to be some amazing engineer to be able to do it. Everybody should do it in day. One there's. Zero training required here, optimize mode which I've seen using the past. I don't know if people are still using this because it's it's extremely efficient, but it's it's not so comfortable to use.

D

So what happened is this? um You define an event and sub event. So sorry, there's event type. For example, uh event type is going to be a page log event and shark type will be streaming and then there is a central location where you have a json-like file and you can say uh for event, type logger, sorry, pg log with start type trim. Then you're going to see, and then here you could type what's going to be the parameters, what's going to be the names and so on and so forth.

D

So it's strong, very efficient, but it's not so comfortable later I'll. Try to show some examples, but the idea is, you would write some text description for this and it would be indexed by event in sub event and you'd, give a list of parameters and for each one, you'd say what's the type. So it know how how much it have to read and what's the logical name so later, you'll be able to search so, for example, you could say uh in this event, I'm.

B

D

Going to have the osd number, which is 4 byte and the name for search, would be osd. Then I'm going to have our own mode id, which is another four byte and you would search by oh node, then I'm going to have um timestamp, which is eight byte and you would call timestamp and so on and so forth.

D

So by creating this file it post processing, when you get when they're reached to this event, they know exactly how to display, without paying any overhead, because they're just going to dump the data which going to be prefixed by the event, type and subtype and then they'll jump to this table. So it's extremely efficient, but very it's not crazy.

D

Uncomfortable I've been working like this uh for many years, but I can say that in many cases you would see people being lazy and not adding event, because for every time they want to at event and they need to open this crazy, json document and add the appropriate line. And then you also need to develop some utilities to make sure that everything matches. So if you're passing five parameters, the json file have five parameters, and you also try to do some kind of society checking on them.

D

So it's extremely efficient, not the easiest to write now. There's something else which probably you need to do either way is. If you have big binary, big object, so say you have object node and you want, in some cases to dump the full object, node or something for event for object nodes.

D

So you're not going to do to write a crazy long line, saying printer fee and proof that you need the tracer to be able to dump the object node so you're going to give it it's going to be done in binary format like a main copy or m-dump or something similar to what we do when you write it to the disk right.

D

So there's um marshall and d martial, but it have to be efficient and later in post-processing they will come and they will see that this is object, object of this type and then they will call this function, which could be just uh o string operator or something like this.

D

So this should define one for the object and then once it's defined, you could just dump it every time you need it.

D

And so how do you do it? So if you do the printf, you don't so sorry the most expensive one is, of course, is the printf, because you need the whole string format.

D

So the way to get it to be efficient is uh the steepest. One is every time you trash an event. You also trace it. The printed format, which is inefficient, because you're going to repeat the same thing again and again uh solution I've seen is that you keep the instruction pointer and the file, so you can say in this uh object.

D

This file object in this instruction location. There is the there's a printer format and then in post processing they will jump there and find the format. So it's working, it's very easy. It's very efficient.

H

D

If the binary is changed, because maybe you change one line, you compile it's not going to work. So it's working, it's fine, but there's some problem with this something else I discussed. I I don't know if it was implemented, but I discussed it in the past in my previous job, I suggested doing that. I don't know if it was ever done, is to keep a single translation table uh per file.

D

So what the first time you see this format you're going to create an entry with the format and the index is going to be stored with the event so.

G

D

Time the hash table is going you're going to you're going to do a hash search. The next time. You see it you're going to find that it's already exist, so you're just going to store the index the index going to be probably 2 bytes should suffice. I don't expect to see more than 64k event, races per type. Remember. It should be index prototype.

F

D

You create a a table which could be stored in a separate file because the log, because the event is just moving forward and this one is probably kept in memory and in the end you dump this whole table at once. It's it's not being dubbed with anything else, it's something which actually should go to the header.

D

So you keep a transaction table from the format to an index, and then you just store the index of every event. I mean the event could be. Could it could be in the trades thousands of times? Oh actually, I've seen it like hundreds of times of time. If the event is a very common one say the event is every time you get an io request, you don't want this thing stopped again again, you store it only once and you keep the index and then on post processing.

D

B

D

And then you jump to the transaction table and of course, when you do post processing, you load all transaction tables to memory.

D

So that's uh how you do that um the printf, so that's allow you to do a printf like comfortable event tracing but still be very efficient in the way you store things.

D

Then you need a way to do an online filtering of event collection. The common way to do this is to define a beat map.

D

So say you you, you decide that you're going to have a 1k event, event type. It's crazy! It's it's! It's a lot 1k! I have never seen something go so much so big but say that you decided you want one cave event type, so you define a um a bitmap and every bit represent one event, so you could start with an empty bitmap or you could start with a default event map or you could create event type for every group offer some kind of io flow like for bluestore.

D

I am probably going to use this beat set and then from this thing you could you could create, modify and you can do it online. The bits should just be set, and you should go- should have an easy api for this. You don't want people to manipulate beats. You should tell them.

D

I want you to add event for pg log for scrubbing, uh for tcp and for what uh blue store you should use human readable commands, don't don't let customers or users manipulate bits and how this thing is working. Every event trace call is starting with two parameters: event type and subtype.

D

When you do trace collection.

D

We need to filter out an event the way to do this is what is this.

D

No okay, I didn't talk about it. Okay, every time we record event, we put event type and subtype. When you collect event, we need to check always when you do online collection right. The machine is always running. If the event racing is active, then you jump outside and you start to look, and so this was event type like this. Then you check the bitmap. If the bit is set, then you're going to collect the event. If not, then you skip it and this thing you could do online and you could change online.

D

That's one of the biggest um flexibilities with this model when you do event tracing based on severity, from level 0 to level 20 and your the event type you're looking for happens to be in level 20.. By opening this, you open everything and you get a flab here. You could just decide which event type are interesting for you and it might be, as I said, it might be scrubber the least interesting event, but if oneness debugging his code for him is the most interesting thing so.

E

D

Going to say, I just want to collect travel event, I don't care about anything else, so just collect that one. So in runtime the event filters will be done by looking at this bitmap.

D

And and other things actually, I should have used probably another slide, the last two bullets you should also allow collecting by subsystem. For example, when you're running ceph, you could say: okay collect osd in events and then I'm going to tell you what else, but I just need osd or just bluestone or just worksdb or whatever, and then you can also limit event collection based on objective just collect event of. If they belong to this pg group, I don't care about other page groups. I just want this one.

D

Even a way to view and search the trades, so you need some way to look. You wanted to to view the trace forward and backwards, just like how you you you'd, search a file. If, if you're reading uh a page, you should be able to go up and down, you should see reverse and forward uh tracing.

D

The renderer should have current location, so you should always be able to say: okay, okay, I've seen up to here now show me the next page. It's it's a paging concept, not like a paging pledging page, but you can see one page after another and it always know where it is. You can say now scroll up scroll down from the current location. You don't need to age time repeat: like I've seen people do you grab for something, and then you say from this point: you filter and then you do it. It's complicated.

D

Just remember where you are and then you can say, search for the next event. You can give it current location, usually there's uh there's a free location, free, macros, there's head tell and current, so you could search from the beginning until you see something from the end search backwards until you meet something or from here. I want you to search forward and backward until you see something when you view the event.

D

You could do second filtering, because when you collected you filter stuff, but when you view it, you say: okay, uh we collected a lot of information because somebody, I didn't do the adventures. If somebody else did it, but I'm only interested in this event. So just show me them. I don't want to see everything else, so you can give it like. Show me only this or show me everything, except this uh show me the next object.

D

The next appearance of an object show me the next uh I'm looking now for object, nodes show me the next time I can see the object, node or show me the next time. Object. Not have this value show me the next time the optional id is this one or it's within this range? I won't go to the next time, but I want to share to see only object, node between x and y and of course, you can do combination in a logical file.

D

I want to see event type 1 or type 2 when variable 1 is equal to x1 and variable 2 verbal one is bigger than x, two and so on and so forth. You should be able to apply this kind of logic uh calculations.

D

How you do this. So how can you do the filter? Now you actually implement it. So the first thing to do is every trace event is started by event, type and subtitle. The event code is always dumping them in binary format. As the first thing for every event record, it could be two byte for the event type and two bytes for the subtype, or maybe one for each depends. How many events you want to create?

D

The renderer always know that the first two parameters, two bytes here and two bytes here. What's the mean so when it's going, the f is just reading the first two byte of every event record and then you can skip them at one time.

D

So it's location based that the way for the event trace renderer to understand what it's doing is based on the location. Always the first object that the first bytes are going to be event, type and subtype uh you you could do you could use four you could you could use the same concept uh with more with more information, so let's say that osd always when we collect event, we need some common parameters. We always need a full id, the object, not id and so on and so forth.

D

So you could create a macro or function and you can call it if osd tracer and that macro is going to ask you to insert pull id object, node id and so on and so forth, and this thing will pack all this information and dump them in a in a single, well-known location. So they always know that the full id is the next four byte and object. Node id is the four byte afterwards.

D

So that's similar to what I explained before, that you can create a json format to describe it, but here it's actually useful, and that's being I actually this thing. I I've seen more used used more commonly than the previous one, because you can create a family of event. Tracer and this family could be like in the beginning of every project.

D

The project manager project lead is going to define the common macros for this project and they're going to have like three or four json line, explaining how the event is going to look like so, and usually it's going to see like the beginning is going to look like this event, type subtype, which is, of course, then they're going to be cool id object, not id um block id and so on and after this they're going to be something else. There's something else. It's like before print a format which you could render on time.

D

So it's very beneficial if you do it with comment types but you're not going to be able to use it for every single parameters.

D

So usually what people do is description base so for every parameters you could start and say uh now collect lba column, and then you give value so you attach before every parameters all the parameters that you want to give an m you. You start it by giving a name to this.

D

Can you see my emacs.

D

Do you guys see my emacs.

A

D

C

C

D

So that's an example of how you could use it, so you do event trace and of course you start by saying, if that's pg log event, you're always going to start with the event type and the subtype I'm going to use here. The formatting and it's going to be this event is a trimming log and then I'm adding some parameters now the parameters. This is not printf, it's look like print, but of course it's not printf, so every parameters could be a topol doesn't have to, but could be so you do it like this.

D

I'm going to give you a value and that's going to be the name. So if you're going to search you're going to search for all node and that's the value and the value is um four byte uh unsigned integer, then the second one is going to be pull.

D

So that's the name you're going to look by it and that's how you store it and actually, in some cases you could actually skip this thing.

D

Actually, no because you need to know that they're going to be here, so you don't need to describe the parameters but okay and the last one you're going to say it's a character, and here you don't give it a name. It's not interested. Maybe it's some information you want, but you're not going to search by this you're only going to keep this kind of information for something you're going to search by this.

D

So back to this, so that's for every parameters you want, you could add a time, but there is going to be, of course, a limited number of types, because you don't want to type everything in the world. Not everything going to have a type, it's crazy and the way you do it in the beginning of the event records they're going to be a header and the header is going to have all the event types.

D

So it starts by the uh event, type and subtype. You don't need anything before them. You know that they're always going to be there and then you could put the address for the index of the format string and then the parameters themselves, the parameters they could be.

D

Also, you could start them the way I showed and then you could see and again it's going to be an index to the parameter type and then the value index type index type and, of course, everything going to be descriptive.

D

uh This thing it's similar to the json thing I described before, but it's done automatically on the fly by the event racing code, the json thing when I was using this, it was not very comfortable people. Just don't like the idea of writing the code here and then jumping to another file and writing something. When you do this kind of code, it's much more intuitive and it's easy. It's not complicated, you just do it and it's good.

D

Other thing, I'm saying, is that the types of course going to be enumerated, because the type name could be long, you could say. Maybe the type is all node id, but you're just going to keep a two byte index into some transaction table.

D

G

Some other features.

D

I've seen in in in event traces another one I was working on, but I've seen on more advanced one, is that there are also type aware to the code so.

A

D

You use something which is enumeration they're going to when that, when they display the value they're going to detect the enumeration name.

D

So if you enumerate uh all kind of operation- and you got uh remove um some remo so get put, delete whatever and the index it's actually an index, but they're going to use the enumeration name and other things are even more interesting when they can see that you're using something of some type is going to create on the fly the type later to be able to search by this and talking about like if there's going to be type like all node id type. So you wouldn't need to have this all nodes.

D

Then oh, no you're just going to keep the kit from there and that's all, and I was able to do it in less than one hour.

D

I'm going to stop sharing question.

I

I have a few comments: okay, okay, I lost your.

I

Yes, you're still, there.

D

Yeah, I'm here uh the network is slow, so I can hear you good to see you.

I

Okay, uh one comment that I think is might be a good idea to follow is that which we should strive to create the output either when collecting or immediately uh on the spot in another process in a formatter another process in one of the formats that there are already existing tools. For you talked about a wire shark. There is, for example, a a kernel shark and there in there is trace compass. For example.

I

All these are tools that read trace composition, especially tools that are already there, and they can read many formats of trace records and all the functionality that we might want. There is usually their.

I

Filtering whatever, and so it might be a good idea to create a format that is known for tracing that that's the most important comment and the another one is that I would suggest. uh I think we mentioned that uh um a feature that I had in the tracer I used in my previous job, which was of a temporary temporary streams.

I

Suppose I'm tracing I'm now trying to trace a bug.

I

I know that it happens some during the operation on a specific object and suppose I'm collecting a lot of information for each incoming request, but once the request is finished successfully, I don't really need all that information that was collected.

I

So what I can do is mark part of that information as transient and if everything went right in the in specific time points I can just throw it off together, maybe maybe remaining with only the basic what we would call level 1, 11 10 trace data. So this is one feature.

D

I'm familiar with which actually I forgot to put in my presentation, which is the event uh to stop the tracing so usually tracing they're using a cyclic buffer. So everything is going to override each other, but there is um there's a showstopper.

D

You need to define when something happened, then I want you to stop the event race, because otherwise you would collect forever. So you could example. Example uh when you see that object not or when this sanity check is going to break then stop the event race.

I

This is one option, I'm not sure it's the only one.

H

I don't think that's really what it's describing. Sorry, I don't think that's describing.

I

No, but it it solves a similar problem. It does because if you have limited space- and you just you know that if, if something is if something happened at the specific time point, everything that was collected a few minutes ago is no longer there, because you just over you just overwrote it.

H

No hang on so um gary. I had a couple of practical questions about what logging.

J

H

Here um is the goal here: okay, so let's say that we're collecting logs from a customer- and we have it set up so that the rate of generation is about. You know 20 megabytes per minute, or so that's a lot. But it's not impractical at all. To keep weeks of that, 20.

D

Megabytes what.

H

Per minute, so we can.

D

Make it a minute, okay,.

H

Yeah we can sustain that rate of rate for a long time right actually.

D

H

I understand that gaby, I'm just asking okay, so my question is: is the idea is to write these to a file and roll the file right? The idea isn't to roll a circular buffer in in the middle after the pr after the thread generates the event, but before it's written to disk? Yes, it's in a circular buffer, but as long as the write out process keeps up with the generation, it's not a problem. Is that the idea.

D

Yeah so so uh before we get here just one critical thing to understand when you trace into memory performance going to suck, but when you trace into file reform is going to be extremely slow. So usually, when you try to debug distributed system, you aim to get as much as possible inside your memory.

D

So that's why you do all this crazy filtering when you put stuff in file yeah sure you could keep it we're now. Sorry, when I was talking about uh cyclic buffer, I was talking about memory tickets.

H

I know and when I said 20 megabits, but megabytes per second, I didn't mean all of the events that are generated by the system. I meant the ones that were configured by the developer to catch the specific bug so.

E

C

H

This is what the developer thought they needed right. um So the related question is suppose the amount being collected is larger and we actually can't keep up with writing to disk.

H

um There are cases there are sorts of logging where the right answer would be to drop some events, and there are some kinds of logging where the right answer would be to slow the system down. um Have you considered a way to distinguish between those two sorts of events.

D

No, I did not I've seen your question and I don't know how to answer this, because it's.

H

Complicated yes, no.

D

H

Nearly as difficult um so they're they're.

D

Pretty clear between these sorts of events.

H

D

The solution I'm familiar with is that you never slow down the system.

H

Yeah that won't be acceptable.

D

It will be just dropping it's the same way. The fabric behaves right.

H

D

Exactly you can't keep up, you drop things I know, and then you would see actually in in the event race you'd see, the tracer would show you that there is a hole and sometimes.

H

D

Get a trace- and you say, oh there's a hole here.

H

D

H

D

Question is why.

H

Do things usually drop about, so I understand that part. I'm saying that there are a few events in the osd where it's. If we didn't have the ability to reason for absence, it would be a problem.

B

H

um So it would be- and these tend not to be high volume events anyway, so I don't actually expect it to be a problem, but there's a difference between it usually won't be dropped and never will be. um So. I was wondering if you've, given that any thought.

D

Okay, so what you say there is some.

H

People there pg state transition events are incredibly important and they can cause a full cluster to simply cease functioning.

D

H

D

Those conditions it's utterly.

H

E

H

Developers to be.

E

Able to turn on these drug lines.

H

E

H

All of the events is that the idea, why do a.

D

H

File- why is that important.

D

Because the way that eventually is working right, like everything all the event trace is collecting, is, is written into a single buffer and then dumping it into this.

H

D

If there's one event type, which is critical and it might be happening just once a second while others are happening, every microsecond, you cannot hold the whole thing.

H

That seems like a multiple buffer problem, not a multiple file problem.

D

Yeah, I said multiple buffers.

H

I used multiple files. Okay, that makes sense sorry.

D

I made multiple memory buffers.

H

Yes, I agree: yeah.

D

Yeah, the the disk itself here sure no problem. I.

E

I've never seen.

D

This implementation, but of course you could do it, I mean there's like yes, I'm simply raising as a use case.

H

That's all yeah.

D

I'm sure people people come with with a lot more of interesting ideas. uh I spoke last week of this guy tom about his eventualizer and he said yeah. I've done it eight years ago, but since then I've done it for many other systems and each time we do it, it's become better and he said now the best the best one I know is the guy's invest data, because you know the they start with everything he got and then they added more so yeah.

H

This isn't that kind of future. This is just a thing we use all the time, oh now so yeah. What I'm saying is you've characterized the pg log or the um that log a bit inaccurately.

H

um Some of the tricks that are present here are already present in the ooc log, and we do use it to capture all of the operations of a certain type pretty frequently. Now it's incredibly non-granular, you have to use grub to search it and it's a giant pain in the ass. It's also inefficient for all the reasons you've outlined, but we use it for these things anyway.

H

D

We need to be able.

H

To use this system to you to do those things too,.

D

Yeah sure, actually I I've done it using that system.

H

D

H

You know what I mean.

D

H

D

I also wanted to locate the position before I start using this system. I didn't even realize that I'm taking for granted that the logger know where I am.

H

Yeah I wanted to talk about that too. So, while a highly featureful log viewer would be great, I think the first step, probably shouldn't be that it should be a utility that just processes the binary trace with a set of filters into a text file, and my reasoning for that is that a log viewer is a great example of a place where you can generate kind of an unbounded set of features um and one way or another.

H

We will need a way to uh process a trace with time and other event filters into a text file simply so that it can be compatible with other tooling. So I suggest we develop that part first, it's annoying, admittedly, that it won't have the features you outlined. Those do seem good well.

D

Actually, you can also develop something which takes uh binary and then uh dump it into text.

H

That's what I'm saying that should be the first step. Oh.

D

Sorry, I I didn't sorry the line keep disconnecting so I'm just.

H

Saying what I mean is the utility that that that embeds those filters and time bounds into a uh just a command line utility that emits a text file, I think, should be the first step.

D

Yeah, but I think it happens by default when you have this renderer, what.

I

Suggested earlier, it's very easy to, if you are dumping everything into a text file, it's very easy to dump it in a format. Usually it's a json format. That is, uh you can feed it to one of them.

H

I wanted to be readable by.

I

H

Not in json thanks, I want it to.

G

Look like the long lines.

H

That show I want it to be readable plus I want it to be grappable and I want to be able to do all of the usual unix utility stuff to it. So no, I don't really want to jump to an intermediate format.

B

I think that could be a second step, though right it could be something to uh understood by existing trace viewers like trace.

H

Compass, but that's not what I'm talking about here.

B

H

Talking about, instead of the log viewer, just the actual tool used to look at the logs. I think the first step should be something that simply renders it into text in a file on disk.

B

H

All of the relevant filters built in as.

D

Flexible, I think when you write a window, that's how it works, so it create a text and the text could be displayed on screen on some nice gui or it could be just the this dump is a file.

D

Okay, just open things on the fly.

H

So the other thing I wanted to ask about is the ceph contacts pointer. That's passed around, there's a reason for that: it's not for funsies um everywhere in the existing daemons. There is a global staff context and you don't need to pass it around. The exception would be any code that interacts with library code, because in those cases there could be multiple libraries within the same process with different config parameters.

E

I don't know like you, should keep that in mind.

H

Yeah, it's in the.

E

H

Because there are, um there are utilities that interact with that code. In principle, though, if those were only ever called from the daemon yeah, we could use the globals it's a little complicated, but that's why um I'm not saying that it's not ergonomic. I just wanted to bring that up. Okay,.

D

H

D

Just unlucky to get into this code, but my feeling.

H

D

When I was you.

H

D

I I needed to to start changing the code everywhere.

H

D

H

Depending on exactly what's going on, it may be important for your logging design to handle the case that it needs multiple instances of itself within the same process that might matter. That is all I am pointing out. Okay, that was all my questions. I'm done.

D

Okay and okay, the last thing I forgot to mention I'm maybe pushing people to um consider an existing trace, eventually solution, which does everything. I explained that but there's one problem with it. Actually, it's a big problem that it's not supported, it's something which was made uh open code eight years ago and it was never touched since I spoke to the guy and he said he thinks like one or two weeks time and it will be working again, ceiling, compatibility and libraries version changing and so on.

D

So I don't know if it means one two weeks of his time which, when he's familiar with the code or one of the two weeks of anybody's time, but there is a solution which I know it's very similar to the stuff I was describing.

D

So I don't think we should be writing anything from scratch.

H

That does seem better.

A

Gaby a quick question, so why was that solution not used for like eight years or something is that the developer just didn't get to finishing it up or were there other issues.

D

No, no, this solution was uh developed for infinidat infinity is uh the third company made by the same guy, which met symmetric and every company started.

D

He was recreating the event race and refactoring what he knew so the first one was met 30 years ago, which probably the one I'm familiar with, and that's the third iteration was done 10 years ago and the guy who trotted for him insisted on making it an open source so eight years ago it used to work and it was their uh solution and since then, libraries change, paired version, change, c lang, version change and so on and so forth. So it's not that it used to work it just.

D

It was not updated to support uh sorry to build with latest kernels and and stuff like.

D

This, by the way he also mentioned, he's going to speak with his friends in this data, which apparently got the latest iteration of this design and tried to convince them to make their tracer open code. I don't know if he's going to succeed, but if that would happen, then it would be getting the best of the.

D

A

Yes, yeah, I mean to just sum: it up sounds like um we're kind of on board with your ideas, and the next step would be to just create um a small prototype with some isolated piece of stuff. Would you agree.

D

uh Actually, I would stay away from creating a prototype if we could get his project and again it was it used.

H

To work with that project.

D

Oh sorry, sorry yeah sure.

H

Using whatever design it is, you think is the best way to do it.

D

So I think his project was um more advanced than the stuff I was describing because it was just another refactoring of the stuff I'm familiar with, and I don't know exactly what is stopping it from being functioning. I think it's just a matter of building this with the correct tools, but I don't know right and.

F

Instead, learning that.

D

But I don't know if he means he said he could have. If you have questions come to me, but I don't know how much he meant to invest in this, but he.

G

Should make an.

D

Effort to make it an open source eight years ago, so I think if red hat would make it to be or even us we made it to be like a an official tool. That would be a good thing because he likes the idea of open source people using it.

H

Right, so the goal of a prototype would be to find out exactly what needs to be done.

E

H

B

I want to go back slightly to the idea of the ear tracing as well. I think that's kind of a separate use case um and I'm not sure I'm not sure whether it would be fully captured by the same kind of mechanism, as the use case with the acre involves some like hierarchical.

D

D

I think the high performance one is not going to synchronize. uh It's not built to synchronize with thousands of machines like that's.

H

B

D

Okay, so how are you guys walking.

H

The way jager works is it's just a bunch of different independent processes, generating events they're later aggregated, based on a couple of well-known ids that are generated in a deterministic fashion. It's the generation part's not functionally different from what you're describing the only difference is that you send them with some degree of rapidity to an aggregator.

H

G

H

Adding a filtering mechanism to your thing so that, in addition to writing things to disk some events with appropriate tagging, also get sent to the yeager aggregator, which.

E

Seems like a good idea to me.

H

um But I understand if we don't want to do that immediately or if we simply want to leave that under jaeger, but I disagree that they're different use cases. The ad addition of the span id or a client id thing will be interesting in both the specifics fan. Id may or may not be interesting in the on disc log, but I suspect that it won't be detrimental and that the rest of the of of the event will be interesting.

H

In fact, I suspect that absolutely every trace event that would be useful for jaeger would have a event right next to it. For admitting to this look. This is why I am arguing that it would be ideal if they were not separate systems. I cannot think of a thing. I would want to admit to jager that I wouldn't also want in the disc trace.

B

Yeah, I agree it would probably be a strict subset, um but it might be a like it's kind of um I wouldn't say it wouldn't be unnecessary for like the minimal start here, it might be like a later step to incorporate the egg aspect into this system as well.

H

Yes, I'm on board. Well, if, by the yeager aspect, you mean the ability to pick out events and send them over udp, you know.

B

H

Endpoints are appropriately dealing with yes,.

B

H

I want the original source level uh annotations to have this in mind. That's all.

H

G

We have to go through.

H

And annotate a great deal of source code, I don't really care about adding more functionality to the back end of the logger, but rewriting the source annotations. A lot of work. So, if possible, and a little bit of forethought, avoids that. Then it's worth it.

B

B

Okay, um anything else on this topic, should we try going going back to um version now? Do you think your screen sharing your connection is working.

F

Better so we'll be looking into how to deploy an nss, ganesha gateway cluster, using volumes plug-in before that. Let's look into uh the other ways in which we can deploy one. So there is rooks we can deploy one using rook and mostly in rook.

F

Everything is done with crd, so first for deploying an nfs ganesha cluster you'll have to deploy the cef cluster using the crd and then even create a file system using the crd and then nfs ganesha demons are created using a crt, but this does not provide an interface to manage the exports for creating an export.

F

You will have to pre-populate the config objects in in the specified pool and name space. If you look into the nfs crd there, you can see, there's a cluster name and then as well as there is a pool name specified and the name space.

F

Then the other way is with cephaliam the fadium uses these f or apply command which deploys the ganesha demons. But again it does not provide an interface to manage the exports or the conflict uh in both rook and cephadium. The ganesha demons are deployed with a minimal minimal conflict, but they do not manipulate the exports again here.

F

uh A export config object needs to be created, then the other way is dashboard. So with the dashboard, the thing is, we need the nfs ganesha clusters already deployed. It does allow us to manipulate the exports, but it also requires uh setting up a rados option like pool and namespace setting of dashboard before the exports can be created.

F

Next, let's look into the way we can create the nfs. uh We can deploy the nfs cluster using the volumes plugin.

F

As you can see here, uh this is so with the nfs volumes plugin. We can create the cluster as well as we can manipulate the exports.

F

The cluster creation is very simple: it requires the cluster type and the cluster id to be specified, and let's look into the code for this, so in other deployment ways. Actually you need to also create a pool before an nfs cluster could be deployed, but with the volumes plug-in, we create a pool for the user and all the ganesha daemons share a single pool.

F

All the nfs clusters actually share a single pool, but as cluster is separated by the namespace, so the namespace is the clustered id basically and apparently rgw exports are not supported, so we are just looking into cfs exports, and here we also create a config object which will have a url to the export object for the user.

F

This does the basic checking and let's show that and with deleting the cluster again requires just like a single cluster id to be passed, and this deletes all the created exports.

F

The another differentiating factor is that we provide cluster information which is not provided in cfdm or rook or.

F

Dashboard and uh for creating an export, we ask the user to specify the cluster id fs pseudo path and then the path and the surfaces all these things like, be it minimum things to be specified for creating an export, but, along with that, we also provide a way in which user can create their own export.

F

With the user config object with this option, they can also modify the ganesha configuration, so ganesha actually commits the last red config block. So if a user creates a.

F

User specifies their own configuration, it takes the precedence over the default one, and this can also be easily removed and for a single cluster we restrict those a single user. Config object.

F

Let's look into export creation.

F

So we ask very little things from the user, like plus ready pseudopod if they wanted to buried on the end, the path and their first name and we define the other parameters required and currently the export created is, cannot be modified.

F

So that's kind of a future work which we're looking into.

F

So for when we create an export for the first time, actually we create an export dictionary which has the cluster id as the key and the values are. The list of export objects.

F

And in a case, if the manager is restarted, then we read all the export from the rados pool with the this particular method, but this does not happen every time. This only happens when the manager is restarted or the first time the export is created. Otherwise, this function, this method is not called and with along with export creation, we also create the user for every export.

F

F

So we create a user with the following caps and this is also removed when we delete and export, and this also gets removed when we delete the cluster. So basically, when we delete a cluster, all the exports gets deleted and, along with that, the pull objects are also deleted, but the pool is not deleted.

F

And we have the tests uh in the the nfs test needs to be run with the rado suit. Basically, we are testing the centurology with cf agm.

F

So we test mostly all the commands.

F

Like test aux, cli needs to be passed in as a filter for testing this out. That's it.

F

Getting it started and we start with restart, we can deploy an nfs ganesha cluster in two ways you know one is with cdn and the other is um without self-adm using the test orchestrator.

F

So here, basically, we define all the required initial config blocks and the default values manually.

F

F

And also set up the ganesha rados. We also populate the ganesha rados grace database.

F

But here you cannot manipulate with the export objects if we create the cluster using test. Orchestrator.

F

Let's look in how to create one.

F

I usually do it with atm.

F

So here I we already have a command to create a cluster and create an export.

F

F

From here, if uh that idiom is.

F

Chosen, it creates a cluster and then also it also creates an export.

F

So we have created an export with judo path of zephyrs and fs name is a and path is root path. Cluster is restart and all we can check this in the shadows pool the export object is.

F

Created so the following expert block is created, so what if we also create a user config object.

F

So I have a config.

F

With these values, we can create one with.

F

The config set option this gets created. You can verify it when the rado spool.

F

And you can also check the user con object, so this has the export block which we created and that's fine. You can.

F

F

With the restart, we also display the information, what a support used and the ip and all.

F

You can accept that something, as you can see,.

F

Get written successfully.

F

And what, if I, if I want to go back to the default configuration I just have to do the reset with cluster id passed, let's delete that this deletes the user. Config object.

F

You can see, there is no user, config object present and one more thing.

F

We have this conf nfs ganesha restart object. This has the radius url to the export object. Basically in both cefadm and look, this particular object is created, if not created by the.

F

User and we can also with the export.

F

You can get a detailed view of the export block without looking into the.

F

F

Spool and also see the user created.

F

So this particular user is created by the fidm and the other one. This one is created for export by the volumes module and if I delete.

F

And if you delete the cluster, everything gets deleted, you can again check and verify saying goes with the rados object. You check there.

F

F

But there are issues.

F

With the current deployment.

F

One is the integration of dashboard with the volumes module, so in the current. Currently, the exports created by dashboard cannot be detected by the volumes module and vice versa. Similarly, there are some couple of things which are different in dashboard and which is different in the volumes module, but in the future we want the dashboard to use the volumes module to create the export, both of them use almost the same, create export in a similar way, but the options provided is little different.

F

We have covered the discussion in this particular tracker ticket and have a look, and second is the compatibility with rook. So currently the rook module itself is in a very bad shape. Most of the commands are not working that needs to be fixed. Also, we are looking into uh the volumes module is compatible with cepheidium. You also want it to be compatible with rook, and there are no tests for rook and teutology we're also looking into adding the test in totology for book.

F

And there are some features which are missing on the volume side example, one is the rgw support.

F

And this is a big difference between the dashboard and the volumes module dashboard supports the rcw exports, but the volumes module doesn't so that needs to be done and second is, as I said, you cannot modify the export object created.

F

So we also want to work in that yeah. That's it. Impressions.

F

B

Yeah, that was great for sure, thanks for walking us through that, and it's good to see you in action as well um as curious. You mentioned about the idea of adding um work tests. Is the topology as well.

B

Do you have an idea of how that would work.

F

Currently, I tested it out manually on one of the smithy machines, the manual script, on using mini cube. Basically, that is working fine, but we need a local registry for chef images instead of depending on qa.

F

So that needs to be done and there are a couple of other things that needs to be figured out how you want to do it, but we want to have an option in teutology that you can either test with root or ceph adm, and most of the other tests should be like a generic template which will go with either of the orchestra orchestrator back end yeah. That's it.

B

That sounds great and it'll be very useful for many other features. I imagine.

F

F

G

Questions, I just want to add, there's currently a discussion on the developer mailing list concerning the date of the rook manager plug-in. So anyone who wants to time in on that thread would be welcome.

B

Thanks, patrick, I haven't got up on that myself, thanks for adding the dev mailing list as well as the acronym one. Two.

B

Okay sounds like.

G

B

For that one that then them so scott, do you want to talk about uh adn.

J

um Okay, so uh my name is scott peterson. I work for intel. A couple of you know me uh for those of you that don't I've previously worked on a persistent memory based uh h.a right back cash for rbd, which took like two years longer than anyone expected it to still not quite done um so.

J

The latest thing we're doing is this thing we call uh adaptive, distributed nvme fabrics, namespaces and that's a mouthful of a name because that's how naming works in the nvme universe and um um if you call it adaptive distributed namespaces then word keeps auto correcting that acronym to, and so we had to add another word in there. um So this crowd understands the basic um goal here.

J

I I just presented this at storage developer conference and um we all get in this universe that uh this picture on the right is how storage works in in the cloud um picture on the left is how nvme member fabrics is typically used. um Basically, it's a really fast patch panel. You can connect a drive to a host, um but that's not how people want to use their storage.

J

So so our basic problem statement here is well. Nvme fabrics has a lot of features that people love, um except for this patch panel. You know point-to-point type connectivity thing, and so we asked ourselves. So what would we have to do to nvme to be able to connect to anything um and it turns out it's um all it takes ha is. Is we add this um this yellow box? Here I don't know if you can see my pointer, you probably can't see my pointer.

J

Can you see my pointer if I do it over here, I'm pointing at the yellow box and the top thing there.

B

It's a little delayed but yeah.

J

Okay, so the idea is that we we take this point-to-point protocol and in all the entities on the fabric we add one component called a redirector and its job is to um examine every io look at the start, lba for it and instead of sending it to the target, it chooses one of the targets based on a table of hints that it has accumulated from these places. um So.

J

The basic idea is that they can. What this diagram is doing is walking you through the sequence of two ios and then the first one, the blue one. This host sends it to what we're calling here the wrong place. We're saying that the first I o number one should have gone over here to storage, node two, um and so the host doesn't know that, yet it sends the I o to this storage. Node storage node knows that it should be over here, so it sends it.

J

There tells the host hey next time, send it there and then subsequently the host does that this is the. This is the basic idea that that a host could learn from making mistakes um in practice as we'll see towards the end here, it turns out it they'll this system actually the way it worked out. They tend to learn it all when they connect so and this learning in response to ios, although we still like this idea turns out to be harder than you'd think um in the environments that we want to run this thing in.

J

So why are we talking about this for seth? um Well, because I'm kind of blasting through this- let's, let's- let's go through this brand new slide here, which is just to summarize um our whole idea- is we're now going to connect hosts that have to one thing: we're going to connect them to many things. If you're making a distributed storage system across many nodes, then there's going to be targets, nvme fabrics targets and all of those nodes and your hosts are going to connect to all of them.

J

You will then, as the architect of that storage system, arrange what it says here to basically let your storage system complete any of these ios, no matter where they arrive, and then you will try to make that not happen. That's that's! The basic idea is you you can handle anything, and then you try to get your hosts to be smart enough to send it to the right place so that you can avoid that second fabric hop associated with uh with the gateway.

J

So um I'm kind of I don't know how much time we actually have here. So I'm trying to do this fast and also skip over the things that stuff people already understand. So if there's anything unclear about this go ahead, raise your hand tell me to go back to the previous slide. That's fine, I'm expecting it that none of this is new to most of you, except these terms here.

J

So again we showed the little box, the redirector box, that's a little thing in the I o path in the hosts, and we also put it in the targets and it's the same box. We use the same logic everywhere.

J

Those things use this concept of location hints where the things next to storage, the redirectors next to storage that are that are informed by some cluster manager um in in stuff's case. It's actually a lot simpler, um where things are are, are enabled to to communicate that to hosts in a in a in a standard or abstract way. Now the hosts don't have to have any specific knowledge of the back end.

J

They just look at those hints and they do what they say and then we just this term called the distributed volume manager, which is that boundary for all that um closely coupled stuff in a storage cluster, and um so this um this is the details of a redirector, but I think we've covered most of this, the uh so the distributed volume manager is the thing, basically that that decides where things are placed and how it will communicate that to to hosts.

J

Now, let's we're gonna the whole adm idea covers more than just staff, so we've you know defined a few.

J

We we expect that there could be many types of location hints, but obviously these in order for these things to communicate, you have to establish some that they all understand and so we're working to to come up with an initial set that probably works for most cases. um The idea is that this will be extensible, so a storage system may send a hint. An old client may not understand that hint, then it will just ignore it, um but we hope to enable the the you know.

J

Jump start this with a with a set of a few that probably work in most cases and so that they basically fall into the categories of simple just says this range goes here and and a group you could call algorithmic, which is essentially a function. You say this range of lbas has this function applied to it to select this target? The simplest of those is striping.

J

We all know how striping works, so a fun, a striping function would say: here's a set of targets. Here's an lba range. Here's the parameters for the stripe function run this function on the start, lba look up the target, send it there, that's how striping would work and the hash hint is um is what you use for ceph. Obviously, now our original idea is that this would work with other storage systems that use assistant hashing, as as their basis for placement.

J

Chief among them was bluster. um I had never used gluster and uh but I did spend a lot of time working in ceph, as some of you know, um so it took me a while to get back around to gluster and discover. Fortunately, while it uses a hash function for placing files when it does block uh your whole block device is one file, so it doesn't actually spread it over bricks.

J

um Maybe maybe somebody's got a patch for cluster that lets it do that, but until gluster shards its block devices over bricks, this hashing isn't going to be much help there, but we did make it work with seth, so um I thought I would be able to get through this much faster. I'm sorry. I have to drag you through, basically everything I said at sdc, but some of these concepts I kind of have to explain so so.

J

A host an adn host sees any storage system as a dvm and they all have a common set of behaviors that lets these hosts be as dumb as we can get away with. um The basic idea is that everything complicated is the dvm's problem, um which sounds like we're creating a lot of work, but when you consider that lots of distributed storage systems already exist and they already work, they've already solved all of these problems.

J

What we're really doing is just naming the important things that they all have to do and um deciding how we're going to communicate that to to hosts. Now in the enviable fabrics world, there is the thing that you can barely read this box. Sorry wrong screen this box right here with a mouse wiggling over. It is labeled discovery service, and this is how an idm fabrics host in theory can be told the address of a discovery service, and then it can ask that discovery service.

J

What subsystems exist on the fabric and subsystem is what nvme fabrics calls targets uh it tells it the addresses and and things um and if they have the right credentials, then they'll be able to connect to those things. So the idea is that uh an adm dvm will have one of those inside it. These hosts will be told about that discovery service. The discovery service will reveal all the nodes in that storage system, and they will then commence connecting to all of them um once they connect to them.

J

So in this, in this little example here you can see, we chose say that this hypothetical dvm has four nodes, because that's an interesting number, it's greater than two, the two node storage system, you think of it as having just h a or active passive, and we want to highlight that that this is intended for slightly more complicated things, so see all these yellow boxes here. This represents a volume provisioned from these ssds on all these four nodes, um it's in some way a little bit more complicated than just mirroring.

J

um So, let's, let's consider that, let's say this host is supposed to use this yellow volume. Here this distributed volume manager will know that it knows this host's identity, so it will let it connect in the first place when it connects. It will tell it about this yellow volume on all of these targets, so this host will see it everywhere. Now this host has a redirector in it. So when it sees this these, these yellow volumes and it notices that these yellow volumes are um adn enabled basically they have redirector features.

J

Then then, it basically lumps them all together into as targets for for its internal redirector, and at that point this volume is available to it and it can commence sending I o to it now. It doesn't know where the extents are yet um theoretically. So what should happen next? Is these targets should all start informing it. So the owner of this extent should say this extent is here. um In fact, it can happen anyway, at all.

J

All of these targets could tell it everything it turns out that doesn't really take much data, so that's probably what will usually happen um and that even helps when, for instance, they're say, there's some temporary fabric problem and this host can't connect to this server, but it could connect to all to the other three now they'll all tell it that this extent it should send it here, but it doesn't have a path there, um so it will send it somewhere else.

J

Remember that a dvm is required to complete any I o that arrives anywhere inside it, so whichever one of these other guys gets the I o for for this extent, has to complete it by forwarding. So if you're building one of these things from scratch, that's one of the things you have to provide as a dvm architect, the beauty of, uh if you're, building a distributed volume manager adapter for ceph. Obviously, that's a lot simpler because inside these things are all just sending it to rbd, and that already handles that.

J

So I want to quickly get to the point where we're talking about the ceph specific versions of this. So rather than go into all those details, anymore, let's say yeah. Okay, I think I just said all these things. I forgot um this actually um well probably should have put this slide at the end. I'm sorry about that this. um This basically highlights. If you were going to build a an add-on package for ceph that provided adn connectivity, it would need to do these things. um The bottom line is it's just a demon that runs everywhere.

J

um It doesn't really need anything new from ceph it. um You know there's already mechanisms to add demons that are monitored and they can listen on cluster state, and so that's the good news is that this would just be a demon that runs on every osd node. um So, let's see.

J

um So just jumping back to here the idea is that the alternative to to.

J

To a dnn, if you want to connect to nvme refer to to ceph rbd with nvme fabrics, if the same problem you have with iscsi or anything else, the normal solution is set up a gateway, so there'll be a host somewhere on your fabric that all the traffic goes to and then it forwards it everywhere else, and so, when you use adn that that second hop is is eliminated.

J

um We also see a couple of other um advantages. We we have. First of all, we have constructed an adm reference implementation in spdk and it has all the components you need to connect it to rbd images uh is obviously not production, ready, it's just a data path and the minimal um the minimal configuration tools.

J

You need to set it up for one image, and I can show you in a minute that we did that and we proved that it worked, and we also showed that um that an adn host uses less um cpu cycles than that same host running at least kernel rbd and for various reasons. When I did this poc, I I stacked the adn stuff on krbd, rather than the rbdb dev uh long story short. I needed to show multiple things with one experiment and that's how it had to work.

J

If you were going to build this for real, you would use the rbdb div that spdk provides so at the bottom of this slide. Here you can see this url, which is an rfc patch in the spdk garrett, um and the one under it is the uh a readme file that explains this.

J

This whole thing in text, so you don't have to look at this powerpoint.

J

So so what I really wanted to get to in this presentation is, is you know how this connects def and and how we get away with um not having library d in our hosts? We, um I think it's summarized by showing you what what an adn host knows about the dvdm. That's when it's talking to ceph, as what does it actually see this json file over here on the right on this slide represents what it gets. So so those remember from the previous slide.

J

All these hosts are going to see a bunch of targets, they're going to see the name space that that is exposed to them. um We expect that your targets would use blood masking right when a host connects to it. It would say yes, you're authorized to connect and it would look up in a table and say and here's the name spaces you should see, but when it does list name spaces, those are the only ones it would see.

J

Any storage array will do the same thing um as uh I have to point out that the spdk nvme fabrics target doesn't actually have that feature yet, but lots of people want it and it's proceeding independently.

J

um So when your hosts connect to these antimatter fabrics, targets in ceph nodes, they're going to get a location hint that describes their volume and it's going to be the hash hint which looks like the structure, this json structure on the side, um the important parts, this thing called label isn't actually something that the that the host will use um it's in this blob of information, so that when we need to regenerate it when something changes in the cluster, we can make it consistent.

J

um So the there's no magic here we what it what it tells the host is um this uh namespace bytes and object. Bytes tells it how to turn an lba into an object number and the object name format tells it how to make the name of that object, and then the hash m uses and for ceph that's going to be the r jenkins hash and the sefstablemod function.

J

So what a host does is for every io just determines which object number it is gets. The name for that. Object runs that through the hash function, runs it through self-stable mod and then looks it up in the hash table.

J

The hash table is, of course, just a simplified version of the pg table, so inside saf you, you get a pg and then you do the rest of crush to figure out which of those things is where your data should go so for adn. Since this is we're only talking about block we, you really can only deliver io to the up primary anyway.

J

So that's all we tell it. We we determine with the sev cli tools, which osd is the up primary, that pg. What host is it in and then what's the name, the nqn of the adn target in that host and that's all it needs to know so this hash table basically contains buckets which are indexes into the table of enqueues.

J

um If you change the number of pgs or you add an osd or something fails, you need to regenerate this hint put it back in the redirectors in the targets and then they will send it to the hosts with the adm mechanism.

H

Sorry is that table present in every hint or is it deduped across all of the hints for a particular pool, so.

J

um We simply okay, it's the yes.

B

J

Basically, every redirector will will every redirector in the osd nodes will will have this entire hint now. Obviously, every rbd image in the same pool will have exactly the same hashtag exactly the same nqn list, but we don't the. This is the whole thing. The basic guiding principle of of adn is loose coupling and you know they'll do their best to get accurate placement information, but if they don't it needs to work anyway. So that means all the redirectors in the osd nodes send the entire thing.

J

um If you look at the implementation, um sorry, what a larger cluster have you tested this one, uh eight virtual nodes with one.

H

J

There were 128 or were there 10 20, I don't know yeah. I.

H

Honestly can't remember.

J

I have the file somewhere. I can tell you um it so this pg table size. You know, that's that's a concern because this all has to fits in a log page, but one of the reasons that the hash table is just indexes into the nqn table is because of that, because there's really no bound on a pg table size right now. There are reasonable constraints. It doesn't do the user any good to have an enormous number of pgs.

J

um Obviously, they're all objects have to be tracked. We all know this about stuff right, but for various reasons. Well, if you get really scary numbers from the from the from the pg calculator right, it says if you've got a rack yeah. You actually know, though, for large clusters you need 30.

H

J

H

J

H

Have you considered deduplicating the mqm table? Yes,.

J

So when these things are communicated so so um I don't really have the details, it wasn't expected to go into this detail, but hints are sent to hosts in an nvme log page, it's a page that has a bunch of small structures in it. The hash ants is one of those small structures, but because it has these enormous tables that hint contains um other log page ids, so the nqn table is in a separate log page and the hash table is in a separate log page.

J

So this means that if you're a host using several volumes on the same theft cluster, um you only need to read they're all going to have the same tables, so those log pages are designed with um with a with a digest. They have a header. The header has a digest.

J

So if you've seen, um if you've already read the the say, the hash table for for one of your adn logical namespaces and it had all 320 000 buckets and you you've remembered what the digest for that version of that table is then, when you connect for another volume- and you read the header out of the log page and it's got the same digest you just.

B

J

um Right and the same thing for thank you and tables, so there's some when you connect this to the nvme or fabrics world you have some. There are some corners of the envelope that aren't going to maybe work so well if you've got 10 000 nodes, that's a lot of nvme fabrics connections and, if you're trying to use an rdma transport that might not work well.

J

Obviously, this works with nvme for tcp as well, but anyway, so there's tuning things. um It's also true that if you've got images in separate pools, then hosts really only need to know about the targets in the nodes that have osds that they will actually use so so that could be managed too um all right. So, let's just quickly go to how we proved so. Does anybody have any questions about how we abuse brush here?

J

We're only doing hash and mod and all the rest of that happens in a script somewhere else in the cluster and it's packed into this json structure, and then we don't update it unless cluster map changes or obviously.

D

Could you explain how you have used the nvme protocol? I mean uh set, not a simple, read, write. There is a read with some parameters which you.

E

Cannot pass, are you using grid buffer instead of freezing.

J

Sorry what I mean we basically only support, reads and writes we don't um what other operations.

D

J

D

Rbd read is passing some parameters which you cannot pass inside nvme there's an.

H

End point on the on the osd notes: that actually does the rbd part right. Yes, so the natural next step would be to extend the osd to be able to speak this protocol directly.

J

H

J

Could do that? What we assumed is that you would just run a demon in your osd nodes and that demon literally links to live rbd so right. So our only.

H

Job with abn.

J

Is to avoid that second fabric hop, we still all the work of rbd still has to happen, and the hosts aren't doing it. So it happens here. What I meant is better.

H

That's additional cpu overhead.

J

Yes, yes, it does, and that may be a showstopper for some customers. There's other.

H

Customers that are just.

J

Thrilled to get that out of the host.

H

I didn't mean that as a showstopper, I meant that a way to avoid that would be to embed more of the intelligence of the osd. But.

J

Yeah, yes, it would. I haven't gone there because that's pretty intrusive into ceph, and so first we've decided: let's see if anyone cares, but um but yeah.

D

Sorry go ahead, but before that, how do you pass uh the information associated with her? With request? I mean the.

G

D

Has some non-standard information in seth? It's not a simple scuzzy nvme with protocols.

J

Well, what's an example of that I mean we ignored all that in our.

H

J

H

The original write or read operation, however, is a perfectly standard way to write. Yes, it doesn't become a ceph thing until it hits the osd host exactly at.

J

H

J

H

J

H

The nvme of of command, in whatever form that it is and turns it into a cell.

D

And uh all this snap and clone they are based on some kind of uh sequence number coming from from the from the from the rbd.

H

Yeah and in this context, the vm host never sees that. Yes,.

D

So we don't support snapping and cloning in this mode.

H

From the host itself, you would but the the name of the rbd image, for instance, doesn't change under those kind, in that case, you're still reading and writing to the same logical volume. Those things would simply happen in imaginable. Error.

J

Yeah, if you change the snap, your name doesn't change, but your id does right. You make a new image.

D

There is a version number associated for every object that you should know if you should read from.

H

D

H

Another way to put it is that the vm nodes in this case aren't aren't actually running rbd they're, another node that is running rvd in this.

C

Case again, the initiator initiators are dumb and the targets have all the logic, and you had some management layer that already told the targets that these clients are accessing. This image at this snapshot or whatever.

H

C

Just like I said.

J

Exactly, except that we don't have a second fabric cop, except in iscsi, you send it to a host who does rbd and sends it again here one hop and you move that cpu utilization to these osd nodes right well,.

C

Yeah one fabric, copper: you saw the loopback right now right, yeah.

D

Yeah yeah, yeah yeah sure there's a right you're running the rbd code on the osd.

H

Yes, on every osd.

D

H

Yes right now, that means.

D

We're adding an extra load to the clock.

J

Yep, so here's the results of our poc was, we showed you know we're comparing conceptually these three cases and as you can see, this is three runs. This network at the top is my ceph network. This network at the bottom is my inventor fabrics network. We do it here with just rbd. Obviously all the traffic is right here.

J

These all these traces in grafana are all my eight osd nodes and my one client in this case we used we, we left the client with engineer fabrics. We went to one of the osd nodes which then just delivered it with rbd. So you can see that almost all the traffic is still on the cef network, and it's also here again on the nvme fabrics network and then in the adn case we put the hash hint now in this poc, the redirectors weren't. Actually they weren't capable of passing the hash into log pages.

J

I hadn't written all that marshalling code.

J

Yet so I just configured the hash in the host, which obviously is not much fun to do, but the the point of this whole graph is that we have no no def network traffic because this pool is unreplicated otherwise, you'd still, of course, this obviously adn isn't going to do anything about your replication traffic you're, just trying to get rid of that latency and the overhead of that extra fabric up and keep, as jason said, keep your hosts as dumb as you can possibly make them so dumb as dumb as a box of twitch switches.

J

Basically, is the whole idea, so um we also wanted to prove or see actually whether this really saved any cpu overhead and in this, in my little toy virtual cluster, which was just barely large enough to show any of these benefits, but a whole lot easier to set up than a real one um when you need to actually get somebody to pay for 8 or 16 nodes, and let you keep them for six months. um The differences are. Are our wrong screen?

J

Sorry, so this is rbd at 30k iops, and this one over here is adn same workload, same 30k iops, and these two red arrows indicate that the cpu utilization on that client dropped by that much not a huge amount but measurable, um whether that matters to you as a customer depends who you are um and again 30k ops is not a lot, and these are really small virtual nodes and we probably don't have time to go into it here, but this actually represents two kinds of cpu utilization: there's, there's normal os red stats and because adn is an spdk app.

J

We have to extract red idle time because they're always busy from spdk and then transmogrify that into a unified cpu utilization stats. So that's what all these caveats here are is that this is a derived number bind from two and I have the scripts that'll show you how you can do that if you care, I'm only mentioning it because someone that just reads this slide will say: hey wait, but why isn't that nailed it 100? It was so we looked inside um you.

H

Have the corresponding change in cpu for the osds.

J

H

You have the corresponding change in cpu for the osc's.

J

I didn't measure it like this. I didn't show it on this graph um and, of course, I've lost the griffon, but here's the line here it said we did see. um This is seven percent of one core on each osd node, and these are eight virtual core nodes. So so yeah there was um um measurable, cpu utilization increase on those nodes. Now it wasn't quite as much as we expected. Let me see if I left that slide.

C

So you're um you're you're a redirector that lives on the osd node. That's not using sbdk running at 100 cpu.

J

It is, it is so it would have had to have been moderate. It's just it's just not shown in this graph is what I'm saying I mean I did capture that data, but I didn't save it out as a as an image for a power.

C

Point all right, so the seven percent overhead is, if you doubt the the.

J

Business but you do have one chord nailed up. Yes, exactly so usage, perhaps yeah, but we would have been a better way to say that and whether that's okay, whether you want to pull mode application at all on your osd node, is a totally valid question. um And so.

H

It's going to work the same way, I'm not worried about that which I'm not crimson the new. Oh.

J

H

Right, that's one! This doesn't that's.

J

C

Yeah but you wouldn't want necessarily like one redirector, stealing work or cpus.

H

From here who located.

C

H

C

Exactly they would have.

J

Been used in the same, it's not absolutely good. It's just a difference in capacity exactly if you can have, but put it in the list of pull functions that that polar does great fine. As long as it happens, and all this was done with kernel networking. I didn't use dpdk networking here, um it's just you know it wasn't important for this poc and then may or may not want to do that in a real stuff cluster anyway. So it's it's orthogonal. You could use user mode networking or not. We expect most people will not.

H

Dollar, difference in cpu utilization backed up to be honest. Sorry, what this is a smaller difference in cpu utilization than I would have expected. That yeah is that the utilization simply due to rbd or is that total node utilization.

J

This is total node utilization, so this includes.

H

Okay, so this is what the remainder of it is fio, so.

J

H

Rbd plus the so the fio plus rbd minus rbd, plus this um adm thing is like yes, there's 25.

J

Percent cheaper.

J

We are saying it saves you um half of one core 30k was the units we were pitching to the customer that was interested in this. um I wouldn't I'd have to think about it before I would make it that general percentage statement that you just made.

J

um The uh other thing to note here is this: this sort of tortured poc. I think I left that slide out, but in in even with just plain rbd and um adn, um it wasn't actually just plain rbd. This was it was always going through an spdk app.

J

So so we had an spdk app that exposed nbd and we ran fio in user mode to get to that, and then we either went directly to kernel rbd out of the bottom of the spdk app or we went to ending mirror fabrics to go to another node or we went through a redirector to go to one of the eight nodes. So all of the nbd overhead was always there all of the leave the spdk up to go to krbd overhead was always there.

J

The only thing that changed was what we did in the middle in the spdk app. um So we think that you know that's out all these other things and you would not actually build it that way, but that's the rbd case. The adm case, however, was direct to whatever pci.

H

Express shooter.

J

Well, we still know we still went fio to an mpd device into the spk app yeah.

H

J

H

J

Ahead inherent.

H

To the system and.

J

The test set up to get the right numbers out of this thing quickly,.

H

J

Again, that's common.

H

J

That was the idea.

J

So we think it saved some and um we're not quite sure why, and we were also surprised by the by the the overhead numbers um we were even more surprised. I think I left it out that the latency did. I put that in here back to here the latency. Paradoxically,.

J

Right latency got lower with k with adn than with just going to krbd. We don't understand why exactly the same amount of rbd work had to happen. We just did it on eight notes. So is that the answer- and this was well- this is cute of 64.. So maybe that is the answer. Maybe they were just use hotter cash, and so they got better performance and when you shoot up 64 at one node fell off the cash cliff. I don't know, but that was the result now I wouldn't.

J

I wouldn't bet I wouldn't count on this in production right. This is probably an artifact of this test setup, but that's what the numbers set so we're just reporting them. um So that was weird um did I was there another question in there that I skipped over or did we let's see, we've.

B

Talked about yet regarding the redirection um with any of these kind of setups, where you're sending thing I o from the client to a separate system and then going into rpd from there. um If you were changing the kind of, if you don't know, if you want to call it gateway, but the the receiving and translating that into rbd.

B

um Is there any kind of handling of like when that changes? Trying to avoid sending the same, I request again or making sure that uh I use that we're already inflate, um don't cause correctness, correctness problems if they're they're still.

J

Yeah, let's, let's jump back to this, this diagram- this is the first one we talked about where the two different ios one with a clue with first one without a clue and second one with a clue. What we don't show here is the the responses. So the idea is that, if you're a host you you initiate an I o to one of your targets. If it has to forward it, it's initiating an I o to the target. It knows it should really go to this.

J

I o is still in flight the whole time so when it completes here it completes back to here, and then it completes back to here. So if, for instance, this guy died, while this was forwarded, um then yeah in theory, there could be an I o in flight here from from this um you're really common with stuff.

J

But if the target the target has to die for that to happen so.

C

Or you need a path, failure at one right: it's in flight. It gets to storage, node one. You have a path failure, so it can't hack it. um Yes, it.

J

C

Out it redirects.

J

It over and there's a possibility for. Yes if it was already in flight, um so there's a couple of things. We haven't really um well okay, so so from the host's point of view, it didn't finish so this redirector, so the application doesn't know about any of this right. You gave the I o to this redirector. The redirector knows: there's multiple alternative targets for it and its job. If it tries it here and that doesn't work, you should try it on one of the other ones.

H

Delay io2 until I one completed.

J

Yes, if it knows that, okay, so whether whether this host delays io2 until I one completes, is up to this application, we we expect that most applications.

J

Well, if the redirector has them both, it will not, it doesn't do um overlapping.

H

J

No, that's that's so we did that in rwl we had to do um overlap, detection for ordering because we were replicating and we need to make sure that the replicas are all identical. So we can't let that let those two the race happen twice and resolve differently in different places. um Now our basic position is that if you're an application- and you issue concurrent rights to the same place- that's a bug.

J

You probably shouldn't have done that and um because most block devices in jboss and storage arrays don't give you any kind of guarantee at all.

H

It maintains a buffer of uncompleted rights and it will resist them if it gets a.

J

Yeah they're, just basically it's it's a b dev I o, and it's not complete until it completes to one of these targets. um If it doesn't complete, then they try it on a different target so um and so the only race happens out here. I tried it here. This guy died this guy had it already he's, got it in his queue or it's in flight.

J

He probably knows that this guy has died, but you know it may not be cancelable right, so it may actually complete and then nobody will find out about it, but this guy doesn't never got a confirmation that it happened so he's going to try it again to this same place. So.

H

Have you done the thing where the redirect on the osu's unmounts to remounts the rbd on the other place.

J

Sorry say it again:.

H

So the redirector here, if we move or rbd at least um the movement of an rbd image from redirect or one to redirector, two means unmounting in one place and remounting on the other right.

J

Well, it's mounted in all of them.

H

C

Target has every single possible rbd image.

J

Right because you're they're only going to handle the ios, ideally you're, only going to get the ios that go to osd's in your note and the other go somewhere else, so we don't so uh we have done some. I think you might be talking about is what happens if something changes in the cluster and the.

B

Parameters here.

J

Change right, and so I control I contrived examples where I you know at first I gave it a hint where I deliberately mangled this hash table, and you know you start up, I o and then, when you go back and you look at this graph um here, it doesn't look beautiful like this, because the host has the wrong idea.

H

Correctness, so that I understand the properties, I'm not worried about performance, yet.

J

Okay, but but then, when you update the hint, then the host applies it and they start sending start sending ios to different place so yeah. So the idea is that there is no correctness issue, because all of the targets in the osd nodes can complete any of the ios. If you send it to the wrong target, it still completes. You still give it to library d. It still does the right thing with it. um It just takes longer.

J

So that's the whole idea is that the correctness part that this whole reason we stress this dvm abstraction- is everything about making storage correct is the dvms problem. The hosts try to get it to the right place, but if they don't it works anyway. It just sucks a little bit.

H

J

Certain optimizations.

H

Impossible, so if you don't use this system at all, then a particular rbd image has exactly one client that this way, however, every er every rpd image is mounted everywhere.

J

H

J

Have to live maybe.

H

There are optimizations we might choose to do that can't be done in that context, like.

J

H

I'm not worried about caching, because the application is where the caching should be done. That's fine right. I agree, um but, for instance, uh a particular image will have operations from many different clients. So if we wanted to do something clever, like remember only the most recent client id that did an operation to a particular object, that's impossible, yep yep. So there are disadvantages to.

J

H

Support there are disadvantages and there are more another way of doing. The redirector would be that you maintain the exclusive mapping and actually move it um and your redirector instead of simply running the rbd operation forwards it to the one that does have the rbd image mounted, and that is an optimization you. We.

J

Could do that? Yes, more complicated this this so so. Seth is kind of a beautiful example of a very simple dvm because, like we said at the beginning, normally a dvm's job is if the I o comes to the wrong place, you've got to get it to the right place somehow and then systems. We consider doing this in like a a pairwise h8j buff with a pci bus in between them.

J

That's pretty easy in that case, because everybody can see all the ssds right the minute it becomes a bunch of boxes, a bunch of commodity servers on a fabric. Suddenly you've got another fabric hop um involved, but in ceph you don't have to do that. Also, a normal dvm um needs to very closely manage these. We call egress redirectors to make sure that they all have perfect knowledge of where everything goes, because they're the source of truth and in the ceph case in this implementation of this f case, that's kind of relaxed.

J

We could be kind of sloppy about it, because all we care about is performance and what you're talking about is is tighter integration, where you're more tightly integrated to other rbd performance features and then suddenly, that would matter- and we think the architecture allows it. But that's not what I built um now. The bad news is it kind of gets worse, as you guys have all probably realized by now. This thing is only talking about the hash function, for you have.

J

If you have a clone stack, the hash function only talks about the top right and everything under it. So that means, if you've got a clone, that differs by one object from its parent. All the I o is going to go to the wrong place and stuff is going to have to forward it um because adnan doesn't doesn't know about that right now. If you decided to do striping over rbd objects, I don't know how popular that still is, but um obviously that's not going to work here, but they.

H

Could that might not be a big problem, an actual disadvantage of rbd clones that operate like that? Is you get hot read zones and because rbd clone data like that is immutable? If each of the redirectors chose to simply cache the frequency.

G

H

Blocks you would actually reduce load on the hot notes and stuff which might be not positive.

J

H

J

Should mention that, because that's actually my straw man proposal is, you should turn on the shared object, cache and all your you know your egress redirectors, and that the reason I don't usually say that out loud is because then people say wait a minute. I have to have ssds in there they're just caches for stuff. That's on my other osds. What you're just trying to sell me? Ssds? Yes, yes, we are, but um but but yeah you're right.

J

It would, and uh it's a it's a it's a probably uh the cost equation for the customer right to decide whether they care that much or whether they really just want to move their caching.

H

Honestly, not just in memory caching they're immutable, so it's free.

J

Oh okay, well sure if you've got enough memory for that, um yes, um but to really handle clones and stripes would you know, need much more sophisticated um way of of combining the hints in the host and we've we've deliberately drawn a line here and says to say no: let's get this out here and see what people think with this sort of minimum viable system right.

J

um We can all think of ways that you can combine these hints they and I can tell you I've spent some time thinking about it and uh there's no super simple way to do it. There are ways to do it, but they're all kind of over the line of complexity that you'd like to have in a version zero, something. So um so we're not doing that, and so, if you are a person, who's wants to look at whether this works for you you'll have to understand those limitations.

J

Basically so, um and as you say, turning on caching and your targets will help tremendously.

H

How frequently do you expect that it will make sense to trade, client-side cpu for osd side, cpu.

J

It'll certainly make sense if you're trying, if okay so the basic motivator, is, if you're a cloud provider and you're billing people for cycles in the hosts, then, if you're spending those cycles on anything in your infrastructure, that's money yeah! You can account for them separately. You can but.

H

The osds are expensive too. It's not like they're they're, free.

J

That's right, um you know we would have worked to be happy if all your osd nodes were two socket. Xeon servers had planned. If you obviously that's totally totally obvious.

C

I mean if it was a bare metal environment. You put all this logic, initiator logic and a smart nick and called it a day.

J

That is the is that this poster child case is you'd like to do bare metal provisioning. You want to be free to give it whatever storage works for you it. You know it makes sense, or it's at least handy. If it can see the world as a bunch of nvme devices, and that interface means it can be very performant if you put very expensive storage behind it and and with this it'll still work.

J

If you put something more capable and flexible and a lot cheaper behind it, though okay fair enough, now it's tempting to say: well how would you migrate things? How would you unify that kind of system, so you can have a you, can have an nvme, jboff and ceph, then maybe migrate volumes back and forth.

J

That's clearly out of scope, but um so did you consider.

H

Actually running crush to generate the hints. Well, we sort of do uh no. No, I mean on the client side, you distribute the actual cluster map.

J

Yeah, okay, so that's kind of over the um distributed. Well then we have a rbd client. Essentially, so you just have crush.h there's a bunch of issues with we already had this issue with um the self-stable. Mod function is as of yesterday public domain, but before that it was lgpl.

J

Spdk is vst, so we can't just bring stuff code in to here. um We um there's something I'm not prepared to talk about here, yet that would be very flexible. um Basically, posts could define their behavior.

H

I'm just saying the.

J

Weakest element.

H

Of this, for replacing like the general case, rpd one is large clusters. Your pg set will become large in that case, and every osd map update, whether it's incremental or large, will result in a large scale, repopulation of the pg stuff, which will be pricey. By contrast,.

G

J

H

Initiators knew how to speak, crush, not a problem.

J

They sort of incrementally generated as they need it. Yes, it's true. We have to generate that that hash table once now when we showed it in this.

H

You have this based on the usd app changes.

J

Yes, every time the osd map changes but you've got you know, hundreds of images and maybe thousands of clients, but you really only need to generate um these two things once when the when the map changes and because those log pages have have digests yeah hosts.

H

J

H

Totally different, yes,.

J

You have to repopulate their their hash table in in their logic path. It doesn't seem to be that big, a deal you just reload the table and the next batch of I.

J

Os the size is an issue.

J

Yes, how big is it and that when we talk about putting this in pieces of hardware, that's that's a real problem, and so um so, if you're talking about you, know hardware, accelerating this, you get to a point where you got to make a choice and say: well, if your fiji map is you know a thousand, we can definitely do that if it's, if it's half a million sorry you're going to go to the slow path, we're going to have to do this, the software database, we you think about a hardware offload for adn.

J

You always have to have the software path there or the things that the hardware can't do. um You know if this becomes standardized in nvme, there's there'll be a table somewhere. That says here are the hint types you can have and version x of the spec there'll be n of them and then the next version there'll be more your hardware or the firmware or device or your os may not understand those. So there.

H

Are other approaches to this too? I suggest in the past that it's sort of irrational for an rbd device to be spread across an entire osd cluster. If we could constrain the placement to just a subset of the pg's if, for instance, to get the parallelism you wanted on a particular entity or on a particular rbd image you needed 64 128 nodes, then you could just constrain that.

J

Make a pool rbd.

H

Image just that subset of the pg space. That would be another approach. Solving this problem.

J

Oh, you mean okay, so you don't mean to make a whole separate pool with its own pg.

H

E

H

Give it a mask so that, depending on exactly where it is it's constrained to a subset of the pg space, that would be another report.

J

Yes, yes, that would be, and it would be really easy to um basically this hint well, you could completely hide that in whatever generated the hint right, you know.

H

It'd be absolutely identical to this hint that it's just that the hash table would be sparse. Only some of the pgs would be represented only the ones.

J

Yeah lots of buckets would point to the same place. That would be.

H

Great, if you could make.

J

That pg table smaller, but you really can't.

H

No, I mean you could make it insensitive to the size of the cluster. That's the goal to the.

J

Number of nodes.

H

J

Right yeah, I don't think those are obstacles at all. Frankly, the things like handling loans actually handling all the layers of clones is fairly complicated. I mean it's fair.

H

To be happy to leave on the osd note right, it is exactly no.

J

Particular reason.

H

Why you have to be hyper clever about it? We're talking about immutable objects that are highly casual.

J

We're trying to resist the temptation to be clever about this, though um right. It's uh uh ideally like jason, said uh we'd like to be able to run it in a very constrained environment and basically abstract the storage to hosts. So you can have a bunch of dumb bare metal tenants. um um It seems to us that um the the um the attack surface is reduced. You do that too. You no longer have to have your hosts be authorized to make rados connections to things.

J

They can only do nvme things, um so they don't have to have the key keys they don't have to have. You know they can't really view the state of the cluster at all. They just see block. I o um that may or may not be important to you as a customer, but um so uh let me just go back to I mean probably way out of time. Sorry I was late.

J

I my calendar had this meeting that it's time it has been at for years and I even though josh told me it was actually a date. My time I didn't notice that so.

J

So, where we're at is this, this reference implementation is out here. It's we only have the spdk lego block. Now you could imagine you might like to try this with a kernel, 80 n client. Well, we haven't written one of those. There could be one, but you know kernel development is so much fun that um we're getting all the bugs out of this. This way and the beauty from my point of view, as the only guy working on this right now, is um that I built one of these and you use it everywhere.

J

So I didn't have to make a client and a target there's only one thing um so, like I said it's out, there's an rfc patch.

J

It's it's future is, you know not yet determined it's kind of large, and so, when we consider adding things to spdk, we only do that if they have real users and there's a real, you know it's part of products, and then you know we can justify the cost of the burden.

J

um There's also some end to me ecosystem issues, so adnn essentially does what you call a shared namespace. We have unaimspace, which you can see from multiple subsystems and, although I didn't think nvme prohibited that it turns out, it actually does um so one of the things that breaks an nvme is you can't do reservations on it. You can't do other things that have been defined since I started this basically nvme over fabrics, multipathing or actually it's nvme multipathing.

J

It's called a a that won't work with with these, um but my position is that you almost don't need it because you've got you've got multiple targets. You can handle loss of a path, it's not as performant as a a which really has two paths the same thing, but I was going for a different use case here. Reservations may or may not be important. Now.

J

This reference implementation doesn't implement that, of course, um the real issue, if you're an nvme person, um the things that that that have distributed or a shared namespace presents problems for are the nvme commands for namespace management and those are the things that you know create and destroy namespaces in actual ssds. Now I you know, if you're making a distributed storage system you're not going to manage volume creation that way it doesn't make a lot of sense.

J

So option is that these targets just won't support those commands. They're optional, commands anyway, so but bottom line is. There are um standards in progress, standard extensions in progress or what they call dispersed namespaces and I apologize for adn's name. I didn't know about the dispersed, namespaces um um technical proposal when I started this. Actually it started well after this, um and it specifically addresses all of these issues it.

J

For other reasons, storage vendors would like to be able to expose a namespace from multiple subsystems, um they're they're sort of headline use cases um migrating between arrays, um and you know, administrative. You know maintenance tasks like that for for, if you're doing that, you need that capability at least temporarily, and so, if you're, an mpv person, you look at those standards and you see dispersed name spaces.

J

You might think, oh that the same thing, no, where this is probably going, is if that tp proceeds and is adopted, then a dnn would naturally um stack on top of that you'd say: well, you use adnn with a dispersed namespace and it adds these capabilities, which are essentially just a bunch of log pages and their contents defined and the behaviors the hosts recommended.

J

um So that's kind of where we are now is the code. Is there so that people like that that might use it and might want to reconcile it with the ecosystem issues, can contribute and comment and tell me what works and what doesn't, um but we don't have a package for ceph yet, and it isn't really time to do that. I don't think but and in fact, since that's not part of the part of steph that I have worked on, um where did I let's back to this slide?

J

So at some point I would consult the just like you to say well how this must be integrated into ceph. I you know somebody wanted to create a package that added sf 18n capability. How would you do that? Would you use the iscsi thing as a model they're a better model now what what uh sort of my question to you? What what should be looking at there or or targeting.

J

As the that's, that's the method for doing that,.

B

Jason has a size because he look like today in terms of a model with this type of thing,.

C

I mean it's a standalone product right um project, so you have tcme runner and you have like the sapphire scuzzy packages just because they're not tied into a separately stuff, provides the core functionality and the runner and step iscsi are just clients to whatever you know, to a lib rbd librarados package that gets installed with them, because there's no there's no like one-to-one feature tie-ins there's. This is not like the octopus release of of iscsi, it's iscsi um that might be using octopus clients, so that's they're, distributed and kept different, separate on on in github.

C

Also, the fact that tcme runner is a not just an sf project.

C

Rbd is just one back end for tcm. So if.

J

Okay, so that's one way to think of a uh you know: a dnn adaptation layer would be a. It could be a separate package that just uses normal stuff libraries and does its thing that would kind of limit us to architecture. We have here where we just use rbd in the ost nodes. If you want to do anything trickier.

C

Right yeah, if you're gonna have to if you, if, if if one day you get comfort, if you, if you get something where it's gonna be co-located inside the you know the reactor of crimson or something like that, yeah, that's a hold of her.

J

But that's tremendously now you're not talking add-on package you're talking.

H

Exactly yeah so.

C

H

You're really talking about that yet anyway, exactly.

J

Exactly that already.

H

Kind of works than it is to like insist on being distributed.

J

Yes, I'm only asking that question now to make sure we don't make what probably makes sense impossible right, not like we're going to build it tomorrow or that we need, um but I'd hate to do something. If that makes sense to you all, then then I will endeavor to well. I mean it gets into.

H

A number of things here are tough, like the hints, don't scale properly with besides the cluster, so that could be a real problem. There would be a lot of asterisks caveats and wherefores with at least the basic version.

C

And- and I do want to ask well something he's just talking about the size of the data yeah, I'm saying um architectural.

H

C

um And then the other question I have is fabrics back like how many name spaces. Can you really have per target.

C

Yeah, if you're effectively exposing every single rbd image, that's being used somewhere on every single target. At what point do you run out of quote name spaces.

J

I can't remember if it's a 32-bit number or a 16-bit number, but obviously the real limit is going to be an implementation limit of whatever framework you're using um you are free to have more than one target um or more than one subsystem. Now that can so you can pack them a number of ways right. So the way we talk, the reason we talk about it as one is: that's the minimum number um right.

C

J

Just trying to get I'm just trying to get I'm just trying to.

C

Get the cloud scale here and you have you know tens of thousands.

J

Thousands and thousands.

C

J

C

In use concurrently that are now mapped across every single osd node, where you uh right so, and I think really, unless you get to the point of you know what sam mentioned with like. Oh, I give an rbd image. It can be down to a given subset of pgs right, but then you start doing that. It's like well. Is this really, then just a gateway yeah. You know.

H

Unless you, unless.

C

You could do that pg restriction thing.

H

You would so yeah. The two major pieces would be somehow restricting the number of pg's a node can map to or changing the overhead of the hands. In addition to you, probably can't in general map every image on every osd, that's extreme, so you would need some mechanism to only map on the oc they're supposed to be the target for with you know, support from moving them.

C

Yeah, but that allows.

H

That ability to restrict.

C

An rvd image to a.

H

Certain set of pg, um no, those two.

C

H

Independently useful, you can do one without the other.

C

It's going to scale if it every single osd node needs to open every single possible image right. I think we all agree to that.

H

I mean it depends on the size of the cluster like if the project is to make it cluster really fast, then sure no problem.

H

J

A lot of orchestra cluster suite yeah, but not thousands.

J

Thousand node clusters, in our experience, are not that common, not within.

H

You guys it's more complicated stuff gets you right across a lot of stuff right, um but you were asking what would be like the architectural changes that you would make as it becomes more complicated. I think this is what they would be, but for the simple version, I don't see a reason why it would necessarily need to be included in stuff.get, like an external project, seems fine. This doesn't have a heart dependency on any particular version of stuff and it doesn't know anything about the seth internal protocols, as that becomes more useful.

H

So if you're trying to become different.

J

Yeah that the things you might do to eliminate or reduce the uh the rbd overhead in the osd, you know integrating it to to crimson somehow in some way. That makes sense- um or at least maybe optimizing that transport connection to say. Well, if you're talking to your local osds, maybe we don't need all this. No encryption.

H

um Even that could be done within the client. You just tell the client by the way you're likely to yes, the local connections.

J

Right, you could make the thing off to the side.

J

That's cool, so these things you know we're not there's no immediate plans to try and do any deep integration uh we figured. We would wait until there was a clear need and the feature set you know took out so so um that unless there's any more questions is basically that I did point you at the.

J

Got a second: where did it go? I wanted to point out this read me file here at the bottom. It oops wait for that to do that, clicked on it um rather than a powerpoint. Some people would rather read an explanation of how it works it to the scripts that do all this stuff. It doesn't sound like any of that's a big mystery. Deaf people understand that you can get all this information out of this fcli tools in json form and that's what we did.

J

Obviously there's smarter ways to do it and ways you could do it less often, but so let me know if there's any. um uh You know where to find me. If there's any questions or you want to try it or you do try it, it doesn't seem to work.

J

That's about all I had to.

B

All say thanks god, yeah very interesting stuff. I think that's all we had for today. So thanks everybody for joining and we'll see you next month.