Ceph Cephalocon APAC 2018, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Reconsidering tracing in Ceph - Mohamad Gebai

Description

Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Mohamad Gebai, SUSE Software Engineer

A

Tracing is awesome if you know how to use it. Fellow members of the subcommittee thanks you thank you for attending my talk about tracing and Seth. My name is Mohammed I'm, a software engineer at Sousa and I work in the Montreal area up in Canada.

A

This is the outline format they're more pretty simple, there's this idea that has been thrown around on the self develop many lists and in between the developers to change the back end of the buffer list, data structure and self.

A

So this is what I'm going to be mostly talking about during this talk, how we can use tracing to do an analysis that allows us to take such a decision. I will then sneak in a small comment or small analysis about logging in general and self, and finally, I'm going to do a small tracing down which I have already pre recorded just in case.

A

So we have this hypothesis that changing the back end of buffer list might make an improvement for performance. Buffer list is simply a class that chains buffers together. It's a list of buffers and is currently implemented using a linked list from the standard library, which is just a bunch of nodes that are referenced by pointers, they're, not contiguous. In memory.

A

Some people like have suggested that we moved to a vector which has all the elements of the buffer list would be contiguous in memory, because the myth is that this linked list is usually more efficient, but because of CPU caches and because of locality. That might not really be the case.

A

How do we measure again if any I mean such as change might be really really small and insignificant? Doing a Rados bench or running FIO might not really expose the benefit or the regression or it might be covered by something else. As important as bufferless is a lot more goes into a workload and said, I think we can use tracing for that.

A

Self currently supports trace points, a TT energy trace points. You can define a new trace point that way here. I have the point the trace point to find the trace point that is called buffer. Let's push front quite easy. This is the current implementation of the push front method of bufferless. It simply calls an insert on the standard library container that is used have instrumented the code as such just call the trace point. Before and after the the insert and we'll see what happens between these two trace points.

A

The cool thing about 30 TNG is that it allows you to add context to at response every time you have an event that is recorded at runtime. You can you can let that event with the trace point hold extra information about about the context of your system here? In this analysis, I have added cache, misses and 1 data cache load, misses, TLB misses still be, is simply a a cache structure for resolving memory addresses from virtual memory to physical memory, which might be very expensive.

A

We were doing low-level profiling on x86 64 things up and for additional access to memory for one TLB, miss I'm, also looking at the number of instructions that it takes to do a push front or a push back. The number of distractions in itself is not really relevant. It's not really important because this case had changed from compiler version to another.

A

What's in, what's interesting is that it lets us guess which code path has taken that rather.

A

The approach I took for measuring the impact of changing puffiness from analyst or Viktor is the following: F, a micro benchmark that creates a thousand buffer lists and each in each buffer list. I append a thousand buffers. So we have the thousand buffer lists. Each buffer list has a thousand buffer to have at l1 cache size of 32 K and an l2 cache of 256 and during the micro benchmark, elected to disable sea states and the turbo-boost hyper-threading and whatever might come in and interfere with. The frequency of the cpu.

A

When you're done, this is what you get to get hundreds of thousands of these lines, which don't look really interesting and they look pretty boring to be honest. But if we dissect one, we see that it's actually easy. You get, you have an event each event as a timestamp, you have the situation in which that event was recorded, and then you have your context. The number of cache misses instructions, l1 data cache, told mrs.

A

and TLB misses, and you have your payload, which I'll define 0 is I'm going into the push front and one is I'm going out of it.

A

So you can guess if we do a difference between these values for two consecutive trace points. We know what happened between those response between two consecutive one on the same CPU.

A

This is the average latency of pushback for different data structures. Vector is in red list, is in green double-ended queue is in purple, but el averages are really interesting. Instead, let's look at individual push flags all of these graphs. The y axis is always the latency, so it's really how much time it takes to do. One push back the x axis is in the upper left. The number of instructions upper right, l1 cache misses which what we're interested in bottom left is not rid of TLB load, misses and bottom right is.

A

The number of cache misses all levels of cache. Let's focus on the first, the upper two, the number of instructions and the number of l1 misses.

A

If you look at the number of instructions for a list all of the green dots, they have the exact same number of instructions every time, because the list is always the same way you create a node, then you update the reference to it and it really is the exact same number. You can it's really interesting, because what I at least always had the impression that I can't really see what's happening in the CPU, because it's so obscure and there's so many random optimizations, but the number of instructions.

A

You can actually predict how much the next function, how many instructions it will take for the green dots as I said. There's only one call path that is taken for the purple one. There are two which means that a pushed luck using the bell that queue either takes 23,000 instructions, or it takes four thousand instructions, with the focus being on the three thousand instructions, which is the first class meaning there's enough map.

A

There's enough space in the current memory block to append a new buffer, and sometimes when there isn't a slow path has to be taken, which is the four thousand four thousand instructions path for the vector documentation. You have many Kasbah that are taken. You have one that is prominent Averill over two thousand instructions, which is again the fast path, meaning there's enough memory, there's no space to add a new element, but you have many other instructions.

A

Many other code paths that might be taken at first I wondered how's, that possible I know that they are only to get code paths possible either the first path or the slow path. But if you think about it, you have to move elements when you're crossing boundaries and the more you have elements. The more instructions have to be executed.

A

So it's pretty straightforward and you can confirm that by looking at the number of branch misses, so you have either zero branch misses or you have one. You have. Zero branch misses when the CPU, when you have enough memory to append a new new element, which is what the CPU expects, because it's the hot bath. But sometimes you have one branch this.

A

When there isn't enough memory left the upper right graph is the number of l1 misses and again we want to be to the left and down left, meaning we want few cache, misses and down. We want low latency and we have pretty much that for everyone, but for the list for the green does we have plenty of our players with more key l1 misses.

A

The number of TLB misses doesn't really seem to impact. Much I mean the more you have. Tlb misses the higher your lower bound will be, but it's not really significant, so it will just ignore it. For now, the next step was to make the the buffer sense, very so I'm still appending a thousand buffers to a buffer list, but I'm making that buffer size larger and, let's see what happens as I increase. It.

A

As soon as the buffer size grows to the same value as my l1 cache line, I started adding a lot of l1 cache misses for the vector, which means that it is cached friendly, but it's less cache friendly as it becomes less cache friendly as a buffer size increases.

A

Double-Ended queue seems to do just fine.

A

If I do the same experiments for the f2 string to STR function, which is an access function, it's not an append. It's a it's an access function.

A

Again, we see there isn't much difference between the three data structures. Again, the list seems to be less cache efficient, as we see here, but I, don't think it's a drastic change on average.

A

It's quite interesting, though, to see that double ended queue seems to really outperform the other two.

A

If you look at push front for those who know the vector implementation, it's very efficient to push at the end that if you want to push at the beginning, you have to shift all the elements.

A

This is what you get for a push front. The vector just becomes alpha, which is understandable, because you always need to shed all the elements every time you do a push front. You have all values of possible instructions, because you always need to shift all the elements.

A

A

It is still somewhat l1 cache friendly well at this point, but although it's pretty it's a little bit, l1 cache friendly itself high latency, so it's really an awful back-end to use if you're doing a lot of the pushpots.

A

Let's see what happens when we increase the number of the buffer size it just worse and worse. The whole ended queue again is pretty good.

A

What else can we look at I showed only 4 p.m. news for performance monitoring units. There are many more that you can look at a cool. One is looking at the CPU cycles, because if you have the instructions- and you have the CPU cycles, you can calculate the IPC and you want that to be close to 1. You can look at the branch prediction machines which I already talked about page faults and the micro benchmark. Page faults aren't really interesting, because nothing really happens.

A

The next step would be ready to implement to be able to really run self itself, whether what different backends for buffer list, what I did for these was take code from other number sense, people and Jessie Williamson's repo, which they have started to do so self crashes. If you want, if you start self, but you can run the unit tests- and this is what I used for this experiment so from these runs mm- that you seems to really work best but to really get to a conclusion.

A

We need to see how real self uses buffer lists a small word on logging. Currently, this is how we do logging and self. We use the DL function, we give it the severity and the string and you get lumps- and there was also talked to make this a little bit more efficient. So what I did was change this for a trace point of NZT ng, and by doing that, we take off all the locking, because l TT ng is a lot less tracer.

A

It uses per CPU buffers, so there's no locking and it's it's pretty efficient and I didn't see any difference. I don't see any difference between T out or logging and self, which uses locking and I think it uses a linked list to keep track of vlogs and then flushes them at a later time. I didn't see any improvements, which means that the big the bottleneck is of this is a string manipulation. This is just what I think it is that I wasn't able to really do a lot further than that, but from early results.

A

Finally, I'm gonna do a small demo using trace compass, which is a trace viewer for a CT angio other bunch of tracers. You do not have to instrument your application to use trace compass, you can just take it, take a binary and run it and see.

A

What happens you can instrument functions if you want and all you do it all you need to do for that is recompile the code with GCC's F instrument, functionals flag, which Sepp supports so for this, the next demo, every compiled self with basically GCC's flag and nothing else and I just traced the kernel and the function. Entry and exits.

A

By the way, the the following demo also works with traces from after external traces for my chest. So if you don't want to use it in G, that's fine too.

A

This is Trace compass, you have three views. The upper one is the controls of you. It shows the state of all of your processes on the system through time the green is running in user space, yellow is blocked, orange is preempted, so you see, swapper is granted often, which is good because who operates the idle tasks on linux. So we wanted to be preempted. Whenever we can it's really a list of all your views. This is a restart cluster, so you see all the set threads as well.

A

Here, Iran Laredos, right with a single right- and you see that Trados is created, runs a little bit then is blocked for the entire time runs a little bit at the end and quits, which is expected on the client side.

A

This is the duration for which is run. It ran.

A

If you zoom in you can see more granular states such as system calls, although the blue states or system calls you can even see which system call is called, this, is the exact V system call?

A

So it really gives you an insight on what your process is doing and it's great for reverse engineering and if you want to understand, what's happening in, for instance, there's the open system call which Rados does.

A

And you see it doesn't open on the log file that locks. Oh yeah, it's for lucky.

A

It does another open.

A

It doesn't open right and the clothes back to back, so we can kind of guess what's happening here and if we look at the open system call, it's opening the south / self proc self does come, which is the name of the of the process. It does the open system call it gets a file descriptor in return. Number eight, it done, does a right on FD number, eight and then closes for the script rate. This might not be reduce itself, it might be exactly that does not for it.

A

Now, as I said, the radius processes runs a little bit at the end weights and runs a little bit at the end before quitting and I want to know what happens, why is it blood? What is it blocked on and what's happening on my systems? All I have to do is open.

A

This other view called the critical path view or critical flow of you right here, and they tell it to analyze the process and it gives you the critical path of a dose, and you see what it's weighted we're waiting on, and this is all using general trace points or things like try to wake up and timer events are two events, scheduling events etc, and you see what's happened, you have a bunch of K worker threads that are involved and I'm gonna zoom in in a second on this, and you see that it's not super efficient.

A

There might be something to be done here on your critical path, the critical path of the Rados faint, because, through all of these key worker threads and the oranges, deep block a preempted state, another term, it's the time between when the process created and when is scheduled on CPU, and you see it's greater than the amount of time it actually runs. So there's a lot of wasted time here we might be able to get away without creating all of these key worker. Threads I haven't done a lot of analysis on this.

A

I can I can't really explained at this point, but the cables per thread or kernel worker often used for I/o I. Think you also see.

A

Communication between TP was the TP, which was part of the OSD.

A

The purple states are the blocking I/o and this expected here, because the right does obviously synchronous, I/o.

A

Finally, have you have the bottom of you, which is the call stack view and if you instrumental, if you did this email with the extra argument that I showed you will get this information you will see which function your OS T's are running at any time, and this is all threads of the OSD and it's cool.

A

It's really cool, because sometimes you'll see loops repeating in based on the colors and you'll see if sometimes in efficiencies, really jump out when you're looking at this, sometimes not, and what you really need to know the code and know how things work to really make sense of this, but it can be quite powerful and obviously all three views are synchronized. So you can see that nothing is happening here for the time it was pre entered, which makes sense.

A

The cool thing with Eng traces is that there's also a library libel trace. If you want to do automated analysis, so, for instance, we could easily write a script to compute how much time spent and specific specific function or specific thread I'm gonna cut. This short representation is done thanks to Jesse Williams and at the Mersenne for unknowingly. Providing quotes for this demo.

A

I also want to thank the South Korean team, which has been really inclusive and as a developer, you see it even whether you're a newcomer, the newbie, you really feel the openness which is which has to be mentioned. I think it's. It's really great. That shows it's clearly an active effort. I also want to thank Leo for his help. This week, I have heckled him with my presentation.

A