Apache Mesos Performance Working Group, 18 Jul 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GMT 2018-07-18 Performance WG

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

You already started recording then.

B

It looks like it: I am using my own account, so I can't control the recording, I, don't think: okay yeah, it says, recording.

B

Yeah I don't know how people have been doing that if it's easy I can do that for sure. Yeah.

C

B

D

B

You know you think of fingers as opaque, but they're partially translated.

B

Okay, I guess we can get started since we have quite a few folks already. Let me share my screen here.

E

If you want some others here and they want to go to the minimum, I belong.

C

In here, I can join you in a minute.

B

B

Everyone can see the screen share, hi, GaN, yeah.

D

B

B

I a bunch of things on the agenda for today, which have mainly been around what what we've done recently as well as what's going on right now. If there's anything else, please just feel free to put it down here, but we can go over these in and talk about each one, how's that sound to people.

D

B

So um the first thing I wanted to talk about was there was this ticket? It was maysa eight four one, eight, which.

B

I think was filed in January.

B

And unfortunately, we didn't really respond to it until.

B

Stephen Erb happened to like mention something in slack and I asked him what he was talking about and- and he mentioned this ticket and I asked him to gather some performance data, so it's fixed now, but at the time it was something that kind of lagged behind pretty pretty badly, given that this was leading to the agent utilization being really high, and there are a lot of containers.

B

So he said it's fixed for him, but I just realized that you know probably before each meeting I need to just do a quick audit of what tickets have been filed with the performance label, because this did actually have the performance label on it. So just I'm thinking about how to avoid this happening again. I'm, probably gonna, either in the meeting or before the meeting just go over those tickets and make sure that we're aware of them, it probably would have showed up in the dashboard actually that I have here somewhere here.

B

B

I guess it has to be you I guess it doesn't show up here resolved because maybe we don't show this I need to update the board. I guess but.

B

It should have showed up in the backlog, at least.

B

Yeah anyway, that's just a little processed thing that I guess I should start doing in order to avoid this kind of thing happening again. But the issue here was that.

B

When you request container metrics there.

B

All the cgroups secret breeds of the stat files end up calling something that verifies that the C group is mounted without in the hierarchy and so on. And that ends up needing to look at mounts.

B

Proc self mounts which in his case was a huge files like 200 kilobyte file and that's because I believe that's because he was using or really FS or something Oh.

B

But in any case, they're running a lot of containers as well. I think this he was running Stephen 140 and the original ticket was 150.

B

So I think this is maybe not something that a lot of users actually get up to, but we probably need to start benchmarking. The agents scalability on containers as well.

B

Any anyone else have any questions or comments about this ticket.

B

You you no okay. What was the current status of it? The current status I mean I, should probably mention that her status was I, did like a low like I did a short-term fix which just moves. The verification in the read helper to be only after a read actually fails, so it directly goes and does the read avoiding the verification, and if the read fails it it does verification to provide the better error message that verification was providing, so it doesn't break any tests or anything but longer term.

B

You know talking with gee I think we wanted to make the verification like a standalone thing, that callers can use.

B

When they're say initializing covering and so that fixed that part of the issue but I think the original ticket filer mentioned that there's a he was seeing an additional issue with stat FS, which is definitely a different problem.

B

So I chatted with him on slack as well and he's on vacation right now, but when he get gets back, he's gonna file a different ticket. If he's still seeing that and give us some perf data for it.

F

Dennah fest should be fairly cheap unless what, but we could be calling it like way too often I, guess yeah.

B

Yeah I was also wondering, if maybe like overlay FS or something could affect that like make it make it slower. I, don't know.

F

Yeah, that seems like it could be.

B

I, just looked at the stat efest page and saw like oh really a fast strings in there. So I was like okay. Maybe that could affect it.

B

Yeah but that's the current status just waiting for him, but originally she was fixed and just waiting for him to come back on vacation and file another ticket. If there's something else going on, you.

B

Okay, the second piece I, don't know if so I see Alex is on the call. You don't know if you want to talk about this surf yeah.

E

Gladly so we've been working on some looking at this dead, though Jason improvements and let me share the screen. Okay, let me stop.

E

Can you guess they must see my screen, yep, okay, so living what we've been looking into the death of Jason and briefly, there are two problems for the standard reason. First, what people usually think and when we say standard Jason, if they say this dead, the reason is slow, which means it takes time to get response from standard Jason, which means the middle UI is slow and some other tools like gcos, UI and version 11, is also slow.

E

However, there is another problem which probably has a higher higher priority is the isolation problem that said, the Jason impacts the overall performance of the master actor, which means in that very unfortunate scenarios. Mallos actor can spend 90% of its time just crunching data for Stanford Jason. So there are several several aspects of this problem, especially if we talk about how to improve, especially if we look at the core of the problem like delivering state information to consumers, the answer to slow's that the Jason might be to use v1 subscribe. Endpoint that will get.

E

This will give you the first fix of the cluster state, and then it will give you an update. However, we still want to tackle the isolation problem first, and this is right now, our first goal, so we did some tests in order to gather some data. The first was on. We did some tests internally. The first was a smaller class 2 with 22 agents and around 390 frameworks and.

E

That the time spent is ezel I, don't look at that part, and we we got these data, so we see here that, as the response size of the state of Jason Rises the time the master actor stands in process in the state, adjacent requests grows, I would say exponentially.

E

The reason for this is that the time each subsequent strategies and requests needs to be answered includes the waiting time of the queue which includes the time that all the previous that the Jason risk requested have been sitting in the queue one request has arrived, took two took to be processed, and these are these. If you look at the bottom picture, the the blue lines, these echoes this is exactly the the amount of weight of time that a specific request was like.

E

How can I put it is the amount of is the number of requests that have been processed before a request is depicted, has been answered so.

B

Hey Alex: can you just kind of explain like overall what we're looking at here like what the gray is and sure sure.

E

So the gray is the total time for establishes and request to be answered. So measured, let's say from a client perspective. The red dot is the time that a specific request step, the Jason request, spends in the master queue until it is scheduled for processing and the crunchin. Here is the time that requests.

E

That times, it takes four requests from being processed. First to the time it was handed over to the process to be sent to the client. The crunching is a little bit confusing the the blue, the blue points, because they include the time it include an extra trip through the master, Q duty authorization that we do an extra dispatch and it obviously- and it obviously hands includes the time that previous standard recent requests are processed and take to process.

E

However, it also demonstrates this echo demonstrates how many requests are sitting in the pew when this specific request arrives.

D

So does that mean you have requests.

E

Being submitted frequently.

D

E

C

Like there's they're.

E

Affecting one another, this is correct. This is correct, and this is one of the first problems we try to seek to fix the exacerbation, which means the more standard rationed requests. Us--And you pretty much, it's a it is an effective DDoS attack. Yes,.

B

Yeah, like so the the read says, queued yes and then, and the gray is waiting in the queue or the.

E

Gravely total time, okay I see from the client perspective got it.

B

And the blue include included that second authorization in.

E

Queuing this is correct and that's why we see echo because because the blue includes that one extra dispatch after authorization and that's where we see that the blue times differ and their multiple of of a specific number, let's say eighty eight hundred milliseconds- that's because of this extra extra trip through the master mailbox. So this is a.

E

Can you see these images a little bit, infuser I know.

D

Anything yeah, I, guess I.

E

D

Understand exactly why we would see 2x or 3x the time.

E

D

For authorization, yes,.

E

So for each standard, Jason requests each let's say before: stedy Jason requests is actually processed for actually processed I means we take the master state and let Jason if I, convert the master state into the JSON string. The the request travels through the master mailbox twice.

E

The first is when the request is put for the first time, then it's being processed by the master and the master schedules it on the authorizer and once authorizer creates all the objects approvers and another authorized authorizing stuff. The request is being dispatched for the second time and due to this fact, due to these two two trips through the master mailbox in this process in time in which we include this second despot dispatch, we see.

E

We see the the time of all previously of previous state requests that have been dispatched into the same master. Queue is included into processing.

E

B

I can I can help I can.

E

B

Can help put it in another way? If please.

E

Do Bend, yes,.

B

B

In order to process a state of JSON request, you have to get authorized that authorization result is also processed on the master through a dispatch.

B

So when I time, how long it takes for my state to JSON to get processed, I-I-I have to do another trip through the queue I already did a trip through the queue to get on to start my timer and have to do another one once the authorization is processed in them in that second one, another state adjacent could be finishing its continuation, which would impose like all that extra crunching time on our current request. Does that make sense? I thought I would do that more elegant adequately, but I did so.

C

D

It the case that, for any.

C

D

E

Yes, this is correct: okay,.

E

Yeah, so it is unfortunate that we include waiting time in the queue the second time we pass through the master queue we include into crunching. So it's a nice picture I think because it gives a very good intuition about what has happened, that the processing time is actually the multiple of the it's actually some time. Four processes say a single request, multiplied by the number of requests sitting in the queue. So we have a second test on a big cluster with that.

A

Yes, so how do we so in this graph? How do we know how much time I spent in actual crunching versus how much was spent in the queue.

E

This they did this the lower line, so the lower line is how much time you spend for one request- and these echoes were definitely multiples of this. The.

A

Lower line yeah.

E

The lower blue.

A

Line: okay, that's.

E

For single request, obviously and echoes, are are when the request times multiplied times. Sorry yeah, when you ponder it when they have been other state requests in the queue. Oh.

A

You're saying this, the bottom line is when there were no other state requests in the queue that's correct. Okay and the one above is if there was one more on the bow, isn't there or something that this.

E

Is correct, I see.

A

E

Are six lines you almost you almost don't see the sixth but I know it's. There. I only see four. There.

A

There are six, yes, okay, I see so there's a subset of them, which did not have anything in the queue with us better State, as far as their whole processing is concerned. Yes, I see, okay, and from this you get some numbers and what exactly or what roughly could be the just? The crunching.

E

Yeah it's the lower line so up to one second, oh I,.

A

Think I see so why does it vary quite a bit like from close to zero to one.

E

Close to zero, no, there isn't there is no zero, so crunching is one second yeah.

A

E

A

E

Of time or what's the question yeah.

A

I guess before, like at the one I and how many zeros are there at that point at time to crunch seems to be very low. I see.

E

I see the graph above shows the response size in kilobytes, so.

A

Has been changing over time, got it exactly.

E

So so this was a real cluster, we've Ranjan work, running workloads and something all these three and a lot of zeros. A lot of workloads will be where started on the cluster, and there were more frameworks, more tasks. I don't have this date included. Here we can map it. What I think is listening.

A

If you actually normalize that it would be very flat.

E

Yes, you mean normalized by response size, yeah, yeah, yeah, pretty much.

A

E

A

E

Okay, shall we move on for me, okay, so this second test was a bigger cluster, with more than 100 nodes, 12,000 tasks and more than 1,000 frameworks and we'll come back to this slide. The here the graph is slightly different. So what we measured here was the time that the request spends in the queue including authorizing and including so to what you walks through the master queue. Then the red that is almost invisible.

E

He is crunching and I'll, explain why Nick ranching is zero and the blue is the time with fan to sterilise the response, and the blue is depicted twice on the lower graph once on top of gray and once separately. So as we can see, the viewer sterilizing time as well as sterilizing time put on top of the gray on the waiting time. So first, why crunchin is zero, is because in v-0 v-0 stood adjacent response. There is no crunching all the filtering. There are no intermediate objects anymore. The run off. There is no filtering.

E

Everything is embedded into sterilizing process, so Jason. If a library creates a bunch of proxy objects and when, during the moment when we generate the string, we do the filtering from the master state directly.

E

So the pure crunching time in V zero is actually sterilized in time of the master state into Jason string.

E

A

E

On on the closet, we've been testing on state v, zero state that Jason was used, so b1 get state and b1 subscribe. Where this assumption is not true, where we actually have crunching, where we convert the master state into intermediate protocol that we then serialize no clients, no tools we're using these these endpoints. So there is no data here for this now, if you look at the time spent per request, so from the total time, thirty-nine percent on average across all step, the Jason request across the lifetime of the cluster were spent in waiting.

E

30 percent of time were spent waiting for the second, for the second time in the Q plus waiting for the authorizer response and for sterilizing was spent against 30 percent.

A

So how do you kind of get the person pitch? Is it also normalized per particle size response eyes? No.

E

It's just total across all the requests, so I simply summed up all the times. It goes across all the requests.

A

Okay, so it's not like if this acoustic five seconds and in the inside this I guess that's fine.

E

Think about it is a ratio of integrals on the on the image below we take the so did the the if you take if you calculate the area of the gray and gray and blues of that, that ugly figure, if you take that area and if you, for example, take only the sterilizing so the area above below the sterilizing line that dotted blue line and they divide one on to another. You will get you'll get what I'm talking about.

A

Okay, it's total wait time divided by total total.

E

A

E

Correct across all requests, I say.

A

And authorizing is 30%, probably because that part of it is coming from the waiting. It's.

E

Not probably, it includes the the second, the second trips through the master queue. Second.

A

Trip so I guess, roughly half of that is or like that members in this waiting there's a 40, so 20% is actually can be accounted to the trip, so only 10% is actual authorization, I think even less.

E

A

E

I, don't have I, don't have these numbers? No okay,.

A

And realizing 31%, yes,.

E

So the plan going forward is there are several things that we plan to do in the nearest future, so the the first one we would like to calculate the cost of creating a deep copy of the master state. I will explain right now why we interested in this at all.

E

The second would be to pool state requests together and process them in parallel, and the idea above the idea behind is behind this approach. Is that if we observe at a certain point of time we observe, let's say six state requests in the master queue, we will dispatch an internal response. Let's say respond. All state requests call on to the master queue at the at the moment when we, when we observe the first request, then until we reach that dispatch in the master mailbox, we accumulate all six requests and then we blog the master actor.

E

We spin up six worker threads and respond to six requests in parallel. So, ideally, ideally because we don't need to modify the master stay to response to the adjacent requests we can put only we only need as a read-only.

E

We can reply all the state requests in parallel and save time and and save time that the master spends answering them as when it in comparison that the master would have answered them individually. So the idea was suggested by Ben by Ben Maller, so we're going to to try the pooling polling.

E

The drawback here might be that if master Q is long and we have a lot of state requests sitting in there when we gather them and we are about to process them- let's say we have three 30 or 40 requests to answer. Then spinning up 40 worker worker threads and blocking the master actor until all the old lives on the old it finish might not give us a 44 2x speed-up, so it might take still more time than we spend processing one single request.

E

So this is why we would like to estimate the cost of creating a deep copy of a master state, because an additional approach, additional or alternative, but probably additional approach would be instead of trying to respond. These pooled state requests. In parallel, we can create a deep copy of the master State. Let the master continue doing its job and answer all the pooled requests, either together or one by one. It doesn't matter in a separate actor or on a separate thread.

E

However, it makes sense only if creating a deep copy is way cheaper than answering one single request. So we would like to get a better idea. How expensive will it be, then, obviously to get some numbers how we improve? We would like to add a benchmark to somehow measure the master actor load so that we can. So we thought that we can see that.

E

Okay with this approach, we would use the master actor load by let's say 30 or 40 percent, and an extra sounds like a very easy improvement would be to avoid the extra the first trip through the master key. While we need to get authorization and did not wait, do not send the authorization through the master queue, but on two separate actor that has a shorter queue. There is one another suggestion came recently from Ben Maller, but I think Bell and Ben will talk about it separately.

E

I think I'm done and are there any questions.

B

That was really great Alex thanks a lot for presenting.

B

To folks have questions I.

D

Was just wondering if you guys had to consider so along the lines of making it fun a good, deep copy of the master state. I I was wondering if you guys can consider the possibility of incrementally pushing updates to do a copy of the master state in batches or something to avoid like a big one-time cost of the copy.

E

Yeah I think yeah, it's a very good question. Greg. We even have that's how B one gets state or subscribe works, so we send first figs and then we send updates.

E

However, it sounds like a non-trivial and definitely not a very easy task to incorporate those updates. So pretty much so pretty much. Let's say we need some. We need specific structures in order to answer each state adjacent requests. They are actually frameworks and enslaves and these structures inside they contain all kind of a lot of different collections. Now, each time when the master updates anything in those collections, we should issue an update to the separate actor and I think just finding all those places and make sure that we modify the next time.

E

Anyone modifies the internal state of the master that we need for the Teresa Teresa to respond to state requests that next time the person modifies that state. They remember that they need to send an update or it should happen automatically. It's definitely not the previous three trivial tasks, so I would say the global idea that we probably at this time agree upon is that we should move off, move off from v-0 step adjacent to v1, get state and subscribe.

E

However, because there are a lot of people use instead to JSON and v1 get stayed and subscribe. Calls are not very well tested, yet with decided to go for smaller improvements first and then think about sending updates later and probably move into these two v1 API later. Does it does it answer the question? Greg yeah.

D

It goes like that sure.

B

Any other questions now Alex can I get you to post those slides on here. If you're, the physicians.

E

Yes, of course,.

B

All right thanks thanks again, Alex, really appreciate it. Yeah.

D

B

The next thing that Alex was alluding to was I was looking into rapid JSON, and the idea here was to take.

B

That is not what I expected no confirm.

B

The idea was to take JSON defy and preserve the interface.

B

So you know you have these: uh you have these writers here, a boolean writer and I'm, the writer and so on. Right now, these things do the serialization, like this writes like true or false to a stream. This writes, you know, knows how to write a number.

B

The string writer does something which is wrong, but it you know, tries to write a string.

B

The array writer writes an array and so on so I want to preserve these interfaces, because that's actually what the caller's use they'll call like element here, but I want to change the implementation to use rapid JSON writer.

B

To to write the data, so I I prototype that yesterday and I have I took some numbers. This is from the master state Curie benchmark.

B

What it does is it spins up a bunch of synthetic agents with tasks and have has them registered with the master, and then it issues a state query: it does a V, 0 state, query, a V, 1 state, protobuf, query and I think some other variants of that with, like maybe converting to JSON and so on. But the baseline here is v-0 state, uh and this is like a boxplot where the lower end, here's the minimum.

B

The bottom of the box is the 25th percentile, so the first quartile, then the top of the box, the third quartile, so the 75th percentile and the top is the maximum measurement. So as a rapid JSON, it looks like we cut time.

B

Almost in half around in half I would say- and this this is a with the default settings.

B

I also tried turning on the single instruction, multiple data options, but it looks like that actually is a little more inconsistent and maybe a little slower there's actually not a ton of those instructions that are used in the writer path. There's just one spot that there used those those are probably more useful in the parsing path, and this is the serializing path. So it didn't make much of an impact here. It kind of looks like it might actually slow it down a little bit, but it looks like this does provide a quick.

B

You know short-term easy improvement, so I'm going to try to get these the patches out to update JSON if I to use rapid, JSON and.

B

And hopefully that I'll make make things better for users in the short term, hey.

F

Ben, did you yet look into the licensing of rep adjacent? It's.

B

A good question: they didn't.

F

I'm looking at the license file and it's I guess: there's it's a bit: there's some MIT, some BSD, which I I think is compatible with apache license, but the json license.

F

Specifically, they used for good part, I'm not sure about.

B

F

Yeah that might have compatibility problems. I think ASF has a page which captures the Apache 2 license compatibility, matrix.

B

Ok, is this a is this something that's like it's, this JSON license the thing, or is it just for this project you happen and now.

F

The Jason license is like a thing so I I, it seems like they are importing code. That's licensed under that. Oh there we go.

E

So apparently, the joint useful, yes he's a problem.

F

So it might so I doubt it I, don't know I, guess we, maybe we get lucky and the part that sliced and something to JSON license isn't required or something okay. It looks like they're putting some code from a bunch of other projects.

B

Is that why it's here there I I.

F

Guess so two other dependencies and license. So if you scroll back up a little bit, there's a hitting other dependencies and licenses. So they might be things that you'll pull in a side effect to this I'm, not sure.

B

It might be this a so much everything like maybe it's their tests, yeah.

F

If it's tests I guess it's probably all fine, but it probably needs a little bit of um checking into it at all. Okay,.

B

Yeah, it looks like a producing libraries, rapid, json source code as licensed or MIT license, except for the third-party components list below which.

B

To avoid the problem of JSON licensing, your projects, it's efficient to include the bin JSON checker directory, okay, great I'll, look into that and make sure we do that properly.

E

Thanks James, do you need help to spin up patches or reviews or anything else I will I will gladly help.

B

At this point, I did the integration I didn't run all the tests. I ran the Stout and the process tests and found that we have. We have tests that like verify, we do the wrong thing they have like.

B

They have invalid utf-8 and they're expecting something to happen a certain way.

B

We right now don't have any way to surface errors during je sonification. The library doesn't have doesn't provide any error. Return mechanisms could like throw an exception or something, but right now we just ignore any errors.

B

So that's kind of what I did with my patch rabbit Jason has a mode where they don't validate.

B

Utf-8 I think the only impact this has on us is the files end point where we will sometimes spit out binary data. If people are trying to look at log files in the past there's there were some issues around like actual Unicode characters, because we don't do it properly, but this this would fix that part of it. I think.

F

I, just use with the processors end point.

B

Okay, yeah so like what this does is it will do the right thing for utf-8.

B

If it's not you tiffy it'll it'll do something which puts you know, I, don't know if it's doing them the right thing, but it's doing something. We're also doing the wrong thing. So it's kind of hard to compare the approaches, but that's where I think I'm not sure, yet to what degree I'll have to change things or just adjust tests and so on, but other than that I pretty much. The integrations done.

B

You so yeah I would just I mean I'll, obviously need to reviews. So whoever wants to help what that would be great.

E

Sure reviews gladly and even if you need help with the adjusting tests or or looking into the errors and how we can now start propagating them properly. I can only help with this or ask Benna to help yeah. Okay,.

B

Yeah I think with this with this, along with this single dispatch change, we should have a pretty substantial improvement for users like if we compared those graphs that you had to what it would look like with. You know things taking half the time in the queue as well as half the time serializing it should. It should help us get a little bit more Headroom, so.

B

I don't know if you were interested in working on that next.

B

But it would be great to have that as well.

B

B

B

E

Yes, sorry, you were saying that I was just saying.

B

I think what, with this rapid chase unchanged, and if we were also able to get the single trip through the queue change with authorization, we should. We should have a pretty nice improvement for for users. Yeah.

E

It's it would be the first that we still don't solve the isolation problem, which means, if.

F

E

Send sufficient steady, Jason requests then you'd DDoS the cluster, but you definitely. Those will be very, very good and valuable improvements. Yeah.

B

Okay, so um hopefully we can get this all in one: seven I think it's mid August or early August I hope. So yes, this is.

E

B

E

There is no date yet as far as I know, there is not even everybody's manager, yet according.

D

To the the quarterly schedule it should be mid, August timeframe, okay,.

B

So I'm really hoping we can get the Arab adjacent changes in by then, which I think is for sure and then also the the fixing the double dispatch as well. Yeah.

E

I would love to would even love to you. Have the the pooling done by one seven. So if you need any help with the reppin Jason, please let me know I'm I'm, very, very keen on pushing this great.

B

Okay, we have only 12 minutes left, so I don't know if we're gonna get through all this, but maybe I'll just call this live front. Actually, how do I do this? So I'll just mention a few brief through things that we don't have any slides or anything interesting to show for the first was our JSON parsing.

B

So this was this actually did conversion.

B

We don't so. First of all, we don't really use this in a lot of performance-critical places. The master does use it in some end points like if an operator posts, JSON or a scheduler post JSON, then we do JSON parsing, not a lot of users. It's not a high load generally from what users are doing today, but I did notice when I was writing a test that the conversion cost is essentially doubling the the time it takes to parse.

B

So it's it's cut in half now after that fix, but probably also this could benefit from rapid JSON I'm, not planning to look into that soon, but it's definitely something we could look at if this ever becomes a performance issue.

B

I'll mention this other one, which is Greg and I, have been making some improvements to metric, scalability and Lib process. This is mainly in support of per framework metrics, so there were some pathological performance issues where, like we were doing an STD list, size which was still N squared, which was still order n on in some standard libraries. So we fixed all that and if we have time but I, don't think we'll have time so I'll just punt on showing anything for that right now.

B

The next thing to mention is: there's been some various allocator performance work, I think the short of it and please correct me if I'm, wrong, Ming but I think the short of it is that in our in our scale test that we were doing, we saw the allocation around time, go from I, think 15 seconds to 3 seconds, something like that five seconds: okay, so something like 15 seconds to five seconds and after gathering more perforated, there's still some really obvious improvements to make so we're thinking. We can get that much lower.

B

So some pretty substantial performance improvements are being done to the allocator and I'll make sure we have some some data to show or a blog post or something but right now, I, don't have anything nice to show for that.

B

So some work being done to expose some metrics in the process for the event queues as well as endpoints. There I, don't know if there's any patches at this point, but.

B

I asked them to include a few folks that have been interested in and Greg and guests on it. I don't know. If James you want to be on those surgery reviews as well or not, but um if you do just, let me know, I can ask Benton. Would you okay.

F

I'm interested in keeping an eye on it, I don't know how much time to do a reasonable review. Okay,.

B

The last thing over the last few things so Dario is on the call I don't know if Dario you want to chat about this briefly or if you have any interesting numbers to share, but there was a additional queue added recently specialized to be multi producer single consumer, which is our use case for I, think event, consumption right. Yes,.

C

So the reason this has been headed is I. Mina has been some work on going with getting not free cues into the process and the the original queue that was used there has I mean it is really fast. The problem with it is that it does not guarantee the order of the elements to be the same wendy cueing that it was when nquing so enjoy. I mean it guarantees that for per thread.

C

So if you send a message from one thread, it's guaranteed or two messages from one thread, they're guaranteed to be in order when they have been DQ'd, but since in the process a process can move from one thread to another very quickly.

C

The messages could be out of order from within the actor context, which is a huge problem to fix that there was a sequence, number added and so reordering had to be done, undo, cueing the messages which obviously created some of the performance of the cue. So I came up with this MP fcq it's. It is heavily inspired by implementations in JC foods, and this is not post on 10:24 course about a queue like that.

C

It's an EQ structure and guarantees the ordering it it's I think it yeah it's linearizable. It is mostly weight. Free and I can share some numbers. So on my machine, I'm working on a Macbook four cores, eight threads I think 2.5 gigahertz I am getting on a contended with soca music using seven threats introduced, one to consume of e-zpass single consumer I, get a total throughput of around 54 million messages, a second or operations, a second that is around 38 million inserts and 15 million views.

C

Anything interesting here, I have some slight optimizations in flight that especially improve non contended DQ's and also teach you on an empty queue which I mean. Is it's not a very interesting case? It's just just a side effect, but it would improve DQ on an empty queue by an order of magnitude, but also be queuing. Undone contended queue so that, if there are no producers running at the moment, would also improve throughput slightly, not not quite as much but.

D

Do you have any idea what the baseline is.

C

It is somewhat hard to measure I mean I could run the the extra like process. The process, throughput tests, I think I've done that in the past, and it's I think my implementation was about twice as fast. So it's a throughput was twice as high.

C

It's the current non-blocking or lock-free queue implementation, but I can run some benchmarks and see how much better would be, but the the queue itself. As I said, it is hard to test that because, because of the reordering that is stunningly accurate implemented or in the previous implementation, that is not done anymore in this implementation. So.

B

So is this behind a configure time, if, like we want to turn that on.

C

So the event queue if you use the concurrent queue flag, would now always be a BSc okay, you can still use the the blocking queue implemented or the the non lock-free queue implementation. If you do not set the flag, but if you use the concurrency, it will always be this one. Okay, great.

B

C

B

Try to get those throughput benchmarks runs that I can we can come up with a blog post to show the improvements and hopefully compare it to I. Think then based that benchmark originally on Scala. So hopefully you can compare it in a positive light to scholars performance.

B

F

I think these are all I guess. The original lock freaky was a build time flag. Then we can't switch it at run time. Yes,.

C

That's great is.

F

There some fundamental issue, I, would say something fundamental that prevents that, or is that, just that people didn't think it was nice worthwhile at the time I think.

C

It's just just the current design, there's nothing that would prevent us from doing it at run. I'm I mean obviously, if, if you use a fixed implementation, there's potential for better compiler optimizations but I think that's the only thing like no no virtual calls or anything like that. Right, yeah.

B

That was the only concern at the time if it turns out that we want to actually have multiple queues and you can select from them depending on like, if I'm, a low overhead situation or if I'm like a high throughput situation or whatever. Then probably, we have to have the run time option, but I think the original thinking was that we would would eventually get towards having this queue but be the default and not necessarily having the other options. But but maybe that's not how we should do it. I'm, not sure that.

F

Probably makes sense, I mean if it doesn't sound like having a runtime option, makes it way easier to test, but it doesn't also doesn't sound like something that we really would like want to teach people how to configure yeah.

B

Okay, with one minute left.

B

uh There's lots of content for the next blog posts if anyone's interested in writing- one, let me know and I- can assist otherwise I'll probably try to figure out how to get some of these into blog posts by they're like one mega one with subsections of like Watson 1:7 or maybe just kind of do it piecemeal and I'll reach out to people for help with that.

B

But definitely let me know if anyone's interested and we can we can I- can help you with that all right we're out of time thanks for joining. Hopefully this was an interesting one.

D

Thanks guys, thanks.

F

B

F

B

C