Ceph Developer Monthly, 4 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2021-08-04

Description

https://tracker.ceph.com/projects/ceph/wiki/Planning

A

Okay, so hello, everyone, my name, is shreya sharma and I'm a gsoc intern with the dashboard team. For the past two months, I've been working on the project reporting sufficient from the safe dashboard. The problem statement was that when a safe user encounters an issue, they gathered all the relevant information and reach out to the developers using either the mailing list, or they file an issue with the self issue tracker.

A

The problem with this approach is that out of the users facing this issue, some of them opt out of reporting it due to inconvenience and out of the users that do report the issue, but all of them get a reply back well, uh as we can see in the funnel diagram that already out of the users that faced an issue most of them reached out, but out of them only some like most of them get a response back either from the developers or from the users and even out of those.

A

If the problem is genuine, they are. You asked to report it on the sufficient tracker, but again some people opt out of doing that due to inconvenience. So, at the end out of the users that face the issue, we don't have a lot of issues reported.

A

So this was the problem statement that we started it. Our solution to this was to provide the users an option to report the issues directly from the safe dashboard and the sep cli.

A

So this would widen the funnel as it would be convenient for users to report issue. So we expect more people to report issues now. This will also provide less engines while working since they don't have to you collect the information from terminal and you like go to gmail or they prefer or send an email file in issue tracker. They can simply report the issue from the cli itself or the safe dashboard.

A

This would also mean ultimately, a better experience and to report an issue. We take the bare minimum information required to report it. That is the project name where they encountered the bug or the feature request they have for the project. The tracker type, which is a bug, feature the subject and the description of the issue.

A

Our initial project workflow decision was that when our user encounters an issue that he can either use a cli command or the dashboard user interface, you would reportation set dashboard and the issue will ultimately be created in the safe issue. Tracker.

A

So for this, our initial approach was to use the python redmi library, but after the research we found out that it is not available in saint os. It's not a package there, so we decided to go for thor requests but uh even like the selfish tracker is based on red mine and is well documented, but even the curl calls that were given in the documentation did not work when I tried them uh using the current calls.

A

All I got was server and I was not able to track down the problem, so I so our mister suggested me to use wireshark what we did was we create a center python redmine. We use this library to send a request to create an issue and we try to fire fireshark to lock the request, but python redmine library does not let us use the http request and https request would be encrypted. So we were not able to use viral shark either, but ultimately we found some python logging libraries that we used.

A

We did not log the complete you first, but we were able to see that it sent the project id as a query parameter and not in the this request body. So we were able to overcome this order and now we're using python law request to create an issue.

A

Then uh a prerequisite for filing an issue is an account on the safe issue tracker like. Ideally, it would be better if we do not ask the users to create an account on the safe issue tracker, but the main reason that we, authentic, like we verify the account, is to avoid spamming. So we require the user to give issue tracker api key to avoid spamming and to authenticate the user when the request is sent, so they must have an account of the sufficient tracker.

A

They can go to the my account tab and get their api access key and store it on their machine which, and they would then be able to create an issue using the seftra api key.

A

Well, so we have two options to report issues the cli and the dashboard user: the interface using this fcli. We first set the api key using the command set dashboard, set safe, tracker, api key and then giving the file name. Where the api key is stored. We can then create an issue using this command safe, dashboard feedback create. We would then give project name, tracker type, subject and description. This is an example and we'll be having a look at the live demo soon.

A

Then we have the dashboard user interface in which the user can go to settings and he would have an option to raise an issue when he clicks on. There is an issue he'll be presented with this model, so he has to select the project name from the list of drop down project names. He will select the tracker name from the list of drop down trackers, which would be bargain feature he has to get the subject and the description also.

A

We decided to go for the model and not for the separate page, because when the user reports an issue, he should be able to see the screen where he encountered it. So it's easier for him to collect the relevant description.

A

And now, let's have a look at the dinner.

A

So this is my terminal. I can set the extractor api key here.

A

I had my api key stored in the file called the domain key. I have sent given the command safe, dashboard set, safe tracker, api, key right, mine key and my safe tracker. Api key has been updated. Now I can use it to create an issue.

A

This is the combined safe dashboard feedback. Create, will then give the project name where they're encountering the issue so I'll write file system here then the uh tracker type. I will give it bug. We can give the subject title as live demo and the description as testing and testing cli.

A

So I have given the command.

A

And we get the message back that the issue was successfully created. The issue id is 52056.

A

Now, if we go to my page, we can see that this issue was created in cep file system. It has the title as live demo and the description as testing cli.

A

Now we can also use the safe dashboard to create an issue.

A

So here, if we go to the reason issue option, we can select the project as it's a dashboard create a bug. The issue title is live demo and description is the test.

A

So again we get the message that the issue was successfully created and issue id is 52057.

A

So going back to this page, we can see that the issue was created in dashboard.

A

The future scope of this project is the ability to attach images. uh If, let's say a user is encountering a bug or an other that he did not expect in the terminal, so he should be able to add attach it with the issue. We also want to give the user an option to give description from a file.

A

Let's say the description is mighty liner, so it would be easier for the issuer to pass the file name, as it would be easier for user to pass the file name as an argument to take the description for there, and we can also add the duplicacy to it using the python. What do you mean plugin, which is a plugin for that mine?

A

So these are the future scopes of this project. This is all about my project. Thank you for that ending and I'm open to questions now.

B

Thanks a lot sria and yeah to the other teamlets, if you see some issues creating your project with a title demo, uh yeah don't be alarmed. This is just a proof of concept.

B

Does anyone have any questions for ria suggestions.

C

This looks great, I'm really excited about this. um I'm curious if there's a way- um maybe maybe more- of a security issue, but um first just to kind of have a pre-installed redmi key that folks could use that they would not need to create their own tracker account.

A

But we can do that, but the main reason for this was to avoid spamming. So if we give the users a red main key, then like it's up for discussion, if that's what the community wants, we can do that we can create a account that can be used by all the users to report the issue.

D

B

We could provide uh an anonymous or a public api key. Let's see that that's it's linked. Probably we will start having the you know, classical spam uh attacks that we had in the past, but maybe we can update that. I mean on a regular basis, just away that or but definitely I think it would be great for every user to be able to report this without and having to go through the process of creating an account getting approval and so on so yeah.

C

Yeah yeah, exactly like that, you mentioned in your presentation. I got some major stuff that often stops users from reporting things. You have to wait for the account creation type be approved and all that.

E

F

We are reporting.

E

From the dashboard or the cli, then the chances of spamming is less right, that someone is not going to intentionally spam. If I mean.

G

Well, it's that's right, but it is the key that is uh that we embed, uh um I mean you need to have that key embedded somewhere right. You know for the for the dashboard to use it.

G

If that key uh leaks, and then uh you know, we would basically need to disable it, and uh until uh everybody is upgraded, uh it would be just plain impossible to create an issue by the dashboard which is probably worse than you know, not having the functionality in the first place, because when the report, the issue button doesn't work, uh the user, who is already uh you know, suffering from uh uh their question misbehaving, uh would be even more pissed off than before.

G

So um I I, I think that maybe we could um do something smart there and kind of um require uh like find a way uh to verify that the issue is coming from the dashboard instance that is backed by a real cluster uh like with some entropy there.

G

But there's still like an issue that uh the the the the key is still going to be embedded uh and uh we will still need to uh you know, have a way to uh secure it, uh because uh otherwise, uh eventually we would get spam because people uh like we get spam even in the downstream uh uh issue reporting systems.

G

uh So uh people actually do like register and go through the process of uh you know, setting up the account uh which is much more cumbersome than than creating a sec tracker account and we still get spam there. So I need to be careful.

D

I wonder if obfuscating the key at some level would be sufficient. I mean I. I would assume that most of these places, where they're trying you, when they're spamming to redmine they're, probably like searching for any redmine instance, that's open or they're searching for red nine keys. But if it's obfuscated such that they actually have to decipher and reverse engineer some code instead, specifically in order to spam to specifically the set tracker that might be sufficient to.

G

You know the fact that it wouldn't be just be just obfuscated, but also like in order to de-obfuscate it. You actually need like a work in that cluster, uh like uh you know, and not just not just a dummy deadboard instance, or something like that. uh That would probably be enough.

D

Yep I have like two small comments. um uh This looks great, um but the dialogue where you actually submit the issue. The description should probably be a text box instead of just a single line, so you can actually have a virtuous description um yeah that one and then also once it's successfully submitted, um instead of just showing the issue number, it could just link to the tracker.

A

H

I would add, uh it might be worth to add the source so that we know on our side uh how many users actually use this.

H

um So if we create a another source um at the back end and we populate it, um it'll be easier to filter out afterwards and see how useful it is, which hopefully, it's very useful.

C

Yeah, that's a great idea might help with filtering spam. If you have somebody we're about to work against this as well.

F

A

Can you, please repeat that sure, I'm not clear what you mean yeah.

H

Yes, um so at the back end of redmine, uh we can create a unique source for this specific type of issue creation, which means it comes either from the cli or the dashboard in an automated way, and then we can pre-populate it in your project and when the issue is created, you can see uh and in red mine issue itself the source field populated, and we can choose a name for it. I don't have a good idea now.

A

Not that this sounds great, that's a great way to analyze how many users are actually using it and will be able to get an idea about if you're, getting like how this project turns out. If it's actually being used for good.

H

Yes, that'd be a great indication um to see um my real data, so yeah.

A

Just adding two years, if we are pre-populating the data, then uh uh since we are doing that from cli, we can also like um I mean this project only integrate like the system, specific uh things of the user and maybe the core dump or some stuff related to debugging.

A

That might be helpful um automatically uh for in a formatted way, read mind and like uh here we said that we can have a specific field uh for uh this project in tracker, and maybe we can streamline the input better.

A

That way, spamming could be avoided as well and have the user define the uh and describe the problem um along with all the metadata that might be useful for a developer to investigate the bug.

A

But yeah it's a later scope thing I think yeah we had this discussion in us, stand up about how this can be done, or I don't remember if it was a one-on-one metamate, but we had this discussion about how we can correct my the metadata and since I'm almost done with the front end and the cli once I clean up the code, I'll definitely do some research into how we can collect the state.

A

Awesome next project congress.

E

One question uh like I see the cli command, you are using it. It's like called feedback, I think, but wouldn't it make more sense to. You know, use it as like create issue instead of feedback, because what the user is actually doing is like creating the issue rather than giving a few feedback.

A

Like we can change it, if touch can discuss on it and if that's what everyone wants, we can change it.

H

And just to reiterate, um seiju also mentioned that, um in the comments uh it'll be great to add um the affected version uh for that specific issue. This way uh it makes it um more informative for the developers to understand so.

A

I'll try everything back.

H

I have just another quick question: please um I put um in the chat the link to the python redmine project. I'm just curious. If this is uh the library that did not work well with standards for you.

A

Yeah, the problem is by fit python. Guideline liability is that it is not packaged for centos.

H

What what centos version was that um I need to do.

B

Yeah, I think we were looking for that in uh april 8. I think, but by the time that we checked that we didn't see the package for epilate, have you found that in innate or are you running 9 or.

H

um um I think eventually I I I also use that, but I think I installed it with um with beef, eventually um yeah, so I I didn't know it wasn't packed fact for that um interesting thanks for mentioning that.

B

Yeah, that's a common issue that we are facing, but I mean so far. We still need to rely on packages, but let's see, if we in the future, we can somehow get rid of it or find a smarter way of installing fresher dependencies.

B

I have a question and well it's not only for serious for everyone um based on the reception. uh From this feature, I was wondering if it would make sense to move part of the functionality to a separate module or something because right now, this would require the dashboard to be running and I'm wondering if it would make sense for users not running the dashboard to still be able to report issues so maybe having some of this functionality in a standalone module and the dashboard just consuming the cli interface or something.

C

Yeah, I think that makes a lot of sense.

B

B

Any more questions, suggestions for sria.

C

I just want to say thanks again. This is great.

B

Thanks a lot yeah.

I

Okay, um so my name is aryan and I'm I I've been working with steph uh under google summer of code for the past three months, and I was my project was visual regression testing of dashboard, uh so today we're gonna discuss like uh so. This is what we are going to discuss today. uh We are going to discuss types of testing and we are using in this what we are already using in the safe dashboard and even even after we have like all these kind of testing. Why do we need a more yes.

I

And yeah, so, even if we have all these uh already existing testing, why do we need more tests and the work? And what exactly is uh visual impression, testing the workflow for it and the different tools available for it? The criteria I use to select the best tool for this? Have dashboard and we'll discuss the most popular tools and why they are not why some of them are not useful and we'll discuss. Pixel, graphics, bitmap matching the algorithm how it works.

I

The techniques we use in that.

I

Three kind of tests we're using uh give me a second in my.

I

Yeah so currently in the safe dashboard, we're using three kinds of testing we're using uh at the bottom we're using unit testing, and it's really good. If you want to test small functions and small components and unit tests, should focus only on the described purpose and shouldn't try to test other things in the same blocks and uh yeah we're using on the second level we're using integration testing uh where we combine all these unit tests and uh see how they work together.

I

The purpose is to expose false in the interaction between the integrated units and on the top level we're using end-to-end testing, which is designed to mimic the behavior of the user when interacting with the application.

I

Also, the test should verify that certain items are displayed, as a user would see them when clicking through the ui, but uh this is where it fall sometimes falls behind yeah. So, instead of dashboard, we already have this whole testing therapy setup and running. But the question is: why do we need visual regulation testing?

I

The first thing is visual: bugs are rendering issues and rendering virtualization is not what functional testing tools are designed to catch. Functional testing measures, functional behavior and that's what it does so over visualization testing helps us is also. It helps us in technical.

B

Sorry argent uh I'm not first intro about it. I think at least my site, I'm not receiving the representation, I reconnected, but it's still not there. So I'm not sure if I mentioned that. Okay.

B

I

No, I can, I can share again yeah.

I

Okay, is it available now.

B

Yeah, it is now.

I

Now sorry for that yeah uh yeah I'll just begin from here, so instead we already have this whole testing limit that we see here. So the question is: why do we need visual regression testing? So the first thing is visual. Bugs are changing issues and rendering violation is not what functional testing tools are designed to catch function, testing measures, functional behavior and that's what it does uh so. Visual regression testing also helps us in picking up css bub, which we don't have any testing, for.

I

It also helps us in it's also um where sometimes user is not able to see critical buttons uh or are they or they are in on unclickable positions and stuff. The other thing, uh the other very useful feature of this kind of testing is. It helps us in making very informed decisions on breaking changes like when you have, when you're doing large, refactoring or just upgrading frameworks or libraries which, when you're moving from bootstrap v4 or to these bootstrap files and stuff.

I

Let's look at some of the examples here in the first image we have, we can see that the login page is awkwardly uh positioned. This is a css buck, resulting in a user, uh possibly not being able to access the elements, uh but this will pass all our all our uh testing pyramid, like all the tests, because uh the elements are still existing and the second image is of the official material gui website. Material ui is a very well very popular ui framework developed by google.

I

You see that the two buttons are awkwardly aligned and you probably won't have issues clicking it, but it isn't a very good experience for the user and also hurts on the reputation of the framework itself.

I

uh Now. Is this an example of how visual regulation testing can help understand how framework or library upgrades can affect the webpage like?

I

In this I tried to upgrade the step dashboard bootstrap version 4 to bootstrap version 5, which caused an unintentional change like, as you can see here in the dips, you can see you can easily catch them and with some of the visual testing tools, you can even see what's what went wrong in the divs itself, like you can just click the diff mask and see that the form group class has been removed, which is exactly what will happen when you move migrate to bootstrap version 5..

I

This is the breaking change, so it's really good for that. So to understand now, let's move ahead and understand what visual regression testing is so to understand what visual diction testing is. We need to first understand what regression testing is. The regulation testing is used to verify that any system changes do not interfere with the existing features and our code structure is the reason. Recreation testing uh is a part of almost every two suite in software development.

I

It is common for devs to change and change or add a section of code that and having it unintentionally disrupt something that was previously working. Just fine visualization testing applies the same logic, but confines testing to the visual aspects of the software. So only the visual aspect, in other words it. It checks that the code changes do not break any aspect of the code of the software's visual interface.

I

A visual regression test checks also checks that the user see will see what, after any code, changes have been executed by comparing screen shots taken before and after the code changes. This way, yeah. Okay. So now, let's talk about how visual the regression tests generally work, so the work of the workflow. For that usually looks like this.

I

uh You initially run the tests and since you already don't have any existing snapshots in your code base, you go here and you take new screenshots and the screenshot will generate fast with, but because- and these state is called as baseline screenshots and whatever snapshots you capture here, when we consider baseline screenshots then uh to test, then you run the tests again when in like in your in your feature, branch or something, whenever you're making a change to the code piece, you run the test again, it captures the screenshots.

I

And now, since you already have the baseline screenshots in place, you could you tip them and when you tip them they're, either equal or they're, not equal and if they're equal, then it's fine, your code looks fine and then it doesn't have any visualization, so it will pass and if it doesn't, it will fail.

I

uh So now, let's talk about the different tools you can use for visualization testing. There are plenty of tools you can find on the internet to implement these tests and to select one two for the step dashboard. I have to come with come up with a robust criteria, so this is the criteria I came up with there, um so the criteria is, uh it has to have like minimum manual effort, easy cyprus integration, because that's what we're using uh entirely for our end-to-end testing it should be.

I

It should have a very good open source support and it should be able to ignore browser platform and india leasing, corporates which we'll discuss uh in detail in the next slides. It should also support responsive testing, because we do use some kind of responsive testing like we. We do have some responsive features and functions.

I

um It should also be able to very well handle dynamic data, because uh dashboard has a lot of moving parts and capture screenshots uh moving screenshots is a big challenge. The tools must have a really smart way to do that, you could also introduce reports with comparisons like this and a really good.

I

We want three before screenshot the baseline screenshot, the checkpoint screenshots and the tiffs in the report. We also need a very good cicd integration so that everyone can easily access these tips and see what went wrong with the test.

I

I

We don't have a lot of. We don't have a lot of tools that passed all these criteria, but so we'll just discuss some of the most popular tools and then we'll discuss the tools that work with the best and text all the boxes. For us, the first in our list was phantom css, which is also the first visual test testing tool based on javascript, um and this is where all the visualization testing tools started to come up with.

I

uh So what phantom css is does is it takes screenshot capture it captured by caspases and compares them with baseline images using resemblances, and it generates those diffs with the phantom css, but it's really outdated. It was unmaintained since it has been unmaintained since 2017.

I

It only works with uh it doesn't have a browser support like you, can't see it actually in the browser you can just see if the divs failed or not in the headless mode. It doesn't have good test management tools and no reporting tools.

I

Next, on our list is uh faith which is developed by british forecasting council uh pvc built for uh performing visual dips and responsive websites. There was a successor to the phantom css.

I

It also is really good on blogs and stuff, and it also allows authors to test various viewport bits like very good, with responsive and also has this really cool feature that um it will ignore all these platforms uh platform, specific nkl, using differences. It is the most part of the false positives, and ever since it was written by the bbc it it received quite a bit of publicity and yeah.

I

So these are the pros and cons here it doesn't. It has a really good nti leasing. Support like uh it won't have a lot of false positives, but even this was not really good for us, because it wasn't able to handle the dynamic content well and no reporting tools and was only able to test full page screenshots where we needed to test uh small components as well individually.

I

The next on our list was type resume snapshot, which is the first well the first intuitive approach that uh intuitive tool that we use uh since we already use the endpoint for our entwin test, we're always using cyprus and it plugs in really well with our infrastructure.

I

It is really a fully open source and it isn't, and all of the preferred ones were open source as well, uh but the biggest problem was uh it was generating a lot of false positives. uh It wasn't able to ignore browser, specific anti-aliasing differences. I wasn't able to handle dynamic data.

I

It was also unmaintained since a long time at the cyprus different tendencies, the cyprus dependencies have a lot of conflicts with our already existing uh surplus dependencies.

I

uh Also, the biggest reason was: I talked about india losing offsets and all that, so, let's talk this is one of the most uh east will have like the same uh problem with them. uh The problem is that it all uses pixel, matching and- uh and one does not simply do with map comparison. uh It uses pixel by pixel matching with not comparison, we'll understand why.

I

Let's see how bitmap comparison actually works, pixel by pixels with map comparison work in bitmap comparison, a bitmap of screen is captured at various points of a testing and its pixels are compared to a baseline, bitmap and comparison stays and they hydrate through each pixel pair and then check of the color hexors color. Hex codes are same if the color codes are different. It raises the visual work and if they are same it passes them.

I

While these are very fast, they have a lot of problems, so the biggest problem was. It was prone to the yeah they're prone to a lot of false positives. The biggest one of them was.

I

I wasn't able to handle a lot of dynamic data uh when you have dynamic content that changes like our step, dashboard where you want to check to, and uh you want to check um in a in a dynamic environment, you want to check the layout is laid properly and the alignments and everything does not overlap and everything.

I

The elements don't overlap. The pixel comparison tools can't test for these cases and dynamic content is probably not the biggest downside of this pixel by pixel. Matching the biggest choice was this: this happens, as you can see in this picture like you can see, uh they look almost identical to human eye, but two pixel by pixel matching algorithms cannot generates a lot of tips, so this happens because of a lot of reasons, and these are not even dynamic.

I

Images like it's not even moving is static content, and this happens because um of a lot of reasons like uh fun, smoothing, algorithms, image, resizing graphic cards, different graphic cards and even rendering algorithms generate a lot of pixel differences.

I

uh As a result, a tool that expects exact, expect, exact pixel matches between two images uh can always be filled with pixel differences so to our rescue. We have ai for virtual regression, testing and, unlike pixel by pixel comparison, ai powered, automated visual networking tested tools do not need special environment that remains static and that remains starting to ensure accuracy and hence have a high degree of high degree of accuracy and even with dynamic content, because the comparisons are based on relationship and not simply on pixels.

I

So, let's look at some of the ai powered visualization testing. First, one is percy and percy uh is one of the most popular tools that is used a lot in these days in the industry um it has. It is also known as visual, resisting these tools are also known as visual regression testing as a service.

I

um What they have. They use automated uh artificial intelligence to compare their and learn over time like if the tips are failing, or should the dips fail or not, um it has so. Percy has an sdk for cyprus and others.

I

I said it was really good for us, like you, can easily just plug it in and it works. You just have three methods now we'll see how the test sample tests in the next slide. It has good uh gate integration.

I

Some of the constraint was uh since it wasn't open source. uh Some of the constraint was, uh it has 5000 screenshots per month and uh screenshot from a screenshot is rendering of a page or component in an individual browser over different, responsive widths, combinations and everything. And if you want different browsers, uh it counts as one screenshot too, and it has 30 bit. It only has 30 day build history, which is probably fine, but no I'm not really so in our.

I

In our experience uh it was able to very well handle the dynamic data and the moving parts of our dashboard.

I

It has really uh intuitive dashboard, intuitive ui and very easy to understand, but I was still logging in by lagging behind in a lot of things that apple tools was able to fill those gaps so yeah it is a bit more powerful than percy. It has exhaustive documentation and easy. I see the integration, um but some of it. Even it has some of the constraints uh it happened. It only supports like 100 100 checkpoint screenshots per month.

I

uh Checkpoint screenshots are taken after you take the baseline screenshots, but this is on the free plan and for the safe dashboard we're using an open source plan that apple tools offered us and in our experience it was able to really handle the really well.

I

It was really good with the dynamic uh content it was able to generate ignore regions using code and even with the qe, we were able to add the ignore regions. It is very accurate.

I

Had the minimum number of false positives and it has really good thinking some integration that we're using, so this is one of the sample tests uh that it is merged, actually, not always uh so it is really simple. You only have three methods in amplitude, you have cy dot, eyes, open and see why dot check window and then eyes close. So if you're for starting the app tools you use eyes open and to check window. This is how I check the content.

I

You have is a check window and it should be placed like where you want to capture the screenshot look at which point of the test you want to capture screenshots, and you just close this and you you see why that is when you're ending the test- and this is another test um this I put this test, because uh this is the test for dashboard component.

I

It and dashboard component has a lot of moving parts. It has a lot of charts which are constantly moving, so what we can do to mitigate that is just use. We only use a selector, we create an ignore region and we just pass in a css selector.

I

It's really simple to just add ignore regions, and this is the articles config like you, can see, um you can give test concurrency, so it and in free plan they support like 10 test concurrency, and you can just provide like an array of uh different browser, widths and just different browsers and different responses with different viewport width and orientation. You can even test it on mobile devices and everything.

I

So we have a lot of dynamic content. I already talked about it, but in our safe dashboard, things are supposed to keep moving, and this is one of the, and this is one of the dynamic element we have in our dashboard canvas element, uh a chart element so to stabilize these screenshot baseline screenshots and mitigate false positives. There are two methods that we can handle. This one is that we create an ignore regions, ignore region that I already use in that we are already using in the dashboard component and.

I

And you can create these ignore regions by either coding them, like. You can just add a css selector or you can just give the position where the ignore region should be, or you can go in the testing and using qe. You can just marquee mark the ignore region and the next. The other method is by changing the match level.

I

Now, by changing the match layer to layout, we can switch to different match levels within components and tests. uh Instead, it can check you can check content, font, layout, color and position of elements and content. It can check the and it is very similar to restrict, but can check, also check for colors uh it can. It will ignore color changes.

I

uh The last option is layout match level. uh With this match level. The eyes matching engine ignores differences in the actual content text and graphics, color and other style changes.

I

This much level is most effective and used to validate pages with dynamic content and right now, in our record, which we're using uh tech matching, but uh even exact matching has worked really well for us and uh I recommend, like trying out different match levels, to see what you really need in my experience, like uh layout option is just used very good for uh like when you have a blog for blog page or a new site, or something where you have different blogs coming on coming in every day and something you can only just see, you can only check the layouts and stuff so as we are reaching the end.

I

uh Let's discuss what uh visual testing is not it. uh What visual testing is not is, it does not verify the logic functionality uh and you can't replace it with functional tests. Functional tests are needed and which should then can't be used to stick them and they're. Also not typical fully automated test suite. You don't always want the passing of failing value.

I

Sometimes when you, when you make some changes to your uh mate, make some changes you will have to go to the apple tool. You'll have to go to the session and manually check the tips.

I

Let me before this, let me demo, you.

I

So this is one of the full requests that I made for this demo in this I have migrated the login component to bootstrap version 5, but we can see that.

I

Let me show you in the input version 5. Our login page will look something like this, but all our tests are passing here. You can see so this is really hard to catch using uh already existing solutions. So we have to use visual testing and you can see. Visual testing was able to actually check the devs easily.

I

You can see it was able to find out this and to see what exactly went wrong. You can also use the root, cause analysis and you can select yeah.

I

You can click on the mask and see what all things are not there in the screenshots. So this works really. This works really well with uh our testing infrastructure and is able to catch the visual test really well.

I

What's next, for me,.

I

So next I'm gonna write more tests for the uh different uh dashboard components and uh I'm gonna and I'm gonna be uh writing uh contribution. You can help in that as well and uh I'll be writing documentation for that. So you can documentation to write how to write more tests and you might have seen and uh next, as we also I'll, also be working on the ci cd pipeline, integrate that in integrated well with the jenkins pipeline.

I

All right! So that's the end. If you have any questions.

B

No questions for rajan suggestions.

C

I guess I'm kind of curious, um not I'm not being aware of um that word development, so much um how much, how how often or what proportion of the bugs do you find tend to be these kinds of visual bugs versus more functional, bugs.

I

um Sorry, can you repeat the question I.

C

Well, in terms of like your front development with the soft dashboard, you haven't had a sense of like how what proportion of bugs would be visual bugs, as opposed to functional bugs.

I

um There are a lot of bugs like one of the biggest thing that it can help in is uh detecting new. When you're upgrading uh new frameworks like you have in since it's a javascript project, you have to keep moving and you have to upgrade it really fast, like you're, using cyprus version, 4 and cypress version 5, and uh now, with subscribers version 7, and also we're using bootstrap version 4 and since we'll have to constantly upgrade our framework in libraries.

I

uh We can see what changes are gonna come up in ahead of time and before merging it and see if something's gonna break or not it does. It really helps in a lot of.

C

Thanks that makes sense, yeah.

B

C

B

C

See how it would help done with uh remembering new versions all the time and nothing is bugs from even being merged in the first place.

B

Yeah, the dissolved ones are really hard to catch. So that's, I think this addition is going to help us a lot, especially uh currently we have uh tried not to migrate to the latest stuff, like the latest angular, the latest bootstrap, ui, etc. Just to avoid this kind of regressions, but with having this kind of safeguards, I think we can more safely.

B

uh I mean keeping in updated with these libraries because every time that there's a new twitter and senior bootstrap ui, I think we have to perform a massive refactoring and check everything that everything's okay, so yeah, that's that's, really tricky and also it sometimes works well with some devices. But then you get a report from someone running some strange browser, that's very broken, so yeah, that's. Hopefully at least we will cover uh firefox and chrome with this, and that will be a starting point, but I mean.

I

Best thing is: we can see like different browsers and stuff. If someone says that the dashboard isn't working in chrome, we can just open it in our api tools session and see how it's how it looks like in their browser.

C

Awesome, so is this going to replace the existing kind of functional ui testing in some sense, like there's, like some other that that exists in the end-to-end tests? Is that correct.

I

um No, it is, it won't be able to replace the functional tests at all. uh It is more. Visual tests are more compliment to that complement to um functional tests, because uh visual test visual education tests only as for the visual aspects of it and not the elements and everything, and if even if they are clickable or not, and the different states of the elements and stuff, it's not gonna more of complement to the function, type.

G

I had to step out for a bit, so I may have missed this, uh but uh are aren't we going to run uh into the uh limits imposed by the um um you know by the uh by the cloud offering there if we incorporate this uh into our regular jenkins tests, which are you know, quite quite heavy and run quite often.

I

Yeah, so we did calculate on the number of screens, screenshots we'll be catching like, based on the number of uh pull requests we already receive on the dashboard um and after tools. uh Open source plans gives us 10 000 screenshots per month, which is pretty liberal and like previously, we thought there was 5000 scene shots for one, but uh even then it was really good.

I

So yeah, I don't think we'll be hitting that very often still seeing like um we'll check this month as well like, uh if I don't think so, will be, and even if you want to migrate to different, uh uh let's say you want to migrate to percy from aptitudes. It's just changing two lines of code in every test. So it's really easy to migrate to other tools.

I

B

Yeah, let's see we did a, I mean a prior calculation, more or less the amount of screenshots that we would take and it might work. We will probably need to tune that a bit, but let's see in any case well, I had some conversation with the appleton's team and they I mean initially granted the open source plan, let's check if they might be flexible in the future. I know that at least the ansible shower team.

B

They are using uh apple tools as well for testing the the ui and they are using the open source plan so and I'm sure they are having quite a I mean high throughput and of gears and so on so yeah. Let's see this is just a poc for the diamond so but yeah, let's explore this and.

B

Okay, more questions.

B

Okay, so thank you very much sergeant really nice work.

B

And we have now the last uh demo from the dashboard uh interns, so in this very case, um whatever are going to talk about the chef manager and their experience trying to or catching that. So uh what berry are you ready for the demo or this presentation.

J

Can you hear me.

B

Yeah yeah, okay,.

F

All right, hello, everyone- um this is wadalkari and presenting with me here.

F

So we're going to present what we have been exploring, analyzing and developing developing in the last four months. We started early april, I believe so with caching and yeah, we going through you what we came up with the ways we have done it and how how we did it so um why we have started this issue. First, we have been receiving feedbacks of some users.

F

um The module commands were taking too much time up to seconds. The cpu were reaching the peak, and sometimes the model commands were unresponsive and also we could help the support team to debug the underperforming. Self-Manager api calls that behaves like a black box without any further information.

F

So how we did it, we have divided the work into two. I was working on. The python side and perio was working on the c plus plus side. On the plus python side, I have been exposing the api methods through the cri benchmarking, the api methods and multi-threading on the c plus plus side.

F

What we did was um manager well pressing, the manager model, testing, injecting fake data, osd map or pg maps and so on, and we have work on shared dtf cache from the c plus side, because we have thought it would be better to have a shared caching than caching each module and, for instance, and civilizing data. We have been using file format and just information. We have tested both and came up with different results.

F

So um how we expose the api method through the cli with well uh with this command we can expose. Well, we can see all the commands the mine is with dash. H is the help command, so we can see all the api methods, as you can see uh the method and description about how to use it.

F

So if we expose the, if, if we expose the osd the or the get method osd map, we can receive this in a json format, as you can see, with manager, api git osd map, and now I'm going to talk about different types of benchmarks based on and performance, diversified performance component level and system level. Benchmark component level is distinguished, specific component, which we have used and system level benchmark, evaluate the overall performance of running application.

F

This can be divided into categories based on composition, synthetic benchmark evaluate a particular capability of a sub subsystem which undergoes a component level benchmark and application benchmark measure the overall performance of system, but we have to take in and consider some rules to do the synthetic benchmark which understand the composition of the benchmark, appreciate the factors contrapting the results and determine if the benchmark functions are typical of your workload and environment.

F

So, um with this command we can expose the benchmark.

F

The api call will benchmark an api method, uh safe manager, api benchmark, get osd map number of calls, is the 1000, for example, and number of trades is the last parameter we passed. The osd map is the data we are passing. The the get is that I model on.

F

Regarding this we have we have. uh We have seen some design flows well, as I said before, how to debug the manager modules for the developers. Safe managers take quite long time that you take quite long time. The cpu reached the peak and we have realized. There is a missing of pagination to filter, unnecessary data from the maps.

F

So uh for for further analysis and data, I will leave you with very so to explain more what we have found.

J

Hey so for now, where we will talking about the results that we got from caching, the monitor module and if you don't mind having a headache or you're immune to it, you can look at the pr that you can see on top, and so basically, this is the structure that we currently have for the manager demon. And basically, you can see that we have some maps on the left side in the manager.

J

These are inside the cluster state and we get the information from them from the manager get method.

J

So basically, what we do is called mhr get osd map the get python format with the pi formatter and it converts it converts it to a pi object and returns it to the monitor module.

J

So, for example, with osd maps, there is a problem that is basically that we have two key pair, give value pairs that increase with the number of hosts and, as you can see, one osd looks more or less like this, and it's basically three kilobytes and one thousand of the country, it's more or less three megabytes.

J

So that's that's going to cause some problems to solve this. We decided to go with a cache in the ziplockplus site, so we can share the cache between every module, so we don't have to store every call in every module and basically we put the title guys there and to do this we had to solve some problems. The first one was that the some monitor modules were modifying the pi objects and as an example, there were some modules that deleted a value and then another module tried to get that module.

J

So since we are using pi objects and the this object is shared between modules, if it's cached, it causes some signal errors to solve this, we tried different solutions and we went for the first one, which is the copy, and basically because this is the this is way we can do it right now, and the others to the mutable and copy and write were a bit tricky since in python, you have to traverse all the object to make it immutable of a copyright.

J

So we first try to benchmark the copy methods that we could gather. The first one was the copy. This is this one with python and it was a bit slow in comparison with, for example, json and picol.

J

They sound a bit faster, but because, since it's a binary serialization tool, it's much faster, as you can see so this time we tried to to see how much worse it would go from the default implementation that we had to pickle copy implementation, and it was 71 percent slower than the current implementation.

J

Then we cast it and it was 31 faster than the current implementation, which is a bit better, but not good enough. So what what we thought about was changing the formatter to adjacent formatter. We we went with the json formatter, basically because we already have the implementation and it was faster to implement this case.

J

So whenever you get a dvd map in this case with the json formatter, we serialize it with the deformator and then we deserialize in the python module.

J

So we write again the same benchmarks and with the json formatter visualization. Without caching, we got a result of 33 percent slower time than the default implementation. That's with threesome, 3000, thursdays and well. We cast after that and we got 56 percent faster implementation.

J

And basically, you can see here the overall improvement with the caching and with json formatter.

J

Also, we try to implement a pineal serialization into what I like another formatter, that used binary civilization, and here you can see there are tons of them and we went for message pack since it's it has a support for a centos and m federer and piccolo didn't have 4th century's. So we didn't go with that at the end, we made it work with death, but it didn't go fast enough.

J

We think that the problem was probably because the when you stream data with message pack, there wasn't super with poor dynamic containers, so we had to make a quick fix for that and at the end, didn't work as fast.

J

Also it's not it is it isn't actively maintained. So that's a problem.

J

Okay, now we are going I'm going to show you a demo of this. So basically, here we have one thousand of this. As you can see, this is this one thousand three come from an injection tool that we developed and basically I'll, show you how it looks like to.

J

Inject some load to the monitor module and without cache, and also with the json formatter, and so basically we're going to run this. It will generate some load on the monitor module and, as you can see, this takes a while to load and yeah. They will finish now and now we are going to try the same with the same load but with the gaussian.

J

But it should work well.

J

About future improvements, uh we think that we need some fascination, because we are retrieving the whole. The map, for example, instead of if we wanted to get, for example, 10 noise. This only we have to get the 1000 towers this to bearing some problems.

J

Bringing some realization- and this realization with json and another binary format, they can bring us to the couple, the monitor modules, so we would go, we would have to use more soup interpreters basically also, we thought that it should be automatic performance should be automated and as an improvement to these formatters, we should bring binary resolution fertilization that we couldn't do successfully.

J

Also, there are some desserts, this realization for json that are more efficient, that we haven't tried.

J

As a conclusion, at first, we saw that json formatter with occasion is 71 percent slower, but if we cash it with 3 000 like this, we get a boost in with 56 percent faster than the pi formatter and, moreover, we we can decouple the modules if we use json formatter, for example, as we we've seen some problems with interpreters and to finish the I. We think that the question elevates the heavy load problems, but it doesn't solve the entire problem.

J

So that's basically, if you have any questions.

K

I had a good question, um so the automated uh performance testing bit that you talked about, do you have any ideas or thoughts about you know how you want to implement that or where you want to implement that.

J

Yeah one I I want to implement that in the monitor module. Basically, so it's easier to debug performance problems.

K

Yeah we've talked about this- I think in the past, like you know, have something run in the background, which would uh um you know which tells us how how the performance is over a period of time and that's something we could also integrate with the telemetry performance channel in future.

K

That's so the telemetry performance challenge is essentially like a telemetry module. It has a performance channel that we are working on so uh currently uh we are trying to capture more like osd, specific and pg, specific stats, etc.

K

So if we have something, that's running performance, benchmark uh benchmarking, the manager in the background- maybe that's something we can incorporate.

K

L

Do you have any ideas, what kind of tests that you'd like to run.

J

Well, we rewrote more testing to the manager that these were the ones that we needed, but there are some missing for different maps and yeah, basically, that we haven't tested all the cluster state maps that there are the different parts of the modules we only have tested with osd maps and pc maps for the monitor, get method so very strong to improve there.

C

If you're late, I was wondering if, if um you'd look as well at um instead of catching trying to reduce the um that the fields you needed to be cached for these structures, like the pgmf, for example, has tons and tons of information that I'm not sure the dashboard needs. All of that all the time.

J

C

um I guess the question is: have you thought about um trying to reduce the amount of data in some of these structures like the like? The pg map or ost map um have many different fields that we may not need to use. All of them.

J

Yeah, I guess that imagination, we should perfect related to the creative scene data that we sent to the modules.

J

Basically because, as you said, there are tons of information that we might not need.

L

Josh we've seen a similar thing in the rgw cls classes in the osd, where we need to filter based on a very small set of fields, and we end up decoding a much larger data structure and then re-encoding after we filter on those different fields. It might be a good, it might be beneficial for us if we can re-structure the way we think about these, so that we really focus on specific fields that we need and figure out how to kind of avoid dealing with lots of extra data.

L

That's not useful, for whatever particular need is.

C

Yeah, it was in that a kind of an ad hoc way with the pg map and the manager for other modules. We had added extra methods to get just particular pieces that they were concerned with, but it might be worth thinking about any more general approach.

D

um On one of your earlier slides and when you're talking about pi objects, one of the problems was that um they were modified. I'm wondering what instances did you identify where they're actually being modified they're being fetched?

D

If we could, in principle eliminate those, then.

J

Right yeah, uh I don't remember exactly where you were, but I think it was in two modules, but I cannot give you the specific answers right now.

J

There weren't tons a lot of them.

B

Yeah yeah, the thing here is that you cannot prevent uh in the future this from happening. So that's the issue with with python that you cannot really create immutable objects and it has to be a best effort from the developers to you know, modify these structures. Otherwise the consequences will be catastrophic here.

D

Is there a way to like poison them or mark them, or something so that, if or mark them immutable and they're a way to like mark up by object, immutable or something yeah? That's something.

C

They're, like overwriting stuff, like the methods that um accessories and and and methods to detect when enabling just modified.

B

Yeah, the thing is that, as you are, are returning a nested structure, anything within that structure would be modified. So if you want to do some that kind of thing, you need to traverse the whole structure, which is smaller, the same penalty as you're paying when you are doing a copy that you need to traverse the whole structure. It's deeply nested, so perry was trying that using immutables and also copying right approach. But in the end I mean, if someone is accessing a a list.

B

Nested in you know three levels inside of a dictionary, you still need to grab that list and replace that with a frozen list or some kind of smart data structure too.

D

If we did, if there was a optional mode, that would do all this extra work and be probably super slow, but that would um either go detect modifications. Then we could do that in all the testing and then in production. You would turn that off.

B

Yeah, they could think of the json approach and it can be replaced with a b zone or binaries on or whatever some other more compact interfaces that we can decouple the I mean at last. We can decouple the manager and the sub interpreters thing, so we can have modules running outside the manager maybe remotely, so that might be a first step towards that yeah. Okay, coming back to the previous question from mark and just one possibility, the api is not very fine-grained, so we are just returning these big chunks of data.

B

One possibility would be maybe to allow passing some uh x paths like uh you know expressions, so you can filter things like the same as the jq tool or the uh xpath expressions, for example, where you can define uh filter out fields like uh only return these fields or that's not a simple thing, but it allows us to you know just return the data that the uh consumer- and this is actually instead of returning everything.

C

Yeah yeah, I guess it might be a little tricky if you, if you wanted to avoid the um like jason encoding overhead in the first place. It applies that kind of filter, yeah.

B

Let's see that you just want the first level in uh next dictionary, so you just return the first level of data, so maybe we can use some syntax just to say that or maybe you just want to access some specific uh property in a dictionary and if we can agree- and we can use some existing language for that- and there are a couple of them, but we can use our custom. We just want to extend this behavior.

G

I think you mentioned that this is a ttl cache uh which I assume stands for time to live, and I think you mentioned 10 seconds. That's what I've heard, but uh you know structures like hd maps, uh don't usually change that often does it mean that we, you know, do the work of recaching uh every 10 seconds, or am I missing.

J

Well, you can could set it to more seconds, but I did it with 10 seconds, because usually it doesn't take that much longer for the benchmark.

D

Would it be easy to change the um the cache validation or whatever to instead just see if the epic matches? If so we use it and if it doesn't then generate, I'm not sure how the code is structured, but.

G

That might be yeah. These are versioned data structures. Each of them has a unique version, and uh sometimes they like there is a burst. So you could see um you know in five seconds. You could see like five new versions, but uh under you know normal cluster operation. The same map would persist, for you know.

M

G

Persist for hours, if not days, so uh this is uh you know, throwing out the uh the uh the cached uh object and every 10 seconds is definitely not ideal.

B

Well, yeah, actually, the object won't be. Map won't be recalculated after 10 seconds only if someone actually request that, so the cache entry will become still and only some some module is requesting that data will be recalculated or reconstructed but yeah. I agree that we might improve this further.

B

Any more questions for very well.

B

Nope, okay, so thank you very much folks for yeah.

C

Thank you. Okay, thanks great to see all the artwork that youtubers haven't done anything this work summer, excellent stuff.

C

uh So the next topic is uh jager tracing. I think uh this was brought up on the mailing list by ubal. I'm not sure if you all is here.

M

um Sure I can introduce that, but this is really in the scope of the rgw. This is what we've done so far. There's also egg tracing in the scope of the osd, and I guess deepika can talk about that and we need to um sometime merge those two efforts. Of course.

M

um So I mean in both cases uh we use jaeger tracing, um there's a little bit of wrapping that we need to do around the tracing. It's not too much um the uh at least in the rgw. We have uh two levels of uh disablement.

M

One is at compile time, and this is for people that would uh build without jaeger and or for systems. That cannot support that. For some reason um and the more interesting level is to disable at runtime, so the runtime disabling um inside jaeger. They have something like a no-op tracer, so it does. It has all the apis that regular tracer has, but it does nothing.

M

So we allow the configuration parameter to change in runtime to enable disable to switch between the normal eraser and the real tracer. So there's no code change in those cases because they share the same apis.

M

um I I don't know exactly how configuration runtime configuration changes are done on the osd, but I think it should have the same mechanism there or similar one.

M

Then, um currently in in the rgwd, we started from tracing on the operations. So all the operation puts object, the departure that object. All these operations um are going to create uh a trace, and then they can have uh one or more spans in them for some, which means, if they're branching, then then this could be supported inside um and um that's going to be the basic implementation.

M

We the we're going to add some tags, um which is something that is supported in the um jager those tags later on they're stored in elasticsearch, so they could be used for searches like if you want to get all the traces for a certain bucket or a central user or object. Let's start with something or whatever condition you can get, you can get that um a more interesting use case uh that we're going to have.

M

Is you the inaugural, there's, an ability to uh serialize and visualize the uh the trace itself, so it could be picked up by another process or by the same process in a different thread or different place.

M

The first use case we're going to have for this is, is in the case of multi-part upload, so in case of multiple upload you have like the in it and then you have each chunk. You have an object put, and then you have the complete uh so we're going to use those serialization dc relations so that in the ui it would look like just one big trace all the different operations, although they can happen on different latest gateways at different times, and so on.

M

um Another step that we're probably going to use the serialized industrialized function is passing the trace to liberators and then sending that over to the osd, and then you can have a trace. That starts from the front end operation in the redis gateway and trickles all the way down through the three liberators to the osd.

M

For for all the operations, and um even more interestingly, at least in devastation. We use lots of cls.

M

So there's like more complex code running on the osd doing some extra work and you can correlate those traces as well.

M

The last piece that we're gonna, hopefully um use tracing for and and get better uh better understanding and and debuggability of the system is with the multi-site thinking, which is something which is extremely difficult to debug.

M

Today, um again, it would require serializing and visualizing the traces into uh into the object only into the um the bucket index log, and then you would be able to to figure out uh which object that was uh first handled by one rgw um when it was fetched by knowledge w in a different site with different zone and- and this would give us better debuggability of this complex system.

M

So that's like the the main thing there. There is a different trajectory of development that is dealing with deployments so um the first step uh it's going to be just the communication for a manual deployment.

M

This is, for you know just explaining how we expect the agents and the collectors and so on, to be deployed um mainly pointing to the existing documentation. So there's not um another title deployment would be cfdm based one, and this is something that I think deepika started, but it's still working progress.

M

As far as I know, and the uh the last option of deployment would be um openshift based or rook, based where um that would be actually from jaeger perspective, that would be the easiest, because jaeger is actually geared toward kubernetes or openshift deployments, so they already have an operator and everything. So that would be actually the easier part to do. um That's uh that's another direction that we would need to progress in, and um um there were also thoughts.

M

This is like more advanced about conditional tracing uh tracing would be heavy on the system, uh so it probably won't be turned on always by default, and we want to be able to add a mechanism that would allow users in the field not only to disable and enable tracing as a whole, but also enabling disabled tracing for specific operations, or you know only for multi-site syncing and not for the operations or only for cls only on the osd only on the rgw, so it's kind of all kind of combinations of of conditional operation.

M

This is pretty much in a nutshell.

C

The overall plan, in addition to the conditional aspects, um thank you document, um there's like some workplace, that's a little sampling already in jager. Is that uh sufficient for our purposes or we need to add some extra little samples against itself.

M

I don't think so, because sampling is really saying: okay from from the entire uh tracing, we don't need all the traces we're going to just sample 10 and that's gonna reduce the load, but I think that, usually, when you come to debug a system, then then I mean, for example, if you, if you want to use tracing in order to get some kind of estimates on your on the different latencies of the different steps in your process. So samsung would be sufficient and great solution in this case.

M

um But if you want to debug uh multi-site, because you have problems, let's say one of the buckets is not syncing and you want to debug multi-site. Then sampling won't help. You you'll have to actually um disable tracing for everything, so it won't stand in your way and just enable tracing for for bucket syncing, ideally for a specific bucket and then there'll be low impact on the system. But you would get all the information that you need for debugging, so yeah.

C

And that makes sense you want a lot of different different approaches for different use cases. Yes,.

A

The reason I added it in cdm was just to bring the consensus between the development of rgw and osd and how we can proceed forward um and like what is needed from ost front and rgw, so that we have a complete development of tracing functionality in general and what all features um do we want to target as priority for quincy?

A

Maybe now in general, as josh, suggested that there are sampling strategies, we can also have sampling strategies from what I saw that they would collect the spans locally, but they won't transmit it and they would only transmit the ones that we choose to and in case of failure. We can. We also have those spans collected in the local demon, but not transmitted, but uh still available for us to uh look into if needed, uh while also giving us the thorough picture of.

A

If there is some abnormality in the system but yeah, I that's again a scope for later and uh testings um for from our ost purpose um perspective, um uh we have added uh general traces uh for the key uh uh input. Output uh um output uh uh functions that um we have in osd and uh uh we also are targeting more of uh uh using the existing methods.

A

Like mark point, uh marker points uh that we have in places where um we generally want to record the matrix specific to that process, uh kind of adding that rapper and also we have contributions uh in uh booster for tracing as well.

A

uh So that's kind of uh the progress from osd front, of course, uh uh for encode and decode method, decelerating and serializing uh from predos and architect, blue and connecting uh the ost and the rgb traces is uh uh the next thing uh that I would definitely like to do or see. uh Somebody who work would pick it up.

A

uh Apart from it uh any other uh developer, specific matrix that people would uh love to see like in a monitoring perspective, I would have be welcome to have that feedback, and also one thing is that these traces um are not just local. If somebody has uh the traces uh suppose, uh user has the traces um they can also.

A

Convert those traces in json format and uh those json uh format based traces could be rendered anywhere, as it is um for even a developer to take a look into what happened at uh with the system. At that particular point of time, I added a blog link that might help somebody to get started with using uh tracing and would love the contributions there as well um uh feel free to add any suggestions.

M

Yeah, I think, from from the um thinking the two efforts uh there is one small thing would be to to have like one common wrapper and everything for the tracer instead of having one in rjw and one in under common, but that should be pretty easy because they're almost identical uh the two, the two rappers um and for the uh serialized disciplines. I think that um one approach could be that you you try to uh to deserialize.

M

um If you have, if there is something coming in the radius message and use that as your parent span and then, if any client would you know, would serialize their span and send it over because you always you always downstream from everybody right, the osd is always downstream for everybody.

M

uh So if, if anyone is sending you a parent span, then you can digitalize it and use it as your parents, pen and then in the ui, then you would see the entire flow end to end, and if you don't get it then well, maybe nobody sent it to you. Maybe there's no need or whatever and and then you just start your own parent span and continue from there. And hopefully we will do that, of course on our side as well in the rgw, but that could be any client, not necessarily rgw.

C

Yeah, you can imagine using the same thing books that you'd want to have from I'll use slide mod, w also for sfs or rpd, mirroring debugging no systems too.

M

Yeah makes sense.

A

It would be, I think we already have zipkin-based traces there to begin with. If they are good enough, you can just start by porting some of them and asking the developers and experts for only those additional ones that are optimal. I think that's a low-hanging fruit, um yeah I'd.

C

Imagine we could probably replace the existing like zipkin raiders api, with uh the same form kind of calls with the acre tracing instead.

M

Yeah and what you pass is just a binary object, so it doesn't really matter yeah exactly.

A

Also, you will, uh if we can do the disabling and enabling part all from a common perspective, or even like, based on a component, that we want to enable or disable tracing. For specifically, um I would like that, like.

E

A

Have two different uh places to um support enabling disabling.

M

So I don't know enough about how configuration works in rgw versus the osd, so I'm not sure in the rgw we have access to osd configuration parameters and vice versa. So that's that's. Maybe the question.

C

You can get them from the monitor, like you, can ask the monitor for the configuration from like any attitude's perspective. um But I guess um I would push in my mind would be. Is this something that we want to control directly with like a self-configuration option or some other kind of format of configuration.

M

I think it should be set configuration uh just something that somebody goes in the command line or I don't need to have that in the dashboard and just set it to true or false in runtime I'll. Try again, I don't know a lot about how this configuration mechanism works, so I'll, try to sync up with and see if we can have like a convert solution.

C

Sure sure that sounds good, because we've been talking about like things like the conditional enablements and and conversely, like the sampling kind of approach, it's more than just like on off option so finally needs more thought.

C

In terms of, I think that what you described sounds like a great plan and like the deployment aspect, is especially key to making this usable and for both developers and for actual users. Of course, um I think those are the other kind of the first steps in my mind as well. I just getting that the basics there, I think longer term movie fantastic to expand this, to include like background operations too, not just the I o path, so you can use ladybug and and understand.

C

The behavior of the the system as well like rgw's, got lots of garbage collection and life cycle things going on in the background there was, he has plenty of like scrubbing and recovery and other other other things going on should be very helpful to um get more uh observability into.

M

Yeah, once we have a good infrastructure in place, then if we like you know anyone that writes a feature or fixes a bug and want to add tracing there, because it helps them to do better work, then they can do that. So I'm going to be kind of scale out the work uh to other people. That would just follow the examples that we said at the.

A

It beginning be like the outlooking, the logarithm.

A

C

Yeah yeah that'd be great.

A

I think uh even now people can do that, but yeah we would need again uh the macros or uh disable jaeger, or at least only in developer perspective. We can use it, but yeah just working out the foundations would help people to use it and scale ourselves.

C

As another aspect, that's missing, I guess from the osd side today- is the uh integration with uh crimson, since that's going to require a bit of modifications for how we use uh bigger, to avoid locking and whatnot and blocking um and I'd. Imagine that we could probably keep the same interface so that wouldn't affect this, like the general uh common flow for how you define the traces, but the implementation might need to be different for like rooms internally.

M

So the way that we try to approach that on the rgw because, like we have uh like the front hand, has multiple threads and they're using we're using core routines there. um The way that we approach that hopefully that will be useful is um that the uh so the traces are just objects, so there's no there's no blocking or anything, but the tracer, which is what sends the tracers the um to the agent um is, is something that has a lock in it.

M

It just has a lock when, when it does descending at the at the end, so when you call finish on a span, it goes to the tracer and send the spin over um so to avoid contention on those logs. So we cannot avoid the lock, because unless we want to change the the jager code um but to uh to avoid contention logs, what we do is that we define the tracer instead of just being a global.

M

It's a thread: local global variable, which means that um there'll be a log taken, but I mean there'd be no contention because there's only one thread: that's gonna use it at a time and different threads. That's going to use it! We're going to use a different uh client, so uh it'll be a little a little load on the system. So, instead of one client per daemon connects to the agent you can have, I don't know 100 clients, but that they they shouldn't, cost too much and it would save on the on the locking. So maybe.

C

M

Approach could be taken with crimson.

C

Yeah, that's a good idea, I think probably something similar could could apply there and yeah. I'm glad you've already looked into this a bit because that's actually much simpler than I was thinking it would be.

A

We have new ideas to work on anything else. Specific person.

A

I think we can move to next, stop with them.

C

All right, well, thanks deepika and you all very excited to see the tracing progress. I think it's going to be incredibly helpful for users and developers as well.

G

C

All right, so the next topic is uh telemetry crash reports. um You read: do you want to introduce this one.

H

um Let me share a link to the ether path um yeah. So, as some of you know, um users who opt into telemetry, sometimes they usually also opt into the crash channel that we have and they're sending crash information about all the crashes that occurred in their clusters and we have a crash dashboard. Maybe I'll share my screen.

H

In one moment,.

H

Can you see my screen yeah, yes,.

C

H

Thanks so um here we can see the dashboard for all the crashes um and we have a search page um that we can go over all sorts of um specific searches according to versions and demons dates and everything.

H

um So our goal here is to make those crashes trackable by developers and also by users.

H

So whenever a user sees a crash in their cluster, they can go and search tracker for the crash, signature and they'll be able to see the status of that crash. Maybe it's not even a bug. um Maybe it is a real issue and there's a fix for it. So for that we have a new um telemetry crashes, bot that tries to find a corresponding issue for each crash signature.

H

um So, for example, let's see yeah, we do have maybe I'll say just a few words about the search page here I I guess most of you have already seen that, but just in case so you can see um the crash signature, which is a new signature that we created on the back end that groups more uh crash events together, you can see once the first time we encountered that crash in telemetry once last time. How many crashes um uh in.

C

That sorry to interrupt, could you increase the font size a little bit it's difficult to see.

H

Sure, yes, sorry.

H

Is this better now.

C

Yeah I can make out the numbers.

H

Yeah, okay uh cool, so so the crash crash count um corresponds to the time frame. Here, um that's the total crash count and that's just in time frame. So um now we just chose new fingerprints um in this time window. But if we change it, you can see that, for example, this signature has total number of 206 crash counts, but in the last 30 days we saw that only 13 times.

H

You can see the number of clusters that were affected by this crash and all the all the versions for the demons that are reporting this crash um yeah and if you scroll sideways, you can also see um the assert function and the assert condition if they exist- um and here we have here, we have a signature uh that the original signature is reported by the user. So that's the client-side signature, and here we have um status of that signature.

H

So we can click on that and for for the signature page, um we have more detailed information. You can see the pressure currents um by diversion. You can see the um the sanitized back trace that helps to create the signature itself uh by the way. If you click on on each of these frames, you can see all the crashes that occurred that had this frame in their back trace, so that can be very useful as well.

H

um In addition, you can see the the graph for the daily occurrences and, of course this is just for the last two years. You can change the time frame as well.

H

You can see all the affected clusters, their their size, all sorts of other information about them. You can also click on on the specific cluster and see um more graphs for that specific cluster. Like a cluster x-ray- and um here you can see an actual uh crash example um where you you have the exact the exact, um maybe that's too small, too. The exact crash dump here.

H

But if we go back to the integration part with redmine.

H

um We want for each one of these signatures, a corresponding tracker issue, so, for example, this signature, um the bot, opened a new issue for it. It populated all the affected versions. So far, all of the signatures that were received from the client, the signature that we generated on the server side and then it populated the description with a link um to the dashboard and um um with all the other details. So, for example, the asserts um the sanitize the vectorize and I crushed crashdown for example.

H

Now the thing is that um there are about 2 000 crash signatures, so we had um a sort of a test run, a few, uh a few dozens of signatures that we opened tickets for, and I think all in all, it went um pretty well and we want to know if we can go ahead and open the rest of the issues or decide on a cadence that we want to open issues every week.

H

um Josh also had uh an idea, because most of the signature signatures eventually are going to be classified as uh radius or booster or manager, issues uh that, since we have a crash, cue and then crash trash, uh where is now we can just go ahead and open all these issues um and just work on them.

H

Whenever, whenever yeah.

H

I just want to give another example here, um so the bot will not always just open a new issue.

H

um It scans all of um of tracker for the crash signatures, so here, for example, um the bot found a bug that was already resolved um that had the crush signature populated somewhere in the ticket, um but you can see that the affected versions was 1424 and there yeah there are no backboards here, but if there were backwards we would um go ahead and take their target versions as well.

H

um So we encountered new crash events in telemetry that were um of a newer version than the one that was fixed. For example, you can see the 15 version here so since that issue was classified as resolved, it means that it was a real uh self bug. So we decided to open a new issue and to uh relate that to the close one.

H

um This means that um in case you've seen tracker issues that are not safe, bugs. Please use other status to close it other than resolve or need more information. um So you can either use closed. uh Not a bug rejected, won't fix um a status that indicates that we don't actually need to provide a fix for that.

H

So this is one thing and the other is when you mark an issue as a duplicate issue, tracker will let you do that uh um without changing its status. So you can. You can say that um this issue duplicates a different issue, but the status will still.

H

Stay as a new status, so please make sure you're also changing the status as well to a duplicate and that the original um as a status I did it either than duplicate so um again, either it's open or it's closed, but not not a duplicate status.

H

Oh yeah, another thing that I encountered was in case. There are several issues that um are marked as a duplicate to an original issue.

H

um Please make sure that they all um um point to the original and not uh uh in a chain um like um like in the list, um duplicate to a duplicate or duplicate to an original that makes life easier. I guess for everyone.

H

Are there any questions so far or what do you think of the idea of just go ahead and opening all those issues.

D

I wonder if we should um find a brave lead who is willing to devote some time to do triage and then open, like all the issues for like that one, the project, maybe that would make sense.

C

I guess my thinking there is that that's the same amount of work for together, regardless of whether the issues are open or not. So we might as well open them up, and if someone has the same bug, they can at least they can find it in the tracker.

D

Yeah, I guess yeah.

H

My other question was um that does the bug queue uh exclude these um tickets because I saw that on the crash queue we are looking uh in the filters um we are relying on the the source, but I'm not sure that um here we will. um I mean we're not.

C

D

A scrub and we want to ignore crashes, we can add it manually. Maybe I don't know.

D

I'm not sure if this should be distinct overlapping. I guess.

C

I guess in the long term we want them to be overlapping, because at that point we're just kind of looking at all the new incoming bugs but network during this import phase. Maybe we want to maybe treat them distinctly if you don't want to just go through all the.