Ceph Ceph Tech Talks, 16 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Ceph Tech Talk: Making Teuthology Friendly

Description

Presented by: Devansh Singh and Medhavi Singh

Join us monthly for Ceph Tech Talks: https://ceph.io/en/community/tech-talks/

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute
What is Ceph: https://ceph.io/en/discover/

A

Hello: everyone, my name is devansh and I'm. The Google summer of code intern working at SEF and the project that I was chosen for was making pathology friendly, so I'll go I'll. Let you I'll go through the agenda of today's presentation. So here are a few points that we'll go through.

A

So first is the introduction of what yourthology is and what are the other services that we have been working on and then the problem with the actual command line, interface of pathology and the solution that we thought of and how we have been implementing it and the features that me and Mitha we have worked upon and the benefits of the solution and some future aspects for for improvements as well.

A

So I'll start with the introduction of what yourthology is so technology is a testing framework for safe written in Python, so pathology is used for running tests of of safe and it orchestrates operations or remote hosts over SSH. So, as you know that a typical job consists of multiple nested tasks and Each of which performs operations on a remote host server over the network, so cathology uses sorry.

A

So these test results are stored in paddles and are available and are available on pulpito, which is a dashboard for running the tests and the users who have scheduled the test can see the results as well, so I'll go through what pulpito is right now, so pulpito is a web dashboard that is used for monitoring the pathology test infrastructure and it includes the tests, the queues and the test nodes available.

A

So as we all know that there are a lot of nodes available and in order to schedule a test, you have to log the nodes and then SSH into the pathology and then using the pathology suit or technology run command.

A

You can schedule multiple jobs or single job and it provides real-time information about the health and performance of the infrastructure, making it easier for the administrators to keep an eye on the system, and this helps to you know quickly identify if there are any issues or if any test is failing and and then the team can take the appropriate actions to address them. So that's what palpito is used for. It is one of the services that I told you. We have been working on so what are the actual problems with the CLI?

A

So as you know that in the tethology CLI there are various commands along with a lot of flags, and it requires a bit of learning curve and use and requires someone who is a veteran to you know actually schedule the jobs correctly and for someone who is who is a new developer who's working on safe, it might be a little bit overwhelming to you know: schedule jobs using toothology, because first you need to SSH into the server. Then you have to lock nodes, and then you have to schedule jobs.

A

So this can interrupt the productivity of the whole team and also the complexity of the pathology CLI can lead the new users, scheduling bad runs that will hold up the queue and block other jobs, because the the jobs will run one by one. So if someone schedules a job wrongly, then it will hold up the whole queue until they actually kill the job. So that's one of that's one of the problem, so I'll discuss some more problems ahead.

A

So, as I said in the previous slide that the commands often require users to understand and input, complex command line, syntax include including various Flags options and arguments which can be overwhelming for new users like there are some options that might repeat and you'll have to enter them every time you're using the CLI. So that is one of the problems and user might struggle to discover available, commands options and their meaning leading to frustration and difficulty in effectively using utilizing pathologies capabilities.

A

So there are some options that aren't documented, you know very properly and they don't have much information about them. So someone who's new to testing using technology, they might face difficulty- and you know, they'll, have to ask someone who has been using pathology for a long time to schedule.

A

A simple job and users often need to manually configure a wide range of parameters such as the repository URLs, Branch names and suit options, which can even lead to errors due to typos or misconfigurations, because you know typos are very common and it is a human error because it can happen to anyone. So we can overcome these problems by uh the implementation that we have been working on. So what is the solution that we have been working on? So the solution is a Next Generation palpito.

A

So what it actually is, the current palpito provides just provides a dashboard for the test results, along with some more information about the nodes and all, but in the next generation of palpito that we have been working on, it will provide a improved user experience compared to the current palpito, because it will provide a more intuitive user interface, making it easier for the admins and developers to navigate, configure and monitor pathology tests, and one of the main feature of this next gen of pulpito would be to allow the users to schedule and kill jobs using the dashboard which will enhance the productivity and make the working of a tautology, a joy for both veterans and the for seasonal professions and someone who is newly working on it.

A

So I'll go ahead with the implementation, so we had made an API for pathology and we have integrated it with the palpito next gen so that the users can directly schedule or kill jobs without actually sshing into the topology server just by interacting with the user interface, and the main features that we have been working on in the new palpito were the GitHub authentication scheduling jobs, which is the part that is done by medhavi and killing. Jobs was feature that was implemented by me.

A

So I'll now go ahead and show a small demo for the same. So right now, I'm running a container for all the services like pulpito technology, tautology, API and paddles. So here you can see the next generation of pulpito that we have been working on here. You can see my GitHub username as I have authenticated, using my account. So here you can see that and right now no runs have been scheduled, so we are going to schedule a run using the feature that was implemented by midavi.

A

So, as you can see here, uh we have a table with two columns, key and value which actually, which actually represents the flags and the flag flag options that are set by the actual command. So here you can see a text area which will represent the actual CLI command. So this is for the C for the professionals, who have been using the uthology uh CLI for a long time, so that they can check that this is the command that they actually wanted or not.

A

So this is a functionality just for that, so I'll go ahead and schedule a sample run for you all, so that you can see that how it actually works. So, as I said that in my previous slide that there are some things that you have to set every time, just like the safe repo and the suit repo and most of the times, it's just the same thing, but you'll have to write it over and over.

A

So here we have some default values for it, so that you don't have to write it every time and here the checkbox using the check boxes. You can uh add the flags that you want and you can even delete them, so I'll delete this and add a row for the suit that we want to run so for that I'll choose the repo option and there's a dummy suit. So we are going to run that so here you can see, we have two buttons.

A

One is for a run and one is for a dry run, so we are going to use the Run button right now. So, as this feature is in implemented implementation right now, I'm going to see I'm going to show you the logs of the docker container. So here here you can see that the technology API is actually scheduling the job.

A

Now, because once we click the button, it took all the flags and options and it sent it over to the Pathology API and the pathology API was made by velari a kudos to her, and while this is scheduling the job I'll, let you know about the GitHub authentication. So, in order to schedule a Kayla job, you need to be authenticated using GitHub and be a member of the step organization, because, if you're, not a member- and if you are not logged in, you won't be able to schedule a clearer job and for killing a job.

A

Only the person who has scheduled a job and an admin can kill a job. So this is where the authentication came in. So, as you can see in the logs that the job has now been scheduled, so we'll go over the runs and see whether we can see the schedule job or not. So, as you can see the job we just scheduled shows up here we use the dummy suit on the main branch and, as we are running a Docker container, it's on the test. Node, so we'll see more information about this job.

A

That was just scheduled all right, so you can see that currently it's queued, and this is the feature that I have been working on. So this button would kill all the jobs in the current scheduled run. So in future we were thinking to add a kill buttons for individual jobs right now, I'm working on it, but uh in this current demo, I'll show you how to kill the whole job at once.

A

So once you click on the kill button it send, it will send the request to the toothology API once again and it will wait for it to complete if it gets a 200 status code, it will say that the Run has been killed successfully, although it will otherwise, it will show that there was an error with the run and you can go through the logs and see what was the actual error.

A

So this was the demo that we have been working or we have been working on, and these features aren't complete right now, but we are just a bit away from actually completing them and then pushing it to production. So these were the three features that I told you about and we saw the demo. So what are the benefits of this solution so through a user interface for scheduling and killing jobs, as I said before, the developers won't have to SSH every time and write the whole commands.

A

They can just use the interfaces and the tables and columns for easier scheduling of jobs, and it will be also productive for the teams in the long run as they won't have to write all the commands and in the scheduling page, there was also an option to save the configuration so that the user, if a user, has a config that they use most of the times, they won't have to write it again. It will be stored in the local storage and they can fetch it from there and and schedule a job using that same configuration.

A

So this is the benefit of the solution that we have implemented other than that we have some future aspects as well. So first was the pedals. We write using fast API so as we know that paddles currently uses picon, which is an unmaintained framework, and so we thought that we could write it in fast API, because fast API has a big community and it has a lot of built-in features, and it also had type annotations and also that we can make a code a bit better rather than just being Dynamic python other than that.

A

We also can work upon more work more upon the user experience of palpito next gen by making it visually easier to schedule a job like. Currently, you saw that the table wasn't that visually pleasing, so we can work a bit more on it, so that it's a bit better is what I'm trying to say and other than that we can add more user-centered features on pulpito next gen.

A

So what I was thinking was, we can add options for users so that they can see their current jobs and the scheduled jobs and their past scheduled jobs so that they don't have to go down the queue every time to see what jobs they have scheduled, and this will make it easier for them to see the results of the jobs and if they want to kill, they can directly kill it.

A

So that was it for the presentation and we did experience a lot of problems during these implementations, but I'd like to thank our mentors, Zach, Aishwarya and Junior, who helped throughout whenever we needed their help, and it was a great time working for SEF and I hope and I'll keep contributing to it, because it's a great project and I really like to work on it, and that was it from my sites and it's a q a time. So if anyone has any questions, I'm open for them.

B

Great job I really like the the way the UI is looking and it looks a lot more user friendly.

A

Yeah, thank you.

C

Yeah really good job I thought that was, that was a super. Informative talk, um I guess a question for you is: do you have any? Was there like a?

C

Was there a biggest challenge that you could name through this work.

A

Okay, so one of the biggest challenges that we faced recently was actually connecting the back end with the front end, because we were having some issues with the cookies that was said by the technology API. So we had to figure that out figure that out that how to actually send back the cookies in the requests. So we did take some help from bellary and we figured it out.

C

A

Any other questions.

B

um Can you ex I know you said this, but can you explain again where we would go to find the logs of this of a scheduled run to see if uh or if there was an error uh to help us like report, any bugs that might come up.

A

Yes, so right now like in a current implementation, like we were running on a Docker container, so to see if there was an error, we had to open up the container, but in the future, what we have. What we were thinking is that we could show the error log in you know in like a new page that, uh if the Run fails, it would show up a link that would you know redirect you to a page where you can actually see the logs of that ethology that it returns.

B

Okay, great thanks.

A

D

I think um in the future, um the scheduling, page I think we would have a separate section to see the logs for sketching. So you don't because I know in the presentation. Devon has to switch to Containers logs to see what is actually happening um on topology side of things, um so in the future, I think that screen will be integrated to schedule page.

A

Here a lock screen can be integrated so that even the user can see the actual logs that have been printed by tethology.

D

Yeah, so they they would know what.

A

Is actually happening exactly.

C

And you know if we want to get even if we want to get fancy with an idea like that, we we can even look at um having chiefology to produce something more structured, uh say in the context of Technology API, so that we could, you know, grab a Json blob and use that to render a UI that sort of shows a more like a kind of a nicer way to expose an error.

A

Make a nice idea: yeah yeah.

C

A

Use a Json to render a UI so that the user doesn't have to go through the logs to point out what the actual error was.

D

D

I think we also explored some kind of like saving um commands that are regularly used. So, for example,.

A

D

People at Yuri, who um you know, have commands for different branches for releases yeah. You can save those commands starting like uh for now without thinking local cash, but ideally I think like we can start in some kind of database, an actual database but yeah. That's also another thing we explored.

A

Yeah so, as I said in the presentation that the current the current Implement implementation uses the local storage for storing the configurations that are, you know, used multiple times by user. So in the future we could use a database such as post your SQL to store it.

B

Yeah I really like that idea. There's a lot of commands that I just reuse by looking back in my um history and I.

A

B

Whatever commandment history so that'd be great.

D

Yeah I think Laura, um since you and I are doing like reviews on runs um regularly, I think it's also. We should also kind of explore how the how what kind of features we can add to help you and next-gen to make reviewing, runs um easier.

D

I think that's another use case that we might want to explore in the future.

B

Absolutely and um Zach told me about uh this issue page for the.

D

Yeah I didn't see actually getting close to tap. That happens to me too.

B

Sorry about that, I was gonna point to this uh link here, but for some reason it's not letting me link it, but um anyway the um the chat box isn't working for me, but I was looking in the the polkito dash NG stuff repository under the issues. Page you'd recommend that we uh raise issues in in that place from now on.

A

Yeah you can raise the shoes in the palpito NG repository and we'll figure it out how to actually implement it or how to debug it.

B

D

Yeah, if it's like a UI thing, yeah I think that's worth, but also, obviously you have a Tracker. It's also good uh yeah for I. Think tracker is more for, let's say like uh tautology problem rather than interface.

B

Really great job really good presentation.

A

Thank you, I really appreciate it.

D

Great job guys.

A

D

Any more questions from the audience.

D

Okay, I think uh you guys did a great job. uh Thank you. So much.

A

Thank you, everyone for joining and listening to the presentation, and let us know if you had any you know further feedbacks and we'll try to work upon it.

B

A

Yeah good job guys. Thank you.