KubeVirt Tech Demos, 8 Apr 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: KubeVirt CI Infrastructure Overview

Description

Brief overview of the infrastructure we use to run KubeVirt CI jobs, how it has changed over the last few months and the short/medium term improvements we are working on. Includes a small demo about our usage of Bazel gitops rules to manage components on CI/CD pipelines

A

Tell you a little bit about the infrastructure that we used to run qvert ci tests. uh I wanted to give an overview of how this infrastructure evolved. Since I joined the team, how it looks like today and what we plan for it for the short-term future.

A

So this is how it looked like in november, when I joined the team we have. The ci system is based on proud, upstream kubernetes ci system.

A

uh We have uh what we call the fenix cluster that is running uh openshift and it it runs the brow control plane that receives events from github from the from our repos in github, and it runs another workloads we are interested in and it runs the ci jobs not only for kubert, but also there are jobs related to jenkins running here and also a prometheus stack that is composed of prometheus node export range and graphara, no alert manager and no loki here and these workloads run in these worker nodes.

A

There are 12 virtual machines, uh 10 bare metals that are attached directly to the or in the same infrastructure, uh to to the cluster control, plane and nine external bare metals. We call them external because they are not in the same infrastructure. They come from ibm, but these are powerful, bare metals that we use to run the end-to-end tests, there's a secondary cluster.

A

What we call the ibm cluster this this also runs ci jobs, part of the keyboard, ci jobs, jobs, mainly that are not like end-to-end tests like unit tests generation of documentation, some linters also tests from other projects like cdi, and things like that and yeah. All these jobs are scheduled from from the control plane and it runs in in these three virtual machines.

A

So what problems uh did we have at that time? uh First of all, the main problem is that the unstable test results were the instability comes from infrastructure.

A

This is mainly because of the of the very old version that is running there of openshift. It is currently not supported and yeah. We got a lot of problems because of it, for instance, issues with the old cni plug-in version that is running there or connectivity not only during test, but also connectivity between components in the brow control plane with with external uh resources and a lot of things like that, we also have issues creating bot, sandbox and yeah a lot of a lot of problems. Also in terms of observability.

A

We don't have enough transparency about how the infrastructure is performing. If something goes wrong, we don't know if it comes from the infrastructure from the code from the tests from external dependencies, we don't have any alerts.

A

There are no metrics at all, but all there were, we didn't have any metrics about the ibm cluster status, no easy way to get locks about the the the processes running on on the on the nodes um yeah, but observability.

A

Also in terms of the code that we use to to deploy the components uh we needed to run some ansible uh playbooks locally. uh I the first time I did it. I broke everything uh and yeah. You need to you needed to put in place the secrets, execute locally hope for the best, because yeah we we didn't have tests either.

A

uh So what we thought or people thought that the how this could be improved most of these things were proposed before a joint. uh First of all, uh the goal that we always had is is reducing the number of ci tester errors, in this case the errors that come from from the infrastructure.

A

For that the idea was to migrating to migrate, the crowd control plane to the this ibm cluster, uh there's very much more modern and it is uh managed by the provider uh we didn't. We don't need to self-manage it um so yeah, all the connectivity or sandbox creation and all these issues uh will be will be resolved uh with this and in the in the migration we also can bump the brow version to a more recent one, also migrating environmental machines to uh to a new cluster.

A

In this case, it needs to be self-managed, uh because these customer methods can be attached to a provider, managed cluster and also update the the operating system of the bare metal. That is also very old and uh also increase. The uh the capacity have more tests to pre use more machines to execute end-to-end test so that we reduce the the pressure on each individual node and also we are able to split the suite and have the not a single monolithic suite.

A

That is ever-growing and it's very hard to to maintain and to evolve, also improving the observability so that we get metrics from all the components and make them accessible to everyone, interested in the forms of alerts of status, pages or anything, and also reduce the chance. The chances of breaking things when the infrastructure code is changed through automated tests and deployments.

A

So this is how things look like today. The fenix cluster looks mostly the same. The ibm cluster has now more components. We have now here a prometheus stack. It is a bit different from this one, because it includes alert manager and and loki for aggregation, and there are additional uh observability tools like uh ci search that uh most of you or some of you have already started using, but other that we haven't started using jet like qr, healthy and cb.

A

But we still have the same capacity in the cluster and we have this this new workloads cluster. This is a self-managed cluster, with two new bare metals that are capable of running end-to-end tests. We we have this week deployed this new cluster and we are starting running new lanes on this on this cluster so um yeah.

A

Additionally, all the new components have been deployed using these uh gitobs rules, these baseball rules for github rules and yeah. We will see a small small demo later, so we still have problems here. The pro control plane is still in the in the old cluster uh low capacity or the same capacity in the ibm cluster for increased workloads and the observability. We have more metrics, but they are, they are not aggregated. We need to check in different places for for for the information from from there and also data retention issues.

A

We can't look back for for metrics as much as we want. We have limitations in there in the and how much data we are. We are collecting, uh and also this uh issue with a dynamic provision of persistent volumes in the in the new cluster should be fixed. Very soon, so what how would we look like what we would like the system to look like in the future?

A

We would like to migrate the brow control plane to to the ibm cluster and also have the all the prometheus stacks connected, the data aggregated and the the capacity of the ibm cluster to be increased, to run this uh more production level, workloads also the the vms migrated to to these uh to the new cluster, uh so yeah.

A

Here um I put some some links to to some uh these components that we plan to use very interesting, interesting this one for the metrics aggregation and the uh unlimited of virtually unlimited uh data retention and so on um and yeah. Let me show you really quick, this uh small demo about the uh the how we test and deploy uh these components the the code is under project infra.

A

This thing that we are you are going to see uh is the same, the same test. We have this same code or this is executed in a pre-submit in there in when, when the code changes, we have a brow job that executes these tests and if, if things go good and the code is merged, then we have a post submit that deploys the the component. So we have continuous deployment for this component.

A

So let me show you quickly how it looks like um it's under services and, let's see, for instance, these are the components that we are deploying the prometheus stack. This is how it looks like for for most of the components. Let's take a quick look at the deploy script here. You can see that we call this custom verb. This apply verb that is created by the by the uh basal rules for deploying crds and then for deploying all the components, and we have this command that we have created for waiting for each of the components.

A

So, let's see how it looks like, let's create a quickly.

A

A

Okay, now, let's execute this deploy script for the testing environment, it can be executed for production too. Here you see how the crds are created.

A

We are waiting for the crds to be available here. All the components of the parameters stack are.

A

A

Here we are waiting for each of the.

A

A

Prometheus operator is ready. Now it comes to alert.

A

A

It's ready too rafana is next. If I recall.

A

A

Ready loki, ready and loki from 10 this is the demon set ready, also a unknown exporter.

A

Okay, ready to now, let's execute the tests. In this case the tests are checking each of the services and yeah they are they'll pass. So this is all I have.