►
From YouTube: KubeVirt CI Infrastructure Overview
Description
Brief overview of the infrastructure we use to run KubeVirt CI jobs, how it has changed over the last few months and the short/medium term improvements we are working on. Includes a small demo about our usage of Bazel gitops rules to manage components on CI/CD pipelines
A
Tell
you
a
little
bit
about
the
infrastructure
that
we
used
to
run
qvert
ci
tests.
I
wanted
to
give
an
overview
of
how
this
infrastructure
evolved.
Since
I
joined
the
team,
how
it
looks
like
today
and
what
we
plan
for
it
for
the
short-term
future.
A
A
We
have
what
we
call
the
fenix
cluster
that
is
running
openshift
and
it
it
runs
the
brow
control
plane
that
receives
events
from
github
from
the
from
our
repos
in
github,
and
it
runs
another
workloads
we
are
interested
in
and
it
runs
the
ci
jobs
not
only
for
kubert,
but
also
there
are
jobs
related
to
jenkins
running
here
and
also
a
prometheus
stack
that
is
composed
of
prometheus
node
export
range
and
graphara,
no
alert
manager
and
no
loki
here
and
these
workloads
run
in
these
worker
nodes.
A
There
are
12
virtual
machines,
10
bare
metals
that
are
attached
directly
to
the
or
in
the
same
infrastructure,
to
to
the
cluster
control,
plane
and
nine
external
bare
metals.
We
call
them
external
because
they
are
not
in
the
same
infrastructure.
They
come
from
ibm,
but
these
are
powerful,
bare
metals
that
we
use
to
run
the
end-to-end
tests,
there's
a
secondary
cluster.
A
What
we
call
the
ibm
cluster
this
this
also
runs
ci
jobs,
part
of
the
keyboard,
ci
jobs,
jobs,
mainly
that
are
not
like
end-to-end
tests
like
unit
tests
generation
of
documentation,
some
linters
also
tests
from
other
projects
like
cdi,
and
things
like
that
and
yeah.
All
these
jobs
are
scheduled
from
from
the
control
plane
and
it
runs
in
in
these
three
virtual
machines.
A
So
what
problems
did
we
have
at
that
time?
First
of
all,
the
main
problem
is
that
the
unstable
test
results
were
the
instability
comes
from
infrastructure.
A
This
is
mainly
because
of
the
of
the
very
old
version
that
is
running
there
of
openshift.
It
is
currently
not
supported
and
yeah.
We
got
a
lot
of
problems
because
of
it,
for
instance,
issues
with
the
old
cni
plug-in
version
that
is
running
there
or
connectivity
not
only
during
test,
but
also
connectivity
between
components
in
the
brow
control
plane
with
with
external
resources
and
a
lot
of
things
like
that,
we
also
have
issues
creating
bot,
sandbox
and
yeah
a
lot
of
a
lot
of
problems.
Also
in
terms
of
observability.
A
A
A
Also
in
terms
of
the
code
that
we
use
to
to
deploy
the
components
we
needed
to
run
some
ansible
playbooks
locally.
I
the
first
time
I
did
it.
I
broke
everything
and
yeah.
You
need
to
you
needed
to
put
in
place
the
secrets,
execute
locally
hope
for
the
best,
because
yeah
we
we
didn't
have
tests
either.
A
A
For
that
the
idea
was
to
migrating
to
migrate,
the
crowd
control
plane
to
the
this
ibm
cluster,
there's
very
much
more
modern
and
it
is
managed
by
the
provider
we
didn't.
We
don't
need
to
self-manage
it
so
yeah,
all
the
connectivity
or
sandbox
creation
and
all
these
issues
will
be
will
be
resolved
with
this
and
in
the
in
the
migration
we
also
can
bump
the
brow
version
to
a
more
recent
one,
also
migrating
environmental
machines
to
to
a
new
cluster.
A
In
this
case,
it
needs
to
be
self-managed,
because
these
customer
methods
can
be
attached
to
a
provider,
managed
cluster
and
also
update
the
the
operating
system
of
the
bare
metal.
That
is
also
very
old
and
also
increase.
The
the
capacity
have
more
tests
to
pre
use
more
machines
to
execute
end-to-end
test
so
that
we
reduce
the
the
pressure
on
each
individual
node
and
also
we
are
able
to
split
the
suite
and
have
the
not
a
single
monolithic
suite.
A
That
is
ever-growing
and
it's
very
hard
to
to
maintain
and
to
evolve,
also
improving
the
observability
so
that
we
get
metrics
from
all
the
components
and
make
them
accessible
to
everyone,
interested
in
the
forms
of
alerts
of
status,
pages
or
anything,
and
also
reduce
the
chance.
The
chances
of
breaking
things
when
the
infrastructure
code
is
changed
through
automated
tests
and
deployments.
A
So
this
is
how
things
look
like
today.
The
fenix
cluster
looks
mostly
the
same.
The
ibm
cluster
has
now
more
components.
We
have
now
here
a
prometheus
stack.
It
is
a
bit
different
from
this
one,
because
it
includes
alert
manager
and
and
loki
for
aggregation,
and
there
are
additional
observability
tools
like
ci
search
that
most
of
you
or
some
of
you
have
already
started
using,
but
other
that
we
haven't
started
using
jet
like
qr,
healthy
and
cb.
A
But
we
still
have
the
same
capacity
in
the
cluster
and
we
have
this
this
new
workloads
cluster.
This
is
a
self-managed
cluster,
with
two
new
bare
metals
that
are
capable
of
running
end-to-end
tests.
We
we
have
this
week
deployed
this
new
cluster
and
we
are
starting
running
new
lanes
on
this
on
this
cluster
so
yeah.
A
Additionally,
all
the
new
components
have
been
deployed
using
these
gitobs
rules,
these
baseball
rules
for
github
rules
and
yeah.
We
will
see
a
small
small
demo
later,
so
we
still
have
problems
here.
The
pro
control
plane
is
still
in
the
in
the
old
cluster
low
capacity
or
the
same
capacity
in
the
ibm
cluster
for
increased
workloads
and
the
observability.
We
have
more
metrics,
but
they
are,
they
are
not
aggregated.
We
need
to
check
in
different
places
for
for
for
the
information
from
from
there
and
also
data
retention
issues.
A
We
can't
look
back
for
for
metrics
as
much
as
we
want.
We
have
limitations
in
there
in
the
and
how
much
data
we
are.
We
are
collecting,
and
also
this
issue
with
a
dynamic
provision
of
persistent
volumes
in
the
in
the
new
cluster
should
be
fixed.
Very
soon,
so
what
how
would
we
look
like
what
we
would
like
the
system
to
look
like
in
the
future?
A
We
would
like
to
migrate
the
brow
control
plane
to
to
the
ibm
cluster
and
also
have
the
all
the
prometheus
stacks
connected,
the
data
aggregated
and
the
the
capacity
of
the
ibm
cluster
to
be
increased,
to
run
this
more
production
level,
workloads
also
the
the
vms
migrated
to
to
these
to
the
new
cluster,
so
yeah.
A
Here
I
put
some
some
links
to
to
some
these
components
that
we
plan
to
use
very
interesting,
interesting
this
one
for
the
metrics
aggregation
and
the
unlimited
of
virtually
unlimited
data
retention
and
so
on
and
yeah.
Let
me
show
you
really
quick,
this
small
demo
about
the
the
how
we
test
and
deploy
these
components
the
the
code
is
under
project
infra.
A
This
thing
that
we
are
you
are
going
to
see
is
the
same,
the
same
test.
We
have
this
same
code
or
this
is
executed
in
a
pre-submit
in
there
in
when,
when
the
code
changes,
we
have
a
brow
job
that
executes
these
tests
and
if,
if
things
go
good
and
the
code
is
merged,
then
we
have
a
post
submit
that
deploys
the
the
component.
So
we
have
continuous
deployment
for
this
component.
A
So
let
me
show
you
quickly
how
it
looks
like
it's
under
services
and,
let's
see,
for
instance,
these
are
the
components
that
we
are
deploying
the
prometheus
stack.
This
is
how
it
looks
like
for
for
most
of
the
components.
Let's
take
a
quick
look
at
the
deploy
script
here.
You
can
see
that
we
call
this
custom
verb.
This
apply
verb
that
is
created
by
the
by
the
basal
rules
for
deploying
crds
and
then
for
deploying
all
the
components,
and
we
have
this
command
that
we
have
created
for
waiting
for
each
of
the
components.