►
From YouTube: Zero downtime KubeVirt updates
Description
KubeVirt has a very precise and resilient method for ensuring zero downtime updates occur.
In this session I'll cover the general strategy behind how we approach updating KubeVirt from a developer's perspective as well as discuss future improvments to our update process.
Attendees will come away with an understanding of how KubeVirt's update process has been designed, how it is tested, and what future enhancements are coming soon.
Presenter: David Vossel, Senior Principal Software Engineer, Red Hat
A
Let's
see
all
right
so
just
can
somebody
give
me
a
confirmation.
They
can
see
my
slides,
yes,
excellent,
all
right,
so
let
me
get
started
here
all
right,
so
my
name
is
david
vossel,
I'm
one
of
the
core
contributors
to
the
keyboard
project
and
I'm
going
to
talk
about
how
we
handle
updates
for
cubert
all
right
so
back
and
what
I
want
to
say
was
2018.
A
A
A
A
A
So
the
angle
here
is
q,
vert
updates.
I
want
them
to
be
so
reliable
and
mundane
that
I'd
be
willing
to
update
keeper
in
production
in
the
middle
of
business
hours
and
do
that
with
confidence,
and
the
only
way
I'd
be
willing
to
do.
That
is
if
we
can
make
some
sort
of
guarantee
around
zero
downtime.
So
to
kick
things
off.
A
Let's
talk
about
our
commitment
to
zero,
downtime
updates
for
qbert
and
what
exactly
that
means
for
the
project.
A
A
Is
that
the
credit
operations
performed
on
key
vert
objects
like
virtual
machines,
they're
going
to
remain
possible
during
the
update,
so
people
should
be
able
to
perform
life
cycle
events
on
virtual
machines
like
starting
and
stopping
them
throughout
the
entire
update.
Now
the
timing
of
how
quickly
those
lifecycle
events
are
processed,
it
could
be
delayed.
But
the
big
point
from
the
drive
here
is
that
invoking
these
actions
should
not
be
impacted,
so
our
api
by
design
is
going
to
be
capable
of
responding
even
in
a
degraded
state.
A
The
second
commitment
is
that
vmi
workloads
will
remain
undisrupted
throughout
the
entire
update
process,
and
what
I
mean
by
this
is
we're
never
going
to
require
performing
a
destructive
action
on
your
virtual
machine
workloads
as
part
of
the
update
process.
So
if
you're
running
a
database,
for
example
in
a
keyboard
vmi
that
database
is
going
to
remain
uninterrupted.
A
A
Commitment
to
zero
downtime
and
all
that
said,
we
do
expect
some
disruption
to
occur
during
the
update
and
that
primarily
is
limited
to
anything
that
involves
a
persistent
connection
between
our
control
plane
components.
So
the
the
first
one
is
the
in-flight
live
migrations,
they'll,
probably
get
terminated.
It
really
depends
on
timing,
and
the
reason
for
this
is
vert
handler
is
responsible
for
processing
the
live
migration
streams.
A
So
that's
one
of
our
components
and
when
we
roll
out
new
vert
handlers,
any
in-flight
live
migration
streams
they're
going
to
get
closed,
which
is
ultimately
going
to
cause
migration
to
terminate
if
you're
using
pre-copy
live
migration.
This
isn't
really
a
big
deal.
Your
vmi
is
just
going
to
keep
running
wherever
it
was
running
and
nothing's
lost
another
form
of
disruption
that
we
expect
is
any
console,
vnc
connections
to
active,
running
virtual
machines,
they're
going
to
get
reset.
A
So
again,
this
is
a
persistent
tcp
connection,
that's
being
tunneled
through
our
verde
api
component
and
our
handler
components.
When
we
roll
out
new
versions
versions
of
these
components,
it's
going
to
reset
those
connections
and
practically
there's
not
a
lot.
We
can
do
about
that.
One
okay,
so
what
I
want
to
do
next,
oh.
A
Is
I
want
to
walk
through
the
installation,
update
process
and
talk
a
little
bit
about
how
it
all
works?
So
if
you're
installing
qvert,
for
the
first
time,
you're,
probably
going
to
find
some
documentation
on
keeper.io
about
first
installing
the
keyboard
operator
and
then
posting
a
keeper
custom
resource
to
the
cluster
which
in
turn
kind
of
kicks
off
that
installation
of
keeper?
So
what's
going
on
here?
Is
we
have
a
top
level
controller
for
operator
which
is
capable
of
orchestrating
the
keyboard,
install
and
update
process?
But
bird.
A
A
A
You
know
the
vert
operator
deployment
and
anything
that
deployment
needs
to
work.
So
I'm
talking
about
service
accounts
web
hooks,
crds
whatever,
and
then
we
have
the
kubert
custom
resource,
so
keeper
cr.yaml
and
that's
going
to
contain
the
very
basic
instructions
that
will
signal
for
operator
to
begin
the
installation
process.
A
B
A
You
will
get
version
0.38
of
cubert
now
in
advanced
mode
again,
totally
just
made
up
this
term
is
that
you
can
pin
a
kuver
custom
resource
to
a
specific
kuvert
release,
and
you
can
do
this
one
way
of
doing
this
is
by
saying
the
image
tag.
Let
me
hide
this.
I
don't
know
if
you
guys
can
see
this
or
if
I
can
just
see
this
okay,
so
you
can
set
the
image
tag
on
the
q,
vert
custom
resource
and
that's
going
to
tell
vert
operator
to
install
a
previous
or
a
specific
version
of
cubert.
A
B
A
People
to
use
different
operator
and
keyword
versions
that
can
of
float
independently
of
one
another,
but
really
the
big
one
here
is
rollback.
So
an
operator
it
can't
transition
from
a
release.
Version
of
qvert,
that's
more
recent
than
the
currently
deployed
operator,
so
the
operator
has
to
be
greater
than
or
equal
to
both
the
current
version
of
keeper
and
the
desired
version
of
kubert
when
performing
the
transition.
A
I'd
have
to
have
at
least
version
0.38
of
the
operator
installed,
and
this
has
to
do
with
permissions,
and
the
new
operator
has
to
know
about
both
everything,
that's
required
and
has
to
have
permissions
to
do
everything
required
for
both
releases,
and
we
only
do
that
in
a
rolling
forward
type
of
fashion.
So
you
can
only
roll
back
if
you
had
the
most
recent
version
of
that
operator.
A
Okay,
let's
quickly
take
a
look
at
the
typical
installation
process
and
I'm
going
to
show
how
you
can
follow
that
process
by
observing
things
on
the
kuvert
custom
resources
status
section.
This
is
kind
of
teeing
us
up
for
observing
a
update
in
a
minute.
So
if
you're
installing
qvert
from
scratch,
then
you're
probably
going
to
want
to
do
something
like
this,
where
you
first
post
the
operator,
manifests
then
post
the
keyboard,
custom
resource
and
then
wait
for
kuvert
custom
resources
task
to
indicate
that
vert
operator
has
completed
the
install.
A
So
one
way
you
can
track
the
progress
of
the
install
is
by
observing
the
changes
to
the
conditions
on
the
q
vert
custom
resource
so
that
wait
command
here
at
the
bottom.
That's
telling
us
that
the
install
has
completed
once
a
condition
called
available
is
set
to
true.
So
let's
take
a
closer
look
at
what
you
can
expect
to
see
in
the
cuvette
custom
resources
status,
both
during
and
after
the
installation
is
completed.
A
B
A
A
Okay,
so
now
we
have
q
vert
installed.
Let's
take
a
look
at
what
entails
the
trigger
and
observe
the
update.
That's
in
progress
if
you
took
the
easy
mode
for
installing
keyboard,
where
the
operator
and
cuvert
release
versions
are
always
going
to
remain
in
sync,
then
triggering
an
update
is
as
simple
as
applying
a
new
q
fert
operator
manifest,
which
you
can
find
in
the
asset
list
of
our
official
keyframe
releases,
which
I
had
a
screenshot
for
that
earlier.
So
in
this
example,
we
post
a
new
kuper
opera
yaml.
A
Then
we
wait
for
the
progressing
condition
to
get
set
in
the
kubert
cr.
So
that's
indicating
us
tests
that
the
operator
is
now
initiating
the
update
process
and
then
finally,
we
observe
the
created
condition
is
set
to
true
which
signals
to
us.
The
update
has
completed
so
we
can
have
an
indication
of
when
the
update
has
started
and
when
it
has
ended.
So
if
you're,
actually
looking
at
the
keyboard
custom
resources
status,
you
can
see
progressing
is
going
to
be
set
to
true
again.
We
have
a
reason
there,
it's
going
to
say,
update
in
progress.
A
That's
telling
us
that
we're
transitioning
to
a
new
version
or
a
different
version
of
cubert
we're
going
to
see
something
new
here.
The
observed
version
of
qvert
does
not
match
the
target
version
in
our
q,
vert
custom
resource
status.
So
that's
telling
us
what
the
last
version
was
that
was
installed
and
where
we're
headed
so
telling
us
that
transitions
taking
place,
and
this
is
what
we're
transitioning
to
so.
A
If
we
waste
some
time
we're
going
to
see
that
that
progression
and
progressing
condition
is
going
to
be
set
to
false
nice
reason,
there
we're
going
to
see
that
created
condition
the
one
that
we
were
waiting
on
in
our
cube
control
command.
That's
going
to
be
set
to
true
and
we're
going
to
see
that
the
observed
version
is
going
to
match
the
target
version
now,
so
there's
no
transition
taking
place.
A
Okay,
so
that's
what
it
looks
like
from
an
administrative
perspective
to
install
an
update
cuvert.
I
want
to
talk
about
the
details
about
what
vert
operator
is
actually
doing
behind
the
scenes
and
how
it
orchestrates
the
update
now
so
in
order
to
guarantee
minimal
disruption,
vert
operator
is
going
to
be
performing
the
q
fer
update
in
a
very
specific
and
controlled
order.
So
there
are
several
phases
involved
here.
A
The
first
phase
is
going
to
be
involved
with
installing
all
the
prerequisites
required
to
successfully
launch
the
new
updated
keyboard
component.
So
this
is
going
to
involve
things
like
installing
new
crds,
meaning
any
new,
apis
and
updates,
and
if
there
are
new
update,
excuse
me
apis
that
didn't
exist
previously.
A
We're
doing
this
thing
where
we
merge
the
old
our
back
permissions
with
the
new.
Our
back
permissions
and
the
reasoning
for
this
is
briefly
when
we're
performing
the
rolling
update
of
our
components.
Both
the
old
and
new
versions
are
going
to
be
running
at
the
same
time
and
they
both
have
to
operate
and
serve
requests
correctly.
A
So
we
have
to
continue
with
both
sets
of
our
bat
permissions
in
order
for
that
to
work
and,
lastly,
we're
going
to
install
any
other
prerequisites
that
can
include
things
like
new
web
hooks
service,
endpoints,
prometheus
rules
and
anything
else,
there's
a
few
other
ones.
I
didn't
even
list
here
so
the
next
phase,
after
our
prerequisite
phase,
is
we're
going
to
roll
out
vert
handler
so
vert
handler
is
our
daemon
set
lives
in
all
the
compute
nodes?
B
A
Vert
controller
and
then,
after
that
we
have
vert
api
and
it's
important
that
api
is
the
last
component
in
this
chain,
because
vert
api
is
what
enables
new
user
facing
features
and
rolling
out
by
rolling
out
vert
api
last.
We
can
ensure
that
all
the
other
components
involved.
These
new
features
are
available
before
anyone
intends
to
use
those
new
features
after
vert
api.
We're
going
to
what
attribute
we're
going
to
be
certain
that
all
of
the
old
components,
all
the
old
deployments,
all
the
old
pods
involved
with
our
control
plane
are
down
and
then.
A
Any
backwards
incompatible
changes
to
our
api.
These
would
be
things
like
introducing
new
versions
to
existing
apis
and
potentially
deprecating
old
versions,
anything
that
might
confuse
our
old
components.
We
would
do
here
and,
lastly,
we're
going
to
do
cleanup.
So
remember
back
in
phase
one
I
mentioned
we
have
to
merge
in
both
the
old
and
new
rbac
permissions
into
the
cluster.
At
the
same
time-
and
this
is
the
phase
where
we're
going
to
clean
up
those
old,
our
back
permissions
and
any
other
temporary
objects
used
during
the
install
process.
A
A
Okay,
wow
time
is
creeping
up
on
me
here,
so
that's
a
quick
overview
of
the
install
and
update
process
as
it
is
today.
If
you're
pretty
familiar
with
the
keyboard
architecture,
you
might
have
noticed.
I
never
mentioned
anything
about
updating
the
components
that
live
inside
the
vmi,
pods
they're,
actually.
B
A
The
guest
virtual
machines-
and
we
have
a
lot
going
on
in
there,
so
there's
a
keyboard
component
called
vert
launcher,
and
then
we
have
livert
and
qmu
in
there
as
well.
So
how
are
those
components
updated
and
the
short
answer
is
well
they're?
Not
so,
as
today
we
we
don't
do
anything.
We
don't
touch
your
bmi
workloads.
A
Vulnerability
in
any
of
these
components
within
the
vmi
pod,
then
we
need
a
path
for
automating,
the
update
of
these
components
and
back
in
the
very
beginning
of
the
presentation,
I
made
a
disclaimer
that
we're
not
going
to
touch
your
workloads
unless
you
tell
us
to
and
here's
where
I
explain
that
so
I'm
working
on
a
new
feature
for
vert
operator
that
lets
us
declare
a
strategy
for
updating
vmi
workloads-
and
this
is
an
opt-in
feature.
That's
configured
globally
on
the
kubert
custom
resource.
A
By
using
this
new
workload
update
strategy
api,
you
can
tell
us
what
methods
to
use
to
update
your
bmi
workloads
so
right
now
I
have
two
methods
you
can
use.
The
first
one
is
non-destructive:
it's
the
live
migrate
method,
which
is
going
to
cause
a
bmi
workload
to
migrate
into
an
updated
pod,
with
all
the
new
components
so
again,
non-disruptive
the
next
one
is
evict
and
most
likely
that's
going
to
result
in
a
vmi
game
shutdown.
A
So
at
the
vmi
is
managed
by
a
vm
object,
with
the
run
strategy
set
to
always,
then
the
vmi
will
restart
in
an
updated
pod
automatically,
but
that's
a
destructive
action
or
it's
disruptive,
because
it's
going
to
actually
interrupt
the
workload
and
restart
it.
So
if
you
want
your
workloads
to
update
in
a
non-disruptive
manner
set
this
live
migrate
method
and
that
will
guarantee
your
workloads
are
only
impacted
by
live
migration
and
if
you
don't
have
live
migration
enabled
in
your
environment.
A
For
some
reason,
then
evict
is
really
the
only
option
you'd
have
if
you
want
this
behavior.
This
feature
is
pretty
much
done.
I
think
we'll
see
it
in
the
march
release
of
kuvert,
but
you
know
we'll
see.
Reviews
are
welcome.
There's
the
link
to
the
pool
request,
all
right,
so
I'm
gonna
close
things
out
when
I
said
I
want
to
make
kubert
updates
boring
one
of
the
primary
ways
we
can
ensure
it
stays
boring
is
by
having
a
solid
test
suite
that
exercises
the
entire
update
process.
So
right.
B
A
On
every
pr,
we
have
functional
tests
that
execute
to
verify,
update,
updates
work
between
the
latest
official
keyboard
release
to
whatever
code
is
present
in
that
pull
request.
So
this
means
that,
if
there's
anything
in
a
pull
request
that
breaks
the
update
path
from
the
previous
keyword
release,
it's
not
going
to
make
it
into
our
code
base.
So
this
test
is
actually
gating
code
from
being
merged.
A
These
tests
also
verify
our
commitment
to
zero
downtime
for
your
bmi
workloads.
So
in
the
test
cases
we're
doing
things
like
start,
the
vmi
workload
perform
an
update,
then
validate
that
the
vmis
are
still
running
and
we
can
still
lifecycle
manage
them
after
the
update
has
completed
okay,
I
tried
to
cover
a
crazy
amount
of
ground
in
20
minutes.
There
was
a
lot
more,
I
wanted
to
say,
but
yeah
I
think,
I'm
out
of
time.
A
Hopefully
this
at
least
gives
everyone
a
sense
of
how
we
approach
updates
in
keyboard
and
some
of
the
techniques
we're
using
to
provide
our
zero
down
downtime
guarantees.
A
B
A
If
I
have
a
few
more
minutes,
then
yeah
I'll
keep
talking.
Maybe
I
did
such
a
good
job
in
my
presentation.
Nobody
had
any
questions
because
it
was
all
perfect
so.
A
I
wanted
to
get
into
that.
I
didn't
get
to
I
and
during
the
presentation
I
talked
entirely
about
updating,
cube
vert,
but
you
know
that's.
A
Update
path
that
you
have
when
you're
running
keyboard,
you
also
have
to
deal
with
cluster
updates.
So
if
you're
updating
the
cluster,
then
how
do
we
avoid
disruption
to
our
virtual
machine
workloads
when
we
have
to
do
things
like,
for
example,
update
a
cluster
node,
and
that
is
configured
a
little
bit
differently,
but
we
have
a
way
of
doing
that
and
it's
going
to
be
used.
We
have
something
called
the
eviction
strategy
and
it's
something
that
exists
on
our
vmi
api.
A
A
Usually
I
mean
they
can
be
stateful
as
well,
but
certainly
virtual
machines
are
usually
stateful,
so
we
want
to
preserve
a
running
virtual
machine,
even
in
the
case
of
an
updated
entire
cluster.
So
using
the
eviction
strategy,
we
can
move,
make
virtual
machines
portable
across
nodes
by
setting
that
live,
migrate,
eviction
strategy.
So
when
a.
A
A
B
A
Will
break
including
console,
I
think,
there's
two
questions
in
this
one.
So
how
does
the
update
process
handle
vmis
with
host
disks
or
container
disks
which
update
process?
Are
we
talking
about
I'll
answer
both
so
the
q
vert
update
process?
Again,
we
don't
touch
your
vmi
workloads
unless
you
tell
us
to
so.
If
you
don't
tell
us
to
do
anything,
then
nothing's
going
to
happen
to
those
okay
in
the
q,
vert
cr
update,
so
by
default.
A
Going
to
happen
to
those
they're
just
going
to
keep
running,
if
you
wanted
to
migrate
them,
then
you're
not
going
to
be
able
to
migrate
with
a
host,
something
that's
local
to
that
specific
node
that
it's
running
on.
It's
just
not
eligible
for
migration.
So
your
only
option
if
you
actually
need
to
update
the
vmi
component
when
I
was
talking
about
workload
updates,
is
event
which
is
probably
going
to
restart
your
your
virtual
machine,
and
the
second
part
of
that
question
was
the
change
in
ip
break
existing
connections.
A
It's
certainly
going
to
break
console
because
console
doesn't
even
go
through
the
api.
I'm
sorry
the
ip
of
the
bmi,
it's
it's
kind
of
going
through
a
back
channel
as
far
as
the
api.
A
Excuse
me,
the
ip
change.
Well,
if
your
ip
changes,
then
yes,
but
it
really
depends
on
what
network
you're
using
so
there's
lots
of
options
for
setting
up
a
virtual
machine
with
different
networks,
and
things
like
that,
some
of
them
will
work
successfully
in
a
migrate
and
some
of
them
won't.
So
if
you're
using
the
bridge
connection
with
the
pod
network,
then
that's
one
of
the
ones
that
won't
work
you're
going
to
get
a
new
ip
address.
It's
going
to
probably
not
be
successful.