►
From YouTube: Kubernetes SIG Node 20220504
Description
Meeting Agenda:
https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU
A
Cool
so
hello,
and
welcome
to
the
first
meeting
and
like
to
kick
off
the
signal
reliability
project,
mostly
to
focus
around
improving
the
testability
of
the
kubelet.
A
Yeah,
we
have
quite
a
few
people
here
today,
which
is
exciting,
so
I
guess
everybody
here
saw
my
emails,
but
to
like
go
over
and
make
a
bit
of
an
introduction
about
what
I
want
to
start
with.
A
This
is
to
start
thinking
about
how
we
can
improve
the
reliability
of
the
kubelet,
partially
through
testability
and
refactoring,
where
necessary,
to
make
it
easier
to
land
changes
in
the
future,
because
where
we
are
today,
we
have
a
lot
of
pretty
scary
regressions
that
land
with
basically
every
kubernetes
release,
while
we
land
features
in
the
and
to
avoid
us
being
in
a
place
where
we
will
want
to
eventually
rewrite
the
kubelet
I'd.
Rather
us
get
the
test
coverage
there
today
to
make
things.
A
Yeah
so
let's
see
if
I
can
show
my
screen.
A
I
cannot
show
my
screen
without
restarting
zoom,
so
can
somebody
volunteer
to
share
the
document
that
I
linked
to
in
the
email.
A
A
A
Okay,
so
let's
just
go
on
assuming
that
everybody
can
see
the
document,
the
link's
in
chat
if
people
are
interested,
so
it's
kind
of
a
like
difficult
thing
to
kick
off.
But
what
I
want
us
to
do
today
is
to
talk
about
the
ways
that
we
think
we
can
improve.
The
kubelet
areas
by
testing
is
missing
from
people
who
have
experience
with
different
parts
of
the
kubelet
and
then
sort
of
like
define
some
goals
for
what
we
want
to
try
and
achieve
in
125..
A
Cool
so
sergey
gave
us
some
notes
that
are
a
pretty
good
starting
point
that
cover
some
of
the
problems
we
have
with
our
like
existing
basic
unit
test
coverage
and
some
of
the
like
ete
gaps
that
we
have
today
with
some
like
specific
things,
we're
missing,
but
to
quickly
summarize
that
we're.
Basically,
we
have
a
lot
of
coffee
well.
A
The
coverage
we
have
today
mostly
focuses
on
the
happy
cases
so
like
we
have
a
lot
of
tests
that
will
assume
that
everything
is
fine
or
set
up
everything
in
a
way
that
is
guaranteed
to
be
fine.
A
But
we
don't
really
test
what
happens
when
things
in
the
kubler
start
to
fail
like
we
don't
have
any
tests
that
cover
say
what
happens
if
the
cri
fails
to
respond
at
a
given
time
and
we've
had
bugs
in
the
past,
where
you
know
say
you
restart,
contain
id
at
the
one
time,
and
kubla
will
never
like
fully
reconcile
something,
for
example,
and
those
are
the
kinds
of
things
we
don't
really
have
any
way
to
even
reproduce
and
test
today.
A
C
C
You
know
failure,
you
know
timeout
in
at
the
on
the
client
side
or
on
the
server
side
and
how
that
would
affect
you
know
the
overall
solution.
So
yes
that
definitely
on
on
each
of
the
apis,
especially
the
primary
ones
like
pod
run,
you
know
in
container
create
and
start.
We
need
to
go
over
in
much
more
detail.
C
How
did
we
do
it
in
docker
shim
perspective,
when
we
did
the
container
run
times,
but
sometimes
in
the
implementation
there
was
a
a
gap,
a
knowledge
gap
between
what
kublet
wanted
or
expected
and
what
we
were
doing.
C
The
another
place,
I
think
that's
important
to
talk
about
here-
is
the
the
life
cycle
from
release
to
release
the
expectation
of
pods.
That
would
can
still
be.
You
know
running
after
you've,
upgraded
either
kublet
or
the
container
runtime,
or
the
runtime
engine
right,
or
a
virtual
machine
and
even
cni
right,
the
the
expectation
for
that
needs
to
be
covered
in
some
test
cases.
C
So
I
mean
that
that's
why
I'm
here.
I
think
we
definitely
need
to
work
in
these
areas
a
little
bit.
You
know
this
whole
the
bigger
life
cycle,
as
well
as
the
threes
of
the
apis.
What
happens
if
there's
a
failure,
and
it
it
doesn't
just
take
kublet
it's
kublet
and
c
advisor,
where
we're
still
using
c
advisor
for
some
part
of
the
solution
for
metrics
and
monitoring.
B
So
my
one
thought
here
is
maybe,
if
we
just
kind
of
agree
on
what
it
means
to
be
unreliable,
and
so
I
think
historically,
I
would
look
at
this
as
the
cubelet
is
unreliable
when
it
is
violating
a
published
invariant
in
the
kubernetes
api.
B
So
the
longest
running
issue,
I'm
aware
of
that
we've
had
intermittent
challenges
with
is
when
a
pod
that
is
either
incorrectly
or
correctly
appearing
to
toggle
between
a
terminal
and
non-terminal
state,
which
seems
to
be
the
most
common
I'm
trying
to
think
through.
I
don't
know
who
else
is
on
the
call
historically
here
that
that,
in
some
cases
has
been
a
reliability
issue
at
both
the
control
plane
side
of
kubernetes,
as
well
as
the
I
guess,
the
cube
with
data
plane
side,
but
for
like
our
community
as
a
whole.
B
I
think
that
that
to
me
is
like
a
like
a
key
thing,
which
is
just
like
a
thing
that
we
could
all
agree
on.
Then
there's
subtler
definitions
of
unreliability
that
I'm
not
sure
how
folks
feel
like
when
we
mention
like
what
happens
when
a
runtime
restarts
or
that
type
of
thing.
B
It's
clear
that
system,
deployer
of
kubernetes
in
some
cases,
has
made
mitigations
to
those
issues
right
like,
even
if
it's
at,
like
the
systemd
unit
file
like
if
your
runtime
restarts,
let's
just
restart
the
keyboard,
to
be
safe
anyway,
right,
like
these
types
of
tricks,
can
happen
and
that
in
some
cases
is
we
haven't
been
clear
about
our
own
invariant
of
expectation
there.
But
I
don't
necessarily
know
if
we
violated
one
right.
B
I
think
it's
worth
talking
through
and
then
there's
been
like
a
weird
squishy
area
in
between
where
particularly
on
cni's,
where
there's
like
expectations
with
respect
to
when
the
cubelet
should
communicate
back
to
the
control
plane
or
not,
about
particular
state
transitions
that
I
don't
think,
we've
ever
documented
any
expectations
around
that
I'm
not
aware
of
us
actually
violating
anything
with
respect
to
what
we've
published.
B
But
I'm
wondering
if
in
this
group
we're
we're
looking
on
insuring.
First,
we
are
meeting
the
invariants
we
advertise
for,
or
are
we
trying
to
also
maybe
expand
the
set
of
invariants?
We
want
to
give
confidence
around
like
thinking
about
like
when,
should
the
cubelet
advertise
back
to
a
control
plane
when
a
pod
has
an
ip
address.
That's
like
something.
We've
never
made
any
statement
around
so
kind
of
just
curious
as
a
group
here.
B
If
we
can
like
do
the
common
definition
of
reliability,
though,
but
that
these
are
the
dimensions
I
could
see
us
looking
at.
So
I
have
an
idea
about
that,
but
I'll
let
clayton.
E
I'll
keep
this
shirt
up
a
couple
of
things
I
was
thinking
about,
like
as
we
went
through
some
of
the
like
cleaning
up
parts
of
the
life
cycle
where
we'd
fix
something
and
it
would
expose
a
bug
or
it
would
interact
with
another
subsystem
and
then
show
up
as
another
bug
kind
of
that
same
long
like
propagation.
E
At
that
time
we
were,
it
was
fairly
easy
to
capture
tests
that
would
have
caught
the
issue,
but
the
I
think
it's
it's
hard
for
us
to
keep
a
pipeline
going
of
that
and
work
through
it.
It
felt
you
know
we
could
record
it
at
the
time.
We'd
discuss
it,
there's
a
lot
of
really
good
discussions
and
then
they
would
they
would
die
out
a
little
bit.
E
You
know
three
months
three
weeks
later
and
write
another
test,
so
that
might
be
a
way
to
get
if
we
can
start
to
build
up
some
of
those
lists
and
a
second
one,
the
second
point-
and
we
didn't-
I
saw
it
kind
of
get
hit,
but
it's
kind
of
derek's
question
about
what's
on
reliability,
a
lot
of
the
latent
bugs
they're,
very
observable
in
a
large
system
where
certain
invariants
are
violated
super
hard
to
pick
out
from
a
unit
test
or
even
a
a
node
ede,
where
we
would
once
we
knew
what
the
problem
was.
E
It
was
super
obvious,
seeing
the
problem
was
one
act
of
filtering
out
reports
from
users
that
are
of
subtle
things
or
something.
That's
you
know
it's
a
small
flake
and
a
small
part,
because
we
we
put
the
cubelet
into
a
stressed
state
and
we're
trying
a
bunch
of
different
things.
E
I
think
there's
a
space
and
I
don't
want
to
call
it
chaos
testing,
but
there's
a
space
for
us
to
have
a
little
bit
more
complex
of
a
workload
over
single
node
without
being
all
the
way
to
a
full
system
or,
potentially
you
know,
clusters
that
start
in
a
specific
state
like
a
lot
of
our
performance
and
and
ends
like
those
kinds
of
variances,
show
up
a
lot
of
the
pod
shut
down
like
where
we
would
never
clean
up
pods.
E
E
And
so
I
think
we
can
do
a
little
bit
better
in
that
mid
ground
of
a
little
bit
higher
than
node
e
to
e
and
a
little
bit
less
than
40
e.
A
Yeah
so
like
the
way
I
was
thinking
about,
this
is
a
lot
of
the
couplet.
A
Today
we
don't
have
a
documentation
of
what
our
expectations
are,
because
there
are
no
tests
for
a
lot
of
things,
and
so
it's
less
about
whether
the
kubelet
fails
in
a
lot
of
like
late
and
unexpected
ways
today
and
more
about
giving
us
confidence
to
know
that
when
we
make
changes
in
the
future,
we're
not
going
to
cause
regressions,
and
so
a
lot
of
that
is
writing
tests
that
essentially
document
those
invariants
and
then
also
introducing
some
of
the
like
failure
testing.
A
You
were
talking
about
into
like
places
like
grpc
interfaces
and
stuff.
A
E
Great
and-
and
I
think
the
invariant
is
a
great
way
to
frame
it-
the
describing
the
invariance
that
should
take
place
in
the
system
are
actually
surprisingly
easy
to
capture.
E
You
know
certainly
like
on
the
openshift
side.
We
spent
a
long
time
assessing
you
know
the
the
sequence
of
transitions
and
there's
some
really
obvious
ones
that
pop
out.
I
think
we
can
do
a
lot
better,
and
this
actually
shows
up
in
controllers
in
cube,
scheduler
and
variants
where
we
prob,
we
documented
them
built,
cube
and
then
never
went
back
and
really
focused
on.
Are
we
always
satisfying
our
appearance?
So
I
completely
agree
danielle.
B
B
One
of
the
easy
ways
to
fix
a
flaky
ede
was
to
extend
the
timeout
period
and
the
pressures
of
the
infrastructure
you
were
testing
on
were
difficult.
So,
like
I'm,
I'm
wondering
if
there's
a
strong
desire
to
try
to
maybe
push
timing
and
variance
or
timing
guarantees
here
when
we
talk
talk,
reliability
where
I
would
kind.
B
Versus
like
behavior
first
and
make
timings
so
secondary,
but
I
I
know
I've
written
tests
that
I
had
to
then
later
go
back
and
push
the
time
out
a
little
further,
because
people's
pressures
on
that
given
moment
are
it's
the
easiest
fix
for
the.
A
F
I
think
cleveland
cover
mostly,
I
want
to
see.
I
just
want
to
mention
one
things
in
the
past.
F
We
did
the
of
the
part
of
spec
and
auditing,
so
every
powder
related
field
and
what's
the
variance
we
added
test
that's
a
couple
years
ago,
we
haven't
done
that
for
last
many
years,
so
I
noticed
that
many
new
feature
new
field,
add
the
updated
to
the
part
and
and
even
there's
the
slide
like
the
new
vinyl
new
variants.
I
did.
We
don't
have
a
test
specific
right,
especially
for
the
non-happy
places,
so
we
didn't
add
those
tests.
So
I
think
maybe
we
can
start
from
there
like
the
we.
F
We
did
the
part
auditing,
so
we
added
the
first
round
of
the
loader
e2e
test,
but
we
are
missing
tons
of
things.
Also,
there
are
tons
of
the
feature
deprecated.
We
also
didn't.
We
only
simply
remove
the
test,
but
we
didn't
reunite.
Okay,
remove
certain
tests
because
those
tests
may
be
tested
with
deprecated
field,
but
actually
they
have
the
more
semantics
within
cover.
Another
thing:
it
is
the
couple
years
ago
when
we
do
that
powder
spike
auditing.
We
see
the
nexus
next.
F
One
where
I
wanted
to
download
the
level-
and
we
never
didn't
know
that
and
we
just
ran
out
of
time.
Then
we
didn't
continue.
The
third
one
is.
We
do
have
like
the
stress
test,
but
the
stress
test
itself.
We
ideated
always
give
us
a
lot
of
error
right
so
because
we
are
over
stressed
off
of
the
so
people
treat
that
it
is
we
we
have
the
goal
back
then
to
make.
That
is
a
little
bit
more
reliable
right.
F
So
it's
kind
of
like
the
how
to
make
sure
the
because
no
the
intuit
has
basically
tried
to
remove
after
uncertainty
right
like
try
to
remove
the
sky-dweller,
so
it
makes
sure
we
try
to
make
the
all
know.
The
level
test
is
more
predictable.
F
So,
like
the
what's
the
input,
what's
the
other,
but
obviously
that's
not.
We
didn't
completely
finish
that
one
because
still
have
the
api
server.
We
still
have
the
control
plan
there
and
we
didn't
guarantee
of
the
resource
for
those,
because
the
know
the
level
of
the
problem,
so
so
so
so
stress
test
is
level
really
run
continuous.
F
We
have
a
lot
of
problem.
Actually,
especially
just
you
mentioned
that
earlier,
you
mentioned
that
unhappy
pass
right,
you
overloaded
the
load
and
because
we
didn't
test
those
stress
test
is
being
disabled,
only
manual
run
when
people
qualify
kubernetes
the
new
worship
that
will
be
running
so
a
lot
of
things.
There's
the
enhancement.
B
Maybe
the
last
comment
I'll
add
here
is
in
my
experience.
We've
also
hit
reliability
issues
when
we
lack
a
feature
or
any
reasonable
sense
of
defense
on
the
node,
and
then
the
keyboard
is
blamed
for
that
source
of
reliability.
So
maybe
just
getting
transparency
among
this
group
with
respect
to
issues
that
I
I
personally
sometimes
look
at
as
best
effort
for
the
cubelet,
but
not
like
guaranteed
so.
B
Like
what
the
cuba
does
in
the
face
of
exhaustion,
I
often
view
that
as
best
effort
and
I'm
I
would
want
to
get
like
consensus
on.
If
we
all
agree
on
that,
because
like
this
accounting
was
slow,
we
lacked
core
features
for
it.
Io
contention,
I
also
view
as
best
effort,
and
so
like.
B
Maybe
just
also
clarifying
like
are
those
are
those
the
unhappy
paths
that
we
want
to
tackle
in
this
group,
because
they're
often
the
root
cause
of
an
unreliability
issue,
particularly
between
cable
and
runtime
communication,
that
I
feel
like
we're
defenseless
at
right
now,
and
we
should
just
be
honest
about
that
in
some
ways,
whereas
the
measured
failure
rate
of
pod
life
cycle
and
variance
being
you
know
invalid
is
is
like
a
different
thing,
I
guess
so.
I
guess
every
one
of
us
has
had
different
experiences
like
I
don't.
B
B
And
I
only
state
that,
because
we
have
to
kind
of
measure
what
we
mean
by
reliable,
right
and
absent,
that
measurement
like
it's
hard
to
focus
our
problem.
I
definitely
agree
the
testing
side
that
there
are
areas,
that's
right
for
improvement
and
sometimes
even
in
unit
testing.
I
have
found
unit
testing
to
not
be
as
valuable
sometimes
when
it
came
to
the
cubelet,
and
I'm
also
not
sure
if,
as
like
an
unspoken
but
expressed
feeling
here
is,
is
it
the
externally
measured
view
of
reliability
or
the
internally
confidence?
B
We
want
to
gather
as
a
community
with
respect
to
forward-looking
changes,
because
I
can
definitely
agree
that,
like
there
is
confusion
between
like
pod
workers
and
the
core
cubelet
iteration
loop,
that
like
well,
it
works
and
you
could
externally
measure
it
a
dimension
of
reliability
for
it.
It
is
confusing
to
many
new
folks
that
come
in
and,
like
I
think,
back
don,
do
you
remember
when
I
I
asked
I
said
to
vish?
B
Oh,
we
should
drew
a
a
map
or
like
some
document
that
says
how
everything
in
the
cube
it
works
and
at
some
point
then
there
wasn't
like
uniform
agreement
that
you
even
needed
that
it
was
like.
Oh
just
look
at
the
code,
it
was
obvious.
So
is
this
more
a
symptom
of
the
latter
than
the
former
is
what
I'm
also
wondering,
because
a
lot
of
definitely
a
bit
of
push
yeah,
okay,.
A
Like
we
have
the
problem
of
like
when
anything
touches
that
bit
of
the
code
heavily,
we
end
off
shipping,
either
regressions
or
behavior
changes
that
we
don't
understand
that
are
then
hard
to
triage,
so
improving
both
like
our
coverage
of
how
we
test
that
and
also
doing
some
level
of
refactoring
to
have
a
better
unit
test.
B
Yeah
one
broad
area
I
could
see
that
we
also
haven't
covered
is
container
run
times,
are
not
always
reliable
and
how
they
report
back
to
the
keyboard,
and
so
has
definitely
happened
and
like
our
reliability,
is
only
as
good
at
as
that's
the
systems
we
invoke,
I
guess,
and
so
just
figure
out
where
we
want
to
focus
our
defense
a
little
bit
like
c
advisor,
metrics
collection,
often
stalls
right.
B
B
It's
like
a
therapy
session
right,
the
the
and,
when
a
runtime
reports
back
active
context,
deadline
exceeded
you're
like
okay,
well,
the
cue
boat's
defenseless
here
too,
or
one
list
container
call
saying
something
exists
and
the
next
one's
saying
it
didn't
like
all
these
things
can
happen,
and
so
I
just
want
to
try
to
understand
like
are
we
trying
to
look
at
total
system
here,
or
is
it
really
to
the
point
you're
talking
to
diana
it's
just
danielle,
which
was
a
confidence
in
evolution
of
a
core
control
loop
which
I
can
totally
get
behind.
A
I
will
let
david
and
mike
reply
first.
G
I
can
go
ahead.
Yeah,
I
mean
totally
agree
with
a
lot
of
the
discussion
here.
I
just
wanted
to
add
one
point
out
I
think
has
been
touched
on,
which
is
we
also,
I
think,
need
to
step
back.
We
have
tons
of
jobs
and
different
variants
of
tests
running
today,
and
I
think
a
lot
of
it
has
been
a
little
bit
lost
like
what
are
we
testing
and
why
are
we
testing
it?
G
So,
for
example,
what
I'm
trying
to
say
is,
for
example,
we
have
a
lot
of
different
variants
like
in
different
container
d
versions,
different
kernel
versions.
We
have
like
c
group
v1.
We
have
secret
v2
that
we
both
support,
and
so
I
think
we
need
to
kind
of
step
back
a
little
bit
and
think
okay,
like
what
are
the
valid
configurations
that
we
support
like
on
the
os
level
and
on
the
container
d
level
and
then
exactly
what
tests
we
need
to
run.
G
So,
for
example,
one
of
the
big
things
that's
coming
forward
is
a
lot
of
the
os
districts
are
moving
to
secret
v2
by
default,
but
as
a
community,
we
probably
still
want
to
continue
to
support
zebra
v1
right.
So
we
need
to
think
okay,
which,
which
test
combinations.
Do
we
actually
care
about
which
container
d
versions
do
we
want
to
support
right
and
what's
kind
of
the
policy
moving
forward
where
okay,
we
always
want
to
test
this,
this
containery
version
with
this
image
and
so
forth.
G
So
I
think
that's
also
something
that
has
been
a
little
bit
ad
hoc,
especially
like
updating
os
images
and
so
forth
that
I
think
we
should
define
a
little
bit
better.
What
is
our
policy
and
what
we
want
to
test.
A
We
actually
have
an
open
issue
for
that
with
some
discussion
going
on
mostly
between,
like
me
and
dennis
right
now,
I
think
but
yeah
like
making
sure
we
have
secret
v1,
v2
power,
t
and
then
making
super
v2
default,
for
example,
is
something
that
I
hopefully
want
to
try
and
get
done.
This
early
cycle.
A
G
Yeah
exactly
and
so
like,
for
example,
you
know
we
have
you
know
if
you,
if
you
sort
of
multiply
it
out,
you
have
like
kernel
versions,
times
infinity
versions,
times
systemd
c
group
drivers
times,
like
c
group
configurations
it's
kind
of
a
lot
of
stuff,
so
maybe
it's
not
reasonable
to
test
every
single
configuration.
So
maybe
we
should
just
say:
okay
for
sugru
v2,
maybe
only
support
one
driver,
okay
and
we're
going
to
test
that
making
some
type
of
claims
around
what
we
really
want
to
test
and
support.
C
Backing
up
to
the
question
that
derek
had
or
the
point
that
he
made
about
it's
the
container
run
time
too
right,
I
agree
and
a
part
of
it
is
the
the
way
we
do
parallelism
and
execution
of
the
you
know
for
instantiating,
pods
and
containers
and
pulling
images
those
sorts
of
things
we're.
We
we've
got
this
director
control
loop
in
the
control
loop,
where
kubelet's
telling
the
container
one
time
what
to
do
and
there's
a
there's.
C
Some
expectation
that
these
hap,
these
things
happen
synchronously
without
timeouts,
but
container
d
will
accept
as
many
parallel
requests
that
google
asks
us
to
do
right
up
to
running
completely
out
of
resources
and
you
getting
timeouts
on.
You
know
these
contact
timeouts.
I
was
talking
about
earlier,
so
we
we
have
another
model
that
we
could.
We
could
provide,
that
might
be
better
and
faster
than
than
doing.
You
know
the
status
update
loop
that
we
currently
have
right
now,
the
sync
loop.
We,
we
could
probably
just
push
you
guys
over
a
stream.
C
Some
events
that'll
notify
you
what
we've
actually
changed
in
status,
so
you
can
have
a
better
expectation
of
what's
happening
in
the
container
runtime
other
than
what
you've
asked
for
in
what
we've
replied
on
you
know
in
the
synchronous
call
but
becomes
asynchronous
when
you
get
a
timeout,
okay,
derek.
So
maybe
you
know
it's
our
an
architectural
decision
to
go
on
that
route,
but
we've
got
the
events
that
that
you
know
from
the
execution
of
the
containers
when
they
change
status.
C
A
C
Yeah
and
to
be
to
be
fair
container
d
is
parallel
in
this
execution.
Much
much
much
more
than
docker
was
so
when
you're
in
docker
shim,
and
then
you
moved
to
container
d.
All
of
a
sudden
things
changed
a
little
bit.
We
started
accepting
more
more
of
the
requests
than
you
would
have
been
able
to
get
before.
C
F
I
think
I'm
done
next
one,
who
is
the
hands
up
here,
so
I
I
agree
with
michael
in
general.
You
said
here
you
the,
but
the
I
think,
there's
the
one
problem
is
that
how
kubernetes
integrate
I
mean
not
that
integrate
but
integrate
with
the
container
runtime
the
whole
integration
test.
Issues
to
we
know
this
problem,
so
this
is
why
we
do
the
cri
design
and
the
implementation.
So
we
introduce
cri
test,
but
I
don't
think
that
we
are
actively
proactively
not
proactively,
even
just
actively
maintaining
those
tests.
F
F
So
for
all
those
signaled
and
the
tests
we
do
have
the
ci
test,
the
even
continuity
take
over
those
ci
tests,
those
as
the
pre-submit
for
sometimes
I
I
haven't
checked
recently.
Did
we
still
have
that
one?
But
I
didn't
see
much
of
the
new
test
that
I
did
over
time
and
also
even
that
time
they
are
test.
F
I
remember
back
then,
and
the
only
one
engineer
and
work
with
me
and
the
nintendo
wii
I
did
it's
definitely
not
comprehensive,
so
we
really
haven't
really
under
know
the
e2e,
but
for
the
test,
the
actual
philosophy
to
really
have
the
high
reliability.
You
need
the
mixed
right,
so
you
need
you
need
the
fingertips
that
you
need
stress
test.
You
also
need
the
isolated
environment
for
component
tests.
F
I
do
think
about
the
in
the
container
runtime.
We
didn't
keep
our
promise
before
that.
We
do
think
about.
We
are
going
to
every
single
new
container
runtime
and
the
new
worship
will
actually
have
the
suite
of
the
cri
test
associated.
Then
we
publish
those
things.
It's
kinda
like
the
conform
test.
What
we
did
for
kubernetes
right
the
seamless
info
signal
that
we
feel
of
those
kind
of
things
I
have
to
say
we
didn't
change.
F
We
didn't
either
even
like
a
performance
dashboard
we
after
a
couple
of
years,
and
we
give
up
so
this
is
where
a
lot
of
time
we
once
a
while
we
receive
like
a
container
runtime
for
the
new
worship
and
maybe
there's
the
using
more
resource
and
unexpected
results,
and
so
so
those
kind
of
things
we
we
definitely
can
be.
It's
not
like
that.
We,
we
never
did
it's
new
effort,
it's
just
like
the
previous
effort.
We
already
did
that
we
did
actively
maintain
those,
not
practical,
even
eye,
update
requests.
A
Cryo
and
containity
do
both
run
no
d2es
on
every
pr
list
and
I
think
that
helps
catch.
A
lot
of
stuff.
F
But
we
haven't
last
time
I
checked,
we
there's
the
new
feature.
We
haven't
proactively
either
any
test,
the
the
the
so
so
the
new
worship.
Basically
it
didn't
capture
problem
even
like
we
found
the
container
runtime
problem
and
there's
the
regression
test
right.
That's
the
just
in
engineering
practice
we
filled.
I
had
the
regression
test
there
right.
We
did
the
hold
on
the
node
or,
like
the
kubernetes
level,
when
we
have
the
refractory,
then
we
found
the
problem.
We
either
we're
going,
but
we
didn't
because
I
think
people
think
about
container
runtime.
F
It
is
is
another
component,
but
actually
it's
owned
by.
It
means
the
cri
and
the
container
randomly
for
the
cri.
That
interface
is
owned
by
signal,
because
this
is
our
component
and
that's
our
responsibility.
We
can
find
some
compatibility
or
issue
or
all
those
kind
of
things,
but
that's
that's.
I
just
want
to
point
out
because
we
do
have
the
effort.
A
E
This
is
a
this
is
a
response,
but
one
of
the
things
we
were
talking
about
reconciliation
is
into
a
little
bit
to
derrick's
point.
So
I
am,
I
am
trying
to
clear
out
time
to
go
and
to
document
some
of
the
the
flows
that
were
touched
specifically
around
places
where
the
cubelet
invariants
weren't
handled,
because
the
control
loops
weren't
accurate.
The
mission
is
a
is
a
key
part
of
that
which
is
there's
a
number
of
just
the
emission.
The
emission
life
cycle
does
not
resiliently
handle
changes
to
the
cubelet.
E
Well
so
at
least
there
is
a
a
draft
kept
sitting
in
my
in
my
editor,
that
is
at
least
halfway
of
trying
to
document
some
of
those
assumptions.
E
So
that's
something
I'm
trying
to
commit
to
get
done
and
I
will
certainly
take
feedback
from
folks
around
some
of
the
other
things
I
heard
mentioned,
like
better
documentation,
better
visualization
of
the
flows,
an
opportunity
for
us
to
describe
invariants
and
where
we
should
be
doing
reconciliation,
but
are
not
getting
to
a
point
where
we
can
have
some
productive
discussion
on
places
in
the
cubelet
that
are
sources
of
issues.
So
this
is
more
of
a
side
thing.
It's
just
it
is
on
my.
A
I
am
excited
to
read
it,
so
I
guess
now
we're
starting
to
run
out
of
time.
We
have
15
minutes
left.
Do
we
want
to
quickly
sort
of
summarize
the
main
areas
we've
identified
and
then
prioritize
what
we
actually
want
to
do
short
term,
because
you
know
it's
open
source.
We
have
limited
amount
of
hours
in
the
day,
an
effort
that
we
can
actively
do
now
and
sort
of
try
and
figure
out
what
we
want
to
do
in
the
next
release
cycle
versus
in
the
next
couple.
A
I
will
take
that
sounds
good
to
everyone,
so
if
I'm
remembering
correctly-
which
it's
getting
late
for
me,
so
maybe
I'm
not,
but
we've
basically
identified
the
sort
of
few
areas
that
I
kind
of
thought
we
would,
which
is
to
say
we
want
to
document
what
exists
today
through
our
tests,
but
we
also
want
to
improve
the
testing
that
we
have
that
has
been
under
invested
in.
A
So
some
of
that
is
things
like
cry
tests
and
sort
of
like
some
of
the
contract
testing
around
how
we
interact
with
cris,
with
a
hopeful
future
case
of
potentially
introducing
some
failure.
Testing
that,
like
you,
know
like
a
grpc
proxy
thing
that
will
just
like
randomly
sleep
or
whatever,
like
we
can
figure
something
out.
Testing
is
always
a
bit
dirty,
but
we'll
figure
it
out
and
then,
as
like
reviewers
and
approvers
of
parts
of
the
kubelet
start
requiring
that
as
we
fix
bugs
and
everything
else
in
the
future.
A
B
No,
the
one
thing
I
was
trying
to
think
about
how
to
ask
is:
there
are
definitely
features
in
the
cubelet.
That
are
what
I
would
classify
as.
B
Cluster
workload,
specific
cluster
to
a
workload,
specific
marriage
of
like
desired
outcomes,
and
those
are
often
the
ones
that
are
not
as
extensively
tested
as
we'd
like
and
they
might
use
certain
exotic
or
not.
You
know
the
95
percent
case
of
like
resources.
B
One
thing
I
was
curious
about
is:
if
anyone
has
found
reliability
issues
in
those
areas
that
we
want
to
bring
forward
that,
like
maybe
none
of
us
are
aware
of
so
like
if
someone
wanted
to
say
like
cpu
isolation
with
the
particular
variant
is
just
not
working
for
me
or
I'm.
Seeing
incorrect
results
with
this
particular
workload
or
entire
features
like
huge
pages
with
a
particular
page
size
is
not
working
with
me.
B
I'm
just
curious
if
there's
any
any
of
anything
in
that
realm
or
dimension,
that
we
should
focus
on
or
maybe
like
cubelet
plus
a
particular
device,
we're
seeing
issues
with
just
so
that
maybe
we
can
step
back
and
ask:
is
there
anything
we
can
do
as
a
broader
community
to
you
know,
get
more
attention
to
that
area?
B
Does
anyone
have
any
concerns
or
awareness
of
those
things
I
mean
as
a
vendor?
I
know
we're
using
those
in
particular
areas,
but,
and
so
I
only
see
what
we
see-
I
don't
know
what
others
are
seeing.
A
The
only
one
I've
seen
reported
on
nice
kubernetes
recently
is
the
one
where
c
groups
v2
when
using
c
group
of
s,
something
failed.
Anything
c
advisor,
but
I
think
david
has
his
hand
up
and
worked
on
that.
G
Yeah
I
mean
I
don't
want
to
slowly
differently,
just
like
as
a
vendor.
Also
like.
I
think
one
of
the
biggest
issues
we
see
is
like
disk
related,
so
like
slow,
iops,
low
disks,
causing
all
types
of
issues
from
container
run
time
to
kubla
to
other
things.
Those
are
hard
to
come
back
because
you
know
we
don't
have
full
control
over
the
disk,
but
I
think.
G
We've
been
looking
at
different,
io
schedulers,
for
example,
and
saying:
okay:
can
we
prioritize
kubelet
and
container
runtime
over
user
workload?
So
if
a
user
workload
suddenly
schedules
uses
of
all
the
iops,
we
want
to
reserve
some
for
for
the
system
and
that's
something
we're
experimenting
with,
but
perhaps
doing
more
of
the
community
and
figuring
out
kind
of
best
practices
around
stuff.
We
can
do
to
provide
more
isolation
to
kublaid
and
container
on
time.
Even
if
there's
I
have
pressure
and
disk
pressure
or
something
I
think
that
would
be
valuable.
A
I
think
that
one
would
be
super
valuable
in
combination
with
like
documenting
the
like
permutations
of
stuff.
We
support
with
you
know
like:
does
the
kubrick
even
require
like
explicitly
a
given
kernel
version
like
not
really
like?
If
you
have
the
features
it
kind
of
work,
maybe,
but
we
don't
document
exactly
what
we
need
and
when
and
where
and
like
you
know,
moving
forward
with
c
groups
v2.
D
F
F
So
that's
those
kind
of
things
we
still
could,
but
so,
for
example,
we
have
the
cuba's
reserve
system
reserve,
we
could
reserve
bigger
things
and
then
try
to
see
how
effective
we
can
avoid
of
the
cardinal
seeds
wound,
because
that's
really
a
big
loss
and
right.
So
the
kubernetes
we
reserve
the
same
of
the
things
mixed
kubernetes
could
do
something.
F
So
you
could
adjust
those
kind
of
value
and
say
say:
can
we
really
kubernetes
if
customer
config
kubernetes
properly
and
the
company
can
take
action
and
to
do
the
handling,
so
those
tasks
actually
can
help
once
we
switch
to
the
sql
version
2
and
we
have
the
better
username
of
the
woman
handling,
which
is
in
the
book.
We
propose
that
the
user
space
will
handle
because
we
think
what
we
can
do
better
job.
If
we
we
protect
the
load,
the
agent
correctly,
all
right.
F
So
those
kind
of
reserving
not
for
kernel
threats,
we
never
enough
for
this
opportunity
of
the
user
space
agent
and
then
what
we
can
do.
I
I
don't
think
we
we
put
a
match
so
when
I
think
about
those
e3
tests.
Basically,
I
do
think
about
that
to
generate
a
new
idea,
new
feature:
it's
not
that
new,
because
we
already
did
in
the
past,
but
still,
how
will
I
make
that
is
generic
enough
for
food
benefit
open
source
community?
F
That's
is
much
harder
requests
because
of
the
the
community
user.
Actually
is
variety
right,
so
we
have
to
provide
more
generic
solution,
so
those
kind
of
things
the
I
also
want
to
back
to
the
what
direct
early
ask.
We
do
have
customer
cases
once
a
while
ask
a
cpu
manager,
maybe
not
work
as
expected,
but
I
think
that
a
lot
of
things
customers
do
other
way
and
work
around
the
problem,
because
so
those
feedback
might
be
not
really
bubble.
F
Report
back
to
the
open
source,
because
a
lot
of
solution
next
to
what
we
are
trying
to
enhance
of
the
cpu
management
and
but
the
problem
is
customer
cases
sometimes
is
quite
a
customer
mass.
And
so
I
think
the
open
source
solution
actually
is
more
generic.
Because
of
customer
after
that,
when
they
understand
the
cpu
management
and
they
basically
do
the
some
customer
master
solution
to
work
around
those
problems.
B
Yeah
and
then
maybe
some
other
eliminating
examples
where
we
could
publish
expected
variants.
We've
encountered
node
reliability
issues
in
the
face
of
exact
liveness
or
readiness
probes,
where
there's
almost
an
there's,
a
misunderstanding
with
respect
to
how
much
memory
an
exec
probe
actually
consumes
from
some
of
the
underlying
issues.
B
Like,
I
think,
run
c,
you
couldn't
launch
something
with
a
greater
less
than
a
particular
megabyte
count,
for
example,
and
so
we've
seen
systems
where
probe
usage
basically
becomes
like
the
dominant
cpu
and
memory
consumer
on
the
node
and
that's
a
surprise,
and
then
that's
also
a
surprise,
because
we
haven't
actually
done
a
perfect
job.
I
think
we
probably
vary
by
runtimes
now,
who
gets
charged
for
a
probe?
Does
the
workload
get
charged
for
cpu
consumption,
or
does
it
go
under
the
system
kind
of
charge?
B
Is
an
image
pool
confined
to
the
pod
sandbox?
That
is
trying
to
use
that
image,
or
is
it
the
cpu
associated
with
pulling
that
image
charged?
The
note
as
a
whole,
we've
done
things
as
a
vendor
to
try
to.
B
But
a
lot
of
those
things
in
my
experience
are
like
critical
to
actually
keeping
stuff
reliable,
and
maybe
we
can
do
a
better
job
of
getting
uniform
best
practice
on
some
of
that,
but
particularly
with
like
cubelet
is
blaine
when
noisy
neighbors
cause
problems
and
rarely
the
workload
is
blamed,
and
so
the
more
I
know
myself
with
my
red
hat
hat
on
I've,
been
able
to
focus
the
problem
on
the
workload
and
away
from
the
cube
with
the
better.
A
Cool
it
sounds
like
we
have
a
big
bucket
of
work.
We
now
know
we
need
to
do
an
agree.
That
needs
to
be
done,
which
is
progress
when
the
recording
for
this
is
available
I'll,
go
through
and
write
up,
notes
and
start
filing
a
couple
of
issues
of
things
we
want
to
think
about,
and
then,
hopefully,
by
next
week's
signal,
meaning
we
can
sort
of
prioritize
those
a
little
bit
as
if
they
would
catch.
B
Cool
this
is
good
to
see
like
plus
one.
D
No,
in
which
case
everyone
gets
oh,
I
was
just
working
here,
but
I
heard
the
the
one
issue
this
is
near
and
dear
to
my
heart,
getting
mentioned
without
without
even
posting
it.
So
the
liveness
and
readiness
probes
exact
you
spawn
a
hundred
probes
that
do
liveness
and
readiness
checks,
even
if
they're
just
doing
you
know
we'll
do
it.
D
B
C
B
C
B
Yeah
yeah,
so
we've
I
mean:
we've
definitely
done
things
to
mitigate
some
of
that.