►
From YouTube: Kubernetes SIG Node 20230509
Description
SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq
GMT20230509-170530_Recording_640x360.mp4
A
B
Yeah
so
last
week,
last
week
we
had
this
device
manager
bug
that
was
fixed
and
I
have
the
link
kind
of
added
in
the
agenda
Doc,
and
the
aim
was
to
ensure
that
the
recovery
flow
is
correct
and
what
we
we,
what
we've
added
as
part
of
that
is
extra
checks
to
make
sure
that
the
application
pod
requesting
a
device
is
admitted
only
if
the
device
plugin
has
registered
itself
to
cubelet
and
the
underlying
devices
are
healthy.
B
B
The
cherry
pick
deadline
is
end
of
this
week
and
Sergey
is
suggested
in
one
of
the
review
comments
that
it's
best
that
I
have
a
discussion
with
the
community
members
and
gather
feedback.
So
I
really
appreciate
everyone's
feedback.
If
there
is
any
concerns
in
back
porting,
this
fix
or
or
if
everyone's
okay
with
that,
please
let
me
know.
B
A
C
Yeah
we
can
so
yeah,
so
actually
what
you
summarize,
what
happened,
which
it
is
much
the
old
bag,
what
we
address
long
time
back
in
open
source.
So
that's
why
earlier
I
saw
this
open,
open
shift
with
the
pack
first
report
actually
to
open
shift
the
integration
instead
and
think
about.
It
is
open
source
issue
so,
but
I
did
this
morning
after
I
saw
you
ask
a
chair,
picker,
so
I
did
ask
the
GK
internal,
because
I
saw
that
some
new
change
here
earlier
last
year
and
of
course
this
regression.
C
So
so
that's
why
I
want
to
wonder
it
is
so
so
because
a
little
bit
concern
for
me.
It
is
the
traffic
to
the
existing
GA.
The
release
right
so
I
want
to
just
give
me
a
little
bit
time
and
once
I
got
to
the
report
back
from
the
internal
production.
Basically,
it's
just
because
internal
production
is
just
using
open
source.
Kubernetes
right,
so
I
want
to
understand
the.
Why
I
never
heard
this
production
report
so
then
I
can
process.
This
whole
thing,
yeah.
Absolutely
no
problem
at
all.
B
No
problem,
just
one
more
thing,
I'd
like
to
point
out
I,
think
the
reason
that
it's
affecting
us
is
primarily
in
single
node
deployments,
I.
Think.
Typically,
if
you
have
multi-node
deployments,
you
can
drain
the
nodes,
but
the
the
use
case
that
we
care
about.
It
is
specific
to
single
node,
kubernetes
openshift
deployments,
and
that's
where
we
we
don't
have
the
ability
to
drain
the
node
and
that's
where
it's
affecting
us.
The
most
and
I.
Remember
as
well.
David
mentioned
that
in
case
of
GPU
plugins.
He
noticed
something
like
that.
D
B
Yeah,
so
in
in
the
scenario
that
that
was
reported
to
us,
it
was
that
the
Divine
underlying
device
was
a
provisioned
properly,
but
in
the
end-to-end
test
case
that
that
we've
provided
as
part
of
the
pr
itself,
what
we
do
is
we
we
intentionally
prevent
the
device
plugin
from
registering
itself
to
cubelet,
and
because
of
that,
the
underlying
devices
are
essentially
not
healthy,
and
we
should
see
the
application
pod
that
is
requesting
the
device.
B
It
should
fail
at
admission
time-
and
this
is
of
course
after
like
the
node
has
been
rebooted
and
cubelet
has
been
restarted.
You
you
would
notice
this.
You
know,
so
you
know
steady
state,
all
your
your
application
container
device
plug-in
everything
is
up
and
running
you
reboot,
the
node
and
and
the
registration
does
not
occur.
The
application
Port
should
fail
application
form
which
is
requesting
device.
E
Everything
yeah,
none
of
them
totally
makes
sense
so,
but
just
to
be
clear,
it's
the
device
needs
to
be
unhealthy
after
reboot.
Is
that
accurate
or.
D
B
Yes,
so
you
know
when
it's
steady
state
everything
is
up
and
running
everything
is
allocated
and
when
the
node
is
rebooted,
there
could
be
a
scenario
where
your
underlying
devices
haven't
been
provisioned.
It
could
be
because
you
know
say
with
SRV
index:
you
need
to
ensure
that
the
device
the
device
driver
is
up
and
running
or
for
whatever
reason,
the
device
plug-in
pod
appears
after
your
application
thought,
because
essentially
at
recovery
time
or
no
reboot
time,
you
have
no
control
over
the
ordering
of
PODS.
C
Thanks
also
for
pointing
out
the
single
node,
because
in
the
summary
and
I
think
and
I
didn't
realize
that's
the
single
node,
so
we
definitely
didn't
fix
the
single
node
issue
in
the
past.
So
this
is
why,
when
you
describe
that
kind
of
things
that
we
fix,
we
try
to
make
sure
the
device
plugin
is
not
allows
the
the
the
capacity
right
so
so
until
everything's
ready,
but
for
the
single
node.
C
That
case
is
I,
think
we
didn't
not
just
single
node,
actually
single
node
and
also
know
the
reboot
and
just
simply
read
from
the
checkpoint
and
claim
they
have
the
capacity,
because
we
did
try
to
connect
the
after
node
died
or
reboot.
We
try
to
re-register
recreate
after
that
node
or
re-register
node.
So
there's
the
sum
private,
but
the
single
note
cases,
whatever
I,
don't
think
we
handled
so
thanks
for
for
pulling
that
up.
So,
okay
I
will
approve
that
one.
Just
after
all,
right
thanks.
A
John
all
right,
thank
you,
swathi!
So
next
on
the
agenda,
we
have
Karthik
with
Dynamic
node
resize
proposals.
F
Yeah
yeah
hi
everyone.
So
recently
we
submitted
a
enhancement
proposal,
node
resize.
Basically
we
are
trying
to
solve
a
problem
where
in
which
we
want
to
dynamically
change
the
node
compute
capacity
without
restarting
the
cubelet,
so
the
current
workaround
is
that
if
you
increase
or
decrease
the
node
capacity,
we
need
to
restart
a
cubelet.
F
So
we
want
to
avoid
that
making
a
completely
dynamic,
and
so
the
use
case
we
are
trying
to
solve
is
that
in
a
situation
where
adding
a
new
node
to
a
cluster
is
considering
considerably
taking
more
time
rather
than
updating
existing
machines
and
also
if
a
user
cluster
want
to
manage
The
Limited
number
of
nodes.
So
this
has
a
couple
of
use
cases.
We
are
trying
to
handle
with
this
proposal,
and
so
we
have
done
some
observation
and
we
created
a
Google
doc
and
shared
across
the
signal.
F
G
I
had
a
quick
question
for
the
a
discussion
doc
that
you
sent.
You
enumerated
some
things
as
non-goals
and
I'm
curious.
If
you
could
explain
the
motivations
on
them
not
being
goals
like.
Is
there
something
unique
about
your
environment
where
you're
not
using
devices
or
you're,
not
looking
at
level
swap
devices
or
is
your
workload
not
needing
any
isolation
around
or
Affinity
around
CPUs
and
memory
I'm
just
kind
of
curious
yeah.
B
F
Yeah
so
initially,
when
we
started
with
that,
our
goal
was
only
getting
regarding
the
dynamic
resize
and
we
consider
them
as
non-gool,
but
later
the
discussion,
the
feedback.
It
has
known
that
we
have
to
make
them
as
a
goal
and
we
have
to
improvise
the
resource
manager
initialization
of
the
reinitialization,
depending
on
the
changes
so
as
it
would
affect
those
components
as
well.
So
recently
we
have
moved
them
to
the
goals
in
the
enhancement
as
well.
G
Okay
and
then
just
follow
up
in
is
your
goal
to
be
able
to
increase
any
Resource
as
well
as
decrease
in
your
resource
or
what
given
the
removal
of
those
from
the
nine
goals?
Is
there
any
constraint
around
resizing
you're
wanting
to
do.
F
F
H
H
You
yeah
Pottery
admission
is
a
big
problem
for
this
PR,
this
cabin,
maybe
some
other
caps
like
one
thing
we
discussed
like
I,
commented
somewhere
here
in
up
armor
case.
For
instance,
we
have
the
case
when
apartment
was
disabled
and
then
even
Kubota
restarted
like
what
do
you
do
with
sports?
That
means
a
Parmer
that
was
another
kind
of
early
admission
situation
that
needs
to
be
covered.
F
H
But
beside
the
admission
I
also
like
this
cap
is
targeted
on
like
automatic
resize,
like
whatever
C
advisor
tells
you.
You
do
so
first
comments,
I
put
in
document
about
throttling
of
this
situation.
So
if
C
advisor
is
like
flake
you're
like
you
have
resources,
quite
Dynamic
I,
don't
know
for
some
reason.
Maybe
some
like
bubbling
happening
because
I'm
like
missed
error
in
Sea
advisor,
then
we
should
at
least
threshold
how
we
change
node
size,
but
I
did
ideally
I.
H
Think
we
may
also
have
approval
based
resize,
so
C
advisor
can
detect
the
new
size,
but
then
you
may
need
to
go
through
some
API
call
to
actually
do
the
size
when
it's
possible
I
think
it
will
be
better
for
some
cases,
especially
when
cases
when
there
is
like
some
virtual
environment
that
actually
bigger
size
but
like
somebody
wants
to
control
it
and
from
from
a
site
and
make
it
smaller
and
bigger
based
on
demand,
and
maybe
they
have
information
about
Hardware
virtual
Hardware
that
doesn't
like
advisor,
doesn't
know.
I
Well,
there
are
a
couple
of
our
cornerstones,
coconut
cases
in
in
our
in
an
office
C
advisor
if
I
remember
correctly
how
it
represents,
where
the
Polish
information
about
a
machine
it.
I
It
will
not
showcase
the
problem
scenario
when,
like
one
of
the
trails
of
a
core
got
offline,
it
or
when,
like
let's
say,
like
cxl
memory
and
like
a
new
memory
region,
will
appear
on
one
out
dynamically
attached
or
even
worse,
like
when,
when
something
disappears
so
like,
when
we
are
talking
about
VMS,
where
we
can
do
more
memorable
link,
it's
easier
but
like
starting
jungling.
When
you
have
a
course
with
different
amount
of
threads,
it
will
be
well
ten
of
worms.
We
are
opening.
I
I
would
say
this
current
implementation
of
set
of
managers.
We
have.
F
I
G
Maybe
one
of
the
things
that
would
help
me
could
be
my
information
was
out
of
date
or
I
just
had
forgotten
this.
If
the
feedback
loop
was
on
whatever
C
advisor
has
observed,
and
obviously
the
channel
to
increase
or
decrease
resources
to
the
node
are
happening
at
a
band.
Is
there
anything
you
could
point
to
in
your
cat
that
might
describe
the
state
of
the
art,
whether
that's
on
Linux
or
Windows,
hosts
with
respect
to
what
resources
can
be
dynamically
added
or
removed
without
a
reboot
I?
G
Think
that
would
be
helpful
in
just
even
understanding
the
scope
of
restrictions.
I
don't
have
that
to
memory,
and
so
maybe
maybe
Dawn
you
have
your
hand
raised.
Maybe
you
do
I,
don't
know
I'm.
Sorry,
if
I
can't
believe.
Thank
you.
C
I
think
that
everyone,
there
is
a
lot
of
complexity
here,
I
told
I
totally
agree.
So
that's
why
I
can
lower
my
hands
here,
but
I
do
want
to
say
bring
some
memory
to
some
folks
here.
I
do
have
constant
as
well,
maybe
I'm
more
concerned
on
this
one.
This
is,
could
be
a
lot
of
non-go
being
there,
but
the
complexity
is
invaded
in
this
language
without
those
kind
of
the
resource
management.
Without
those
part
admission,
I,
don't
think
this
can
go
can
can
fly
can
be
beneficial
to
many
users.
C
I
also
worry
about
people.
Even
we
make
this
work.
I
think
this
has
been
couple
years
ago,
something
like
the
same
in
a
proposal.
Send
it
to
kubernetes
I,
think
we
worked
as
a
together
and
reject
that
one,
because
that's
really
could
the
potential
harm
of
kubernetes
so
similar
protozo
is
kind
of
the
Dynamic
node
that
or
maybe
watch
her
node,
the
virtual,
what
you
kubernetes
it's
quite
similar,
so
so
the
node
size
can
unlimited
expanded.
C
So
I
worry
about
once
we
started
to
this
one,
then
it
goes
to
that
direction.
I'm,
okay,
to
build
something.
On
top
of
the
kubernetes
outside
the
kubernetes
and
hide
those
details,
implementation
to
treat
the
entire
cluster
as
like,
a
virtual
Giant
pool
and
the
resource,
but
the
real
name
and
the
kubernetes
layer
I
think
that's
that's
the
complexity.
C
You
heard
many
people
mention
that
different
complexity
right
so
I,
don't
want
to
repeat
that
many
resource
and
how
you're
going
to
manage
Dynamics,
so
so
so,
but
on
the
kubernetes,
the
common
layer,
what
we
are
doing,
this
purely
from
the
kubernetes
open
source
perspective
and
so
I
really
worry
about.
We
couldn't
handle
very
well
and
abusively
using
and
even
harm
of
the
kubernetes
health
in
general.
So
that's
just
my
comment
here.
So
I
want
to
say
that
I
I
personally
don't
want
to.
C
J
Yeah
I
had
two
questions.
One
touches
on
what
Don
brought
up,
which
is
there
an
alternative
that
deals
with
a
less
invasive
way
of
fixing
it
in
the
cubelet
by
pushing
it
to
the
someone
managing
the
node?
Who
wanted
to
do
this
more
aggressively
like
replacing
a
node
with
another
node
and
it's
possible
to
orchestrate
some
of
that
today.
J
So
some
of
those
efforts
like
if
an
implementation
path
like
that
might
actually
help
in
other
ways
and
be
simpler
to
implement
because
it
could
be
done
on
top
and
improve
through
testing
basic
changes.
So
that's
one
thing
to
consider:
I'll
make
that
comment
in
The
Proposal
and
the
second
one
was
a
concern
about
the
dynamic
resizing
would
be
we
already.
A
lot
of
the
the
third
party,
plug-ins
device
manager
or
CPU
already
are
a
little
fuzzy.
Have
bugs
around
dynamism?
J
F
Yeah,
so
we
want
to
change
those
plugins
to
accompany
these
changes
as
well
with
the
dynamic
resize.
That's.
B
F
J
I
mean
I'm
generally
in
favor
of
improving
how
we
deal
with
dynamism
and
the
cubelet,
because
we're
not
too
not
to
say
that
I
don't
agree
with
Don's
concerns.
Just
there's
a
lot
of
dynamism
in
the
cubelet.
That's
just
broken
today
and
it's
all
subtle
stuff,
anything
that
helps
anything
that
forces
us
to
go
and
test
it
and
be
better
I
think
is
and
that
positive
for
the
cubelet.
I
Yeah,
so
I
just
wanted
to
answer.
Don's
comment
about
the
use
cases
so,
like
we
main
use
case
besides
with
virtual
machines,
is
what
what
are
some
changes
in
the
hardware?
What
is
going
to
a
market
or
radio
on
the
market,
so
multiple
Hardware
vendors
implemented
with
feature
called
cxl
memory,
and
one
of
the
properties
of
this
thingy
is
possibility
to
dynamically,
attach
or
detach
what
memory.
I
So
what's
the
matter
when
you
have
a
node
which
is
under
memory
pressure,
you
can
like,
if
you
were
back
to
a
management
infrastructure
like
please,
attach
me
like
in
our
10
gig
of
memory
and
from
top
of
a
rack
pool
of
memory,
it
will
be
dynamically
increased
and
when
we
the
same
we're
decreasing.
So
it's
it's
something
what
will
be
common
in?
Probably
like
next
couple
of
years,
but
for
where
to
make
sure
this
feature
is
available,
we
need
to
start
to
work
on
it
now.
I
Or
or
another
thing
like
another
example
like
Swati
or
Francesca,
might
also
say
something
like
where
a
bunch
of
Telco
customers,
who
can
say
I,
want
to
run
my
workload
on
like
on
the
full
call
or
exclusively
and
for
them
shutting
down
the
hyper
threat
of
a
core
is
quite
I,
wouldn't
say
like
common
situation,
but
desired
situation.
I
C
On
top
of
those
things,
and
it's
unlimited
and
it's
never
there's
the
boundary
but
Alexandra
you
use
kisses,
actually
is
different
here
right,
so
you
use
cases
attached
memory
and
there's
the
there's
the
even
like
a
couple
years
ago
we
have
the
GC.
You
also
have
the
case
like
resizing
note,
and
so
the
problem
is.
We
have
a
range
right,
so
you
could
up
front
for
those
kisses.
Actually
you
could
you
have
to
arrange
how
much
Can
Dynamic
it's
unlimited
resource.
C
You
steal
right,
so
you
could
how
we
could
handle
from
there.
So,
like
okay,
when
I
report
a
note,
you
basically
know
what's
your
magnet
size,
this
is
what
I
told
the
gke
team
back
then,
but
they
never
come
back
to
me.
Unfortunately,
but
that's
what
I
thought
you
do
know.
What's
your
real
name,
maximum
of
the
success,
you
know:
what's
your
latency
to
grow
from
the
minimum
current
to
what
I
available
resource
to
grow,
to
the
maxim,
and
then
you
give
me
that
information,
then
we
can
handle
still
it
is
more
manageable.
C
It
is
predictable
so
that
we
can
give
that
signal
to
the
scheduler
and
give
that
signal
to
the
resource
management
even
on
the
Node
we
can
handle
better.
I
did
share
this
to
the
internal
GCE
team
and
but
they'll
never
come
back
to
me
so
I
share
here,
since
we
have
more
use
cases
really
common
use
cases.
We
can
start
from
there
to
discussing
so,
but
I
really
concerned
about
that.
We
go
to
down
to
next
unlimited
resource,
so
so
everything's
a
hide
you'll,
basically
treat
the
kubernetes
as
the
scheduler
and
I.
C
Don't
think
about
that,
we
can
take
that
big
jobs
we
want
to
handle.
We
could
handle
at
another
level.
Work
with
our
kernel,
handle
those
scheduling
job
much
better
to
to
manage
of
the
resources
or
negences
throughput,
even
throughput
we
cannot
handle
while
so
so
we
we
only
can
do
that
part
of
the
job
right.
So
that's
kind
of
my
concern
to
abusive
go
crazy,
so
this
is
this
is
why
I
try
to
rephrase
if
we
have
the
real
name
vendor
device
vendor
have
the
real
potential
in
the
in
the
future.
C
We
could
start
talk
about
those
kind
of
things,
but
I
want
to
start
from
the
whole
kubernetes.
How
kubernetes
can
manage
the
how
admin
can
manage
those
kind
of
things
and
on
top
of
those
kind
of
serverless
or
even
go
crazy,
like
the
Northerners
of
things,
I
think
that's
can
build
on
top
of
us
not
in
open
source.
Kubernetes
know
the
team
here.
I
Just
one
comment
about
the
provisioning
potential
capacity:
the
problem
is
what
or
Kubler
trusts
what
sea
advisor
tell
and
she
advisor
doesn't
know
anything
about
the
potential
capacity.
It
just
says,
like
I
see
it
how
it
is
right
now-
and
this
is
my
node
capacity,
so
for
admin
there
is
no
way
to
to
to
to
tell
what,
like
I,
have
some
additional
potential
capacity,
and
actually
this
cap
is
actually
to
to
to
retrigger
like
sea
advisor.
A
Great,
maybe
I
think
we
can
move
on
to
the
next
topic.
J
Sure
so
the
previous
two
topics
deal
with
some
of
the
things
I
tried
to
summarize
in
the
attached
spreadsheet.
I
hope
folks
have
seen
that
I
sent
that
around
a
while
back.
But
if
it's
your
first
time
seeing
it
there's
a
number
of
intersecting
challenges
in
the
cubelet
about
how
the
cubelet
you
know,
takes
input,
and
then
you
know
different
sub
components
how
they
react
to
that
input.
Admission
obviously
takes
desired
pops
pods
and
turns
them
into
pods
that
the
cubelet
has
actually
desired
to
run.
J
There's
a
couple
of
bugs
on
static
pods
that
have
some
short-term
implications,
but
a
number
of
the
things
that
we
talked
about
even
today
like
admission
and
resizing,
as
well
as
in
127,
the
In-Place
resizing
of
pods,
made
it
made
it
obvious
that
we
have
some
challenges
in
describing
and
reasoning
about
how
you.
What
is
the
correct
way
to
to
add
some
of
these
features,
and
as
we
get
more
of
these
Dynamic
features,
it
figures
it's
a
good
time,
so
for
128
I
was
going
to
focus.
David
Porter's
been
working
on
this.
J
Along
with
me,
I
was
going
to
work
on
trying
to
get
some
of
the
Clarity
in
the
cubelet
around
the
difference
between
what
someone
has
asked
the
cubelet
to
run
and
what
the
cubelet
is
actually
running
and
that's
kind
of
been
some
stuff.
That's
been
working
on
for
a
couple
releases,
that'll
help
the
static
pod
stuff
that
Ryan
had
asked
about,
and
then
working
with
In-Place
resizing
to
improve
cubelet
admission
to
be
a
little
bit
less.
J
J
Are
there
any
other
key
issues
around
admission
and
State
Management
to
cubelet
any
of
the
issues
that
you
might
have
seen
in
that
deck
that
are
important
that
either
folks
would
like
to
work
on
with
David
and
I,
or
need
to
need
additional
capabilities
in
admission,
for
instance,
that
might
justify
coming
up
with
a
bigger
design
than
just
the
cap
themselves.
So
it's
kind
of
a
general
question
to
anyone
who
has
changes
coming
in
one
two,
eight
or
potentially
in
one
two:
nine.
J
H
I,
don't
think
it's
a
dynamic
enough,
but
Google
Resource
Management
plugin
may
affect
both
admission
and
it
will
be
pluggable.
So
it's
not
static
enough,
but
I
think
it
yeah
I,
don't
remember
the
last
state
of
this
cap
to
understand
how
if
it
will
be
affecting
your
work
at
all.
J
Certainly,
certainly
I
would
probably
say
anything
around
Dynamic
resizing
of
nodes
would
need
to
be
fairly
like.
We
would
want
to
make
sure
that
we
get
the
design
of
anything
like
that
in
the
cap.
Are
there
any
things
that
folks
are
working
on,
like
in
CPU
manager,
device
manager,
plugins
beyond
what
you
mentioned?
Sergey
anybody
know
of
anything.
That's
going
to
be
going
in.
That
has
some
dynamism.
C
C
Benefits
depends
on
the
reviewers
benefit
approval
all
those
kind
of
dependencies.
So,
but
we
can
definitely
share
with
you.
Our
current
planning
like
what
it
is
people
commit
to
say
they
are
going
to
working
on.
There's
the
Civil.
Oh
yeah,
manure
I'll
share
with
you
manual
fast,
so
you
can
see
that
there
are
a
bunch
of
the
people
and
to
working
on
that
yeah.
Look
at
the
1.28.
There
are
many
things.
People
say
they
might
want
to
work
out,
but
they're,
not
clearness.
We
definitely
will
deliver
or
lock
down
the
resource.
Yeah.
J
Yeah,
the
one
that
and
I
think
if
we
had
caught
if
I
had
if
I
had
realized
this
earlier,
David
and
I
have
been
working
on
like
some
better
descriptions
of
this.
If
I
had
realized
that
In-Place
resizing
was
going
for
one
two,
seven,
we
probably
could
have
helped
some
of
the
review,
and
so
some
of
what
I'm
looking
to
do
is
I'd
like
to
improve
the
review
process
for
complicated
change.
J
State
changes
to
the
cubelet
by
having
better
docs,
better
explanations
and
better
code
structure,
because
a
lot
of
a
lot
of
this,
how
cubelet
deals
with
state
is
go:
ask
someone
who
hasn't
worked
on
a
project
in
five
years
or
somebody
who's
very
busy
and
I.
Think
that's
that's
the
thing
I'd
like
us
to
get
out
of
by
the
end
of
one
two
nine
is:
you
can
go,
read
a
doc
and
understand
what
the
cubelet
is
doing
when
you
want
to
change
how
data
flows
around
the
cubelet.
K
D
E
One
of
the
things
yeah,
one
of
the
things
I
was
looking
at
into
that
we
managed
to
tackle,
is
the
issues
around
termination.
Grace
period
like
with
pre-stop
hook,
blocking
precept
hooks
basically
like
are
not
factored
into
termination
or
screen.
I
think
that's
something
nice
and
also
help
with.
J
That
actually
reminds
me
of
one
more
if
there's
so,
one
of
the
things
that
David
and
I
have
noticed
is
there's
a
lot
of
subtleties
at
the
cubelet's
behavior
that
aren't
well
captured
by
ede
tests
that
stress
them.
One
of
the
along
the
lines
of
the
termination
thing,
the
context
cancellation
in
the
cubelet.
You
know
what
a
delete
request
comes
in
or
the
Pod
exits.
J
You
know
something
that
can
come
in
and
say
like:
hey,
we've,
we've
stopped
this
and
we're
moving
to
the
next
phase,
one
of
the
things
that
was
pretty
clear
there
is
it's
really
hard
to
explain
how
to
test
this,
and
so
some
of
the
contributions
from
folks
needs
a
lot
of
hand
holding
just
to
get
better
tests
in
place.
J
So
if
there's
other
people
who
are
looking
to
build
better
testing
as
part
of
their
Caps
or
their
use
cases,
I
think
this
is
also
a
fruitful
angle
for
us
to
look
at
how
we
can
catch
more
issues
with
fewer
tests
and
and
catch
them
in
like
the
restart
issues,
instead
of
going
and
catching
them
one
by
one
as
each
new
component
is
added
putting
that
on.
You
know
the
cap
author
to
run
more
aggressive
testing
around
new
features,
that'll
catch
things
like
cupid,
restart,
or
you
know
early
termination
stuff
like
that.
J
So
if
you,
if
any
of
those
things
interest,
you
combine.
H
Yeah
I
wanted
to
know
that
we
did
a
few
improvements
in
testing
for
sidecars,
specifically
it's
mostly
about
the
life
cycle
less
about
restarts,
but
yeah.
If
you
look
at
this
cap
to
mentioned
here,
I
think
it's
a
good
transition
to
next
topic
from
Zach
in
other
dynamics
of
a
change
of
behavior
proposal.
Zach,
do
you
want
to
speak
out
sure.
L
Thanks,
hey
I'm,
Zach
I
have
never
presented
here
before,
but
I
didn't
know.
L
Many
of
y'all
I
have
been
trying
to
bug
Sergey
and
David
for
some
improvement
in
Crash
loop
back
off
behavior,
and
they
suggested
I
come
here
and
and
present
like
why
I
care
about
Crash
loop
back
off
and
you
know
what
options
we
have
with
the
possibility
of
working
towards
a
cup,
perhaps
for
128
or
you
know
later
so,
there's
a
doc
Linked
In
the
agenda
that
I
tried
to
summarize
the
issues
but
crashly
back
off,
for
those
of
you
not
familiar,
has
basically
sat
since
the
beginning
of
time
in
kubernetes,
with
a
static
policy
of
you
know,
container
crashes
we
back
and
we
restarted
after
10
seconds
next
crash.
L
L
This
actually
presents
a
challenge
and
I'm
now
in
a
role
with
the
GTA
games
team,
which
is
mostly
responsible
for
gones,
which
is
the
thing
that
host
live
came
through.
But
previously
I
was
on
a
role:
maintaining
gke
kubernetes
control
planes
which
were
heavy
static.
Pod
users,
actually
in
both
of
those
roles,
I
I,
had
issues
with
crash
loop
back
off
and
I'll.
L
Explain
why,
in
the
first
in
the
previous
role,
we
and
this
isn't
unique
to
to
static
pause
at
all,
but
anytime,
you
have
a
cold
start
with
a
lot
of
with
a
long
chain
of
dependencies.
So
you
have
C
depends
on
B
depends
on
a
if
that
cold
start
is
a
little
slow.
L
So
you
know
a
takes
a
little
while
to
start
out
all
of
a
sudden
b
gets
backed
off
more
and
more,
and
then
that
can
easily
Cascade
the
sea
getting
backed
up
more
and
more
so,
especially
in
kubernetes
control
playing
you
had,
you
know,
SCD.
Eventually,
we
might
have
I
think
we
actually
introduced
a
dependency
on
Etsy
that
something
that
CD
also
depended
on
yada
yada
and
then
give
API
server
depends
on
that.
L
We
actually
saw
reliability
issues
where
you
know
our
slos
were
getting
impacted
because
qbpi
server
wasn't
coming
up
in
the
in
the
kind
of
rare
case,
because
these
initialization
chains
took
a
little
too
long
in
a
different
part
of
the
coin.
Agonius
has
an
interesting
case
where,
in
order
to
not
exacerbate
or
or
notice
to
not
put
undo
load
on
the
kubernetes
control
plane,
we
would
like
to
be
able
to
restart
the
container
within
the
Pod
rapidly.
L
So,
in
the
economics
case,
each
of
these
little
live
game
servers
is
effectively
a
shared
simulation
between
players,
so
you're
logging
in
you're
you're,
you
know
have
some
Street
provider
match
right
and
you're
both
connecting
to
this
thing,
and
then
it
recycles
the
entire
pod
or
potentially
the
container,
depending
on
like
how
they
have
it
set
up
right
now
we
have
a
kind
of
hacky
work
around
and
agonas
where
you
can
restart
within
the
process,
and
we
allow
that
kind
of
within
the
life
cycle
in
agones,
but
there's
a
lot
of
people
that
don't
want
to
restart
within
the
process
because
they're
basically
saying
like
hey,
we
run
a
containerized
workflow
anyways.
L
Why
am
I
restarting
in
the
process
when
I
Could
Just
Bounce
the
container,
but
the
problem
with
bouncing
the
container?
Is
it
only
work
for
workflows
where
sorry
it
only
works
for
like
games
whose
session
length
is
whatever
it
is?
10
minutes
the
the
back
off
reset
timer
so
again,
coming
back
to
say,
Street,
Fighter
or
something
like.
If
you're,
you
know
playing
a
a
two-minute
match
against
someone
it.
That
container
is
going
to
stop
at
that
two
minute
Mark
and
have
no
opportunity
to
be
recycled
without
back
off.
L
So
that's
kind
of
the
impetus
to
the
doc
talks
about
it,
a
little
more
I
kind
of
wanted
to
explore
different
options
for
that
I
think
the
easiest
and
kind
of
different
options
for
changing
this.
L
A
on
the
complicated
end,
I
feel
like
we
could
introduce
just
API.
That
is
straight
up
like
for
this
pod
I
want
this
exact
back
off,
Behavior
I!
Think
that
might
take
a
little
bit
of
like
kind
of
tuning
to
understand
what
the
the
right
balance
are.
Like
you
know,
what's
the
minimum
that
Kubla
could
support
for
a
backup,
speed
or
anything
like
that?
L
Eight,
a
super,
simple
approach
might
actually
be
ripping
off.
Some
of
the
the
work
done
and
like
the
the
recent
jobs
kept
might
even
be
something
like
just
having
two
static
policies,
one
of
which
was
or
or
maybe
one
tunable
policy
or
something
of
that
nature
have
one
policy:
that's
for
containers
that
exited
cleaning
cleanly
and
another
like
strictly
looking
at
just
exit
codes,
and
maybe
another
policy
for
containers
that
you
know,
crashed
or
had
some
other
infrastructure
failure
or
something
of
that
nature.
L
Anyways.
That's
kind
of
roughly
trying
to
think
if
I
presented
any
other
options
in
the
dock
yeah,
so
I
actually
wanted
to
explore
that
a
bit
here
see
if
y'all
had
ideas
had
other
approaches
that
we
should
be
considering.
C
So
the
background
for
it's
not
any
other
people
lose
their
hands.
Okay,
sorry
so
so
I
just
want
I
want
to
say
that
the
back
of
window
back
off
or
August.
My
initial
name
is
just
oversimplified.
That's
just
back!
Then
it's
just
to
protect
the
node
right
yeah,
because
you
worry
about
those
either
exactly
you
were
there.
We
discussed
you
can't
play
this
before
also
too
and
we
all
know
that's
the
problem.
We
want
to
optimize
those
things
enhance
this
right.
So
there's
the
one
cutest!
C
Actually
you
are
really
triggering
me
to
thinking-
is
that
if
the
on
purpose
there's
the
certain
process
like
now
cascading
the
workflow
and
the
one
lead,
the
restart
actually
is
the
properly
starter,
and
maybe
it
shouldn't
be
punished
by
the
back
of
restart.
We
only
really
go.
It
is
punished
this
to
the
crash
Loop
of
the
container
trying
not
there
need
them
to,
because
we
didn't
crack
them
and
we
shouldn't
need
them
to
attack
our
node
right.
C
So
if
it's
properly
so
that's
the
way
yeah,
you
propose
something
like
the
maybe
first
crash,
but
how
we
are
going
to
track
her
track.
This
is
the
first
time
right,
so
we
do
have
some,
but
that
restart
the
count
not
properly
I
have
to
say
that.
But
we
could
do
some
policy
next
to
say.
Oh
certain
things,
you
maybe
don't
need
like
first
restart.
First,
the
crash-
maybe
you
or
maybe
execute
code,
is
the
proper
restart
and
we
shouldn't
publish
by
the
back
of
questionable
things.
C
J
And
so
I
I
was
in
favor
of
this
there's
a
couple
other
use
cases
of
the
Damon
job
proposal
also
got
me
thinking
about
this,
which
is
like.
Basically,
if
you
want
to
do
anything
that
applies
policy
on
nodes
and
the
the
Damon
job
proposal.
For
those
who
aren't
here
was
basically
a
Daemon
set
can
be
used
to
enforce
policy
on
nodes
and
actually
that's
how,
in
practice
a
lot
of
people
use
it.
You
know
you
want
to
force
pull
images
to
all
the
nodes.
J
You
want
to
go
change
some
State.
You
want
to
deploy
a
network
plugin
a
lot
of
those
there's
kind
of
a
there's,
a
separation,
maybe
between
you,
apply
a
policy
to
the
node,
and
you
run
something
on
the
Node,
but
a
lot
of
the
applying
policy
on
the
Node
variants.
J
J
J
I,
think
we
we
can
dramatically
improve
what
we
do
today.
Dawn
kind
of
triggered.
Another
thing
which
is
there
are
things
that
we
don't
account
for,
like
a
crash.
Looping
pod
is
using
system
resources
that
are
sometimes
poorly
accounted
for
to
the
Pod.
J
It
would
be
awesome
as
we
go
through
this
if
we
can
come
up
with
a
justification
for
quantifying
the
actual
impact
of
restarts
on
the
Node,
even
like
a
really
high
level
like
what
does
it
actually
cost
to
to
restart
a
container
in
terms
of
cubelet
time
Cuba
this
time
and
what
is
what?
What
is
the
Pod
do,
and
what
does
container
runtime?
Do
that's
a
good
input
to
like,
maybe
just
an
operational
parameter,
which
is
how
much
you
know
how
much
yeah
Derek's
probably
I,
was
thinking
about.
J
System
d
as
well,
is
like
there
is
an
administrative
policy,
which
is
how
much
resources
do
you
want
to
accept?
Is
overhead
and
that's
the
real
key
reason
that
crash
loop
back
off
exists
is
to
with
the
worst
policy
even
moving
up
to
another
level
of
policy,
just
like
how
many
resources
per
second
do
you
want
to
fairly
divide
among
workloads
and
leave
that
up
to
the
admin?
If
we
can
keep
it
below
that,
I
think
that's
another
option
as
well.
L
For
sure
surrogate
I
think
you've
had
a
hand.
H
Yeah
I,
really
like
this
angle
of
minimizing,
like
improving
this
behavior
for
all
parts
like
accounts,
better
accounting
resources
like
making
it
cheaper
to
restart
will
definitely
help
everybody
I
think
in
this
specific
case.
H
It
also
resonates
with
another
cap
where
Port
intention,
like
container
intentionally,
want
to
instead
of
saying
that
it's
a
benign
failure,
please
restart
me
as
fast
as
possible,
also
maybe
indicate
that
it's
like
it's
very
bad
failure
and
you
need
to
like
terminate
into
our
pod
because
it's
like
such
a
bad
failure
that
I
I
can't
run
here
any
longer,
so
I
think
from
API
perspective.
There
may
be
two
features
coming
together
like
when
you
when
container
saying
like.
Is
there
this
failure?
H
Specific
exit
code
or
specific
type
of
failure
is
B9
or
it's
very
critical,
so
I
think
there
is
opportunity
for
API
change
that
can
be
done
reasonably
narrow,
scoped.
D
A
E
Yeah
one
of
the
things
I
think
we
should
think
about
here
is
like
when
we,
when
we
think
about
this,
is
it?
Is
it
up
to
the
admin
to
Define
kind
of
what
is
a
crash
versus
not
or
is
it
something
like
we
observe
so
in
the
sense
of
like
you
know,
I
see
two
ways
we
can
progress
here
right,
one
is
like
say
we
look
at
the
exit
code
right,
you
know
if
it's
a
zero
exit
code
may
be
successful.
E
Maybe
we
don't
limit
it
or
something
like
that,
but
there
might
be
a
different
case
where
you
know
you're
deploying
some
application
where
you
don't
control
the
accent
code
right
and
you
want
to
force
it.
It's
not
debate
by
the
crashly
back
off,
so
I
think
it
might
be
useful
to
to
see
what
we're
trying
to
sell
like
What
scenario.
Return
to
solve
turns
the
assumption
that
we
do
have
control
around
the
actual
application
being
run
and
can
modify
it
to.
J
Adding
one
more
point
to
static
pods
should
probably
not
be
put
into
crash
loopback
off
at
the
at
the
rate
they
do
today
like
there's.
J
This
is
a
bigger
discussion,
but
it
gets
into
some
of
that
chain
of
dependencies.
I'll
I'll,
be
sure
to
add
that
comment
when
we
get
to
it.
But
thinking
about
it
now,
there's
very
few
cases
where
the
full
back
off
makes
sense
for
a
static
pod,
because
that's
the
administrator's
responsibility
to
make
sure
that
the
the
point
of
the
static
pod
is
to
be
running
and
our
full
back
office.
L
In
that
scenario,
sorry
I
do
have
a
hand
up
David,
but
I'll
briefly
respond
to
Clayton.
Here,
that's
an
area
where
like,
if
we
had
just.
C
J
Yeah,
if
we
dramatically
improve
it,
it
may
not
even
be
necessary.
I'm
just
I
I
keep
finding
ways
that
static.
Pods
don't
do
what
people
use
them
for
which
is
run
critical
system
components
that
must
come
back
sometimes
in
order
to
make
sure
that
the
cluster
comes
back,
and
so,
if
we
pick
a
good
default,
I'll
just
add
that
as
a
I
think
it'll
be
a
goal
on
whatever
cap
comes
up
is
static.
Pods,
you
don't
have
to
think
about
static,
pods
doing
the
right
thing.
They
just
do
the
right
thing
by
default.
H
H
H
Like
one
is
improve,
like
the
situation,
for
everything
just
confirms
that
we
don't
waste
too
much
research
on
restarting
and
just
remove
this
five
minutes
that
it
started
for
everything
and
another
option
is
to
have
an
API
that
you
allow
to
toggle
like
how
much
it
is
like.
Is
it
been
I
understood
or
not,
foreign.
L
H
B
C
Hope
we
address
this
properly
instead,
just
simply
get
rid
of
that
five
minute.
Right
thanks.
That's
that's!
Maybe
I!
Think
if
you
are,
you
won't
do
that
you,
you
don't
need
to
came
to
the
signal,
that's
one
of
the
bad
things,
because
you
could
do
that
while
using
it
I,
don't
know
the
config.
We
could
expose
that
one
right.
So
then
you
could
handle
that
node
config,
but
that's
not,
which
has
been
that
simply
will
apply
all
the
all
the
content,
all
the
parts
I'm
not
sure.
C
L
C
So
that's
why
I
hope
we
are
addressed
properly
and
introduce
a
different
policy.
I
think
this
is
the
good
things
to
start
this
conversation.
We
all
agree
this
part.
This
is
not
the
first
time
we
heard
those
complaints,
but
this
first
time
people
give
us
more
detail
about
use
cases
right,
the
workflow
and
the
process
right
this.
Let's
leverage
this
this
use
this
situation
and
make
this
correct.
C
G
L
G
It's
been
a
really
useful,
interesting
conversation.
One
thing
I
was
curious
about
is
I,
did
put
in
the
chat
and
I
didn't
speak
up
like
I
was
trying
to
think
mentally.
G
If
you
had
total
control
of
your
Linux
host,
what
you
would
have
wanted,
the
behavior
to
be,
and
like
so
I
just
went
down
to
like
the
init
system,
we're
always
commonly
working
on
here
in
system
D
and
saying
what
would
Zach's
unit
file
looked
like
and
I
kind
of
wonder
in
your
scenario,
if
like,
if
you
had
the
ability
to
specify
a
start
limit
first
budget
for
your
pod,
if
that
would
have
been
what
you
would
have
used
in
your
use
case
and
either
way
if
you
wanted
to
like
just
if
you
weren't
familiar
with
that
field,
if
you
want
to
Google
it
and
come
back
and
say:
oh
man,
that's
the
missing
thing.
G
I
would
have
wanted
I'm
just
curious.
If
that
is
the
the
scenario
that
we're
lacking
in
cube
right
now,.
G
So
basically
like,
if
you
have
a
systemd
unit
with
a
resource,
restart
policy
of
always-
and
you
are
failing
on
Startup
like
systemd-
will
try
to
rapidly
start
you
as
fast
as
possible
over
and
over
again
up
to
that
burst
value,
and
then,
once
that
burst
value
has
been
reached.
Let's
say
it
was
10.
You
then
get
put
into
the
back
off
period
you
get
put
into
that.
Earn
a
interval
second
weight,
whereas
in
cubelet
we
put
you
automatically
in
a
back
off
period
and
it
feels
like
what
you're
wanting
is.
L
I
think
that's
possible,
but
like
The
Agonist
case
where
it's
like
every
you
know
a
couple
minutes,
it
might
I'd
have
to
look
at
how
systemd
handles
that
because
it's
possible
it's,
maybe
that
plus
the
the
tunable
like
when
the
timer
reset,
because
if
the
timer
doesn't
reset,
if
I
crash
every
couple
minutes
crash.
Sorry
but
I
think
these
are
kind
of
details.
We
can
flesh
out
I
actually.
H
J
L
Yeah
and
I
don't
know
if
any
of
the
cases
we
have
would
be
particularly
grumpy
about
the
the
base
like
10.
Second
one
so
I
mean
on
the
kcp
side,
like
I,
certainly
didn't
like
the
10
seconds
when
it
came
up,
but
like
it's
better
than
the
hot
Loop
for
sure
so
I
think
there's
some
discussion
there
for
sure.
A
Yeah
great
VR
over
time,
we
still
have
some
topics
we
can
carry
them
forward
to
next
week.
Thanks
for
joining
folks.
H
And
amongst
topics
there
are
some
link
cprs
that
needs
approval.
If
somebody
have
time,
please
go
through
them.
Yeah
I'll.