►
From YouTube: Kubernetes SIG Node 20210915
Description
Meeting Agenda:
https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU
A
B
First
item
from
archon
he
cannot
be
here.
Apparently
there
is
a
some
holiday
in
israel,
yeah.
A
A
C
Dynamic
kubelet
config
has
some
interesting
side
effects
right
now
and
I've
been
analyzing
a
lot
of
our
test
failures
lately
and
it
seems
like
a
bunch
of
other
flakes,
are
caused
by
dynamic,
kubelet
config
restarts
being
flaky.
Yes,.
A
C
I
have
spent
like
30
hours
of
my
life
in
the
last
two
weeks,
debugging
those
kinds
of
things,
and
so
I
think
we
like
pulling
them
out
of
cereal
right
now
is
going
to
be
a
waste
of
our
time.
C
A
A
I
think
what
we
can
probably
do
is
like
within
test
infra.
We
can
have
a
default
public
config
that
we
write
to
disk
on
the
node.
We
can
define
that
in
test
infra
and
then,
if
we
need
to
change
that
in
the
test,
we
can
just
overwrite
that
and
call
the
cubelet
to
restart
or
whatever,
of
course,
that
you
know
like
the
test
process
runs
elsewhere.
So,
like
that's
effectively
kind
of
what
dynamic
cubic
config
does
anyways
just
like
jankily.
A
It
is
true
that
there's
a
bunch
of
stuff
that
you
can't
actually
like
set
when
you
invoke
the
tests
because,
like
if
a
thing
is
not
a
command
line,
flag
thing,
you
have
to
actively
go
and
change
the
code
in
order
to
be
able
to
like,
like
you,
have
to
go
and
like
add
a
dynamic
config
whatever
to
like
change,
how
the
cubelet
in
the
test
harness
runs
so
like
that
sucks
and
that
that's
like
something
we
should
fix,
because
if
something
gets
added
to
like
the
cubelet
config
and
there's
no
corresponding
command
line,
flag
added,
which
is
pretty
common,
then
you
can't
set
it
so
like
that
should
just
be
fixed
overall,
and
I
think
that
there's
just
like
a
giant
refactor,
we
need
to
do
this
release
and
like
get
it
done.
A
B
Yeah,
I
also
plan
to
look
at
that,
so
maybe,
if
I
will
have
time
before
you
you
come
back,
I
like
I
will
share
my
progress.
A
Do
you
want
me
to
file
an
issue
for
that
and
ccu
and
danielle,
because
I
think
I
may
having
dug
into
this
a
ton
in
the
last
release,
I
probably
have
all
the
context
for
what
we
need
to
get
rid
of.
B
A
Okay,
yeah.
I
have
now
actioned
myself
on
this
one.
I
think
it's
really.
I
think
that
we
shouldn't
just
pull
the
tests
out,
because
we
still
need
to
test
those
things
and
like
there's
a
lot
of
stuff
in
the
test
that
relies
on
this
and
like
we
just
need
to
refactor
it,
we
need
a
better
way.
We're
definitely
like
limited.
A
I
think
right
now
from
having
not
invested
in
this
in
quite
a
long
time,
because,
like
everything
in
the
test,
suite
assumes
that
we're
configuring
things
via
like
passing
extra
command
line
flags
to
the
cubelet,
there's
no
way
without,
like
literally
changing
the
test
code
to
set
like
cubelet,
config
stuff,
and
that's
not
great
so.
B
So
once
we
do
that,
and
we
have
like
I'm
curious
if
you
daniel
you
looked
at
what
other
tests
were
fighting
like,
was
this
test
testing
serial
or
they
were
like
in
a
regular,
I
mean:
did
they
execute
in
parallel
with
other
tests,
because
once.
C
B
C
They
already
do
restarts
like
as
part
of
dynamic
couplet
config,
like
a
bunch
of
stuff,
already
restarts
a
lot
of
the
flakes
in
other
tests
that
happen
because
of
dynamic
coupling
config
partially
also
happen,
because,
after
the
restart
stuff
can't
always
connect
to
sockets
and
stuff,
but
by
moving
it
away
from
dynamic,
kubelet
config,
where
a
bunch
of
stuff
happens
out
of
band,
we
can
start
having
useful
logs
for
that
within
tests
to
actually
start
finding.
Some
of
those
problems.
A
B
A
C
That
has
a
couple
of
things
that
will
probably
help
in
that
one.
The
eviction
pr
lands
that
I've
had
open
for
a
bit.
We
should
be
able
to
get
rid
of
the
eviction
job
and
move
that
into
cereal
plus.
There's
like
a
couple
of
like
very
wasteful
ways.
We
use
resources
right
now
like
despite
the
fact
that
only
one
test
uses
gpu
devices.
C
We
run
every
test
on
a
node
with
gpu
devices,
and
so
we
might
actually
benefit
from
like
splitting
some
stuff
out
a
little
bit
mostly
to
be
less
wasteful.
A
So
I
recently
looked
at
this
issue:
yeah
aaron
suggested
that
we
just
like
use
the
normal
gce
pool,
not
a
separate
note.
One.
B
Yeah,
when
I
talk
to
ben,
he
says
that
most
we
can
do
is
to
like
combine
tests
together.
More
and
more.
B
D
C
So
much
better
than
multiplayer
outlook,
multiplayer
world,
though
like
office
365
multiplayer,
is
never
fun.
A
Yeah
sometimes
that
gets
a
little
flaky,
so
I
guess
going
back
to
our
nose
issue
here:
migrating
proud
jobs
to
community
infra.
Now
do
you
want
to
tell
us
about
this,
because
I
am
not
on
the
up
and
up
for
this
one?
E
E
So
we
have.
We
need
to
migrate
all
the
pro
job
for
all
the
seeks.
So
basically
I'm
going
sick
to
seek
to
start
the
migration
process.
E
E
Yeah,
oh
so
the
question
about
this.
We
basically
project
in
the
google
aux
with
specific
gc
image,
but
we
don't
know
if
it's
still.
E
C
As
I
can
tell,
we
haven't
actually
used
the
node
e2e
images
project
in
at
least
two
years.
I
might
have
missed
something
in
test
infra,
but
everything
uses
stuff
from
upstream
projects
directly.
C
C
I
mostly
lost
like
three
hours
trying
to
go
to
list
stuff
in
that
project
before
realizing.
I
didn't
need
it
in
the
first
place.
A
Danielle,
what
email
should
I
use
when
I
add
you
in
the
google
doc.
A
A
A
G
Sure
so
last
time
I
was
making
some
changes
to
the.
I
think
it
was
related
to
the
node
feature
tags.
We
discovered
this
job,
which
apparently
is
never
run,
and
then
I
did
some
findings
at
the
last
comment.
You'll
find
out
that
this
is
just
a
duplicate
job.
We
already
have
one
that
runs
as
a
pre-submit.
G
A
I
haven't
looked
at
this
one,
so
I
don't
know,
but
if
they
run
the
same
test,
oh
you
say
one
is
being
a
pr
job
and
the
other
is
a
periodic
job.
That's
normal.
So
in.
A
You
to
be
able
to
run
something
on
a
pr.
You
have
to
have
a
separate
job,
because
that's
how
proud
works.
G
C
A
A
A
A
And
I
guess
the
the
one
thing
we
want
to
check
is
make
sure
that
if
we
do
get
rid
of
both,
I
think
that's
the
only
thing
that
pulls
in
stuff
tagged,
node
alpha
feature,
I
don't
know
how
many
of
those
exist,
but
those
should
probably
be
tagged
feature
not
node
alpha
feature
and
yeah.
It
looks
like
there's
a
bunch
of
stuff
running
in
there.
A
I
think
I'll
I'll
comment
on
this
issue
and
discuss,
because
I
think
that
if
we
I
agree,
I
don't
think
this
should
be
a
separate
job.
I
think
we
should
just
use
the
same
alpha
job
that
we
use
for
the
rest
of
the
project,
because
there
is
already
one-
and
I
think
it
runs
on
every
pr
by
default,
like
it's
a
required
test,
so
whereas
this
one
is
not
required,
this
is
like
a
separate
pre-submit.
A
Clearly
runs-
and
I
guess
the
other.
G
B
A
It's
I'm
curious
why
the
like
pre-submit
failed
the
way
it
did
because
it
looks
like
it
was
like
an
infra
problem.
G
Yeah
yeah
it
kept
failing
and
we
we
just
decided
to
still,
instead
of
just
trying
to
find
out
the
problem,
just
decided
whether
or
not
it
really
is
important,
and
should
we
keep
it
or
not,.
A
If,
in
theory
the
periodic
is
identical
to
the
pre-submit,
I
think
it's
generally
good
practice
for
us
to
have
a
pre-submit
for
every
periodic,
because
we
are
finding
that
you
know
we
want
to
like,
for
example,
run
like
say
the
the
cube
serial
test
like
that
periodic
was
not
defined
and
it
can
be
really
annoying
if
you're
like.
I
want
to
see
if
this
test
passes
and
you
can't
actually
run
it
on
the
pr
you
have
to
like,
merge
it
and
wait
for
the
periodic
to
run
so
like
in
this
case.
A
A
Because
I
think
this
is
true,
so
I'll
I'll
write
a
comment
on
this
one
I'll
update
it
saying
as
much.
I
hadn't
seen
this
issue
so.
B
G
C
F
A
So
that
looks
like
we're
almost
at
the
end
of
our
agenda
other
than
bug
triage.
Is
there
anything
else
that
anybody
wants
to
discuss,
but
we
didn't
add
to
the
agenda.
E
A
So
sergey-
and
I
are
the
sub-project
leads
of
node
ci,
so
you
could
ping
either
of
us.
You
can
probably
just
post
something
in
pound,
sig
node
as
well
and
because
I'm
sure,
probably
not
just
us,
want
to
know
about
it.
Hopefully
so.
B
A
G
A
Sergey
we
want
to
go
check
and
see
if
there's
any
action
items
from
previous
meetings
that
we
forgot
to
follow
up
on.
I
see
one
comment.
Oh
I
see
yeah.
B
Still
yeah,
I
still
need
to
like
go
to
architecture,
thingy
and.
C
And
also
the
affection
test,
pr.
B
B
Yeah,
I'm
not
sure
what
the
imran
finished
the
vr
I
haven't
seen
updates
there.
B
F
A
I
think
we
should
triage
accept
this
one.
You
can
also
assign
me.
B
I
think
we
did
something
similar
for
last
release
when
we
stop
calling
props
on
certain
events
right.
A
A
G
B
A
Maybe,
let's
not
well,
I
think
clayton
assigned
it
to
himself,
so
we
can
just
triage
accept
this
one.
B
A
A
B
B
B
A
They
didn't
include
a
lot
of
stats
about
their
system.
I
mean
that
so
that
to
me
sounds
like
something
and
the
colonel
is
preempting.
A
B
Yeah
I
thanked
francesca,
but
you
will
ask.
A
I,
I
don't
think,
that's
a
bug,
I
think
that's
just
the
reality
of
whatever
version
mismatch.
They're
using.
A
J
B
A
All
of
the
pr
I
didn't
pull
this
one
onto
the
board,
everything
that
is
oh,
maybe
I
did
anything,
that's
a
linked
repo,
so
I
think
testing
for
a
website
etc.
Anything.
A
Can
file
a
bug
against
node
goes
to
our
board,
so
that
way
we
don't
forget
about
them.
B
A
We
should
triage
it.
I
think
we
do
want
to
fix
this.
That's
broken
in
the
docks,
or
at
least
it's
wrong
on.
A
No,
that's
the
that's
the
cncf
slack
slack.kate.
I
think.
B
F
J
So
essentially,
if
you
run
a
lot
of
parts
on
a
single
node
with
lots
of
subparts
and
two
efs
volumes,
the
node
will
eventually
crash.
I
noticed
two
things.
One
was
like
we'll
get
this
error
message.
I
need
to
attach
amount
volumes.
J
The
second
was,
we
saw
a
lot
of
dangling
mount
points,
so
if
you
scroll
up,
I
think
there's
a
wc
hyphen,
l
even
more
at
the
top.
I
think
I
don't
remember
where,
but
yeah
this
part
feels
like
a
lot
of
dangling
nouns
and
at
the
very
end,
if
you
look
at
it,
what
I
found
out
was
right
now.
My
theory
is
that
basically
like,
because
a
lot
of
parts
are
being
scheduled
with
subcuts
at
the
same
time,
what
is
happening
is
on
linux.
J
Slash
drop,
slash,
mouse
is
getting
modified
constantly,
and
because
of
that,
we
are
getting
like
inconsistent
reads
for
these
mounts
and
that
essentially
leads
the
entire
node.
You
know.
A
So
that's
not
a
sig
node
thing.
That's
a
sig
storage
thing!
Six
storage
owns
all
of
this
code,
so
yeah
on
here
and
storage
on
here,
which
is
probably
right.
We
probably
shouldn't
triage
this.
We
need
six
storage
to
look
at
it.
I
don't
know
if
you
want
to
add
them
on
github
or
something
sergey.
B
Every
five
minutes,
so
just
all
at
once
and
stop.
J
B
A
Why
would
the
daemon
set
pod
still
be
in
the
running
state,
the
node's
gone,
and
what,
while
it
is
shutting
down,
I
imagine
that
it
doesn't
get
pulled
from
the
end.
Points
returned
by
that
service
until
like
the
node
is
like
gone
gone
like.
I
would
not
be
shocked
if
there's
some
intermediate
state
where
that
sticks
around.
A
B
A
When
you
terminate
a
node
or
even
if
you
like,
drain
a
node
damon
set
pods,
don't
get
removed
because
they're
daemon
sets
they
never
go
away
from
that
node.
So
like
yes,
the
thing
is
gonna
like
even
when
you
terminate
the
node,
it's
gonna
stay
running.
A
Our
graceful
shutdowns
and
oh
I'm,
sorry,
I'm
I'm
simultaneously.
Writing
a
comment
is
degree
contribution.
Is
that
an
alpha
or
beta?
It's
a
postal
right?
Oh
it's
beta.
B
It's
beta,
but
critical
priority
is
alpha,
so
it's
like
extra
feature
on
top
of
it
is
alpha,
but.
A
So
when
you
shut
down
a
node
that
will
happen
because
they're
apparently
using
121.,
I
don't
know
what
version
that
went.
Let
me
double
check
in
the
website.
B
122
is
beta:
okay,.
A
A
Oh,
according
to
the
blog
post,
this
says
that
grace
will
know
shut
down
my
beta
in
121,
so.
F
A
A
Somewhere,
looking
at
all
of
these
pods
that
are
part
of
the
daemon
set,
it's
like
here
are
their
ip
addresses,
but
the
node
goes
away
at
some
point
because
the
node
gets
shut
down
at
some
point.
They
should
get
reconciled.
But
there's
like.
B
One
the
issue
that
it
says
that
port
remains
in
running
state
after
node
is
gone.
Yeah.
A
That's
normal.
Okay,
although
I
am
surprised
like.
A
B
A
C
It's
on,
but
they
need
to
configure
it
yeah.
So.
A
A
This
is
expected
behavior,
so
I'm
just
gonna
close
this
bug.
E
A
I
think
we
should
mark
this
triage,
not
reproducible,
just
because
it's
so
old
and
then,
if
somebody
can
reproduce
it
then
we'll
deal
with
it
and
take
that
label
off.
But
I
think,
as
is
we
should
close
it,
because
this
isn't
a
supported
version.
A
B
A
F
A
F
B
J
J
A
That
seems
probably
like
a
bug.
It
would
not
be
sig
cluster
life
cycle,
it
would
be
us-
and
maybe
apps,
but
probably
just
us.
A
But
yeah,
I
think
this
is
just
like
node,
so
I
think
we
should
remove
the
cluster
life
cycle
and
I
don't
know
who
wants
to
look
at
this.
A
B
B
A
Okay,
well,
they
definitely
need
to
add
that
oh
there's
a
server
version
here,
no.
D
A
A
and
they're
using
rancher
on
aws,
so
I
suspect
so
when
clayton
did
pod
life
cycle
refactor,
we
found
a
few
different
places
in
the
cubelet
where,
like
in
the
containers,
were
just
ignored
so
like
they,
they
weren't
factored
in
properly
in
terms
of
the
life
cycle.
This
may
be
fixed
in
122,
but
you
can
assign
me
and
I'll
put
a
or
you
can
cc
me
and
I'll
put
a
note
on
this
one.
I
don't
think
I'll
handle
the
bug,
but.