►
From YouTube: Kubernetes SIG Node CI 20230726
Description
SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk
GMT20230726-170549_Recording_1920x1020.mp4.crdownload
A
Hello,
hello,
it's
July,
26
2023,
it's
a
Signet,
CI
working
group
welcome
everybody
we
are
currently
in
test
freeze,
so
I
think
we
need
to
look
at
the
dashboards
and
see
what's
happening,
how
we
prepared
for
release,
and
we
have
a
few
items
on
our
agenda.
A
Yeah
and
the
first
item
is
actually
not
related
to
release.
I
asked
last
time
if
anybody
wants
to
participate
in
bhaktiash
and
maybe
like
at
least
this
meeting
so
we'll
have
more
people
orientation
with
that
and
Mike
said
that
he
wants
to
try
bhaktias
today,
Mike
IO,
still,
okay,
with
that.
A
Thank
you
Mike.
If
anybody
else
interested,
please
reach
out
and
I
hope
that
we'll
have
more
people
that
are
teaching
doing
triage
and
watching
the
status
of
CI
jobs.
Our
next
item
is
Francesco.
Do
you
want
to
talk
about
it.
C
Yes,
thank
you.
I
will
be
quick
because
at
this
point
in
time
is
more
an
announcement,
so
I
have
a
backboard
and
while
reviewing
David
pointed
out
hey
how
come
the
GPU
Lane
is
failing.
So
very
fair
question:
I'm
investigating
that
I
filed
a
very
trivial
PR
which
cannot
possibly
break
GPU
Lane,
because
it's
literally
one
comment-
and
it
seems
the
lane
is
already
broken,
which
is
interesting.
So
I
will
report
my
findings.
C
If
anyone
has
experience
with
GPU,
Lane
and
or
remember
past,
the
incidents
just
reach
out
to
me,
so
we
can
make
progress
and
fix
that,
because
my
feeling
at
this
point
in
time
is
the
lane
is
broken
for
pretty
much
the
stable
benches,
because
I
see
no
difference
between
the
configuration
of
all
the
stable
branches.
So
I
really
appreciate
folks,
as
as
soon
as
I
get
more
information.
A
Thank
you,
yeah,
along
the
same
lines,
I
was
looking.
I
was
trying
to
understand
what
we
did
for
release
and
we
did
enable
a
few
more
test
Lanes
for
specifically
for
multinuma.
We
have
CI
jobs
and
we
have
a
standalone
job,
but
standular
jobs
are
still
failing.
I
know
if
I
was
watching
on
a
line
today,
yeah
spices
and
a
fix
that
didn't
actually
fix
the
situation.
So
it's
still
failing.
A
I,
really
hope
that
we
can
get
them
back
to
green,
not
back
to
green,
like
we'll
make
them
make
the
test
definition
green.
So
it
will
run
successfully
the
same
way
as
CIA
jobs.
I
think
same
with
arm
jobs,
and
our
job
is
interesting
because
they
just
got
broken
like
a
few
days
back
so
721.
A
What
is
this?
They
were
running
green
and
then
end
of
21.
First,
they
start
failing,
which
is
very
unexpected,
so
you
see
like
a
bunch
of
tests
were
here
and
then
suddenly
we
have
failures.
A
Another
thing
that
is
failing
is
one
of
the
new
tests
that
we
added
in
this
release
for
a
new
status
of
it's
specifically
for
I,
think
for
container
G,
1.7,
yeah
I
think
this
is
1.7
categy
and
like
the
failing
test.
Is
this
new
one.
A
In
this
release,
but
it
was
passing
just
recently
so
very
surprisingly,
start
failing
just
maybe
a
few
days
back,
it
may
be
related
to
container
G
version,
like
maybe
they
change
something
in
their
brush,
but
I,
don't
know
like
something
that
person
investigation
and
then
yeah
and
I
also
looked
at
a
few
other
things,
I
think
few,
some
other
tests
I
all
other
tests
failing
are
not
oh,
really
is
working.
Yeah,
I,
remember,
release
broken
is
also
one
test.
Failing
here.
A
Yeah,
this
is
device
manager,
yeah
I,
know
if
anybody
has
a
context
for
this
test
specifically,
but
it
also
started
failing
just
recently
I.
C
Things
are
bad
enough,
but
just
to
give
to
have
a
bit
context.
Yeah
flicks
are
totally
bad,
which
I
should
improve
that
yeah
yeah.
D
Sorry
I
wasn't
mutant
to
are
you
noticing
more
failures?
In
container
d
related
tests,
there
was
a
PR
merged
I'll
give
the
link
which
overrode
vendor
directories
for
container
d.
Let
me
just
give
you
that
we
are
so.
D
There
was
a
discussion
going
on
whether
the
fix
should
come
from
copy
from
container
D
are
wait,
a
minute
where's
the
chat
here,
yeah
yeah,
so
this
this
PR
overrides
a
lot
of
files
in
containerdy,
folder
and
vendor
directly,
and
this
code
may
or
may
not
even
exist
in
continuity,
so
that
we
need
to
check
whether
why
this
PR
was
merged.
There
is
a
if
you
go
to
the.
If
you
scroll
down.
E
D
E
E
A
The
same
time
as
a
different
one,
as
I
said
this
book
and
thinking.
F
A
Started
failing
721
you
can
see
here
and
which
one
like
there
was
another
one
that
started
feeling
around
the
same
time.
I
think.
E
More
I
might
be
familiar
with
that
1.7
as
well
container
d
that
fixed
the
arm
64
jobs.
They
were
also
failures
expected
on
the
the
container
D
1.7
tests,
because
we
merged
a
breaking
change
in
our
we
much
breaking
changes
a
few
days
ago,
which.
C
E
Be
resolved
now
because
they'll
go
back
ported
properly.
It's
not
this!
Okay,
it's
basically
the
bucket
change
right.
So
we
change
the
buckets
where
you
pull
the
blobs
from
the
the
concept
binaries,
but
it's
all
been
back
ported
now.
So
all
that
should
go
green
today.
A
Yeah,
but
so
which
test
it's
supposed
to
fix
arm.
E
Yeah
arm
will
be
fixed
because
of
the
fixed,
no
damn,
but
the
contained
D
1.7
E3
test
had
a
separate
bug
that
was
fixed.
A
First
Jesus
Autumn
still
failing
so
last
round
was
this
morning
and
it
started
failing
seven.
C
E
That
looks
like
whatever
was
installing
contained.
D
didn't
finish
it
properly
interesting.
A
A
A
Foreign
jobs
I
think
there
is
no
new
failures:
I
I
glanced
over
and
I
think
they're.
All
known
failures
like
serial
still
fighting
and
eviction
fading
I.
Don't
remember
what
zeros
yeah.
A
A
Okay,
anything
else
related
to
test
or
maybe
related
to
upcoming
release.
A
Okay,
then
Dixie.
G
Hey
a
few
weeks
ago,
I
had
presented
the
OSS
test
guidance,
talk,
I've,
also
attached.
The
pr
link
to
the
doc
for
to
the
meeting
notes
in
here
I
would
like
to
request
everyone
to
take
a
look
at
it
and
drop
in.
If
you
have
any
suggestions
before
I
merge
the
like
before
I,
go
ahead
and
get
it
reviewed
and
merged.
There
was
one
suggestion
from
one
of
the
reviewers
that
we
add
dashboards
for
each
architecture
and
each
cloud
provider.
So
initially
we
had
decided
that
we'll
have
just
four
dashboards.
G
One
was
signal
default,
the
other
one
is
Sid
node
cubelet,
then
signal
container
d,
I,
think
a
three
main
dashboards,
so
I'm
thinking,
maybe
I'll,
multiply
that
by
two
one
for
each
cloud
provider,
GCE
and
ec2,
and
then
inside
each
dashboard.
Whatever
test,
whatever
sub
dashboard,
we
have,
we
can
run
it
for
each
architecture.
G
So
if
anyone
has
any
thoughts
about
it,
yeah,
please
comment
on
the
pr
just
wanted
to
let
everyone
know
that
we
had
initially
decided
on
three
main
dashboards
I'm
planning
to
multiply
that
by
two
for
each
cloud
service
provider
and
inside
each
dashboard.
We'll
have
tests
for
each
architecture
yeah,
that's
all
from
my
end,.
A
Yeah
Muhammad
I,
don't
know
what
you
are
being
here,
but
I
think.
The
goal
here
is
to
make
dashboards
as
neutral
as
possible
from
vendor
perspective.
So
naming
dashboards
by
after
the
cloud
will
defeat.
The
purpose
like
like
we
will
save
money
by
running
some
tests
in
one
Cloud,
something
in
other
clouds
and
being
flexible
there,
and
if
you
start
having
two
dashboards
you'll
start
saying
like,
let's
run
all
the
tests
and
all
that,
like
all
the
clouds
which,
if.
E
Yeah
I
I
think
I
was
referring
to
a
tab
right,
I'm
sure
how
you
do
it.
Let
me
see
one
second.
E
If
you
look
at
the
cops
test
grid,
for
example,
which
is
quite
nice-
the
delivery
dashboard
for
every
permutation
that
they
test
right
because
effectively
we'll
have
a
lot
of
permutations
to
test
based
on
your
operating
system,
to
container
runtime
c
groups,
the
architecture
and
what
cloud
that
test
is
going
to
run
on,
because
we're
going
to
have
doing
a
test,
different
oscs
on
different
clouds,
because
some
don't
exist
on
the
other
ones
and
then
there's
different
mutations
of
the
same
OS
in
different
clouds.
E
E
A
G
A
Think
they
need
to
test
those
compatibilities
right,
I
think
what
we're
trying
to
do
to
Signal
this
test,
complete
mostly
so
Google
Behavior,
rather
than
compatibility
with
clouds
and
operating
systems.
That's
why
we
want
just
a
few
operating
systems
for
different
flavors
and
try
to
be
as
neutral
as
possible.
So
we
even
discussed
like
we
don't
want
specific
devices
we
specifically
to
Cloud.
E
A
Yeah
I
mean
there
is
a
trade-off
of
genetic
image
and
image
that
is
easier
to
use
on
a
cloud
right.
So
I
don't
know
what
like
there
is
no
right
answer
what
to
use,
but
we
try
to
be,
as
I
mean,
we
want
to
make
it
easier
on
us
to
to
use
these
tests
so
being
exactly.
E
A
Is
sometimes
hard,
so
we
need
to
understand
how,
like
what
kind
of
permutations
we
need.
E
Yeah,
it
sounds
good
when
dealing
their
cops
does,
which
is
very
nice,
is
all
of
that
you
see
on
your
screen
right
now.
Is
it's
not
generated
by
hand,
because
that's
some
that
grid
has
like
700
different
jobs
on
there
600
sorry,
so
they
don't
generate
it
by
hand.
They've
got
generated
this
all
that
that's
one
of
the
things
I'm
thinking
of
applying
here,
because
we're
gonna
have
as
many
permutations
as
they
do.
G
Yeah
I
think
having
the
automated
like
automated
code,
to
hear
the
job
name.
That
looks
fine
media.
Maybe
we
can
discuss
offline
as
to
yeah
to
finalize
the
dashboards.
That
might
be
helpful.
So
does
that
sound
good
to
you,
like,
after.
E
G
So
yeah,
let
me
think
about
it
a
bit
and
then
after
the
meeting
we
can
finalize
what
ones
are
needed.
Maybe
we
could
run
like
soccer
said
we
could
run
a
few
jobs
on
one
provider
and
few
on
the
other,
just
to
make
sure
that
there
is
some
neutrality
there
and
yeah.
We
can
also
discuss
about
what
OS
images
should
we
finalize
there
and
yeah
then.
E
A
Okay,
okay,
let's
discuss
an
NPR
I
think
we
started
I
hope
we
will
finalize
it
by
opening
of
129
release,
so
we
can
start
reshuffening
tests
early.
You
know
at
least
yeah.
G
We
can
finalize
it
this
week,
it's
just
one
one
or
two
permutations
here
and
there
that
you're
missing
okay.
E
I
mean
there's
no
rush,
I'm
busy
hashing
out
some
keep
test
two
stuff,
so
we
can
use
it
more
widely,
but
there's
a
lot
of
leg.
Work
I
need
to
do.
I,
think
that
will
take
up
a
fair
amount
of
time
for
the
rest
of
this
release.
Cycle.
A
And
then
let's
go
into
triage,
really
quick.
A
A
A
Last
week
you
said
you
take
a
look
at
this
I
I.
Don't
know
why
I
didn't
move
it
from
triage.
G
Oh
I
think
I
missed
it.
Yeah
I'm
fine
with
other
people
taking
a
look
at
it.
A
Just
saying
to
myself
as
well
sorry
for
the
noise
I
think
it's
already
looked
at
by
ad,
so
I
think
I
I
just
take
a
look
at
upload.
C
We
talked
about
last
week,
I'm
instrumental
correctly,
so
the
the
this
one
basically
is
a
reproducer
for
a
behavior
which
we
need
to
figure
out
is
if
it's
correct
or
not.
So
there
is
not
not
the
specification,
not
enough
specification
in
the
producers.
Api,
so
I
think
we
need
to
do
triage.
Okay,
Maria.
E
Yeah,
so
this
one
is
quite
safe
and
straightforward
I'm
making
changes.
Well,
let
me
see
in
the
latest
provider
code
where
we
keep
the
they
make
test
ET
stuff
in
there.
I
need
to
deprecate
that
thing.
Basically,
the
make
file
I've
got
rid
of
the
defaults
on
this.
I
need
to
set
the
defaults
on
a
job
and
then
clean
it
up
after
128
is
released.
A
E
A
Okay,
can
you
can
I
trust
you
that
it's
so
yeah,
it's
good.
A
Jobs
are
on
our
ad,
so
I
didn't
even
mention
that,
hopefully
we
can
get
yeah.
E
So
dims
is
building
a
e
case,
armies
and
they're
missing.
A
few
lags
that's
a
new
one,
but
the
core
drugs
are
green
and
have
been
some
time.
A
E
A
E
E
Monster
then,
that's
already
a
I'll
need
to
fix
all
that
stuff.
Downstream
come
back.
A
E
D
A
Yeah
I'm
not
sure
if
we
need
it
strictly
inside
cars,
we'll
test
you
around
I,
don't
want
to
risk
breaking
anything.
C
I'm
fine
I
I
did
half
of
review
not
to
the
the
full
review,
so
there
are
two
things
from
my
side.
First
of
all,
I
agree
with
you.
This
is
looks
not
too
bad
looks
okay,
but
probably
not
the
best
time
to
match
that
and
yeah
once
I'm
sure
that
no
existing
testers
keep
them
looks
good
to
me
and
then
we
can
evaluate
if
we,
if
we
can
measure
or
not
so
looks
half
good.
Let's
see.
C
G
A
G
F
A
I
already
have
Patrick
so
I,
don't
think
we
necessarily
have
to
read
a.
F
F
A
A
A
20
CPU
set
as
restartable
unit
container,
oh
okay,
so
this
is
a
sidecar
issue
that
we
discussed
it's
beyond
beyond
this
forum.
It's
a
product
change.
A
All
right,
that's
to
be
table
driven
testing
will
be
the
best
to
get
the
best
practices
here.
A
A
B
I
kind
of
feel
like
table
tested
writer,
for
if
you
have
like
several
combinations
of
the
symptoms
with
a
slightly
different
variants.
A
So
yeah
I
think
Muhammad
this
one
that
you
mentioned
right.
A
D
A
Having
do
you
know
if
it's
ec2,
related
or
arm
related.
F
F
E
A
quick
question
Todd
is
that
running
this
6.1
kernel
version.
F
I'm
not
sure
if
this,
which
kernel
is
arm
test,
is
using.
A
Yes,
that's
API,
it's
a
little
bit
fragile,
I,
remember
and
maybe
I
I
may
be
wrong,
but
I
think
this
test
is
like
huge.
It's
it
does
like
in
one
test.
It
has
tests
for
all
sorts
of
metrics,
simultaneously
or
runtime
element.
A
Runtime
element
by
the
way
may
be
related
to
some
parameters
like
we
may
not
pass
the
right
parameters
to
find
a
container
G
group,
but
I,
don't
remember,
I
mean
I,
have
very
vague
memory
about
this.
That
may
be
very
much
configurations
problem.
A
A
A
But
this
one
I
showed
earlier
started
failing
on
21st,
so
I
think
it
was
I
mean
we
can
double
check
on
21st
whether
this
tests
were
passing.
If
they
did,
then
it's
ec2
problem
or
configuration
problem.
A
If
not,
then
we
need
to
I
mean
I
hope
it's
not
a
wider
issue.
I
hope
it's
just
configuration
problem.
A
Okay,
yeah
we're
at
the
end
of
this
segment
and
I
want
to
move
to
bhaktiash,
and
for
that
Mike,
do
you
want
to
drive
it
as
we
discussed?
I
can
make.
B
Okay,
yeah
I
think
I
have
cookies.
Now,
thanks,
apologies
in
advance,
for
if
there's
too
much
noise,
it's
really
bad.
Let
me
know
and
I'll
move
to
another
place.
D
B
B
Right
interesting,
so
we
have
our
issues
allocate
device
for
parts,
foreign.
B
C
C
C
Think,
basically,
information
saved
at
the
moment.
I
think
it
could
wait
the
128
one
two,
whatever
that
stream
I
think
thanks.
B
A
B
B
A
It's
an
endpoint
controller
I.
Think.
D
B
Good,
what
do
we
do
with
the
with
the
part
that
we
removed
from
here.
B
Okay,
just
moved
it.
Let's
see
that's
transition
time
last
transition
time
of
already
condition
roll
back
to
previous
okay.
B
A
A
A
Not
lifecycle
controller
loses
connectivity
to
the
node.
It
will
update
all
the
radio
state
for
all
the
ports
to
latest
time,
so
it
will
it'll,
basically
Mark
all
the
ports
as
unreachable,
because
node
is
unreachable.
B
Okay,
that
makes
sense,
but
the
thing
is:
why
is
it
it
doesn't
like
a
bug
that
is
using
the
the
early
time.
A
A
I
think
we
can
300
as
needs
information
and
ask
for
not
controller,
not
life
cycle
controller
logs
to
understand
why
not
life
cycle
controller
believe
that
Ford
registered
needs
to
be
updated.
A
Yeah
I'm
not
sure
what
this
proxy
doing
so
he's
saying
that
the
prophecy
actually
breaks
a
connection
between
node
and
API
server,
yeah.
F
That's
what
I
was
assuming
it
was
doing:
yeah
hdb
proxy
from
the
worker
node
to
the
cube,
API
server,
get
everything
going
and
kill
the
proxy,
so
you
break
the
connection
but
leave
Cuba
running
and
then
restart
the
proxy.
So
the
cube
reestablishes
the
connection,
and
then
you
see
this
weird,
the
last
update
time
or
less
transition
time
issue:
yeah,
it's
okay,
so
I.
A
So
time
one
is
when
it
became
ready
time
two
is
what
is
time
to.
D
B
Because
they're,
independent
right
and
other
than
he
really
doesn't
have
any
idea,
controller
manager
who
just
uses
whatever
it
had
before.
A
Yeah
I
think
what
do
you
we
may
fix
it
as
part
of
another
box
that
we
have
with
probes
like
today
when
Kubota
starts,
it
will
start
with
empty
slate,
so
it
will
say
that
everything
became
not
ready,
even
though
it
was
already
before.
So.
What
you
want
to
do
is
to
read
the
last
status
from
API
server
when
Kubota
starts,
so
it
will
know
the
status
but
I
don't
know
whether
you
want
to
change
it.
A
F
F
D
F
A
I
think
we
can
requestify
it
as
a
feature
request
and
ask
a
topic
started.
What's
the
side
effect
like
what
is
wrong?
Be
Beyond
I
mean
it's
inconsistency
in
data,
but
how
is
it
affecting
the
customer.
B
B
Is
it
like
this
with
a
dash.
B
B
Okay,
last
one
volume
file
system
is
corrupted,
causing
the
program
to
get
stuck
when
accessing
files,
meaning
to
program
free.
Thank
you
multiply
this
corrupted.
B
B
B
B
A
Because
this
bug
I
typically
leave
it
to
seek
storage
to
deal
with
that.
B
A
Yeah
his
pull
request,
marked
I
mean
as
long
as
can
you
see
the
bug?
Is
there
a
discussion
from
seek
storage?
There's
a
reason
just
bail
out.
A
If
version
is
okay,
then
I
think
we
can
just
put
it
as
a.
F
So
I
think
he's
saying
like
if
you're
like
a
corrupted
file
system
to
where
you're
file
system
access
hangs
like
which
what
happens,
if
like
NFS
disconnects
for
whatever
reason
like
cubelet
hangs.
That's
like
that
happens
all
over
the
place
like
basically
anytime.
We
try
to
root
a
file,
it'll,
hang
and
I.
Don't
know
if
there's
a
whole
lot
that
can
be
done.
A
Yeah,
why
I
said
it
would
be
fun
to
write
a
test
on
that?
How
do
we
even
do
that.
B
Seen
some
cases
with
that
as
well
I,
don't
know
it
seems
like
a
diamond,
is
a
good
solution
for
those
cases,
but
I
don't
know
if,
like
it
entirely
a
reliably.
F
Yeah,
the
problem
is
it
blocks
in
the
system,
call
oh
okay
and
which
consumes
a
real
OS
thread
and
go
so
I've
been
a
lot
time
out,
but
you
still
are
blocking
an
OS
thread.
You
start
considering
those
it's
really.
F
B
A
Even
something
that
typically
a
regression
of
this
version
or
something
that
we
have
to
fix
in
this
version,
this
block
probably
existed
for
like
since
the
beginning
of
computer
lenses,
so
I
think
it's
more
I
wouldn't
say
feature
request,
but
maybe
a
like
long
term
back
and.
A
Accepted
and
let's
seek
storage
to
try
to
suggest
anything.
A
And
right
write
something
like
common
saying,
like
six
storage,
what
do
you
think
about
this
fix
or
something.
A
B
B
A
Don't
don't
touch
storage
thing.
B
Yeah,
it
looks
like
we're
done
with
the
three
hours:
do
we
go
for
the
other
ones.
A
Some
things
out
of
mysterious
there
is
a
link
for
that.
On
top
like
on
the
second
box,
there
is
a
link
on
the
recent
updates
and
industriage
and
now
down
below.
A
Yeah,
those
are
everything
that
was
recently
updated,
so
you
can
see
if
somebody
replied
on
this,
so
we
don't
have
time
for
that.
I
think.
Maybe
next.
A
Okay,
any
other
topics
for
the
day,
thanks
Mike
for
Allegiance.
Maybe
next
time
we
can
extend
the
experiment.