►
From YouTube: Kubernetes SIG Node CI 20230125
Description
SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk
GMT20230125-180539_Recording_1542x1120.mp4
A
A
I
put
this
agenda
item
on
a
list:
I
I,
don't
have
specific
action
items
yet
it
just
one
of
the
bugs
that
we
just
covered
I
spoke
about
it
during
signals
meeting
yesterday
it's
about
HTTP
probes.
If
you
have
too
many
containers,
I
mean
in
this
example.
When
we've
been
designated,
it
was
like
300
containers
on
a
node
and
all
of
them
doing
live
live
news
props
every
second,
then
we
can
run
out
of
resources
on
node,
because
every
HTTP
connection
takes
some
socket
and
it
waits
for
completion.
A
And
then,
after
callback
like
after
the
response
was
received,
it
keeps
the
circuit
open
for
another
60
seconds
just
to
because
of
some
TCP
standard
and
I'll
be
changing
it
to
one
second,
making
a
problem,
less
significant.
But
still
we
have
this
noise
enabled
problem
of
sort,
and
we
discovered
this
issue
in
from
different
channels.
A
We
we
saw
people
complaining
about
it,
so
it's
like
one
channel,
but
we
couldn't
understand
why
it's
happening
and
like
we
started
like
maybe
some
application
doing
some
crazy
things
with
resources,
and
maybe
it
doesn't
listen
for
circuit,
because
error
was
client
error,
so
client
kublet
reporting
that
liveness
profiles
with
connection
timeout
trying
to
connect
application.
So
naturally
you
will
expect
that
application
is
not
responding.
So
maybe
like
socket
is
like
publication,
cannot
reply
on
the
circle
or
something
like
that.
A
But
you
know
it's
equivalent
problem,
so
node
exhausted
all
the
circuits
and
it
can
just
create
one
another
one
so
yeah
and
this,
but
then
we
also
discovered
it
in
our
stress
test
in
G
key
been
running
some
of
them,
but
surprisingly
I
like
it
didn't
fail
before,
but
it
started
failing
and
I'm,
not
sure
what
changed,
but
through
that.
A
We
like
also
recognize
that
maybe
you
know
a
customer
issue,
it's
it
looks
more
and
more
like
a
infrastructure
problem,
so
we
dig
a
little
bit
different
Antonio
found
this
problem
because
he's
very
familiar
with
networking
stocks.
So
it
was
obvious.
A
I
mean
it
wasn't
obvious
for
him,
but
like
it
took
some
time
to
figure
it
out
and
then
I
I
spoke
with
scalability
team
in
the
past
six
collaborities
running
all
sorts
of
scalability
tests,
but
they
never
run
any
notes
collability
test,
so
they
don't
run
like
authors.
Collability
efforts
are
directed
to
towards
scheduler
and
API
service
collability,
so
they
want
to
make
sure
that
they
can
handle
a
lot
of
nodes
with
each
node
will
host
a
lot
of
ports,
but
every
port
that
they
run
is
collability.
A
Test
is
extremely
small
and
using
the
same
image
as
everything
else.
It
doesn't
have
any
probes
at
all
defined
so
like
all
the
primitive
premature
probe,
so
they're,
mostly
tasting
like
and
like,
maybe
sometimes
they
have
configma
but
config
map
for
the
sport
is
typically
designed
to
test
API
serious
collability
like
how
many
config
Maps
we
can
host
and
like
how
fast
they
will
be
downloaded
this
kind
of
things.
A
So
this
is
six
flagability
Focus
for
many
years
now
and
they
knew
about
node
scalability
questions,
but
they
never
like
dig
much
deeper
into
node
efforts
at
all.
So.
B
A
What
I'm
saying
was
that,
like
I
I,
feel
that
we
have
this
lack
of
testing
in
sick
note,
we
don't
have,
according
as
another
thing
that
we're
working
on
is
grpc
probes
for
grpc
props.
We
also
like
implemented
couple
conformance
tests
and
like
it's,
it's
so
good,
like
functionality
is
working,
but
now
looking
at
this
problem,
I'm
thing
like
what
kind
of
collability
does
do
I
need
to
run
like
how
many
continuous
is
too
many
can
I
run.
A
Like
should
I
run
skull,
businesses
with
like
2
000
containers
like
five
thousand
Canada,
whatever
the
limit
that
they
need
to
support
yeah
and
what
kind
of
resources
we
can
exhaust
on
node
with
grpc
connections.
A
No
excuse
me
yeah,
so
I
don't
have
answers
yet.
I
just
I
feel
that
there
is
a
lack
of
testing
that
we
have
you
also
like
in
like.
We
also
have
this
talk
test
that
something
that
Ryan
could
continuously.
We
have
this
test.
It's
constantly
red.
We
made
like
a
few
attempts
already
to
fix
it.
I
think
we're
getting
closer,
but
yeah
Brian
is
he.
You
know
shaking
ahead.
A
Yeah
I
mean
I,
get
it
to
the
point
when
it
I
I
knew.
The
problem
is
some
leak
and
log,
so
I,
like
I,
was
trying
to
figure
out
which
lock
I
need
to
rotate
but
I
get
to
this
point,
but
then,
like
I,
went
to
parental
leave
and,
like
I
came
back
and
it
gets
broken
again,
I
completely
passed
it
so
I.
Don't
know
like
this.
A
This
is
a
little
bit
dysfunctional
but
yeah,
something
that
we
also
need
to
do,
but
we
have
at
least
some
I
mean
we
have
a
place
to
fix.
We
don't
have
like.
We
actually
have
a
test
that
we
need
to
look
into
to
fix
it.
For
stress.
We
don't
have
anything
I.
C
Have
a
question
for
you:
I
see
this
your
link
in
the
pull
request
here.
Does
this
mean
you
started
a
test
to
do
the
stressing.
A
A
It's
very
I
mean
it's
good
enough,
for
that,
like
I
mean
it's
failing
without
the
fix
and
it's
not
failing
with
the
fix
but
I
fairly
speaking,
didn't
even
try
it
on
Windows.
Will
it
work
on
Windows
I?
Will
not
I
I
have
no
no
idea,
so
that
I
mean
probably
it
does,
because
we
we
run
unit
tests
on
Windows,
yeah,
Ryan.
D
I
was
gonna
mention,
I
was
just
looking
at
the
patch
too
I.
Think
it's
a
really
good
patch.
D
A
Yeah,
that's
why
I
mentioned
grpc
because
I
like
we
were
thinking
of
GA
and
grpc
probes
and
right
now
with
a
scalability
problem.
We
probably
need
to
look
deeper
and
try
to
test
it
on
scale.
A
Yeah
yeah
so
I,
don't
if
anybody
I
will
be
interested
to
look
into
stress
tests
and
like
how
what
kind
of
stress
test
we
want
to
run
during
functional
validation,
it
will
be
interesting.
I
I
definitely
will
support
for
that.
I.
Just
don't
know
like
who
has
energy
right
now
and
whether
we
want
to
tackle
it
in
the
cities
or
next
to
this.
A
Yeah,
maybe
we
can
start
with
understanding
what
you
want
to
do
from
stressed.
I.
Think
probes
is
obvious
things
that
you
may
want
to
test,
but
you
also
may
want
to
look
at
how
many
config
Maps,
like
one
pod,
can
handle
like
how
many
I
know
I
know
what
else
to
stress
like
like
how
fast
you
can
remove
and
delete
Paul
like
maybe
you
can
just
keep
pounding
like
creation
deletion
and
see
how
Google
behaves.
C
I
have
a
practical
suggestion
for
would
help
me,
and
probably
others
in
here
would
make
it
easier
for
beginners
to
contribute
if
there's
already
a
place
where
these
kind
of
things
would
fit
in,
and
we
can
add
test
cases
if
there's
not
a
good
first
step
would
be
to
create
a
place
where
we
can
begin
to
add
test
cases
find
a
place.
Something
like
that
and
tell
everybody
in
here.
So
we
can
jump
in
and
make
test
cases.
A
E
A
Is
fair
like
once
you
have
an
example,
it's
easier
to
extend
it
with
other
examples,
right,
yeah
but
I
mean
even
thinking
about
it.
I
I,
still
myself,
not
super
clear,
like
do
you
want
big,
big
machine
with
and
test
a
lot
of
like
heavy
load
of
ports,
or
you
want
smaller
machine
and
just
test
the
limit
of
a
couplet,
I
I
said
in
on
the
senate.
A
In
my
mind
myself,
so
that's
why,
like
I
mean
even
to
create
this
place,
we
need
to
do
a
little
bit
more
thinking.
What
exactly
you
want
to
validate
and
why
doesn't
even
know
if
anything
was
discussed
for
invented
plug
the
plug
I
evented
black
is
a
Improvement
where,
like
today,
we
cobot
will
release
all
the
containers
periodically
from
runtime
to
to
see.
If
there
are
any
changes
that
needs
to
be
reconciled.
A
I
was
invented
plaque.
It
opens
a
streaming
connection
to
come
to
runtime
and
runtime
whenever
something
changes
reports
back
through
this
channel
through
this,
like
stream,
saying
that
there
is
a
change
on
this
container,
please
updated
and
I
would
like.
A
The
goal
of
the
same
problem
was
to
improve
performance
and
minimize
like
memory
usage
and
I
was
wondering
if,
during
this
cap,
somebody
was
looking
into
stressing
stress
testing
it.
F
So
Sergey
we
are
trying
to
add
event
like
Ci
jobs
in
various
repositories.
Right
now,
so
there
is
one
was
recently
added
the
public
as
a
space
submit
job,
but
unfortunately
spaying
due
to
some
cluster
authentication
we're
trying
to
fix
that.
So
that's
the
node
e2e
job
uses
you
enter
plug.
Then
we
are
trying
to
add
events
like
Ci
jobs
in
cryo,
I
believe
someone
might
add
in
containerdy
as
well
and
and
similarly
we
are
trying
to
add
an
event
like
job
in
in
openshift
CI.
So
that's
that's.
F
B
F
Right
now
we
have
a
pre-submit
job.
There's
a
small
problem
like
when
you
launch
it.
It
kind
of
doesn't
get
this
right
cluster
to
launch
a
job
once
the
job
doesn't
get
launched,
so
you're
trying
to
debug
that
what
could
be
going
on
there,
but
as
soon
as
that's
fixed,
we
should
have
a
event
at
like
no
D2
is
running
soon.
F
A
Make
sense,
and
also
like
it's
a
side
note
when
looking
at
the
probes
I
discovered
that
today,
props,
like
we
wouldn't
start
next
probe,
while
previous
prop
is
still
waiting
for
response.
So
it's
it's
quite
surprising
for
me,
because
I
thought
that
we
will
keep
creating
new
connections
over
and
over,
even
if,
like
older
connections,
still
waste
no
time
out
yeah,
it's.
F
A
It's
I'm
I'm,
putting
this
PR
as
an
example
where
stress
tests
may
be
needed
and
even
black
is
another
place
where
I
believe
even
like
stress
tests
will
be
very
important,
mostly
because
during
stress
we
can
find
all
sorts
of
race
conditions
and
Mission
events
that
can
be
like.
Maybe
it's
a
missing
event
will
cause
some
port
to
not
be
removed,
or
something
like
that.
So
I
was
wondering
if
a
specific
stress
test
was
discussed
and
how
the
stats
would
look
like.
A
F
Yeah
gotcha
I
thought
I
just
wanted
to
make
sure
that
the
the
evented
lake
has
no
relation
to
the
changes
in
that
pool
like
this.
No.
A
Like
if
there
is
Improvement
already
said
like
cap,
that
is
intentionally
doing
stress
testing,
can
it's
part
of
our
CI?
I
would
like
to
understand
details
like
how
exactly
we
decided
to
stress.
Does
this
aspect
and
not
that
aspect
kind.
F
F
That's
where
that
is
what
will
trigger
a
lot
of
events,
so
you
need
to
have
a
lot
of
parts
and
they
need
to
continuously
change
States
from
one
state
to
another.
That
would
be
a
good
status
for
your
template.
F
Right
now
we
have
a
fair
number
that
we
have
around
250
number
that
we
easily
cross
we
want
to.
We
want
to
see
whether
we
can
go
beyond
that
number
very
easily
right,
so
I'm
going
to
take
the
existing
pod
life
cycle
test
and
then
start
scaling
up
without
event
plague
and
then,
given
the
plugin
to
see
whether
we
see
any
performance
difference,
because
there
are
a
lot
of
optimization
that
happens
in
container
runtime
as
well,
so
that
that
kind
of
works
with
existing
framework.
F
So
whenever,
even
though
cubelet
is
requesting
the
part
statuses
every
second,
it's
not
a
runtime
is
actually
going
and
hitting
the
disk,
at
least
in
case
of
cryo.
It
has
a
lot
of
caching
that
is
inside
and
it
can
respond
from
the.
So
in
that
case,
how
would
event
plague
actually
make
a
difference
there?
So
all
those
things
are
need
to
be.
They
need
to
answer
those
questions,
but
yeah
stressful
stress
testing
is
required.
Considering
event
that,
like
was
designed
for
for
me
for
improving
the
performance,
they
are
definitely
stress
test
is
required.
F
A
Yeah
I
wonder
if
we
can
even
reach
the
scale
or
like
stress
of
system
when
Google
is
not
able
to
process
all
the
events.
Oh
my
God.
If
we
can
get
a
stress
test
to
that
level,
yeah.
F
A
Yeah,
and
once
you
have
it,
I
would
be
interested
to
to
learn
how
it's
done.
A
You
thank
you.
Okay,
if
there's
no
more
agenda
items,
I
would
love
to
go
into
our
board.
A
S,
we
have
some
issues
to
triage,
but
I
wanted
to
start
with
waiting
on
author
think
we
haven't,
looked
and
wait
in
another
for
a
while
and
I
cleaned
it
up
a
little
bit
for
some
obvious
things.
E
C
C
Oh,
it's
I
put
a
comment
on
it.
Divya
anyway,
we'll
see.
B
A
A
Okay,
Peter
Peter.
E
Yeah
this
one,
the
the
test
is
still
failing,
haven't,
had
I
keep
not
having
time
to
are
not
remembering
to
look
at
it.
So
I
would
say
like
a
waiting
on
the
author
kind
of
situation.
Okay,.
E
E
A
E
A
A
A
So
something
you
want
to
add
a
new
feature
to
query
logs
from
a
node
using
cool
blood.
So
you
don't
need
to
SSH
to
know
to
gather
some
I
think
a
journal
logs
in
this
case,
it's
and
Jordan
seems
to
be
having
a
problem
with
is
an
overall
approach
like
we
don't
want
to
make
Kubler
to
be
a
look
forward
in
solution
for
any
node.
It's
a
little
bit
scary
from
security
perspective.
A
A
Yeah
this
one
yeah
Brian-
you
mentioned
this
right,
so
this
is
something
you
reviewed.
C
A
B
A
Yeah,
since
all
the
comments
were
replied
on
I
would
put
it
in
each
reviewer.
If
anybody
interested
to
take
a
look.
C
A
A
Right:
okay,.
A
A
A
F
So
so
good
so
issue
here,
it's
time
to
fix
is
we
have
the
Fedora
core
OS
as
a
operating
system
that
we
used
for
Clara
jobs
and
the
boot
configuration
of
the
I
forgot?
What's
the
new
Econo
Ubuntu,
but
you
know
where
the
system
comes
up,
is
certain
configuration
that
configuration
is
called
Ignition
and
so
far
we
have
been
manually
editing
those
ignition
files
to
update
things
there,
but
this
PR
essentially
uses
a
feature
of
federal
course
where
you
can
define
a
human,
readable,
yaml
file
and
convert
automatically
generate
machine
configure
ignition
file.
F
B
F
A
F
F
Yeah,
this
is
the
nobody
to
be
evented
like
job
that
I
talked
about
sometime
back,
which
is
not
able
to
find
the
gcp
project
and
pspi
is
attempting
to
fix
that.
But.
F
Have
I
have
a
couple
of
questions
around
that?
So
if
you
go
to
the
yeah,
this
portion
of
the
changes
I'm
not
sure
it's
required
or
not
so
I'm.
Talking
to
author
directly
on
the
slide,
maybe
I
should
put
some
comment
here
or
hold
here,
because
we
have
a
private
job,
which
is
a
periodic
which
doesn't
have
this
and
it
still
works.
Fine,
so
I'm
not
sure
whether
this
is
required
or
not.
A
Okay,
do
you
need
this
in
the
chat
yeah.
F
F
A
Yeah,
it's
not
not
specific,
so
I
would
take
it
out
of
our
board.
A
David
is
working
on
that
yeah
I.
Remember
the
back.
We
have
a
skip
logic
that
tries
to
skip
a
test
when
we
are
outside
of
pre-configured
environment,
with
test
Handler
and
I.
Think.
E
A
B
This
is
related
to
the
topology
manager,
graduation,
to
GA.
One
of
the
items
that
we
have
to
have
for
a
feature
to
graduate
to
GA
is
to
have
some
sort
of
metrics
and
there
were
none
for
topology
managers,
so
I've
added
two
and
I've
added
some
end-to-end
tests
as
well.
B
A
Okay,
the
sentence
with
this
Matrix.
B
A
A
B
A
A
A
Yeah,
this
I
need
to
replace
you
in
trapper.
F
F
F
F
F
F
A
A
E
A
B
A
A
A
F
A
C
E
A
Yeah
box
is
saying
we
have
a
good
environment
where
it
fails.
Please
help
us
I'm,
not
generally
helpful.
We
need
a
little
bit
more
information.
E
A
A
A
Okay,
we
still
have
a
little
bit
of
time.
So
let's
try
to
understand.
A
This
will
continue
container.
A
A
D
N
e
exact
end
to
it
right
up
in
this
block.
That's
at
the
bottom
of
the
screen,
there's
a
block
that
he
labeled
it.
D
Here,
right
here
right
here,
there's
an
exec
with
GC
test
B.
A
Yeah
but
I'm
looking
for
yamo
for
this,
this
B.
A
I'm
just
curious:
what
one
that
container
means
is
it
the
job
that
has
one
canner
completed
already.
A
A
C
A
E
A
Doesn't
say
anything
about
what
triggered
it.
A
A
No
for
this
one,
we
don't
need
it
because,
like
we
call
it,
it
doesn't
sound
like
you
can
say
stock
forever,
and
then
we
just
accumulate
those
calls.
Maybe
you
shouldn't
call
again
if
one
cause
already
in
progress.