►
From YouTube: Kubernetes SIG Testing 20180605
Description
SIG Testing meeting, June 05.
A
Yes,
thanks
Eric,
so
you
had
to
give
some
context
to
this
discussion.
So
we've
been
discussing
this
way:
I
guess
iishe,
Ben's
and
Eric,
and
Mitra
and
I
guess
a
few
others
or
like
a
few
weeks,
and
so
we
so
I
added
a
couple
of
scalability
presubmit
jobs
that
now
run
against
every
single
PR.
So
so
so
they
run
against
every
single
player,
but
they
are
not
blocking
the
they
are
not
blocking
free
segments.
A
So
so
so
the
plan,
and
as
now
that
I
actually
proposed
to
make
those
piece
of
mitts
blocking
and
the
reason
behind
this
is
that
we've
been
catching
way
too
many
scalability
progressions
using
those
tests-
and
it
turns
out
they're,
actually
not
having
those
pre
submits
blocking,
is
significantly
hurting
the
the
the
hourly
schedules
and
like
in
terms
of
engineering
productivity
as
well.
Because
what
happens?
A
So
so
it
takes
a
while
to
debug
these,
and
sometimes
they
sometimes
we
kind
of
fall
short
of
manpower,
and
this
ends
up
having
these
regressions
in
for
a
long
time,
clock
close
to
the
relays
and
then
we
feel
kind
of
like
pushing
by
a
couple
of
days
or
something
while,
while
debugging
these
issues
and
like
we
were
saying
this
kind
of
like
pretty
much
every
release.
This
really
is
we
kind
of
managed
to
get
these
issues
fixed
on
time.
But
this
is
happening
from
time
and
again.
A
So
because
of
this
week,
we
came
up
with
a
proposal
to
add
a
set
of
formal
processes
which
will
actually
make
the
scalability
takes
much
more
streamlined
and
we
don't
have
to
waste
so
much
time
on
this,
and
we
don't
want
to
block
the
entire
release
on
to
people
from
like
pretty
much
two
people
from
the
firm
six
scalability
debugging.
These
issues
also,
given
that
how
important
scalability
is
for
the
project
and
like
we,
this
is
really
serious
like
production
concern.
A
So
one
of
the
lessons
we
learnt
there
is
we
should
actually
having
pre
submits
like
stops
about
60%
of
these
regressions,
which
actually
has
more
effect
than
just
stopping
these
regressions,
because
it
also
makes
our
large-scale
5,000
node
tests.
Not
have
these
not
see
these
equations
and
they're
costly
if
we,
if
we
lose
a
ton
of
those,
we
pretty
much
have
wasted
like
a
17
hour,
long
run
of
these
performance
tests.
So
yes,
so
we
want
to
catch
them
early
on
and
and
so
I
added
a
couple
of
these
P
summits.
A
One
of
them
runs
our
scalability
test
or
performance
test
on
100,
no
DC
cluster
and
the
other
runs
it
on
of
500
node
qumar
cluster
q
mark
is
our
way
of
running
simulated
clusters.
So
so,
yes,
and
over
the
past
month
or
so
ie
and
like
a
few
others,
were
working
on
making
these
jobs
really
green
and
like
I've
spoken
to
ash
and
a
few
others,
and
it
seems
like
the
level
of
health,
is
quite
good,
and
the
Cinci
of
these
jobs
is
like
the
long-term
success.
A
A
B
A
B
The
bigger
concern
for
for
us
was
around
the
timing
of
these
tests,
especially
if
you're
going
to
make
it
presubmit
blocking
right
that
the
slowest
job
right
now
takes
about
an
hour.
But
Ben
was
we
spice
book
on
flying
with
men,
and
there
are
efforts
to
make
these
faster
so
that
it
meets
a
40-minute
time
that
our
presubmit
could
take.
So
just
adding
more
tests
that
are
longer
is
going
to
like
add
so
much
more
time
for
every
Cynthia.
That's
all
yes,
so
that
so
I
we
just
wanted
to.
B
C
A
Necessarily
so
so
the
way
it
works
is
that
our
5,000
node
test,
which
is
like
the
highest
scale,
it
usually
ends
up
catching
way
more
regressions,
because
because
of
course,
it's
at
a
higher
scale,
so
oftentimes
some
regressions
which
are
not
caught
at
lower
scale,
because
things
are
not
stressed
as
much
at
lower
scale,
so
it.
So.
There
are
a
few
regressions
which
happen
in
the
past,
which
is
just
not
possible
to
catch
at
lower
scale
of
100
nodes,
and
you
need
to
actually
have
it
running
on
the
5,000
node
scale.
A
And
yes,
so,
but
of
course
we
cannot
make
a
4000
at
lustre
if
we
submit,
because
it's
like
super
long
and
you
just
cannot
you
don't
have
code
and
stuff,
so
we
just
have
CI
running
for
it,
but
that's
our
highest
supported
limit
officially
according
to
kubernetes.
So
that's
the
highest
we
test
for,
but
yes,
that
that's
a
that's,
that's
a
four
segment
so.
D
A
So
maybe
I
can
quickly
explain
so
in
general,
the
buck
proposes
adding
these
process
in
three
phases.
First,
one
is
for
pre-sub
mints
like
having
blocking
presubmit,
and
the
second
is
having
post
submits
block
on
failure,
and
the
third
is
actually
dealing
with
key
element
issues
potentially
early
on
in
the
development
process,
which
is
right
during
feature
development.
A
So
that
is
where
design
reviews-
and
this
is
also
the
order
in
which
we
we
were
thinking
of
introducing
this,
and
but
we
don't
want
to
actually
add
all
of
them
at
once,
because
that
kind
of
can
end
up
being
a
significant
red
tip
for
developers.
So
so
so
there
is
an
exit
criteria
mention
at
the
I,
don't
know
if
you're
actually
looking
at
the
same
talk
right
now.
But
let
me
also
share
this
when.
E
A
Blocking
merges
because
we
have
a
scalability
jobs
are
released,
locking
already
so
block
emerges
is
so,
but
but
the
thing
is
I
would
not
prefer
discussing
it
right
now,
because
it's
kind
of
dependent
on
this
step,
and
if
we
see
that
just
having
these
blocking
scalability
pyramids
is
already
putting
us
in
a
good
enough
shape,
we
don't.
We
may
actually
not
need
to
do
that,
and
so
that's
something
we
would
want
to
do
at
a
later
point
if
things
are
still
really
bad
after
these
pre
supplements.
So
that's
so
that
is
exactly.
F
A
G
A
Lot,
if
you
open
that
dot,
the
then
I
shared
with
these
that
the
block
with
the
scalability
stories.
There
is
also
call
they're
showing
what
what
are
the
six
involved.
So
we
kind
its
scalability
is
hugely
a
horizontal
effort
in
that
we
do
catch
these
issues
and
we
kind
of
like
narrow
it
down
enough
that
we
know
where
the
causes,
and
then
we
try
a
jet
to
the
right
sinks,
and
usually
these
things
are
then
following
up
on
these
issues,
and
we
are
from
time
to
time
like
communicating
with
them.
G
A
A
It
went
in
and
then
these
tests
start
failing
the
large
cast
a
test
start
failing
and
in
the
meanwhile
there
there
is
some
other
regression
which
comes
up
and
we
don't
have.
We
don't
have
a
good
point
to
know
where
this
regression
came
in,
because
the
last
casters
were
failing
because
of
the
other
regression
that
actually
came
in
due.
To
this
reason,
it's
not
working
and
and
and
trust
me.
A
This
happens
more
like
quite
often
that
we
are
seeing
multiple
scalability
issues
stacking
on
top
of
each
other,
and
then
this
actually
makes
debugging
those
issues
exponentially
harder,
because
sometimes
we
need
to
run
by
sections
on
these
large
scale
clusters
and
like
we
need
to
run
multiple
rounds
of
those
and
things
like
this.
So
yes,
so,
basically.
C
A
This
is
this
is
actually
letting
us
per
performance
test,
every
single
change,
which
is
quite,
which
is
of
quite
a
value
to
developers,
because
normally
I
mean
often
like
I've,
seen
in
various
discussions
from
various
peers
earlier
that
people
are
wondering
about
what
can
be
the
performance
impact
of
theirs
and
like
on
really
large
clusters.
What
what
can
this
change
end
up
having
so
so
so
I
think
these
tests,
all
these
presents,
also
empower
developers,
will
actually
knowing
what
is
the
performance
impact
of
their
changes
and.
E
Valuable
but
I
I
would
want
to
know
what
like
what
actually
takes
so
long
in
the
hour
and
20
minute
test,
and
can
we
cut
that
because
that's
at
least
a
33
percent
gain
over
time
to
run
the
tests
and
we
don't
just
run
the
tests
on
each
PR.
We
also
have
to
potentially
rerun
them
before
merge
batch
merge.
So
that's
that's
going
to
impact
the
merger
II,
maybe
not
massively,
but
at
least
some
would
and
feedback
time
for
developers.
You
know
calm
that
down
or
Arthur
can.
A
We
probably
can
try
cut
in
short
of
a
few
minutes,
but
there
are
some
bottlenecks,
so
the
way
it
works
is
like.
You
basically
had
to
start
this
big
cluster
and
it
takes
some
time
to
start
the
cluster
itself,
and
that
is
not
something
we
probably
can
really
optimize
upon,
because
we've
actually
already
done
a
lot
of
things
in
the
past,
like
parallel,
is
the
startup
and
stuff
for
large
casters,
and
that
that
that
that
is
one
front
on
which
we
can
improve,
but
I
think
that
does
not
the
limiting
factor
here.
A
The
limiting
factor
is
actually
our
tests,
so
we
run
two
kinds
of
tests
and
each
test
has
its
own
value.
So
the
we
run
a
couple
of
tests.
The
first
one
is
like
it's
called
a
density
test
where
we
basically
densely
fill
a
cluster
with
pods
at
a
at
a
very
high
rate,
and
that
is
for
basically
measuring
our
pod
start
of
latency
SLO.
We
have
to
I
suppose
in
Cuba
nighters
right
now
that
we
officially
sell-
and
one
of
them
is
both
start-up
latency-
has
to
be
within
5
seconds.
A
The
99th
percentile
of
what
startup
literacy
has
to
be
within
5
seconds
and
the
other
test
is
load
test
where
we
actually
exercise
a
different
aspect
of
the
system.
In
that
we
basically
create
many.
Many
different
kinds
of
API
calls
and
we
measure
the
API
call
latency
and
that's
our
second
SLO.
So
yes,
so
these
two
tests
are
needed
for
measuring
the
2s
ellos
and
could
we
separate
them
and
each
week.
C
A
We
can
try
to
do
that,
but
that
has
a
shortcoming
in
that
this
means.
Basically,
we
would
need
200
nodes
for
every
single
PR,
and
that
means
we
cannot
have
the
same
parallelism
that
we
have
right
now
right
now,
I
I
made
sure
that
we
have
as
high
parallelism
as
we
have
for
our
as
much
parallelism
as
we
need,
so
that,
like
this
job,
doesn't
reduce
the
number
of
parallel
and
that
will
actually
basically
mean
doubling
the
quota
for
these
tests.
So
so
much.
E
A
A
Think
for
the
hundred
node
GC,
the
test
itself
actually
runs
for
I.
Think
something
like
25
or
30
minutes.
I
need
to
check.
I
can't
try
to
speed
it
up,
but
there
are
some
hide
limits
to
do
that
because
so
in
general,
like
the
cuban
ideas,
control
plane
has
different
at
different
places.
There
are
some
rate
limits,
specifically
with
respect
to
the
control
plane
QPS
to
talk
to
the
api
server.
So,
for
example,
the
scheduler
has
a
limited
QPS,
so
you
cannot
schedule
faster
than
that.
A
Yes,
that's
a
good
question.
So
400
no
tests,
we
it's
not
really
too
different
from
actually
our
normal
pull
GC
job,
which
runs
the
normal
set
of
E
2,
is
but
for
Q
mark.
So
q
mark
in
general
is
like
an
overlay
cluster.
You
start
a
base
caster
and
you
you
create
this
simulated
cluster
on
top
of
it.
So
there
is
this
extra
overlay
step
in
Q
mark,
which
is
like
I,
think
7
minutes
extra
for
creating
the
Q
mod
500
node
clustered
on
top
of
the
after.
A
G
C
B
A
That's
a
very
good
question
ash
and
I
would
say
that
we
would
need
a
really
intelligent
way
to
do
that.
This,
probably
we
can
do
at
some
point
like
at
this
point.
I
can
see
some
obvious
peers
that
we
can
cut
this
short
for,
like
type
of
friend
like
I,
don't
know
non
code
changes,
but
for
code
changes.
It's
really
non-trivial,
because
I've
seen
cases
in
the
past,
where,
like
a
very
simple
change,
that
no
one
doubts
would
create
an
effect
like
like
just
for
example,
I
I
can
actually
recollect
one.
A
F
C
A
C
E
E
A
Yeah
so
so
I
agree
with
you
that
too
much
500
is
probably
like.
You
probably
want
to
like
improve
it
or
think
upon
what
to
do
with
that.
But
at
this
point,
what
do
you
think
about
like
adding
just
the
other
presubmit,
because
that's
still
really
useful
for
us
and
at
the
moment
I,
don't
think
it's
going
to
slow
down
our
margin
rate.
I
will
try
to
improve
this
time.
I
try
to
cut
short
on
this
and
yeah
I.
E
Being
very
one
but
I
think
these
are
valuable
tests
and
that
we
should
blog
in
them
if
they're
catching
like
scalability
failures.
I
just
want
to
make
sure
that
we're
not
going
to
let
the
test
time
just
spiral
out,
and
it's
a
little.
It
is
a
little
worrying
that
that
one
is
already
starting
considerably
higher,
at
least
a
third
than
are
like
worst-case
presubmit
right
now,.
A
E
A
I
think
at
this
time
of
the
release
where,
when
we
are
not
actually
like,
we
are
right
now
hunter
or
have
you
already
entered
code
freeze
but
well.
Yeah
anyways,
like
the
I,
see
that
the
March
queue
is
not
actually
that
big
right
now
so
I
think
we
might
want
to
like
kind
of
do
like
s
introduce
like
kind
of
beta
test.
This
like
introduce
these
and
see
how
it
works,
and
another
thing
which
actually
just
came
to
my
mind
is
that
I
have
fired
earlier
in
some
discussion
and
I.
E
A
Yes,
except
that
it
still
ends
up
in
as
bad
situation
as
you
right.
No,
because
most
of
these
regressions
actually
come
in
during
the
development
phase.
For
example,
like
I've
been
debugging,
they
get
list
of
conferences
and
progresses
which
were
during
the
active
code
period
like,
which
is
in
the
first
one
month
and
a
half
or
so,
and
then
so
yeah
they
come
in
and
like
the
first
half
of
the
release,
first
time
of
the
quarter,
and
then
we
debug
it
for
the
remaining
half
of
the
quarter
like
slowly
removing
them
one
by
one.
A
E
Steve
brought
up,
there
are
things
we
can
do
to
improve
the
merger
throughput
though
they
might
affect
quota
again,
but
like
not
getting
feedback
and
the
reasonable
time
is
probably
going
to
affect
developer
experience
a
lot
more
than
like
variation
in
the
merger
aid,
which
already
happens
at
times
due
to
other
things
like
a
job,
is
just
totally
broken.
Very
sudden,
though
reason,
one
one
question
about
link
I
think
there's
a
pretty
common.
C
C
A
developer
workflow,
where
you
push
changes
very
quickly,
a
github,
so
I
think
a
couple
of
our
ethos.
We're
seeing
developers
make
changes
in
the
order
of
every
3
minutes
of
a
chef
when
you
change
or
every
5
minutes,
so
they
do
that
10
times
and
so
they'll
see
a
whole
slew
of
jobs
to
get
started,
isn't
very
quickly
cancelled.
Yes,.
H
H
D
Well,
but
still
like
canceling,
a
hundred
node
job
isn't
non-trivial
right.
C
I
A
I
A
So
Steve
this
was
exactly
the
problem.
This
is
exactly
the
reason
why
this
job
was
failing
almost
all
the
time,
I
initially
added
it,
because
what
happened
was
there
was
no
proper
garbage
collection
mechanism
was
like
we
didn't
have
janitor
for
these
pre
submits
because
they
were,
they
are
all
running
in
a
single
project
where
there's
this
huge
quota,
so
we
cannot
really
clean
it
well
and
because
of
that,
such
aborted
runs
were
leaking
them
and
I
had
to
manually,
delete
them
all
the
time
and
that
ended
up
creating
a
big
mess.
A
C
Since
we're
hitting
the
end
of
the
meeting
and
Herman
has
the
job,
the
guidance
will
be
done
today
in
terms
of
making
a
decision
here
so
like
the
biggest
I
think
reservation.
That
I'm
hearing
is
that
at
least
the
key
bar
500
job
is
slow.
So
maybe,
if
we
identify
for
the
TC
100
job
of
the
current
timing,
what
is
spent
doing?
What
and
what
are
our
options
to
speed
it
up
in
the
future?
C
E
A
I
think
I
would
suggest
is
like
maybe,
let's,
let's
add
these
jobs
for
now,
because
we
are
actually
not
under
like
merge
queue
pressure
right
now
and
like
there
are
not
many
such
high
submit
queue
and
then
I
will,
in
parallel
work
on
trying
to
improve
these
times,
and
if
it
really
becomes
a
pain
in
the
ass.
We
can
like
just
disable
the
cube
mark
500
later,
but
at
least,
let's
start
catching
these
regressions
and
stopping
them,
because
that
is
a
lot
of
saved
engineering.
Productivity.
A
J
D
J
It
was
very
opaque
as
to
what
the
problem
actually
was
and
the
logs
were
not
clear
and
the
J's
reports
that
were
were
shipped
up
to
goober
nadir
were
not
clear
as
to
what
the
actual
problem
was.
There
was
just
a
bunch
of
pods
that
were
going
crash,
loopback
off,
so
that's
a
difficult
thing
for
somebody:
who's,
not
a
scalability
engineer
right
now
to
kind
of
figure
out.
Okay,
what
is
actually
the
failure
here.
A
The
first
is
that,
like
we
are
not
going
to
fail
this
test
for
most
trivial
or
simple
Pia's
or
simple
changes,
because
they
most
like
given
the
scalability
and
for
once,
which
actually
failed.
It
means
that
something
is
actually
wrong
with
respect
to
scalability
in
that
PR
that
that
either
someone
from
the
scalability
team
might
need
to
look
into
that
pair
of
what
is
wrong,
because
there
might
actually
be
some
serious,
scalability
consequence
of
impact
of
that
year.
A
But
even
if
someone
scalable,
someone
from
scale
really
doesn't
look
into,
we
still
should
stop
that
PR
from
going
in
and
with
respect
to
understanding
this.
Yes,
that
is
something
I
feel
personally,
that
it
is
going
to
develop
over
time
when
we
introduce
this
process,
because
then
people
will
start
caring
about
more
than
just
normal
prism.
It's
passing,
but
actually
quality
of
solutions
and
quality
of
the
changes,
because
scalability
then,
will
actually
be
something
that
people
start.
A
J
This
is
like
this
is
the
particular
thing
that
was
failing
to
start
and
that
was
caused
by
the
regression.
So
I
agree.
We
would
catch
regressions
if
we
make
this
blocking,
but
on
the
flip
side
we
need
visibility
either
so
that
individual
PR
authors
can
can
identify
what
the
problem
is
incorrect
or
having
like
scalability
engineers
who
are
able
to
really
dig
into
it
themselves.
The
other
thing
that
I
wanted
to
bring
up
that
is
kind
of
related
is
the
timing
as
far
as
making
them
blocking
I
think
really
it.
J
We
would
need
to
engage
with
the
release
team
there
and-
and
you
know,
we've
obviously
got
some
people
who
are
members
that
release
team
here,
but
that
would
be
my
other
concern
is
in
particular,
as
we're
going
into
code.
Freeze,
yes,
the
merge
queue
and
the
is
smaller,
but
if
there's
a
critical
fix
that
needs
to
go
in
for
something
else
that
then
all
of
a
sudden
gets
blocking.
That
would
also
be
a
bad
situation
to
run
into
so
close
to
the
really
dear.
A
E
A
E
B
A
B
F
The
to
the
point
about
like
allowing
developers
to
better
understand
failures
and
their
tests
or
failures
and
their
jobs,
there
is
some
work
being
done
on
sort
of
better
understanding
of
logs
and
better
sort
of
viewing
of
things
that
went
wrong
and
the
things
that
went
right
with
certain
job.
That's
going
to
be
coming
out
very,
very
soon
for
review
by
the
community.
So
if,
if
you
guys
are
interested
in
and
like
a
better
understanding
of
how
those
things
are
happening,
take
a
look
at
that.
K
Also
really
doubtful
that
we
have
enough
quota
for
this
if
you
even
just
slipped
it
to
a
non-blocking
skip
report
you'd,
probably.
Finally,
we
don't
have
the
quota
for
hundreds
of
nodes,
because
you
need
to
run
this
priest
MIT
over
200
times
a
day
and
I
think
you
might
expose
a
lot
of
issues
before.