►
From YouTube: Kubernetes SIG Testing - 2020-07-28
Description
A
Okay,
good
morning,
good
afternoon,
good
evening,
everybody
today
is
tuesday
july
28th,
at
least
in
this
time
zone,
and
welcome
to
the
kubernetes
testing
by
weekly
meeting.
I
am
your
host
eric
kirkenberger.
B
A
So
we're
all
going
to
adhere
to
the
kubernetes
code
of
conduct,
which
means
basically
we're
all
going
to
be
our
very
best
selves
and
not
be
sharks.
This
meeting
is
being
publicly
recorded
and
will
be
posted
to
youtube
later
so
today,
on
the
agenda,
we
have
a
proposal
that
jordan
and
ben
and
myself
have
been
working
on
to
suggest
some
policies
to
maybe
improve
the
stability
of
kubernetes
ci
in
the
short
term
and
we'll
see
where
the
conversation
takes
us
from
there.
A
Jordan
did
you
want
to
speak
to
this
or
do
you
you
want
me
to
speak
to
it.
C
I
can
give
the
one
minute
blurb,
if
you
want
so
ci
issues
are
not
new
to
kubernetes.
This
has
been
an
ongoing
problem
for
years.
At
this
point,
it
is
more
or
less
of
a
problem
at
different
points.
In
time
we
see
it
crop
up
a
lot,
usually
around
code
freeze
when
lots
of
things
are
trying
to
get
in
and
there's
lots
of,
load
and
lots
of
people
very
concerned
about
how
quickly
their
requests
get
merged.
C
This
release.
It
was
particularly
bad.
There
were
two
or
three
real
issues
that
were
causing
lots
of
flakes,
but
they
were
lost
in
the
noise
of
other
test
failures,
and
so
it
was
hard
to
find
them.
It
was
hard
to
get
the
fixes
that
resolved
them
in
it
was
hard
to
get
signal
on
release
blocking
jobs,
and
so
we
there
have
been
a
lot
of
things
said
over
the
past
few
years
about
needing
to
do
better
and
hold
test
owners
and
component
owners
accountable
for
their
tests.
C
It's
very
hard
to
do
that
when
it
can
legitimately
be
said
that
a
lot
of
the
test
failures
are
not
the
fault
of
the
component
owner
or
the
test
owner,
and
so
this
proposal
is
trying
to
put
in
place
things
that
will
improve
the
infrastructure
aspect
so
that
when
we
see
a
test
failure,
we
can
actually
open
issues
for
that
component
owner
and
test
owner
and
say
your
test
is
failing,
and
we
have
a
reasonable
degree
of
certainty
that
it's
not
an
infrastructure
problem.
C
You
need
to
fix
your
test
or
else
and
the
or
else
is
to
be
determined,
but
we
envision
things
like
disabling
jobs
that
are
permanently
failing
or
not
allowing
cigs
to
merge
features
if
their
test
health
is
demonstrably
bad.
The
like,
I
said
to
be
determined,
but
the
first
step
is
getting
to
a
point
where
we
can
actually
reasonably
rule
out
infrastructure
issues
and
say
the
the
owner
of
this
tester
component
is
the
one
who
needs
to
resolve
this.
C
A
How
about
I
meet
myself,
so
does
anybody
disagree
with
jordan's
statement
of
the
problem
or
our
stated
goals.
D
Can
I
just
not
to
disagree
with
any
of
that
just
kind
of
add
some
kind
of
overall
control?
It
seems
kind
of
bad
that
things
got
this
bad
right
there.
It
seems
like
there
should
have
been
alarm
bells
ringing
earlier
and
focus
brought
earlier
just
kind
of
at
a
very
high
level.
You
know
to
figure
out
whether
it
seems
to
structure-
or
you
know,
discipline
of
the
component
owners
or
test
owners
or
whatever.
A
I
I
agree
with
that:
it's
why
I'm
happy
there
are
so
many
people
here,
because
I
think
we
could.
We
could
use
everybody's
help
in
figuring
out
how
to
to
motivate
everybody
properly.
How
do
we
align
our
incentives
such
that
people
are
inherently
motivated
to
do
the
right
thing
right?
We
should.
We
should
be
encouraging
people
to
brush
their
teeth
every
day
we
shouldn't
have
to
drag
them
to
the
dentist,
to
get
a
root
canal
every
year
and
a
half
or
so.
A
So
I
can
run
through
the
specific
policies
we've
proposed
here,
see
what
the
group
thinks
about
these.
I'm
also
open
to
suggestions
from
the
group
about
what
what
they
think
could
be
done
as
follow
onto
these,
but
I
feel
like
I'd
like
at
least
like
to
understand
if
we
can
get
sign
off
on
some
of
these
things.
Does
that
seem
fair
to
everybody.
A
Yep,
so
it
seems
like
the
biggest
problem
we're
running
into
at
the
moment,
for
whatever
reasons
are
resource
constraints,
because
the
majority
of
our
jobs
schedule
themselves
as
best
efforts
or
possibly
burstable
pods.
A
A
So
I
think,
like
these
are.
These
are
generally
listed
in
sort
of
order
of
importance.
I
think
I
think
the
most
important
thing
is
to
get
to
a
place
where
all
release
blocking
and
all
merge
blocking
jobs
actually
declare
the
resources
that
they
need,
such
that
they
can
become
guaranteed
quality
of
service
pods.
A
So
when
the
skype,
when
pods,
have
non-zero
resource
requests
for
memory
and
cpu,
and
they
have
limits
that
match
that
the
scheduler
is
sort
of
guarantees
that
wherever
it
schedules
that
pod
that
is
guaranteed
to
get
those
resources,
it's
also
going
to
be
the
highest
priority.
Pod,
that's
left
on
a
note
as
long
as
possible.
Should
there
be
any
reason
to
evict
pods
off
of
them?
A
A
So
I
pr
just
landed
yesterday
afternoon
to
make
a
best
guess
at
what
the
cpu
and
memory
limits
might
be
for
release
blocking
jobs.
So
those
would
be
the
jobs
that
run
on
the
sig
release
master
block
dashboard.
Those
are
the
jobs
that
run
periodically.
A
A
We
would
enforce
this
and
all
of
the
policies
we're
suggesting
here
the
tests
that
would
run
against
job
config
submitted
to
the
testing
for
repo,
so
you
wouldn't
be
able
to
land
a
job
campaign
and
have
it
be
on
release
blocking
dashboard
unless
it
actually
had
the
appropriate
resource
requests
and
limits
our
suspicion.
Well,
let
me
stop
here.
Does
that
sound
reasonable
to
everybody.
E
A
I
will
show
you
just
real
briefly
how
I
took
the
guesses
on
the
release
blocking
pods,
so
this
cluster
is
a
gpa
cluster.
It's
set
up
in
google
cloud
and
with
that
we
get
google
cloud
monitoring
and
logging
based
on
the
metrics
that
I
got
out
of
the
box
with
pilot
monitoring.
A
I'm
not
saying
I
I
liked
this.
This
was
still
very
manual,
but
I
tried
to
see
if
I
could
gather
metrics
that
showed
like
what
a
given
job
was
using
in
terms
of
cpu
or
memory
usage
over
time.
Let's
see
if
this
still
actually
works.
F
A
Wanted
to
show
this
and
then
say
that
I'm
using
this
with
a
account
that
hasn't
even
activated
its
free
trial
of
google
cloud
and
the
reason
I'm
getting
away
with
this
is
because
I'm
in
a
member
of
I'm
a
member
of
a
group
called
kate
and
for
proud
viewers
which
will
give
you
read-only
access
to
all
of
the
things
that
are
necessary
for
running
prowl
over
in
the
wg
k-10
for
land.
A
But
I
can't
offer
this
capability
for
the
proud
build
cluster
that
runs
in
google.com,
but
we
can
happily
provide
this
to
anybody
who
wants
to
work
with
the
publicly
funded
and
available
crowdfill
clusters.
This
is
the
cluster
I've
been
moving.
This
is
why
I've
been
trying
to
move
all
the
release
blocking
jobs
over
to
that
cluster.
A
So
if
you
want
to
see
what
I'm
seeing
I'm
open
to
anybody
pr
themselves
into
this
group
so
that
they
can
see
this
level
graph,
which
shows
that
the
kind
ipv6
job
is
using
roughly
whatever,
like
14
gigs
of
ram
at
its
peak,
at
least
as
as
far
as
these
metrics
think,
I
don't
know.
If
these
are
true,
I
don't
know
how
to
trust
them
and
I'm
super
open
to
suggestions
on
other
ways
to
collect
this
kind
of
data.
But
that's
the
best
answer.
I
have
to
alvaro's
answer.
G
So
so,
could
I
ask
you
aaron:
how
would
people
request
request
to
request
that
access
that
you're
talking
about
there.
G
A
So,
like
I
said,
this
is
only
available
for
the
kubernetes
io
cluster,
which
only
runs
periodics
and
release
blocking
jobs.
You
get
no
visibility
into.
What's
going
on
with
pre-submits,
it's
why
I
think
we
should
be
incentivized
to
try
moving
the
rest
of
the
release
blocking
jobs
over
to
that
cluster
and
also
moving
the
merge
blocking
pre-submits
over
to
that
cluster
or
another
cluster
dedicated
to
it.
A
I'm
open
to
suggestion
from
folks
whether
they
think
this
should
be
a
dedicated.
You
should
have
like
a
very
clear
cluster
for
periodics
and
then
another
one
for
birch
blocking
I'll
caveat.
All
of
this.
That,
again,
all
this
infrastructure
is
in
a
place
where
the
community
is
able
to
run
this,
and
so
I
would
really
like
to
see
other
people
involved
in
spinning
up
these
clusters
using
the
code
and
infrastructure
that's
publicly
available
there.
A
Ideally,
at
the
end
of
all
that,
we
should
have
a
cluster
that
is
appropriately
scaled
and
sized
for
all
of
the
release
blocking
jobs
in
the
event
that
that
doesn't
seem
to
be
enough,
or,
as
we
continue
to
move
things
over.
We
think
that
we
should
also
start
to
enforce
some
better
accountability
and
hygiene
on
all
of
the
other
jobs
for
all
of
the
other
repos
or
all.
A
For
kubernetes
that
are
not
required,
merge,
walking
jobs
or
they're
in
their
periodics
that
just
happen
to
be
informative.
We
think
every
job
should
have
contact
info
associated
with
it.
This
is
something
we
I
mandated
a
while
ago
for
the
release
blocking
jobs
so
that
everybody
would
get
emailed
when
they're
when
their
jobs
broke
and
then,
in
the
interesting
case
of
jobs
that
run
like
a
bunch
of
tests
owned
by
a
bunch
of
different
cigs.
The
release
team
has
sort
of
handled
that
with
the
ci
signal
team.
A
There's
part
of
me
that
wishes
that
less
toil
was
involved
with
that,
but
I
think
that's
also
been
a
relatively
effective
model
for
those
catch
all
jobs
for
jobs
that
are
not
catch-alls
and
are
owned
by
specific
sig
use.
That
states,
I
don't
know
group
as
the
email
address,
or
maybe
they
have
a
dedicated
group
for
alerts.
A
But
the
idea
is
that,
ultimately,
if
you're
going
to
use
community
resources
like
crowd,
you
should
have
a
point
of
contact
and
then,
if
yeah,
so
we
know
that
alerts
are
being
sent
to
you.
You,
the
point
of
contact
is
being
alerted
that
the
job
is
continuously
failing.
If
the
job
still
never
gets
back
to
a
good
state,
then
we
should
probably
just
go
ahead
and
disable
those
jobs.
So
we
free
up
resources
and
stop.
A
A
Just
to
give
you
a
taste
of
what
the
state
of
things
are
like
this
used
to
be
visible
or
fauna
dashboard,
but
we
do
have
a
query
that
runs
every
day
and
you
can
see,
for
example,
ci
kubernetes,
no
kubelet
serial
job
has
been
failing
continuously
for
934
days
in
a
row,
and
there
are,
you
know,
they're,
not
just
not
to
single
out
note.
There
are
a
bunch
of
jobs
owned
by
a
bunch
of
different
cigs
that
do
a
bunch
of
different
things
that
have
been
continuously
failing.
A
A
A
So
I'll
stop
reading
to
everybody,
I
I
think
I
was
personally
I'm
more
interested
in
this
being
a
bit
of
a
dialogue.
I'm
sure
there
are
lots
of
people
here
who
have
opinions
or
suggestions
on
what
we
could
do
next,
so
I'm
interested
in
hearing
first,
if
everything
that's
proposed
here,
seems
like
a
good,
reasonable
first
step.
A
Third,
when
jordan
mailed
out
this
proposal,
he
talked
about
the
idea
of,
like
you
know,
maybe
we
should
just
not
open
like
we
shouldn't
open
up
the
main
branch
of
development
back
up
for,
like
anything,
goes
like.
Maybe
we
should.
Our
focus
after
119
should
be
continuing
to
address
this
to
this
test
and
ci
signal
pain
until
we
feel
like
it's
actually
been
properly
addressed,
as
opposed
to
you
know
having
a
great
angst
for
like
a
week
or
two
and
then
everybody
kind
of
moving
on
and
spamming,
retests.
A
H
Hi
eric,
I
have
a
quick
question
about
the
all
the
failed
tests.
You
just
showed
us
like
the
tests
that
had
have
been
failing
for
several
hundred
several
years.
Maybe
so
are
those
the
ones
that
are
triggered
when
a
pr
is
submitted.
A
Some
are
some
aren't,
so
the
naming
convention
that
we
stuck
to
was
if
the
job
has
ci
in
front
of
it
most
likely
a
periodic,
and
if
the
job
name
has
pr
or
poll,
then
it's
a
pre-sigma,
and
so
the
majority
of
these
look
like
ci
jobs.
But
there
are
so.
There
are
also
some
some
failing
pull
request,
jobs.
H
A
They're
they're
defined
as
testing
for
repair
all
of
the
jobs
for
all
of
the
150
plus
repos
across
the
six
different
kubernetes
orgs,
live
in
the
testing
for
repo.
H
Okay,
okay,
so,
based
on
my
understanding,
I
think
my
suggestion
here
is
that
we
should
make
a
list
of
the
long
failing
tests,
and
maybe
volunteers
can
get
work
on
some
of
them
and
can
see
there
could
be
batches
and
we
can
see
how
much
resource
we
can
free
up
by
doing
the
most
by
fixing
the
most
frequently
fading
ones.
A
I
I
I
like
that
idea.
However,
like
I
personally,
I
would
love
to
get
a
little
more
confidence
about
like
what
actually
is
the
resource
usage
here,
and
what
would
we
free
up
by
doing
this?
There's
also
a
part
of
me
that
almost
wants
to
default
to
you
like.
If
it's
been
failing
this
long
and
nobody's
noticed,
is
it
really
that
worth
bringing
back
from
the
dead
like
it?
It
may
very
well
be
worth
a
sanity
check.
I
know.
F
The
other
thing
is
that
this
has
been
periodically
brought
up,
that
we
have
jobs
telling
us
long,
and
yet
we
have
ones
that
have
now
reached
something
like
three
years.
I
think
the
other
thing
to
remember
is
that,
even
if
we
delete
a
job,
that's
not
by
any
means
permanent
everything's
recorded
and
get
if
someone
comes
back
later
and
thinks
that
this
was
valuable,
they
could
run
in
the
effort
to
bring
it
back.
But
just
as
a
matter
of
policy,
we
shouldn't
leave
things
failing
for
this
long
and
just
running
and
wasting
resources.
A
So
question
link
to
the
list
that
was
shown.
The
link
is
in
the
proposal
document
which
you
can
get
to
from
the
sig
testing
meeting
notes.
It
also
lives
in
the
testing
for
repo
under
directory
called
metrics.
A
I
I
think,
like
at
the
zeus
point
like
I
don't
know
anytime
this,
the
continuously
failing
jobs
thing
has
been
brought
up.
Nobody
has
thought
to
make
a
giant
umbrella
issue
with
a
huge
checklist
and
a
clearly
defined
set
of
work
items
such
that
people
go
off
and
do
it.
We
have
noticed
that
that
does
actually
work
for
these
simple
mechanical
things
like
go:
limp
and
shell
check,
failures
and
stuff,
but
I
feel
like
hey.
This
work
is
less
mechanical
and
b.
This
isn't
about
one-time
fixes.
C
This
is
also
the
reason
that
the
permanently
failing
job
item
was
after
the
contact
info
item.
I
I
suspect
that
a
fair
number
of
these
jobs
are
failing,
because
nobody
knows
that
they're
running,
and
so
the
goal
is
not
to
disable
jobs
that
are
actually
providing
useful
signal.
It's
to
provide
to
disabled
jobs
that
are
not
providing
useful
signal
that
nobody
is
really
looking
at
or
paying
attention
to.
F
I
want
to
point
out
very
quickly
that
we've
talked
a
lot
about
resources
and
we've
talked
a
lot
about
failing
tests
and
those
are
not
always
necessarily
related
and
that,
while
this
will
all
be
super
useful,
getting
things
to
a
healthy
state
will
require
another
series
of
steps.
After
this,
once
we
have
more
confidence
in
the
ci
we're
going
to
have
to
look
at
the
tests.
F
There
are
cases
where
we
totally
fail
to
run
the
test,
or
something
like
that.
There's
also
a
lot
of
flaky
tests
that
are
just
the
tests
of
flaky
or
some
piece
of
kubernetes
isn't
quite
reliable
enough
or
something,
and
that
probably
will
need
its
own
proposal
right
now.
It's
a
little
difficult
to
look
at,
though,
because
ci
is
such
a
mess.
C
Yeah,
I
I
have
a
long
list
of
follow-up
ideas
around
particular
tests
or
particular
ways.
We
could
analyze
the
end-to-end
and
integration
and
verify
test,
runs
and
identify
particular
problems
for
cigs
to
go
fix,
and
so
my
goal
is
to
get
good
enough
signal
at
the
ci
infrastructure
level
so
that
we
can
run
some
of
these
reports
and
then
fan
out
to
the
sig
owners
to
say:
hey,
look:
your
end-to-end
tests
are
consuming
40
of
the
end-to-end
run.
C
Please
reduce
that
to
this
much
either
by
moving
things
to
post
submit
or
by
speeding
up
tests
or
whatever.
Similarly
for
integration,
you
know
here's
a
package
that
takes
10
minutes
to
run.
Can
you
reduce
that
or
parallelize
things
or
collapse,
your
coverage,
but
yeah?
At
that
point,
I
think
we
can
fan
out
and
lean
on
the
community,
like
the
sig
owners
to
drive
that
work.
G
So
so
so
jordan,
how
would
how
would
people
be
able
to
action
such
decisions?
Like
I
mean?
How
would
they
be
able
to
see
or
how
would
you
propose
that
that
the
people
are
able
to
view
reports
indicating
resource
usage?
Is
that
something
that's
there
already
or
something
that
would
have
to
be
built?
I.
C
Mean
for
it
depends
so
for
simple
things
like
test
runtime,
you
can
go
look
and
test
grid
and
show
a
graph
of
how
long
every
test
took,
and
so
you
can
see
like
this
test
takes
eight
minutes
on
a
typical
run,
like
that
seems
excessive
for
a
pre-submit.
There
are
other
tests
where
you
can
look
at
the
graph
and
see
like.
Usually
it
takes,
you
know
half
a
minute,
but
sometimes
it
takes
10
minutes.
C
E
A
The
other
fun
thing
about
the
test
grid.
It
just
does
this
by
default.
I
guess
I
don't
know
if
this
is
a
feature,
but
but
it
sorts
by
the
longest
duration.
So
I
can
tell
you
right
now
that
basic
state
the
set
functionality
should
perform
rolling
updates
blah
blah
blah
is
the
longest
test.
That's
currently
in
the
defaulty
tv
suite
and
it's
hovering
around
six
minutes,
which
is
yeah.
It's
that's
actually
a
little
bit
over
what
we
said.
A
A
A
Well,
hopefully,
I
know
which
state
to
notify,
because
that's
in
the
test
name
and
hopefully
that
sig
has
some
group
of
people
who
are
responsive
to
polar
classics
related
to
you
know,
tests
triage,
whether
that
be
correctly
identifying
pests
as
slow
or
flaky
or
fixing
test
failures.
A
This
is
something
I
went
through
sort
of
last
december,
I
think
to
we.
We
kicked
out
a
number
of
really
really
slow
tests
and
properly
tagged
them
as
slow.
A
C
Yeah,
I
I'll
drop
a
link
in
the
the
meeting
notes
to
a
doc
that
just
had
random
ideas.
I
agree
on
wanting
to
automate
this,
like,
I
think,
as
an
initial
effort.
If
people
are
wanting
to
help
like
identifying
the
top
10
or
the
top
20
or
whatever
issues
and
kind
of
getting
those
knocked
out
manually
is
great,
but
something
where
we
actually
can
run
a
report
and
say
all
right
for
these
omnibus
jobs.
C
What
what
percentage
of
this
job
is
being
consumed
by
each
sig,
and
what
do
we
think
is
reasonable?
How
long
of
a
test
do
we
want
to
allow
in
a
pre-submit?
C
How
long
of
a
test
do
we
want
to
allow
per
package
for
for
integration
things
like
that,
and
just
say
what
are
our
goals
if
we
want
pre-submits
to
be
able
to
finish
in
30
minutes,
40
minutes,
30
minutes
20
minutes?
Maybe
what
do
we
have
to
do
to
get
there
and
and
then
having
tools
that
we
run
against
the
test?
Result
say
which
cigs
are
out
of
out
of
bounds
for
where
we
want
to
be
and
publishing
those
metrics.
C
So
I
think
for
sig
testing.
What
we
can
do
here
is
sort
of
figure
out
how
to
get
that
data
out
of
the
current
runs
and
the
current
test
grids
and
get
it
in
a
form.
That's
consumable
for
cigs
to
show
really
clearly
like
these
are
the
sigs
that
are
doing
good.
These
are
the
sigs
that
are
not
doing
good
and
then
that
can
be
input
both
for
those
six
to
prioritize
their
work,
but
also
for
enforcement
things
that
are
not
doing
good.
C
What
what
should
the
consequence
be
if
your
tests
are
unhealthy
and
you're
eating
more
than
your
fair
share
of
the
community's
time?
Should
you
prioritize
that
over
features?
I
would
expect
probably
so.
B
So,
if
there's
a
stated
kind
of
like
you
need
to
get
this
done
by
this
time,
and
it's
not
to
you
know,
make
their
lives
harder.
It's
just
because
it's
you
know
fair
to
the
community
and
also
you
know
they
should
probably
know
what
they're
testing
and
that
sort
of
thing.
So
I
think,
for
any
sigs,
just
having
kind
of
like
this
defined
threshold
for
different
areas
of
quality
makes
it
a
lot
easier
to
encourage
them
to
do
things
and
also
makes
it
clear
to
them.
B
F
C
I
I
think
I
think
it
looks
like
features
get
held
up.
I
that,
that's
probably
I
mean,
and
that's
not
really
like
a
sig
testing
specific
call,
that's
going
to
be
like
testing
and
release
and
architecture.
Maybe
I
don't
know
but
like
right
now
there
are
preconditions
to
adding
things
for
api
changes
like
you
have
to
have
a
proposal
you
have
to
have
the
it
has
to
have
a
migration
plan,
and
things
like
that.
C
I
think
it
is
reasonable
to
say
if
a
component
is
unhealthy
based
on
the
test
reporting,
it's
not
appropriate
to
be
adding
new
things
and
making
changes
to
that
component.
If
we
aren't
certain
that
the
current
function
is
working
properly,
then
we
shouldn't
be
slamming
more
things
into
that
area.
We
should
be
building
confidence
that
the
current
level
of
function
is
working
so
that
the
goal
is
not
to
be
punitive
to
like
actually
release
a
high
quality
thing.
F
Sure
I
just
one-handed
seems
like
that
probably
needs
its
own
discussion.
I
I'm
super
on
board
with
that.
E
C
A
Yeah,
frankly
speaking,
I'm
not
sure
either
I
feel
like
cap
is
that,
like
kept
some
production,
readiness
reviews
are
really
great
injections
of
human
thought
into
the
process,
but
I
also
feel
like
it's
possible
for
a
component
to
decay
at
a
faster
cadence
than
that
within
a
release
cycle.
So
how
can
we
appropriately?
You
know,
I
don't
it's
the
usual
question
of
like
well.
How
do
we
decide
to
revert
an
enhancement
or
whatever
that
turns.
C
I
I
think
the
first
thing
we
have
to
do
is
just
get
visibility
like
what
what
are
the
standards
and
if
sigs
aren't
meeting
those
standards
get
visibility.
To
that.
I
remember
in
some
releases
we've
had
ci
signal
reports.
You
know
kind
of
the
last
month
of
the
release
that
says
here:
here's
the
top
three
flakes
and
what
sigs
owned
them
and
whether
progress
has
been
made,
and
that
was
like
an
email
that
went
out
to
the
community
once
a
week
and
so
like.
C
On
the
one
hand,
it
adds
visibility
like
maybe
maybe
no
one
knew
that
it
was
going
on.
On
the
other
hand,
it
adds
a
little
bit
of
kind
of
you
don't
want
to
be
on
the
bad
sig
list,
I'm
okay
with
that.
Even
even
if
we
don't
have
a
super
clear,
like
iron,
clad
things
with
this
label
will
not
merge
and
there's
a
bot
enforcing
it
like.
Even
if
we
don't
have
that
story.
C
Yet
just
saying
like
here's,
here's
how
we've
sliced
up
the
resources
we
have
and
here's
how
cigs
are
consuming
those,
and
I
think
that
would
do
a
lot
to
move
the
needle
like.
If,
if
there's
an
email
going
out
to
the
community,
once
a
week
saying
these
things
are
out
of
slo
for
these
things,
I
think
that
would
help.
F
So
one
thing
I
feel
like
we
didn't
quite
get
yet
was,
I
guess
maybe
nobody
had
anything
to
say,
but
do?
Are
there
any
objections
to
the
things
that
are
actually
in
scope
for
the
current
proposal,
as
opposed
to
continuing
to
think
about
what
we'll
do?
Next
after
that.
A
A
I
took
a
best
guess
for
the
release,
master
block
and
stuff,
and
I
feel
like
perhaps
I
did
you
all
a
disservice
by
going
back,
I
probably
did
more
than
my
fair
share
of
taking
a
look
at
jobs
and
assessing
what
was
reasonable,
and
so
I
feel
like
it
should
be
on
folks
here
to
decide
what
limits
and
such
they
want
to
set
for
their
priesthoodness,
and
we
should
go
through
that
together
as
a
as
a
community.
A
That
would
be
one
suggestion.
Another
question
I
haven't
heard
raised
by
anybody
here
necessarily
is
like
what
is
the
definition
of
success.
A
We
proposed
a
couple
of
things
that
we're
planning
on
doing,
but
it's
unclear
to
me.
We've
decided
like
what
metric
or
means
of
measurement
we're
going
to
use
to
describe
how
awful
things
are
now
and
how
much
better
they
are
as
a
result
of
what
we've
done.
F
A
I
can
speak
to
that
with
a
little
more
screen
share
if
you
are
interested,
so
I'm
going
to
browse
over
to
the
kubernetes
testing
florida.
A
Eventually
and
I'm
going
to
click
on
the
metrics
directory,
and
so
what
this
is
is
most
of
the
test
results
that
land
in
buckets
that
are
visible
by
testgrid.com
also
end
up
getting
scraped
into
a
publicly
accessible
bigquery
database
that
anybody
is
capable
of
querying.
A
A
Job
flakes
the
complete
core.
It
looks
something
like
this.
It
took
me
about
an
hour
and
a
half
for
two
hours
to
really
pick
this
apart
and
it's
like
not
perfect,
but
it
does
show
the
flakiest
jobs
for
a
given
week
and
it
tries
to
call
out
the
flakiest
pests,
and
so
here
I
could
see
that
pr
pulled
kubernetes
e2e
gce
has
a
consistency
of
84.,
so
84
percent
of
the
time
it'll
it'll
run
appropriately.
D
Yeah,
I've
looked
at
this.
You
know,
particularly
during
the
past
few
weeks,
when
things
were
really
bad
and
these
numbers
were
like
80.
Something
and
you
know
we
had
jobs,
were
failing
50
90
of
the
time
it
seemed
like
it
to
me
right.
A
So
there's
there's
like
imperfections
in
this
process,
so,
for
example,
data
doesn't
actually
make
it
into
this
pipeline
unless
a
job
has
actually
finished
so
for
all
of
those
jobs
that
we're
failing
to
schedule
because
of
pod
pending
timeout,
which
we
think
is
resource
concerns
like
we
weren't.
None
of
that
data
makes
its
way
here.
This
only
really
becomes
more
useful.
A
What's
the
percentage
chance
of
you
actually
landing
your
pr
in
an
ideal
world
where
every
test
would
pass
to
begin
with
and
like
maybe
that
could
possibly
be
a
metric.
F
I
have
a
really
simple
one:
I'd
like
to
see
I'd
like
to
see
for
the
kubernetes
blocking
and
release
blocking
jobs,
because
that's
an
easy
set
to
say
like
those
as
opposed
to
I'm
not
sure
what
all
is
in
ci.
I
would
like
to
stop
seeing
pod
error
state
in
the
ci,
as
in
the
ci
failed
to
run
like
it
didn't
schedule.
F
It
didn't
finish.
The
init
containers
that
sort
of
thing
there's
still
the
like
the
tests
themselves
need
to
be
healthy
and
that
we
may
continue
to
see
regardless
of
this
work,
and
that
will
be
the
next
step.
F
But
we
can
look
at
the
data
that
is
served
and
browse
a
website
when
you
look
at
a
job
product
case
that
I
have,
and
there
is
a
state
that
gets
recorded
when
we
like
fail
to
schedule
a
pod
or
when
the
pod,
just
like
errors,
or
it
times
out
during
pending
those
things,
reflect
infrastructure
problems,
resource
contention,
inability
to
schedule.
F
D
Yeah,
I
agree.
There
are
different
sources
of
problems
here.
I
was
kind
of
moving
back
to
the
you
know
main
point,
which
is
adding
all
of
them
up.
You
know
we
need
to
be
concerned,
but
even
more
importantly,
breaking
them
down,
so
they
can
be
diagnosed
and
fixed.
F
The
thing
I
want
to
see
is
those
gray
triangles,
so
the
circles
are,
I
think
we
stopped
running
it
because
there's
a
like
a
new
version
of
the
code
to
test
and
we're
moving
on
whenever
you
see
one
of
those
alert
triangles
and
gray,
that's
one
of
the
entries
where
it's
just
we
didn't
manage
to
run
it
successfully
one
way
or
another
at
an
infrastructure
level,
and
though
that
data,
that's
a
json
endpoint.
F
C
Think
yeah,
so
that
that
was
why
I
copied
the
the
sig
leads
list
into
the
thread,
because
there
is
going
to
be
work
for
everything
to
go,
do
in
adding
contact
info
to
their
jobs
and
turning
down
or
fixing
perpetually
failing
jobs
and
then
diving
into
particular
tests
that
are
flaking
and
dealing
with
those.
So
getting
the
leaders
of
those
areas
involved
now
to
know
this
is
coming,
and
I
mean
it,
it
shouldn't
be
a
surprise.
Everybody
knows
everyone.
C
G
Do
just
a
a
question
I
have:
is
it
possible
to
leverage
owner's
files
in
order
to
attach
ownership
and
responsibility
to
tests.
C
Sometimes
sometimes
an
entire
package
is
owned
by
a
sig
for
unit
tests,
that's
more
likely
to
be
true
for
and
actually
for
e
to
e
tests.
That
was
done
as
well.
Most
of
the
ede
tests
live
in
a
package
that
is
sig.
Specific
integration
tests
are
sort
of
all
over
the
map.
So
that's
that
was
one
of
the
items
in
my
follow-up
suggestions
like
we
need
strong
association
of
tests
with
cigs
individual
tests,
not
a
whole
job.
F
A
It
could
be
worth
investigating
something
like
that,
because,
like
each
of
these
tests,
we
as
humans
can
clearly
sort
of
look
at
the
sig,
whatever
prefix.
E
A
Last
name,
but
things
like
individual
integration
tests,
you
might
need
other
kind
of
sort
of
manual.
Parsing
of
that
input.
Machine.
G
Sorry
I
took
it
across
you
there
just
as
a
background
task
on
I've
been
kind
of
dawdling
on
us
on
working
on
bubbling
up
an
error
for
where
a
label
was
configured
via
owner's
files
and
as
part
of
that
work,
I've
been
doing
some
front-end
work
to
make
it
easy
to
look
at
those
files.
And
so
when
I
have
that
finished
and
skinning
up
say
the
flakiness
jason
that
you
you
have
there
coming
out
your
queries
and
making
that
presentable
and
reportable.
That's
something
that
I
could.
A
A
E
A
A
A
Anybody
else
have
anything
they
want
to
say
on
this
topic,
or
shall
we
call
this
basically
agreement
that
we
should
move
forward
with
this
and
look
look
to
see
some
action
required
emails
coming
out
soon.
F
We
did,
but
it
wound
up
being
all
aaron,
which
is
something
we
want
to
avoid.
A
Well,
so
I
I'm
sort
of
trying
out
the
hot
qos
guarantee
thing
for
the
release
blocking
jobs
and
seeing
if
they
sort
of
flake
any
less
randomly.
I
think
it
would
be
cool
to
see
what
or
how
the
ci
circle
team
currently
takes
an
assessment
of
like
how
flaky
things
generally
are.
If
we
could
sort
of
see
like
did
that
substantively
improve
things.
A
If,
if
you
wanted,
you
could
go
ahead
and
start
printing
in
resource
requests
and
limits
for
the
precipitates
as
well,
and
we
could
just
all
find
out
together
what
happens
like
we're
theorizing
in
the
dock,
that
you
know
once
we
start
guaranteeing
resources
for
enough
jobs.
We
suspect
that
some
of
the
other
1500
jobs
for
some
of
the
other
repos
will
not
get
scheduled
for
random
reasons
or
might
be
kicked
off
of
a
pod
for
random
reasons.
I
I
was
thinking
like
okay.
If
I,
if
I
just
pull
a
number
out,
say
400
look
at
the
things
that
have
been
failing
for
400
plus
days.
A
number
of
things
stand
out
to
me,
like
hey,
we're
doing
scalability
testing
against
kubernetes
1.13,
except
it's
been
failing
for
over
400
days.
F
That's
probably
mislabeled
there
like,
for
example,
there's
a
google
dashboard
that
I
have
on
my
agenda
to
remove.
That
has
things
that
are
running
against
the
the
like,
stable
one
tags
or
whatever.
F
F
That
is
also
something
that
should
probably
be
removed
from
ci
entirely.
For
other
reasons,
we
we
have
some
ci
that
got
added
at
some
point:
it's
using
the
upstream
tests
and
nothing
else
upstream.
I
was
pretty
against
that
and
now
that
it's
not
being
maintained,
I
think
we
have
a
clear
path
to
just
remove
it.
I
Oh
and
to
be
fair,
some
of
them
are
from
my
employer,
also
vmware,
so
I
don't
want
to
sound
like
anti-amazon.
F
There
yeah
there's
this
google
dashboard
in
tesco.
That
has
a
bunch
of
nonsense
in
it,
but
I
I
took
a
look,
and
most
of
what's
in
there
turns
out
to
just
be
like
the
the
dashboard
isn't
very
good.
The
the
things
that
are
running
are
actually
not
running
because
of
that
dashboard.
They're
just
included
in
that
dashboard
as
well.
A
A
A
F
C
A
So
so
I'm
proposing,
if
everybody
since
everybody
it
sounds
like
everybody
here,
here's
an
agreement
with
all
of
the
policies
we're
proposing
that
once
we
get
down
to
implementing
the
policy
of
contact
info
is
required
on
all
jobs
and
I'm
proposing
that
by
some
deadline,
if
a
job
doesn't
have
contact
info,
nobody
wants
it,
so
we're
going
to
get
rid
of
it.
So
how
long
do
we
get
people
to
claim
their
jobs.
G
I'd
be
inclined
to
say
to
to
suggest
maybe
four
weeks,
but
maybe
not
the
four
weeks
of
august.
Maybe
the
four
weeks
of
september.
I
I
would
like
to
see
it
sooner,
but
I
was
thinking
maybe
like
in
in
over
the
next
two
weeks.
Please
get
back
to
us
or
start
doing
this,
otherwise,
over
the
next
two
weeks,
others
are
going
to
start
removing
like
I
want,
I
feel
like
I
don't
necessarily
I
shouldn't
just
go
out
and
start
removing
other
people's
things,
but
given
a
lot
of
time,
if
they're
not
starting
to
move,
maybe
I
could
sort
of
help
by
initiating
a
pr.
A
I
think
it's
reasonable
for
us
to
remember
that
humans
are
involved.
I
just
also
sort
of
feel
like
some
of
the
some
of
the
jobs
that.
F
So
it's
something
we'll
think
about
sorry
to
get
people
to
put
contact
info
for
everything.
I
would
give
plenty
of
time
for
this.
Job
has
been
failing
for
hundreds
and
of
days.
I
would
say
we're
deleting
it
if
you're
interested
in
it,
you
can
come
undelete
and
we
should
just
go
ahead
and
move
forward
with
deleting
and
if
someone's
interested,
it's
really
not
a
big
deal
to
revert
like
removing
something
from
get
just
some
yaml
no
big
deal,
there's
not
going
to
be
like
code
conflicts
or
anything
like
that.
F
They're
separate
chunks
of
configuration,
and
we
can
just
notify
people
that
this
is
being
done
and
if
they
miss
that,
because
they're
out
they
can
come
back
at
it
later.
F
It
should
not
be
a
big
deal
if
something
that's
been
failing
for
900
days
is
gone
for
a
week,
and
then
someone
actually
wanted
it
back,
but
in
terms
of
like
action
required
things
that
give
people
more
time.
These,
I
would
say
we
already
had
action
required.
You
shouldn't
have
left
it
failing
this
long.
A
Okay,
well,
that
puts
us
that
puts
us
over
time.
I
really
appreciate
everybody
showing
up
and
again
all
questions
comments
and
concerns
on
how
to
improve
this
situation
are
welcome.
This
is
only
gonna
work
if
we
all
put
in
the
work.
A
So
that's
a
happy
tuesday.
Everybody
thank
you.
C
A
Hi,
I'm
howard,
I'm
from
howard.
Can
we
get
to
your
agenda
item
in
two
weeks?
Sorry,
it
wasn't
like
scheduled
ahead
of
time.
So
if.
A
Ben
and
I
offline
on
or
on
slack,
if
maybe
we
can
chat
about
it
in
the
sig
testing
channel.