►
From YouTube: Kubernetes SIG Testing - 2020-08-11
Description
A
Hi,
everybody
today
is
august
11th
and
you
are
watching
or
attending
the
kubernetes
testing
bi-weekly
meeting.
I
am
your
host
erin.
A
A
We've
got,
but
we've
got
a
couple
of
demos
to
show,
and
then
I
wanted
to
have
a
follow-up
about
sort
of
the
kubernetes
ci
policy
discussion
we
had
last
week.
So
first
up
is,
is
howard
zhang
here,
or
is
anybody
here
able
to
speak
to
having
the
proud
pod
utility
images
on
arm.
C
I
was
building
some
a
while
back
for
or
the
proud
images
themselves,
maybe
not
the
pontiac
team,
but
for
raspberry
pi.
A
Okay,
this
was
something
we
didn't
have
the
opportunity
to
get
to
last
meeting
and
I
think
there
like
it
was
handled
on
slack
offline.
But
if,
if
not,
we
can
come
back
to
this,
they
have
to
show
up
to
the
meeting
today.
Next
there
was
a
there's
an
entry
here.
I
don't
know
who
this
is
from
about
triage
party
and
the
kubernetes
pr
dashboard.
D
I
would
assume
that
that
was
great
yeah.
That
was
me.
We
don't
have
to
cover
if
we
have
other
things,
but
it
was
just.
I
had
talked
to
triage
party
and
ben
just
about
maybe
making
it
a
drop-in
replacement
for
the
pr
dashboard
since
kubernetes
getting
older
and
older
and
less
used.
A
That
would
be
really
cool.
I
know.
Arno
has
been
involved
in
setting
up
a
triage
party
instance
for
the
release
team,
arno
and
george.
Do
either
of
you
have
anything
to
say
about
how
triage
party
is
going
for
y'all
right
now,.
B
I
can
also
describe
that.
Oh.
B
Yeah
so
I
spoken
to
thomas
last
night,
so
I
basically
did
an
audit
of
what
the
triage
party
set
up
for
mini,
cube
and
scaffolder
are
monitoring
and
how
it's
displaying
information
and
it's
basically,
you
can
set
different
time
limits
for
what
you
want
to
watch
and
prs
comments,
etc.
So
it
actually
helps
you
organize
the
bucket
of
work
overall
into
those
categories
so
that
you
can,
you
know,
take
action
on
pr's
issues,
etc,
and
then
you
can
also
have
tabs
for
specific
types
of
issues.
B
So
if
there
are
like
tools
for
you
that
you
want
to
just
monitor
like
the
issues
on
a
specific
tool,
you
could
actually
probably
set
up
a
tab
to
do
that
for
release
team.
So
that's
that
work
is
basically
to
figure
out
how
we
can
prioritize
the
work,
not
just
in
terms
of
urgency,
but
we're
also
looking
about
impact
so
like
what
is
the
highest
value
work
that
we
could
possibly
do,
and
so
that's
that's
a
discussion
that
we
have
to
have.
B
The
tabs
enable
you
to
do
that,
so
you
can
watch
different
themes
at
one
in
one
place,
so
we
haven't
used
it
yet
in
meetings,
because
we
have
a
lot
of
things
in
the
backlog
to
sort
through,
but
once
that
is
in
a
reasonable
place,
I'm
basically
dropping
a
roadmap
for
how
we
would
use
cherish
party
and
then
what
are
the
types
of
labels
that
we
would
need
to
use
it
as
best
as
we
can,
including
those
impact
labels.
A
So
maybe
maybe
grant
knows
more
about
this,
but
the
question
I
have
is
whether
it
would
be
possible
to
tailor
that
dashboard
to
an
individual
so
to
speak
briefly
to
what
gubernator's
pr
dashboard
does
right
now
I
can
go
to
gubernator.com
pr.
I
can
hold
myself
accountable
here
and
do
this
live.
I
guess
and
it'll
show
me
like
what
pr's
are
on
deck
for
me
to
review.
So
if
I
go
to
ubernator.com.
A
Xp,
that's.
I
have
apparently
18
pull
requests
that
need
my
attention
and
it
does
this
by
like
a
little
bit
of
a
state
machine
that,
mostly
like,
looks
at
prs
and
the
labels
that
are
applied
to
them
and
then
like
who
has
most
recently
commented
on
it.
So
I
can
I
can
get
like,
but
the
thing
is
it's
personalized
to
me.
So
this
isn't
me
looking
at
sig
testing's
review
cue.
This
is
looking
at
like
high
individual
pr
workload
and
this
seems
easier
for
me
to
manage
than
github's.
A
So
it's
it's
unclear
to
me
whether
triage
party
supports
that
kind
of
workflow-
and
I
don't
know
like
maybe
how
many
people
use
this-
I
kind
of
live
or
die
by
this,
this
pr
dashboard
for
what
it's
worth,
which
is
one
of
the
options.
One
of
the
reasons
I
think
grant's
been
looking
at
it
because
the
rest
of
gubernator,
we,
basically
we
don't
touch
anymore,
and
so
I
guess
it's
kind
of
trying
to
figure
out
if
we
want
to
rewrite
this
or
if
we
can
use
triage
40
party.
D
I
just
ended
up
with
thomas
about
this
and
I
created
an
issue
and
he
said
dave.
They
were
planning
to
try
to
do
something
user-centric
and
I
was
hoping
to
get
to
that
work.
So
I
was
going
to
put
a
pr
up
against
triage
party
to
add
oauth
with
github
and
have
this
like
user-centric
dashboard,
and
then
I
was
also
talking
to
him
on
ben's
recommendation
about
multi-tenancy.
D
So
we
don't
have
to
have
multiple
triage
party
instances
per
team.
We
could
have
like
a
single
instance
with
multi-repo
and
multi-tenancy
or
tenants.
So
it'd
just
be
easier
to
have
everything
in
one
place
and
be
able
to
see
like
sig
release
like
testing
like
your
trigger
party
within
the
same
instance.
B
A
Urgent
that
definitely
that
could
be
useful.
Okay
sounds
like
that.
Would
augmenting
triage
party
sounds
like
a
really
neat
idea,
and
maybe
a
useful
way
for
us
to
align
our
efforts.
A
Next,
up
on
the
agenda,
I
had
we
had
shane
presented
design
dock
to
us
a
little
while
back
about
sort
of
a
secret
rotation
thing
he's
been
working
on
to
help
us
manage
crowd,
secrets
and
shane
wanted
to
present
a
demo
today.
Shane,
are
you
available?
A
E
E
Okay,
so
I
got
started
hello:
everyone,
I'm
shane,
and
today
I'm
going
to
talk
about
my
project,
integrating
kubernetes
secret
with
the
google
cloud
secret
manager
and
I'm
going
to
include
a
live
demo
in
this
presentation.
So
I'm
just
going
to
start
it
right
now,
because
it's
going
to
take
some
time
for
for
it
to
run
okay.
So,
let's
start
with
the
background
we
utilize
kubernetes
to
facilitate
the
ci
process
and
within
these
processes,
sensitive
data
will
be
needed
and
stored
as
kubernetes
secret
and
then
consumed
by
parts,
whoever
need
them.
E
So
these
secrets
need
to
be
first
securely
managed
and
also
be
rotated,
and
what
we
propose
here
is
we
incorporate
the
google
cloud
seeker
manager
and
has
it
worked
as
a
an
independent
secret
managing
platform
that
has
a
separate
lifecycle
from
any
given
kubernetes
cluster.
So,
in
the
case
that
the
cluster
dies
for
any
reason,
these
secrets
can
still
be
restored
and
also
in
the
case
that
the
secret
got
stolen
at
any
point,
it
will
be
invalidated
after
a
while
if
the
secret
is
being
rotated
okay.
E
So
this
is
the
basic
architecture
of
the
project.
It
is
an
integration
of
the
google
cloud
seeker
manager
and
kubernetes
secret.
So
after
being
provisioned,
a
secret
will
be
managed
by
the
secret
manager
and
then
a
secret
rotator
will
be
responsible
for
rotating
this
secret.
Another
secret
sync
controller
will
be
continuously
syncing
from
the
seeker
manager
to
kubernetes
so
that,
within
the
same
cluster,
a
pod
can
hopefully
consume
the
latest
version
and
also
the
currently
valid
key,
so
suppose.
E
For
example,
the
currently
valid
service
count
key
if
it's
applied
on
service
counties
and
yeah.
So
the
first
part
is
the
secret
rotator
and
what
it
does
is
it
performs
two
actions.
The
first
is
refresh
and
it
happens
when
the
last
version
is
too
old.
That
is,
it
was
created
too
long
ago.
Then
what
it
does
is
it
tries
to
refresh
this
and
create
a
new
version
of
the
secret
and
ever
after
it
did
so,
it
will
then
try
to
invalidate
all
of
the
out
of
date.
Version
of
the
secrets.
E
So
one
thing
to
note
here
is
that
we
utilize
the
metadata
of
the
secret
manager
secrets
to
help
the
secret
rotator.
So
these
are
the
informations.
The
rotator
will
need
to
rotate
a
service
count
key.
The
first
two
fields
are
the
names
of
the
project
and
service
count
for
the
service
count
key
and
the
last
field
is:
it
contains
all
of
the
service
county
ids
for
every
active
versions.
E
Okay.
So
the
second
part
is
the
sync
controller
and
what
it
does
is
pretty
simple:
it
just
continuously
syncs
from
the
source
secret
to
the
destination
secret
and
the
source
secret
is
said
to
be
a
secret
manager
secret,
which
is
likely
to
be
also
rotated
and
the
destination
secret
is
a
desired
kubernetes
secret.
So,
for
example,
we're
we're
thinking
on
the
key
value
pair
level
for
a
kubernetes
secret.
So,
for
example,
we're
thinking
this
secret
manager
secret
to
the
value
corresponding
to
this
key
of
this
kubernetes
secret.
E
Okay.
So,
let's
talk
about
the
setup
for
this
live
demo.
Today
we
use
config
maps
in
kubernetes
to
specify
the
configurations
for
both
the
secret
rotator
and
the
secret
sync
controller.
E
So
in
this
example,
the
secret
rotator
is
rotating
this
secret
one,
which
is
a
service
count
key
for
this
service
count,
and
it
has
a
refresh
interval
of
two
minutes
and
30
seconds
with
a
grace
period
of
one
minute
and
the
scene
controller
is
continuously
syncing.
E
So
what
it
does
is
it
continuously
watch
which
is
on
the
mounted
secret
kubernetes
secret,
which
is
the
service
count
key
in
in
this
case,
and
whenever
an
update
is
observed,
it
just
makes
a
copy
of
this
current
service
county
into
another
directory
so
that
it
will
always
have
a
full
collection
of
the
entire
history
for
this
secret.
So
it
will
have
every
version
of
the
service
account
key
in
the
the
other
directory
and
then
what
is
what
it
does.
E
Is
it
continuously
pins
all
versions
of
the
service
account
key
and
checks
whether
each
is
valid
at
the
current
time
being.
Okay,
so
let's
see
the
results.
First,
we
get
the
pod
that
is
running
the
the
mock
consumer,
so
it's
under
namespace
8
and
we
got
the
log
for
this.
E
Okay,
so
let's
see
we
can
start
from
here
so
at
this
time
point
there
is
only
one
key
valid.
That
is
the
third
key
key
three,
and
it
is
the
only
active
service
count
key
and
then,
after
a
while,
the
rotator
will
try
to
refresh
this.
So
what
it
does
is.
It
first
creates
a
new
version
of
this
key,
which
is
key
four
and
now
the
two
secrets.
E
The
two
keys
are
both
active
they're
coexisting,
which
is
corresponding
to
the
gray
spirits
of
this
key
three
and
when
the
grace
period
for
this
key
three
is
over,
which
happens
within
a
minute
here.
So
after
the
the
key
three
is
out
of
its
grace
period,
it
then
gets
invalidated,
or
in
this
case
deleted
the
the
key
three
will
be
delayed.
The
service
count
key
will
be
deleted
and
leaving
only
the
t4
as
the
currently
active
service
count
key
and
the
rotation
just
keeps
going
and
going
so.
E
One
more
thing
we
can
demonstrate
here
is:
if
we
stop
the
secret
rotator
now,
so
if
the
secret
rotator
is
being
terminated
for
whatever
reason-
and
we
can
see
the
current
status
of
this
rotated
secret,
so
the
the
informations
needed
for
the
rotator
will
be
stored
in
the
metadata
for
the
secret,
so
that
whenever
the
see
this
rotator
is
being
restarted
after
it's
being
terminated,
it
can
continue.
E
It
can
continue
to
rotate
the
secret
seamlessly,
so
we
can
just
try
to
restart
the
secret
rotator
and
what
it
does
is
because
it's
out
of
the
refresh
interval,
so
it
will
refresh
a
new
surface,
count
key
and
come
back
to
deactivate
the
old
version
4,
where
it
was
terminated
for
any
reason.
E
This
is
all
for
the
demo,
and
until
now
we
have
google
cloud
secret
manager
to
securely
manage
the
secrets
that
will
be
used
in
kubernetes
and
we
are
now
supporting
rotation
for
gcp
service
count,
keys
and
with
secret
rotation,
and
we
can
mitigate
the
harm
in
the
case
that
any
secret
should
be
stolen
at
any
time
point,
and
also
by
synchronizing,
from
secret
manager
to
kubernetes
the
pods
and
kubernetes
can
consume
these
rotated
secrets
and
hopefully
get
the
latest
version
and
also
currently
active
versions
of
this
secret.
E
Okay.
So
special
thanks
to
my
host
aaron
and
the
kubernetes
android
team.
This
is
the
repo
for
this
project
and
thank
you
all
for
your
attention.
I'm
happy
to
take
any
questions
and
inputs.
A
Oh
shane
I'll
work
with
you
offline
to
get
a
link
posting
the
meeting
notes
and
we'll
and
we'll
post
a
link
to
chat
or
to
slack
here.
A
Yeah,
cool,
okay,
shane:
do
you
mind
all
right?
There
we
go
next
up.
We
have
michael
colbert
here
to
talk
about
rewriting,
triage
and
go.
I
think
we
do
yeah
there.
You
are
masked
up.
I
like
it.
A
H
H
I
hope
you
can
all
hear
me
if
you
can't
just
let
me
know
and
I'll
take
off
the
mask,
but
you're
good.
H
Okay,
so
I
presented
a
few
months
ago
regarding
rewriting
triage
in
go.
H
Because
this
is
meant
to
be
an
in-place
port
like
a
drop
in
replacement,
the
best
meaning.
The
best
outcome
is
that
you
can't
tell
the
difference
between
them.
H
Ideally,
so,
on
the
left
hand
side,
you
can
see
python
triage,
it's
currently
running
at
its
regular
link,
which
is
actually
go,
dot,
gates,
io
triage
and
then
on
the
right
hand,
side
it's
the
same
url,
but
instead
of
triage
here
you
just
move
it
to
triage,
go
so,
which
I
actually
should
be
porting
this
over
to,
instead
of
having
this
under
triage,
go
just
have
it
under
triage
by
end
of
day
today,
yeah.
H
Essentially,
you
can
see
that
the
number
of
test
failures
in
each
instance
are
working
the
same.
There
are
a
few
that
don't
exactly
agree
like
here.
It
says
4078
and
the
old
one
says
4082,
but
overall,
that's
just
going
to
be
because
of
some
time
limits
that
are
going
to
be
different
in
python,
which
is
go.
But
if
you
scroll
down
a
little
bit
and
you
should
see
that
they
start
to
match
up
pretty
much
exactly.
H
I
think
yeah
there
you
go
right,
and
one
thing
that
we
did
add
in
the
new
version
is
that
the
clusters
for
sig
now
works
so
and
the
old
one
doesn't
really
do
anything.
H
But
if
you
do,
for
example,
api
machinery
here,
you
only
get
clusters
that
are
mostly
related
to
api
machinery
based
on
the
name
of
the
test,
so,
as
you
can
see,
most
of
these
happen
to
have
sig
vi
machinery.
In
addition,
we
did
update
these
clusters
over
here.
These
are
sigs
over
here
before
we
had
some
outdated
cigarettes.
H
Yeah
that's
in
terms
of
the
ui
in
terms
of
time
we're
taking
for
the
jobs
you
can
see
here
that
the
old
ones
were
taking
around
50
minutes
to
an
hour
and
now
they're
taking
much
less
around
half
an
hour.
If
we
actually
want
to
compare
them,
we'll
see
that
the
average
or
the
last
whatever
this
is
20
jobs
or
so,
is
that
51
minutes
for
the
old
one
and
31
minutes
for
the
new
one,
which
is
a
great
improvement.
H
In
addition,
it's
hard
to
compare
the
graphs
because
they're
scaled
differently,
but
you
can
see
sort
of
like
the
high
points
over
here,
like
the
longest
ones,
take
75
minutes
for
the
old
ones,
more
or
less,
and
then
the
oldest
ones.
These
are
failed,
but
let's
say
for
the
normal
ones.
The
longest
it
takes
is
47.,
so
even
the
longest
jobs
are
taking
a
significantly
smaller
amount
of
time
yeah.
I
think
that
basically
concludes
what
I
have
to
show
any
questions.
H
I
believe
we
are
ready
to
migrate,
I'm
working
on
a
pr
to
migrate
right
now,
and
that
should
be
done
by
end
of
day.
H
A
Great
and
as
I
understand
it,
this
is
this
is
the
optimization
we
got
from
rewriting
and
go.
There
are
further
optimizations
we
may
be
able
to
make
to
update
the
triage
results
even
more
frequently.
H
That
is
correct.
There
are
some
that
are
easy
and
should
provide
meaningful
performance
gains
and
some
that
are
a
bit
harder
and
are
more
tenuous
in
terms
of
how
much
of
a
performance
game
they
will
bring.
G
One
other
really
important
thing
that
didn't
come
up
here
is
that
the
previous
implementation
was
minimally
commented
very
dense
and
undocumented.
The
readme
said
to
do.
There
is
now
a
wonderful,
detailed
readme
outlining
everything
in
fairly
well-commented
code.
Now,
thanks
for
that.
H
There
are
no
speed
changes
in
the
ui.
I
didn't
touch
the
ui,
mainly
because
it
was
supposed
to
be
a
drop
in
replacement
in
general.
I
don't
think
that
the
ui
is
particularly
slow.
Like
you
know,
it's
not
ideal.
C
H
Not
particularly
snappy,
but
I
don't
think
it's
particularly
slow
if
you
have
like
any
specific
issues.
A
G
G
B
H
H
It
already
is
pretty
efficient,
like
the
person
who
wrote
this
before
me
did
go
to
a
lot
of
pains
to
actually
make
this
pretty
efficient,
like
you
can
see
like
over
here,
there's
not
so
much
data
and
then,
as
you
get
to
the
bottom,
it
loads
more.
So
it
doesn't
implement
lazy
loading.
It
isn't
some
compressed
form
so
that
the
data
gets
transferred
quick
more
quickly.
H
A
Further
well
cool!
Thank
you.
I
I
I
don't
know
about
everybody
else
here,
but
personally,
when
I'm
trying
to
like
figure
out,
if
I
fix
them
like
or
not,
this
is
the
number
one
tool
that
I
use,
and
so
it
is,
it
was
frustrating
to
me
when
I
had
to
wait.
You
know
one
to
three
hours
to
see
if
the
pr
fixed.
G
This
is
also
the
way
I
find
flakes
meaningfully.
Testgrid
is
great
for
telling
me
if,
like
a
job,
is
passing
or
something,
but
this
is
really
useful
for
telling
me
that
some
specific
failure
mode
is
popping
up
in
our
jobs,
especially
across
jobs,
and
having
that
latency
meant.
It
was
difficult
to
know
if
changes
were
that
we
suspected
might
introduce
issues
were
introducing
issues
like
upgrading,
a
container
runtime
or
something.
H
Yeah
yeah
there
shouldn't
be
sorry.
There
shouldn't
be
any
noticeable.
H
H
H
H
A
All
right,
thanks
for
your
time,
michael
sure,
okay,
next
up,
let's
talk
about
kubernetes
ci
policy,.
A
Laurie,
I
feel,
like
you've,
been
doing
a
lot
of
organizing
of
stuff.
Did
you
want
to
walk
people
through?
What's
what's
been
set
up
in
terms
of
organization.
B
Sure
so
I
don't
think
I
can
share
the
project
board
without
hosting,
but
maybe
you
can
pull
it
up.
It's
it's
fine!
If
you
want
to
just
do
that,
it'll
be
easier.
Give
me
a
moment
sure.
A
That
that
sounds
fine
to
me
I
mean
I
think,
for
for
some
of
the
work
that
we're
going
through
right
now
it
has
a
pretty
fixed
workflow.
A
Be
other
things
where
there
are
pull
requests
that
we
feel
are
relevant
to
this,
that
we
want
to
toss
on
this
board
to
get
wider
attention
and
maybe
they'll
become
relevant
at
that
time.
But
if
we're
not
using
them
now,
we
may
as
well
get
rid
of
them.
B
G
B
But
right
not
right
now
what
aaron's
saying
is
in
the
future
they
might
so
we'll
we
can
keep
them
for
a
while
all
right.
So
let's
just
wait
and
see
what
happens
there
and
then
the
monitoring
column
is
something
rob
suggested.
So
my
question
there
is
just
so
that's
like
our
parking
lot
to
see
how
the
jobs
are
performing.
So
what
might
be
the
timeline
for
taking
action
to
move
those
to
the
next
step?
What
do
you
need
to
do
and
find
out
longer?
Do
you
need
to
monitor.
A
I
feel,
like
we've
probably
had
sufficient
time
so
these
are.
These
are
sort
of
questions
that
maybe
have
for
the
broader
group.
We
still
haven't
actually
defined
metrics
that
are
measurable.
That
say
success,
yes
or
no.
I
have
some
things
linked
in
the
talk
that
I
could
make
as
suggestions,
but
my
general
sense
is
like
if
we
make
the
changes
and
the
ci
signal
folks
feel
like
yeah
everything's
about
as
stable
as
it
has
been,
maybe
even
more
so.
A
I
Yeah,
I
definitely
agree
with
that
and,
like
part
of
the
point
of
introducing
these
resource
limits
and
requests,
is
that
when
you
know
we
do
experience
flakes
or
failures
that
we're
able
to
pinpoint
why
they're
happening?
So
you
know,
theoretically,
as
the
tests
that
are
running
as
part
of
these
jobs
change
over
time,
the
limits
and
requests
could
need
to
change
as
well
so
yeah
as
far
as
like.
Actually
adding
these,
I
think
we
are
done,
and
monitoring
like
in
the
broad
sense
is
just
gonna
continue
to
happen
over
time.
A
I
am
I'm
okay.
If
the
people
who
are
currently
assigned
to
these
issues
go
through
and
say,
like
yeah,
I've
been
watching
it
for
a
couple
days
and
it
looks
okay.
I
had
planned
on
sweeping
through
these
at
some
point
myself,
but
my
schedule's
been
pretty
booked,
but
if
people
are
waiting
for
some
kind
of
signal
to
say,
should
we
call
these
done
now?
I
think
we've
reached
that
point.
J
Yeah
we
can
always,
I
think
we
can
work
through
the
list,
start
closing
them
off
and
and
if
you
know
at
the
end
of
the
day,
if,
if
anything
needed
to
be
reopened
off
the
back
of
this
change,
they
could
be
reopened.
And
the
intention
there
was
just
to
when
I
to
just
to
shorten
the
in
progress
column
yep
just
to
show
that
this
that
yeah
just
to
show
that
the
work
had
been
done
and
that
we
they
were
just
keeping
an
eye
on
us.
B
A
My
impression
is
many
of
the
make
critical
jobs
guaranteed.
Pod
quality
of
service
can
probably
be
moved
over
to
the
monitoring
column.
As
far
as
I'm
aware,
almost
every
single
one
of
these
has
had
a
pull
request
as
merged,
so
it
may
just
be
a
matter
of
like
clicking
through
and
seeing
if
the
full
request
has
been
merged
associated
with
it.
Yeah
seems
like
it
so
I'll
move
this
one
over,
but
I
think
people
who
are
assigned
to
these
issues
can
totally
move
it
over
or
glory.
A
If
you
want
to
sort
of
groom
the
board
that
way,
that
could
also
be
helpful.
B
A
Sure
I
I
would
probably
suggest
other
members
of
the
ci
signal
team
can
probably
keep
track
of
this
on
his
behalf.
G
That
yeah,
I
think
like
if
they
have
a
pr
version,
we
move
them
to
monitoring,
but
there
are
some
that
you
know
someone
has
opened
a
pr
and
we
haven't
reviewed
it
yet,
and
most
of
them
are
moving.
A
few
are
stalled
on
a
review
or
the
author
being
out,
as
you
mentioned,
I
I
know
I
I
have
a
few
stalled
just
because
I've
mostly
been
like
load,
shedding
review
and
sorting
that
out
and
didn't
actually
get
many
reviews
done
for
a
couple
days.
B
J
B
So
all
right
and
then
cool
and
then
in
terms
of
the
metrics
erin.
What's
your
plan,
what
would
you
like
to
be
your
next
step
for
getting
those.
A
So
this
is
the
umbrella
issue
that
resulted
from
our
discussions
two
weeks
ago,
which
tries
to
represent
sort
of
what
we
feel
like
are
the
three
most
critical
pieces
of
action
that
need
to
be
taken.
In
order
to
mitigate
this
situation,
then
we
need
to
figure
out
how
to
define
like
what
success
looks
like
and
what
our
metrics
is.
I've
had
a
difficult
time
creating
issues
to
represent
these,
because
I
can't
describe
like
this
is
the
metric.
You
should
go
measure
and
implement
it's
more
like.
A
We
need
to
go,
do
a
bunch
of
discovery
and
there
might
be
a
lot
of
false
starts.
That's
at
least
what
I've
been
encountering
personally
and
then
these
were
sort
of
the
three
other
like
lower
priority
things
that
we
discussed
doing
afterwards.
So
things
like
mandating
that
every
single
crowd
job
has
contact
info
and
then
removing
like
perma-failing
jobs,
starting
with
the
really
egregious
ones
and
then
figuring
out,
based
on
our
experience
there,
how
to
create
a
policy
to
sort
of
continue
doing
that
going
forward.
A
So,
let's
see
back
to
the
job,
so
as
far
as
where
we're
at
with
our
existing
migrating
existing
jobs,
this
is
a
spreadsheet
that
should
be
accessible
to
kubernetes
staff
and
kubernetes
testing
members.
It's
generated
by
a
little
report
thing
that
I
have
it's
very
manual.
I
kind
of
just
threw
it
together
and
presented
to
the
group
here
and
I'm
filtering
on
every
job.
That's
in
the
stick,
release
blocking
dashboard
and
also
has
pod
qos
guaranteed.
A
True,
I'm
picking
up
a
couple
things
that
also
happen
to
have
word
blocking
in
them,
but
aren't
guaranteed.
Just
because,
like
sigurly's,
prototype
master
doesn't
seem
like
an
appropriate
dashboard
nor
relinch
blocking,
but
this
is
just
a
sign
that,
yes,
I
think
we
have
this
is
this
would
be
my
metric
to
measure
like
have
we
done
the
things
we
said
we
would
do.
A
Yes,
we
have
take
me
a
little
bit
more
clicking
around
to
do
this
for
merge
blocking
see.
If
I
can
do
that,
real.
B
A
But
the
the
intent
is,
this
is
sort
of
more
of
a
reporting
mechanism
that
we
could
use
to
figure
out
what
our
work
is,
and
then
there
are
tests
that
basically
enforce
this
stuff.
The
tests
are
informative
and
then
flips
to
blocking,
let's
see
here
and
then
kubernetes
kubernetes.
A
There
I
don't
even
know
why,
that
those
are
there.
Ostensibly,
these
should
be
all
of
the
pre-submits
that
run
against
the
kubernetes
kubernetes
repo
and
are
also
merge
blocking
and
by
merge
blocking.
I
mean
they
either
always
run
and
are
considered
blocking,
or
they
run
if
certain
files,
change
and
they're
blocking
so
again,
not
sure
why
keyflow
is
showing
up
here,
but
the
gist
is.
I
think
there
are
some
time
jobs
that
we
need
to
set
resource
limits
for
and
then
hopefully,
we'll
be
done
with
merch
blocking.
A
A
So
if
I
filter
down
on
the
number
of
errors
over
the
last
seven
days,
I'll
back
this
up,
just
for
like
the
last
90
days,
I
went
through
and
bumped
the
persistence
on
the
prometheus
instance
that
this
queries
against,
so
that
we'll
we'll
be
able
to
sort
of
see
things
going
forward.
The
persistence
was
tuned
to
like
four
weeks.
A
So
we
could
take
a
number
of
jobs
in
an
error
state
as
some
kind
of
signal
that,
like
something
bad,
is
happening,
and
we
could
take
the
reduction
of
this
number
down
to
zero
as
a
signal
that
we
think
we've
stopped
the
badness
and
this
corresponds
roughly
to
when
we
were
experiencing
a
lot
of
pain,
and
this
corresponds
roughly
to
I
don't
know
if
we
have
have
not
been
experiencing
a
lot
of
pain.
You
can
also
see
this
roughly
corresponds
to
like
the
volume
of
pre-submit
jobs
has
also
lessened
over
time.
A
A
A
I
wasn't
sure
if
just
daniel
or
or
robbed,
either
of
you
sort
of
have
a
pulse
check
on
the
general
stability
of
the
jobs.
Lately,
how
do
you?
How
do
you
check
that.
I
Yeah,
so
there's
not
a
lot
of
stuff
exposed
in
spyglass
about
you
know
why
something
failed.
In
fact,
until
recently
it
would
say
a
job
failed,
even
if
it
aired
out
so
there's
like
a
pod
scheduling,
timeout
or
something
like
that.
That's
been
updated.
One
of
the
things
that
makes
it
hard
hard
to
understand
what's
happening
is
that
if
a
job
has
a
any
any
sort
of
error,
so
pod
scheduling,
timeout
or
anything
of
that
nature,
we
don't
get
any
of
the
pod
info.json,
so
that's
not
being
published.
I
So
it's
pretty
much
impossible
to
get
anything
other
than
there's
a
pod
scheduling
timeout.
So
you
don't
know
you
know
kind
of
like
the
underlying
reason
for
that
I'm
working
on
adding
some
of
that
to
it.
So
hopefully,
when
folks
see
you
know
that,
especially
on
pr's
that
something
happened
that
they
can
go
and
it'll
kind
of
like
hit
them
in
the
face,
whether
it
was
you
know,
a
resource
issue
or
there's
something
actually
wrong
with
the
change
that
they're
submitting.
So
hopefully
that
will
help
to
some
extent.
C
B
A
B
I
was
just
gonna,
I'm
just
gonna
suggest
that
maybe
some
of
you
want
to
get
together
and
just
sort
of
hammer
out
the
metrics
that
you
think
are
worth
worth,
taking
a
look
at
or
formalizing
into
like
how
we're
doing,
and
maybe
also
you'd
mention
that
there's
some
discovery
relate
discovery
involved.
So
maybe
like
what
is
there
to
discover
like
what
are
the
what
it
would
be,
the
checklist
for
things
to
look
into,
and
then
maybe
from
that
checklist
we
could
break
down
the
work
a
bit.
B
A
B
A
B
H
G
Just
want
to
caution,
I
think
a
successful
in
state
where
we've
actually
gotten
everything
happy
is
relatively
easy
to
measure,
won't
need
much
time
to
define
but
yeah
trying
to
define
how
individual
things
are
working.
Well,
it's
going
to
be
difficult.
There's
a
lot
of
things
that
are
hard
to
control
for,
like
the
load
on
ci,
like
as
120
development
opens
up,
for
example,
and
or,
like
you
know,
is
a
job
consuming
more
than
it
needs
or.
G
Yeah,
I
I
I
think
we
can
define
what
does
it
look
like
when
we've
reached
a
healthy
state
and
things
would
run
well
pretty
easily.
I
tossed
out
a
metric
around
this
previously
that
I
think,
is
good
enough.
Just
around
the
amount
of
error
state
jobs
we
have,
I
think
we
can
lose
a
lot
of
time
trying
to
define
more
granular
metrics
and
probably
not
get
super
far
with
it.
Okay,
but
it
might
be
useful
to
look
at
trying
to.
G
Find
the
right
trade-off
and
bin
packing
tightly
and
how
we
make
that
maintainable
long
term,
as
in
only
requesting
what
we
actually
need.
I
Yeah,
so
I
mean
some
of
that
can
be
kind
of
difficult
without
sig
involvement
from
like
you
know,
you
have
to
understand
exactly
what
a
test
is
doing
to
understand
whether
it's
it's
consuming
more
than
it
should,
or
the
resource
limits
and
requests
are
where
they
should
be
at
like,
for
instance,
I
know
that
a
few
of
us
have
been
looking
at
the
verify
test.
That
was
using
wildly
more
memory
than
it
needed,
and
you
know
previously.
The
response
to
that
was
to
you
know,
bump
the
memory
request
higher
right.
I
Oh,
it
needs
more
memory.
Let's
do
that
actually
investigating
whether
it
should
need
that
much
memory
is
a
different
thing
entirely,
so
I
don't
want
to
like
as
we're
kind
of
shifting
more
responsibility
or
motivation
to
sig
owners.
I
don't
want
to
like
also
throw
this
in
there
because
that's
a
fairly
significant
ass,
but
in
the
long
term
I
think
it
would
be
a
reasonable
and
more
sustainable
path
to
have
sig
owners
be
responsible
for
the
amount
of
resources
the
tests
that
they
run
consume.
B
Okay,
so
there's
two
different
approaches
here:
maybe
you
can
combine
them
so
there's
a
discovery
of
things
side
and
then
there's
also.
We
know
what
already
what
we
have
to
measure.
So
maybe
the
four
of
you
would
just
want
to
hammer
out
what
you
want
to
do
with
metrics.
Would
that
be
possible
like
in
a
google
doc
or
on
slack
just
what's
your
approach
here.
A
J
Would
you
agree
that
it's
tricky
you
see
on
on
the
face
of
us
on
on
the
facebook,
where
we,
where
we,
when
we
stand
in
front
of
test
grid,
we
know
where
we're
at
you
know
in
terms
of
in
terms
of
green
and
red,
and
what
you're
asking
for
is
something
that
I
think
that
we
might.
We
might
have
to
spend
a
few
hours
having
to
think.
Can
we
create
some
definitions?
J
A
So
my
suggestion
was
going
to
be,
I
feel
as
though
the
forcing
function
for
us
will
be
moving
on
the
next
two
tasks,
so
one
is
migrating,
the
remainder
of
the
release
blocking
jobs
to
a
publicly
visible
cluster
in
case
infra
this
one
here.
I
really
wish
I
could
share
this
out
to
the
community
members,
so
they
can
help
you
this
for
me,
but
unfortunately,
most
of
this
involves
google.com
specific
stuff,
so
I've
been
working
on
chipping
away
at
the
things
that
I
can.
A
Again,
I
kind
of
feel
like
that's
something
community
members
can't
really
actively
participate
in,
but
you
know
that
there
we
are
with
that.
I'm
also
just
sort
of
trying
to
get
prototype
equivalence
to
these
jobs
running
in
community
visible
infrastructure.
So
the
community
can
see
what
kind
of
resources
are
applied.
A
The
bigger
list
is
going
to
be
moving
over.
All
of
the
merge
blocking
jobs
or
all
of
the
pre-submits,
there
is
at
least
an
order
of
magnitude,
if
not
two
orders
of
magnitude,
more
traffic
and
resource
consumption
by
these
pre-submits
than
there
is
by
the
release
blocking
jobs.
So
we're
really
going
to
start
stressing
or
exercising
whatever
capacity
related
issues
we
might
have
with
the
kate
simple
clusters.
A
A
The
number
of
gce
projects
has
now
increased
to
like
120
something
in
the
last
week,
and
this
is
to
anticipate
if
I
look
at
the
number
of
gce
projects
that
are
used
by
that
they're
using
the
google.com
build
cluster,
it
hovers
somewhere
under
100
and
occasionally
peaks
around
100
mark
and
so
given
what
we
already
have
as
our
load.
Trying
to
anticipate.
A
A
That's
not
as
visible
to
the
community
is
not
quite
as
proof,
so
it's
my
intention
to
start
teeing
up
individual
tasks
to
migrate
all
of
these
over
and
tagged
them
as
help
wanted,
and
I
could
really
use
help
from
rob
and
daniel
and
whoever
else
has
been
involved
so
far
and
sort
of
describing
what
kind
of
process
we
would
go
through
to
verify
that,
like
the
jobs
look
good
and
it
may
be
like
it
could
be
a
little
herky
jerky
for
all.
A
I
know
in
a
wildly
underestimating
capacity
and
we'll
have
to
like
stop
until
we
get
more
quota,
but
working
on
this
will
give
us
significantly
more
insight
into
whether
the
resource
limits
that
we
have
set
are
working
appropriately
and
then
I
think,
because
we
start
seeing
the
amount
of
money
all
of
this
costs
that
can
start
to
move
us
closer
to
the
world
where
we
talk
about
individual
states,
getting
budgets
and
and
how
much
you
know
where
they
want
to
allocate
their
their
budget
towards,
and
we
may
be
able
to
find
metrics
in
the
community
may
be
able
to
find
metrics
that
sort
of
show
how
certain
jobs
may
be
over
allocating
resources.
A
You
know
like
it
turns
out.
They
may
be
using
20
of
the
resources
that
they're
asking
for
both
of
the
time.
So
that's.
A
A
B
Okay,
so
I
guess.
A
A
And
I
feel
like
in
the
work
that
I
have
done
to
migrate
things
over
for
policy
number
two:
that's
given
community
members,
sufficient
visibility
to
implement
parts
of
policy
number
one
like
they've,
been
able
to
go,
daniel's,
been
able
to
go
check
out
the
verified
job
and
see
like
what
the
metrics
look
like
for
memory,
consumption
and
stuff.
I
feel
like
it's
it's
time
that
we
give
that
same
level
of
visibility
to
the
merge
blocking
jobs.
A
In
terms
of
me
being
a
blocking
factor,
I
feel
like
it's
more
important
that
I
tee
all
this
stuff
up,
but
it
makes
it
just
we're
over
time
for
what
it
also
just
makes
me
comfortable
as
a
data-driven
individual,
that
I
don't
have
data
that
shows
that
what
we
are
doing
is
actually
having
an
effect.
B
A
I
can
I
can
kind
of
seeing
I
can
kind
of
see
in
some
ways
that
it
is
having
a
effect
like.
I
can
see,
auto
scaling
working
now,
based
on
the
resource
limits
we've
set
for
release
blocking
jobs,
but
I
cannot
verifiably
point
to
a
graph
that
is
the
smoking
gun
for
the
pain
that
we
were
experiencing
before
and
pointing
a
similar
graph
now
and
show
how
it
is
significantly
better
as
a
result
of
what
you've
done
and
that's
that
would
be
the
holy
grail
right.
Okay,.
A
So
I'll
try
to
maybe
start
an
umbrella
issue
for
that
discussion
and
we're
at
time.
Anybody
have
anything
else.
A
Okay,
thank
you,
everybody
for
your
time.
I
appreciate
you
all
showing
up
and
have
a
happy
tuesday.