►
From YouTube: Kubernetes SIG Testing - 2021-06-29
Description
A
B
B
B
I'll,
just
I'll
just
introduce
myself
since
there's
nothing
else
on
legit
my
name
is
trent,
I'm
from
the
I'm
from
salesforce.
I
work
on
the
unlike
the
cicd
team.
There,
I'm
just
joining
to
kind
of
listen
in
and
see
where
I
can
jump
in
and
help
got
a
lot
of
background
in
ops
and
go
lying,
and
so
it
looks
like
you
guys,
work
on
a
lot
of
code
things.
So
any
suggestions
on
where
I
can
jump
in
or
whatever
would
be
great
but
assume
there's
a
lot
on
the
board
here.
A
Yeah
welcome
trent.
Thank
you.
There's.
Definitely
a
lot
of
need.
One
of
the
biggest
things
that
we've
been
dealing
with
is
flakes
just
tons
of
like
intermittent
flaky
testing
and
a
large
part
of
that's
just
coming
from
the
nature
of
you
know
a
distributed
testing
environment,
and
so
it's
a
lot
of
timeouts
and
stuff
and
we're
coming
up
on
code
freeze
right
now,
but
hopefully
that
will
lead
way
to
a
little
drilling
down
into
you
know
more
reliability.
Oh
there's
ben
hooray,.
A
C
D
I
guess
I
know
aaron
was
probably
not
going
to
make
it
impacted
by
the
pacific
northwest
heat
wave
this
year.
I
would
imagine
the
same
as
also
likely
for
steve.
So
welcome
everyone.
Sorry,
I'm
super
late.
This
is
the
kubernetes
sig
testing
meeting
for
june
29th
2021.
D
This
meeting
is
under
the
cncf
code
of
conduct,
essentially
be
excellent
to
each
other.
This
meeting
is
recorded
and
will
be
posted
to
youtube
at
a
future
date.
D
So
usually,
we
have
a
hard
cutoff
for
agenda
items
ahead
of
time
the
day
before,
there's
a
calendar
entry
for
that,
and
we
would
announce
that
the
meeting
was
cancelled
if
there's
no
items,
but
we
didn't
have
the
template
entry
in
the
notes
this
time.
So
I
thought
we
ought
to
just
meet
anyhow,
given
that
there
wasn't
a
good
place
to
add
them,
they've
been
confused.
D
So
if
you
have
any
topics,
please
go
ahead
and
add
them
to
the
doc
now,
so
that
we
can
at
least
sort
of
go
through
them
in
some
sort
of
order.
E
Thanks
ben
really,
I
just
came
in
to
see
what
was
going
on
and
what
I
should
be
involved
in.
D
Great
yeah,
I
don't
know
like
I
said
we
don't
we
don't
have
an
agenda
today,
so
I
can
discuss
a
bit
of
that,
but,
depending
on
what
everyone
else
is
here
for
or
may
be,
interested
then
might
be
better
to
follow
up
offline
about.
D
That
just
give
a
moment
for
any.
D
D
Okay,
so
sure,
let's
start
on
that,
so
if
you're
looking
to
get
started
in
the
sig,
definitely
you
know
come
to
our
slack.
I
think
it's
the
most
active
part
of
the
sig.
By
far
like
we
have
the
mailing
list.
We
have
the
zoom
meetings,
but
you'll
find
most
activities
happening
in
slack
or
github.
So
join
our
slack.
It's
sig
testing
in
sick
dash
testing
in
the
kubernetes
slack
and
from
there
I'd
say
we
have
github.com
kubernetes
sig
testing.
D
So
this
repo
is
where
we're
starting
to
like
document
some
of
these
things.
We've
been
experimenting
with
github
discussions,
so
I'd
say
that
experiment
is
not
going
well,
but
so
you're
aware
they're
there,
and
we
have
some
issue
tracking
that
is
sort
of
like
sig
issues
that
aren't
really
repo
specific
one
of
them
is
actually
about
how
we
need
to
surface
the
project
boards
better.
D
We
have
a
project
board
in
kubernetes
and
a
project
board
in
the
kubernetes
sigs
org
kubernetes
cigs
kubernetes
dash
cigs
is
where,
like
six
subprojects,
go,
that
aren't
more
or
less
legacy
stuff.
That's
still
in
the
kubernetes
org
pretty
much
all
new
new
endeavors
are
there.
D
Unfortunately,
github
projects
can't
track
issues
across
org,
so
we
have
one
for
each
org
and
you'll
find
in
our
project
boards
that
we
have
a
help
wanted
column
and
we
attempt
to
make
sure
that
any
issues
that
have
been
tagged
as
help
wanted
that
are
relevant
to
the
sake
are
added
there.
I
wouldn't
say
it's
100.
Another
thing
you
can
do
is
go
to
our
repos.
Those
are
documented
in.
D
If
you
go
to
the
github.com
kubernetes
community,
repo,
the
community
repo
in
the
kubernetes
org,
there
is
a
dock
listing
all
of
the
sigs
and
there
are
subfolders
for
the
sigs
there's
a
dock
there
that
lists
all
the
sub
projects.
The
sub
projects
that
we
own,
like
test
infra,
will
all
use
the
help
wanted
and
good
first
issue.
Labels
are
appropriate
and
then
we
are
currently
manually
sweeping
through
those
and
making
sure
they're
in
the
project
port
as
well.
D
There's
another
one
in
the
kubernetes
org.
Let's
see,
I
think
I
think
I
linked
back
to
the
slack
discussion
where
we
had
it,
so
you
grab
that,
but
it's
just
it's
also
called
the
sig
testing.
It's
pretty
much
the
same
thing,
but
it's
in
the
it's
in
the
other
or
oh.
B
D
Yeah,
I
did
not.
D
Yeah
because,
like
those
numbers
are
just
like
it's
just
an
auto
increment
on
adding
projects,
so
there's
no
good
way
to
it's
11
in
the
other
org.
D
Down
so
here
we
go
like
I
said
currently
at
the
moment,
that's
aaron
and
I
going
through
and
more
or
less
going
through
and
like
manually,
adding
things
to
this
board.
There
is
a
little
bit
of
automation
for,
like
you
know,
finishes
close
and
moves
through
the
board,
but
so
the
hope
wanted
there
things.
There
should
be
pretty
good
that
we've
like
gone
back
over
and
identified
again
as
like.
D
Yes,
this
looks
like
a
help
wanted
issue,
but
if
you're
looking
for
more,
there
are
definitely
issues
that
haven't
made
it
into
the
board
like
since
our
last
pass-through,
and
this
topic
is
also
something
else
that
we
are
looking
to
do
so
in
the
sig
testing
repo.
D
I
have
another
issue
filed
about
improving
issue
triage
and
how
we
might
do
that.
One
thing
that
comes
to
mind
is
there's
a
tool
triage
party
that
basically
hosts
a
custom
web
page
for
doing
triage,
workflows,
sort
of
multiplayer
and
having
a
bot
enact
the
actual
changes
on
github.
Some
parts
of
the
project
have
experimented
with
this.
It
originated
from
google's
mini
cube
team,
so
we
ideally
don't
just
want
to
like
explore
this.
D
For
sick
testing,
but
we
want
to
explore
how
with
work
group,
kate
simpfra
if
this
tool
works
well,
we
can
sort
of
make
it
easy
to
stamp
out
for
other
cigs
or
or
just
go
ahead
and
host
some
more
instances
or
so
on.
What's
that
called
triage
party
and
the
tracking
issue
is
here
I'll
link
it
in
the
dock.
D
D
Issue
number
seven
in
the
sig
testing
repo,
and
that
also
came
out
of
the
discussion
that
I'm
having
about
a
bot
that
we
on
the
source
go
to,
but
the
policy
is
owned
by
the
contributor
experience
sig,
which
is
also
sort
of
generally
how
the
dynamic
works
in
kubernetes
that
automatically
labels
stale
issues
and
closes
them.
I
made
the
mistake
of
luckily
complaining
a
bit
about
this
and
started
a
large
discussion
about
it
recently.
D
So
one
of
the
more
productive
things
we
can
do
is
try
to
improve
issue
triage
so
that
we
sort
of
can
keep
up
with
incoming
issues
better,
so
that
so
rolling
out
like
providing
tooling
for
that
sort
of
thing
is,
is
in
the
scope
of
the
sig.
We
pretty
much
do
project
infra
tooling,
particularly
as
it
relates
to
ci
and
github
contributor.
D
Experience,
though,
is
sort
of
responsible
for
like
the
workflows
at
this
point
and
like
how
how
should
we
be
behaving
as
as
like
processes
and
things
around
github,
and
then
that's
also
worth
knowing
there's
a
work
group,
kate
simphera
that
span
out
of
so
a
lot
of
the
infrastructure
came
out
of
a
team
that
I
was
on,
though,
started
before
me
at
google,
that
sort
of
set
up
all
sorts
of
things
for
the
project
early
on
so
like
ci
infrastructure,
image,
publishing
and
because
that
was
just
kind
of
done.
D
Real,
quick
in
a
convenient
way
to
get
the
project
going.
It
was
all
in
projects
that
are
like
under
google's
google
cloud
organization,
so
they're
not
really
accessible
to
the
community,
because
they're
like
google
internal,
so
maybe
two
or
three
years
ago,
one
of
the
more
funny
googlers
from
hawking
arranged
a
large
like
credit
donation
in
the
cncf.
D
So
we
could
basically,
instead
of
like
billing
it
to
google
through
google
projects,
we
can
have
like
credits
to
the
cncf
and
then
cncf
hands
into
steering
and
then
steering
hands
them
to
work
group,
kate,
simphora,
and
then
we
set
up
some
gcp
projects
in
a
kubernetes.io
organization
and
then,
if
you
want
to
like,
have
access
to
any
resources,
that's
all
controlled
with
like
yaml
in
github
and
terraform
and
bash,
and
we
are
like
migrating
things
there.
D
So,
whenever
we're
standing
up
new
infra
in
particular,
especially,
we
are
looking
to
collaborate
with
that
work
group
and
make
sure
that
you
know
we're
doing
this
with
their
resources
so
that
it
will
be
like
a
sustainable
thing
that
the
community
can
collectively
manage
right.
D
So
yes,
the
bulk
of
it,
is
in
the
kubernetes,
slash,
kates,
dot,
io
repo,
originally
that
just
had
some
things
like
there's
an
nginx
instance
that
handles
things
like
god.kates.io
shortcuts
that
go
to
different
things,
but
now
it
has.
We
have
we
use.
Google
group
management.
Kubernetes
has
like
a
g
suite
instance.
So
we
have
google
groups
that
we
use
for
like
these.
D
Are
the
owners
of
this
thing
and
then
we're
able
to
do
things
like
synchronize
that
to
our
back
rules
within
clusters
that
we
run
for
the
project
or
access
controls
in
google
cloud.
Most
of
it
is
in
google
cloud
because
of
that,
like
large
credit
donation,
but
there's
also
some
manage.
There
is
some
management.
D
There
are
some
things
like
there's
some
aws
accounts
that
are
managed
for
some
end-to-end
testing,
in
particular
a
couple
of
the
sub-projects
use
like
k-ops,
and
so
there's
a
little
bit
of
the
code
for
that
there.
Some
of
the
other
code
for
like
the
tools
themselves
are
in
repos
like
test
infrared,
but
pretty
much.
All
of
the
like
definition
of
like
deploying
the
infrastructure
is
now
in
the
category,
along
with
things
like
like
images.
D
If
you
want
to
publish
an
image
to
our
our
case.gcr
dio,
then
you
like
pr
manifest
there,
and
that's
also
the
trend
that
we
use
for
pretty
much
everything
and
something
that
like
sig
testing
has
been
pretty
big
on
pushing.
Is
that
pretty
much
everything
is,
to
the
extent
possible
manage
declaratively,
just
like
you
would
in
kubernetes,
where
you're
going
to
like
send
a
pull
request
to
like
a
yaml
file.
That
says,
like
this
person
should
be
a
member
of
the
github
org.
This
person
should
be
in
this
group.
D
This,
like
image,
should
be
published.
This
infrastructure
should
be
deployed.
This
domain
name
should
be,
it
should
exist.
We
have
kind
of
gone
all
in
on
pretty
much.
Everything
is
declarative,
yaml
to
the
extent
possible.
Some
things
are
still
like.
You
know
some
bash
groups
that
haven't
been
replaced
yet
or
some
terraform.
D
We
have
a
number
of
custom
things
so,
for
example
like
domain
names,
what
started
with
octodns
and
a
bash
script.
But
now
we
have
something
that
like
reads
a
simple
yaml
spec,
that
we
came
up
with
and
maps
that
into
that.
So
there's
various
tools
that
are
sort
of
like
ad
hoc,
reconciling
these
things
and
there's
some
that
are
doing
it
continuously,
and
some
of
it
is
just
like
off-the-shelf
stuff
like
kubernetes
or
terraform.
D
Yeah,
it
depends
on
the
tooling
so
like,
for
example,
for
all
the
github
management
we
needed
to
build
a
tool,
because
I
mean
most
people
are
doing
this
manually,
which
is
you
know,
pretty
easy
to
do
low
friction
if
you
have
some
admins,
but
we've
strived
to
have
just
a
very
few
folks
that
actually
have
administrative
permission
over
everything
and
then
pretty
much.
Everything
else
is
delegated
through
robot
accounts,
and
you
know
publicly
audible
auditable.
You
can
go
see
like
okay.
D
Yeah,
I
mean
a
number
of
the
leaders
in
the
community
really
want
to.
Like
you
know,
we
want
these
doing
what
the
project
says
like
okay,
declarative
management
makes
sense,
so
we
should
declare
it
if
we
manage
everything.
I'd
say
the
project's
doing
pretty
good
at
that
generally.
If
we
stand
up
some
infra
tools,
there
is
like
some
declarative
spec
for
what
it's
doing.
D
D
Anything
the
gist
is
we
had
a
jenkins
that
was
another
one
of
those
things
that,
like
kubernetes,
just
sort
of
set
up
and
then
over
time
they
started
containerizing
workloads
on
it
for
like
easier
setting
up
the
correct
environment,
that
sort
of
thing,
and
they
had
some
issues
with
like
the
github
integration
for
triggering
and
things.
D
So
someone
built
a
little
web
hook
handler
that
would
watch
for
github
web
hooks
and
like
kick
off
jenkins
jobs,
then
we
had
a
yaml
spec
for
that,
and
then
somebody
realized
okay,
we're
doing
yaml
to
create
containers.
Why
don't
we
just
like
run
these
on
a
kubernetes
cluster
and
since
there's
google
team
doing
most
of
this,
we
just
set
up
a
gke
cluster
like
create
no
more
managing
clusters,
just
schedule
pods
and
now
pretty
much
most.
All
of
the
ci
looks
like
that.
D
Now
you
sort
of
write
a
pod,
spec
and
then
some
additional
containers
are
injected
to
do
things
like
code,
checkout
or
upload
the
results,
and
we
have
some
like
yaml
that
describes
like
this
is
a
this
is
a
pre-submit
job.
It
runs
on
prs,
or
this
is
a
periodic.
It
runs
on
the
schedule
and
then
we
have
some
like
web
dashboards
for
viewing
results
and
pretty
much
all
of
that
works
by
uploading
things
to
a
gcs
bucket
and
where
possible
we
produ
have.
D
Where
possible,
we
encourage
people
to
produce
junit
xml,
and
then
we
have
sort
of
like
structured
results,
that
various
tools
can
read.
So,
for
example,
we
have
a
tool
that
goes
back
through
all
of
these
results
and
does
some
big
query,
queries
and
then
some
further
processing
and
produces
a
dashboard
of
like
what
are
the
most
common,
like
error,
messages
we're
actually
seeing
across
all
of
our
jobs
so
that
you
can
start
to
identify
like
okay,
we
started
having
this
failure
mode
spike.
D
At
this
point,
we
we
had
a
bad
change,
go
in
awesome.
That
is
that
in
the
chat.
D
So
almost
everything
revolves
around
this,
like
we
use
containers
and
like
kubernetes
clusters
to
run
things,
and
then
we
produce
pretty
much
all
of
our
results
as
files
on
gcs
that
are
either
junit,
which
was
so
uncle
level
standard
or
some
json
files
that
contain
some
metadata
about
like
when
we
started
it
and
like
which
job
it
was,
that
sort
of
thing
yeah
and
then
there's
something
of
a
spec
for
that
in
test
infra,
and
then
a
bunch
of
different
tools
are
able
to
like
consume
those
results
and
process
them.
D
D
This
job
has
here's
the
results
over
time
with
all
the
individual
structured
results
as
rows,
and
so
that's
used
pretty
heavily
by
the
project
like
releases,
are
pretty
much
there's
a
dashboard
of
these
jobs
must
be
green
to
release
and
then
periodically
against
the
branch
we're
developing
and
then
there's
like
a
team
that
monitors
those
and
says.
Okay,
we
have
this
important
unit.
Tests
are
not
passing,
oh
no.
D
We
can't
we
can't
release
and
like
file
issues
and
fall
throughout
that,
so
we
also
work
quite
a
bit
with
sig
release
there
on
that
sort
of
thing
yeah
so
like.
D
Similarly,
we
provide
most
of
the
tooling
for
this,
and
I
think
we
provide
quite
a
bit
of
direction
on
like
what
those
jobs
are,
but
at
the
end
of
the
day
like
if
you
want
something
to
be
released
blocking
we
some
testing
faults,
wrote
the
policy,
but
it's
owned
by
sig
release
and
you
need
to
go
talk
to
sick
release,
so,
like
kind
of
similar
to
contrabex
might
be
building
stuff.
But
like
we're,
you
know
we're
working
directly
with
other
cigs
for,
like
the
ownership
model.
All
these.
B
D
Investment
for
it,
I
would
be
being
harder
to
sort
of
like
rename
the
cig.
A
lot
of
people
come
to
us
and
assume
we
just
like
write
all
the
tests.
Yeah.
D
I
think
it
came
out
of
one
of
the
things
that
was
early
on
was
like
the
end-to-end
test
framework,
but
at
this
point,
there's
really
not
that
many
folks
working
on
that
directly.
It's
one
of
those
things
where,
like
you
kind
of
improve
it
a
bit
when
you
need
to
use
it,
but
like
no
one's
actively
like
that's
not
their
day
job
or
anything
like
that,
where
they're,
like
main
focus,
people
are
working
on
like
the
infrastructure,
mostly
than
this
thing.
D
So
we
have
a
lot
of
overlap
with
work
group
kate's
infro,
they
kind
of
like
own
the
the
credits
and
sort
of
like
the
broader,
like
maybe
some
of
the
infrastructure
that
isn't
remotely
testing
oriented
like
the
domain,
names
and
stuff.
But
again
you'll
find
like.
If
you
look
at
the
like
leadership
between
the
groups.
There's
a
lot
of
overlap
like
aaron
is
a
is
a
lead
of
both
with
both
side
testing
and
work
group,
kate,
simpro,
and
doing
a
lot
of
things
there
or.
D
H
Actually
hi
mariana,
I'm
I
want
to
be
a
contributor,
I'm
just
lurking
at
this
moment,
and
I
was
late
to
the
meeting.
I
saw
the
notes
and
I
was
wondering
what
were
the,
what
did
you
say
were
the
tags
that
were
supposed
to
be
looked
at
for
for
new
contributors
or
beginners?
Was
it
just
help
wanted,
or
is
it
was
it
was
it
specifically
for
you?
I
remember
there
was
attack
for
new
people.
D
Yeah,
so
throughout
kubernetes
we've,
this
is
pretty
standardized,
there's
a
help,
one
tag
and
then
there's
an
additional
good
first
issue
tag
yeah.
G
D
Are
strong
rules
that
are
expected
by
the
project
around
what
is
a
help
wanted
issue
like
if
you're
asking
for
help,
if
there
should
be
some
pointers
to
like
what
actually
needs
to
be
done,
and
it
should
be
an
issue
that
some
folks
that
are
active
in
the
project
have
like
agreed
to
like
there
will
be
support
for
actually
taking
this
action.
It's
not
up
for
debate
still
good
first
issue
is
a
bit
looser,
but
is
applied
by
folks
where
they
think,
like.
D
Probably,
this
is
a
is
a
good
starter
issue
and
those
are
also
labels
that
github
itself
consider
standard.
So
I
think,
there's
actually
a
page
in
github,
that's
a
little
obscure
on
repos.
That
will
like
point
you
to
these
issues.
H
Cool,
but
if
there's
like
some
sort
of
like
offline
more
and
for
sharing,
I
want
to
subscribe
to
that.
D
Oh
yeah,
sorry,
I
just
wanted.
D
I'll
I'll
pull
up
some
of
those
resources
and
pass
them
along
I'll.
Do
that
in
the
six
stack,
I
guess
and
try
to
poke
the
right
folks
there
I
I
would
share
them
here,
but
I
probably
need
a
little
bit
to
dig
those
up.
H
G
D
Pretty
massive,
the
infrastructure
itself
I'd
say
is
a
little
bit
madness.
We
run
a
lot
of
stuff.
D
H
Yeah
thanks,
I
still,
I
also
have
eddie's
talk
on.
I
want
to
look
at
it
at
some
point.
Didn't
get
to
it
yet.
H
I
also
lurked
on
the
cli
channel
and
you
have
like
a
thing
where
you
talk
about
contributing
as
well.
B
B
D
Thank
you
all
for
contributing
to
the
notes
as
well.
You
usually
we
have
someone.
You
know
it's
like
someone
else
might
be
reading
doing
the
meeting
and
I'm
taking
notes,
but
today
we're
a
little
short
stat
so
appreciate
it
of.
B
D
Okay,
well,
I
will
like
I
said
I'll
post
some
more
links
where
you
can
like
learn
more
about
the
sig
in
the
infrastructure
later
today
and
the
sig
channel,
and
we
can
also
see
if
we
can
get
some
more
of
these
things
sort
of
more
centrally
documented,
I'm
hoping
that
we
can.
We've
only
had
the
sick
testing
repo
for
since,
like
maybe
end
of
last
year.
Hopefully
we
can
flesh
out
some
of
the
documentation
there
and
have
the
read
we
kind
of
point
people
towards
that
right.
D
So
next
up
we
have
eddie's
point
about
simple
pod
flakes
with
network
timeouts.
I
want
to
talk
about
that.
Eddie.
A
Yeah,
so
I
know
you've
been
keeping
tabs
on
this
one
a
bit
too
the
as
we're
hitting
code
freeze.
I
just
want
to
know
what
work
we
need
to
scope
out
for
these
types
of
flakes.
The
original
fix
was
to
kind
of
give
it
a
more
of
a
grace
period
for
the
timeout,
but
we
obviously
don't
want
to
just
keep
increasing
timeouts
for
things.
So
what
I
noticed
for
when
a
lot
of
these
flakes
still
happen,
they
happen
in
batches.
D
I
don't
think
we
do.
The
only
thing
I
can
think
of
is
we
have
some
suspicion
that
some
of
that
we're
still
having
issues
with
disk
throttling
so
like
we
run
like
I
said,
we
run
most
of
our
infrastructure
on
gcp
because
we
have
like
the
big
credit
donation
for
that.
That's
where
the
infrastructure
has
been
historically
when
you
run
workloads
on
gce
on,
like
kubernetes
throttling
is
done
at
like
the
vm
level.
D
So
if
you
have
like
multiple
workloads
running
on
a
node
in
a
cluster-
and
they
are
really
really
I
o
heavy,
you
can
run
into
like
a
special
sort
of
contention
where,
like
the
infrastructure,
is
turning
around
and
saying,
okay
you're
using
too
much
like
iops,
your
like,
I
o
operations
are
going
to
get
throttled
and
we've
had
that
in
the
past,
where,
like
we
weren't
setting
resource
requests
on
things
in
particular
that
we
still
aren't
necessarily
everywhere,
it's
a
bit
hard
to
roll
that
out
to
all
the
ci.
D
We
have
some
clusters
where
we
do
guarantee
that,
but
you
know
there's
still
no
way
to
request
io.
So
if
you,
if
you
still
kind
of
make
the
mistake
of
scheduling,
multiple
workloads
to
a
machine
that
are,
I
o
heavy,
they
will
thrash
pretty
hard
or
even
one
workload.
That's
just
really.
D
I
o
heavy,
like
a
cluster
with
multiple
nodes,
with
that
cd
running
a
bunch
of
tests,
we're
running
our
local
clusters,
so
one
issue
we
do
have
tracked
under
catesio
is
that
we'd
like
to
make
a
new
ci
pool
with
local
ssds,
where,
like
there's,
actually
an
ssd
physically
on
the
machine
that
you
are
running
your
vm
on
that
is
dedicated
to
you.
D
So
there's
there's
no
throttling
donuts
right
to
the
ssd
as
fast
as
you
can,
whereas
right
now
at
best
we're
using
pd
ssd,
which
is
like
network
back
like
virtual
disks,
we're
not
certain
that
that
will
solve
it.
So
we
want
to
like
make
a
small
pool
taint.
It
allow
a
couple
of
jobs
to
like
opt
into
running
there
and
see
if
that
improves
things.
D
Otherwise,
we
have
just
tried
to
make
sure
that,
like
all
the
critical
ci
is
on
the
newer
clusters,
where
we
have
pre-submit
that
enforces,
your
configuration
must
set
requested
limits
which
helps,
but
it
still
doesn't
guarantee.
Like
you
know,
we
still
could
be
we're
making.
D
Maybe
small
requests
on
something
that
probably
should
be
making
larger
ones
so
like,
in
particular,
the
kind
cluster
testing
we've
kind
of
had
some
debate
still
over
like
how
much
resources
that
you
really
need,
because,
if
it
weren't
for
disk,
it
really
shouldn't
need
many
cores
to
like
run
some
end-to-end
test
somewhere.
D
But
I
have
some
suspicion
that
we're
running
back
into
the
queer
issue
and
it's
it's
not
that
easy
to
find,
because
you're
gonna
have
to
like
correlate
what
we
were
running
on
the
vm
and
then
look
at
the
like
vms,
like
disk,
throttling
monitoring
or
something
and
not
many
people
have
like
have
access
to
that.
You
can
get
access
by
a
pull
request
for
like
read-only
information
and
gcp
pretty
easily.
But
it's
one
of
those
things
that's
like
kind
of
obscure
and
not
that
many
people
have
done.
A
D
Yeah,
so
I'm
like
and
suspicious
that
like
what's
happening,
is
we're
just
like
hammering
fcd
and
like
well
in
this
case,
though,
so
that
only
implies
we're
running
time
clusters
or,
if
we're
running
like
a
local
integration
test
or
like
builds
stuff
like
that
in
this
case,
because
it's
a
gci,
dce
end-to-end
job,
we
are
like
the
only
thing
that's
actually
running
in
the
those
pd
ssd
nodes
is
like.
We
grabbed
the
binaries
and
we
stood
up
a
cluster
and
then
we're
running
the
e
to
e
test
binary.
D
The
actual
tests
are
running
against
a
cluster
that
was
stood
up
for
the
duration
of
the
job
like
using
some
bash,
and
so
those
vms
are
not
running
anything
else,
but
the
tests
and
are
like
should
be
large
enough.
So
also
one
thing
that
can
happen
is
like.
If
you
saw
this
spike
more
over
time,
we
might
have
had
a
regression.
D
D
Yeah
and,
let's
just
say,
like
timed
out
waiting
for
the
condition
and
you're
like
well,
what
was
the
condition?
What
yeah
it's
one
of
those
things
where,
like
we
have
some
generic
matches
and
a
lot
of
the
tests,
don't
really
tell
you,
so
it
takes
some
time
just
to
even
find
out
what
might
have
timed
out.
A
D
Yeah
we
I
mean
it
might
be
something
in
the
in
like
the
container
run
time
or
what
are
we
running
in
this.
D
We're
seeing
that
a
deadline,
it
was
exceeded
waiting
for
the
pod
to
be
observed,
ready,
like
the
test,
was
waiting.
B
B
D
I
think
tests
may,
like
our
tests,
are
expected
to
clean
up
after
themselves,
but
can't
necessarily
guarantee
and
like
we're
running
concurrently.
So
we
have
like,
like
in
threads
of
running
like
running
through
the
test
cases.
B
But
yeah,
if
you
leave
those
jobs
laying
around
even
after
they're
done,
it
can
really
it
can
bog
down
kubernetes,
at
least
that's
what
we've
seen
and
we're
running
an
earlier
version
that
we're
running
116.
yeah.
We
have
like
a
operator
that
we
have
running
that
just
removes
just
anything
out
lower
than
24
hours,
just
delete.
D
The
pods,
the
the
infrastructure
itself,
the
persistent
parts,
not
that
like
test
cluster
we've
brought
up,
but
we
tend
to
call
them
build
clusters
that,
like
sort
of
just
workload,
pool
that
we're
using
to
execute
like
running
the
tests
and
things
like
that
on
the
ci.
Those
are
in
a
component
called
syncer.
That
does
pretty
much
that
it
has
some
awareness
of
the
of
like
the
like
job
abstractions.
So
it
can
choose
to
keep
around,
maybe
like
one
instance.
D
The
most
recent
run
or
something
like
that,
but
it
pretty
much
goes
through
and
completely
pods,
because
with
our
jobs
we
were
running
into,
they
tend
to
write
to
like
just
ephemeral,
pod
storage
and
building
kubernetes
for
even
just
a
single
architecture.
If
you
build
everything
in
the
repo
could
be
like
30
gigs,
and
if
you
have
a
whole
bunch
of
jobs
doing
that
pretty
soon,
you
just
run
out
of
disk
on
the
ci
machines,
since
those
aren't
released
until
the
pods
actually
deleted.
D
We
had
pretty
pretty
excessive
issues
with
that
but
norm.
That
would
not
be
super
expected
during
an
end
to
end
test.
I
would
say
that,
generally,
these
close
end-to-end
test
clusters
are
like
a
little
over
provision
and
that's
one
of
the
reasons
we
were
doing
kindness
like
well.
We
can
probably
test
most
of
the
stuff
just
with
some
little
tiny
simulated
ones.
We
don't
really
need
like
a
full-blown
cluster
to
run
like
can
you
use
cube
cuddle
to
run
a
pod
like
that
he's
super
cheap.
B
D
Yeah
we
have
some
stuff
like
that,
but
we're
still
kind
of
just
using
most
places,
they're
still
using
whatever
the
config
was
years
ago
for
the
defaults
for
creating
a
test
cluster,
so
they're
for
most
tests,
they're,
probably
like
a
bit
a
little
bit
more
than
we
need,
and
that
is
helpful
because
of
clear
signals.
So
when
we're
having
all
these
timeouts,
the
first
thing
I
look
for
is:
did
the
job
timeout
like
did
we
reach?
Did
we
reach
just
like?
D
So,
and
you
can
see
that
they're
just
so,
you
can
see
if
you
scroll
through
just
like
the
the
whole
lot,
you
can
see
them
timing
out
throughout
the
yeah.
So
it
seems
to
me
like
more
of
a
like
there's
something
systemically
wrong
with
just
like
starting
pods
in
these
clusters.
Right
right.
D
Because,
like
we
might
in
the
past,
have
blamed
like
oh
well,
this
is
one
of
the
jobs
where
we
switched
to
like
container
d
or
something-
and
it
wasn't
like
there's
some
bug
in
this
version,
because
it's
an
early
version
or
something
like
that.
But
this
one's
running
docker
and
we
don't
upgrade
it
super
often,
but
it
should
be
pretty
stable.
D
It
looks
like
we're
running
1903
on
dyno
9,
so
it
could
pretty
patched
version
of
knight
of
like
a
pretty
stable,
pretty
commonly
deployed
docker
version
and
it
probably
hasn't
changed
other
than
the
little
than
the
patches
for
some
time.
So.
H
And
the
is
the
other
are
the
other
things
like
I
see
should
not
be
passed
when
plot
info
mount
equals
null
or
flaky
cube,
ctl,
explain
blah
blah
resource
name
as
built-in
object.
Are
those
significant
in
any
way
or.
D
So
the
flaky
tag
is
we
that's
one.
That's
manually
added
to
jobs
like
we've
decided
like
this
is
known:
flaking
we
don't
have
a
solution
and
the
main
purpose
of
that
is,
for
example,
the
tests
that
we
run
on
your
pull
requests
to
kubernetes.
They
exclude
any
tests
that
are
tagged
flaky
so
that
we're
not
like
wasting
everyone's
time
with
okay.
We
know
this
is
flaky
and
like
we
don't
have
a
fix,
and
maybe
we
don't
think
this
is
the
most
critical
text,
so
we're
just
gonna
tag
with
that.
D
We
can
let
it
run,
and
this
periodic
ci
job
over
here
until
someone
fixes
it.
The
fact
that
a
flaky
test
failed,
like
not
surprising,
but
that
we
have
all
these
other
tests
that
aren't
tied
flaky
that
are
like
timing
out
on
like,
and
he
said
some
of
these
things.
If
you
dig
through
it,
you
can
see
they're
just
waiting
for
a
pod.
To
start,
that's
I
mean
you
can
imagine
like.
D
If,
in
your
production
cluster,
you
tried
to
schedule
a
busy
box
pod
and
you
waited
a
couple
of
minutes
and
the
busy
box
pod
wasn't
ready.
You
would
be
concerned
yeah
right
kind
of
the
same
thing
here,
even
though
we
are,
even
though
this
is
like
a
weird
test
cluster
on
some
test
version
of
kubernetes
and
we're
just
running
test
workloads.
You
would
still
kind
of
expect
that
some
super
cheap
pod
would
just
come
up
one
of
the
things
we
can
also
do,
there's
a
link
at
the
top
artifacts
artifacts.
D
Is
we
populate
an
environment
variable
when
we're
running
these
jobs?
It
says
like
put
things
here.
If
you
want
them
to
be
preserved
and
then
the
tests
will
do
things
like
dump
all
the
cluster
logs
under
that
directory
and
at
the
end
of
the
run,
the
sci
will
like
upload
all
this
to
gcs.
D
So
if
you
click
through
to
artifacts,
the
first
thing
you
see
is
like
the
actual
metadata
that
the
ci
uploaded
the
build,
the
like
the
log
of
the
test,
the
like
the
pod
that,
like
there's
a
json
format
of
like
the
pod,
that
we
were
actually
running
to
run
the
test
and
that
sort
of
thing,
and
this
again
in
artifacts
directory
yeah
another
little
tool,
a
project
built
for
like
viewing
our
results
and
under
there
you
can
see
a
couple
of
directories
that
look
suspiciously
like
node
names
because
they
are
and
those
have
all
of
the
like
all
of
the
system
logs.
D
So
we
can
like
look
through
this.
So
that's
how
I
would
say
like.
Oh
this
cluster
is
running
this
test.
Cluster
is
running
docker,
we're
not
doing
anything
exciting,
like
maybe
it's
one
of
the
jobs
where
we're
running
a
super
recent
version
of
container
d,
because
we're
trying
to
test
the
latest
run
c
to
qualified
against
kubernetes
like
collaboratively
before
the
project's
release.
We're
not
doing
anything
like
that
here,
but
you
can
see
that
because
we
have
all
the
like
system
logs
so
see
that
yeah
like
here.
D
You
can
see
the
logs
for
the
control
plane
mode.
It's
also
small
note
there.
This
is
using
the
super
old
bash
tooling
that
no
one
in
the
org
really
wants
to
own
called
cubeup,
that's
still
in
the
kubernetes
repo,
and
so
it
has
some
pretty
archaic
things
like
naming
nodes,
master
and
minion,
which
are
both
terms
that
are
not
used
in
the
project
anymore.
Except
in
this
thing,
because,
even
though
it
runs
like
most
of
ci
pretty
much,
everyone
just
wants
to
pretend
this.
This
stuff
doesn't
exist
anymore,
delete
it.
No.
B
D
So
if
you
wanted
to
add
something
like
that
to
the
infrastructure
like
it's,
you
know
all
this
is
just
like
managed
through
like
get
ups
anyhow,
but
in
this
case
this
job
is
not.
It
should
not
be
really
using
disk
on
the
shared
infrastructure.
It's
creating
it's
creating
a
cluster
just
for
the
purposes
of
testing
and
scheduling
a
set,
there's
no
other
workloads,
only
the
test
workload
so.
D
D
So
even
if
they
were
writing
that
c
heavily
like
there's,
nothing
else
happening
there,
there's
just
like
it's
running
clusters
at
cd,
so
it
should
be
able
to
keep
up,
whereas
we
have
other
tests
we're
running
kind
clusters,
which
is
this
one
of
our
sig
sub
projects,
where
their
docker
containers
are
the
nodes
for
the
clusters
and
they're
running
on
their
rock.
D
So
that's
what
I
originally
thought
referring
to
and
something
that
is
tracked
is
we
should
we
should
look
into
like
sort
of
what
happens
if
we
just
throw
more.
I
o
at
this
problem,
but
in
this
case
we've
been
running
this
sort
of
thing
for
a
long
time
it
runs
its
own
ephemeral,
dedicated
infrastructure
and
there's
no
reason
it
should
be
having
like,
like
disco
issues
to
schedule,
a
pot
or
something
like
that.
D
D
This
is
a
totally
dedicated
machine
like
nothing
else
is
running
on
these
vms,
just
the
kubernetes
itself,
but
we
have
had
things
in
the
past
that
showed
up
like
more
heavily
unlike
those
kind
clusters
where
they're
running
on
a
shared,
possibly
like
under-provisioned
vm,
and
we
had
things
like
at
one
point.
D
There
was
a
side
car
in
the
csi
testing
that
monitored,
like
all
nodes,
all
pods
enough,
like
a
number
of
the
core
resources
that
was
like,
like
watching
all
of
them
pretty
aggressively
and
like
I
like
increased
the
the
load
on
the
cluster
and
then
that
caused
tests
of
flake,
and
we
couldn't
like
relate
them
to
anything
else.
And
so
that's
where,
like
the
triage
dashboard
comes
in
and
we
can
say,
okay,
we
can
see
that
we
just
having
a
bunch
of
random
failures
increase
in
these
jobs.
D
D
No,
there
shouldn't
be
any
because
so
these
are
running
on
ephemeral
clusters
that
are
gcp
vms,
so
they
have
their
own
like
disks
that
the
the
cloud
provider
is
guaranteeing
like
here's,
your
disk.
In
reality,
those
are
like
abstracted
disks
that
are
over
the
network
and
stuff
and
there's
more
things
going
on.
D
But
we
haven't
observed
any
issue
like
that,
like
even
if
it
turns
out
that,
like
part
of
how
that
is
ultimately
backed
is
shared
like
there
are
other
systems
between
us
and
the
disc
that,
like
make
sure
we
get
the
I
o
we're
supposed
to
like
the
throttling
of
someone
else's
workload.
D
Okay.
So
what
we
were
seeing
is
like
when
we're
running
these
kind
clusters
or
like
builds
or
some
some
workloads
like
that
or
like
there's,
some
integration
tests
that,
like
spin
up
just
a
couple
of
the
components
directly
and
then
run
tests
against
them.
Those
are
running
in
the
ci
cluster
and
then
in
in
the
ci
cluster.
We
may
be
like
ourselves,
co-scheduling
workloads
to
one
of
those
like
virtual
disks,.
D
So
I
want
to
also
before
we
share.
I
want
to
find
a
dock
and
link
it
to
you
all
the
since
we've
kind
of
wound
up
on
the
just
general
topic
of
flaky
test
debug.
We
actually
have
a
doc.
D
I
think
it's
a
community
repo,
there's
a
there's,
a
dock,
that
our
members
wrote
that's
sort
of
just
about
like
how
to
do
flaky
test
debugging
in
this
project,
and
it's
quite
good.
There
was
also
a
we've,
also
got,
I
think,
there's
a
yeah
there's
a
video
from
jordan.
Liggett
came
to
our
meetings
and
gave
sort
of
like
a
little
talk
about
this,
since
he
does
quite
a
bit
of
it,
and
you
can
also
watch
that
talk.
It's
linked
in
the.
D
Dock
we
have,
we
have
a
lot
of
tools
that
can
be
used
here
that
this
goes
into
like
more
detail
on,
but
for
all
that
I'll
say
that
a
problem
like
this
is
still
kind
of
the
hardest,
where
it
doesn't
obviously
seem
related
to
anything
it's
across
multiple
different
tests.
They
don't
appear
to
be
doing
anything
crazy.
D
The
infrastructure
is
like
boring,
old
stuff.
We've
been
running
for
a
long
time
other
than
the
kubernetes
version,
so
that
makes
me
say
that,
like
probably,
we
want
to
look
in
the
direction
of
like
did
this
get
worse
over
time,
and
can
we
pinpoint
when
that
was
and
look
at
what
changes
happened
in
that
time?
But
you
know
going
through
all
of
those
motions
will
pro
first
requires
like
getting
a
little
bit
up
to
speed
on
like
okay?
D
D
Like
the
triage,
dashboard
itself
is
a
very
powerful
tool,
but
it
takes
a
little
bit
of
like
getting
used
to
and
it
you
know
it
doesn't
have
much
in
the
way
of
docs
or
explanation
about
things
or
like
tool
tips,
and
it
ties
back.
D
You
know
pretty
heavily
to
like
our
concepts
around
like
what
is
a
job
and
like
what
is
all
the
metadata
we
have
around
it
that
sort
of
thing
yeah,
it's
pretty
much
a
power
user
tool
that,
like
a
handful
of
people,
depend
on
pretty
heavily,
but
it's
like
hasn't
been
optimized
for
like
bring
new
folks
on
board
to
it.
So
I'd
say
this
talk
in
this
docker,
like
the
best
resources
for
like
getting
up
to
speed
on
that.
D
The
test
stock,
that's
in
the
the
community
repo.
I
think
that
me
dave
it's
not
that
I
think
that
was
like
his
one.
The
current
doc
is
here,
marina,
I
think
marina
gave
us
a
break.
I.
F
Threw
a
link
to
these
notes
in
the
agenda
as
well.
I
will
probably
clean
them
up
afterwards
and
post
them
somewhere
less
ephemeral
than
I
just,
but
if
you
want
to
follow
along
or
if
my
screen
is
not
working
well,
that
link
is
there
for
you,
so
didn't
realize
you're
here,
hi.
D
Oh,
this
is
that
was.
B
D
Well,
thank
you
all
for
coming.
I
will
try
to
find
some
time
soon.
I
need
to
follow
up
on
a
couple
of
things
at
work,
but
I'll
try
to
find
some
time
soon
to
like
surface
some
more
of
these
resources
and
if
you
are
looking
for
anything,
feel
free
to
reach
out
for
the
most
part.
D
At
the
moment,
I'm
probably
gonna
point
you
to
the
project
boards
to
start,
but
you
don't
find
anything
follow
back
up
and
just
in
general
like
ask
questions
in
our
slack
everyone's
super
friendly
and
helpful,
and
you
know,
there's
probably
someone
else
that
wants
to
ask
the
same
question,
but
it
hasn't
yet.
So.