►
From YouTube: 2017-08-17 Kubernetes SIG Scaling - Weekly Meeting
Description
2017-08-17 Kubernetes SIG Scaling - Weekly Meeting
A
B
A
A
A
The
first
bullet
I
think
is
a
non-controversial
but
I'm
going
to
represent
that
we
are
that
the
tone
has
been
shifting
from
sort
of
bigging
builder,
big,
building,
bigger
bigger
clusters
to
stability
on
5,000
node
clusters
and
I'm.
Gonna.
Ask
the
question
about
whether
stretching
beyond
5,000
there
there's
an
urgent
need
to
do
so.
I.
C
A
B
Okay,
so
my
problem
is
that
you
know
we
want
to
have
the
slightly
larger
footprint
across
the
bowl
for
all
these
things.
So,
for
example,
our
operatically
stones
have
close
to
20
K
machines.
So
you
know
Italy
won't
like
some
feature
so
that
all
this
20
k
can
be
managed
in
a
very
fondly,
but
that
might
mean
using
Federation
so
not
sure.
C
There's
there's
well-established
documents
that
we
have
that
having
a
single
large
X
thousand
in
the
tens
of
thousands
node
clusters
sort
of
violates
principles
of
failure.
Domains
in
Federation
is
a
potential
escape
hatch,
but
it
comes
with
its
own
complexities,
like
sometimes
people
just
do
multi
cluster
approach
to
for
evaluating
third
stage
abroad.
There's
many
patterns
that
exist
that
give
you
an
escape
hatch
versus
having
a
single
large
environment.
B
A
C
A
C
Think
you
know
my
personal
take
on
Federation.
Is
that
your
grouping
failure
domains
and
like
we?
We
when
we
wrote
up
our
blog
post
on
it?
You
know
we
talked
internally
for
a
long
time
about
it
and
there's
actually
a
three
part
blog
posts
that
joandcraig
wrote
up
regarding
it,
but
it
conflates
failure
domains.
So.
C
A
C
A
A
A
Aaron
Aaron
actually
typed
up
some
very
helpful
kind
of
comments
here.
I
really
wasn't
it's
a
fairly
short
update.
I
really
wasn't
planning
on
covering
all
this
in
detail,
but
just
to
give
the
wider
community
some
exposure
to
where
the
docs
are
what
they,
what
they
do.
I
do
want
to
do
a
quick
call
out
for
folks
that
are
interested
in
the
provider
coverage
beyond
the
ones
that
we
have.
This
seems
like
a
good
call
to
make
sure
they
know
that
we
love
their
input.
A
So
I'll
cover
this
briefly.
I'm
gonna
mention
event
refactoring,
which
Merrick
we're
here
as
and
I.
Think.
The
only
piece
of
this
I'm
really
concerned
about
whether
I'm
representing
properly
is
the
fourth
bullet,
which
is
the
sort
of
at
least
two
releases
of
work,
with
kind
of
a
baseline
into
1/8,
and
not
really
not
really
helping
substantially
until
1/9
I
just
call
it
a
work
in
progress.
A
A
A
A
Short
of
getting
some
additional
feedback
from
those
guys-
and
so
this
was
the
this-
is
the
paging
API
update
just
let
people
know
this
is
going
on
for
these
kind
of
like
large
scaling
parameters.
That's
a
pointer
to
that
issue.
Here,
I
got
the
1
8
1
9,
1
10
bits
straight
from
the
straight
from
the
feature
straight
from
the
feature:
repo
I
pinged
Clayton
just
a
few
minutes
ago,
just
to
make
sure
he
knew
I
was
going
to
say
this.
A
So
I
think
that
one's
probably
okay
cluster
results,
sharing
I
looked
at
the
last
update.
I
did,
and
this
was
there
too
so
I
think
this
is
10.
This
is
probably
mostly
for
you
to
make
sure
you're
comfortable
with
this
update
in
terms
of
our
interest
in
the
sig,
for
using
sonically
in
some
form
to
capture
and
share
results,
looks
good
to
me
alright
and
then
what
I'm
planning
to
do
is.
D
Hey
so
this
isn't
something
I
can
take.
A
ton
of
credit
for
I
just
happen
to
be
aware
of
it,
because
I'm
artistic
testing,
so
I
think
Shawn's
been
pushing
this
pretty
hard.
So
one
of
the
big
things
we
noticed
during
the
1:7
release
was
that
the
scalability
testing
wasn't
really
passing
and
we
didn't
notice
this
until
late
in
the
release
cycle.
So
we
put
together
a
proposal
or
strong
and
I
think
merit
to
some
extent.
D
Friday
rooms
are
performance
oriented
you
know,
throw
the
density
test
that
it
makes
sure
it
meets
those
two
big
round
s
loz
that
we've
defined
thus
far
we'd
like
to
get
to
the
point
where
it
it
passes,
sort
of
more
refined
SOS.
But
this
is
state
of
today.
The
correctness
has
to
make
sure
that
all
of
the
other
tests
still
pass
at
the
5000
M
level
and
then
Saturday
and
Sunday.
We.
D
D
So
the
way
this
is
implemented
today,
you
may
have
heard
me
talk
about
things
like
proud
before.
Jenkins
is
still
involved
in
running
tests
for
some
things
today,
and
this
includes
the
scalability
test,
so
we
use
Jenkins
cron
trigger
in
order
to
make
sure
that
triggered
according
to
the
schedule,
I've
just
laid
out
and
then
we're
using
the
script
called
bootstrap
for
the
testing
for
repo.
It's
basically
the
script
that,
if
you
run
that
locally,
you
would
do
all
of
the
exact
same
setup
and
configuration
and
layering
of
stuff.
D
That
would
actually
happen
the
same
way
we
run
tests
for
the
project.
So
if
you
look
at
any
of
those
end
files,
there
you'll
see
the
exact
configuration
down
to
the
size
of
disk
in
size
of
nodes
and
QPS
to
nipples
that
have
had
to
been
adjusted
in
order
for
the
clusters
to
meet
the
SLO
that
the
sizes
next
slide
so
and
we
can
maybe
click
through
some
of
these
lengths.
I
don't
know.
Basically
I
just
want
to
call
out
that
six
scalability
is
being
a
good
thing
and
we
have
our
own
dashboard.
Today.
D
It's
called
sake
scalability.
It's
one
of
these
fancy
tab
group
things
where
we
actually
have
the
just
the
density
tests
out
of
the
GCE
test
and
then
yeah.
You
can
just
open
like
cool,
so
you
can
see
there's
sort
of
all
the
tests
that
six
Caleb
Lee
cares
about.
On
that
second
row
there
and
then
the
top
row
calls
out
Google
GCE
scale
and
G
Google
gke
scale.
Those
are
the
larger
scale
tests
with
the
two
thousand
and
five
thousand
minutes
that
I
was
just
talking
about.
D
So
you
can
see
if
there
are
problems
with
the
tests
that
this
sake
cares
about,
and
we
can
see
that
and
the
other
thing
that
I
wanted
to
call
out
again.
I
mentioned
that
the
GCE
tests
are
blocking.
If
you
go
back
to
the
slide
and
look
at
the
release
master
blocking
dashboard
there,
you
can
see
I
have
two
links
to
the
GCE
scale,
correctness
and
scale
performance
jobs.
Those
are
the
five
thousand
nodes
jobs.
D
Those
are
part
of
the
list
of
jobs
that
are
considered
release
blocking
so
the
tool
that's
in
charge
of
cutting
builds
for
kubernetes
actually
looks
at
all
of
the
jobs
and
that
release
master
blocking
dashboard.
So
the
same
configuration
that
drives
test
grade
is
the
same
configuration
that
drives
an
aggregate
and.
B
D
I
always
forget
how
to
pronounce
this
tool
for
those
of
you
who
are
familiar
with,
and
so
if
this
is
read
which
it
is,
we
can't
cut
an
alpha
or
beta
or
release.
D
Okay,
next
slide
something
cool
that
we
sort
of
discovered
along
the
way
is
collecting
all
the
logs
from
all
the
nodes
when
you're
at
2,000
or
5,000.
That
scale
takes
a
long
time,
so
the
way
that
it
used
to
be
done
was
basically
SSH
in
through
Jenkins
to
copy
the
logs
off
from
each
node
I
think
it
was
either
in
serial
and
parallel,
but
basically
Jenkins
was
the
one
going
out
getting
the
logs.
We've
now
shifted
to
using
this
tool
called
bog
exporter,
and
it's
linked
in
the
slides
here.
D
You
don't
have
to
follow
through,
but
if
they're
people
who
want
to
look
through
the
slides
afterwards,
it's
a
keeper
Nettie's
project
and
basically
the
nodes,
are
now
responsible
for
pushing
their
own
logs
up
to
GCS.
And
then
all
we
do
at
the
Jenkins
level
is
watch
to
make
sure
that
all
those
logs
land-
and
then
we
can
the
logs
are
already
there
in
GCS.
So
what
this
means
at
the
5000
level
is
instead
of
spending
over
four
hours
to
collect
logs.
D
We
now
just
spend
less
than
20
minutes
and
at
100
note
level,
it's
productive
as
well
taking
us
down
from
ten
minutes
to
two
minutes.
This
is
now
able
to
cross
all
the
scalability
jobs
and
they
think
it
works
so
well
that
it
might
be
worth
pushing
all
of
the
testing
jobs
period
in
the
failure
mode,
where
a
node
fails
to
cut
them
up
or
doesn't
come
up
completely
enough
to
push
its
logs
to
GCS.
We
can
still
fall
back
to
SSH
into
the
node
to
collect
what
is
there
for
forensic
purposes.
It.
C
Seems
weird
like:
what's
the
biggest
mean
the
time
frame
seems
weird?
What's
the
biggest
lag?
Isn't
everybody
writing
the
GCS
buckets
because
we
did.
You
know
if
you're
writing
to
a
single
node
who
actually
is
running
to
a
spinning
disk
before
you
before
you
tore
it
up
to
push
it
off
to
some
bucket?
C
D
D
I
think
I
personally
was
more
concerned
about
the
correctness
vinkle
that
okay,
if
we're
leaning
on
GPS
we're
leaning
on
the
notes
to
do
it
themselves,
what
if
the
notes,
aren't
functional
right
if
the
difficult
get
it
far
enough
to
actually
be
running
the
Kukla
process
and
then
schedule
a
container,
that's
responsible
for
shipping
it
logs
up
to
GCS,
I'm
gonna
have
their
logs
and
that's
really
impossible
to
debug
from
a
forensics
perspective.
Five
tests
that
fail,
because
the
cluster
failed
to
come
up.
I
want
logs.
A
D
A
C
I'm
not
actually
here,
but
I
will
show
you
there,
but
this
isn't
even
yours.
This
is
like
this
is
a
Google
ISM
right
for
Google
testing.
In
there
they
developed.
What
kind
of
you
know:
I'm
gonna
go
on
a
little
rant
here,
because
I
got
the
time
okay,
but
there
is
no
something
that's
very
useful,
only
useful
in
the
context
of
the
testing
infrastructure,
along
with
using
Google
stuff
right.
This
is
a
general
purpose
problem
that
the
community
has
and
we've
developed
something
that
lives
outside
of
it.
C
That
does
something
similar,
but
unifying
these
tooling
and
tool
chain
type
of
problems
that
they
are
generally
useful
to
other
people
is
a
thing
that
we
should
be
solving
versus
like
we
need
it
for
tests
intra
or
the
test
chain
tool
chain.
So
we'll
build
it
here,
and
you
know
now:
it's
it's
so
entrenched
into
tests.
Infra
disentangling
and
getting
out
this
general
composable
piece
to
be
generally
useful
to
the
broader
ecosystem
is
a
non
tenable
thing
right
all
right
on.
D
Well,
so
you
know
I
I
bring
it
back
to
you
the
funded
mandate,
examples
so
Tim.
You
might
be
right
that
this
is
a
common
problem
that
everybody
has
but
I'm,
not
sure
that
anybody
actually
dove
in
and
fixed
it
in
a
way
that
it
works
for
everybody.
I
I
have
to
go.
Look
at
the
commit
logs
right,
but
I
think
this
was
just
like
a
tool
that
was
needed
to
reduce
the
amount
of
time
it
took
to
solve
this
problem
at
scale.
A
D
And
the
same
thing
from
the
aid
of
the
US
perspective
like
if
we
want
AWS
tests
to
be
blocking
somebody
who
can
stand
up
a
5,000
dead
cluster
in
AWS,
you
should
probably
be
dedicating
the
resources
to
make
that
work
on
a
continuous,
reliable
basis.
Google
has
dedicated
one
person
Shawn
to
making
it
work
on
a
2
km,
5k
node
basis,
probably
with
some
assistance
from
low
tech,
America
right.
So
I
don't
know
if
that
would
be
somebody
from
sig
AWS
or
if
that
would
be
the
people
behind
cops.
D
C
A
I'm
certainly
inclined
to
take
another,
take
another
round
of
I'll,
say
evangelism
with
both
user
folks
and
an
e
WS
to
see
if
we
can't
get
some
support
from
them
on
this.
So
I'm
like
oh,
go,
I'll,
go
talk
to
Gabe
and
Brendan
and
say
hey:
what
does
it
really
take
for
you
guys
to
help
us
help
help
do
these
five
came
into
tests
on
a
juror
and
I
can
do
the
same
thing
with
the
fashion
this
guy's
ordered.
He
see
us.
D
You
yeah
totally
so
like
I
said
just
here
point
to
him.
If,
if
any
of
this
is
coming
across
or
being
presented
away,
wear
it
any
way
where
it
looks
like
we
would
not
welcome
the
effort.
That's
we
have
to
work
on
that,
but
I
believe
any
and
all
are
very
welcome
to
contribute
and
I.
Don't
just
mean
this
as
pull
requests.
Welcome
I
mean
seriously
what
what
about
the
on-ramp
is
is
preventing
people
from
from
getting
onto
the
ramp.
How
can
we
help
yeah.
C
I
think
we
need
folks
who
would
help
own
it
for
the
different
providers
and
to
validate
these
pieces
to
abstract
it
away
right,
like
I,
don't
know
of
an
owner
other
than
reaching
out
to
the
separate
cloud
SIG's.
But
there
isn't
a
representative
from
the
separate
clouds
think
this
might
be
a
call-out
location,
be
like
it'd,
be
nice.
If
those
who
are
providers
for
different
providers
in
general,
I
guess
you
know,
would
also
do
similar
types
of
validation
or
help
support
this
effort.
Definitely
a
call
to
arms
type
of
thing.
Yeah.
A
A
A
Well,
we're
out
of
time
I.
Think
I
think
this
is
mission
accomplished
for
today,
I
know
Tim.
You
wanted
to
have
a
some
other
discussion,
but
I
don't
think
we
have
quorum
for
it.
So
I
I
don't
really
have
another
discussion
than
anything
else
for
you,
okay,
anyone
else!
Anyone!
Anyone,
okay,
well
we'll
see
earrings
next
hall.