►
From YouTube: 2016-08-18 Kubernetes SIG Scaling - Weekly Meeting
Description
2016-08-18 Kubernetes SIG Scaling - Weekly Meeting
A
B
We
could,
if
we
collect
agenda
items,
there's
a
bunch
of
stuff
to
talk
about,
there's
the
exit,
III
stuff,
there's
the
future,
the
assorted
nest
of
issues
that
the
garbage
collector
controller
exposed
other
performance
issues
that
are
kind
of
running
into
with
both
the
watches
and
some
other
stuff.
So
the
laundry
lists
that
has
gone
on
this
last
couple
days
has
been
pretty
pretty
large.
A
A
A
B
Give
a
quick
update
on
the
end
wojtek
has
notes
to
in
so
this
poncho,
so
I
think
we're
all
rock
party
to
the
Quran
gear
off
so
I
have
the
fixes
of
the
for
the
client
stuff
to
enable
for
the
testing.
I
was
able
to
get
that
in
this
morning,
but
I
mean
and
then
some
type
of
testing
purgatory
enough.
A
lot
of
other
people
are
seeing
this,
but
like
I'm,
getting
random
test
failures
that
have
nothing
to
do
with
my
dr's.
It's
happening
with
Chris
frequency
I.
Think
it's
like
an
endless
cycle
which.
E
E
B
Are
some
issues
here
so
the
PRS
are
released?
Obvi
and
I
saw
hunch
output
in
the
TLS
one,
but
I
think
it
still
needs
a
unit
test.
Is
there?
Is
there
a
list?
It
I
know
you
created
that
list
on
the
feature
repo.
B
E
A
Tim
I
was
looking
to
see
if
Aaron
happen
to
join
us
this
morning,
but
I
know
from
my
his
last
his
last
commentary
to
me
after
the
cig
testing
meeting
was
that
there
seemed
to
be
some
pretty
big
code
changes
that
were
getting
put
in
that
you
know
what
would
he
I
think
he
certainly
felt
were
a
little
bit
late
in
the
process.
So
it
doesn't
surprise
me
to
hear
you
saying
you're
seeing
things
maybe
get
a
bit
flaky
here
at
the
end.
B
F
So
I
have
a
question
who
attack
like
about
about
rolling
back
to
up
light
from
leafy
data
to
wheat,
to,
as
you
requested,
I'm
just
wondering
I
do.
Can
we
assume
I
the
way
to
data
is
empty
because,
like
that,
that's
much
easier,
we
can
just
why
a
simpler
widget
data
file
instead,
like
this,
we
do
data.
We
cannot
either
catch
it
and
merge
it.
I.
E
E
Yeah,
so
there
was
at
least
one
rice
which
is
hopefully
fixed,
but
like
I'm,
not
one
hundred
percent,
it's
fake
took
sure
that
it's
fake,
but
hopefully
it
is
like.
So
that
was
one
issue.
It
was
like
yeah,
it
was
reflector
issue.
There
is,
and
there's
also
like
a
huge
increase
in
Murray
consumption
in
both
the
controller
manager,
which
is
kind
of
expected,
but
also
an
API
server,
and
the
reason
for
that
is
basically
that
garbage
collector
is
using
dynamic
client,
which
is
using
Jason's
and/or.
E
All
other
clients
are
using
proto
box
now
and
it's
like
a
known
issue,
but
like
basically,
we
need
to
make
a
decision
if
we
want
to
enable
it
without
product
with
Jason's
or
not.
It's
not
it's.
In
my
opinion,
it's
not
the
blocker,
but
it
needs
to
be
a
conscious
decision
that
our
components
will
be
using
like
API
server
will
be
using
two
types,
more
memory,
so
yeah.
B
G
The
the
first
is
naive
clients
being
able
to
get
clean
up
without
having
diplomatic,
client-side
Reaper
still
like
web
consoles,
like
I,
think
the
dashboard
was
blocked
on
this
and
a
couple
of
other
clients.
People
have
reported
like
once.
Garbage
collection
goes
in
the
naive
clients
can
delete
controllers
and
not
have
pots
left
around
I
kind
of
had
assumed
that
this
would
be
something
that
people
want
to
enable
what
to
choose
to
enable
because
of
the
impacts.
But
we
do
want
people
to
enable
it.
G
So
we
can
fix
the
rest
of
the
bugs
that
probably
still
exist
in
and
even
though
we're
in
a
pretty
good
shape
now
better
than
we're
in
13,
so
not
enabling
it
would
probably
just
pump
this
problem,
and
we
would
have
the
same
problem
in
15
that
we
would
have
to
go
fix.
But
we,
it
seems,
like
people
believe
that
the
the
functional
part
of
it
is
correct.
Now.
G
Don't
think
it's
going
to
be
possible,
so
we
got
closure
on
how
we
would
do
it
or
there's
a
proposal
that
we're
going
to
write
up
in
the
API
machinery
for
how
we're
going
to
enable
partial
object
retrieval
for
protobuf,
so
that
a
naive
client
can
say
I
want
to
get
so
the
dynamic
client.
The
reason
it's
doing.
G
That
is
just
to
get
at
the
object,
meta
to
get
a
controller
references,
and
we
we
have
a
rough
high
level
agreement
in
the
API
sig
that
we
will
introduce
a
mechanism
that
we
can
use
to
say
the
generic
Clank
and
say
I
want
to
just
get
the
object
meta
out
of
this
protobuf
object
and
get
it
back
and
protobuf,
which
should
fix
the
API
server
side
in
the
rough
timing.
For
that
is,
there's
a
lot
of
motivation
to
do
that
for
15,
but
I,
don't
think
we
can
guarantee
we'll
get
it
in
15.
C
G
C
C
G
G
G
In
practice,
all
of
the
controllers
ended
up
needing
to
post
a
reference
back
to
themselves
in
practice
because
of
dueling
controllers,
which,
while
the
label
selector
sounds
great
in
practice,
the
downsides
of
dueling
controllers
was
very
bad
and
the
cost
to
fix
dueling
controllers
is
basically
you
put
something
on
the
thing
that
you
created
that
other
people
can
be
like.
Yep
I,
don't
touch
you
Jodi
your
point,
though,
about
the
graph
that
most
of
that
is
because
we
don't
have
tombstones
anyway
to
tombstone
in
that
CD.
Today,
the
deletion
characteristic.
G
So
given
it's
a
graph
and
that
we
have
things
coming
and
going
that
are
disconnected,
and
we
can't
build
a
single
transactional
tree.
We
have
to
have
either
tombstones
or
a
long
wait
period,
and
the
memory
is
basically
dealing
with
that.
I
think
there
was
a
proposal,
maybe
further
down
the
road
that
we
would
track
tombstones
at
some
point
by
you
it
and
once
we
can
track
tombstones
by
you
it
then
the
memory
implications
go
away,
but
it's
kind
of
that
short-term.
C
I,
honestly,
I
think
you
know
you
know
it
back.
References
are
totally
saying
the
semantics
are
on
the
back.
References
in
terms
of
spreading
and
stuff,
like
that,
I
think
should
be
more
explicit.
I
just
want
to
make
sure
that
we're
not
continue
like
we're,
not
making
another
sort,
of
instance
of
that
ugliness.
That's
in
the
scheduler
right
now.
So
it
sounds
like
that's
not
the
case,
so
I
believe
it
is
not
the
case,
but
we
should
double.
C
G
G
So
so,
given
that
we
think
that
there
may
be
a
fix
in
15
for
some
of
the
worst
performance
aspects
of
it,
but
there's
no
guarantee
do
we
want
to
hold
it
up
or
do
we
want
to
make
it
be
opt
in
everyone
make
it
be
opt
out
or
have
a
recommendation
like
is
opting
out
of
a
new
feature
really
that
bad
for
end-users?
Do
we
have
the
control
to
turn
it
off,
etc.
I.
C
Mean
ideally,
we'd
have
some
sort
of
global
setting
that
you
could
set
here
for,
like
an
experiment.
I
mean
I'm
thinking
about,
like
the
been
during
experiment
that
the
goal
line
guys
went
through
and
they
were
able
to
age
that
in
over
a
couple
of
versions
so
that
there
was
some
leeway
right.
So
initially
it
was
off
by
default,
but
you
could
turn
it
on
and
then
it
was
on
by
default.
But
you
could
turn
it
off
and
then
it's
a
done
deal.
G
The
only
thing
that
the
memory
impact
is
solely
caused
by
running
the
garbage
collection
process.
If
you
do
not
run
the
garbage
collection
process,
you
still
get
controller
refs,
but
you
don't
like
the
controller,
rev
sardonic
creation
time.
So
it's
basically
free
I
think
if
the
controller
is
off,
there
is
zero
impact
to
the
system
for
garbage
collection,
and
it
is
exactly
the
same
as
a
13
system.
Yeah.
G
Controller
manager
today
there
is
a
flag
enabled
garbage
collector,
enable
garbage
collector
defaults
to
false.
So
today,
if
you
have
a
one
master
cluster
that
you
turn
on
today
to
coming
from
master
314,
garbage
collection
is
not
on
by
twofold:
it
has
no
memory
impact
and
the
references
are
being
set
and
setting
the
flag
to
true
will
increase
the
memory
usage
and
we
go
delete
a
deployment
or
a
daemon
set.
All
the
pods
get
cleaned
up
by
default.
H
Okay,
I'm,
just
as
a
cluster
operator
from
a
capacity
planning
perspective,
if
I'm
being
told
memory
usage
could
potentially
double
I
might
need
some
additional
time
to
give
myself
room
for
that.
So
maybe
not
enabling
it
by
default
would
be
the
least
surprising
thing
here,
but
I
totally
understand
your
position
that
it
the
longer
we
don't
enable
it
by
default
the
longer
this
continues
to
be
a
problem,
just
thinking
about
the
element
of
Lee,
surprise
and
really
loud,
release,
notes
or
something.
A
I'm
trying
I'm
sitting
here
trying
to
think
of
it
to
the
degree
to
which
that
matters,
just
as
a
general
principle,
I,
think
that
a
lot
of
folks
still
looked
gke
as
sort
of
the
canonical
running
system
and
defaults
that
kind
of
match
what
gke
does
to
you
know
that
they're
there
some,
like
principle
of
least
surprised
again
there
to
echo
what
aaron
is
saying.
I
mean.
G
I
can
set
it
from
an
open
chef
perspective.
We
are
unlikely
to
enable
this
an
open
shift
34
for
our
customers,
because
we
still
have
some
lingering
security
issues
with
regards
to
multi-tenancy
to
sort
out
so
in
a
multi-tenant
namespace
context
where
you
might
have
users
who
are
only
editors,
an
editor
can
use,
can
abuse
references
like
deletion,
cleanup
references
to
trick
the
system
into
deleting
a
resource
that
you
don't
have
authority
to
delete
only
to
edit
and
from
a
cube
perspective.
G
This
doesn't
matter
it
only
really
matters
when
you
have
fine-grained
roles
and
namespaces,
and
so
that's
our
primary
concern
and
we
have
the
selfish
desire
that
we
would
like
to
see
all
the
bugs
sorted
out
as
well
before
we
turn
about
introduction
environments,
so
somebody's
got
to
jump
off
the
ledge
first
and
we
get
that
we
have
the
excuse
of
security,
but
I
do
think.
We
really
need
to
move
forward
on
this.
A
E
F
F
E
F
E
So
excessive
you
to
the
issue
that
I
opened,
there
are
some
blocks
from
api
server.
It
seems
like
that
we
are
going
out
of
the
history
of
that
CD,
but
I
not
sure
why
exactly
like
either
like
API
server
is
not
able
to
process
fast
enough
or
HDD
is
not
able
to
send
fast
enough.
Or
could
this
be
an
issue
with
compaction,
possibly
like.
B
E
A
A
D
A
You
know,
wit
with
with
sufficient
reliability
and
failure,
love
the
failure
rates
that
we
get
are
pretty
high
and
it
seems
to
be
docker
issues,
that's
the
sort
of
thing,
and
so
we're
just
trying
to
still
get
to
a
point
where
we
have
a
cluster
setup,
that's
sufficiently
reliable,
I
mean
I.
Think
we
see
the
we
see
the
community
tests
pass
far
more
reliable,
far
more
reliably
than
anything.
We
can
manage
to
assemble
at
this
point,
and
it
has
it's
still
pretty
concerned.
B
B
B
Against
upstream
we
we
are
still
in
the
process
of
our
of
our
vetting
for
one
hour,
13
rebase,
I'm,
still
finding
issues
and
I
believe
David
EADS
just
pick
some
stuff
this
morning,
Clinton
so
and
we
have
not
rely.
We
are
not
reliably
running
at
that
scale.
No,
because
we
have
a
bunch
of
other
stuff
on
our
side.
So
openshift
adds
a
bunch
of
extra
resources
and
objects
and
controllers
that,
to
the
point
where
we
don't
run,
the
numbers
back
up
stream
runs.
We
run
denser
clusters
with
less
nodes
and.
D
E
Currently,
we
have
some
reliability
issues,
but
we
basically
are
running
continuously
2000,
note,
busters
and
2000
month.
We
are
mostly
like
running
two
thousand
note
cube
marks
just
like
different,
but
we
are
also
running
like
from
time
to
time,
real
monsters
but
like
currently,
and
we
are
running
them
from
heads.
Basically,
all
the
time.
D
Okay,
so
maybe
we
can
take
this
offline
I'll
contact
each
of
you
and
see
if
I
can
find
out
what
your
stats
look
like
in
particular
years,
boy
deck
is
where
we're
consistently
running
into
problems
with
clusters,
far
smaller
than
1000
notes,
I
mean
even
hundred
oak
clusters.
We've
seen
problems
so
infinity.
D
B
A
ridiculous
number
of
hard-coded
timing
parameters
inside
of
the
tests
and
that
could
usually
be
a
weird
source
of
issues
and
anomalies
and
for
a
long
time
we've
we've
struggled
with
that.
I
know
I
now
that
we
have
Jay
back,
I've
asked
them
to
start
tweaking
or
digging
into
that
space
again,
because
we
belong
the
issue
a
long
time
ago.
So,
if
you're
seeing
errors
would
be
nice
to
have
it
like
least
reported
upstream
what
those
errors
are
and
I
wouldn't
be
surprised
if
a
large
number
than
our
timing,
artifacts,
okay,.
C
Just
want
to
mention
I
mean
one
long
stay.
We
can
dig
up
the
number
if
you're
interested
one
long-standing
sort
of
open
issue
that
we
want
to
be
getting
some
way
to
point
yourself
at
a
cluster
and
say
tell
me
about
yourself
down
to
the
level
of
all
the
parameters
and
this
isn't
setting
parameters,
which
is
a
whole
nother
effort.
But
this
is
just
collecting
meta
information
about
a
cluster.
So
it's
easily
shareable,
that's
kind
of
been
out
a
want
for
a
while,
and
so
you
know
help
wanted.