►
From YouTube: Mesos Performance Working Group - July 26 2017
Description
First meeting of the performance working group.
A
C
B
B
So
there's
been
a
number
of
folks:
who've
been
working
on
performance
related
stuff.
Recently,
NH
and
I
were
working
on
some
some
performance
improvements
to
the
process,
Ian,
Xu
and
Dimitri
million
I
think
we're
all
interested
in
making
master
failover
faster.
So
we
wanted
to
to
just
have
a
meeting.
Maybe
do
these
on
a
recurring
basis,
just
to
discuss
kind
of
all
things
performance
that
includes
things
like
hey.
How
scalable
can
we
make
NATO's?
B
B
How
do
we
do
benchmarking
and
can
we
automate
benchmarks
and
so
on
so
today,
I
think
we
wanted
to
do
was
just
you
know,
do
intros,
which
you
just
kind
of
did
and
then
talk
about
some
of
the
work.
That's
happening
right
now
and
then
here
a
little
bit
from
from
people
about
what
they
think
the
performance-related
pain
points
are
today
and
where
we
might
want
to
focus
our
efforts
going
forward
and
then
I
put
in
something
to
kind
of
fill
out
like
a
planning
table
just
to
figure
out.
B
You
know
who's
going
to
be
able
to
work
on
what
things
and
who
can
maybe
Shepherd
those
things
and
then
I
think
James
peach
who
I
guess
haven't
joined
yet.
Is
he
added
some
items
here,
one
of
them
being
the
metric
sampling,
which
there
were
some
discussions
about
maybe
a
month
or
so
ago
and
then
I
guess
you
had
a
benchmarking
and
built
protocol.
I,
see.
B
A
A
B
A
So
we
have
we've
embarked
on
basically
message-passing
performance,
optimization
as
just
in
the
low-hanging
fruit
that
we
wanted
to
do
before.
We
did
much
else
just
because
a
lot
of
the
subsystem
is
built
on
on
actors
and
passing
messages
between
them.
We
figured
that
anything
that
we
dated
from
an
optimization
perspective.
There
would
just
help
all
future
optimization
as
well.
So
let
me
pull
let
me
paste
in
the
ticket.
A
If
you
wanted
to
pull
a
lot
and
see
what
that
benchmarks
doing
so,
then,
if
you,
if
you're
looking
at
the
Mesa,
seven,
seven,
nine
eight,
which
is
improved
the
process,
message-passing
performance,
the
basically
been
broken
up
into
three
three
different
phases.
The
first
phase
was
a
combination
of
things,
mostly
around
removing
locks
and
all
the
Fastpass
of
the
code.
A
Stuff
like
that,
but
some
of
the
other
ones
that
we
did
where
things
like
introduce
a
lock
free,
run
queue
and
that's
an
example
of
something
that
needs
to
be
explicitly
triggered
at
compile
time.
So
you
know
you
when
you
configure,
you
have
to
use
dash
dash,
enable
lock,
free,
run.
Q
makes
us
when
you
actually
build.
That's
actually
gentle
Opry
Rock,
you,
okay,
so
lapse
on
phase
one
then
phase
two
was.
A
Having
a
locker
going
to
you,
but
also
a
lock
for
your
back
to
you,
so
that
there's
no
locks
being
acquired
when
actors
are
adding
events
into
each
of
the
other
actors,
mailboxes,
also
one
that
you
need
to
configure.
So
it's
enable
I
forget
exactly
that's
not
enable
lock,
free
event,
queue
I
think
is
what
it
is
and
then
and
then.
Finally,
the
sort
of
third
phase
was
a
collection
of
other
optimizations
that
we
especially
noticed
we're
important
on
Linux.
So
probably
the
biggest
one.
There
is
just
the
kernel
just.
A
A
bear
semaphore
on
Linux.
We
can
really
get
great
great
V.
Excuse
me
get
great
performance,
so
we
kind
of
built
a
fixed-size,
lastin
first-out,
so
for
that
yields
tremendously,
better
performance
and
related
to
that
last
phase,
I
actually
attached
in
base
of
seven
seven,
nine,
eight
I
attached
all
of
the
I,
not
all
but
I
attached,
a
bunch
of
flame
graphs.
That's
capturing
the
performance
winds
that
you
see
when
you're
using
some
of
these
last
stages
of
optimizations,
so
they're,
both
not
using
that
the
leaf
o
semaphore
as
well.
A
It's
using
the
we
post
on
before
and
then
also
you
eat
all
the
same
flame
crafts
are
included
where
we
have
double
the
load,
so
we
basically
just
increase
the
number
of
workers.
Sorry,
a
number
of
actors,
v
2x
so
anyway,
so
that's
a
pretty
detailed
account
of
some
of
the
first
stuff
we've
been
trying
to
do
in
the
process
and
there's
a
lot
more
that
we
want
to
do.
This
was
just
on
message
passing
so
probably
the
second
phase
passed
message
passing
will
be.
A
Mckuhn
and
all
that
stuff
is
in
this
process
has
been
said
earlier.
We
also
want
to
capture
things
outside
the
process
and
maysa
and
that's
been
Cimino.
We
want
to
brainstorm
where
people
feel
like.
We
should
be
spending
most
of
time
and
energy,
but
that's
that's
the
process
bit.
So
personally
answer
questions
looks
like
James
had
a
good
question:
how
hard
would
it
need
to
make
the
lottery
right
key
or
one-time.
F
H
A
H
A
A
A
So
so
that's
one
path,
the
other
path
to
try
to
make
it
work
as
a
runtime
setting
where
it's,
where
it's
still
optimized,
it's
just
Messier
or
it's
not
or
it's
not
a
you
know
a
like
market
class.
So
if
that's
something
that
that's
interesting
I'd
be
happy
to,
we
could
take
a
look
and
then
actually
it
happened.
But
it's
just
it's.
It
was
just
a
performance
reason.
No.
B
A
F
A
B
A
A
I
think
it
takes
like
three
minutes
or
format:
I've,
never
even
waited
for
it
to
finish
it
without
the
the
optimizations
turned
on.
But
when
you
start
to
turn
on
the
optimizations,
you
still
won't
see
the
benchmark
finish
because
you
really
need
all
these
optimizations.
Okay,
if
any.
If
anywhere
in
the
fast
path,
there's
blocking
on
on
a
lock,
it's
like
it
doesn't
matter
like
a
performance,
it's
just
abysmal,
so
you
have
to
enable
all
of
the
lock
free
event
queue
and
the
lock
free
run
cute
actually
actually
run
the
benchmark
and.
A
Pass
in
like
two
seconds
three
seconds:
five
seconds
failure
and
your
computer,
but
but
you
really
have
to
enable
all
those
things.
So
that's
another
thing
to
point
out:
if
you
are
thinking
about
trying
some
it
to
step
out,
you
should
try
them
all,
because
you
won't
see
really
wins
from
just
one
of
them.
Okay,.
H
Good
I'm,
looking
at
the
first
2x
no
life,
oh
yeah
SVG,
can
you
anybody
figure
out
where,
in
the
flood
graph
you're
seeing
the
bad
performance
of
a
semaphore?
Yes
in
any
of
stacks,
like
I'm,
looking
in
the
wrong
store,
yeah.
A
So
you
know
all
these
things.
So
what
I
notice
in
my
in
my
test
is
that
the
semaphore
performance
on
Linux
was
okay.
If
you
had
enough
actors
and
enough
work
that
you
were
doing-
and
it
was
not
as
good
if
you
look
at
curse,
just
no
leaf-
oh
so
not
2x,
but
just
just
perfect
Ovie's.
Oh
you
see
Christ.
We
spend
a
tremendous
amount
of
time
in
some
host
some
weight.
A
A
A
Right
so
so,
through
the
practice,
that's
processbook,
so
what
I
believe
is
happening
and
I'm
not
kind
of
center
about
this?
What
I?
What
I
believe
is
happening
James,
is
that
this
is
now
in
the
kernel
and
Cerf
is
capturing,
basically
the
kernel
stack
and
from
within
the
kernel.
It
doesn't
actually
know
where
it's
going
to
be
returning
to
in
benchmarks,
and
so
this
is
like
so
I
I'm,
not
a
hundred
percent
sure
about
this,
but
I
think.
A
A
A
This
on
Mac,
at
all,
when
I
read
so
no
extent
the
the
biggest
implementation
difference
is
the
semaphore
on
Linux
requires
a
guy
that
just
got
woken
up,
a
thread
that
just
got
woken
up
to
try
to
repost.
So
when
you
do
a
post,
if
you
can't
post
you
sleep
and
then
when
you
get
woken
up,
you
try
to
post
again,
whereas
on
on
on
Mac,
it
doesn't
happen.
So
the
guy
that
gets
woken
up
had
expectedly
already
has
done
the
post,
in
other
words
the
guy.
A
A
Difference
and
I've
got
a
couple
of
notes
in
the
doc
for
how
that
could
potentially
cause
a
lot
of
threads
to
be
getting
woke
up,
not
having
anything
to
you,
shutting
down
a
thread
coming
in
trying
to
close
so
anyway,
I
think
we
can,
we
can
say,
got
one
offline
and
folks
want
to
chat
with
me
about
specifically
their
thoughts
around
the
performance.
From
that
perspective,
I
would
love
to
have
that
conversation.
But
that's
that's
why,
right
now
we
introduced
this
week.
A
A
A
A
A
E
F
E
Basically,
what
we
have
with
protolith
now
is
that
doesn't
support
move
construction,
but
we
do
like
they
are
working
on
this
now
and
probably
in
the
future.
It
will
be
supporting
that
so,
but
another
thing
in
registration
is
a
vector
of
I,
don't
remember
what
like
and
in
the
support
move
construction.
So
like
every
time
we
do,
we
differ
on
to
pass.
The
message
further,
like
this
vector,
is
copied
quick,
seven
or
something
times.
So
the
idea
was
to
just
use
most
of
construction
and
help
that
I
posted
spreadsheet,
which
is
our
little
captures.
E
E
Well,
are
the
quick
in
regular
conditions
this
completes
in
about
one
minute
and
a
half,
but
after
this
patch
is
applied,
it
runs
for
just
about
one
minute.
That's
not!
It
cannot
be
similar
tributed
to
the
patches
because,
like
improving
performance
of
registration
in
this
place,
it
avoids
duplicate
requests,
but
for
your
registration
and
that's
part
of
it.
It's
because
of
fatty
registration,
the
part
of
it
because
of
avoiding
legate
requests.
E
E
It's
currently
marked
working
working
progress
in
review
board
just
because
we
need
to
discuss
how
to
make
this
work
with
protobuf
Arenas
it's
locally
cleared,
because
when
we
convert
something
to
vector
and
if
objects
in
this
vector
are
coming
from
wrote
above
then
things
become
complicated
like
if
they
are
on
arena,
then
vector
should
be
also
using
arena,
locator
and
so
on.
But
in
general,
like
this
improves
performance
of
converting
repeated
Petare,
repeated
pointer
to
repeated
fields,
pointer
pointer
to
vector
so
yeah.
That's.
E
E
D
D
B
B
E
E
Just
wanting
related
to
metrics
like
when
we
first
encountered
encountered
this
problem
like
we
are
completely
blind,
because
metric
metrics
were
not
working
at
all.
I
mean
like
it's
currently
HTTP
request,
which
did
like
it
itself
has
several
dispatches
inside,
like
500,
and
the
queue
sizes
were
like
thousand
hundreds
of
thousands
messages
and
basically
running
this
metrics
request
required
several
minutes,
and
oh
just
like
what
we
saw.
It
was
just
our
metrics
disappear
to
some
time
and
then
master
is
up
and
everything
is
over.
E
E
It's
it's
two
part
thing.
The
first
one
is
that
it's
HTTP
requests
and
HTTP
requests
to
to
to
the
master
process
yeah.
It
has
several
dispatches
inside
and
every
time
when,
when
we
dispatch
to
master
it
lands
on
the
end
of
Q
and
the
Q
is
huge
and
another
one
is
that
just
to
acquire
the
metric
we
need
for
some
of
them.
At
least
we
need
to
do
a
dispatch
to
master
like
counting
the
number
of
messages,
it's
a
it's
a
dispatch
to
master,
and
that
was
the
problem.
Okay,.
H
Specifically
around
this
I
know
to
be
in
many
cases,
because
a
lot
of
disappears
that
we
end
up
doing
on
the
matrix
path
could
be
implemented
as
counters.
So
then,
okay,
then
Mara
and
landed,
attach
a
a
while
ago
that
looks
to
implement
a
gauge
metric
using
a
simple
standard
function
instead
of
a
defer.
So
another
approach
is
we
could
implement
a
gauge,
that's
backed
by
a
counter,
so
it's
backed
by
the
fare,
and
so,
if
we
do
enough
of
those,
then
it
should
improve
the
matrix
which
done
yeah.
B
Okay,
okay,
so
we
got
a
over
20
minutes
left
just
trying
to
think
about
the
ordering
of
these.
We
could.
We
can
definitely
discuss
performance,
related
people,
pain
points,
but
I'm
curious
if
we
should
try
to
fill
out
a
planet
table
or
if
we
should
try
to
talk
about.
You
know
metric
sampling,
benchmarking
build
performance.
If
we
do
that,
I,
don't
think
we'll
get
to
fill
out
a
planning
table
and
kind
of
figure
out
what
people
can
work
on
today.
A
B
H
You
have
probably
will
run
into
J&I
problems,
so
originally
we
did
I
had
a
patch
and
outlet
system
which
did
that?
What
happens
it
is
you
run
into
Jana
problems
because
j'ni
library
is
using
je
mallet
many
wind
up
I
think
we
ended
up
with
a
problem
because
with
Jade
Java
Runtime
environment,
using
the
system
Malik,
but
in
resource
module,
yeah
is
using
Tamera.
H
B
I
J
So
adding
to
that
I
think
clicking
about
something
additionally,
which
is
to
to
handle
history
with
it.
I
did
a
run
like
the
wrong
time,
data
in
a
master
and
separately,
which
could
potentially
move
archive
things
which
is
just
for
web
UI
to
another
process,
I
mean
actor,
which
I
think
could
could
I
watch
the
master
actor
whatever.
B
Thought,
my
god,
oh
I,
guess
some
of
these
things.
I
was
hoping
to
maybe
talk
about
like
when
we're
planning
we're
doing
kind
of
like
a
planning.
Google
like
what
we
might
work
on,
I'm
luckier,
unless
you
know
that
this
is
like
a
definite
performance
like
I,
was
thinking
more
for
performance,
related
pain
points
like
Oh,
our
throughput
in
the
API
is
the
bad
or
the
master
takes
a
long
time
to
failover
or
the
agent
takes
a
real
long
time
to
recover
from
the
state
on
the
disk
or
things
like
that.
E
J
I
was
just
adding
code
Ariel's
point,
so
so
I
guess
in
terms
of
the
terms
of
points
is
filling
ones.
Everybody
probably
have
already
mentioned,
must
have
been
like
metrics
and
with
UI
slowness
and
till
over
performance
and
in
terms
of
foot
HTTP
API
to
report
I
think
the
aerial
properly
as
more
data
on
it.
J
I
B
Yeah
I
definitely
had
talked
to
and
on
a
lot
so
I
think
I
know
all
those
ideas
he
has,
but
yeah
we
can.
Some
of
them
are
in
the
process
itself.
Some
of
them
are
up
in
how
we
write
code
in
up
in
my
shows,
planned,
but
yeah
there's
there's
definitely
a
lot
of
stuff.
We
can
do
there
to
make
it
faster
and
I.
Think
if
I
recall
correctly,
the
throughput
is
lower
than
the
old
API
right
says.
A
J
J
So
so,
actually
I
was
just
hurting
our
web
UI.
So
probably
this
is
going
to
fact.
So
the
end
result
is
it's
affecting
the
speed
of
the
responsiveness
of
the
web
UI
right,
so
not
just
the
state.
That
is
on
itself,
because
we
probably
want
to
move
us
from
that.
Endpoint,
okay,
but
I
was
just
looking
at
micron
performance,
a
trace
of
my
web
UI.
So
for
one
of
our
clusters.
B
So
I
was
waiting
for
someone
to
mention
web
UI
so
glad
he
did.
That's
one
thing
that
I
think
you
know
the
user
experience
is
pretty
pretty
bad
for
large
clusters
or
even
clusters
that
aren't
that
large,
but
have
a
lot
of
state
so
I'm
going
to
add
that.
B
B
Then
what
do
you
think
should
be
user
s
the
time
to
try
to
I
think
we
should
maybe
figure
out
where
people
can
I
like
actually
and
their
time
helping
with
this
effort,
while
we're
together
now
agree
on
okay,
so
the
way
we'll
kind
of
punt
these
a
few
topics
for
the
next
time
I
was
hoping
to
do.
Is
you
know,
fill
up
this
kind
of
table?
We
can
just
do
it
very
roughly
right
now,
but.
B
I'll
clean
it
up
later,
I
noticed
that
in
Jeep
in
the
containerization
working
group
they
have
a
nice
table.
That
kind
of
shows
this
stuff
so
I'll
make
it
kind
of
look
like
that,
but
for
now
I
just
kind
of
wanted
to
get
a
sense
of
like
what
are
all
the
things
that
people
think
we
should
do
and
you
know
who
can
actually
help
with
which
things
so
I'll
just
I'll
get
this
started.
B
B
So
that's
kind
of
what
I
wanted
to
to
look
like
just
you
know
what
what
the
thing
is
and
who's
going
to
write,
patches
and
he's
going
to
maybe
help
review
so
I
mean
feel
free
to
just
add
stuff
or
just
call
things
out,
and
we
can
write
them
down.
A
A
A
Optimizing
the
HDP
path,
but
we
don't
have
a
benchmark
in
it,
and
you
know
especially
folks
that
have
a
pretty
good
understanding
of
what
their
benchmark.
Sorry,
both
of
HD
traffic
looks
like
it'd,
be
great
to
help,
get
a
benchmark
in
place
for
that
and
then
like.
What's
the
typical
size
of
the
protobuf,
a
number
of
crota
buffs,
the
frequency
they're
showing
up
at
that
is
yeah,
it
would
be
great
to
sort
of
have
that.
Is
it
good?
A
B
A
J
D
F
J
J
So
I
wonder
whether
that's
too
specific,
no,
no
but
but
yeah
I
I
think
is.
This
is
one
of
the
like
very
important
cases
might
as
well.
B
B
J
B
B
B
J
C
B
More
fine-grain
with
it
as
well
to
see
where
people
can
help
for
now,
I
think
we've
kind
of
got
enough
things
going
on.
Maybe
Ian
you're
done
with
your
registry
patch
that
you
had
sent
out,
but
they're
still
we're
still
working
through
the
lid
process.
Optimizations
and
Dimitri
still
has
these
patches
that
need
to
get
through.