►
From YouTube: OMR Architecture Meeting 20210722
Description
Agenda:
* GC Parallelism and Adaptive Threading (#5829) [ @RSalman ]
A
Welcome
everyone
to
the
july
22nd
omar
architecture
meeting
today
we
have
one
topic:
it's
a
gc
topic
presented
by
salman
renna,
so
I'll
just
turn
it
over
to
him
and
take
it
away.
B
All
right,
hello,
everyone
and
good
afternoon.
B
My
name
is
solomon,
as
the
bureau
mentioned,
I
work
at
ibm
on
the
runtimes
team,
specifically
on
the
gc
team,
focus
on
mainly
optimization
and
performance,
so
today
I'll
be
talking
to
you
about
parallelism
and
adaptive,
garbage
collection,
threading
or
adaptive
gc
threading
for
short,
so
the
goal
of
this
talk
is
to
kind
of
give
a
under
the
hood
look
at
gce
and
speak
about
performance
and
optimizations
by
specifically
taking
a
look
at
dc,
parallelism
and
adaptive
adaptive
threading-
something
I
should
point
out
before
we
start
is
that
work
around.
B
This
has
been
also
referred
to
as
dynamic
threading
and
thread
throttling
in
the
omr
github
repos
on
the
various
issues
and
the
different
pr's
that
have
been
opened.
So
it's
been
referred
to
as
with
different
names,
but
I
think
adaptive
gc
threading
is
like
a
good
umbrella
term
which
encompasses
the
different
goals
of
this
work.
B
Another
thing
I
should
point
out
is
that
for
most
of
this
discussion
and
talk
I'll
be
approaching
it
from
an
open,
g9,
vm
implementation
perspective.
Openg9
is
the
most
comprehensive
consumer
of
omr.
So
I
think
so
I'll
be
talking
from
that
perspective,
but
even
though
this
is
all
common
code
and
technically
it
could
be
used
by
any
consumer
of
omr
all
right,
so
an
overview
of
the
talk.
So
I
assume
everyone
knows
what
what
the
gc
is
and
what
it
does.
But
I'll.
B
Throw
a
slide
or
two
up
just
to
go
over.
It
then
I'll
specific,
specifically
talk
about
omr,
gc
technology
and
then
we'll
get
into
some
parallelism.
What
parallelism
means
to
do
you
see
the
implications
of
it
advantages?
How
gc
takes
advantages
of
it
and
some
throughput
and
pause
time,
discussions
around
that
and
then
we'll
talk
about
the
motivations
for
adaptive
threading,
resulting
from
the
issues
with
parallelism
that
we
observed
in
certain
workloads
and
we'll
talk
about.
B
Why
why
we
saw
them
and
the
assumptions
and
the
investigation
around
that
and
then
we'll
get
into
adaptive
learning
itself
the
core
idea
of
it
and
how
it
solves
some
of
the
issues
with
parallelism,
we'll
talk
about
the
models
and
heuristics
that
we
use
from
there
we'll
get
into
some
internals
we'll
talk
about
adapt,
actual
implementation
of
the
optimization.
B
So
anyone
could
go
into
the
code
and
take
a
look
at
it
and
make
sense
of
what's
happening
under
the
hood
and
then
we'll
wrap
things
up
with
performance
results
and
perhaps
some
future
work
that
can
come
out
of
out
of
this.
B
So
as
so
I'll
just
quickly
go
over
garbage
collection.
So,
as
many
of
you
know
that
gc
garbage
collection
is
automatic
memory
management
with
the
end
goal
of
reclaiming
garbage
or
memory
that's
allocated
by
a
program,
that's
no
longer
referenced,
but
memory
management
with.
Even
though
the
end
goal
is
reclaiming
memory,
it
implies
a
lot
more
than
that,
so
it
has
to
handle
allocations.
B
It
has
to
worry
about
the
heap
layout
memory
layout
and
handle
those
have
access
barriers
in
place,
ensure
object,
validity
things
of
that
sort.
So
all
that
to
say
the
gc
touches
a
lot
of
different
areas
in
the
run
times
other
than
just
popping
up
when
memory
needs
to
be
reclaimed
so
nowadays
majority
of
the
languages
at
least
like
the
top
10.
If
you
take
a
look
at
them,
they
provide
some
sort
of
gc
component
and
I
don't
think
we
need
to
make
a
case
for
it.
B
There's
a
lot
of
advantages.
It
eliminates
a
big
category
of
bugs
speeds,
up
development
lets
developers,
focus
on
the
problem
at
hand
etc,
but
without
a
doubt
there
is
some
associated
costs
and
drawbacks.
Although
we
can't
get
rid
of
all
of
them,
we
we
can
always.
B
We
can
have
optimizes
optimizations
in
place
to
handle
them
in
a
smarter
way,
also
look
for
certain
patterns
or
certain
scenarios
and
situations
where
we
can,
where
we
can
mitigate
some
of
those
those
drawbacks,
so
biggest
drawbacks
being
the
runtime
costs
associated
with
it
or
unpredictable
application
causes.
B
B
So
I
think,
before
speaking
about
optimizations,
it's
important
to
understand
the
user's
perspective
so
from
a
high
level
user
perspective,
it's
usually
a
compromise
between
application,
throughput
pause
time
and
sometimes
footprint,
and
these
can
be
considerable
things
for
the
user.
B
For
example,
if
the
user
has
a
vm
deployed
in
the
cloud
and
there's
thousands
of
instances
running
and
they're
being
charged
based
on
memory
footprint,
then
footprints
something
significant
for
the
user
to
consider,
and
so
what
this
means
for
us
as
gc
developers,
is
that
the
internal
technology
that
we
develop
to
accommodate
these
different
goals
of
the
user
is
significantly
different.
So
we
have
different
technologies
to
accommodate
these
different
goals.
So
so,
for
example,
we
could
have
different
heap
layouts
flat
heap
versus
a
fixed
size
regions.
B
So,
for
example,
generational
or
just
traditional
marked
sweep
collection,
but
I
think
the
most
most
evident
is
when
you
consider
stop
the
world
technology
versus
concurrent
and
its
implications
on
throughput
and
pause
time,
so
user,
that's
more
sensitive,
that's
developing
an
application,
that's
more
sensitive
to
pause
times,
might
want
to
consider
concurrent
concurrent
technology
and
they
give
up
some
throughput
while
and
and
someone
someone
who's
more
concerned
with
throughput
might
be.
B
Okay
with
the
longer
pauses
and
might
opt
out
might
go
for
the
stop
the
world.
So,
as
developers,
we
have
to
accommodate
these
different
needs
and
requirements
with
different
technology,
so
I
think
to
understand
the
to
understand
and
categorize
the
omrgc
technology.
Specifically,
I
think
we
can
think
of
it
in
terms
of
policies.
B
B
So
we
have
so
in
order
of
increasing
complexity.
From
top
down,
we
can
start
with
op
throughput,
which
is
the
simplest
out
of
of
the
three
policies
that
I
have
listed
here.
So
opt
is
a
traditional
standard.
Stop
the
world
collector
mark
and
sweep
gives
very
good
throughput,
but
at
the
expense
of
some
some
pause,
so
yeah
and
so
opt
average
pause,
builds
on
that
with
a
concurrent
component.
B
So
it
adds
a
concurrent
marking
component
to
that
software
world
and
it
it
gives
up
some
throughput
for
better
pause
times
and
then
we
have
gencon,
which,
which
kind
of
takes
those
concepts
from
opt
average
paws
and
uses
that
as
a
global,
collector
and
then
has
it,
introduces
a
new
local
collector
for
generational
style
of
a
garbage
collection.
B
So,
for
the
purposes
of
this,
talk
we'll
be
talking
speaking
about
gen
con,
specifically
as
it
is
the
default
in
openg9,
and
it
provides
the
best
results
in
most
cases
and
it's
a
good
compromise
between
pause
time
and
throughput.
So,
even
though
we'll
be
speaking
about
gen
con,
the
the
concepts
concepts
of
parallelism
and
adaptive
threading
applies
universally
to
like
all
the
different
styles
of
collection.
B
So
the
underlying
technology
actually
comes
into
perspective
when
you
actually
look
at
the
internal
hierarchy
of
class
structure
of
the
different
classes.
So
this
this
is
the
high
level
hierarchy
of
the
collectors
themselves.
So
if
you,
if
you
notice
opt
average
pause,
is
a
parallel
global,
gc,
collector
and
then
opt
average
pause
is
a
it's.
B
It's
inherited
components
from
the
parallel
collector
and
it
adds
a
concurrent
component
to
it,
whereas
gen
con
it,
it
uses
two
different
collectors:
it
uses
scavenger
as
a
local
collector
and
it
uses
concurrent
gc
as
a
global
collector
and
so
for
for
again.
For
this
purposes
of
this
talk
we'll
be
speaking
about
scavengers.
B
Specifically,
majority
of
the
collections
happen
in
scavenger
and
by
default,
as
it
is
the
default
policy
in
openg9,
so
we'll
so
first
we'll
take
a
look
at
things
from
an
application
perspective
and
see
how
the
different,
how
how
thing
what's
happening.
So,
if
you
were
to
observe
dc
activity,
see
and
see,
what's
happening
relative
to
the
application,
we
would
see
something
like
this.
B
We'd
see,
application
threads
running
the
applications
running,
does
doing
useful
work,
allocating
doing
activity
and
at
some
point,
perhaps
there's
an
allocation
failure
and
there's
not
enough
memory
to
allocate
an
object.
So
at
this
point,
gc
kicks
in
there's
a
pause.
Application.
Threads
are
halted
dc.
Does
this
work
to
reclaim
memory,
and
then
it
releases
control
back
to
the
application
thread
with
freedom
memory,
and
then
they
continue
on.
B
So
this
is
how
the
default
scavenger
currently
works
and
there's
also
a
concurrent
variant,
which
I
have
also
shown
a
trigger,
for.
I
think
the
only
relevant
thing
to
our
talk
with
this
is
that
even
with
concurrent,
we
also
have
these
pauses,
even
though
it's
a
concurrent
style
of
collection,
but
these
pauses
are
a
lot
shorter,
but
they're
there.
B
So
so
we're
observing
we're
focused
on
observing
these
pauses
and
how
they
can
be
made
shorter
or,
if
there's
anything
obvious,
that
sticks
out
in
certain
scenarios
or
cases
that
might
result
in
unpredictable
or
unjustified
justified
long
pauses,
so
we'll
be
taking
a
look
at
the
effects
of
parallelism
on
these
gc
pauses
and
how
we
can
end
up
with
some
long
pause,
long,
unnecessary
pauses
as
a
result.
B
So
parallelism,
I
think
parallelism
is
it's
it's
a
given
with
any
modern
application.
That's
cpu
intensive,
that's
processing
work
has
computation
computation
heavy,
especially
in
the
last
10
to
15
years,
with
compute,
with
how
computing
hasn't
evolved.
So,
throughout
the
years,
computing
hardware
is
increasingly
scaled
and
there's
more
resources
available
and
consequently,
garbage
collection
tasks
are
paralyzed
to
take
advantage
of
this
and
overall
be
more
performant.
B
So
parallelism
decreases
past
times,
as
we
have
more
more
resources
available
to
us.
So
all
collectors
and
all
major
vms
have
have
parallelism.
It's
a
given,
there's
nothing
new
with
it
in
general,
it's
a
key
optimization
to
reduce
gc
gc
cycle
times
so,
for
example,
traversing
an
object
graph
graph
with
one
thread
versus
two
threads
will
will
be
a
lot
faster
doing
it
with
with
the
two
threads
or
even
multiple,
any
any
any
larger.
Multiple
in
that
case.
B
So
so
currently,
it
would
make
sense
to
use
all
the
available
resources.
So
that's
why
in
openg9
at
least
the
total
gc
threads
that
are
used
is
equal
to
the
number
of
hardware
system
threads
that
are
available,
so
we
maximize
the
utilization
of
available
resources.
B
So
if
we
come
back
to
this,
it's
this
diagram
and
we
zoom
into
the
gc
pulse
time
and
see
how
we'll
see
what's
happening
there
and
look
at
parallelism
and
the
threads
running.
We
would
expect
to
see
a
a
one
thread
executing
known
as
the
main
gc
thread
and
at
some
point,
worker
threads
are
spawned,
so
parallelism
kicks
in
and
then
these
worker
threads
do
work,
help
with
the
main
gc
thread
and
they
finish
up
and
the
main
gc
thread
continues
on
and
releases
control
back
to
the
application.
B
So
if
you
analyze
this,
if
you
analyze
this
further,
we
would
see
three
distinct
phases
during
collection.
Two
two
phases
where
main
thread
is
running
exclusively
without
worker
threads
and
the
phase
where,
where
the
bulk
of
the
work
is
happening,
the
actual
collection
takes
place.
So
that's
the
second
phase
in
this
bottom
diagram.
So
the
first
phase,
the
main
thread
is
just
setting
up
doing
some
reporting
and
spawning
the
worker
threads
themselves
and
at
the
end,
maine
is
doing
some
cleanup
work,
reporting
and
doing
some
bookkeeping
there.
B
So
you're
probably
wondering
what's
what's
the
issue
here?
What's
the
big
deal,
everything
everything
looks
okay.
If
we
have
more
resources,
we
should
be
able
to
use
them
right.
But
let's,
let's
take
a
look
at
this
example
which
tells
a
different
story,
so
we
have
the
same
workload
running
on
the
same
system,
but
in
one
run
we're
limiting
the
parallelism
to
four
threads,
we're
in,
whereas
in
the
other
run
we're
letting
it
use
all
the
available
resources
and
the
default
behavior
before
adaptive
30..
B
So
there's
something
clearly
wrong
here.
The
there's
pause
times
are
three
to
four
times
consistently
higher,
with
the
48
threads
and
and
and
and
and
so
so
so
so,
there's
something's
wrong.
So
intuitively
a
larger
system
should
perform
better
than
a
small
one
looks
like
48
cores
versus
four
core
system.
It's
a
48
core
system
should
outperform
it.
It
should
either
be
better,
or
at
least
the
same.
B
It
shouldn't
be
worse,
so
you
shouldn't
be
penalized
for
using
a
larger
system
that
just
doesn't
make
any
sense,
but
that's
what
we're
seeing.
So
we
have
to
ask
ourselves
and
look
at
if
there's
any
cost
or
overhead
associated
with
parallelism.
B
We
know
that
there
are
additional
requirements
with
multi-threading.
We
know
that
we
have
to
synchronize
crit,
there's
critical
sections.
There's
we
have
to
access
global
resources,
so
there's
threads
there
could
be
races
between
threads,
so
we
have
mutexes
and
semifloors
and
and
all
those
different
synchronization
mechanisms
in
place,
and
we
also
have
to
manage
threads.
So
the
threads
have
to
be
dispatched,
they
have
to
be
suspended
and
they
have
to
be
notified
to
wake
up
when
they're
idle.
B
So
if
you
look
more
into
this
issue
and
and
look
at
what
exactly
is
being
synchronized
or
why
do
we?
What?
Why
do
we
have
this
overhead
in
in
the
code
in
the
technology?
B
So,
for
example,
with
mark
maps,
there's
one
bit
for
multiple
objects
and
different
threads
are
marking
those
objects
and
they
can
race
in
updating
the
the
the
market
so
that
we
have
atomic
operation
operations
there,
which
we
can't
get
around
so
they're
like
comparing
swaps,
atomics
and
gc
threads
are
also
frequently
pushing
and
popping
from
the
work
stack,
and
so
we
need
a
new
text
there
to
control
access,
and
then
we
also
have
an
issue
with
the
distribution
of
work,
so
threads
can
go
idle
and
when
they
don't
have
anything
to
do
so.
B
These
are
all
different
things
to
consider
when
we
talk
about
synchronization
of
different
the
threads.
B
So
in
the
example
before
with
the
48,
the
threads
versus
the
four
threads
running
the
same
workload,
there
was
obviously
noticeable
overhead
which
resulted
in
the
higher
collection
times
and
given
that
the
only
varying
parameter
between
the
two
runs
was
to
utilize
threads.
We
can
very
confidently
say
that
the
overhead
was
the
result
of
parallelism
and-
and
I
also
mentioned
before
that
parallelism
is
a
key-
is
key
in
reducing
pause
times,
but
we're
seeing
these
long
pauses,
which
aren't
acceptable.
B
So
how
can
we
explain
this
this
this
and
how
do
we
reconcile
these
two
ideas?
So
for
that
we
we
consider
two
two
cases
where
there
might
be
a
problem
with
parallelism
and
we
might
actually
end
up
with
detrimental
parallelism,
so
the
two
main
reasons,
one
of
them
being
there's
little
work
to
be
distributed
to
the
different
threads
and
the
second
one
being
cpu
saturation,
so
high
cpu
utilization.
B
B
They
have
limited
amount
of
work
that
could
be
divided
between
the
the
threads
and
so
threads
end
up
being
underutilized,
but
they
still
incur
overhead
also,
depending
on
the
workload
and
the
object
object.
Graph
collection
can
only
prob
be
paralyzed
only
up
to
a
certain
number
of
threads,
so
there's
an
imbalanced
distribution
of
workload
when
too
many
threads
are
utilized.
B
B
So
with
this
with
this
type
of
parallelism,
we
end
up
incurring
overhead
without
gaining
any
benefits
and,
in
the
second
case,
with
cpu
saturation,
the
effectiveness
of
parallelism
is
limited
by
by
by
availability
of
the
threads.
So
this
can
be
true
for,
for
example,
with
a
system
running
multiple
vms.
In
that
case
system
system
threads
are
shared
among
the
different
vms
and
the
different
gcs.
For
example,
if
you
have
two
vms
running
the
same
set
of
threads
will
be
scheduled
to
both.
B
There
will
be
a
lot
of
context,
switching
and
for
and
if
gc
takes
place,
it
will
need
the
thread
schedule
to
it
to
continue
on
and
to
clear
synchronization
points,
and
so
so,
overall,
these
kind
of
threads
that
are
shared
are
limiting
dc
progression
and
as
a
result,
they
they
provide
ineffective
parallelism
and,
and
then
and
the
other
big
cost
is
the
context
switching
which,
which
has
a
big
impact
on
dc
performance.
B
So,
overall,
unless,
unless
the,
unless
the
benefits
of
parallelism
are
greater
than
the
overhead,
then
parallelism
will
be
detrimental
and
cause
gc
times
to
increase
unnecessarily,
and
this
overhead
can
be
significant.
It
will
increase
proportionally
with
the
with
the
number
of
threads
that
are
utilized.
B
So
if
we
come
back
to
this
example,
and
so
if
we
were
to
run
this
example
again
with
8
16,
24
or
64
threads,
we
would
see
that,
as
we
increase
the
number
of
threads,
the
the
issue
is
compounded
and
it's
even
more
so
if
you
prefer
to
plot
64
thread
threads
here,
we
see
the
pause
times
are
even
higher
than
48
threads.
B
So
this
this
kind
of
reminds
me
of
we
could
draw
a
parallel
with
the
example
of
too
many
cooks
in
the
kitchen
where
there,
when
there's
too
many
people
working
together
on
something
it
couldn't
result
in
the
final
product
being
negatively
affected,
and
I
think
that's
that's-
that's
precisely
what's
happening
with
these,
these
different
threads.
B
So
knowing
this
we
have
to
ask
ourselves
what
the
right
number
of
threads
thread
threads
is:
are
we
using
too
many
threads
and
getting
bad
performance?
If
so,
how
much
do
we
decrease
by,
or
is
it
possible
that
we're
not
using
enough
threads
and
missing
opportunities
to
paralyze
the
task
farther
and
gain
benefits?
B
So
it's
kind
of
like
the
goldilocks
goldilocks
situation,
where,
if
you're
familiar
with
it,
we're
looking
for
that
sweet
spot
where
it's
not
too
high,
neither
is
it
too
low
and
we're
just
looking
for
the
perfect
perfect
number
so
that
that
brings
us
to
sub-optimal
versus
detrimental
parallelism.
Where
we're
seeing
is
there
a
net
loss
or
are
we
losing
gains
potential
gains?
B
So
in
general,
parallelism,
we
could
say
is
most
beneficial
when
done
with
the
correct
number
of
threads,
and
we
can
call
this
number.
The
optimal
thread
count
anything
more
than
this
number
would
result
in
unnecessary
overhead
and
on
the
other
hand,
the
thread
count
less
than
the
optimal
thread
count
would
be
considered
sub-optimal,
since
there
would
be
more
opportunities
to
paralyze
it
parallelize
the
task
farther
and
gain
benefits.
B
So
these
two
tables
show
a
comparison
of
the
same
of
of
two
different
workloads,
so
the
table
on
the
left
shows
the
workload
running
with
48
threads,
eight
threads
and
four
threads
on
the
same
system,
and
we
can
see
that
as
we
decrease
the
thread
count,
there's
decreased
parallelism,
obviously,
but
the
cycle
times
also
decrease
and
performance
results
actually
increase
on
the
on
the
on
the
on
the
table
to
the
left
or
on
the
right.
B
Sorry,
we're
running
a
different
workload
again
doing
the
same
experiment:
writing
with
48
threads,
eight
threads
and
four
threads,
but
we
actually
observed
that
peak
performance
is
obtained
when
we
run
with
eight
threads
with
48
threads,
we're
getting
too
much
overhead
and
with
four
threads,
we're
missing
opportunities
to
further
paralyze
the
task
with
more
threads,
so
we're
missing
out
in
benefits
there.
B
So
here
we
can
say
that
with
48
threads
we
have
detrimental
parallelism
and
with
four
threads
we
end
up
with
suboptimal
parallelism
and
the
point
at
which
we
reach
peak
performance
can
be
referred
to
as
the
equilibrium
point
so
in
the
table
to
the
right.
That
would
be
around
eight
threads
since
we
get
the
best
performance
there.
B
B
So
with
all
that
being
said,
we
need
some.
We
need
something
to
solve
these
issues,
the
received
parallelism,
and
we
know
that
it's
not
so
straightforward.
B
B
So
before
we
get
deep
into
adaptive
threading,
I
think
it
would
be
helpful
to
draw
a
parallel
between
an
example
that
we
all
might
be
familiar
with
and
that's
brook's
law
which
states
something
like
adding
more
resources
to
a
late
project
makes
it
even
more
late,
and
I
think
the
idea
of
adaptive
setting
can
be
related
to
this.
So
we
could
draw
draw
a
parallel,
and
so,
if
we
look
at
the
graph,
we
see
the
months
until
completion
of
a
project,
the
number
of
people
contributing
to
the
project.
B
As
we
add
more
people
and
resources
to
a
project.
The
time
to
completion
decreases
until
a
certain
point
where
any
more
people
added
on
become
a
net
loss
and
because
there's
increased
coordination,
costs
on
onboarding
and
ramping
up.
So
so
there's
this
inflection
point
at
which,
which
we
would
say,
is
like
the
optimal
number
of
people
to
have
on
a
project
with
in
terms
of
gc,
paws
and
threads.
B
We
can,
we
could
say
we
have
the
same
relationship
where
adding
on
helper
threads,
increasing
parallelism
results
in
decreased
pause
times
only
at
to
a
certain
point,
after
which
each
additional
thread
contributes
to
a
loss,
because
there
is
the
additional
overhead
and
without
getting
any
better
fits.
So
this
inflection
point
is
that
is,
can
be
seen
as
the
equilibrium
point
at
which
we
reach
optimal
parallelism.
Anything
beyond
this
number
results
in
increased
cost
and
any
anything
anything
before
this
number.
B
B
So,
coming
back
to
the
goldilocks
example,
where
we're
looking
for
the
sweet
spot
so
or
the
goldilocks
number,
which
is
just
right
so
adaptive
threading,
is
looking
for
the
for
that
for
that
sweet
spot
based
on
observations
for
a
completed
cycle,
and
we
want
this
number
to
be
just
right.
We're
not
losing
out
on
benefits,
and
neither
are
we
incurring
unnecessary
overhead.
B
So
so,
essentially,
adaptive
threading
is
an
optimization
and
it's
answered
question
answering
questions
such
as
when
to
adjust
and
how
much
to
adjust
by
and
and
and
it
does
so
to
tune
parallelism
and
seek
that
equilibrium
that
we're
talking
about
where
parallelism
results
in
deep
performance.
B
So
so
it's
it's
a
it's
a
systematic
approach
designed
to
identify
these
detrimental
and
sub-optimal
scenarios
and
determine
the
degree
to
which
it's
suboptimal
and
and
and
and
as
a
result,
give
give
give
a
recommendation
how
to
how?
How
can
we
actually
achieve
that
optimal
thread?
Count
from
where
we
are
currently,
it
also
needs
adaptive.
Threading
also
needs
to
ensure
that
the
changes
in
parallelism
are
not
invasive
given
like
anomalies
or
things
that
are
observed
one-offs.
B
So
it
needs
to
take
everything
into
consideration,
and
one
thing
is
that
this
optimal
thread
count
isn't
static.
It
changes
over
the
life
lifetime
of
the
application,
so
adaptive
threading
is
a
continuous
process
of
reevaluating
and
readjusting
the
thread
count
so,
for
example,
workload
and
load
load
distribution
properties
can
change
as
object
graph
changes,
you
could
have
higher
allocation
rates
or
the
live
set
increases
or
in
terms
of
cpf
cpu
utilization.
B
Cpu
utilization
can
be
higher
or
lower
later
on.
So
so
it's
it's
a
dynamic,
it's
a
dynamic
system.
B
So
so,
if
we
compare
this
to
traditional
approach
of
of
that,
we
currently
have
in
place.
Currently
the
only
thing
we
have
in
place
to
mitigate
this
before
adaptive
threading.
The
only
thing
we
had
in
place
was
to
look
at
the
heap
size
to
kind
of
estimate
the
number
of
threads
to
be
used,
and
then
we
also
had
manual
tuning
where
the
user
could
set.
B
The
thread
count
themselves,
but
manual
tuning
requires
tedious
experimentation
and
perform
analysis
of
the
application
performance
and
and
overall,
the
traditional
tuning
always
assumes
that
this
optimal
thread
count
is
static
and
it
just
it
doesn't
change,
whereas
we
know
that
it
can
change
over
the
lifetime
of
an
application.
B
So
so
adaptive
threading
is
a
superior
solution
to
these.
B
B
So
it's
a
systematic
approach
that
that
we
use
to
evaluate
that.
That's
used
to
evaluate
a
completed
cycle
by
by
looking
at
two
things,
the
the
number
of
threads
that
were
utilized
and
overhead
data
from
using
the
that
number
of
threads.
So
in
terms
of
overhead
data,
we're
specifically
talking
about
busy
times
and
stall
times
for
each
thread
participating
in
in
in
garbage
collection,
so
so
busy
busy
install
times
are
key
they're.
What
drive
adaptive
threading
these
things,
these
these
measures
hint
at
cpu
utilization.
B
They
they
tell
us
about
the
object
and
workload
distribution.
So
busy
time
is
anytime.
A
thread
is
performing
useful
gc
work
which
contributes
to
completing
the
cycle
like
scanning
object,
processing,
routes,
copying
or
marking
objects.
Stall
times
are
more
interesting.
B
These
are
times
that
are
that
threads
are
doing
non,
non-useful
or
trivial
work
or
times
that
the
threat
is
idle
not
doing
anything
at
all,
and
these
implicitly
don't
contribute.
So
these
don't
explicitly
contribute
to
completing
the
cycle.
So
it
includes
things
like
pushing
and
popping
something
to
a
shared
list,
acquiring
a
synchronization,
monitor,
idling
it
for
work
and
notifying
idle
threads.
So
all
of
these
can
be
considered
stall
times,
which
would
where
the
thread
is
doing
non-useful
things.
B
One
thing
to
consider
is
that
different
stock
types
of
stalls
have
different
characteristics
and
it
has
very
varying
dependency
on
the
the
number
of
utilized
threads.
So
what
that
means
is
that
these
different
types
of
stall
times
respond
differently
when
changing
the
number
of
threads.
B
So
specifically,
we
have
three
different
types
of
stalls
that
we
can
kind
of
categorize
in
their
own
category.
We
have
synchronization
stall,
we
have
resume
stall
and
we
have
idle
waiting
waiting
for
work,
so
resume
stall
would
be
a
function
of
it
would
be
dependent
on
os
and
the
platform,
whereas
idle
waiting
for
work
would
be
more
dependent
on
the
object
and
and
the
live
set,
and
the
and
the
properties
of,
and
the
bill
and
availability
of
work
for
the
thread.
B
B
We
can
say
it
looked
like
something
like
this
for
these
different
threads
so
over
here
we
have
some
examples
of
the
different
time
measures
and
where
the
stalls
are
coming
from,
so
we
have
a
synchronization
stall
in
red
resume
stall,
then
yellow
and
then
this
time
to
notify
in
purple,
and
these
are
all
different
times
which
we
would
consider
are
non-useful
or
trivial,
because
they
don't
explicitly
contribute
to
completing
the
the
gcu
work
and
then
we
have
busy
times
in
green.
B
So
in
this
example,
there's
a
synchronization
point
and
there's
four
threads,
which
have
to
clear
it.
Worker
three
is
the
first
first
thread
to
reach
the
synchronization
point
and
it's
waiting
there
for
all
the
other
threads
to
reach
the
reach.
So
this
this
time
that
it's
idle,
we
would
consider
it
a
stall
time
and
then
the
last
thread
worker
one
reaches
the
synchronization
point
and
it
it
notifies
the
rest
that
we're
all
synchronized.
B
So
it
notifies
them
to
wake
up.
And
so
then
there's
there's
a
time.
There's
actually
an
overhead
for
it
to
go
ahead
and
wake
up
to
different
threads.
So
we
would
consider
that,
as
as
as
a
stall
and
we'd
measure
that
and
then
as
the
worker
threads
are
waking
up,
they
have
to
reacquire
the
mutex
and
and
the
monitor
so
there's
a
they're
waking
up
one
by
one
and
so
there's
an
overhead
to
resuming
these
threads
as
well.
B
So,
let's
take
a
look
at
at
the
model
and
to
understand
how
this
information
helps
us
to
estimate
a
optimal
thread
count
so
as
an
implementation
of
the
model
can
be
derived
from
finding
a
minimum
of
gt
time
function
and
this
this
gc
time
function
proves
to
be
pretty
good
and
over
here
this
function.
What
it
basically
does
is
it
projects
the
total
duration
of
gc
for
m
threads
would
with
observe
busy
install
times
while
performing
gc
with
n
threads.
B
So
if
you
have
n
thread,
then
we
know
the
busy
install
times
as
a
result
of
using
n
threads.
We
can.
We
can
project
the
total
duration
of
gc
if
we
were
to
use
and
and
thirds
so
if
we
solve
that
take
for
the
minimum,
we
would
end
up
with
something
like
equation
one,
and
that
would
give
us
the
number
of
optimal
optimal
threads
as
a
function
of
previously
observed,
thread
count
and
busy
install
times
resulting
from
it.
B
B
Basically,
just
using
heuristics
and
weighted
averaging
to
massage
the
thread
count
and
to
make
sure
it's
not
too
invasive,
or
it's
it's
a
it's
a
just
using
different
thing,
experimental
data
that
we
know
for
better
results
so
equation,
one
can
be
re
rewritten
for
simplicity,
in
terms
of
percent
stall
and
and
so
in
terms
of
it,
for
instead
of
being
written
in
terms
of
three
parameters,
we
could
reduce
it
down
to
two
parameters
and
it
so
the
final
final
model,
I
guess,
is
shown
at
the
bottom
here-
that's
outlined,
but
just
stating
the
percent
stall
here
is
somewhat
an
oversimplification,
because
coming
up
with
the
percent
stall
is
actually
more
involved
and
requires
closer
consideration.
B
Another
thing
to
note
is
that
this
constant
x,
this
is
a
model
constant
to
help
model,
non-linear
dependency
of
stall
times
on
the
gc
thread
count.
I
won't
go
into
too
much
details
there
about
these
different
constants,
like
h,
w
and
x
I'll
leave
a
link
where
it
goes
into
these
in
in
detail.
B
So
so,
let's
see-
let's
see,
let's
see
this
in
action,
so
this
table
gives
gives
an
example
of
how
the
model
and
heuristic
approach
works.
So
it
gives
give
the
matrix
of
inputs
and
the
resulting
recommended
thread
count.
So,
for
example,
if
we,
if,
if
at
the
end
of
the
cycle,
we
have
a
percent
stall
of
ninety
percent
that
we
calculate
after
using
the
12
threads,
then
we
then
the
model
would
actually
recommend
to
throttle
down
the
thread
count
to
eight
for
the
next
cycle.
B
Similarly,
if
we
were
to
use
12
threads
and
we
we
determined
the
percent
stall
to
be
35
percent,
we
would
increase
to
15
threads
and
then
in
the
next
cycle
we
would
reevaluate
again
does
pulling
down
the
thread,
count
actually
help
and
what
effects
does
it
have
in
the
stall
point?
And
if
it's
positive,
it
could
either
be
stable
and
it
could
hold
that
thread
count
or
it
could
continue
adjusting
it
for
the
subsequent
cycles.
B
Model
constants
are
x,
is
equal
to
one
and
the
threat.
The
thread
booster
is
equal
to
0.85
and
the
weighted
averaging
is
50
at
the
time.
B
C
B
The
diagram
and
look
at
the
three
different
phases
and
compare
it
to
this
flowchart
on
the
right
here
it
would
be.
We
would
see
how
adaptive
threading
works
at
the
different,
the
different
phases,
so
during
pre-collection
we
actually
want
to
adjust
the
thread
count
based
on
the
previous
cycles.
Observations
so
before
the
main
thread
spawns
the
the
worker
threads.
We
want
to
see
if
there's
any
recommended
thread
count
based
on
previous
observations
and
then
during
the
collection
itself.
B
We
want
to
collect
all
of
the
overhead
data
and
during
post
collection,
we
want
to
use
that
collected
data
and
project,
an
optimal
thread,
count
which
could
be
used
for
which
could
be
used
for
the
next
cycle,
like,
like
I
mentioned
before
during
pre-collection
the
next,
so
the
next
cycle's
pre-collection
would
actually
look
at
the
the
the
thread
recommendation
from
the
post
collection
of
the
the
cycle
that
just
completed
before
it.
So
we
could
take
a
look
at
each
phase
in
a
bit
more
detail
so
at
this
post
collection
it.
B
What
what
so
at
this
post
collection
time?
Well,
what
adaptive
threading
does
is
that
so
the
main
thread
once
it
completes
it
checks
if
all
the
worker
threads
are
have
suspended
or
not,
and
if
they
haven't
it
will
wait
for
all
the
worker
threads
to
complete
and
once
they
complete
it'll,
go
and
aggregate
all
of
the
overhead
data
and
and
from
there.
B
It
will
at
this
point,
we'll
know
the
total
cycle
time
and
it
could
compute
the
percent
stall
by
massaging
some
of
that
overhead
data
and
and
and
and
and
and
and
compute
the
optimal
thread
count
by
providing
that
percent
stall
to
to
to
the
model
and
then
from
there
we
can
determine
and
set
the
number
of
recommended
threads
for
the
next
cycle,
and
so
during
the
next
cycle,
when
during
pre-collection
we
would,
when
the
main
thread
goes
and
dispatches
the
worker
threads,
and
it
involves
the
parallel
display
dispatcher.
B
It
will
check
if
there's
been
a
recommendation
and
if
there's
a
recommendation
for
thread
count,
it
will
force
the
parallel
dispatcher
to
use
that
when
spawning
the
worker
threads.
B
So,
in
terms
of
the
collection
itself,
we
see
that
there's
a
this
is
where
the
actual
gc
work
is
happening,
where
we're
doing
roots
we're
scanning
doing
some
copying
and
whatnot,
and
here
we're
we're
actually
measuring
the
different
stalls
and
the
different
idleness
and
trying
to
pick
up
on
the
utilization
of
the
threads.
So
specifically
when
we're
scanning
threads
go
idle
all
the
time.
So
we
want
to
put
things
in
place
to
take
measurements
here
and
attract
idleness.
B
So
if
we
put
all
that
together,
we
end
up
with
something
like
like
this,
where
there,
where,
where
the
pre-collection
routine
has
has,
has
this
adaptive
threading
component,
that's
that's
implemented,
and
then
in
post
collection
we
have
the
the
the
model
actually
going
and
projecting
the
third
count.
B
So
if
you
were
to
see
adaptive
threading
in
action,
I
think
these
two
figures
illustrate
it.
So
these
figures
show
how
adaptive
threading
works,
to
reduce
parallelism
overhead
by
trying
to
reach
that
equilibrium
point
or
optimal
thread
count,
so
the
figure
on
top
shows
percent
stall,
number
and
number
of
threads
at
any
given
cycle,
and
so
for
each
cycle.
The
number
of
threads
is
determined
by
the
previous
cycle
number
of
threads,
you
utilize
at
the
previous
cycle
and
the
percent
stall
as
a
result
of
using
that
many
threads.
B
So,
for
example,
if
you
take
a
look
at
cycle
number
two,
we
see
that
the
thread
count
is
30
and
that's
determined
by
taking
by
adaptive
threading,
taking
a
look
at
thread
count
of
48
with
a
percent
stall
of
94.,
so
it
it
brings
down
the
thread.
Co
dried
count
because
it
thinks
the
stall
is
way
too
high
and
it's
leading
to
detrimental
parallelism.
B
So
so
the
bottom
figure
shows
a
breakdown
of
the
cycle
total
cycle
times.
So
it
shows
the
total
cycle
time.
Then
it
shows
the
proportion
of
time
that's
considered
stall
time
and
that's
considered
busy
time.
So
by
looking
at
this
cycle
time
breakdown,
we
see
that
initially,
that
cycle
times
are
very
high
and
majority
of
the
time
is
coming
from
stall
or
parallelism
overhead
and
as
we
decrease
the
threat
count,
we
see
we
observe
a
significant
drop
in
in
in
cycle
times,
indicating
that
parallelism
overhead
is
reduced.
B
But
overall
we
can
see
that
this
initial
thread
count
of
48
from
cycle.
One
ends
up
at
six
around
six
to
seven,
so
it
reaches
a
stable
point
around
that
and
in
this
one,
particularly
this
results
in
a
seventy
percent
decrease
in
cycle
times
and
and
about
a
twenty
five
increase
in
percent
in
performance.
B
So
I
think
another
interesting
example
is
the
multivm
or
high
cpu
utilization
scenario.
So
these
two
tables
show
multi-vm
for
running
multiple
vms
running.
On
the
same
on
the
same
system,
one
set
shows
it's
with
adaptive:
threading
enable
the
the
threat,
so
the
set
on
the
left
shows
it
with
the
without
adaptive
threading
and
the
set
on
the
right
shows
it
with
the
adaptive
threading
enabled
so
there's
six
vms
running
on
the
system.
B
At
the
same
time,
where
vms
one
to
four
are
running
the
same
gc
workload
for
the
same
application,
while
five
and
six
are
running
two
different
applications,
so
we
have
two
different
workloads
there.
So
when
we
compare
these
two
tables,
we
can
see
that
average
collection
times
are
actually
lower
with
adaptive
threading.
This
is
most
apparent
when
taking
a
look
at
vm5
the
row
for
vm5
over
here.
B
Another
interesting
thing
to
look
at
here
is
the
thread
distribution,
so
for
so
for
any
given
number
of
threads.
This
figure
shows
how
many
number
of
cycles
collection
cycles
use
that
use
utilize,
that
amount
of
threads
so
for
for
vm
24.
B
We
can
see
that
the
default
of
64
or
the
maximum
thread
count
was
used
for
26
cycles,
but
we
do
have
variability,
so
we
have
anywhere
from
40
to
about
57
on
average
threads
used
for
majority
of
the
collections,
but
I
think
this,
the
the
the
effect
of
adaptive
threading,
is
most
apparent
on
vm5,
where
we
actually
get
the
most
benefits,
where
we
have
77
percent
reduction
in
collection
time.
B
So
over
here
you,
you
would
notice
that
only
three
collection
cycles
used
64
threads,
which
would
be
the
default
had
adaptive
starting,
not
been
enabled,
and
we
can
see
that
on
average
we're
using
seven
threads,
and
so
this
clearly
is
is
is,
is
is
reducing
this
detrimental
parallelism
and
we
could
we
would.
We
can
observe
something
similar
for
for
vm6,
similar
to
vm
one
to
four.
B
B
So
that
sums
up
adaptive,
threading
and
parallelism
in
terms
of
future
work
for
this,
I
think
the
model
can
be
made
more
aggressive
right
now.
It's
it's
pretty
conservative
and
it's
in
in
throttling
the
thread
countdown,
it's
more
reluctant
to
use
more
threads,
and
so
it's
more.
I
think
it
could
be
more
aggressive,
but
it
needs
better
like
detection
to
find
anomalies
or
like
not
to
be
too
invasive
another.
B
Another
thing
is,
this
could
be
expanded
to
other
collectors
as
well
right
now,
it's
currently
only
for
scavenger,
because
that
is
doing
majority
of
the
collections
right
now
by
default.
B
B
That
would
require
continuous
calculations
and
adjustments.
That
would
be
more
that
work
would
be
more
involved
and
I
think
it
would
be
a
bit
more
complicated
to
get
that
working.
B
Another
thing
is
that
we
can
use
ai
or
some
sort
of
machine
learning
or
for
smart
recognition
to
tune
the
model
constants
constant
themselves,
so
the
weighted
averaging,
constant
or
the
thread
booster
heuristics.
I
think
they
could
benefit
from
some
some
ai
concepts
and
machine
learning
that
that
pretty
much
sums
it
up.
Thank
you
and
if
anyone
has
questions,
I'd
be
happy
to
answer
them.
A
B
Yeah,
that's
a
great
question:
yeah
it
it
would,
it
would
recommend.
So
if
you
have
like
a
64
per
system
and
that's
what
the
default
would
be
and
we
would
actually
cap
it
at
that,
so
it
would.
It
would
try
to
recommend
more,
but
we
would
cap
it
at
the
maximum
default.
D
B
E
That
equation
that
you
showed
earlier
or
that
formula:
where
did
we
get
that
from
the
several
slides
in
the
yeah?
I
think
maybe
this
these
slides
here.
B
So
number
of
optimal
threads,
so
that's
by
setting
this
equation
to
zero
and
finding
the
derivative
of
it
so
essentially
finding
the
minimum
and
solving
for
so
this
time
function.
This
is
something
that
alex
alex
came
up
with
and
I
think
through
experimentation.
This
show
proved
to
be
pretty
good,
but
this
is
just
one
implementation
of
the
model,
but
this
was
sufficient
but
yeah.
So
if
we
set
this
to
zero
and
solve
for
m
and
take
the
derivative,
we
would
we
would
get.
F
But
basically
that
first
formula,
what
it
says
is,
I
guess
the
time
is,
I
guess,
reverse
directly
proportional.
So
we
have
this
ratio
n
over
m
or
m
over
n.
Where
m
is,
I
guess
correctly,
someone,
but
one
is
a
current
number
of
threads
and
the
other
one
is
is
yeah
yeah.
F
Yeah,
so
so
we
have
two
components
here
and
say:
one
is
saying,
for
example,
that
time
is
directly
proportional
to
busy
times
and
universally
proportional.
You
know
to
start,
you
know
to
stall
times
you
know
and
that,
however,
there
is
more
some
nonlinearity
where
this
factor
x
comes.
You
know,
that's
roughly,
you
know
like
it
wasn't
really
that
straightforward
formulas,
maybe
we
could
there
there
could
some
other
v1,
but
you
know
like.
F
C
B
So
100
of
the
code
related
to
adaptive,
threading
is
actually
in
for
more,
it
would
be
only
the
only
thing
to
consider
is
hooking
up
a
vm
to
to
use
scavenger,
and
I
think
that
requires
a
bit
more
work
than
using
the
traditional
mark
sweep
collector
that
we
have
so
it's
just
about
hooking
it
up
to
scavenger
and
and
using
the
generational
collection
from
omr,
and
this
would
be
available.
C
Okay,
that's
very
cool!
That's
a
that's
a
great
answer.
I
guess
that.
But
it
suggests
the
follow-on
question
and
and
how
hard
would
it
be
for
someone
to
use
the
scavenger
from
from
the
omar
gc?
Maybe
that's
also
kind
of
a
question
for
alex.
I'm
not
sure
if
you
have
a
pat
answer
to
that
kind
of
question.
B
Yeah,
I
think
I'm
not
aware
of
actually
another
vm
using
gen
con
or
using
scavenger
or
generational
style
of
collection
from
omr
from
the
other
talks
I've
heard
at
cascon
or
like
this,
they
usually
tend
to
use
the
traditional
stop
the
world
collector,
but
I
know
alex.
Are
you
aware
of
the
amount
of
work
that's
involved
in
actually
hooking
up
another.
F
Yeah
yeah,
so
scavenger
is
mostly
ready
to
be
used
by
other,
I
guess
platform
languages
other
than
java,
probably
the
the
biggest
I
guess
glue
code
to
be
written
is
around
the
right
barriers.
I
mean
the
people
and
or
projects
have
successfully
incorporated
scavenger
but
glue
code.
There
is
a
little
bit
of
of
a
glue
code
to
be
written
around
the
the
right
barriers.
F
You
know
so
basically
the
generational
barriers,
so
we
have
to
ensure
that
whenever
mutator
threads
mutates
a
reference,
it
has
to
notify
the
collector
about
this
mutation.
You
know
so
some
examples
could
be
used
from
j9,
but
I
think
there
is
a
little
bit
of
code
on
j9
site,
probably
that
could
be
better
shared
with
omr
that
we
don't
share
right
now.
This
is
something
that
maybe
in
future,
we
can
do
better.
F
A
A
Okay,
if
not
I'll,
thank
you
again
salma
for
putting
together
this
informative
talk
and
oh
the.
B
Oh
yeah,
yes,
it
has
yeah.
I
was
gonna,
I
I
was
gonna
post
links
to
the
different
issues
and
things,
but
I
could
share
that
as
well
in
the
slot
channel.
A
Okay,
terrific
yep,
so
yeah
that
looks,
looks
really
good.
So
thanks
again
and
thanks
everyone
for
attending
and
we'll
talk
again
in
two
weeks,
then
thank
you.