►
From YouTube: JProfiling 20180927
Description
Profile driven optimization is very useful in improving runtime performance especially in Just-in-Time compilations where application are compiled while running. In this talk, Rahil will first give an overview on the different profiling mechanisms including those used in OpenJ9 and discuss their advantages and disadvantages. He will then explain the low-overhead patchable runtime profiling mechanism called JProfiling, how it works and its advantages over the other mechanisms.
A
This
profile
infrastructure
can
forward
some
of
the
vital
vital
information
for
an
example.
When
a
compiler
encounters,
particular
virtual
call
or
interface
call,
it
means
it
needs
to
resolve
that
class
to
call
the
correct
implementation
of
a
method,
if
known
during
compilation
time
that
most
significant
number
of
time,
that
call
would
result
of
particular
class
optimizer
can
inline.
That
call
is
faster
which
effects
on
a
significantly
in
statically
compiled
language.
This
information
is
achieved
through
a
training
run
which
might
not
reflect
a
true
learning
behavior
of
an
application.
A
On
the
other
hand,
a
dynamic
little
compiled
language
can
benefit
the
most
out
of
this
infrastructure,
as
they
can
compile
the
math
method,
while
the
while
the
program
is
running,
they
can
have
the
most
accurate
profiling
information
using
which
optimizes
at
can
generate
code
of
a
better
quality
provided
at
the
past.
Behavior
reliably
predict
the
future
behavior
before
I
jump
into
the
open,
j9
profiling
story.
A
Let's
take
a
look
at
the
classes
of
information
that
any
runtime
profiling
infrastructure
tends
to
collect
a
code,
frequency
or
code
execution
frequency
which
aims
to
describe
how
frequently
different
path
of
an
expla
of
an
application
executed
competitive
to
another.
This
helps
us
in
a
block
ordering
and
also
provides
us
information
about
hot
spot
in
an
application
that
we
can
put
more
optimization
effort.
Our
value
profiling.
Information,
on
the
other
hand,
aims
to
describe
different
value,
sets
a
civic
program
point
which
gets
us
to
generate
a
code.
Dealers
for
most
frequency
frequently
executed
cases.
A
So
how
does
open
j9
collects
and
utilizes
the
profiling
data?
Well,
open,
j9
or
execution
works
in
a
mixed
mode
where
application
starts
with
interpretive
methods
and
frequently
executed.
Metals
are
compiled
to
native
code,
so
initial
profiling
information
is
achieved
from
a
bytecode
interpreter
loop,
where
interpreter
is
provided
with
the
application,
fare
profiling
data
buffer,
which
it
uses
to
collect
a
load
data
from
an
application.
Once
this
data
buffer
gets
full
eye,
profiler
or
interpreter
profiler
consumes
that
raw
data
and
passes
into
a
global
it.
A
Global
data
structure
that
can
be
consumed
by
JIT
compiler
data
collection
in
in
in
an
interpreter
is
very
costly
operation,
so
I
profile
also
controls
data
collection
by
turning
it
on
or
off.
As
my
requirements
once
I
prepare
passes
it.
The
data
into
global
data
structure,
a
JIT
compiler
uses
that
data
in
its
compilation
and
also
sends
out
a
query
to
AI
profiler
for
to
request
specific
data
request,
specific
data.
A
So
using
this
infrastructure
we
collect
Tyga's
of
virtual
call
that
helps
in
Nano
to
design
each
math
implemented
in
to
align
a
branch
direction
or
switch
statements.
Associative
block
ordering
instance
of
a
check
cast
on
time
types
that
helps
us
generating
better
typecast
when
we
are
in
let
all
compiling
this
method,
although
most
of
the
JIT
compilation
are
driven
through
data
collected
for
my
eye
profiler,
there
are
some
challenges
associated
with
the
high
profiler.
First
of
all,
the
data
collection
is
really
costly,
so
we
only
profile
last
and
measuring
location.
A
So
we
only
profile
last
and
method
invocation
when
n
is
tuned
as
per
application
characteristic,
most
most
entries
in
the
data
buffer
is
consumed
by
branch
entries,
so
so
to
reduce
time
spent
on
business.
In
those
buffers
we
skip
those
bland
changes
in
a
sampling
fashion
and
collect
a
branch
bias
information.
Just
before
we
are
about
to
compile
a
method.
It
requires
a
separate
thread
pool
to
process
the
data
so
that
we
can
avoid
having
forces
in
application.
A
Third-
and
one
of
the
main
challenge
with
the
eye
profiler,
is
the
collection
of
record
that
is
shut
off
once
the
method
is
compiled
to
native
code,
but
open
gene
I
can
also
add
an
instrumentation
in
compiled
code
that
is
called
cheat
profiler.
It
can
make
an
exception
for
very
frequently
executed
methods,
usually
a
method
that
consumes
significant
amount
of
application.
Learning
time
can
be
chosen
by
JIT
compiler
to
go
undergo
special
compilation
for
compilation
fees
that
is
called
a
profiling
compilation
in
this
profiling
compensation
forgiven
method.
A
We
generate
two
metal
bodies,
one
without
instrumentation
code
and
what
with
instrumentation
code
a
code
execution
is
interleaved
between
these
two
metal
bodies
based
on
sampling,
profiling,
frequency,
open
g9
uses
a
compiler
control
counter
based
sampling,
in
which,
at
various
points
in
a
metal
body
without
instrumentation
code.
We
have
this
code
that
decrements
the
counter
and
upon
reaching
a
counter
to
zero
a
transient
instrumentation
code.
A
After
returning
from
instrumentation
code,
it
resets
the
event
counter,
based
on
the
size
of
an
instrumentation
code,
that
it
is
running
after
certain
transitions
between
these
two
metal
bodies
in
compilation
is
requested,
so
that
a
compiler
can
use
the
data
that
is
generated
by
the
instrumentation
code
to
optimize
the
meta
body.
Detailed
information
about
this
implementation
can
be
found
in
Arnold
Android
of
spiders
paper,
a
framework
for
reducing
cost
of
instrumented
code,
although
it
is,
is
really
useful
to
help
instrumentation
code
in
a
compiler
part.
A
There
are
some
challenges
associated
with
JIT
profiler,
which
makes
it
viable
hold
option
1
for
only
handful
of
methods.
First
of
all,
in
a
metal
body
with
instrumentation
code
to
get
a
cold
frequency.
We
put
the
counters
of
all
the
blocks
on
the
on
the
order
on
all
the
blocks
so
to
reduce
complexity
during
optimizations.
A
So
we
have
seen
a
software
based
profiling
techniques.
There
are
some
profiling
mechanism,
which
use
hardware
provided
buffer
for
profiling
data
in
open
j9.
We
use
those
hardware
provided
buffer
in
a
sampling
mechanism
where
we
compile
everything
at
optimizer
level,
with,
of
course,
low
invocation
count
to
improve
startup
of
a
workload
and
based
on
processing
those
Hardware
buffers.
We
try
to
identify
hotspots
and
compile
those
methods
at
high,
optimization
level.
In
general.
A
There
are
so
many
challenges
associated
with
the
use
of
hardware
buffers
in
profiling,
but
the
main
challenge
is
limited
information
in
virtualization
dome,
and
that
makes
it
that
makes
it
unusable
in
cloud
environment.
Its
implementation
is
Hardware
dependent.
The
harder
samples
we
have
in
open
j9
has
two
different
implementations:
one
for
power
and
4c
platform.
Collecting
profiling
value
profiling.
Information
and
of
low
frequency
information
becomes
very
costly
and
complicated
with
this
kind
of
infrastructure.
So
that
is
why
we
kept
its
users
limited
to
simpler
only
in
open
j9.
A
So
we
have
seen
a
software
based
profiling
techniques
used
in
open
Jana,
and
we
have
seen
hardware
using
open
g9.
Why
do
we
need
multiple
profilers
in
first
place
to
answer
that?
We
need
to
see
two
main
categories
of
foreclose
under
which
DVM
falls
in
workload
with
hotspot.
A
significant
amount
of
application
running
time
is
spending
health
and
full
of
hot
metals
in
this
type
of
workloads.
A
This
handful
of
hot
metals
will
be
compiled
to
native
code
and
might
inline
some
other
methods,
as
well,
so
to
get
a
real
current
context
about
an
application
data
collected
from
my
profiler
on
from
aliens
and
might
lack
the
information
about
inline
methods.
So
did
profiler
comes
here
because
it
can
add
instrumentation
in
compiled
code,
and
it
provides
us
the
current
information
that
we
need
to
compile
these
methods
we
compile
this
matter
to
help
us
improving
so
put
a
flag
were
close.
A
Learning
time
is
distributed
is
pretty
flat
in
this
type
of
workloads
to
improve
performance
in
this
type
of
workload.
We
need
to
compile
as
many
methods
as
quickly
as
possible
and
startup.
Time
is
really
important
in
this
type
of
a
workload
because
of
the
higher
code,
cache
consumption
and
dependency
on
a
recompilation.
A
So
we
have
seen
challenges
that
we
faced
with
different
profilers.
What
do
we
expect
from
a
standard
former
ideal
of
profiling
infrastructure?
Well,
we
want
to
have
a
single
framework
for
both
startup
and
support
improvements.
We
went
to
have
a
profiling
code
for
large
number
of
compile
methods
with
the
minimal
impact
of
minimal
minimal
impact
on
a
compile
time,
so
cost
of
adding
instrumentation
codes
in
the
compiled
body
should
be
minimum.
A
A
Well,
one
of
the
main
problems
G
profiler
head
was
it
in
the
method
body
with
instrumentation
code.
It
has
a
counters
on
all
the
blocks
to
get
the
code
frequency
j
profiler
tries
to
aim
to
minimize
this
counter.
How
can
it
does?
How
can
how
can
it
does
it?
Let's
take
a
example
intuitive:
let's
take
a
look
at
an
intuitive
example,
so
this
in
this
example,
we
can
decide.
Now
you
put
counters
on
the
all
edges,
but
is
there
something
better
way
we
can
optimize
this?
A
Well,
we
know
that
the
if
you
put
a
counter
on
a
blocks
with
a
single
success
of
a
single
predecessor,
we
can
effectively
counter
frequency
of
other
edges
and
block
so
that
reduces
the
counters
to
half.
Can
we
still
do
something
better
well
for
some?
If
somehow
we
know
the
method,
entry,
frequency
or
the
exit
frequency,
we
can
leave
out
the
counter
from
the
frequently
executed
path
and
have
a
minimal
impact
of
adding
counters
on
the
on
running
time.
So,
with
this
intuition
in
mind,
J
profiling
implies
a
counter
placement
algorithm.
A
Based
on
this
two
paper,
one
is
written
by
Newton
situations
that
describes
CFG
reduction
procedure
to
optimally
profiling
and
another
is
Bell
and
written
by
Balan
was
that
provides
us
comparison
of
different
algorithms
about
tracing
programs
and,
based
on
these
two
paper,
we
prepared
a
formulation
based
on
C
F,
G
H,
excitation
frequency
profiling.
Let's
take
a
look
at
the
example
how
a
profiling
minimize
the
number
of
counters
we
need.
This
is
a
simple
control
flow
graph
with
one
a
cage
it
can
represent
for
loop,
a
while
loop
in
any
method.
A
Let's
see
how
we
can
minimize
the
number
of
counters
in
here.
So,
first
of
all,
as
a
profiling,
the
what
G
profiling
does
is
to
implement
of
maximum
spanning
tree
out
of
that
control
graph
and
just
to
get
the
maximum
spanning
tree.
It
s
an
artificial
edge
between
start
block
and
exit
block.
To
emphasize
that
the
number
of
times
you
enter
into
method
is
equal
to
number
of
times
you
exit
into
a
method.
So
after
it
it
also
uses
the
plane's
algorithm
to
obtain
the
spanning
tree.
A
So
once
it
obtains
of
spanning
tree
I
have
just
highlighted
those
edges
which
are
not
present
in
a
spanning
tree
with
wrinkler.
Here
it
Travis's
through
the
all
the
edges
and
tries
to
find
out
the
edges
that
are
not
present
damage
in
maximal
spanning
tree
and
put
counters
to
measure
those
edges.
So
in
this
case
we
just
need
to
count
the
edges
4
to
5
and
5
to
1,
so
we
put
a
counter
to
measure
edge
4
to
5
and
hh5
to
1,
once
it
has
put
the
required
counters
on
all
the
paths.
A
It
prepares,
a
formulation
in
terms
of
addition
and
subtraction
of
that
counter
blocks
to
get
a
frequency
of
uncounted
edges.
A
detailed
implementation
of
this
algorithms
can
be
found
in
J
profiling,
block
toe
CPP
in
effect
in
G
9,
and,
let's
see
how
we
can
obtain
a
frequency
of
uncounted
blocks.
So,
let's
take
an
example,
we
want
to
get
the
frequency
of
block
2.
That
is
not
counted.
Well.
We
know
that
the
block
2
has
a
single
predecessor.
A
So
if
we
get
the
frequency
of
edge
0
to
2,
we
can
get
a
frequency
of
block
2.
We
can
see
that
the
edge
we
got
to
2
is
a
single
successor
of
block
0,
so
to
get
a
frequency
of
edge
0
to
2.
We
just
need
to
get
the
frequency
of
block
0
log
0,
which
is
a
start
block.
That
means
the
frequency
of
log
0
is
equal
to
frequency
of
block
1
because
number
of
times
you
enter
the
method
is
equal
to
number
NGH
it
matter.
A
So
it
becomes
intuitive
that
to
get
a
frequency
of
block
2,
we
can
just
get
a
counter
from
block
7
and
obtain
the
frequency
for
block
two.
So
following
these
steps,
it
can
prepare
a
table
to
find
out
frequency
of
one
counter
blocks
or
edges
to
get
the
in
terms
of
block
numbers
that
contains
the
counter
that
we
need
to
load
or
add
or
subtract
together.
A
In
this
example,
there
are
no
requirement
to
add
or
subtract
any
counters
together,
but
we
can
assume
that
in
a
complicated
control
flow
graph,
we
would
have
some
blocks
that
we
need
to
add
or
subtract
together.
So
we
have
minimize
the
number
of
counters
that
we
want
to
put
in
a
control
flow
graph.
We
have
prepared
a
formula
that
requires
to
obtain
a
frequency
for
uncounted
paths.
How
do
we
store
this
information?
Well,
it
uses
a
persistent
here
block
frequency
information
for
data
structure
to
use
to
store
this
information.
A
This
data
structure
contains
array
of
pointers
to
a
bit
vector
that
holds
any
information
about
which
on
which
counter
blocks,
it
needs
to
add
or
subtract
together
to
obtain
a
frequency
of
counter
blocks.
So
in
this
array,
if
we
have
a
null
entry,
that
means
there
is
no
counter
available.
If
that
entry
is
low-tech.
That
means
we
can
efficiently
obtain
a
frequency
from
single
counter
and
with
meta
single
mathematical
operations.
A
On
that
entry
we
can
find
which
block
counter
from
which
block
number
we
need
to
load
to
get
the
frequency
if
there
are,
if
she
is
not
null
or
not
low-tech,
that
means
it
contains
a
pointer
to
the
arbitrator
which
holds
an
information
about
which
kind
of
counters
it
needs
to
add
or
to
add
or
subtract
together
to
get
the
frequency
detail.
Information
about
this
data
structure
can
be
find
in
j9,
profiler,
dot,
h,
PP
or
CPP
file.
A
So
the
benefit
of
this
algorithm
is
in
a
cyclic
region.
It
has
at
most
one
pout.
It
has
at
most
one
counter
and
implementation
favours
counters
at
lower
frequency.
Edges
have
low
key
locks,
so
it
reduces
the
head
contentions
while
be
updating
the
counter.
It
can
compute
a
frequency
of
any
single
block
as
a
sequence
of
additions.
It's
an
operation
and
the
block
frequencies
are
stole
using
pair
of
bit
vectors.
That
means
provides
a
high
information
density.
A
So,
in
my
example,
one
of
the
main
part
of
our
open
g9
was
excessive,
which
I
did
not.
So
how
do?
How
does
reprofiling
handles
exception?
With
this
algorithm?
Well,
exceptions
are
special
in
open,
G
and
s.
They
can
represent
a
non-local
control
flow
changes
in
a
non-local
control
flow
and
if
we
decide
to
put
a
counter
on
all
the
exception
edges
it,
we
need
this
many
counters
to
get
a
frequency
of
all
the
blocks
in
the
in
the
in
the
program.
A
But
we
have
simple
observation
that
we
made
about
XS
and
edges
that
they
are
there
and
usually
do
not
occur.
So
using
that
observation,
we
can
simplify
an
assumption
and
combine
all
the
counters
from
excerpts
and
edges
and
put
them
onto
the
catch
block
as
currently
in
open
j9.
We
know
how
to
exploit
the
information
on
a
catch
block
and
we
do
it,
but
we
don't
have
any
mechanism
that
exploits
information
about
counters
on
exceptional
edges.
So
we
can
take
that
as
themselves
and
put
counter
on
the
catch
block.
A
So
how
does
the
profiling
handles
inlining?
Well,
inlining
is
special
because
it
can
change
comparison
to
compilation.
So
if
we
are
compiling
a
method
and
find
out
that
the
method
is
not
in
line
in
the
previous
come
in
the
come
in
the
compiler
profile
body,
then
it
goes
into
a
sim
call
chain
and
finds
tries
to
find
out
information
about
that
method.
A
We
are
not
limited
when
we
are
want
to
apply
certain
optimizations,
which
we
wear
in
subject
compiled
AG
profiler,
so
we
have
seen
how
efficiently
G
profiler
optimizes
the
number
of
counters
that
it
requires
to
put
less
together.
Let's
take
a
look
at
how
it
employs
a
value
prefer
how
it
collects
a
value.
Profiling,
information
well
main
aim
behind.
Add
value
profiling.
Information
is
to
identify
common
values
and
program
properties
at
specific
points
during
program
execution.
A
This
information
can
be
type
of
object
that
which
is
a
type
dislike
jackass
instance
of
a
virtual
method
dispatch
it
can
be
loop,
induction
variable
or
rent
of
Ellis,
and
using
this
of
information
and
optimizer
can
generate
a
code,
tailor
to
frequently
executed
cases.
So
the
main
lookout
in
any
value
profiling
information
is
to
detect
a
consistency,
detector
consistency
so
from
given
value
profiling
information.
Only
small
set
of
information
is
of
interest
so
that
compiler
can
use
that
information
to
generate
better
coding
to
generate
better
code.
A
Lack
of
this
consistency
suggests
that
there
is
nothing
there
is
no
commonly
executed
cases
present
in
the
method
body
that
optimizer
can
exploit.
So
using
this
idea,
the
profiler
only
profiles
most
frequent
n
values
where
n
is
a
small
integer,
with
keeping
track
of
uncounted
cases
that
are
supervisors
information
about
the
inconsistency.
So
how
does
the
profiling
data
stores
the
value?
A
Well,
it
uses
a
hash
function
to
store
the
value
that
hash
function
contains
to
err
a
one
with
the
value
information
that
that
value
that
contains
value
that
your
provides
us
information
about
which
values
we
are
profiling
and
other
one
contains
the
frequencies
of
corresponding
values.
It
uses
a
bit
extract
as
a
hash
function
that
extracts
the
bits
out,
offer
value
and
together
in
integer,
to
effectively
Ginetta
indexing
into
added,
so
to
update
or
add
a
new
value
to
a
table.
A
First,
it
first
of
all
it
janessa
index
to
an
array
using
the
hash
function
and
it
checks
in
the
array.
If
that
value
exists
or
not,
if
that
value
exists,
then
it
just
increments
because
finding
counter
and
continue
running
the
main
code.
If
that
value
does
not
exist,
then
it
tries
to
check
if
the
table
is
full
or
not.
That
means,
if
it
can
add
new
values
to
a
table
or
not.
If
it
cannot
add
new
values
to
tables,
then
it
contributes
that
value
to
uncounted
cases
and
continues
a
continuous
learning
mean
program.
A
If
it
can
add
new
values
to
a
table,
it
attempts
to
lock
the
table
and
checks.
If
the
table
is
locked
or
not,
if
table
cannot
be
lost,
then
it
again
contributes
that
value
to
uncounted
cases
and
continues
to
the
execution
of
a
main
program.
If
it
can
add
a
new
value
to
a
table,
then
it
rehashes
the
table
to
make
sure
that
most
frequently
profile
values
are
not
kicked
out
in
case
of
collision
and
add
new
value
to
the
table,
with
initial
counter
unlocks
the
table
and
comes
back
to
running
main
program.
A
A
If
uncounted
cases
becomes
more
significant,
then
it
is
really
easy
to
clear
out
the
whole
information
and
start
populating
the
table
from
scratch,
and
the
efficient
hashing
mechanism
that
contains
a
bit
extract
are
probably
optimized
on
all
platform
with
single
instruction.
So
this
is
a
simplified
version
of
how
trees
look
like
in
our
in
our
I.
Just
prefer,
simplified
the
trees.
A
So
if
you
have
a
method
that
spends
so
much
time
in
a
loop,
you'll
stop
executing
profiling
method
before
chipping
recompilation.
So
we
added
a
recompile
intesting.
Look
to
identify
these
cases
because
of
a
block
splitter.
The
comparison
time
was
increased
in
global
register
allocators.
So
we
reduce
the
number
of
blocks
that
we
had
for
adding
a
profiling
value.
A
It
was
profiling,
multiple
value
in
same
same
value,
in
multi,
same
value,
multiple
time
in
extended
basic
block,
so
we
remove
duplicates
as
well
and
it
requires
tuning
because
we
wanted
to
decide
how
much
it
is
enough
for
us
to
get
the
purpose.
Optimized
body
performance
results,
I
have
just
put
out
the
numbers
of
I
log
on
x86
and
on
power.
A
I
lost
300
to
set
improved
by
5%
when
we
switch
from
GE
Profile,
otogi
Jitsu
filer
to
j
profiler,
while
on
z,
it
improved
by
two
point:
three
to
five
two
point:
five
percent
future
work.
We
want
to
enable
this
infrastructure
for
warm
bodies
and
we
want
basically
want
to
have
or
this
profiling
instrumentation
code
for
large
number
of
bodies.
A
There
are
some
challenges
that
we
have
faced
with,
while
we
enabling
this
for
the
profiling
comparison
as
increased
compliation
time
in
mobile
register,
allocator
that
we
are
trying
to
address
right
now
and
we
need
to
have
a
mechanism
that
allows
us
to
deliver
spatting,
because
currently
in
open
j9,
we
can
only
patch
a
branch
to
a
no,
it's
not
possible
to
do.
Vice
versa.