►
From YouTube: Code Generators and Much More (Part I)
Description
In this talk, Julian will give an overview of the JIT code generator framework for the platforms we support, namely X, P, and Z. In particular, he will discuss what concepts/code is currently being shared among the code generators, and what is not. He will compare the architectures, reason about the philosophy of the respective construction, and how the modern micro-processor differences result in differences in our JIT code generators.
A
Which
is
going
to
next?
Oh
that
one
happy.
So
basically
the
whole
talk
is
the
introduction
to
the
modern
architectures
of
the
three
platforms
we
support
it,
and
that
is
a
high
level
comparison
and
because
of
that,
the
in
our
three
code,
generators,
a
commonality
and
the
difference,
and
then
a
lot
of
the
coach
and
details
I'm
going
to
talk
about
it.
A
So
the
agenda
is
the
chip.
Comparison
and
call
comparison
is
a
and
our
flow
of
coaching
and
I
asked
the
implications
on
binary
size,
performance,
exclusion,
time
path,
length,
addressing
mode
for
the
coaching,
mainly
for
the
load,
store
instruction
or
memory
operand
and
addressing
mode
for
branch
and
a
call
and
condition
code
for
the
loop
and,
if,
whatever
there
and
a
register
and
register
allocator.
A
A
Nowadays,
when
you
talk
about
cheap
cheapy
is,
is
equal
to
a
socket
equal
to
processor,
equal
to
Co
processor,
in
a
path
that
is
not
the
same,
but
nowadays
is
basically
the
same.
All
these
three
processors
are
fabricated.
On
40
nanometer
process,
skylake
is
by
Intel's
40
nanometer
process
and
power.
9
and
z14
are
fabricated
by
a
group
of
foundry
14
nanometer
for
cocoa
foundry,
actually
IBM's
previous
facility
in
Fishkill
is
acquired
by.
We
saw
it
too
cocoa
foundry
now.
A
Follow
on
the
same
chip,
you'll
compare
the
skylake
power,
I
+
z-,
the
court
hound
skylake
is
up
to
2800
chip
and
the
power
9
is
up
to
12
of
reporting
up
to
10,
and
you
look
at
this
l3
caching.
So
this
slide
all
contain
cheaper
level
resource,
not
a
core
level
private
resource.
So
l3
is
typically
a
chip
with
shared
resource.
A
Skylake
contain
338
0.5
megabyte
of
l3
cache
shared
by
all
28
cost
power.
9
is
private,
is
10
megabyte
per
hole,
although
the
l3
cache
can
be
shared
among
a
cause,
but
it's
law
is
yo-yo
because
it
is
a
private
that
you
need.
You
have
one
cache
line:
Henley
the
Torah
your
neighboring
cause
l3
is
took
128
20
cycles.
It's
not
that
fast
reporting
is
128.
Megabytes
3
cache
is
shared
by
the
Holocaust
and
memory
channels.
The
memory
connection,
Skynet
has
6
channels
and
power.
9
8
channels
are
z14,
is
of
5
channels.
Now.
A
The
8
channel
of
memory
connection
can
be
connected
in
two
ways,
for
the
low-cost
machine
is
connected
through
DDR
double
data
rate
connection.
That's
the
industry
standard
memory,
connection
for
the
enterprise
machine
on
the
same
chip.
It
can
be
connected
through
the
PMI
channel,
the
direct
media
interface
channel,
the
PM
eyes,
the
password
memory
connection
provided
closer
to
2
X
of
the
memory
bandwidth,
but
it
slow
down
memory
as
slow
down
AP.
We
have
the
poverty,
but
there
the
memory
latency
is
around
15
to
20
nanoseconds
longer.
A
A
A
Z
40,
the
number
is
secret,
maybe
jor-el
knows,
but
you
probably
can't
deduce
it
from
powers
power
9.
The
buffered
memory
is
around
how
much
it
is
and
thee
is
always
a
buffer
and
X
is
always
DDR
that
my
understanding
and
give
you
a
perspective
helps.
How
much
is
this
memory?
Pencil
is
okay,
it's
good
for
the
running.
If
you
look
at
the
HPC,
the
high
performance
computing
code
is
demand.
A
A
lot
of
memory
bandwidth,
the
the
skylake
core
actually
can
pull
around
40
gigabytes
per
second,
so
that
90
give
us
a
second,
probably
only
good
for
to
coal.
So
you
can
imagine
is
running
that
kind
of
code.
You're
26
cause
sitting
idle.
You
cannot
do
anything
the
only
the
two
cars
can
put
in
the
bandwidth.
So
in
this
perspective
you
can
see
the
p9
and
z14
is
more
scalable
in
for
that
kind
of
workload,
and
the
connection
for
the
l3
cache
is
shared
versus
private.
That's
a
performance
implications
here
as
well.
A
I
have
been
going
through
these
performance
issues
with
many
customers
between
X
and
P,
for
if
you're
running
single-threaded
on
the
chip,
then
that
single
thread
for
the
x
case
you
can
use
close
to
40
megabyte
of
cash
l3
cache
on
power.
You
can
use
10
megabyte
so
sometimes
the
performance
is
the
customer
saying
a
power
running
slow,
slow,
but
my
answer
to
them.
You
are
not
buying
the
machine
to
run
single
thread
right.
B
B
A
A
A
A
A
The
element
size,
plus
a
displacement
immediately,
plus
there's
a
another
addressing
mode
for
IP,
relatively
instruction
pointer,
basically
relative
to
the
current
instruction
and
you're
upset
going
to
be
a
certain
amount
offset.
So
is
IP
relative
addressing
that
to
2
mode
of
that
there
are
a
lot
of
sub
mode,
real
Derby
process,
X
multiplied
by
s,
and
you
can
combine,
combine
press
ability
to
have
a
lot
of
motor
there.
But
you
look
at
power.
The
instruction
set.
There
are
only
two
mode,
two
for
the
load
and
store
instruction.
A
A
A
The
see
there
it
means
conditional
for
conditional
branching
power.
Only
provide
16-bit
worth
of
bridging
distance
for
unconditional
branch
is
26
bit
for
indirect
call.
You
need
to
use
the
special
register,
not
GPR,
so
power
is
called
LR
and
a
CTR
link
and
a
counter
register
on
Z
is
the
took.
The
branch
in
relative
range
is
32-bit,
prime
multiplied
by
2
distance,
because
their
instruction
is
always
2
by
4
5
6
5
is
multiple
of
2
and
indirect
bench
is
the
GPR
based
the
registers
act.
A
A
So
the
full
of
cogeneration
in
our
coaching
compiler,
the
cogeneration,
basically
going
through
instruction
instruction
and
register
assigning
and
people
optimization
and
finally,
finally,
encoding
there
used
to
be
two
stages
between
the
register
before
that:
a
pre
register
pre
ia
instruction,
scheduler
and
after
the
register
assigning
as
a
post
RA
instruction
scheduling
that
right
now
is
our
disabled
and
it's
disabled
for
a
good
reason.
Basically
it
because
it
all
the
cause
and
nowadays
is
all
out
of
order
deep
outer
order
machine.
So
the
hardware
itself
provided
scheduling
capability
already.
A
Okay,
so
for
the
instruction
length
implications
of
the
X
and
Z
the
typical.
Finally,
operations
are
automatic
operation
instructions
destructive
attractive,
in
a
sense
that
the
example
there
is
a
eco,
a
plus
B.
Then
you,
the
AE
is
destructed
you
if
you,
if
we
want
to
keep
the
a
value
for
a
future
reference,
you
need
to
copy,
you
need
to
copy
the
register
or
you
need
to
later.
We
motorized
it,
and
so
you,
you,
probably
see
more
register
copy
instructions
in
our
X
and
Z,
but
I
know
skylake
microarchitecture,
actually
optimizing
this
copy
instruction.
A
A
A
These
has
finally
size
the
benefit,
because
you,
if
you
look
at
the
instruction
of
increment
the
memory
operand
by
one
power,
typically,
you
need
to
load
it
and
add
one
to
the
loaded,
a
value
and
a
storage
you
have
you
need
to
use
three
instructions
and
encoding
is
a
twelve
ability,
but
on
X
you
probably
only
need
to
buy
the
instruction
to
do
the
ink
so
that
finally
size
benefit
is
obviously
obviously
is
there
and
now
white
problem.
Basically
how
many
instructions
you
need
a
dynamic
sense,
how
many
instructions
you
need.
A
A
A
However,
that
doesn't
mean
power
take
more
time
to
execute
the
add
the
increment,
because
the
the
memory
operand
instruction
internally
actually
is
a
crack
into
3
micro
up
anyway.
So
from
a
pipeline
SEO
point
of
view,
the
3
micro
alternately
issued
the
versus
power
3
3
instruction
issue
there
and
the
x86
instruction
are
mostly
of
two
operand
format,
so
they
are
mostly
destructive.
A
We
have
free
operand,
the
format
for
new
instruction,
FM
AM
FM,
a
means
floating
multiply
and
add
for
in
my
plasma,
and
you
need
this
resource
anyway,
you
need
the
register,
a
multi
price
register,
P
plus
register
C
and
a
result
that
needs
put
into
a
register.
So
you
have
three
operant
format
is
still
destructive.
A
The
three
operator
probably
is
non-destructive,
but
I
didn't
go
through
the
percentage,
how
many
instructions
destructive?
How
many
are
not
and
the
power
is
mostly
of
the
operand,
so
it's
mostly
non
destructive,
but
for
the
VSS
we
have
said
the
vector.
Fma
is
three
operon,
so
its
destructive
any
questions.
A
Addressing
mode
implications
on
instruction
selection,
the
here
the
limitation
is
mainly
on
power
because
the
addressing
mode
is
either
base
price
index
or
pay
supplies
displacement
and
the
disagreement
is
limited
to
16-bit,
16-bit
ability
or
negative
32,
k+,
sorry,
okay,
and
that
there
is
no
architected
IP
register.
You
can
now
refer
to
the
IP
register,
so
there
is
no
idea
relative.
Addressing
so
give
you
an
example.
A
Typically,
you
are
iterating
over
and
Java
array
on
power.
You
need
to
use
more
instructions
or
for
sure,
because
you
of
our
object
model,
you
have
a
object,
header,
each
byte
or
whatever,
and
then
you
have
a
index.
So
what
do
you
need
to
addressing
a
particular
element?
You
need
to
add
the
index
first,
then
you
need
to
multiply
something
shift
on
an
index
plus
the
base
and
a
part
of
the
header.
A
So
it's
quite
a
few
instructions
to
go
to
addressing
it
unless
look
strider
Kiki
and
you
can
have
the
internal
pointer.
So
basically,
the
base
register
can
point
anywhere
in
the
object
in
the
middle,
and
then
you
don't
need
that
long
sequence
to
to
calculate
the
address
in
a
path
We.
Certainly
have
a
proposal
for
pi
object
model
that
will
help
the
power
for
sure
you
what
the
by
object
model.
A
Basically,
the
Java
object
reference
point
at
the
start
of
the
data,
not
as
data
of
the
header,
so
the
header
is
in
a
negative
offset
of
your
object
reference
and
the
data
is
in
positive.
Offset
of
your
object
reference,
but
this
is
in
a
path
is
a.
This
is
a.
It
is
a
problematic
for
the
architecture,
because
these,
the
immediate
offset,
is
always
positive
if
unsigned
data.
A
It's
not
in
order
to
address
the
negative
you
need
use
the
index
form.
You
need
to
move
a
negative
value
into
a
register
and
using
index
form
to
address
it
so
that
the
problem
there,
but
nowadays
are
the
new.
The
architecture
allow
both
negative
and
a
positive
offset.
So
it's
a
property
of
ayodhya
object
model
can
be
revived.
A
We
selected,
but
the
CEO.
Probably
some
implication
here.
I
believe
that
the
architecture
for
the
negative
offset
addressing
is
a
six
byte
instruction
right
now,
four
byte,
so
it
helped
the
implications
of
bigger
binary
accessing
stack
afraid
the
stack
frame
is
can
be
more
than
32.
Kilobyte
I
have
seen
this,
especially
when
escape
analysis.
Where
another
allocating
a
lot
of
object
on
a
stack,
then
a
stack
frame
becoming
bigger
than
studied.
A
A
Performance
impact
because
our
stack
frame
addressing
binding
is
late
is
at
the
very
latest
stage.
So
you
even
don't
know
which,
in
the
instruction
selection
phase,
you
don't
know
which
item
is
offset
modern
little
kilobyte
and
at
the
binary
encoding
time
you're
found
a
this
thing
is
more
than
32
kilobyte
offset.
Then
what
you
need
you
want
to
do,
because
you
do
already
passed
the
register
assigning
stage
now
you
need,
you
need
to
evacuate
a
register
to
calculate
the
offset
that
evacuation
of
the
register
have
certainly
has
the
performance
implications.
A
You
evacuate
the
register
typically
to
negative
offset
of
your
stack
and
then
you
calculate
address
and
you
addressing
the
stack.
Then
you
load
in
the
back
that
register.
There
is
a
lot
performance
penalty
there
and
also
for
accessing
static,
variable
and
a
constant
address
so
I'm,
something
like
a
j9
class
and
a
j9
method.
A
The
in
this
way,
the
IP
relative
addressing
is
really
handy,
bound
power,
there's
no
IP
relative
addressing
so,
but
we
have
a
lot
of
register.
So
we
in
power
coaching,
we
dedicate
one
register
in
64-bit
mode.
We
dedicate
one
register
as
the
pseudo
talk.
So
what
is
a
talk,
and
where
is
talk
is
the
in
the
avi
in
that
traditional
avi
or
the
ERF
or
s
cough
they
have
with
something
called
God
or
talk.
Kata
means
global,
offset
the
table
and
talk
means
pay
table
of
constant.
The
inequalities
is
a
area.
A
This
mainly
is
for
a
position,
independent
code
and
a
shared
library
for
the
benefit
of
shared
library,
because
you
don't
want
to
you,
don't
want
to
modify
your
instruction
sequence
in
a
shared
library
to
encoding
the
particular
global
address.
Viewing
modify
that
then
your
Sherrod
library
cannot
be
shared
by
many
process.
So
what
they
do
is
the
ABI
ability
have
got
or
talk
and
in
gotta
talk,
you
put
the
global
address
there
and
you
are
using
I
feel
relative.
Addressing
to
address
that
got
our
talk
to
load.
The
global
address
constantly
address
Palin
power.
A
We
don't
have
IP
relative
addressing
so
we
dedicate
the
one
in
C
or
whatever
environment.
We
have
a
dedicated
resistor
to
point
at
the
talk
and
then
you
can
similarly
doing
IP
relative
addressing,
and
indeed
we
have
another,
because
we
have
so
many
registers.
So
we
will
delegate
another
register
to
point
at
the
pseudo
talk
now.
A
However,
if
you
have
a
studio
talk
overflowing,
then
you
you
have
trouble,
then
you'll
need
to
revert
back
to
a
long
sequence
of
the
terrorising
address,
in
particular
for
a
64-bit
mode.
You'll
use
five
instructions
to
encoding
the
long
64-bit
address,
and
then,
when
you
are
using
pseudo
talk,
you
also
have
the
pic
size
invalidation.
When
card
unloading
is
happening,
you
need
to
take
different
actions,
so
that
is
the
peculiar
things
on
power.
Coaching,
that's
the
different
things
here.
So
any
questions
on
this.
Yes,.
A
Yes,
the
question
is
the
pseudo
top
is
for
put
method
or
group
globally
for
a
whole
JVM.
Yes,
the
studio
talk
is
global
for
whole
JVM
and
the
global
that
the
pseudo
top
is.
We
have
a
hash
table
in
the
pseudo
talk.
We
know
which
one,
for
example,
Java
object.
Class
pointer
is
in
index
of
30
32,
then
the
globally,
whatever
that
T
decoder
is
running
and
if
they
need
a
Java
and
object
class
is
just
to
access.
A
A
Your
conditional
branch
is
only
sixteen
bit
so
that
up
and
down
32
kilobytes
for
a
unconditional
branch
and
a
call
is
26
bit
that
is
up
and
down
32
megabyte
and
the
even
bigger
is
up
and
down
for
gigabyte
and
now
the
the
implication
here
is
limitation
is
also
if
many
on
power,
but
it
can
happen
on
other
on
X
and
Z
as
well
for
within
the
method
that
typically
conditional
branch.
So
do
you
have
enough
LaBranche
in
distance
for
conditional
branch
using
a
method
on
power?
Is
it
happened?
A
A
Kilobyte
is
not
enough.
So
what
happened
here
is
we
have
a
pre
encoding
phase
gang
to
identify
the
potential,
long
branches
and
then
change
JaVale.
The
identified
potential
candidate
of
long-range,
we
would
turn
me
if
you
are
tween
a
branch
of
evil
greater.
Then
we
turn
it
into
a
branch
if
less
than
or
equal
reverse
the
Ferengi
direction,
and
then
the
real
branch
is
using
unconditional
one
and
canÃtö
one
can
go
up
and
down
3-2
megabyte,
that's
enough
Cepeda
teeth
and
you
assuming
there's
no
method
compiled
requiring
more
than
32
megabyte.
A
A
This
can
happen
in
a
past
on
all
platforms,
because
in
a
past
that
the
code
cache
is
allocated
separately
each
piece
by
piece
as
your
run,
so
you
can
imagine
you
have
a
code
code,
cache
or
piece
of
code.
Cache
is
more
than
four
megabytes
a
4
kilobyte
away
from
your
previous
code.
Cache
now,
in
that
case,
even
the
even
though
it
can
address
up
and
down
for
gigabyte,
then
is
still
not
enough.
A
A
The
code
cache
is
a
big
hole,
kg
use,
kind
thing,
ecocash
is
reserved,
preserved
there,
256
megabytes
and
then
every
code
cache
is
tab.
One
piece
out
of
that
preserve
the
256
megabytes
piece
by
piece.
In
that
case,
X
and
Z
will
not
have
the
problem
of
not
enough
branching
distance.
Around
P
is
still
can
happen
because
you,
your
preserve
area,
is
256.
Megabytes
P
can
only
jump
up
and
down
32
megabyte.
So
you
still
not
enough
on
all
platforms.
You
can.
A
This
thing
can
happen
for
helper
call,
because
the
helper
is
in
share
library
and
how
far
away
share
library
from
your
code
cache
that
is
performed
dependent
in
run
time
dependant.
Actually,
so,
if
not
the
2
gigabyte
whatever
not
enough,
then
you
still
need
the
bridging
help
from
the
trampling.
So
what
is
the
trampling?
Basically,
you
have
a
tiny
piece
of
code
to
bridge
the
call
you
you
don't
have
enough
branching
distance.
A
So,
in
your
code,
cache,
your
branch
is
jumping
to
the
Diaa
trembling
and
trembling
is,
is
responsible
to
transfer
your
control
to
the
long
distance
away
collie,
but
the
Creator
is
create
an
illusion
that
you
are.
You
are
calling
to
the
Kali
directory.
Ok,
next
slide.
I
can
show
you
what
what
the
trampoline
look
like
it
and
trembling
actually
is
the
saving
encoding
space.
You
you
you,
you
can
imagine
you're
in
your
code
cash,
your
car
to
method,
the
fool
a
lot
of
times
and
you
cannot
reach
fool.
A
If
you,
you
are
not
using
trembling,
you
need
to
encode
the
long
sequence
in
many
time
many
places,
but
you
have
a
trembling.
You
have
a
single
tramping
sitting
in
your
code,
cache
every
core
justic
job
to
the
trembling
be
trembling
is
the
thorough
gait
of
your
colleague
in
your
current
code.
Cache
also
is
required
trembling
for
atomic
touching
purpose.
If
you
have
method
recompilation
and
you
need
to
touch
your
core
to
the
new
target,
then
if
you
directly
encode
your
car
in
your
main
line
sequence,
then
you
cannot
patch
it
in
multiple
instruction.
A
Okay,
so
next
slide
basically
are
the
binary
encoding
and
the
late
creation
of
trembling
binary
encoding.
Is
you
you?
You
always
need
to
keep
in
mind.
We
have
the
dynamic
instruction.
Patching
can
happen.
You
need
dynamic
code.
Patching
can
happen
for
a
lot
of
places
in
a
DJ
environment,
for
the
data
method
and
across
the
resolution.
Maybe
you
the
data
method
and
a
class
unknown
at
the
time
of
your
compilation.
You
need
to
resolve
it.
Then
you
resolve
it
by
instruction
patching.
A
So
you
need
the
instruction
passion
for
that
compilation
and
recommendation
is
required,
touching
as
well.
No
other
god
invalidation
is
patching.
Castle
loading,
unloading
and
pre-existence
optimization
is,
can
trigger
patching
and
compilation.
Failure
also
can
trigger
patching
so
for
instruction
patching
because
it
is
a
is
a
relatively
mark,
complicated
topic.
It
is
how
I
catch
the
coherency
implications
and
see
mod
X
specification,
basically
concurrent
modification
and
again,
that
is,
for
the
part.
Two
I
will
talk
about
more
about
that.
A
But
for
this
talk
instruction
patching.
Basically
the
requirement
is
atomic
behavior.
So
for
atomic
behavior,
you
release
that
you
need
to
align
your
instruction
on
right
boundary
for
the
atomic
patching
to
succeed.
So
so
this
is
basically
the
hardware
specifically
require
Minh
on
instruction
alignment,
when
power
is
relatively
simple,
because
every
instruction
is
a
for
Biden,
so
it
automatically
is
aligned
and
you
can
patch
it
a
simple
way.
A
Taman,
because
there
are
different
instruction
length
on
x
and
z,
then
you
probably
need
to
take
care
of
the
alignment
of
what
you
are
touching
and
now
for
trembling
trembling
is
our
trembling
is
created
on
demand,
because,
most
of
our
time,
even
on
power,
you
is
not.
Many
applications
generate
a
modern
32
megabyte
of
binary.
A
So
typically,
the
26th
bit
of
the
branching
distance
is
enough
for
most
of
time,
for
example,
for
a
typical
dippity
day
trader
on
the
binary
generate
is
probably
around
14
mega
byte
to
15
mega
byte,
but
for
a
for
virtual
enterprise,
2010
is
typically
28
megabyte
to
30
something
background,
and
in
that
case
occasionally
you
can
have.
The
trebling
is
required,
so
trampling
in
that
scenario
or
trembling
is
created
our
demand
only
when
you
found
when
you
do
instruction
fetching
you
found.
A
Oh
I,
cannot
reach
the
target,
then
at
that
time
you
are
requesting
the
trampling
to
be
created
now
for
that
to
to
always
succeed,
because
this
is
a
runtime
behavior,
you
cannot
allow
later
creation
of
tramping
to
fail.
If
you
fail
that
and
then
your
whole
runtime
fail
right.
That
is
not
allowed
to
fail
so
that
space
for
each
potential
trampling
is
preserved
busy.
You
you,
you,
you
preserve
the
memory
and
then
later
on,
you
always
can
get
grab
the
space
to
create
your
traveling.
A
Now
trembling
still
need
to
be
purged
for
recompilation.
Now
the
question
on
whether
trembling
can
be
atomically
patted,
modified
in
place,
that
is
platform
specific
on
X
and
Z.
Yes,
the
trembling
air
can
be
modified
in
place
around
Pino,
so
traveling.
What
trampoline
look
like
is
on
a
green
portion,
the
trampoline.
Basically,
you
compose
the
address
and
then
jumper
to
the
address
jump.
I
need
to
emphasize
its
jumper
to
address,
not
called
your
dress,
because
you
call
your
dress.
Your
return
address
is
for
the
fall
ecology.
A
The
return
address
going
to
be
after
the
jump,
and
that
is
not
where
you
want
to
return,
because
your
return
address
should
be
on
initial
branch
instruction
branch
into
the
the
court
hood
trampling.
So
in
this
sequence
you
should
not
be
trashing.
Your
return
address.
The
return
address
already
established
for
the
initial
call
to
the
trembling,
so
trembling
can
only
use
jump.
Your
jumper
to
it
is
not
a
call
if
you're
not
create
a
return
address
again
and
a
thrashing.
A
Displacement
seems
to
calculate
the
address,
and
then
you
cannot
patch
it
in
one
goal.
So
what
we
we?
What
happened
on
power
is?
We
have
a
temporary
trampling,
we
create
a
temporary
trembling
and
redirect
the
the
initial
call
to
the
temporary
trembling
to
jump
to
the
new
target
and
the
temporary
trembling
will
be
cleaned
up
on
a
next
GC
when
the
world
is
stopped.
A
The
question
is
where
to
put
the
temporary
trembling,
so
in
Ichiko
you
have
5
right
now
is
by
default.
5%
of
the
code.
Cache
is
cut
out
for
the
trampoline
space.
In
that
5
percent.
You
have
how
many
I
don't
remember,
how
many
is
4
is
64
or
something
we
reserved
for
temporary
trembling
is
typically
we
in
real
life.
We
have
never
seen
more
than
2
or
3
is
used.
So
is
enough
already
there
any
questions.
A
B
A
A
P
is
24,
pied
and
I.
Don't
remember,
is
having
is
up
to
24
pint,
so
is
the
sort
of
the
capacity
for
the
trembling
area
is
calculated
from
power.
Size
estimate,
you're,
24
PI
and
for
a
typical
code
cache
you
5%,
allow
you
more
than
your
the
typical
method.
The
compilation
is
on
power
is
to
kill
abided
to
three
kilobyte
or
something
you
calculated
is.
Basically,
you
know.
A
A
Our
condition
code
and
a
condition
register
the
difference
between
our
X
Z
and
the
P
is
our
x
and
z
arithmetic
instructions
to
set
a
contingent
code
by
default
automatically,
whether
you
want
it
or
not.
On
P
arithmetic
instructions
don't
set
any
condition
registers
by
default.
Unless
you
ask
for
it
there's
a
instruction
form
is
called
a
record
form.
Instruction
is
one
bit
in
instruction:
you
can
you
can
ask
for
the
condition
status
for
particular
instruction
and
our
power
is.
There
are
eight
condition
register
in
total
architectual.
A
E4
compare
instruction
and
conditional
branch
instruction
you
can
designate,
which
you
see
are
to
use
and
for
record
form.
Instructions
is
a
specific
CR
is
implied
according
to
the
instruction
type,
for
example,
for
integer
instructions
is
CR
number.
Zero
is
implied
for
floating
point
that
the
cr6
is
implied
and.
A
A
A
You
use
multiple
condition:
registers
actually
performance
benefitting
because
you,
if
you
are
you
don't
have
multiple
conditions,
read
condition
registers.
You
need
multiple
conditional
branches
and
each
one
is
unpredictable.
You
have
potential
pollution
means
and
you
suffer
the
multiple
miss
penalty,
but
power.
You
can
combine
the
CR
logical
instruction
to
end
or
act
or
whatever
the
instruction
bit
that
the
condition
bit.
Then
you
can
land
it
on
a
single
instruction.
A
A
B
A
So
registers
and
global
register
allocator
and
local
register
allocator.
So
in
coaching
we
typically
dealing
with
virtual
register
and
you
have.
The
real
register
is
called
architected
register
and
the
register
Isana
is
to
assign
the
manage
the
lifecycle
of
virtual
register
and
assign
the
real
register
to
the
VR
and
also
there's
a
register
renaming
or
register
buffer
or
reorder
path
or
whatever
is
in
modern
architectures.
A
A
Gpr
side
is
180,
something
renaming
buffer
and
the
protein
point
side
is
100
more
than
160
on
power,
9
you'll
think
about
it
is
not
surprising
at
all.
You
have
8
SMT
each
is
read,
the
architecture
is
96
already
you
help
32,
plus
a
64
vector,
and
you
multiply
by
8
that
is
close
to
800
architected
register,
possibly
maybe
is
going
to
be
more
than
1000
and
the
past
transactional
memory
transaction
memory.
You
need
to
keep
the
old
register
so
is
even
more
basically
now
for
studying,
studying
on
30
on
x86.
A
They
already
have
partial
registers.
The
aah
is
overlapping
with
the
EAX
whatever
right.
Yes,
that's
the
partial
register
problem
and
with
the
introduction
of
vector
register
this
even
happen
more
frequently.
The
x86
has
the
XFM
register
ymm
register.
The
register
is
all
overlapping
and
the
power
is
same.
Similar
is
a
floating
point.
A
Register
is
overlapping
with
the
vector
register
is
only
partial
of
the
partial
of
the
vector
register
and
the
floating
point
is
partially
overlapping
with
the
vector
register,
as
well
as
a
performance
implications
here,
because
your
non
overlapping
portion,
how
is
defined
in
is
a
basically
is
undefined
versus
unchanged,
where
the
performance
implication
here
or
pipeline
usage
implication
here,
for
example,
your
rx,
you
have
the
zmm
register.
The
512
bit
register
then
is
partially
overlapping
with
the
128-bit
xmm
register.
A
Now
you
need
to
be
abstraction.
Need
to
define
the
non
over
overlapping
part
is
undefined
or
unchanged
when
you
are
using
the
SMM
register,
the
lower
portion,
if
you,
if
you
have
the
inner
path
that
you
have
implementation
problem
here,
performance
problem
here,
you
if
you
didn't
carry
the
know
over
the
overlapping
portion
with
you
then
later
on.
You
need
to
combine
things
together,
but
for
the
for,
if
the
instruction
set
that
defined
it
as
undefined,
then
you
are
free.
A
You
are
free
to
go
because
it
is
undefined
anyway,
and
also
is
relevant
to
how
the
pipeline
is
used.
If
you,
if
he
is
unchanged,
you
need
to
carry
the
whole
512
bit
with
you
in
order
to
4/4
is
not
not
changing,
but
you
carry
that
over.
Basically
in
Prior,
you
need
to
use
a
wider
pipeline
to
push
the
instruction
through
and
I
can
tell
you.
Our
power
is
almost
always
undefined.
A
The
non-overlapping
portion
is
undefined,
so
the
hardware
implementation
is
free
to
go
basically
and
global
register
allocator.
This
is
different.
The
overall
framework
is
the
same
among
the
three
code
generator,
but
is
configurable
and
parameterize
able
to
each
code
generator.
Basically,
how
many
real
registers
you
are.
You
can
be
used
for
for
global
register
candidates
and
which
real
register
are
most
favorable
for
the
candidate,
and
this
is
configurable
in
each
code
generator
because
it
is
up
to
each
culture
and
platform
specific
to
decide
which
will
register
a
most
favorable.
A
All
these
information
is
carried
to
cogeneration
to
through
the
gr,
Red,
Death,
Note
and
low
for
local
register.
Allocator.
The
status
of
the
real
register
set
is
controlled
through
registered
dependency
condition,
so
that
is
used
to
honor.
The
global
register
allocation
request.
Also
guarantee
knows
the
bill
for
certain
code
regions
and
enforce
the
linkage
convention.
A
A
Okay,
my
time
is:
okay
on
I
need
to
go
faster,
I,
think
linkage,
Convention
and
the
ji
direct
to
the
private,
a
linkage
convention
business,
the
jbjb,
a
JVM
is
own
ABI.
So
you
have
nature.
You
helped
question:
why
not
just
using
a
system
ABI
this
historical
Wizards,
we
are
using
I.
Think
the
most
simple
answer
is
that
in
the
past
we
are
doing
the
compilation
on
the
same.
A
We
are
doing,
compilation
did
compiled
compilation
on
the
same
stack
of
same
thread
of
application
thread.
So
if
you
are
using,
you
are
using
the
same
stack
for
the
native
code
and
Java
thread
your
potentially
for
every
javathread.
You
need
to
have
a
big
stack,
so
in
the
past
available
you
separated
we
separated
so
native
stack
and
is
native
such
Java
staff
in
Java
stack.
So
the
compilation
probably
cannot
cause
the
stack
to
be
that
bigger,
for
implication,
for
the
Java
stack.
Now
you
have
a
different
stack,
so
you
have
a
different
linkage
already.
A
The
linkage
basically
is
how
real
registers
are
used,
which
are
volatile.
You
can
modify,
which
are
preserved
and
which
are
reserved
reserved
easy.
You
cannot
touch
on
it
and
how
arguments
are
passed
to
your
colleague
and
how
value
is
returned
and
what
the
stack
frame
shape.
Look
like.
That
is
the
linkage
and
also
your
argument,
the
past.
If
your
past
on
stack,
you
have
to
load
his
to
stop
performance
implications.
A
So
what
is
the
load
he
saw?
Is
you
helped
a
older
store
and
a
younger
load
to
the
same
memory,
and
if
we,
our
load,
is
issued
later
than
the
store
and
you
check
the
stock
you
and
your
hit
the
store,
then,
in
that
case,
stop
forwarding
need
to
happen
and
typically
is
going
to
be
rejected
and
then
stop
holding
on
xnz.
A
The
stop
holding
is
a
fast,
and
sometimes
it
can
reject
it.
You
also
suffer
performance
there.
On
power,
typically
locust
or
penalty
is
higher
and
a
store
heat
load
is
when
you
issued
out
of
order
load
is,
is
your
first
store
is
easier
later
and
store
check
the
load
and
we
oughta
kill
heat
the
load
you
you
need
to
reject,
and
that
is
a
load,
a
store
penalty
and
also
if
this
can
also
happen
where
your
short
circuiting
complicated
cause
without
the
shrink-wrapping.
A
Basically,
you
have,
you
have
saved
a
lot
of
register,
but
you
you
return
right
away,
because
you
have
a
condition
of
something:
is
a
null
pointer,
I
going
to
return
right
away
and
that
return
you
need
to
load
you
need
to.
We
saw
all
the
preserve
registers
and
a
lot
of
load
historian
there
and
getting
a
direct
copy.
This
is
the
bridge
the
linkage
convention,
difference
between
a
dieter,
privateeer
linkages
and
a
system
linkage.
A
You
need
to
bridging
that
difference,
even
on
a
different
stack.
If
you
you
can
dispatch
their
I
through
interpreter
or
helper
call
as
well,
but
the
overhead
is
very
significant.
You
can
imagine
housing
given.
It
is
because
you
you
need
to
copy
the
argument
from
a
Java
stack
to
the
native
stack
and
you
even
you
need
to
have
a
matter
towed
to
set
up.
A
You
need
to
decoding
the
signature
to
know
you
is
a
fraud,
important
argument
or
integer
argument
whatever
there
and
that
decode
big
time
and
the
shopping
of
acumen
take
time
and
you
need
the
medical
physics
meta
code
sequence
to
set
up
the
register.
For
example,
you
need
to
name
the
t
got
TB
r3
to
contain
a
first
argument
on
power.
That
kind
of
thing
is
overhead
is
high
and
the
Jake
I
direct.
The
call
is
itself
it
has
a
lot
of
overhead.
A
You
need
to
building
the
stack
of
frame
to
anchor
a
call
and
release
we
acquire
we
have
accessed.
This
are
including
atomic
update,
and
after
your
call
you
come
back,
you
need
to
free.
You
need
to
free
the
reference
frame
before
a
call.
You
need
to
prepare
for
potential
DC
and
and
at
the
end,
that
you
need
to
tear
down
a
stack
frame.
This
is
pretty
expensive
and
right
now,
I
think
we
have
a
better,
lighter
mechanism
in
a
work.
Possibly
is
better
any
questions.
C
A
Instruction
charging
and
we
Association
instruction
scheduling.
Basically
you
are
you,
you
have
the
constraint
of
the
same
semantics
of
instruction
sequence
and
then
you're
moving
them
around
to
fit
the
pipeline
better
and
not
much
benefit
with
the
current
Court
of
deep
outer
order.
Execution
engine
I
have
examples
in
the
next
page.
A
B
A
Page
so
so
I'm
going
to
to
describe.
Why
is
the
instruction
scheduling
is
not
that
beneficial
on
our
order
machine
so
on
the
generate
code,
column
and
the
schedule
column,
I
have
a
basic
machine
model.
Here
is
the
load.
Instruction
is
a
five
cycle
latency
and
aromatic
instruction
is
the
sixth
psychology
for
floating-point
and
it
typically
true.
A
The
original
instruction
is
the
sixth
cycle
and
load
is
a
four
to
five
cycle
and
the
example
is
a
currently
is
pretty
popular
that
idiom
for
machine
learning,
for
whatever
there
is
Mitch
matrix
multiplication
or
vector
products
on
there
is
the
typical
idiom
here.
So
you
will
hear
the
example.
Is
ability
is
squared
each
element
and
path
together,
put
it
in
X
environment
variable?
A
So
if
you
unroll
this
stupid
twice,
the
ternary
code
look
like
you're
load,
a
I
and
the
square,
the
our
putting
a
2
and
add
our
2
to
X
and
a
load
I
plus
1
and
square
it
and
add
it
if
you
are
not
out
of
order,
and
although
your
pipeline
pipeline
ability
means
if
the
next
instruction
is
independent
of
the
previous
instruction,
it
can
go.
But
if
you
dependent,
you
cannot
go
so
for
this
instruction.
A
Your
load
instruction
is
issued
the
result
that
will
come
back
in
5
cycle
and
the
next
instruction
of
multiply
our
1/2
hour
and
because
it's
dependent
on
the
load
instruction
so
loaded
is
issued.
The
mall
cannot
go.
You
need
to
wait
there
for
five
cycles,
although
you
are
pipelined,
but
you
are
not
out
of
order,
so
you
can
roughly
calculate
it
that
sequence
with
that
model.
It
will
take
29
cycles.
A
The
mouth
the
first
amount
can
eat
you
and
the
next
cycle,
the
ass
we
come
back,
the
next
mal
can
go
and
you
can
calculate
it
this
for
that
pipeline.
The
model
is
around
23
cycle,
so
that
is
the
20
to
30
percent
benefit
because
of
the
gathering,
but
on
our
order
machine
you
feed
this
instruction.
The
generate
code
instruction
into
the
outer
order
machine
it
will
executed
at
the
schedule
or
anyway,
because
if
XP
came
out
of
order
can
be
out
of
order.
A
So
that's
why
the
scheduler
is
not
providing
much
benefit.
Because
of
this,
however,
we
Association
is
going
to
help
this
I
have
the
we
associate
our
own
eight
times.
You
can
imagine
your
execution,
echoing
look
like
eight
load,
eight
mile
and
the
few
add
there.
So
the
we
association
is
my
way
when
you
do
the
eight
load,
eight
mile
and
you're,
coming
back,
you
are
going
to.
You
are
not
going
to
sequential
eyes
on
the
x
register.
You
are
going
to
add
our
two
and
our
four
together
first
doing
go
into
a
new
register.
A
Peggy
you
we
associate
the
will
thing
not
not
going
to
and
sequentially
on
on
your
package
the
register,
so
our
two
are
for
together
our
6.66
update
together,
you
help
for
such
and
is
independent,
each
other
is
can
go
and
after
therefore
you
can
help
to
independent
y&z
together
and
M,
plus
J
together.
That
can
go
in
parallel
as
well
and
avenge.
The
last
moment
you
are
sequentialized
or
X:
u
X,
plus
P
1
and
X
plus
T
2.
A
So
you,
if
you
assuming
the
two
pipes
for
each
type
same
latency
model
you
unload
8
times.
If,
if
we're
only
doing
as
the
schedule,
1
8
load,
8
mile
and
8
added
8,
add
sequentialized.
That
will
require
59
cycles.
But
if
you
we
associate,
as
that,
we
associated
column
showing
you
calculate
is
roughly
35
cycle.
That
is
almost
twice
better,
and
this
is
our
JIT
compiler.
Right
now
cannot
do
this,
but
is
sitting
on
Andrews
radar
somewhere.
C
D
You
so,
let's
give
a
round
of
applause
for
doing
such
a
great
job
on
a
very
complicated
topic,
so
we
will
end
here
today.
If
you
have
any
more
questions
that
Julian
or
me
know
like
we
were
talking
about
earlier,
he
will
deliver.
The
second
part
so
hold
your
interests
for
a
month.
If
you
really
want
to
know
early,
we
can
also
reschedule
if
you
want
to
talk
earlier,
but
right
now
as
scheduled
for
next
month's
talk,
and
if
you
have
something
that
you
would
like
to
share
like
I
was
talking
about
earlier.