►
From YouTube: Parallel Rustc Planning Meeting 2019-10-21
Description
Discussing the jobserver integration, how LLVM parallelism works today, and what better Rayon-jobserver integration might look like.
A
A
B
So
I
mean
I,
read
up
a
lot
of
I
wrote
of
a
document
just
kind
of
of
an
overview
of.
What's
the
parallelism
is
in
kind.
Where
is
the
job
servers
specifically
I,
don't
mind
going
over
any
of
the
various
pieces,
but
so
I
don't
mind
like
diving
into
one
particular
one,
but
we
can
also
just
start
with
a
job
server
and
see
what
goes
from
there.
A
A
B
B
Kind
of
not
the
last
time
that
I've
been
sparked
the
current.
What
we
exactly
have
entry
right
now,
I
do
think.
A
lot
of
the
performance
is
related
to
the
job
server,
but
I
don't
think
it's
related
to
Olivia
integration
at
all.
I
believe
that
there's
a
like
this
is
where
a
little
while
ago
we
were
talking
about.
Maybe
we
working,
maybe
maybe
rusty,
doesn't
want
to
use
rayon.
B
Maybe
just
wants
to
use
parallel
for
loops
in
a
few
places,
and
so
that
is
where,
like
right
now,
I
think
the
ran
integration
has
very
very
fine-grained
where
it
acquires
releases
of
job
server
token,
and
that
causes
a
huge
amount
of
traffic
to
the
kernel
and
that's
one
of
the
main
reasons.
I
think
that
we're
having
seek
let's
just
slow
down,
so
it's
not
actually
explicitly
related.
As
far
as
I
know
to
the
current
job,
server
integration
of
the
LLVM
backends.
A
Okay,
well,
why
don't
we
start
in
whatever
order?
You
think
I
still
feel
like
it'd
be
good
to
cover
both
those
things.
If
we
can
so
whatever
order,
you
think
is
best,
but
that
makes
sense
what
you're
saying
and
it's
actually
encouraging
I
guess,
because
it
suggests
we
might
be
able
to
better
performance
like
without
as
much
restructuring
yeah.
B
And
so
I,
it's
I
do
think
it's
worthwhile
to
talk
a
little
bit
about
about
what
we
want
to
do.
It's
a
coaching
back-end
going
forward
because
I
do
think.
There's
gonna
be
some
big
simplifications.
We
can
have
a
fully
Pella
compiler,
so
yeah,
let's
just
go
through
this.
So
do
you
go
to
just
kind
of
like
basically
I'm
I'm,
fine
doing
whatever's
easiest
I
can
just
like
go
down
this
dock
and
kind
of
give
like
the
high
level
points,
and
then
we
can
kind
of
drill
into
questions
as.
B
All
right,
so
the
number
one
thing
to
do
here
is
actually,
if
you
want
to
scroll
down
and
click
on
the
graph
that
little
picture
looking
thing,
that's
gonna
be
the
easiest
thing
to
take
a
look
at
yeah
so
case.
You
can't
really
see
my
cursor
for
anyway,
so
the
parallelism
on
the
back
end
all
has
to
do
with.
B
Right
so
I'm
gonna
go
bring
up
a
blown-up
portion
of
the
picture.
All
right.
This
is
perfect.
Okay,
so
all
of
the
parallelism
that
we
get
is
not
in
the
compiler
itself,
but
rather
through
LVM.
So
all
the
cogent
one
has
to
do
is
he
means
to
take
much
of
the
Russy
data
structures
like
meteor,
translate
them
to
LV
Mir,
and
then,
once
you
get
to
elevate
me
our
IR,
then
we
have
to
optimize
it
I
think
coach
ended.
B
The
translation
phase
has
to
be
serial
because
it
uses
the
type
context
and
that
just
inherently
is
synced
right
now.
So
this
big
bar
right
here
we
see
at
the
top.
This
is
translation.
This
is
the
main
thread,
performing
translation
and
then
kind
of
as
soon
as
we
have
an
LVN
module
ready
to
go.
Each
Olivia
module
is
completely
independent
from
all
the
others
in
terms
of
like
LLVM
data
structures,
and
so
that
is
safe
to
send
across
threads.
B
B
Threads
picking
up
work
for
other
places,
except
for
kinda
like
this
one
3
right
here,
but
in
any
case
that's
the
main
face
parallelism
or
so
the
the
main
thread
will
translate
stuff,
send
it
over
to
other
threads
and
they
will
optimize
the
optimization
even
happens
in
debug
mode.
It
handles
things
like
inline,
always
it's
very,
very
fast
things
like
that,
and
then
what
we're
looking
at
here
is
actually
a
optimized
build,
and
so
there's
this
kind
of
wall
at
this
portion,
and
this
represents
the
fin
LTO
passes.
B
So
as
soon
as
every
single
cogent
unit
from
the
left
side
is
optimized.
Initially
this
this
one
from
thread
two
is
the
very
last
one.
So
it
finishes
right
about
here
that
translates
to
this.
We
do
a
tiny
amount
of
work
to
do
thin,
LTO
analysis,
and
then
we,
as
you
can
see,
we
fork
off
a
bunch
of
yo
stuff
to
actually
happen
later
on,
and
so
that's
where
all
the
parallelism
comes
from.
So
this
is
sort
of
like
at
a
high.
A
B
A
B
B
Is
it
was
like
a
really
crappy
format
for
viewing
these
crappy
in
a
sense
of
is
not
nearly
as
fancy
as
the
current
one.
So
this
is
very
very
recent
from
the
proof,
profiling
work
and
to
me
it's
so
nice
to
view
these
kinds
of
graphs
of
the
ancestors
to
know.
I've
barely
been
looking
at
these
and
the
other
answers.
Yes,
so
as
you
look
at
them
real
life,
we
can
do
a
lot
better,
and
so
that's
the
only
section
I
wrote
here
about
cogent
units,
which
is
that
for
parallelism.
This
isn't
too
interesting.
B
B
C
B
So
this
bar
at
the
top
is
the
main
thread
actually
translated
Olivia.
So
that's
mostly
in
rusty
because
we
were
manufacturing
Mir.
But
yes,
every
single
other
one
is
predominantly.
We
just
call
it
LVM
and
said:
run
your
optimization
passes
and
that's
like
the
entire
thing.
There's
like
little
tiny
pieces
where
it's
just
all
opium-
and
this
is
like
if
the
LVM
is
where
we
are
actually
executing
LVM
on
multiple
threads.
C
B
Far
as
I
know,
yes,
I,
don't
I
think
there's
been
some
efforts
a
long
long
time
ago
to
paralyse
within
a
module,
but
I'm
not
really
sure
they
ever
got
off
the
ground
or
landed
in
tree
now.
That
could
be
wrong.
It's
been
years
since
I
took
a
look
at
this,
but
I'm
relatively
certain
that
like
because
this
would
be
been
one
of
the
number
one
things
that
clang
would
want
to
enable.
B
A
B
All
right
so
kind
of
an
overview
so
of
what
can
and
cannot
be
paralyzed.
So
we
have
these
five
steps
which
the
first
one
is.
We
have
to
actually
split
cogent
units
in
Rus
team,
just
to
figure
out.
What's
where
that's
kind
of
hard
to
paralyze,
you
can
make
the
query
parallel,
but
you
can't
paralyze
it
to
too
much
the
actual
translation.
It's.
B
A
B
Is
true
and
I
haven't
done,
Yuja
mana
profiling,
but
this
is
a
it's
a
non-trivial
amount
of
time
in
the
compiler.
It's
not
massive,
just
the
splitting
things
up,
but
it's
all.
It's
not
tiny.
I'd
like
it's
not
just
inconsequential
but
on
the
next
part,
which
is
the
most
interesting
one
is
where
we
actually
translate
to
LVM.
So
we
translating
how
much
of
here
mirror
to
elevate
ir
that
today
is
not
parallel
is
actually
can
become
parallel
once
we
get
parallel
like
the
whole,
parallel
compiler.
It's
that's.
You
know.
B
That's
the
number
one
major
simplification,
the
coaching
packages.
We
could
just
translate
everything
in
parallel
all
at
once,
so
we
don't
have
to
have
this
like
coordination
between
the
main
for
the
order
threads,
and
all
that
that's
like
a
really
big
thing.
I
wanted
to
call
out
coder-
and
this
is
like
today's
sequential,
but
it
can
be
parallelized
cogent-
that
sort
of
parallel.
We
don't
worry
about
the
penalty,
oh,
is
by
definition
serial.
So
it's
okay
to
do
anything
about
that,
and
then
the
actual
translation
after
penalty.
Oh,
is
it's
already
completely
paralyzed.
B
B
This
the
beginning,
there's
like
this
weird
detail
we
have
like
so
everything
in
the
compiler
is
controlled
via
a
coordination
thread.
That
is
not
the
main
thread,
and
the
coordination
thread
is
the
one
that
decides
everything
it'll
tell
the
main
thread
coach
in
a
module,
it'll
actually
spin
off
a
thread
to
go
work
on
all
of
the
young
it'll,
be
the
one
doing
sin
LTO
passes
or
sending
that
to
thread
to
do
passes,
and
so
it's
it
is
a
very
long
comment.
B
You
know,
I'll
click
on
it
very,
very,
very
long
comment
that
goes
goes,
and
so
that's
actually
read
that
recently.
This
is
still
pretty
accurate
per
get
to
date
in
terms
of
what
the
coordinator
thread
does.
But
this
is
largely
in
the
plantation
detail
and
I
expect
this
to
almost
entirely
go
away.
Once
we
get
a
parallel
compiler,
because
we
just
won't
need
this
crazy.
A
Who
kind
of
did
this
work
to
like
sort
of
don't
do
more
than
n
at
a
time
reduce
the
peak
memory
usage,
basically
as
well
as
starting
as
soon
as
you
can,
which
is
good
and
I'm
a
little
bit
nervous
that
if
we
just
sort
of
just
made
it
so
that
each
query
takes
a
code
genuine
and
like
runs
it
I?
Guess
we
kind
of
get
some
of
that
for
free.
Now
that
I
think
about
it?
We.
A
A
B
Actually,
but
this
is
a
section
on
the
route
specifically
on
this
and
you're,
definitely
right,
so
it
used
to
be
like
that.
Stair
step
didn't
used
to
happen.
We
used
to
when
we
first
had
parallel
compilation
or
parallel
code
gen,
we
translated
everything,
serially
and
then
in
parallel.
We
did
everything
in
parallel,
so
that
meant
I
repeat
memory
usage
was
literally
every
single
Olivia
module
in
memory,
the
biggest
it's
ever
gonna
be
because
it's
not
optimized
ir+
the
time
context.
B
Everything
is
that,
so
that
was
what
killed
us
and
we
needed
to
fix
that
and
I
actually
tracked
it
down.
That
was
this
was
a
some
innocuous
PR
which
doesn't
sound
like
it's
tackling
exactly
this,
but
it
was
when
we
introduced
that
stair-stepping
effect
and
the
explicit
purpose
of
that
was
that
we
want
to
make
sure
we
don't
hold
literally
everything
in
memory
all
at
once
and
I.
So
we
still
hold
I
know
for,
like
so
I
think
one
of
many
things
is
like
we
want
to
try
it.
B
We
want
to
drop
the
type
context
as
soon
as
possible
after
we
translate
everything
we
also
want
to,
but
but
on
the
other
hand,
we
do
actually
have
every
single
Olivia
module
in
memory
at
once
when
we
do
with
penalty,
oh,
and
so,
if
I
remember
I,
we
have
to
talk
to
him
MW
because
he
might
know
for
sure
he
was
tougher,
the
one
that
was
investigating
this
and
I'm
pretty
sure
it
came
from
Firefox,
but
I
think
it
was
incremental
or
an
incremental.
You
could
have
hundreds
of
thousands
of
objects
and
having
all
those.
A
B
This
I
realized
I.
Think
that
name.
The
main
thing
is.
We
need
to
drop
the
type
context
as
soon
as
possible,
but
they're
like
this
is
an
example
of
like
fiddle
too.
But
if
you
didn't
do
this,
we
would
just
cut
out
this
step
in
general,
and
so
we
would
like
memory
would
be
the
entire
elegant
module
be
dropped
as
soon
as
it
could
like
what
we
actually
create
the
object
file
and
so
like
this
is
definitely
something
on
what
we
need
to
we
have
to.
B
We
must
avoid
this
is
a
problem
we
must
continue
to
do
with,
but
I'm
pretty
sure
that
we
can
largely
let
the
job
server
limitations
of
like
just
inherent
parallelism
guide
us
as
opposed
to
having
this
like
weird
heuristics
for
who's.
Doing
what?
B
When,
because
that's
all
based
on
the
fact
that
we
have
to
do
main
thread
is
the
only
thread
that
can
translate
things,
whereas
if
we
get
always
said
that
every
thread
just
picks
up
a
module,
takes
it
all
the
way
to
completion
and
then
starts
the
next
one,
that
model
is
just
inherently
more
naturally
self
real
self.
Looking
because
it's
like
you
started
it
from
front
to
back
so
no
one's
producing
everything
all
at
once.
A
B
It's
it's
a
little
more,
it's
difficult
because
in
if
you're
not
optimizing,
so
if
you're
not
doing
a
penalty,
oh
that
everything
can
be
one
parallel
for
loop,
because
you
just
want
to
translate
an
unco
gen.
If
you
are
doing
a
fiddle
to
you,
you
have
a
synchronization
point.
We
have
to
optimize
once
we
don't
do
Co
gen
and
then
you
do
some
stuff
and
then
you
guys
get
in
Co
Jenna
after
that,
and
so
whether
or
not
you
have
felt
you
or
just
general
LTO
enabled
might
add.
B
A
A
A
A
B
A
A
B
B
Parallelism
is
actually
really
difficult
to
do
within
build
tools,
specifically
because
everything
wants
to
use
parallelism,
and
so
one
of
the
most
examples
of
this
is
that
cargo
is
going
to
spawn
the
number
of
CPUs
processes.
It's
gonna
spawn
the
number
of
CPUs
russy's
instances
it
actually
can,
but
if
there
was
no
rate
limiting
or
know
sort
of
like
limitation
of
how
much
can
be
done,
then
each
of
those
receipts.
This
would
spawn
another
num,
CPU
threads,
and
so
and
also
if
this
is
recursive,
then
everyone
can
keep
doing
that.
B
And
so
you
can
very
quickly.
You
can
have
the
exponential
blow-up
of
the
number
of
threads
and
processes
on
your
system,
and
so
the
general
idea
is
that
in
the
build
process
which
lots
of
things
wants
to
do
various
amounts
of
parallelism,
but
there's
still
one
build
process
as
a
whole.
We
want
to
mean
general
limit
parallelism
on
this.
We
want
to
make
sure
that
cargo
treats
everything
nicely
all
the
Rusties
coordinate
and
especially
the
Firefox's
use
case.
B
Firefox
hasn't
much
as
C++
files
getting
built,
but
also
put
your
us
code
getting
build
even
across
that
they
want
a
limitation
parallelism
to
make
sure
that
you're
not
spawning
thousands
of
processes,
and
so
the
solution
for
this
is
this
thing
called
job
server.
I
believe
it
was
pioneered
by
Kanu
make
and
it's
been
comported
too
much
platforms
ever
since
job
server
is
a
name
for
a
glorified
IPC
semaphore,
which
is
synchronization
primitive.
Where
you
add
in
n
tokens,
and
you
can
get
n
tokens.
B
But
if
there's
not
longer
raining
you
just
block
and
then
you
can
add
it
back
in
at
any
time,
and
so
the
idea
here
is
that
cargo
cargo
typically
creates
the
job
server
and
then
inherits
it
to
these
odd
processes.
Mrs.
Thurm
via
follow
scriptures
or
literally
IPC
semaphores
on
widows,
but
the
idea
there
is
that
so
cargo
will
create
a
token
pipe
with
42
tokens.
A
B
It
if
you
have
a
32
core
machine
and
then
it
will
take
it,
remove
tokens
as
it
spawns
processes,
and
then
each
process
internally
will
attempt
to
require
token
if
it
wants
to
run
parallel
work.
And
so
this
is
like
some
weird
adherence
stuff
about
the
protocol
that
you
can
kind
of
largely
not
do
we
care
about.
B
It
makes
it
sort
of
difficult
for
us
in
a
few
places,
but
every
running
process
always
has
at
least
one
token,
because
it's
the
token
for
that
process,
and
then
it's
optionally,
the
process
can
acquire
more
after
that,
and
so
the
general
idea
are
the
the
way
that
it
works
in
the
backend
right
now
is
that
our
units
of
parallelism
are
extremely
coarse.
It's
just
a.
A
B
Module
which
is
massive,
and
so
whenever
we
get
around
to
actually
we
have
a
module
we
would
like
to
translate.
We
call
this
request.
Token
method,
I'm,
not
gonna,
really
much
talk
where
that
basically
tells
the
job
sort
of
crate.
Please
really
token,
and
then
call
some
callback
that
you
previously
registered
and
then
it.
A
B
B
Got
four
tokens:
I've
got
three
modules,
so
let
me
just
go
and
start
executing
those,
and
so
this
will
dynamically
manage.
It
will
acquire
tokens
it'll
spawn
work
as
long
as
it
has
tokens
and
then
the
I
linked
it
in
here
somewhere,
I
forget
where
but
oh
yeah,
so
tokens
are
tokens
are
immediately
released
as
soon
as
we
realize
we
have
too
many
tokens,
and
so
it's
not
a
perfect
system
like
this
is
not
nearly
as
critical
clean
as
good
to
make.
But
it's
like
good
enough.
B
It's
kind
of
fine,
so
there's
some
freshing
of
the
job
surfer
tokens
here,
but
it
ends
up
coming
out
in
the
wash
and
not
matter
too
much
because
the
you
know
parallelism
is
so
coarse
and
so
large,
and
this
also
means
that
once
we
acquire
token
as
long
as
they're
worth
as
long
as
there's
work
to
be
done,
we
will
continually
hold
onto
that
token.
We're
not
like
acquiring
our
token
for
a
module
and
then
releasing
it
once
that
module
is
done,
it's
more.
B
A
Let
me
make
sure
interesting:
we
got
an
internal
queue
of
modules
to
translate
somewhere,
I.
Guess
yes,
and
we
have
we
put
them
in
there
as
they
get
done
sorry
about
modules,
to
process
with
LVM,
optimized
I.
Guess
yes,
and
as
we
create
the
element,
my
are.
We
stick
them
in
this
queue.
Meanwhile,
this,
like
other
thing,
is
saying
so
much
someone
else.
I,
don't
know
who
is
saying
a
worker
thread
would
like
to
start.
A
B
A
B
Tokens
the
job
server
creat
has
this
into
helper
thread,
which
literally
spawns
a
helper
thread,
which
does
the
blocking
reads
and
writes
where
that
helper
thread
is,
like.
You
said
that
a
message
of
doing
the
blocking
read
and
it
does
the
blocking
read
and
then
it'll
call
this
callback,
and
this
callback
is
just
sitting
on
channel
and
saying
here
we
have
token
so
that's
sort
of
our
sourcing
tokens
from
is
you
request
token
and
then
eventually,
they'll
come
back
on
the
channel
every
time.
B
B
Is
tokens
are
literally
store
just
at
a
Veck
as
soon
as
we
get
it
like,
we
can
see,
we
just
pushed
the
token
onto
the
local
vector
and
save
it,
and
that's
it
and
the
villain
will
pump
the
iteration
of
this
loop
again
and
the
beginning.
Part
of
this
loop
does
a
whole
bunch
of
stuff.
So,
like
errors
don't
worry
about-
and
this
is
where
this
is
like-
this
is
like
every
single
block
and
it's
really
complicated.
B
B
We
say
we
have
work
just
some
amount
of
work
to
do
and
the
amount
of
work
we
have
running
is
less
than
the
amount
of
tokens
that
we
have.
So
therefore
we
can
spawn
in
our
work.
We
have
ten
tokens
what
we
have
five
units,
so
we
have
five
things
in
the
queue
two
units
running
in
ten
tokens,
so
we're
gonna
spawn
all
that,
and
so
this
will
just
pop
it
off
and
create
the
increase.
B
They're
running
I
can't
just
go
actually
spawn
work
and
then
once
we've
actually
spawned
as
much
work
as
we
possibly
can
is
where
we
actually
truncate
saying.
Oh,
if
we
owe
a
lot
of
times
like
we
might
like,
if
you
give
us
two
modules,
we'll
work
tokens,
but
if
we
get
one
token,
we
finished
both
modules.
B
With
that
token,
and
then
we
get
the
second
token,
we
still
have
to
relinquish
that
and
like
it's
that's
part
of
the
monkey
interface,
where
it's
not
like
as
nice
or
clean
as
always
might
be,
but
it
means
so
sometimes
we
will
acquire
token
and
then
immediately
release
it
back
to
the
system,
but
in
practice
I.
Don't
this?
Actually,
matters
do
too
much,
but
otherwise,
once
again,
once
we
get
a
token,
we
will
never
truncate
it
unless
we
have
fewer
things
running
that
we
actually
currently
do.
A
B
The
main
thread
LLVM,
even
though
literally
never,
is
or
is
the
major
at
co
Jenning
and
that
sort
of
dance
of
Matt
of
managing
that
will
do
a
little
bit
a
lot
easier.
Once
we
actually
have
like
one
thread
from
start/finish,
takes
the
modules
on
the
beginning
of
translation,
all
the
way
through
the
end
to
Cochin
or
just
optimization,
if
it
still
too.
B
C
B
C
B
Correct
it
so
technically,
if
you
start
resi,
it
actually
immediately
spawns
a
thread
with
a
bigger
stack
and
then,
when
you
get
to
coach
n,
we
spawn
two
threads,
the
coordinator
thread
and
the
job
server
helpers
read.
So
those
are
three
threads
that
we're
starting
unconditionally
and
yes,
there's
only
one
hepl
helper
thread
for
the
entire
process
and
I
think
the
current
parallel
integration,
nice
I,
don't
actually
know
exactly
how
it
works.
But
if,
if
the
current
parallel
integration
is
using
a
helper
thread,
it's
a
different
helper
thread
putting
this
one.
B
B
C
B
Is
sort
of
inherent,
unfortunately,
like
you?
Ideally,
this
would
be
like
some
non-blocking
I/o,
where
we
can
wait
free
token
for
five
milliseconds
or
whatever,
but
the
file
descriptors
on
unix
cannot
be
non-blocking
and
makes
it.
So
we
can't
do
anything.
Make
is
not
ready
for
its
file
descriptors
to
be
non-blocking,
so
we
can't
set
they're
not
like.
So
we
have
to
do
blocking
I/o,
which
kind
of
forces
us
to
make
a
separate
thread.
B
C
B
B
B
Was
actually
that
was
the
specific
feature
that
I
requested,
because
this
is
like
morally
what
your
CPU
is
doing,
but
actually
this
is
a
special
flag
being
passing.
Please
collapse,
threat
IDs.
If
you
can,
if
you
don't
pass
that
this
actually
shows
you
a
giant
waterfall
of
just
a
one
unit
per
thread,
that's
what's
literally
happening
so
there's
these
threads
are
not
literally
being
reused.
It's
just
in
the
visualization.
A
A
B
Was
asking
the
cost
of
acquiring
and
releasing
a
job,
server,
dokkan,
and
so
on
unix,
at
least
it's
a
reading
of
write
on
a
pipe.
It's
a
read
of
one
byte,
to
put
it
back
and
it's
a
write
of
one,
it's
a
read
of
one
byte
to
get
it
and
it's
a
rate
of
1
byte
to
put
it
back
in
there.
So
this
is
a
very,
very
expensive
operation
for
doing,
unlike
every
aeration
of
a
very
Titan,
but
if
you're
doing
Olivia
doesn't
matter
at
all.
Just
once
it's
gone.
A
And
I
have
to
look
what
I
would
expect.
This
is
what
bucks
he
did.
I
sort
of
remember
this
in
the
terms
of
like
rayon
works.
It
has
a
notion
of
threads
going
to
sleep
when
there's
no,
where
and
then
waiting
back
up
right
and
they
end
up
like
blocked
on
a
big
lock
and
it's
something
they
we
tried
to
avoid
doing,
don't
do
it
too
lightly.
It
seems
like
that
would
be
a
time
to
relinquish
a
chopped
server
token
or
get
one
back
like.
A
A
B
A
B
Think
that
rusty
should
not
hold
on
the
threads
are
hold
on
to
tokens.
Opportunistically
actually
work
to
be
done,
although
that
could
be
fungible
like
I
could
imagine
we
hold
on
for
it
for
like
two
milliseconds
that
if
no
work
comes
in
the
next
two
milliseconds
then,
but
that
kind
of
thresholds
gonna
be
really
hard
to
deal
with
so
I
would
ideally
I
do
think.
B
A
One
thing
I
will
say
is
that
there
ray
on
the
version
of
RAM
that
well
all
versions
of
Graham
have
a
pretty
not
very
good
algorithm
for
putting
threads
to
sleep
in
week.
It's
kind
of
a
binary
algorithm,
actually,
where
it
sort
of
says,
like
either
they're
all
awake,
or
they
can
start
to
go
to
sleep
and
the
point
being
as
long
as
there's
work
happening,
it
tries
to
keep
all
the
threads
awake
and
I've
been
until
we
in
the
last
month
or
two
I
was
hacking
on
an
alternative
one.
A
That
would
not
be
that
bad
that
we
sort
of
scale
up
the
number
of
fits
more
gracefully
based
on
how
much
work
there
actually
is.
There's
a
few
benchmarks
that
regress
I
haven't,
landed
it
yet,
but
like
basically
metrics
where
it
turns
out
to
be
useful
to
have
all
the
threads
in
around
grabbing
work.
But
that
might
be
why
you're
seeing
so
much
acquisition
that.
B
A
B
B
Yeah
I
am
I,
so
I,
just
profoundly
I
did
was
a
long
time
ago,
like
I,
don't
think
it
substantially
changed
since
then,
but
I
like
thread
creation,
was
a
big
one
where
I
feel
I
think
like
just
creating
a
bunch
of
threads,
was
high
on
the
profile.
So
it
sounds
like
this
would
definitely
fix
that.
The
other
one
I'm
not
thinking
of
is
that
if
every
thread
is
waiting
for
jobs
or
token,
then
that's
definitely
a
issue
we're
like
once
it
jumps
over
to
can
re-enters
the
system.
B
B
Like
if
you
have
so
I,
you
like
put
a
unit
of
work
into
the
ring
on
thread,
pool
and
therefore
32
threads
will
try
to
acquire
top
server
token
to
run
that
work,
but
there's
only
one
job
server
token
available,
and
so
one
thread
gets
it
and
does
it,
but
then,
as
soon
as
it
releases
it
every
other
thread.
Sequentially
then
gets
that
might
be
issue.
A
Could
certainly
see
that
happening,
I'm
skimming
over
I
mean
it's
not
too
hard
to
look
in
the
rail
source
so
seeing
where
they,
where
we
release
it
right
now,
but
it
looks
like
we.
We
do
indeed
release
the
thread
when
we
go
to
sleep
as
you
might
expect,
and
then
we
acquire
the
thread
when
we
require
the
token
when
we
wake
back
up.
A
So
probably
if
you
did
do
something
like
one
unit
of
work,
that's
dropped
in
and
that's
it
goes
not
sure
when
that
would
happened,
though,
that
was
sort
of
correspond
to
like
a
slow
trickle
of
jobs,
which
we
probably
don't.
Do
we
probably
have
some
master
queue.
I,
don't
know
what
to
look.
Maybe
no
I.
B
A
B
A
A
B
Was
at
least
I
I
definitely
saw
at
the
very
beginning
of
a
profile
when
we
have
like
tons
of
tiny
crates
that
take
just
under
milliseconds
to
compile
spawning
I
have
a
twenty
core
machine,
so
spawning
20
threads
per
se.
Every
single
time,
I
merely
go
back
to
sleep
and
die
that
actually
was
causing
a
lot
of
CPU
contention.
A
lot
of
a
lot
of
time
spent
in
the
kernel,
as
opposed
to
in
the
compiler
and
wonders.
A
B
B
Is
one
really
like
I
do
think
we
need
to
probably
look
a
little
bit
more
into
what
exactly
is
going
on
and
then
read
up
service
stuff
with
Rhian
right
now,
like
a
lot
of
this
is
kind
of
guesswork
I'm
on
my
and
I'm
kind
of
like
trying
to
see
it
was
well
how
to
improve
it,
but
I
don't
know
for
certain.
What's
going
on
so.
A
This
comes
back
to
the
question
of
what
do
we
want
to
see
to
have
a
dependency
on
rayon
in
the
first
place?
I
have
mixed
feelings
about,
though
I
think
it's
not
like.
This
is
a
obviously
bad
right
now,
with
rands
current
scheduler.
Do
you
make
grant
have
a
smarter
scheduler,
not
so
bad?
B
I
was
actually
curious
there
in
terms
of
like
getting
there
was
the
Effie
idea.
What
if
we
only
paralyzed
like
top
level
for
leaps
and
given
Rusty's
workload,
that
might
actually
get
almost
all
the
like,
not
all
of
it,
but
it
might
be
a
huge
win
just
doing
that,
and
that
would
be
relatively
easy
to
manage.
The
job
servers
Forbes
cuz
cuz.
We
have
like
one
function
that
just
does
the
thing,
and
so
I
would
be
curious
if
like.
A
B
A
A
C
A
A
B
Code
like
ripped
out
alders
job
server
stuff,
run
a
perforin
per
crate,
so
no
job
server
overhead,
just
what
we
currently
have,
but
then
also
ripple
rip
that
we
currently
have
and
make
a
lot
only
parallel.
The
top
level
I
would
be
curious
to
see
the
time.
Comparisons
between
those
two
I,
don't
actually
know
like.
We
might
finish
her
for
losing
a
lot
of
opportunities
for
parallelism
in
the
current
compiler.
If
we
only
paralyzed
my.
A
Current,
so
looking
just
with
a
critic,
grep,
of
course
it's
too
bad
Zacks
he's
not
here,
I'm
sure
he
that's
the
most
up-to-date
information
on
this,
but
I
see
that
we
are
doing
the
collector.
So
that
was
the
thing
that
you
said
can't
be
paralyzed.
We
that
we
are
indeed
paralyzing.
That's
to
say
when
we
basically
enumerate.
B
A
A
B
Is
there
anything
along
like
I
think
at
one
point
we
were
like
the
type-checking
passes,
like
query
type
check
and
your
dad
one
query,
but
the
loop
is
all
happening
like
internally
like
or
like
the
the
parallelism
comes
from
like
the
tree
like
structure
where,
but
that
doesn't
really
show
up
anywhere.
It's
mostly
parallelism
through
four
loops
yeah.
A
C
A
A
A
B
Like
I
know
when
I
when
I
tested
it,
I
test
card
row
and
like
during
the
first
five
seconds,
everything
was
like
grinding
to
a
halt
versus
like
flying
through
crates
and
so
profiling.
That,
with
without
the
job,
would
be
kind
of
way
just
to
see
how
bad
the
job
server
integration
is,
how
much
cost
it
is
right
now,
because
the
other
alternative
is
the
other
thing
that
could
be
costly
is
displaying
threads,
where
every
single
tiny
machine
instance
spawns
64
threads
28
threads
locally.
For
me,.
A
Okay,
so
we're
trying
to
figure
out
how
much
are
we
costing
okay?
So
that's
actually
pretty
interesting,
because
when
we
said
we
were
talking
about
j1
overhead
and
we
kind
of
said
something
about
percentages
so
that
we
or
we
looked
for
like
absolute
noticeable
values.
But
what
you're
saying
is,
if
you
have
some
tiny
compilation
like
relatively
small
holder,
world
style,
small
crates
and
there's
a
large
number
of
them,
then
even
if
they
individually
are
not
very
expensive,
it
adds
up
over
the
course
of
the
cargo
run.
A
B
It's
a
it's
inherently,
a
very
difficult
thing
to
measure
an
entire
karma
build
because
you
don't
care
about
instructions
of
that
plan.
They
care
about
time
and
there's
so
many
variables
that,
like
getting
any
precise
amount
of
time,
there's
was
really
difficult
are
getting
ice
city
measurements,
it's
very
clean
every
time.
So
it's
not
the
five
ten
percent
small
wins
here
and
there
it's
more
of
like
like
you,
should
be
able
to
download
a
parallel
compiler
turn
on
parallelism
and
then
type
carbon
build
in
a
big
project
and
it
should
like
it
should
feel
faster.
C
One
thing
that
kind
of
when
asked
do
we
think
that
it's
viable
that
if
we
have
sufficient,
because
in
the
compiler
that
so
the
long
term
future
is
maybe
we
say,
cargo
is
no
longer
sort
of
j16
and
only
spawns
five
seeds
at
a
time
with
the
idea
of
being
like
it's
more
advantageous
for
us
internally
to
paralyze,
or
is
that
not
really
something
that
would
expect
to
be
interesting?
I.
B
Hadn't
thought
about
that
I
would
maybe
say
the
cargo
will
always
be
better
at
parallelism,
because
it's
just
so
simple,
it's
just
processes
and
those
are
guaranteed
to
be
parallel
and
guaranteed
to
saturate
as
much
as
it
can,
and
so
like
it's
a
question
of
like
if
Carl
can
spawn
a
process,
should
it
not
because
receive
might
be
a
better
job
leaking
we're.
Keeping
those
cores
busy
and
I
would
say,
like
I
feel,
like
I
worked,
actually,
probably
not
because,
like.
C
A
A
Really
I
mean
I
think
what
you're
saying
mark
if
I
understand
is
like
in
the
beginning,
there's
a
lot
of
parallelism
available
into
form
of
crates,
so
those
crates
effectively
get
only
one
job
server
to
open
each
who
know.
Therefore
they
will
never
get
any
benefit
from
their
locks,
as
seems
correct,
but
believe
me,
yeah
I
mean
one
question
is
how
low
we
can
get
the
overhead
and
the
other
question
is
like
I
guess
in
the
dear
God
and
the
most
extreme
versions
we
might
have
two
versions
of
rusty.
A
A
If
I
were
to
try
to
take
that
rayon
branch
that
is
less
eager
without
starting
its
rights
or
it'll
still
start
the
threads,
but
that
is
less
eager
about
like
scaling
starting
making
them
from
sleep
and
produce
like
a
rusty
branch.
How
hard
would
it
be
to
get
some
measurements
out
of
it?
What
kind
of
measurements
would
we
wanna
get
I
want.
A
B
An
easy
one
is
perf
single
crate
performance,
make
sure
it
doesn't
regress
and
see
if
it
actually
improves
another
one
is
the
given
that
branch
we
would
produce
like
all
compilers
and
just
test
the
op
compiler
for
the
previous
commit
but
single
threaded
and
take
that
to
build
project
and
see
what
happens
like
I
I'm,
expecting
cargo
to
pretty
like
I
guess
like
I,
want
to
see
the
load
average
for
a
compilation
to
be
almost
at
the
number
of
cores
for
almost
the
entire
time
like
that.
B
That
should
be
what
we're
seeing
and
like
you're,
also
not
just
caramel
stuff.
But
users
like
that's
whatever.
It's
not
it's
more
difficult
to
measure.
It's
more
subjective,
I
mean
you
get
so
put
numbers
to
it,
but
that's
what
I
would
expect
just
like.
Take
it
around
to
a
couple
of
projects
build
from
scratch
and
just
see
how
it
fares.
A
Okay,
I'm
thinking
about
how
like
I,
have
to
go
revisit
I
mean
one
of
the
child
is
here.
We
don't
really
know
how
much
I
get
away.
Rayon
is
kind
of
what
we
want
in
the
sense
that
we're
stealing
is
reasonably
well
designed
for
this
cases,
where
you
don't
really
know
how
much
help
the
other
threads
are
gonna
be,
which
is
exactly
what
we
have
going
on
here.
A
So
I
guess
just
having
a
central
cue
and
pulling
jobs
from
them
would
work
too
as
long
as
it's
flat,
but
we
do
things
like
you
know,
like
rail
will
do
things
like
divide
the
work
in
chunks
they're
like
getting
smaller,
such
that
if
it
looks
like
you're
on
your
own
for
this
loop
because
nope,
you
know
when
no
one's
picked
up,
you've
finished
a
chunk
and
nobody
picked
up
any
of
that
work
from
you.
Then
you
go
on
and
you
don't
bother
to
do
further
subdivisions
and
stuff
like
that.
A
B
B
These
are
the
cases
it's
gonna
be
rests
on,
so
it
might
just
require
a
lot
of
in-depth
investigation
so
like
and
like
I'm
being
very
I,
think
we're
still
very
vague
about
what
the
cost
of
jobs
are
is
where
exactly
a
problem
is
exactly
the
problem
is,
is
we're
kind
of
hoping
that
this
fancy
new
scheduler
will
fix
everything,
but
once
we
have
something
concrete
to
work
with,
we
can
just
go
and
investigate
and
like
do
a
really
in-depth
analysis
of
trying
to
figure
out.
What's
going
on
there.
C
One
sort
of
know
I
can
make,
is
that
I've
been
thinking
and
looking
at
adding
the
cargo.
The
timing
graphs
to
her,
because
we
always
build
like
the
whole
crate
from
scratched
and
those
graphs
seem
useful
and
he
might
not
be
entirely
precise
and
they
probably
have
high
variance.
But
he
just
like
that
could
be
helpful
here
as
well,
because
it'll
give
us
some
insight
into
overall
sheepy
usage
with
bearing
compilers
I
was
losing.