►
From YouTube: ExaFEL project port to GPU
Description
Johannes Blaschke of LBNL presents a talk on ExaFEL project port to GPU. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Muaaz Awan
A
Thank
you
very
much
yeah,
so
I
will
be
essentially
following
what
I
what
nick
soda
presented
just
before
lunch,
and
so
I
want
to
start
with
just
thanking
all
of
these
collaborators.
A
They
are
all
members
of
nick's
group
at
lbl
and
they
are
the
ones
who,
like,
oh,
except
hugo,
but
hugo
also
worked
extensively
with
this
gpu
port
and
therefore
I
I
think
it's
only
fair
to
acknowledge
all
the
people.
Who've
done
the
actual
work,
and
so
the
things
the
the
technique
that
nick
presented
just
before
lunch
is
this
technique
of
crystallography
and
just
in
in
the
most
basic
physical
terms.
A
Imagine
you
have
a
crystal
in
in
the
sense
that
it's
a
lattice
of
scatterers
and
what
will
happen
is
whereas
an
x-ray
beam
comes
in,
there
was
the
extras.
The
coherent
x-rays
will
scatter
of
these
scatterers
that
are
arranged
in
a
regular
lattice,
some
physics
happens
and
then
on
a
detector.
You
will
see
bright
peaks
with
dark
regions
in
between
and
here's
a
sketch
for
a
one-dimensional
array
of
scatterers.
You
would
see
these
localized
peaks
and
some
some
maybe
a
little
bit
of
in
structure
in
between,
and
in
fact
we
can.
A
We
can
be
specific
about
what
kind
of
physics
happens
there.
It's
all
wave
optics
really
so,
when
you
can
think
of
each
of
these
scatters
as
sources
of
coherent
light.
A
In
this
case
x-ray
light
and
depending
on
which
position
I
choose
on
which
pixel
I
choose
on
my
detector
here,
I
will
either
get
destructive
interference
or
will
get
constructive
interference
at
that
position
from
all
the
from
all
the
waves
emanating
from
all
the
scatterers
and-
and
this
means
that
we
can
actually-
because
we
understand
wave
physics
very
well-
we
can
actually
simulate
the.
A
The
intensity
that
we
would
see
at
each
one
of
these
pixels
and
that's
given
by
this
equation
here
and
the
important
thing
about
that
equation-
is
that
it's
a
combination
of
structure
from
in
the
individual
scatterers
and
a
combination
of
the
arrangement
of
the
different
scatterers
themselves,
which
lead
to
different
intensities
of
this
inter
interference.
A
And
then
we
have
some
properties
of
the
incoming
wavelengths
as
well,
and
on
top
of
that,
we'll
see
some
random
noise
that
that
will
be
interacting
with
the
individual
pixels
as
well.
So
there's
a
kind
of
a
background
to
all
of
this.
A
That
needs
to
be
filtered
out,
and
essentially,
why
are
we
interested
in
this
sort
of
thing?
Well,
sim?
The
previous
slide
shows
what
we
sort
of
like
the
forward
direction
of
this
kind
of
simulation.
Essentially,
if
we
start
off
with
a
crystal
and
the
parameter
is
associated
with
a
beam,
we
can
produce
interference
patterns.
But
this
is
the
much
more
exciting
problem
is
to
flip
it
around.
A
So,
essentially,
we
could
have
a
bunch
of
diffraction
patterns.
These
are
real
diffraction
patterns
and
what
we
want
to
know
is
what
are
the
crystals
responsible
for
for
producing
that
kind
of
pattern,
and
in
fact
we
can't
just
like
disentangle
the
crystals
from
from
the
properties
of
the
beam
themselves,
and
so
the
the
vision
of
all
of
this
is.
A
We
will
take
a
massive
amount
of
these
images
and
we
will
cram
them
all
into
pearl
matter,
and
out
comes
some
interesting
signs,
some
unknown
structures
that
that
we
are
interested
in
and
so
the
way
we
achieve
this
using
cctbx
is
essentially
a
hierarchy
of
different
codes.
You
can
take
a
look
at
cctbx
here
and
essentially
the
the
the
overall
structure
of
these
simulations
is
roughly
the
same,
and
on
the
top
on
the
user
facing
level.
We
have
python
that
sort
of
acts
as
a
as
the
glue
code.
A
That
will
orchestrate
the
data
analysis,
and
so
you
might
have
something
like
this
we'll
loop
over
a
bunch
of
parameters
we
might
want
to
for
each
parameter,
do
some
simulations
and
then
some
data
io,
and
we
don't
really
want
to
stay
in
in
python
altogether.
A
We
we
use
boost
python
as
an
api
to
expose
a
c
plus
plus
back
end
to
python,
and
this
c
plus
plus
backend
it
it
orchestrates
all
the
the
I
o,
the
data
structures
or
the
logic
that
you
might
need
for
crystallography
and
in
as
part
of
our
port
to
gpus.
We
have
started
to
take
the
most
expensive
components
out
of
the
c
plus
plus
backend
and
we've
started.
Writing
cuda
implementations
for
these
selected
functions
and
this
and
this
basically
the
these
sort
of
cuda
implementations.
A
They
are
fairly
straightforward
actually
for
for
the
forward
simulations
of
bright
spots
and
that's
because
our
problem
is
fairly
easily
paralyzed.
The
the
individual
pixels
don't
interact
with
one
another,
not
at
least
in
in
the
forward
simulations.
A
So
what
you
can
do
is,
while
your
original
c
plus
plus
loop
might
have
looped
over
pixels,
what
we
will
do
is
we
will
assign
each
cuda
thread
a
a
set
of
pixels
and
we
will
just
iterate
over
them
in
a
strided
fashion
and
then
for
each
pixel
independently
of
what
the
other
pixels
are
doing.
A
So,
for
instance,
we
will,
we
might
be
wanting
to
resolve
some
pixel
details,
the
detectors
themselves.
They
they.
There
are
some
physics
associated
with
how
they
absorb
x-rays
and
therefore
we
can
loop
over
like
detector
thickness,
essentially
the
thicker
the
more
easily
it
absorbs
a
photon
and
the
photons
can
also
come
from
different
sources
and
can
have
different
angles,
and
the
crystal
muds
also
have
not
the
perfectly
regular
structure,
but
there
might
be
different
mosaicity,
and
so
we
might
need
to.
A
Iterate
over
those
domains,
and
so
finally,
once
we
have
so
once
we
have
computed
this
on
the
gpu,
we
haven't
ported
the
background,
the
noise
yet
to
the
cpu.
So
on
the
cpu
we
will
add
this
random
noise,
and
so,
let's
just
see
what
what
the
end
result
of
such
a
nanobrag
simulation
is
so
nanobrag
by
the
way
refers
to
this
forward
stimulation,
and
so
what
a
nanobreak
does
is.
A
It
will
take
some
parameters,
some
details
of
the
crystal
that
you're
simulating
and
the
wavelengths
and
the
beam,
and
it
will
produce
black
spots.
So
there
are
several
things
I
want
you
to
be
aware
of
here.
Brag
spots
are
fairly
localized.
Nick
did
show
that
they
do
have
a
structure
in
this
simulation.
I've
actually
zoomed
in
here
into
the
center.
You
can
see
they're
sort
of
fuzzy
blobs,
essentially
so
they
they
are.
A
They
are
not
points
in
the
true
mathematical
sense,
and
you
can
see
that
they
all
have
varying
intensities,
and
there
is
a
background
here
and
now
now
to
to
get
maybe
a
handle
on
the
inverse
problem.
I
thought
it
might
be
instructive
to
look
at
what
happens
when
I
simulate
changing
the
the
the
crystal
position
or
distance
from
this
detector.
A
So
these
are
videos
here,
for
example,
for.
A
You
know,
so
you
might
be
able
to
see
some
noise
in
here
that
this
is
the
simulated
noise
and
also
some
aliasing,
because,
as
as
I
move,
my
interactions
with
the
pixels
will
be
different,
so
I
might
see
some
aliasing
effects
and
this
would
be
fairly
simple
to
to
to
look
at
a
diffraction
pattern
and
then
to
if
you
know
what
you're
looking
at.
A
If
you
know
what
kind
of
crystal
you're
looking
at,
then
you
can
just
look
at
the
positions
of
the
pixels
themselves
and
you
can
say,
oh
well,
it
was
at
that
distance
from
the
detector.
So
let's
look
at
the
more
complicated
situation,
let's
rotate
the
crystal
the
z-axis,
and
we
see
something
much
more.
You
know
trippy
happening
here,
so
we're
really
only
rotating
the
the
the
crystal.
So
it's
actually
creating
a
completely
different
looking
pattern.
A
What's
the
orientation
of
the
crystal
might
even
be
already
a
harder
ask
anyway,
so
so
getting
back
to
the
gpu
port
of
of
nanobrand
you
so
our
essentially
our
first
objective-
and
this
is
actually
happened
on
before
I
joined
the
project-
was
to
accelerate
these
forward
simulations
to
to
start
with
some
known
parameters
and
simulate
the
bright
spots
and
to
accelerate
that
code,
and
so
here
you
can
see
a
comparison
between
the
performance
of
cuda
on
a
v100
compared
to
a
single
thread
on
skylak
cpu
and
essentially
what
what
that
work
has
done.
A
A
If
you
really
cared
about
noise-
and
this
might
be
the
next
target,
however,
let's
ignore
now-
and
you
can
see
that
we
have
accelerated
the
spot
simulation
by
a
factor
of
22.-
we
could
accelerate
it
and
probably
will
accelerate
it
much
further,
because
data
movement
is
currently
taking
up
about
51
of
the
cuda
time
and
also
there
there's
an
api
overhead,
that's
pretty
high,
so
we
will
be
looking
into
that
more
and
so
the
reason
why
we
we
care
about
this
is
something
that
nick
alluded
to
before,
and
that
is
we
are
interested
in
the
inverse
problem.
A
We
want
to
start
with
a
lot
of
simulated
image.
Well,
what
we
want
to
do
is
we
want
to
tell
based
on
forward
simulations
like
these.
We
want
to
tell
what
is
the?
What
are
the
crystal
parameters
of
this
measured
image
here?
You
can
see
some
bright
spots
over
here.
Doesn't
look
like
anything
like
that,
but
the
idea
is,
let's
iterate
over
different
forward
simulations
that
minimizes
the
mismatch
between
the
forward
simulation
and
the
measured
data.
A
In
fact,
we
can
do
this
intelligently
with
quasi-newton,
optimization
and
essentially
the
idea
is
we'll
use
a
forward
simulation
and
the
measured
pixels
to
guide
our
next
parameters
for
the
forward
simulations
and
to
iteratively
decrease
the
mismatch
and
and
the
lowest
mismatching
parameters.
That
will
be
our
best
estimate
for
the
crystal
parameters.
A
So
I
also
want
to
take
maybe
a
minute
or
two
just
to
point
out
that
with
these
kinds
of
codes
there
it's
it's
there's
an
important
aspect
that
sometimes
is
overlooked
when
we
just
want
to
accelerate
kernels
in
that
code,
and
that
is
the
full
software
stack
can
matter,
and
this
is
just
a
small
example
of
what
happens
when
we
process
a
large
batch
of
images
stored
on
disk.
So
here
we
have
time
on
the
x-axis
and
the
mpi
rank
on
the
y-axis
and
red
is
bad
here.
A
Red
means
I
o
and
data
movement
over
the
network,
and
so
over.
Here
we
had
a
problem
with
mpi,
so
you
can
see
things
bunch
up
and
there's
a
lot
of
red,
so
that's
very
bad,
and
then
we
and
then
over
here
we've
reduced
this.
I
o
time
by
optimizing
the
the
way
we
actually
schedule
file
access,
but
none
of
this
is
happening
on
the
cuda
level,
and
so
I
want
to
plug.
A
I
I
just
want
to
plug
jonathan
madison's
temory
utility
and
here
because
it
allows
us
to
build
profilers
that
are
able
to
profile
across
the
python
c,
plus
class
and
cuda
boundaries,
and
so
just
a
very
quick
basic
example
in
python,
we'll
just
import
this
object
here.
This
wall
clock
object
and
we
can
surround
the
python
constructor
for
our
nanobreak
simulator
and
then
in
c
plus.
A
We
can
also
use
temori
to
decorate
the
same
constructor
actually,
but
here
I've
changed
the
label
to
cpp
versus
pi
right,
so
I've
sandwiched
the
vc
profiler
with
the
python
profiler.
And
now,
if
you
just
profile
this
whole
thing,
you
can
see
what's
happening
here.
We're
calling
nanobrag
in
python
that
dispatches
a
call
to
c
plus,
and
the
only
thing
that
sits
in
between
is
the
python
c,
plus
plus
api.
A
And
then
when
we
look
at
the
differences
in
the
time
taken,
you
can
see.
In
this
example,
we
can
pick
up
on
the
python
api
call
time
spent,
and
so
this
is
so.
I
I
will
be
con
will
be
essentially
expanding
the
use
of
temporary
to
not
only
profile
cuda
but
but
to
essentially
to
get
a
snapshot
of
the
full
software
stack,
your
highness,
you
have
30
seconds
left.
Can
you
wind
up?
That's
perfect.
I
am
already
done.
I
just
wanted
to
say
that
all
work
is
work
in
progress.
A
Our
nanograph
cuda
port
has
already
resulted
in
a
decent
speed
up,
but
if
we,
if
we
improve
our
data
movement
and
our
api
based
on
what
we
are
observing,
we
can
definitely
get
way
more
out
of
that.
One
next
on
our
plate
is
optimizing,
the
diff
brag
iteration
itself
and
then
finally,
we
want
to
look
at
openmp
offloading
but,
for
example,
there's
some
little
issues
associated
with
the
fact
that
python
and
our
software
stack
likes
gcc,
which
has
some
problems
playing
nice
with
openmp.
A
At
the
moment,
all
right,
and
with
that
I'd
like
to
thank
you
all
for
your
attention
and
yeah
I'd
love
to
answer
your
questions.
Thank
you
very
much,
johannes
yeah.
So
can
we
have
a
couple
of
questions?
I
think
so
the
audience
can
ask
questions
in
the
q
a
box.
A
A
Currently
we
have
a
workaround
strategy
that
that
is
based
on
using
clang
and
then
just
hoping
that
the
rest
of
python
doesn't
block
when,
when
you
cross
link
but
yeah,
so
currently
we've
only
used,
summon
and
corey
gpu
and
we'll
be
using
matter,
but
I'm
going
to
deploy
some
of
this
on
tulip
and
iris
next,
once
we
get
the
openmp
issues
figured
out,
so
are
you
confident
you'll
be
able
to
replicate
the
performance
on
openmp,
the
one
you
got
from
cuda?
A
I
I'm
I'm
confident
that
we'll
get
something
to
work.
I
don't
know
whether,
like
I'll
be
happy
with
the
performance.