►
From YouTube: PASTIS @ ExaBiome
Description
Oguz Selvitopi
PASTIS @ ExaBiome
A
Hi
everyone-
my
name,
is
oh
salutopi
and
I'm
with
the
performance
and
algorithms
research
group
at
the
lbnl,
and
today
I
will
be
talking
about
our
work
on
protein
simulator
search
and
how
to
do
this
on
distribute
number
system
using
gpus
Etc,
and
this
is
a
joint
effort
of
the
ECP
projects
at
the
biome
and
hexagraph,
and
our
tool
that
does
this
search
is
called
task.
A
This
so
comparative
genomic
studies,
the
biological
relationship
between
different
organisms
by
exploiting
the
similarities
of
the
genome
sequences,
a
major
component
of
many
biological
workflows
is
to
find
out
the
existing
genes
by
aligning
them
by
comparing
them
against
a
reference
database.
For
example,
a
common
task
is
to
find
out
the
functional,
the
functional
or
the
taxonomic
contents
of
the
samples
collected
from
an
environment
Often
by
running
those
Collective
sequences,
comparing
them
against
and
established
and
known
reference
database.
A
So
it
becomes
more
and
more
important
to
enable
and
build
fast
computational
infrastructure
for
this
kind
of
comparative
genomics
as
more
and
more
genomes
are
sequence
every
day
and
with
the
sequencing
costs
dropping
and
the
technology
becoming
more
available.
A
The
bottlenecks
in
this
meta
genomics
research
and
are
shifting
towards
computation
and
storage
so
for
those
unfamiliar
metogenomics,
means
that
you
can
have
genomes
in
a
sample
virtually
from
any
type
of
organism,
not
just
one
like
bacteria
tongue
bacteria
it
can
from
a
plant.
It
comes
from
an
animal,
it
can
form
a
human,
that's
called
metagenomic
datasets
and
they
are
extremely
large
data
sets
because
you
can
have
any
genes
and
they
can
relate
it
or
not.
You
don't
know
anything
about
them.
A
A
There
are
existing
Tools
in
this
area,
although
these
tools
are
fast
and
well
optimized
for
shared
memory
environments,
they
are
not
really
well
tuned
for
distributed
memory
environments,
they
usually
Implement
a
tool
for
shared
memory
environment,
and
then
they
try
to
extend
it
to
a
distribute
memory
setting
which
do
not
often
scale
very
well.
A
These
are
our
two
tools.
The
first
is
pass.
This
it's
called
is
for
alignment
vs
plasmat
disease,
which
basically
platforms
to
search
on
these
sequences
on
metogenic
data
cells
and
for
clustering.
We
have
hip
and
CR,
which
must
a
distribute
memory.
High
performance,
implementation
of
the
macro,
clustering
algorithm.
A
Here
this
figure
shows
our
pipeline
to
form
the
protein
synthetic
graph
or
to
perform
a
protein
Synergy
search
here
at
the
left,
we
had
the
protein
sequences
from
a
given
set
of
sequences.
We
select
the
pairs
which
we
want
to
run.
The
alignment
on
this
alignment
will
produce
a
similarity
measure
which
we
can
then
use
to
establish
how
similar
the
pairs
are.
A
You
usually
at
the
end,
we
filter
out
the
pills
that
do
not
mean
meet
certain
criteria
in
similarity
or
the
coverage,
and
then
this
is
going
to
finally
lead
to
final
smart
information
or
the
symmetic.
However,
however,
you
look
at
it
and
then
this
graph
or
this
information
can
be
used
to
Cluster
to
get
the
product
and
families
to
discover
new
protein
sequences
Etc.
A
A
Imagine
novelty
in
passive
is
the
treatment
of
grafts
as
fast
matrices.
The
basic
information,
storage
and
manipulation
medium
impasses
are
smashes.
They
are
used
to
represent
different
kinds
of
information
required
throughout
the
search,
for
instance,
here
on
the
left,
you
can
see
there's
a
matrix,
radials
represent
the
sequences
and
the
columns
represented
the
k-mas.
The
camera
means
that
a
substring
in
that
sequence,
so
if
that
camera
exists
in
the
sequence,
there's
a
noise
right
in
that
location.
A
As
you
can
see,
these
matrices
are
quite
different
from
the
matrices
in
the
microlinear
algebra.
They
are
actual
graphs
and
we
have
a
lot
of
custom
information.
We
do
not
just
use
a
numerical
value
in
those
in
other
locations,
we
start
custom
data
types
and
operations
on
these
matrices
usually
operate
on
those
you
know
extended
data
types,
foreign.
A
A
It
relies
on
comb
last
perform
distributed,
spasmatics
operations,
this
Library
supports
NPI
opponent,
type
of
qubit
parallelism
and
for
alignment
on
CPUs
and
gpus.
We
use
two
different
libraries
secant
and
adapt.
A
These
are
all
not
shared
memory,
libraries
and
they
are
quite
fast
and
optimized
for
shared
memory
environments,
so
they
are
not
aware
that
we
use
them
in
a
distributed
manner.
So
we
orchestrate
them
to.
Apart
from
the
search
and
delegate
alignment
to
those
libraries.
A
So
what
are
the
basic
challenges
of
this
search?
There
are
two
basic
types
of
computations
performed
in
our
search.
One
of
them
is
this
past
computations,
and
the
second
is
the
edit
distance
computations.
The
farmers
describes
about
its
computations
are
memory
bound.
They
have
low,
computational
intensity,
High
memory
footprint
and
they
have
a
lot
of
irregular
memory
accesses
while
the
edit
this
computations
have
high
qualification
intensity
and
they
have
a
uniform
pattern
in
Computing
those
edit
distance
matrices.
A
We
charge
small
and
dance,
so
they
perform
a
lot
of
computation
per
byte,
while
in
sparse
computations,
that
ratio
is
quite
low,
so
there
that's
why
they
have
high
memory
footprint.
A
Another
challenge
is
that
this
kind
of
search
usually
requires
huge
amount
of
memory,
and
the
existing
libraries
in
this
area
have
various
techniques
to
deal
with
this
issue,
ranging
from
writing
intermediate
files
to
this
performing
to
search
in
stages
problem.
Is
that
the
the
challenge
is
that
the
number
of
the
candidate
pairs
that
need
to
be
stored
and
aligned?
It
grows
by
the
order
of
and
Scar.
So
since
we
are
doing
and
many
against
main
search.
A
If
we
increase
the
number
of
sequences
by
n,
your
search
is
going
to
go
over
ends
cache.
So
it's
going
to
require
extensive
amount
of
memory.
A
We
have
several
different
techniques
to
efficient.
Apart
from
the
search,
these
are
some
of
the
techniques
or
algorithms
we
have
implemented.
These
are
optimization
techniques,
they
include
optimization
techniques,
low
level,
optimization
techniques
as
well
as
algorithmic
Innovations.
The
first
Technique
we
use
is
the
blocked
or
the
incremental
formation
of
the
symmetic
graph.
We
here
have
a
null
algorithm
called
block
two-dimension
spasmas,
so
this
is
used
to
find
the
candidate
pairs
which
are
going
to
be
aligned.
We
have
different
algorithms,
heuristics
for
load,
balancing
and
then
specific
GP
optimizations
for
the
gpus.
A
When
we
use
the
GPS,
we
have
a
technique
that
can
utilize
all
the
resources
or
Not
by
Distributing
the
work
we
do
while
doing
the
search.
A
This
figure
here
shows
block
two-dimensional
spasma.
There
are
a
lot
of
map.
This
is
here.
As
you
can
see,
the
matrices
on
the
left
are
multiplied
to
find
the
candidate
pairs
in
the
middle,
which
are
here,
the
non-zeros
in
this
overlap.
Matrix
each
non-zero
is
a
candidate
to
be
aligned
and
then,
after
this
alignment
on
the
right,
we
have
the
similar
to
Matrix,
which
contains
the
information
we
need.
A
We
can
limit
the
memory
required
by
the
entire
execution
of
this
search
or
the
you
know,
memory
account
of
the
symmetic
graph
by
using
a
block
formation
or
the
incremental
formation
of
this
symmetromatics
or
overlap
Matrix.
A
So
this
parameter
is
tunable,
so
we
have
a
control
over
the
amount
of
marginal
memory
we
use.
So
in
this
way,
we
about
maximum
volatilization,
and
then
we
enable
in-memory
search
for
VG
data
sets,
and
it
also
opens
up,
opens
up
the
path
for
several
storage
optimizations.
The
always
disadvantage.
Is
it
increases
the
time
comparing
to
doing
this
thing
all
at
once,
but
it
is
better
than
doing
not
being
able
to
do
this
and
it
also
increases
the
communication.
A
We
have
different
techniques
for
load
balancing,
as
you
might
have
noticed,
the
overlap
Matrix
in
in
previous
slide
is
metric
because
in
sequence,
I
and
J
are
similar
sequence.
J
I
j
and
I
are
also
the
same,
so
this
map
exists
Magic
in
this
way
by
using
by
exploiting
this
metricity,
we
can
avoid
alignments
and
even
periods
classmates
computations.
A
In
this
block
formation
incremental
formation
of
the
Matrix
we
tested
out
two
approaches
based
on
triangularity
and
indices.
The
basic
difference
is
that
the
triangularity
phase
approach
it
avoids
a
lot
of
spasmatics
computations,
but
it
can
still
lead
to
load
imbalance,
mainly
due
to
diagonal
blocks
in
The
Matrix.
A
A
As
you
can
see,
the
index
based
method
is
able
to
attain
a
bad,
better
load
balance
than
the
Triangular
base
method
for
all
tested
to
accounts,
but
at
the
right
you
can
see
the
triangular
based
method.
It
is
able
to
save
a
lot
of
computations
and
when
the
number
of
blocks
increases,
the
effect
of
this
load
in
Balance
reduces
or
disappears
in
the
triangular
based
method,
because
the
number
of
diagonal
blocks
is
proportional
to
order
of
n,
while
number
of
entire
blocks
is
proportional
order
of
n
Square.
A
A
We
perform
the
alignments
on
the
GPU,
so
we
delegate
the
alignments
on
gpus
and
those
High
memory
footprint
and
those
irregular
computations
related
to
surprise
computations.
We
do
those
on
this.
We
perform
those
on
the
CPUs
in
the
kamato
formation
of
the
graph.
The
CPU
has
to
prepare
the
candidate
pairs
which
are
aligned,
and
then
it
sends
down
to
the
GPU
which
performs
those
alignments
and
then
copies
back
to
the
host
site
and
then
post
processes
them
to
incrementally
form
the
symmetic
graph.
A
In
this
way,
if
done,
you
know
natively,
these
CPUs
can
stay
idle,
but
this
block
formation
allows
us,
allows
the
CPUs
to
go
ahead
and
prepare
the
next
patches,
while
the
gpus
are
filled
with
performing
the
current
patch.
In
this
way,
we
can
hide
the
overhead
of
this
past
computations
and
these
are
distributed.
A
Spasmatics
computations,
with
all
the
collective
Communications
you
know,
and
The
Irregular
memory
accesses
so
avoiding
those
types
of
operations
are
really
useful
because
fast
computations
do
not
really
scale
well
as
opposed
to
those
alignments
that
are
usually
that
usually
schedule
really
well,
as
you
can
see
here
on
the
table,
if
we
done
this
nearly
in
the
plane
way
and
with
the
pre-blocking,
we
achieved
around
30
percent
of
reduction
in
the
overall
execution
time
here
on
the
left,
you
see
the
strong
scaling
performance
and
Android
UCD
weak
scaling
performance
jungler
to
base
approach,
achieves
a
strong
efficiency.
A
More
than
75
percent
on
400
nodes
going
from
1500
close
to
400
nodes.
As
for
the
weak
scaling
efficiency,
there
are
different
components,
as
you
can
see
here
with
those
lines,
but
in
the
overhaul
at
at
800
nodes,
it
maintains
an
efficiency
more
than
80
percent.
A
The
io
is
here,
as
you
guys
do
not
scale
well,
but
you
can
disregard
area
because
it
usually
constitutes
no
more
than
three
percent
of
the
oral
execution
time.
So
it's
not
really
a
bottleneck
in
our
two.
A
So
what
were
the
things
we
did
to?
We
did
our
code
or
you
know,
encountered
while
porting
the
parameters.
We
did
not
face
any
major
issues.
Why
reporting
capacities
on
promoter
as
our
tool
was
already
able
to
run
on
nvd
gpus.
But
the
thing
is
that
past
this
needs
both
the
CPU
and
GPU
resources.
So
the
summit,
CPU
notes
CPUs,
were
relatively
slow
compared
to
those
gpus,
and
this
past
computations
started
to
start
to
dominate
our
execution
time.
A
The
alignments
were
really
fast
when
we
put
them
onto
gpus,
so
faster
CPU
benef
and
on
the
right
here
you
see
the
parameter
comparison
against
Summit
and
Korea
Haswell,
there's
a
small
scale,
experiment
on
64
nodes
and,
as
you
can
see,
both
the
CPU
and
the
GPU
components
run
faster
than
just
keep
your
NGP
components
on
Summit,
not
even
to
mention
Curry
Haswell.
It's
quite
lost
in
this
figure.
A
And
we
also
conducted
large-scale
experiments
on
Summit
and
parameters.
The
data
sizes
we
tried,
the
number
of
sequences
the
highest
on
Summit
were
more
than
400
million
sequences
and
on
on
parameter
we
tried
200
million
sequences
and
the
test
scale.
On
Summit
we
tried
largest
used
around
3
300
nodes
and
on
parameter
we
tested
on
July
on
around
1000,
not
1K
notes,
I'll
Summit,
our
smaller
scale
run
on
2K
nodes,
with
300
million
sequences
took
around
four
sorry
roughly
for
four
hours.
A
The
larger
similar
to
search,
bigger,
smart
search
on
for
under
five
million
sequences
took
around
two
and
a
half
hours,
and
the
important
metrics
here
is
the
alignments
per
second
and
cell
updates
per
second.
Those
were
those
are
quite
good
and
we
improve
improve
those
a
lot
in
our
bigger
run
and
this
bigger
run
as
we
submitted
to
the
God
for
the
Gordon
about
price,
and
it
was
selected
as
a
finalist.
So
it's
going
to
appear
at
SC2.
A
If
you
are,
you
know
interested
into
a
interested
or
going
to
actually
be
welcome
to
attend
our
talk.
A
We
had
similar
performance
compared
to
Summit,
but
the
external
version
on
a
smaller
ckl
tested
on
1K
nodes
and
200
million
sequences
on
the
right
here
you
see
alignments
per
second
per
node
when
we
exclude
I
o
is
quite
competitive,
the
summit
and
the
parameter,
and
if,
if
parameter,
was
bigger,
it
was
as
big
as
a
summit.
We
would
also
submit
those
submit.
Those
is
asked
to
do
you
know
for
the
competition,
but
I
think
the
more
important
point
is
the
table
at
the
below
here.
A
As
you
can
see,
these
are
the
times
spanned
in
this
past
component
and
the
alignment
component
of
practice,
the
closer
they
are
better.
It
means
that,
because
the
overall
execution
time
is
going
to
be
determined
by
the
maximum
of
these
two,
so
as
you
can
see
at
parameter
these
two
and
these
numbers
in
red
is
the
ratio
of
alignment
to
sparse
a
component
and
on
parameter.
They
are
quite
close.
A
As
for
the
state
of
the
art,
the
biggest
round,
you
know,
the
biggest
run
reported
to
date
was
reported
two
years
ago,
a
search
that
a
search
between
two
data
sets
containing
280
million
sequencies
and
40
million
sequences
in
this
search.
The
alignment
rate
is
computed
to
be
1.2
million
alignments
per
second.
In
our
two
on
Summit,
with
our
biggest
run,
we
achieved
almost
700
million
alignments
per
second.
A
This
is
an
increase
in
foreign
magnitude
in
in
this
metric
and
as
for
the
search
space,
the
size
of
the
third,
the
actual
size
of
the
search
we
increase.
If
we
increase
that
by
an
order
of
magnitude,
15x.
A
Biological
workflows
is,
can
we
both
complete
intensive
and
they
usually
have
a
huge
memory?
Footprints
I
think
that
that
is
what
makes
this
problem
more
challenging
compared
to
simulations,
or
you
know,
chemistry
simulations
in
our
approach.
We
use
accelerators
for
the
alignments
and
we
relied
on
algorithmic
techniques
for
staying
in
memory
and
we
use
no
intermediate
IO.
A
We
only
utilize
at
the
beginning
at
the
end,
so
this
makes
our
approach
faster
and
then
we
also
propose
a
lot
of
optimizations
done
a
lot
of
optimizations
to
take
advantage
of
the
heterogeneous
node
architectures
to
able
to
use
all
resources
on
demands
at
the
end
of
today
we
cut
some
solution
from
days
or
weeks
to
hours.
A
Thank
you
for
your
attention.
If
you
have
any
questions,
I
will
be
glad
to
answer
them.