►
Description
Partial recording of Breakout 1 NUGEX Special Interest Group for Experimental Science Users
A
Them
so
you
can
go
back
and
watch
the
talks
themselves
and
listen
to
any
of
the
discussion
that
happened
afterwards
if
you
are
interested,
so
that's
all.
I
have
really
in
the
way
of
introduction
to
this
to
say
what
we
are.
I
will
certainly
say,
and
I'll
probably
ask
again
at
the
end,
if
you
have
a
project
that
you're
doing
at
nurse
that
really
kind
of
fits
into
this
area,
please
by
all
means.
Let
me
know
I
would
love
to
get
another
scheduled
list
of
talks
together.
A
So
we
could
kind
of
continue
this
series,
maybe
a
little
bit
later
in
the
fall.
I
think
it
was
very
useful
so
with
that
I
am
going
to
go
ahead
and
jump
into
my
talk,
which
happens
to
be
the
first
talk
in
this
kind
of
lightning
round
talk
series.
So
what
we've
asked
is
all
the
people
who
gave
talks
during
the
regular
session
this
couple
of
months
that
were
users
external
users
to
come,
give
a
summary
of
those
talks,
so
they
gave
more
extended
versions
during
the
the
meetings.
A
But
here
we
wanted
just
to
give
some
brief
summaries
and
we
kind
of
decided
that
we
wouldn't
have
the
nurse
staff
repeat
their
talks,
because
a
lot
of
that
information
may
come
out
in
the
other
talks
they
were
giving
in
the
plenary
part.
So
this
is
a
talk
that
I
gave
initially
in
at
chep
conference
in
adelaide
last
november.
I
recycled
it
for
the
special
interest
group
and
I'm
recycling
it
yet
again
here,
but
I've
taken
out
some
slides
to
try
to
make
it
shorter.
A
A
This
is
an
experiment
that
I'm
going
to
talk
about
where
we
are
using
nurse
course
called
gluex,
it's
being
run
in
one
of
the
four
experimental
halls
there
are
at
the
accelerator.
The
facility
jefferson
lab
has
is
primarily
centered
on
an
electron
accelerator
which
is
buried
underground.
You
can
kind
of
see
the
access
buildings
here,
giving
you
an
idea
of
the
shape
of
this.
It's
really
two
linear
accelerators,
coupled
together
with
magnets,
so
the
beam
can
go
around
a
few
times.
A
Three
of
the
experimental
halls
are
buried
underground
here
in
these
round
mounds.
The
fourth
one
where
blue
x's
house
is
up
on
the
other
side
over
here,
and
I
won't
go
into
a
lot
of
detail
of
the
experiment
itself,
since
we
just
don't
have
time,
but
I
will
say
something
about
kind
of
just
the
scale.
A
B
A
Is
over
on
the
far
left
for
our
high
intensity
running
that
we
expect
to
produce
on
around
several
petabytes
of
data
a
year
from
this
experiment,
and
the
data
will
be
taken
over
several
weeks.
This
might
be
taking
over
30
weeks
of
the
year.
The
accelerator
may
be
on
we'll
acquire
the
data,
we'll
store
it,
and
then,
when
we
do
processing
of
it,
we
may
have
to
do
a
couple
of
passes
on
it
and
we
store
some
processed
information.
That
itself
will
add
up
to
a
few
petabytes
of
information.
A
Cpu
power,
that's
required
to
do
the
processing
of
this
data,
and
this
was
an
estimate
of
that
at
one
point
in
time.
I
think
that's
actually
gone
up
because,
as
you
go
through
time,
people
only
try
to
improve
the
code,
and
most
of
that
improvement
makes
it
give
better
answers,
not
necessarily
make
it
run
faster.
A
So
the
way
that
we
do
this,
we
actually
do
off-site
processing
from
the
lab
in
a
few
different
places.
We
have
our
own
scientific
computing
farm
at
jefferson
lab
and
it's
it
got.
You
know
on
the
order
of
10
000
cores
in
it.
So
it's
you
know
it's
not
tiny,
but
it's
not
really
enough
to
do
everything
we
need
to.
A
So
we
do
have
a
docker
container
that
we
made
it's
a
very
thin
container.
We
use
a
one
line:
conversion
to
create
a
singularity
container
out
of
it.
One
line
convert
or
just
import
it
into
shifter,
so
so
we
don't
have
to
do
any
modifications
to
the
container
itself.
It's
it's
thin
in
that
it
only
has
a
couple
of
system
installed
packages
in
it.
It
doesn't
contain
our
software
at
all,
so
we've
been
able
to
use
the
same
one
for
I
think
a
couple
of
years
now
without
having
to
modify
it.
A
The
way
we
get
our
software
is
through
the
cvmfs,
the
cert.
I
guess
it's
virtual,
virtually
managed
file
system,
a
virtual
machine
file
system,
it's
basically
a
file
system
where
you
can
publish
your
your
files
in
these
in
this
case
binaries,
and
that
can
be
then
mounted
and
read
like
it's
locally
mounted
as
a
remote
file
system
kind
of
like
nfs,
except
for
it's,
it's
read
only
where
you're
operating
on
it
from,
but
that's
fine
for
what
we
want
to
do
so
we
do
all
of
our
software
builds
using
centos
7.
A
Our
docker
container
is
based
on
centos
7..
We
put
third
third-party
software
like
root,
which
is
a
product
from
cern
all
of
our
calibration
constants
go
into
an
sqlite
file
that
is
also
stored
on
cvmfs
and
other
resource
files
like
large
magnetic
field
maps
and
material
maps
also
go
there,
so
they're
all
kind
of
published
out
that
way
and
they're
all
considered
kind
of
more
or
less
static
information.
A
The
calibration
constants
database
does
get
updated
and
every
night
at
midnight
we
generate
a
new
sqlite
file
from
our
mysql
database,
which
is
the
definitive
source
and
hosted
at
jlab.
But
we
don't
want
all
of
our
jobs
that
are
running
off-site
to
reach
back
to
the
jlab
database
server,
so
we
just
distribute
the
calibration
process.
This
way,
data
transport
to
both
nurse
and
psc.
We
use
globus.
A
This
down
here
in
the
bottom,
is
a
graph
of
when
we
first
finally
got
high
throughput
on
esnet
from
jlab
to
nurse.
It
took
a
little
bit
of
effort
from
rit
and
network
guys
working
with
the
guys
over
at
nurse.
A
But
it
and
all
finally
finished
processing.
All
our
data
goes
to
tape.
We
don't
have
enough
disk
space
to
store
it
all.
We
have
to
have
a
workflow
system
that
pulls
it
off
of
tape
through
our
data
transfer,
node
to
the
nurse
data
transfer
node
to
cory
and
then
brings
all
the
resulting
files
back,
so
we
can
store
them
on
tape,
so
we
have
to.
I
never
submit
a
job
to
slurm
directly.
A
I
only
submit
to
our
workflow
system,
which
then
only
submits
to
slurm
once
the
file
is
there
ready
to
go
so
it's
a
little
complicated
and
I
guess
I
can
skip
over
this.
It
just
shows
that
we
have
multi-threaded
processing
the
scales.
A
This
is
from
last
year
we
made
most
of
our
jobs
run
through
backfilling.
This
is
just
a
statement.
It's
a
little
controversial,
maybe
but
extremely
poorly
matched
to
our
job
shape
is
the
scheduler
at
nurse
two
jobs
at
one
time
are
most
free
of
priority
and
all
others
must
go
in
through
backfill
and
it
treats
large
jobs.
If
I
want
64
job
nodes
for
48
hours,
it
treats
that
as
one
job
just
like
one
node
for
three
hours.
So
it's.
A
Us
to
compete
if
we're
doing
single
jobs
like
this,
we
were
able
to
do
a
lot
with
backfilling,
though
last
year
astrix
we
were
able
to
get
about
a
thousand
jobs
per
day
through
when
we
were
running
smooth
on
quarry
two,
which
was
plenty
for
us,
and
it
was
very
we're
pretty
successful
on
using
it.
A
So
I
guess
I
should
jump
to
my
summary
now
kind
of
at
the
end
of
my
time,
but
we
are
running
at
nurse
with
large
experimental
nuclear
physics
data
the
backfilling
saved
us,
but
the
asterisk
there
is
that
this
is
really
no
longer
true
in
2020,
and
I
think
this
has
to
do
with
what
sudeep
may
have
said
this
morning
on
they
got
10
more
out
of
knl
and
that's
to
our
detriment,
because
now
they
don't
have
big
holes
for
us
to
go
in
and
fill
anymore,
and
so
it's
it's
hard
for
us
to
get
to
get
much
throughput
on
it
this
year.
A
So
we're
doing
things
to
try
to
adjust
for
that.
But,
okay,
that's
all
I
have,
and
so
I've
kind
of
run.
Out
of
my
eight
minutes,
I
suppose
I
should
jump
over
now
to
the
next
person
who's
supposed
to
talk,
and
I
think
that's
steven
yes
so
go
ahead
and
take.
C
It
over
okay
hi,
so
I'm
stephen
bailey,
I'm
the
data
management
lead
for
the
dark
energy
spectroscopic
instrument,
we're
making
a
3d
map
of
the
universe
using
nursk
as
our
primary
computing
center.
I'm
going
to
be
focusing
on
the
computing
part,
not
the
science
part,
but
briefly,
describing
what
we
do
at
nurse
some
challenges
we've
had
and
some
successes
that
we've
had
so
first
of
all,
just
the
basics
of
what
we
do
at
nursk
on
a
nightly
basis.
C
So
we
can
analyze
it
during
the
day
and
then
that
informs
the
following
night's
observing
plan.
And
then
we
repeat
this
nightly
for
five
years
and
that
builds
up
a
3d
map
of
around
50
million
objects,
and
so
it's
it's
hundreds
of
gigabytes
per
night,
and
we
expect
that
over
the
next
five
years
to
grow
to
scale
of
sort
of
10
petabytes
and
using
around
100
million
hours
per
year
in
the
next
five
years.
C
Sorry
spam
call
coming
in
on
my
phone
shutting
that
off
so
then
on
a
monthly
or
yearly
time
scale.
We
have
reprocessing
runs
that
use
the
latest
tagged
code,
starting
from
the
raw
data.
C
This
uses
the
same
code
as
the
nightly
processing,
but
it's
a
very
different
scaling
needs,
and
this
is
the
primary
reason
why
we're
working
at
nurse.
If
we
were
just
trying
to
keep
up
with
the
data
with
10
nodes,
we
just
buy
10
nodes
and
be
done
with
it,
but
is
the
fact
that
we
sometimes
need
to
do
sort
of
a
burst
of
processing
years
worth
of
data
as
rapidly
as
possible.
That
drives
us
to
wanting
to
use
an
hbc
center,
but
we
also
benefit
from
the
one-stop
shopping
for
having
our
daily
processing.
C
So
where
we
sit
in
big
large
scale
user
projects,
you
know
horizontally
is
allocation
in
millions
of
hours
vertically,
as
storage
in
terabytes.
We're
not
the
largest
allocation,
we're
not
the
most
data
but
sort
of
along
the
diagonal
we're
in
the
top
five.
For
just
you
know,
big
data
and
big
computing.
C
I
wanted
to
give
a
shout
out
to
debbie
for
emphasizing
that
for
a
lot
of
these
projects,
it's
about
much
more
than
just
flops,
and
I
o
bandwidth.
That's
very
true
for
us.
It's
you
know.
We
use
all
the
different
cues,
all
the
different.
I
o
systems.
We
use
the
workflow
nodes,
we
use
jupiter.
We
have
you,
know:
spin
services,
multiple
different
spin
services,
cron
jobs,
so
we're
we're
everything
that
debbie
said
yay.
C
One
of
the
key
challenges
that
we
face
is
cueing
the
complex
dependencies.
This
is
showing
a
cartoon
version
of
the
processing
needs
for
one
night
of
data,
where
each
box
is
representing
kind
of
a
task
that
needs
to
be
computed
vertically.
C
The
size
of
the
box
is
representing
time
needed
and
horizontally
is
the
number
of
nodes,
and
so
we
have
sort
of
some
calibration
data,
that's
kind
of
big,
and
then
it
gets
collated
together
in
a
small
job
and
then
some
big
jobs
and
a
small
job,
and
then
a
bunch
of
small
jobs
and
a
bunch
of
big
jobs
and
a
bunch
of
medium
jobs.
And
then
it
ends
with
kind
of
a
big
and
that's
one
night
worth
of
data,
and
you
know,
sort
of
the
naive
version
of
each
of
these
boxes
represents
a
job.
C
C
So
our
first
attempt
was
bundling
each
step
over
about
a
week's
worth
of
data
where
we
take
a
bunch
of
these
small
tasks.
Put
them
together
into
one
job.
A
bunch
of
these
larger
tasks
put
them
into
another
job
chain
them
together.
C
This
gave
us
big
hpc
like
jobs,
it's
the
most
efficient
packing
in
theory
and
when
it
works,
it
works
great,
but
it
still
requires
hundreds
of
jobs
and,
as
david
mentioned,
only
two
of
which
are
priority
scheduled
and
the
remainder
don't
backfill
very
well.
Job
b
doesn't
start
aging
until
job
a
is
finished
and
it's
coupling
otherwise
completely
independent
tasks,
which
resulted
in
a
lot
of
fragility
of
just
like
one
rank,
can
take
down
all
ranks
and
mess
things
up.
C
So
our
next
attempt
was
to
sort
of
reshape
b
a
bit
pack
them
all
together
into
one
job.
Accept
the
inefficiencies
of
you
know,
partial
portions
of
the
time
we're
not
using
all
the
nodes.
This
gets
us
a
faster
end-to-end
sub
for
a
subset
of
the
data,
but
it's
in
it
decouples
the
independent
data
it
matches
well
what
we
do
throughout
the
night,
but
it's
not
going
to
scale
up
to
five
years
worth
of
data
processing,
so
we're
still
working
on
how
to
do
this.
C
Well,
the
the
special
interest
group
talks
were
helpful
for
learning
our
options,
but
you
know
it
boils
down
to
also
that
only
two
priority
scheduled
jobs
is
a
big
limit
on
experimental
data
processing,
especially
when
we're
running
these
on
behalf
of
hundreds
of
users.
Other
projects
might
even
be
doing
it
on
thousands
of
users.
I'm
I'm
wondering
whether
you
know
experimental
facilities
should
be
advocating
for
getting
more
than
just
two
slots,
and
it
is
somewhat
ironic
that
an
hbc
center
has
scaling
problems
with
its
scheduler.
C
So
you
know,
speaking
to
the
the
nurse
folks
on
the
line
you
know.
Investment
here
with
skedmd
could
help
improve
the
effective
use
of
future
systems
if
slurm
itself
could
scale
up
better,
but
I
want
to
end
on
some
positive
stuff
so
successes.
One
thing
that's
worked
well
for
us
is
testing
at
nurse.
We
have
a
simple
but
effective,
nightly,
cron
job
that
just
does
a
get
pull
of
all
of
our
repos.
C
It
runs
the
unit
tests
to
confirm
not
just
that
it
works
on
some
travis
ci
configuration,
but
that
it
actually
works
at
nursk.
This
is
especially
important.
You
know,
after
an
upgrade
or
something,
and
it
also
runs
a
basic
integration
test
quarterly.
We
have
software
releases
that
we
use
jupyter
notebooks
to
orchestrate
the
end-to-end
integration,
some
of
that's
running
on
jupiter
itself,
some
of
that
spinning
off
batch
jobs,
waiting
for
them
to
finish
and
come
back,
and
so
a
question
to
nursk
is
how
will
they
be
supporting
continuous
integration?
Testing
on
gpus?
C
We'll
definitely
want
some
sort
of
equivalent
of
that.
Once
you
know,
the
gpus
are
deployed,
something
that's
a
success,
but
it's
more
of
a
work
in
progress
but
wanted
to
seed.
An
idea
is
the
idea
that
we
should
be
investing
as
much
effort
into
easy
recovery
from
problems
not
just
avoiding
problems
in
the
first
place
as
the
various
experimental
facility
you
know,
groups
were
giving
their
talks
in
the
the
previous
series.
C
That's
still
a
thousand
failures,
that's
more
than
human
can
easily
handle
if
the
recovery
requires
custom
hand
work-
and
so
you
know
something
we've
come
to
realize
is
that
we
want
to
make
that
easy
to
recover
from
not
just
putting
all
of
our
effort
into
avoiding
the
problem
in
the
first
place,
and
I
also
wanted
to
give
a
shout
out
to
nisap,
which
has
been
really
great.
You
know
a
single
full-time,
postdoc,
plus
some
part-time,
you
know,
senior
consulting,
has
resulted
in
huge
speed
ups
for
us,
so
thanks
to
the
nissan
team.
C
So
with
my
last
bit
of
time,
you
know
we're
making
a
3d
map
of
the
universe,
using
nurses
as
our
primary
computing
center.
It's
that
yearly
reprocessing
that
drives
the
need
for
hpc,
but
we're
also
benefiting
from
that
one-stop
shopping
aspect.
We
have
various
challenges.
I've
covered
a
few
things
that
I've
not
covered
here,
but
I
just
wanted
to
you
know,
say
it's
cueing.
Those
things
is
not
our
only
challenge
but
we're
also
having
successes
working
at
nurse
and
that's
going
well,
and
I
met
my
eight
minutes.
E
E
Okay,
hello,
everybody,
my
name
is
michael
pote
and
today
I'm
going
to
be
giving
a
talk
about
physics,
data
production
on
hpc
and
our
experience
to
efficiently
running
at
scale.
So
I'm
working
for
the
star
experiment
at
rick.
E
E
E
So
we
went
a
different
route
by
having
a
minimal
size
container,
with
just
the
operating
system,
the
base
os
and
some
of
our
rpms
and
having
cdmfs
serving
our
software
and
additionally,
in
the
past,
we
used
to
have
one
node
that
would
serve
our
database
on
quarry,
while
all
the
other
worker
nodes
would
run
star
tasks.
E
So
we
have
thus
combined
the
mysql
service
to
run
along
the
test,
the
star
tasks
as
well,
so
we
can
have
everything
packed
in
one
container
and
one
node
could
do
one
job
without
having
to
rely
on
a
head,
node
or
another
worker
node,
so
cdmfs
on
quarry,
there's
a
fuse
restriction
on
quarry,
meaning
that
you
cannot
mount
cbmfs
natively.
E
So
there
are
nuris,
does
provide
these
dvs
servers
that
forward
the
I
o
for
cm
cvmfs,
but
they
don't
support
metadata
lookups.
So
we
wanted
to
test
this
out.
So
we
did
a
throughput
test
with
15
000
tasks
at
240
nodes,
and
if
you
see
this
little
plot
here,
the
flat
curve
is
a
good
sign
showing
the
number
of
events
completed
per
minute.
E
So
our
workflow
looking
something
like
this,
where
we
launch
a
master
script
to
the
batch
system
and
each
node
that
runs
in
the
job
will
run
our
container
and
immediately
launch
two
scripts,
one
for
launching
the
database
service
and
one
for
launching
the
star
software
script.
But
both
of
these
scripts
will
have
these
sleep.
E
Delays
that
create
a
load
spreading
effect,
one
for
the
database
payload
each
node
is
copying
about
25
gig
database
and
each
node
is
loading
the
star
software
through
cdmfs
and
by
having
the
time
delays,
creates,
allows
each
node
to
not
copy
the
same
exact
file.
At
the
same
time,
once
everything
is
up
and
running,
we
can
then
launch
our
parallel
root
for
start
tasks,
and
one
thing
to
mention
here
is
this
portion.
E
We
consider
our
job
start
efficiency,
which
I'm
going
to
talk
about
in
the
next
slide,
so
really
we're
focusing
for
on
our
efficiency
on
quarry
to
get
really
to
maximize
the
number
of
events
per
second
per
dollar.
E
So,
to
define
a
few
things,
we
have
our
job
start
efficiency,
which
is
the
real
time
to
copy
the
database
load,
the
environment,
sleep
delays,
etc.
Then
our
event,
efficiency,
which
is
the
cpu
real
time
of
the
star
event,
data
reconstruction
tasks
and
then
the
total
efficiency,
which
is
the
slurm
job,
start
to
the
last
task
finish.
E
So
what
we
found
is
that,
with
the
first
off
with
having
our
database
being
served,
we
initially
had
the
one
head
node
that
would
basically
only
run
the
database
serving
say,
10
other
worker
nodes,
where
now
we're
doing
the
one-to-one
model,
where
the
node
is
self-serving
itself.
So
this
really
makes
a
big
impact
as
the
one-to-one
model.
Our
total
efficiency
is
99.3
percent
or
the
1
to
11
model
without
89.44,
basically
dedicating
an
entire
node
for
that.
E
So
it's
better
to
self-serve
the
database,
the
job
start
efficiency,
it's
only
a
0.05
percent
loss.
This
is
over
a
48-hour
job,
so
bigger
the
job
higher
the
value
same
thing
with
the
event:
efficiency,
bigger
the
job
higher
the
value
it's
98
to
99
and
what
we've?
Since
our
tasks
require
about
one
gigabyte
of
memory
per
task,
we
can't
use
all
the
cpus
on
a
household
node
or
a
k
l.
E
So
we
just
we
found
that
it's
best
to
focus
on
packing
the
best
number
of
tasks
and
focusing
on
how
efficient
we
can
use
the
machine
with
the
software
that
we
have
to
run
so
really
just
to
wrap
it
up
for
our
containerization
model.
We
find
it's
best
to
keep
them
to
a
minimum
and
having
and
leveraging
cdmfs
to
serve
our
software
for
the
database
side.
Since
the
query
nodes
are
on
a
private
network,
we
have
to
run
the
database
locally,
we're
able
to
copy
our
database
payload
to
nursk
on
demand.
E
We
will
it's
best
to
launch
the
database
environment
scripts
in
parallel,
so
get
everything
set
up
as
fast
as
you
can
and
to
start
doing
the
event
processing,
although
we
did
find
we
need
to
have
our
time
delays
implemented
for
cbmfs
and
overall
for
the
job
for
the
efficiency,
the
job
start,
efficiency
and
the
idle
cpu
that
we
we
get
out
when
the
tasks
finish
is
really
a
small
impact,
especially
if
we
run
over
the
whole
48
hours
and
really
the
headnode
model
introduced
our
biggest
efficiency,
because
we're
paying
for
that
node
to
just
run
a
database
and
looking
forward.
E
Our
next
steps
is
to
ensure
graceful
termination.
So
the
idea
of
using
signal
handling
if
the
tasks
need
to
run
past
the
48
hour
limit.
There
is
the
potential
use
of
the
burst
buffer
for
our
database
content
and
our
event
service
is
coming
soon.
That
will
allow
us
to
start
new
events,
new
tasks
when,
when
one
finishes-
and
that's
really
the
the
summary
of
of
the
whole
talk.
Thank
you.
A
Great
thanks
a
lot
michael,
so
I
guess
we
should
move
on
to
jeff
you're
ready.
You
can
take
over
the
screen.
F
A
F
Great,
so
I'm
gonna
be
approaching
a
slightly
different
approach.
I
mean
we're
looking
at
the
challenges
for
creating
nurse
resources
into
an
existing
distributed
and
automated
data
processing
model.
That's
that's
been
around
for
for
for
many
years
where,
where
alice
is
a
at
least,
is
a
experiment,
heavyweight
experiment
at
the
lhc,
we
have
a
history
with
working
with
nurse
at
pdsf
and
I'm
just
going
to
go
quickly
here.
F
We
when,
when
nurse
introduced
corey,
we
put
in
some
effort
to
to
try
and
make
use
of
the
system,
and
this
is
just
a
kind
of
a
hodgepodge
of
different
things
that
we
were
working
on
to
to
leverage
the
system.
F
Do
some
benchmarking
of
the
the
of
the
resources
build,
a
system
that
could
handle
could
could
handle
serial
jobs
but
combined
into
something
that
was
more
fit
into
the
way
nurse
processes,
but
in
in
reality,
four
years
later,
it's
mainly
used
by
local
tasks,
local
groups
for
one
off
task
and
remains
an
outlier
in
the
out
system,
and
that's
because
alice
has
this
very
specific
computing
model,
and
so
the
the
the
point
here
was
to
try
and
figure
out.
F
How
do
we
tie
directly
into
the
nerf
system
with
the
alice
computing
model
as
it
exists?
So,
just
briefly,
what
is
the
ios
computing
model?
It's
a
distributed
facility,
it's
a
great
facility
of
about
80
sites
that
are
that
act
together
as
one
facility.
It's
a
120
000
or
more
serial
jobs.
It
runs
24
by
365
all
the
time.
It
has
a
110
petabyte
file
system.
That
is
a
distributed
file
system
and
there's
software
that
ties
all
the
pieces
together
and
it
really
is
a
facility
you
can
log
in
you
can
do
ls.
F
You
can
edit
files,
you
can
move
files
around,
it
does
act
like
a
facility
and
so
the
way
you
can
achieve
something
like
this
is
that.
B
F
Is
very
distinctly
different
than
any
other
site.
That's
how
this
thing
is
able
to
glue
the
pieces
together
if
every
site
was
different,
there'd
be
a
lot
of
a
lot
of
manpower,
maintaining
this
facility
as
a
as
a
a
unique
facility,
so
every
site
runs
nine
percent
of
different
subtypes,
monte,
carlo
simulations
and
organize
data
analysis.
F
So
since
this
is
to
to
what
we
look
at
is
how
do
you
link
in
a
facility
to
the
alaska
grid
and
I'm
skipping,
some
slides
that
were
in
the
other
talk,
but
so
there's
a
couple
requirements.
One
is
at
no
level
and
for
the
most
part,
particularly
since
cvmfs
has
been
set
up
and
and
using
shifter
for
the
per
node
cache.
This
is
working
really
well
to
load
at
the
node
level.
There's
some
issues
with
the
swap
not
having
swap
but
that's
a
small
issue.
F
It's
not
much
and
the
facility
level
is.
Is
it
works
pretty?
Well,
we
have
access
to
a
workflow
node,
which
is
one
of
the
critical
pieces
of
having
a
single
contact
between
the
facility
and
the
rest
of
the
atlas
grid
facility
and
the
the
local
resource
management
system
of
slurms
works.
Fine
with
us.
What
is
not
working
well
with
us
is
that
is
that
we.
E
F
A
facility
to
be
optimally
configured
for
serial
jobs
and
we
need
long-term
storage
that
is
grid
enabled
that
doesn't
go
year
to
year.
It
goes
for
long
term
and
it's
grid
enabled
so
so
we
can
kind
of
look
at
how
to
address
those
without
without
disrupting
the
alice
computing
model.
And
that's
the
point
here
now.
This
is
just
a
simple
cartoon
of
what
happens
with
the
house
computing
model
and
if
you
just
consider
this
is
working
out,
these
are
just
serial
jobs.
It
could
be
other
people's
jobs
in
here.
F
The
local
resource
management
manager
schedules
the
job
the
agent
an
agent
is
launched
and
the
agent's
all
the
same.
They
build
a
wrapper
and
that
rapper
goes
and
gets
the
payload
and
the
payload
is
defined
in
the
central
services
at
nurse
not
at,
and
so
these
these
they're
independent.
They
don't.
They
don't
interact
with
each
other,
and
these
are
the
the
pieces
that
that
that
operate
this
this
in
the
facility
from
the
node
level.
F
Now
one
thing
we
did
was
we
just
decided
since
we
want
to
leverage
either
multi-node
but
whole
node
and
multi-node
scheduling.
We
we
figured
it
something
called
a
job
runner,
which
is
a
very
thin
layer
that
actually
combined
the
resources
of
the
entire
job
of
many
cores
and
many
resources
and
then
acted
as
a
broker
for
those
resources.
So
now
it's
a
job
runner
that
manages
the
resources
and
launches
the
job
agents,
but
the
rest
of
it
is
pretty
much
the
same.
F
The
the
the
job
rapper
still
goes
out
and
gets
the
payload
and
runs
the
job,
and
so
this
was
actually
initial
fund
from
an
elderly
already
with
physics
and
zac
marcher
marshall
helped
us
put
this
together,
so
the
the
the
good
news
is
that
we
did
this
initial
deployment.
Just
to
give
you
some
some
scale
reference
at
the
top
left
is
the
the
normal
alice
grid
is
running
130,
000
jobs.
F
The
bottom
left
plot
shows
the
two
facilities
that
we
that
are
production
facilities,
that
in
the
us
at
bridge
and
lbnl
they're
running.
B
F
5
000
jobs.
The
nurse
allocation,
if
we
ran
24
by
365,
is
around
700
or
800
jobs,
but
so
we
were
able
to
to
deploy
this
system
and
retain
the
full
automatic
workflow
of
the
grid,
we're
getting
only
about
100
jobs,
but
we're
able
to
maintain
the
late
binding,
the
auto,
clean
up
and
resubmit
on
failures.
We
don't
have
to
to
do
anything
special
for
failures.
We
did
this
automatic
and
it
and
it's
usable
in
serial
whole
node
and
even
partial
node
scheduling.
So
this
is
the
good
news.
F
Low
resource
utilization
rate
is
something
we're
looking
at
now
and
that's
there's
several
things.
I
think
that
that
we
discussed
this
in
during
the
talk,
the
the
actual
talk,
I
think
the
main
thing
is
that
the
what
other
people
has
already
said
about
the
only
two
nodes
two
jobs
are
are
are
used
to
for
scheduling
and
the
rest
are
just
backfill
and
we're
using
48
hour
jobs.
So
what
we
need
to
do
is
look
at
reducing
the
time
to
see
if
the
backfill
will
work.
F
Making
big
wide
jobs
is
probably
not
the
right
way
for
our
model.
Just
because
we
like
things
that
are
really
dc,
you
can
see
from
these
these
plots
of
jobs
running
that's
typically
what
we
what
we
prefer,
but
this
gives
us
something
to
work
with,
and
and
we
we're
continuing
on
that.
The
other
piece
is
the
storage.
And
how
do
we
manage
the
storage
and
there's?
F
Some
work
was
done
actually
through
also
through
the
ldrd
with
physics
which
was
to
make
you
utilize
that
we,
we
do
have
a
large
grid
storage
element
at
lbnl
nearby
nurse,
because
in
another
facility
so-
and
we
can
use
this-
a
something
called-
a
proxy
cache
to
access
the
data
directly
from
that
storage,
and
we
see
some
market
improvements
on
that.
F
So
I
mean
that's
been
that's
something
that
we're
working
on
in
the
future
to
really
actually
optimize
that
the
effort,
just
as
summary,
but
the
effort
was
you
know
it's
analysis
development,
but
that
the
corey
was
a
target
was
a
use
case,
but
is
is,
it
was
also
for
alice's
future
was
we're
getting
into
multi-core
simulations
and
other
hpc
facilities
really
requiring
whole
node
and
multi-node
scheduling.
F
So
this
is
what
we're
trying
to
connect
in
without
without
disrupting
the
the
ounce
workflow
and
we've
already
seen
some
benefits
and
there's
another
computer
at
ldl
lorenzium
that
has
whole
node
scheduling
requirements
and
but
has
opportunistic
utilization.
We
just
didn't
do
anything.
We
just
turned
it
on
and
it's
running
fine.
It
learns
them.
So
both
this
activity
both
helps
us
use,
nursery
and
other
sites
as
well.
F
G
A
D
All
right,
I'm
speaking
now,
so
if
you
can't
hear
me,
let
me
know
we
can
hear
you
good.
D
D
So
what
is
our
facility
is
if
the
slide
will
advance
we're
lcls
at
the
slack
national
accelerator
laboratory,
so
a
big
long,
linear
accelerator,
where
we
create
these
short,
intense,
bursts
of
x-rays
for
doing
photon
science.
D
We
operate
24
hours
a
day.
D
Currently
we
send
down
these
short
bursts
of
x-rays
120
times
a
second,
but
next
year
we're
supposed
to
go
up
to
a
million
times
a
second,
and
that
is
what
drives
our
increased
interest
in
nurse
and
other
facilities
in
the
u.s.
D
This
is
one
of
the
big
examples
is
we'll
do
a
nano
nano
nano
crystal
crystallography
so
coming
up
with
structures
and
the
experiment
which
we
just
turned
on
yesterday
for
the
first
time
in
18
months,
was
imaging
covid
related
stuff,
so
trying
to
see
which
amino
acids.
D
So
what
about
the
real-time
nature
of
of
what
we're
doing
so?
This
is
a
billion
dollar
facility
and
it
runs
24
hours
a
day,
seven
days
a
week
and
we're
gonna.
Currently
we
generate
about
two
gigabytes,
a
second
of
data
and
we're
going
to
go
up
to
20
gigabytes
a
second
next
year
and
that's
a
challenging
data
volume,
20
gigabytes
per
second-
and
that's
just
for
starters.
It's
supposed
to
go
up
after
that.
D
We
get
about
200
gigabytes
per
second
coming
off
the
detectors,
but
we
reduce
it
by
a
factor
of
10
in
real
time
and
here's
the
key
point
in
green.
The
things
change
all
the
time
at
lcls,
they're
really
kind
of
flying
by
wire.
D
So
we
need
real-time
feedback
to
steer
the
experiments
and
the
experiments
change
dramatically
multiple
times
per
week,
so
we
have
to
be
able
to
adapt
very
quickly
to
changing
requirements
for
the
experiments
and
this
real-time
data
analysis
feedback
is
critical
for
running
these
experiments,
so
we
have
kind
of
one
second
of
latency
for
our
in
hutch
analysis.
This
is
done
before
the
data
even
touches
a
disk.
D
We
multicast
the
data
currently
and
we
get
it
over
infiniband
before
it
hits
the
disk.
So
we
can
get
one
second
latency,
and
that's
not
what
I'm
going
to
talk
about
here,
because
we're
not
expecting
nurse
to
provide
one
second
latency.
But
what
we're
trying
to
get
from
nurse
is
a
few
minutes
of
latency
from
disc.
So
this
is
what
I'm
going
to
talk
about
today
is
getting
this
one
minute.
Latency.
D
So
we've
been
looking
into
this
with
help
from
debbie
and
david
skinner
and
other
people
at
nurse
at
the
possibilities,
from
forgetting
a
few
minutes
of
latency
reservations
are
a
big
one,
but
they're
kind
of
inflexible.
D
D
So
the
way
that
it's
been
described
to
me.
This
is
like
oversold,
first
class
seats
on
airplanes
and,
if
you're
fortunate
enough
to
get
one
of
those
first
class
seats,
you
can
take
advantage
of
this
pool.
D
Then
there's
this
intriguing
thing:
the
so-called
flex
queue
where
jobs
that
can
checkpoints
like
density,
functional
theory
codes,
I
think,
are
the
big
example
like
vasp
and
quantum
espresso
will
write
out
their
wave
functions
every
once
in
a
while,
so
that
the
jobs
can
be
killed
and
they
get
a
discount.
D
So
this
is
sort
of
starting
to
feel
a
little
bit
like
what
we
would
want.
We
would
want
to
be
able
to
preempt
these
jobs
and
then
there's
this
effort,
as
I
understand,
with
a
dmtcp
with
zhengxi
zhao
to
make
all
jobs
preemptable,
but
in
the
user
domain.
So
you
don't
have
to
go.
Do
weird
things
inside
the
kernel
is
my
naive
understanding
of
this
effort,
and
but
it
does
require
the
user
jobs
to
do
some
work
to
become
preemptable.
D
Okay,
so
the
summary
is
the
options
for
us
in
the
real-time
quality
of
service
is
inefficient,
so
it's
not
going
to
be
an
option
for
us.
D
So
this
preemption
method
that
we
use
here
at
slack
won't
work
at
nersk,
so
the
flexq
is
the
closest
to
what
we
need,
which
receives
a
discount,
and
so
we've
been
in
talking
with
david
skinner,
and
my
understanding
is
that
nurse
has
agreed
to
provide
to
expand
this
flex
q
idea
and
somehow
provide
us
with
the
mechanism
to
preempt
these
preemptable
jobs
so
that
we
can
get
our
our
few
minutes
turn
around
time.
So
we
can
give
real-time
feedback
to
the
experiments.
D
A
A
A
All
right,
so
I'm
not
sure
if
bryce's
is
here,
I'm
trying
to
look
for
him
in
the
list
and
I
don't
see
his
name
showing
up
there,
so
maybe
he
was
unable
to
make
it
today,
in
which
case
then
I
guess
what
I
should
do
is
probably
open.
A
This
up
there's
been
a
little
bit
of
messages
going
through
the
chat
window,
but
if
anybody
had
any
questions,
maybe
they
wanted
to
bring
them
up
here
and
ask
any
of
the
speakers
or
or
any
of
the
the
nurse
folks
who
are
on
about
it.
That
would
be
a
great
great
time.
H
Hey
david,
this
is
this
is
katie
antipa,
so
yeah.
Thank
you
again
for
organizing
this.
This
was
really
helpful,
especially
because
I
wasn't
able
to
attend
all
the
weekly
ones.
I
had
a
you
know
a
comment
and
then
just
wanted
to
encourage
folks
to
for
one
one
other
item.
So
first,
is
you
know
I?
I
guess
I
continue
to
hear
about
job
job,
throughput
issues,
and
you
know
I
thought
we
had
some
some
solutions
that
could
work
for
people
that
were
helpful
in
bundling
jobs.
H
H
At
the
same
time,
you
know
our
it's
true
our
scheduler
and,
I
would
say
any
hpc
scheduler
just
will
get
knocked
over
when
there's
millions
of
jobs
individually
going
through,
and
so
we
have
to
find
some
way
to
sort
of
meet
in
the
middle
and,
if
that's
providing
more
assistance
for
people
to
change
their
workflows.
I
saw
something
in
the
comments
that
shane
is
working
on
like
a
condor
option.
H
H
The
second
comment
I
wanted
to
make
was
the
ercap
season
is
coming
up,
and
so
I
wanted
to
encourage
all
of
you
to
make
sure
that
you
all
knew
about
the
community
file
system,
so
the
community
file
system
replaced
project,
it's
about
10
times
bigger
or
so
it's
about
75,
petabytes
and
the
storage
is
actually
allocated
and
approved
by
your
program
manager,
and
so
I
would
encourage
you
not
to
be
shy
in
saying
what
you
need
if
you
need
30
petabytes
of
storage
I
put
in
30
petabytes,
and
I
think
we
don't
want
you
to
shrink
your
ass
based
on
what
we
you
think
we
have,
because
we
really
need
to
know
what
what
your
workflow
needs.
A
So,
as
maybe
just
as
a
quick
follow-up
on
that
so
for
our
particular
workflow,
we
don't
need
the
space
for
very
much
time
other
than
when
we're
trying
to
run
it
through.
So
I
guess
I
haven't
asked
for
really
large
disk
space
in
the
request
before,
because
I
thought
well,
the
file
only
needs
to
be
there
as
long
as
until
the
job
runs,
and
then
we
get
the
results
back.
So
we
don't
really
need
it
to
be
year-round
quota.
A
Is
there
anything
in
between
or
any
option
for
getting
you
know,
scratch
space
that
exists
for
a
temporary
amount
of
time.
A
H
If
you
get
a
large
quota,
but
you
don't
use
it
you're
not
hurting
nurse
users,
it's
only
if
you're
using
that
space
and
you
never
delete
like
never
delete
it.
So
if
you
have
a
60,
I
don't
know,
if
I
jumped
out,
did
that
make
sense
yeah
that
made
sense.
So
if
you
have
a
100
terabyte
quota,
and
then
you
only
use
that
once
in
a
while,
you
know
on
average
you're
not
taking
out
away
space
from
from
people
and
and
we
can
kind
of
do
an
over
subscription
factor.
G
But
david,
what
is
there
aspects
about
how
scratch
is
is
scratch?
Do
you
just
need
more
space
than
you
typically
get
on
scratch,
or
is
what.
A
About
scratch,
maybe
it
doesn't
work.
No,
we
asked
for
a
a
special
allotment
that
got
us
up
to
60
terabytes
and
that
actually,
that
number
came
because
of
our
bandwidth
limit.
We
kind
of
it
kind
of
all
gets
tied
in.
How
fast
can
we
transfer
data
to
nurse
and
then
how
many
nodes
would
we
be
able
to
feed
in
a
steady
state
there?
So
you
know
how
much
disk
space
we
need
really
is
dependent
on
that
bandwidth
and
how
many
nodes
we
can
expect
to
have
at
any
point
in
time.
B
A
G
I'll
I'll,
just
chime
in
so
I'd
mention
this
in
the
chat
window,
but
yeah,
I'm
a
nurse
staff
member,
but
I
also
work
on
a
couple
of
other
projects
and
one
of
those
is
nmdc.
G
I
don't
know
what
bryce
was
gonna
it
was
it
bryce
it
was
supposed
to,
or
was
it.
G
Yeah
I
and
I
I
know
that
they're
using
something
called
cromwell,
it's
a
way
to
you,
can
encapsulate
your
workflows
using
a
sort
of
the
standard
description,
language
and
then
there's
a
tool
called
cromwell
that
can
kind
of
take
those
in
and
run
them,
and
so
he
was
probably
going
to
talk
about
that.
G
We're
using
this
for
nmdc
as
well,
which
is
the
national
microbiome
data
collaborative
and
the
I've
hit
some
of
the
same
issues
that
you
know
were
brought
up
here
by
others,
and
so
you
know
even
me,
as
a
nurse
staff
member,
I
see
exactly
the
kind
of
things
that
you're
you're
mentioning,
and
so
you
know
for
these
particular
workflows.
What
makes
them
challenging
is
there's
a
kind
of
an
iterative
aspect
where
it'll
do
some
work
and
then
it
will
submit
jobs.
G
G
Now,
if
we
submit
it
to
the
real
time
queue
that
would
probably
mostly
address
some
of
these
things,
but
the
way
that
I've
worked
around
this
for
nmdc
and
jji
has
a
kind
of
a
similar
approach,
but
a
different,
a
different
piece
of
software
is
there's
some
intermediate
scheduler
in
my
case,
I'm
using
condor
in
their
case
jgi,
is
using
something
they
developed
internally
called
jtm.
That
uses
like
a
rabbit,
mq
message.
G
Bus
same
kind
of
thing,
though,
is
like
there's
this
intermediate
queue,
and
then
the
you
submit
jobs
that
basically
pull
off
of
that
and
that's
not
too
different
from
how
some
of
the
hep
projects
work
as
well.
So
I
do
think
that
long
term,
we
need
to
work
with
slurm
to
figure
out
ways
to
have
it
more
effectively
deal
with
these
things
directly,
and
I
don't
know
exactly
how
that
would
be
done.
G
Potentially
some
kind
of
more
hierarchical
method
might
be
might
work
so
that
you
could
kind
of
have
the
idea
of
like
there's
a
sub.
You
know
there's
a
subset
of
nodes,
that
another
scheduler
can
kind
of
just
focus
on
those
nodes
and
that
might
relieve
some
of
the
scaling
issues
that
we
have
to
wrestle
with.
G
And
another
thing
I
think
that
slurm
needs
to
deal
with
is
really
the
idea
of
scheduling
a
workflow
versus
just
scheduling
tasks
with
a
bunch
of
requirements.
So
how
do
you
kind
of
treat
a
bundle
of
work
as
something
that
ages
together
and
even
if
you
don't
know
it
all
up
front,
it
can
kind
of
schedule
them
effectively.
A
Great
very
good
points,
so
I
I
guess
one
thing
I
I
want
to
make
sure
I
get
full
confession
here
that
this
time
when
we
we
started
our
camp
most
recent
campaign,
we
found
that
we
weren't
able
to
backfill,
like
we
did
last
year,
which
really
kind
of
worked
fairly
well
for
us,
and
I
guess
I
didn't
hear
exactly
what
the
the
improvements
were,
but
I
guess
sudeep
was
saying
this
morning
about
improvements
on
corey
k
and
l
that
got
you
from
mid
80s
up
to
mid
90s,
and
I'm
I'm
going
to
guess
that.
A
Maybe
that's
one
of
the
reasons
why
I
can't
get
somebody
in
anymore
and
and
what
I've
done
is
actually
made
the
problem
worse
because
trying
to
fit
into
smaller
jobs.
I've
made
my
job
smaller.
So
now
I
have
10
000
jobs
at
once
that
I'm
submitting,
which
probably
just
puts
an
even
larger
burden
on
your
schedule.
H
Well,
I
think
we
should
have
someone
follow
up
with
you.
I
did
notice
bryce
just
said
he
he
was
here.
A
All
right,
sorry
about
that,
so
we
only
have
like
three
minutes
left
here,
but
I'll
tell
you
what
why
don't
you
go
ahead
and
give
your
talk,
and
anybody
who
wants
to
stay
on
is
welcome
to
I'll,
certainly
stay
on
and
listen
to
it.
If
you
feel
like
you
have
to
duck
out
to
the
plenary
session,
then
go
ahead.
B
I
am
is
a
slide
showing
up.
Yes,
I
can
see
it
all
right.
So
thank
you,
everybody
for
inviting
me
to
talk
today
about
jgi
and
our
pipelines
that
we
have
so
for
those
of
you
who
aren't
familiar
with
jgi
we're
a
high-throughput
sequencing
facility
that
does
dna
sequencing
for
researchers
around
the
world,
we're
looking
at
different
metabolomics
and
other
analysis
on
these
dna
sequencers.
B
We
have
these
sequencers
that
are
at
the
lab
that
are
producing
tens
of
terabytes
of
data
every
couple
weeks
and
then
you
know
we're
processing
it
through
multiple
pipelines
constantly
through
nurse,
and
you
can
kind
of
see
the
automation
here
of
you
know
all
these
little
outputs
going
into
different
boxes
for
our
collaborators
who
are
helping
out.
B
So
last
year
we
had
about
almost
2
000
users
actively
doing
projects
at
jgi.
We
have
about
16
000
different,
active
projects.
We
received
about
24
000
different
dna
samples
over
the
year
for
2019,
for
we
have
almost
100
000
pipeline
runs
and
that's
for
rpc.
Only
I'm
the
group
lead
for
rkc
doing
all
these
pipeline
runs
for
30,
different
pipelines
or
a
few
other
groups,
also
at
jji,
who
also
do
a
number
of
pipelines,
and
we
all
use
nerfs
pretty
heavily
for
that.
B
So
here
you
can
see
the
plots
of
our
growth
over
time.
This
is
just
since
2013..
You
know
it's
fairly
linear,
but
it's
going
up.
Of
course,
2020
is
a
little
bit
different
for
everybody,
but
you
know
these
things
go
up
because
every
few
years
there's
a
new
sequencing
technology
that
comes
out
that
does
things
cheaper,
faster,
produces
more
data
for
us
and
so
we're
able
to
accommodate
more
products
over
time.
B
Our
compute
usage
over
time
you
can
see
it's
grown
and
then
it
shrank
and
really
that's
a
sign
that
our
product
mix
is
changing
a
bit.
We
used
to
have
products
which
required
a
lot
more
heavy
compute.
One
other
thing
that
we
did
is
there
is
something
called
blast
which
is
a
way
of
aligning
the
sequencing
off
the
sequencer,
with
the
sample
that
or
big
database
of
samples
to
try
to
identify
what
it
is
and
that's.
B
It
was
a
huge
compute
sync
for
us,
and
so
we
were
able
to
replace
it
with
something
much
faster
and
better.
That
gave
us
kind
of
the
same
output
and
of
course,
you
know
over
time,
we're
also
looking
at
ways
of
replacing
older
tools
with
newer
ones
like
blast,
like
I
mentioned,
to
make
things
better
for
us,
I'm
seeing
chats
come
up,
but
I'm
not
actually
following
them.
B
Okay,
all
right
so
for
the
pipelines
that
we
run.
For
example,
we
run
almost
all
the
pipelines
on
the
quarius
gene
pool
partition
and
we
do
this
to
meet
our
cycle
time
requirements.
B
H
B
On
so
our
tempo
pipelines
that
we
run
they're
high
memory
pipelines
because
they're
loading
all
the
sequence
state
into
memory,
they
can
take
anywhere
from
16
gigabytes
to
three
terabytes.
Depending
on
the
pipeline
for
memory
usage,
it's
really
heavy
io.
All
these
files
coming
off
of
the
sigma
servers
are
pretty
heavy
files,
even
when
they're
compressed
and
we
have
a
lot
of
variability
in
the
runtime,
it
can
be
five
minutes
on
some
pipelines.
B
So
more
than
seven
days
for
other
things,
you
can
see
here
on
the
box
plot
on
the
right,
even
for
some
of
the
same
pipelines,
there's
still
a
lot
of
variability
in
the
runtimes,
and
this
is
really
product
dependent,
not
even
product
pen.
It's
sample
depends.
Some
samples
are
a
lot
more
complex
than
others
and
take
a
lot
more
resources
to
run.
B
Also,
we've
purchased
some
nodes
in
quarry
that
are
1.5
terabyte
memory
nodes
that
we
use
for
some
of
our
special
pipelines
to
run,
and
it
has
slightly
different
characteristics
but
yeah
we're
using
corey
pretty
heavily
for
all
this.
B
So
working
with
nurse,
you
know,
I
think,
like
everybody,
we've
had
occasional
challenges
like
last
year.
The
create
upgrade
on
causes
some
disruptions
for
our
product
cycle
time
at
jji,
because
all
sorts
of
things
started
failing
and
we're
running
slower.
B
But
you
know:
we've
worked
with
nurse
and
nurse
has
agreed
to
help
us
by
running
some
reframe
tests.
Whenever
they're
going
to
do
some
of
these
upgrades
that
we
can
potentially
get
ahead
of
some
of
these
problems,
you
know
over
time
it
seems
like
the
quarry
file
system.
Performance
is
unstable.
B
You
know
when
dvs
goes
down
or
in
dbs
is
slow.
You
know
it
does
take
us
time
to
go
into
pythons
and
say:
okay,
what
happened
wrong
here?
Was
it
actually
a
problem
with
the
data,
a
problem
with
the
python,
or
was
it
something
with
the
query
file
system
and
a
lot
of
times?
It
seems
to
be
more
of
a
query
file
system
issue.
B
As
for
pro
modder
gpus,
we
did
have
a
hackathon.
I
believe
it
was
made
2019
where
we
worked
at
taking
some
bioinformatics
code
and
trying
to
port
it
over
to
gpus
and
seeing
what
kind
of
performance
increase
we
would
get.
You
can
see
here
on
the
lower
left
here
kind
of
a
timing
slide,
and
the
green
here
is
essentially
the
nvidia
gpu
results.
You
can
see
they're
not
better
than
running
on
cpus.
H
B
The
optimized
flag
for
c
that
isn't
done
by
default
by
turning
on.
We
already
got
a
huge
improvement
without
doing
much
for
the
code
anyways.
It
was
interesting
that
a
lot
of
bioinformatics
software
that
people
tried
to
port
over
at
the
hackathon
didn't
see
a
huge
amount
of
gpu
acceleration.
B
B
That's
a
really
nice
feature
that
we
have,
and
we
have
a
program
called
metahitmer
that
several
jj
staff
are
working
on
and
that's
a
assembler
that
does
a
huge
assembly
that
takes
three
terabytes
of
memory
and
multiple
nodes
to
take
a
huge
amount
of
data
and
try
to
assemble
it
to
get
out
of
assembly
from
it,
and
we've
been
able
to
get
that
somewhat
running
on
the
query
cluster
and
there's
actually
a
paper
published
for
that
in
nature.com.
B
So
that's
really
the
last
of
my
slides.
I
was
working
through
here
because
I
know
I
didn't
have
much
time,
but
questions
or
other
things
I
can
answer
for
people.
A
C
A
Well
very
welcome.
I
guess
I
should
also
very
much
thank
all
the
people
at
nurse.
Katie
was
the
first
person
who
initially
suggested
this
so
so
I
have
to
thank
her
for
that
and
for
all
the
support
we've
gotten
from
the
nurse
staff
on
this
whole
thing.
A
So
the
last
I'll
I'll
just
take
the
last
word
here
and
say:
if
you
know
of
any
other
projects
or
any
other
person
who
may
be
working
on
something
at
nurse
that
is
kind
of
relevant
to
this,
please
let
me
know
you
can
send
me
their
name.
I
can
anonymously
or
not
to
prod
them
to
come,
give
a
talk
to
us
and
not
bring
your
name
into
it
if
you
want,
but
it
would
be
good
to
to
get
a
few
more
talks
together.