►
From YouTube: 08 Workflows at NERSC
Description
Part of the NERSC New User Training on June 16, 2020.
Please see https://www.nersc.gov/users/training/events/new-user-training-june-16-2020/ for the training day agenda and presentation slides.
A
A
So,
by
the
end
of
the
20
minutes,
I'm
hoping
that
everyone
here
will
understand
what
we
mean
when
we
use
the
word
workflows
and
workflow
management.
If
the
problems
that
you
work
on
count
as
workflows
as
far
as
these
tools
and
resources
are
concerned
and
out
with
you'll
understand
all
of
the
resources
that
nurse
provides
both
in
infrastructure
and
consulting
and
support,
to
help,
workflows
and
I'm
going
to
show
off
a
handful
of
examples
using
these
kinds
of
tools.
A
So
the
big
question
and
there's
been
a
lot
of
uncertainty
is,
what's
the
definition
workflows?
What
do
we
mean
when
we
say
this?
Cuz
lots
of
people
use
the
word
workflows
and
they
talking
about
kind
of
different
things.
For
us,
a
workflow
is
a
problem
that
is
best
solved
by
inserting
automation
between
user
action
and
the
interfaces
that
come
into
our
systems,
our
computation
and
storage.
A
Hello,
cat,
the
resources
that
we're
talking
about
are
the
actual
compute
and
storage
component.
So
these
are
the
compute
nodes
that
have
been
scheduled
by
slurm,
our
storage
space,
our
network
bandwidth
and
data
transfer
capabilities,
and
even
our
identity.
Man
systems,
workflow
management
tools
are
the
specific
noun
that
talks
about
software
systems
that
do
that
kind
of
automation,
so
general
examples
to
try
and
clarify
what
things
counts
as
workflows
for
the
talking
about
these
tools.
A
If
you
need
to
brand
your
application
thousands
of
times,
you
may
have
thousands
hundreds,
tens
of
thousands
of
individual
data
sets,
you
may
be
doing
something
like
a
very
large
Monte,
Carlo
simulation,
where
everything's
a
little
independent
simulation
and
our
sampling
and
in
the
end
you
want
to
combine
them
all
together.
To
do
a
statistical
analysis.
A
A
Your
data
processing
happens
in
several
stages.
Maybe
you've
compiled
multiple
applications
that
act
in
a
chain
where
the
output
from
one
is
the
input
for
the
next
and
it
precedes
in
a
line
all
the
way
through
your
processing.
It
would
be
tedious
to
do
each
of
those
in
its
own
job
and
type
us
about
yourself,
but
many
of
these
workflow
management
tools
can
take
a
description
of
that
chain
and
run
all
of
the
applications
for
you
from
beginning
to
end.
A
Maybe
your
application
has
a
small
chance
of
crashing
or
having
some
sort
of
failure
and
it
needs
rerun.
It
wouldn't
be
great
if
a
human
needed
to
be
sitting
there
watching
it
looking
for
crashes
and
then
manually,
restarting
the
ones
that
need
restarted.
So
some
of
these
tools
have
the
ability
to
detect
an
exit
code.
That's
nonzero
and
oughta
money
automatically
reaper
on
the
task.
Maybe
your
application.
You
want
to
run
it
every
month
or
every
two
months
or
some
regular
period.
A
A
So
those
are
the
problems
that
we're
looking
at?
What
kind
of
resources
are
we
providing
to
help
with
us?
So
as
far
as
support
there
is
specialized
infrastructure?
Wiki
mentioned
some
of
it
already.
We
provide
some
software
tools
that
can
do
this
stuff
and
we've
got
some
specialized
support
going
towards
these
particular
problems.
There
is
a
workflows
working
group
at
nurse
it's
been
operating
since
September
2019,
there's
three
members,
Laurie
Stephie,
Bjorn,
Enders
and
myself.
A
We
in
that
time
have
been
thoroughly
evaluating
many
workflow
management
tools.
The
total
space
is
more
than
300
tools,
so
this
is
a
pretty
involved
task,
but
it's
also
very
important
for
us
to
do,
because,
when
users
come
to
us
and
say
this
is
my
problem,
what
should
I
be
doing
to
fix
it?
We
want
to
be
able
to
narrow
down
those
three
hundred
choices
to
two
choices
or
three
choices
and
save
a
lot
of
that
mental
overload
of
having
too
many
things
and
not
being
able
to
find
the
right
one.
A
A
big
part
of
this
is
we
are
refreshing
and
improving
the
documentation
and
advice
on
the
nurse
documentation
site
regarding
workflows.
So
if
you
go,
if
you
navigate
to
running
jobs
and
then
you'll
see
a
section
on
workflow
tools,
that
is
where
our
work
is
being
placed
as
we
create
it.
So,
each
time
we
finish
evaluating
a
new
tool,
we'll
put
up
a
blurb
about
it,
what
it's
good
for,
what
it's
not
good
for
how
to
get
started
using
it.
A
There's
also
a
big
outreach
component
here
me
speaking
with
you
today
providing
information
about
using
workflows
at
nurse.
We
talk
to
users
about
what
they're
doing
we
talked
to
experimental
facilities
about
what
their
their
pipelines
and
facilities
and
researchers
need.
We
talked
to
tool
developers
that
make
workflow
tools,
and
we
also
talked
to
other
major
computational
facilities
to
try
and
share
best
practices
about
how
to
support
workflow
tools,
so
the
documentation
and
guidance.
A
Here's
the
URL
to
that
someone
could
maybe
take
that
link
and
put
it
in
the
living
document
off
to
the
side.
That'd
be
great
again,
it's
a
work
in
progress.
We
are
expanding
and
refining
as
we
find
new
tools,
sometimes
we'll
evaluate
a
new
tool
that
is
much
better
than
everything
before
it
we'll
have
to
adjust
everything.
Accordingly,
it's
got
the
details,
there's
very
clear
examples
and
suggestions
as
far
as
if
your
work
looks
like
this,
then
start
by
trying
to
Lex
or
why
we
also
want
to
get
tickets
about
workflow
management
tools.
A
That's
the
main
source
of
information
about
how
we
learn
what
users
need
and
that's
a
main
way
that
we
can
share
our
experience
that
we've
gained
about
workflow
management
tools
with
users
that
are
say
starting
a
brand
new
project.
They
haven't
implemented
anything
yet,
of
course,
you
want
to
start
by
working
with
a
tool
that
you
know
works
at
nurse.
Can
you
know
works
well.
A
Infrastructure
resources
well,
he'd
mentioned
the
query:
workflow
nodes.
There
are
two
login
nodes,
but
they're,
not
in
the
load
balancer
when
people
SSH
n
that
are
reserved
specifically
for
things
like
workflow
management
tools.
The
environment
is
the
same
as
log
in
notes,
but
the
access
is
only
limited
to
people
that
have
been
approved
and
heavy
compute
is
not
allowed
only
lightweight
management
processes.
A
A
Workflow
nodes
explain
that
you
want
to
run
your
workflow
once
a
month
and
then
you'll
SSH
into
the
workflow
or
into
the
workflow
node,
and
put
your
crontab
there's
a
caveat
that
the
uptime
for
these
nodes
is
the
same
as
the
Cori
login
nodes,
so
they're
also
subject
to
query
main
dances
or
unexpected
down
times.
So
you
can't
put
a
tool
running
on
one
of
these
expected
to
stay
there
indefinitely.
A
You
may
need
to
go
in
and
check
it
and
restart
it
after
maintenances
or
after
a
downtime
yeah,
and
the
way
to
gain
access
is
to
submit
a
request
to
help
nurse
gov
asking
for
workflow
node
access
you'll
need
to
describe
what
you
expect
to
do,
and
you
only
need
to
explore.
Describe
the
computational
resources,
it's
going
to
need
on
that.
A
A
It
is
hundreds
of
jobs
or
thousands
of
jobs
and
what
ends
up
happening
when
you
use
the
shared
QoS
is
each
of
those
jobs
is
being
held
by
the
two
jobs
gang
priority
at
a
time,
so
the
actual
throughput
of
all
of
them
is
pretty
low
if
you
have
a
lot
of
them,
but
if
use
genu
parallel
to
pack
lots
of
shared
jobs
into
a
single
job,
submission
that
uses-
maybe
one
entire
node
or
maybe
multiple
nodes-
there's
also
recipes
to
do
that.
Then
you
will
only
wait
in
the
queue
what's.
A
So
another
thing:
that's
a
big
benefit
to
packing
lots
of
small
jobs
into
single,
wider
or
jobs
with
more
nodes.
Is
it's
much
less
burden
on
the
slurm
controller,
the
slurm
controller
controls
resource
allocation
for
everything
on
query
nurse
wide.
So,
if
somebody
submits
too
many
s
runs
in
a
loop
like,
we
saw
an
example
of
earlier
or
a
hundred
thousand
stephanie
five
thousand
jobs
at
a
time.
All
of
those
commands
go
to
the
slurm
controller
and
can
cause
it
to
pause
for
every
user.
A
A
Another
advantage
of
parallel
is
that
it
can
take
care
of
running
combinations
of
smaller
tasks
on
a
node
both
in
parallel
and
sequence.
You
can
tell
it
I
want
to
run
two
jobs
at
a
time
with
this
J
flag,
but
if
you
give
it
ten
jobs,
total
five
jobs
total.
What
it's
going
to
do
is
it's
going
to
run
two
at
a
time
and
then,
when
those
are
done,
it'll
run
the
next
two
and
then
whatever
is
left.
A
So
this
makes
it
very
straightforward
to
pack
lots
of
single
core
jobs
or
single
core
tasks
into
one
full
node
job
input.
Substitution
is
very
easy.
This
these
two
parenthesis
here.
This
is
where
it
substitutes
what
the
input
file
went
in
for
it,
but
science
just
a
very
simple
substitution.
There
are
many
many
more
options
for
say,
doing
combinations
of
multiple
files
and
that
sort
of
thing,
if
you
need
power,
it's
available,
but
it's
not
that
out
of
the
gate.
A
A
But
if
you
turn
a
very
large
task
array
into
a
single
job
on
multiple
nodes
using
GNU
parallel,
then
you're
only
going
to
wait
in
the
queue
once
I
don't
have
time
to
go
through
a
lot
of
the
detail
here,
but
go
to
that
documentation
that
I
listed
earlier
and
there's
examples
and
instructions
for
using
GNU
parallel
to
pack
work.
A
A
The
burst
buffer
is
very,
very
good
at
I/o
operations
capacity,
and
that
is
one
of
the
things
that
luster
was
not
great
at
quarry.
Scratch
is
designed
to
deliver
very,
very
large
amounts
of
bandwidth
to
one
very
large
MPI
job,
but
lots
of
computing
or
throughput
computing
is
many,
many
individual
tasks
each
doing
their
own
thing
and
that
drives
a
lot
of
filesystem
metadata,
which
the
two
lustre
men,
the
data
controllers,
are
not
great
at.
A
A
A
There's
things
that
you
need
to
worry
about.
If
you're
going
to
use
a
tool
like
this,
if
nurse
you
don't
just
pick
it
up
blindly
and
start
running
it,
a
lot
of
these
tools
expect
sort
of
a
cloud
level
of
availability
of
resources.
They
don't
expect
to
wait
in
the
queue
for
a
number
of
days
and
that
can
lead
to
much
lower
performance
like
they're,
not
very
good.
At
navigating
policy
and
QoS
structures
like
we
have,
they
expect
to
be
able
to
get
resources
immediately.
A
A
Another
problem
is
some
of
these
tools
have
NAIC
slurm
integration,
that
does
a
lot
of
SQ
commands
to
see
what's
the
state
of
their
jobs
in
the
queue
and
if
it
does
it
too
many
too
fast,
then
that
will
also
slow
down
controller
for
everyone
using
Cory
there's
also
some
risks.
As
far
as
using
these
tools
with
network
file
systems,
they
expect
certain
file
system
features
to
be
available
that
aren't
available
in
all
of
our
file
systems
like
there's
some
locks
and
there's
also
some
synchronization
things
that
they
may
expect
to
be
faster.
A
Then
actually
happen,
so
that's
something
to
look
out
for
right.
Oh
good
timing,
all
right!
So,
hopefully
you
have
a
good
sense
now
of
what
we
mean
when
we're
talking
about
workflow
management
tools
and
what
we
can
do
to
help
you.
If
your
problem
looks
like
it
needs
a
workflow
management
tool.