►
From YouTube: 05 - Running Jobs
Description
Part of the NERSC New User Training on September 28, 2022.
Please see https://www.nersc.gov/users/training/events/new-user-training-sept2022/ for the training day agenda and presentation slides.
A
Hello,
everyone,
and
can
you
see
my
slides,
changing?
Yes,
changing?
Yes,
okay,
perfect,
okay,
yeah,
hello,
everyone,
my
name
is
Mars
and
I'll
be
talking
about
how
to
run
jobs
on
Perimeter.
A
So
I'm,
just
like
any
other
large-scale
system,
needs
to
needs
to
use
a
job
scheduler,
and
here
we
use
slurm,
which
is
an
open
source
tool,
and
it's
the
very
same
thing
that
we
had
on
Curry
slam.
There's
three
take
takes
care
of
three
key
responsibilities.
One
of
them
is
to
allocate
the
resources
that
are
requested
in
an
efficient
manner,
execute
and
monitor
jobs
and
also
manage
the
queue
of
the
submitted
jobs.
A
So
any
jobs
that
are
submitted
by
the
users
are
first
sent
to
a
queue
and
then
that
queue
is
managed
based
on
a
certain
priority
that
is
associated
with
each
job.
And
then
it's
the
duty
of
the
scheduler
to
schedule.
The
jobs
as
it
deems
fit
perimeter,
as
Rebecca
described
in
the
morning,
is
a
heterogeneous
system
and
consists
of
two
different
types
of
partitions:
the
CPU
nodes
and
the
GPU
nodes.
A
When
you
initially
log
into
the
system
you're
placed
on
a
login
node,
which
is
me
which
is
basically
something
in
between
which
contains
one
GPU
and
one
CPU,
that
is
basically
a
shared
node
and
the
the
use
case
for
a
login
node
would
be
just
to
do
simple
text.
Editing
and
you
know
job
match,
script,
writing
and
submitting
the
jobs.
It's
not
advised
that
you
run
anything.
A
You
know
any
job
on
the
login
nodes,
because
this
is
a
shared
resource
and
if
you're
running
you
know
making
extensive
views
of
these
notes,
other
users
will
feel
you
know
systems
slowing
down.
So
you
know,
as
a
good
citizen
of
nurse,
it's
recommended
that
anything.
If
you
want
to
run
a
job,
you
request
a
computer
for
that.
A
A
So
let's
have
a
quick
look
at
the
node
Types
on
the
right.
I
have
a
I
have
an
overview
of
the
GPU
nodes.
Each
GPU
node
consists
of
one
AMD
Milan
CPU,
which
has
64
Hardware
cores
a
total
of
128
Hardware
threads,
and
it
also
has
four
ampere
a100
gpus.
The
total
memory
available
on
a
GPU
node
is
256
GBS,
where
each
ampere
GPU
contains
40
GBS
of
hyper
HPM.
A
The
CPU
nodes,
as
shown
on
the
left,
contain
two
AMD
Milan
CPUs.
These
are
very
same
CPUs,
which
are
present,
which
is
also
present
on
the
GPU
node.
So
the
total
number
of
Hardware
cores
available
on
a
CPU
node
are
twice
that
of
the
GPU
node.
The
amount
of
memory
available
on
these
nodes
is
also
two
times.
That's
512,
GBS
of
memory
on
the
CP
nodes,
okay,.
A
So
before
we
get
to
how
to
run
jobs,
here
is
a
node
of
advice,
be
mindful
that
perimeter
is
being
used
by
7000
plus
users
of
nurse
and
since
it's
a
shared
resource,
it's
important
to
be
be
courteous
and
be
mindful
and
try
to
follow.
You
know
the
protocols
and
the
good
practices
that
are
conveyed
to
you.
For
example,
a
good
thing
would
be
to
before
you
schedule
a
job,
try
to
classify
your
job
or
try
to
determine
what
type
of
job
it's
going
to
be.
A
If
you're
trying
to
debug
something,
then
you
won't
really
be
needing
a
lot
of
resources
for
a
long
time.
So
for
that
we
have
a
type
of
quality
of
service
that
we
give
through
a
debug
queue.
Then
you
can.
You
know
refer
to
that
if
you
realize
that
you
want
to
have
an
interactive
node,
where
you
want
to
do
something
in
real
time,
like
you
know,
building
a
complex
code
which
requires
a
lot
of
threads
to
build,
and
then
you
want
to
do
a
quick
run
to
check.
A
If
it's
running
fine,
then
obviously
you
would
want
to
go
with
the
interactive
node
and
we
also
provide
that
quality
of
service
through
the
interactive
qos.
And
if
you
decide
you
want
to
run
a
large
production
job,
then
you
go
to
a
you
know
a
relevant
Qs,
and
if
you
have
a
very
long
running
job
that
has
a
has
a
built-in
a
capability
for
checkpoint
and
restart,
then
you
know
you
would
go
through
a
preemptive
queue,
so
I
will
go
through
the
details
of
how
to
use
these
queues
later
on.
A
But
first
it's
important
to
understand
the
the
type
of
job
that
you're
about
to
run
the
type
of
resources
that
you
need
and
for
how
long
that
changes?
This
will
improve
turnaround
times
for
you
as
well,
as
you
know,
help
other
users
use
the
system
in
a
better
way.
Foreign
jobs
can
be
submitted
in
two
ways:
One
is
using
the
S
patch
command,
which
is
which
basically,
you
write
a
batch
script
and
you
submit
it
through
aspect.
The
other
option
is
the
seller
option.
That's
basically
a
command
line
option.
A
It's
recommended
that
you
use
a
batch
script.
That
way,
you
know
it's
kind
of
reusable
resource.
So
if
you
have
like
multiple
jobs,
you
could
just
you
know,
make
slight
change
in
the
batch
script
and
then
you
know
reuse.
It
slog
is
usually
recommended
if
you're
trying
to
get
an
indirect
note.
This
at
the
bottom
here
is
an
example
of
using
slot
where
we
are
requesting
one
node,
a
one
node
of
CPU
type
through
the
debug
queue
for
five
minutes.
The
dash
a
is
the
project
that
you
want.
A
This
you
know
resource
to
be
charged
to
so
before
we
talk
about
how
to
write
an
expat
script
and
how
to
submit
it.
Let's
see
what
happens
when
you
request
a
resource,
so
initially
you're
placed
on
a
login
node,
and
that
is
where
you
will
write
or
submit
your
sbat
script
from
or
where
you
will
make
the
SL
request
from
so
once
you
make
the
request.
A
The
resources
are
allocated
to
you
and
one
of
the
and
the
resource
list
basically
consists
of
multiple
nodes
and
out
of
those
notes,
one
node
will
be
assigned
as
head
node.
That
is
where
your
batch
script
is
actually
going
to
be
executed
or
the
Launcher
command
will
be
executed
once
the
launcher
is,
you
know,
called
it
basically
initiates
the
parallel
processes
on
all
the
nodes
that
are
included
in
the
resource
list.
A
Now
to
the
to
the
actual
Stuff
how
to
launch
a
job.
So
this
is
how
a
batch
script
would
look
like
the
one
on
the
right
and
on
the
left.
They
both
are
equivalent
I'll
talk
about
why
they're
same,
but
they
basically
do
the
same
thing.
It's
just
a
different
way
of
writing
them.
A
The
very
first
thing
that
you
do
is
you
write
the
type
of
shell
that
you
want
the
script
to
be
executed
in
here
we're
using
the
batch
shell,
and
after
that
we
have
some
job
job
configuration
options
on
the
left
I'm
using
the
the
long
way
of
referring
to
them,
referencing
to
them.
For
example,
the
extract
account
basically
means
the
account
that
you
want
this,
this
job,
charged
to
on
the
right,
I'm
referencing
that
using
an
you
know
a
short
short
format
of
that
that's
Dash
a
so
everything
on
the
left
is
same.
A
You
know
equivalent
to
on
the
right,
so
you
could
use
either
of
these.
It's
it's
better.
If
you
use
the
more
verbose
option,
the
you
know
the
longer
format,
that's
what
I
do
it
keeps
things
you
know
simpler.
So
first
we
have
the
dash
dash
account
the
data
option.
The
video
you
know
recommended
that
you
enter.
You
may
not
be
able
to
submit
a
job
if
you
don't
have
these,
but
you
may
need
to
tweak
these
depending
on
the
type
of
job
you
have.
A
The
second
option
is
the
dash
type
QRS,
which
is
the
queue
or
the
quality
of
service
that
you
want
this
job
to
go
to.
In
this
case,
we
are
setting
that
regular.
After
that
we
have
the
number
of
nodes
that
we
are
requesting
here,
I'm
requesting
two
nodes
after.
That
is
the
amount
of
time
that
you
that
you're
requesting
these
two
nodes,
for,
if
you
just
enter
a
number,
then
that
should
be
assumed
as
minutes.
For
example,
this
is
60
minutes,
but
you
can
also
use
this
format,
which
is
hours
minutes
and
seconds.
A
Basically,
the
script
on
the
right
and
on
the
left
they're
requesting
the
job
for
the
resource
for
60
minutes.
Then
we
have
the
constraint,
which
is
type
of
node,
that
you
want
here,
I'm
requesting
a
GPU
node,
so
I'm
setting
it
to
GPU.
This
is
an
optional
thing,
which
is
the
job
name.
It's
it's
good.
If
you
mention
this
because
then
it
would
make
it
easier
for
you
to
track
the
job
once
you
have
submitted
it.
A
A
It's
basically
to
tell
the
the
type
of
storage
system
that
you're
using
it
basically
adds
a
tag
to
your
job,
so
that,
if
you
know
as
a
certain
system
is
down
a
storage
system,
it's
down,
we
will
be
able
to
hold
your
job
so
that
you
know
they
do
not
fail.
You
know
just
like
it
recently
happened
when
perimeter
scratch
had
to
go
down
for
maintenance
and
users
who
had
their
jobs
labeled.
With
this
license,
we
were
able
to
hold
them
so
that
you
know
they
didn't
crash.
A
And
after
that
you
add
your.
You
know.
Job
settings
like
you
were
in
a
typical
bash
script.
For
example,
here
I'm
setting
the
number
of
openmp
threads
to
one
and
then
then
it's
my
Launcher
command
here
you
can
use
any
format,
the
one
on
the
right
or
the
left.
There
are
a
lot
of
other
options
that
you
could
use
here
to
see
it
complete
list
of
options.
I
would
suggest
that
you
refer
to
the
manual
pages.
You
can
access
that
using
man,
s
patch
on
your
terminal.
A
You
may
need
to
tweak
and
optimize
the
script
for
the
type
of
node
that
you
have.
For
example,
if
you're
requesting
a
GPU
node
make
sure
that
you're
requesting
you're
setting
the
constraint
equal
to
GPU
each
GPU
node,
as
we
discussed,
has
64
Hardware
cores.
So
it's
optimal.
If
you
run
64
processes
on
it,
so
we
can
set
the
number
of
tasks
per
node
equal
to
64
here
CPUs
per
task
equals
to
two.
A
Now
the
CPU
in
the
context
of
slur
is
a
hardware
thread,
Note
8,
complete
code,
so
one
Hardware
core
has
two
Hardware
threads
and
one
Hardware
threads
equals
to
a
CPU
for
slurm.
So
in
this
case,
we
are
going
to
set
it
equal
to
2.,
and
then
you
request
a
number
of
gpus
if
you
set
it
to
anything
else
like
one
or
two,
that
is
the
number
of
gpus
that
will
be
visible
to
your
job,
so
make
sure
that
you
select
that
appropriately,
even
though
the
nodes
are
exclusive,
you
still
do
not.
A
You
know
view
all
the
gpus
as
if
you
do
not
request
them.
So
the
equation
on
the
right
is
the
CPUs
per
task.
It's
recommended
that
you
use
you
know
compute
C
or
the
CPUs
task,
using
this
equation
for
the
GPU
nodes,
because
that
will
make
sure
that
you
are
not
under
utilizing
your
resource.
The
K
is
the
number
of
tasks,
but
now,
for
example,
here
it's
64,
so
you
take
the
the
term
inside
the
bracket
becomes
one.
You
multiply
type
it
two
and
then
that's
two.
A
So
that's
why
I
have
two
over
here
and
when
you're
executing
your
command.
You
enter
the
total
number
of
ranks
that
you
want
to
launch,
since
we
have
two
nodes
here:
okay,
so
yeah.
So
we
were
on
this
GPU
slide
and
so
for
CPU
nodes
we
are
going
to.
You
know
just
change
the
constraint
from
GPU
to
CPU
make
sure
that
the
task
thread
node
is
suitable
for
a
CPU
node.
A
We
have
two
times
the
the
course
here
so
I
had
64
for
the
GPU
node
here,
but
CP
node
I'll
double
that
and
then
you
know
when
calculating
the
CPUs
task.
I
will
use
128
here
and
it
will
again
turn
to
two,
because
I
have
like
the
number
of
tasks
that
I'm
setting
is
equal
to
the
number
of
Hardware
cores,
but
in
some
some
people
you
know
some
jobs.
They
require
the
task,
be
a
task
when
node
be
less
because
they
want
to
utilize
the
threads.
So
you
know
you
can
make
the
change
accordingly.
A
Let's
say:
if
you're
using
64
over
here,
transcript
node,
you
bring
64
here
in
this
equation.
That
will
give
you
C
equals
to
four.
So
you
know
set
this
appropriately.
Otherwise,
you
will
be
enter
utilizing
the
node
and
you
will
be
getting
a
performance
hit
and
here
then
again,
I'm
launching
two
times
two
ranks
256
on
the
CPU
nodes.
A
So
when
we
are
talking
about
launching
jobs,
it's
important
to
talk
about
Affinity
Affinity
is
how
close
or
how
your
processes
and
threads
are
bound
to
the
to
the
hardware
threads
and
the
hardware
cores.
It's
recommended
that
one
MPI
or
one
process
is
bound
to
a
hardware
core
and
to
make
sure
you
can
set
this
option
in
your
Launcher
command.
That's
dashtag,
CPU
that
you
find
equal.
To
course.
There
are
other
options
to
do
that.
A
You
can
explore
that
using
the
CPU
bind,
you
know,
setting
the
cpuine
equals
to
help,
but
if
you
are
trying
to
optimize
the
node
usage,
this
is
what
I
would
recommend
and
while
you're
doing
this,
also
make
sure
that
you
are
setting
your
CPUs
per
task
appropriately
using
this
equation
on
the
right
for
the
GPU
node.
It
obviously
will
be
64
instead
of
128,
but
you
want
to
make
sure
that
you're
making
the
best
of
the
resource
that
you
have.
A
When
you're
on
the
GPU
notes,
you
have
an
additional
component
GPU,
so
we
need
to
make
sure
that
the
Affinity
here
is
also,
you
know
the
optimal.
So
each
node
has
four
gpus
and
we
have
other
log
64
Hardware
cores.
These
codes
are
divided
into
different
numer
nodes
and
you
will
get
the
optimal
results
if
the
rank
in
a
certain
numer
node
has
access
to
the
GPU
that
is
closest
to
it.
A
By
default,
all
the
ranks
will
be
able
to
see
all
the
gpus
and
programmatically
typically,
programmers
would
do
is
that
they
do
around
copy
and
assignment
that
way.
It
is.
You
know
it
is
not
guaranteed
that
each
rank
will
be
getting
the
GPU
that
is
closest
to
it
or
it's
in
the
same
Numa
region
and
that
may
increase.
You
know
data
times,
and
you
know
when
you're
using
unified
memory.
You
may
see
some
performance
stuff
downgrade.
A
Now
to
understand
what
I
mean
by
this
is
I'm,
using
basically
a
vector,
add
example
from
the
January
parameter
training
where
we
demonstrated
how
to
build
and
run
a
GPU
code.
So
in
this
code,
I
have
an
MPI
code
that
is
making
use
of
you
know,
making
a
kernel
Launches
on
gpus
when
I
run
it
without
any
GPU
binding.
You
can
see
that
each
rank
has
access
is
able
to
view
all
the
gpus.
A
A
But
if
I
set
GPU
bind
equal
to
closest
and
then
I
do
the
run,
you
can
see
that
each
rank
is
able
to
see
just
one
GPU
and
that
GPU
is
closest
to
it,
and
you
can
check
that
using
the
you
know,
this
Numa
region
so
have
a
look
at
these
highlighted
lines
rank
one
and
five
are
on
core
16
and
17,
and
you
can
see
that
code,
16
and
17
are
on
the
same
pneuma
node,
so
they
have
been
assigned
the
same
GPU
that
was
closest
to
that
particular
newer
node.
A
So
it's
recommended
that
you
set
it
equal
to
closest
to
make
sure
that
you,
you
know
your
ranks
get
the
GPU
that
is
closest
to
them.
But
if
you
have
programmed
your
app
differently
and
it
doesn't
really
care
the
GPU
is
closed
or
you
know
far
away,
then
you
can
set
it
according
to
how
you
how
you
want.
There
are
other
options
that
you
can
explore
and
you
can
go
to
the
main
page
of
S1
command
to
see
how
you
can
do
your
GPU
bindings
in
a
different
way.
A
Finally,
we
have
thread
Affinity.
So
if
your
code
is
using
openmp
threads,
then
it's
recommended
that
you
set
the
bindings
and
Affinity
using
the
openmp
environment
variables
in
this
case
I'm
using
these
three
environment
variables.
The
first
one
is
the
famous
one
which
is
used
for
setting
the
the
threads,
the
number
of
threads
that
you
need
per
Rank
and
the
second
one
is
the
OMP
player.
Since
this
is
where
your
threads
will
be
will
be.
You
know
the
place
that
your
threads
will
reside
on.
A
It
could
be
so
right
now,
I've
set
it
equal
to
threads,
which
means
a
hardware
thread,
so
each
CPU
core
will
have
two
threads
two
openmp
threads
map
to
it.
But
if
you
want
just
one,
then
you
know
you
can
replace
threads
with
cores.
That
will
make
sure
that
one
thread
has
one
Hardware
core
assigned
to
it.
The
third
option
of
OMP
proc
find
that's
basically
making
sure
that
your
threads
are
not
relocated.
You
know
to
to
different
cores
or
different
threads,
and
after
that
you
have
your
Launcher
command.
A
Finally,
let's
go
to
the
job
queues
that
that
we
have
been
talking
since
the
morning,
so
we
have
different
types
of
queues
depending
on
the
type
of
quality
of
service
that
you
need,
as
I
talked
before,
that
it's
important
that
you
determine
your
job
needs
your
resource
requirements,
the
time
that
you
need
them
for
the
type
of
node
that
you
need
and
the
type
of
work
that
you're
going
to
do
and
then
you
know,
choose
the
queue.
Now.
Let's
say
that
you
you
want
to
do
some
simple
file.
A
Writing
or
you
know
some
text
analysis.
You
know
automated
text
analysis.
Then
you
wouldn't
want
to
waste
your
hours
on
a
complete
note,
because
if
you,
if
you
request
an
exclusive
note
to
the
regular
Q
list
and
even
if
you're,
using
just
one
core
on
it,
you
will
still
be
charged
for
the
complete
node
hour
to
avoid
that
wastage
of
time.
It's
you
know
you
can
use
the
shared
qos.
A
If
you're,
if
your
job
is
serial
just
you
know,
it's
recommended,
you
do
not
use
S1,
because
that
has
over
overhead
associated
with
it.
But
if
you
have
multiple,
you
know
ranks
that
you
want
to
launch.
You
can
use
that's
run
even
over
here.
A
If
you
want
to
get,
you
know,
do
debugging
or
you
want
an
interactive
node.
There
are
two
different
keywords.
There
are
two
different
qs's
for
that,
for
we
have
a
debug
Qs.
We
have
an
interactive
Qs.
Debug
has
a
maximum
limit
of
eight
nodes
where
that
maximum
time
is
30
minutes.
So
if
you
know
typical
debug
jobs,
if
you,
if
you
run,
you
know
it's
typically
short,
you
just
want
to
hit
a
certain
bug,
and
you
know
that
that
is
it
so
I
think
that's
very
appropriate.
A
But
if
you
want
to
debug,
you
know
on
an
interactive
node
and
you
want
to
do
it
for
a
longer
time.
You
could
use
the
interactive
node
the
difference
between
these
is
of
the
time
and
the
number
of
notes
that
you
that
you
can
access
and
of
how
you
submit
that
debug
Qs
can
be
accessed
through
an
S
patch
script,
while
the
interactive
has
always
to
go
always
goes
to
the
SL
because
it
is
interactive
and
if
you
do
a
batch
submission
that
won't
really
make
sense.
A
If
you
want
more
information
on
these,
you
can
go
to
this
hyperlink
here.
Then
we
have
the
preempt
queue.
Now,
let's
say
that
you
have
a
very
large
job
that
takes
up
a
lot
of
resources
and
runs
for
a
very
very
long
time.
It
will
almost
become
impossible
to
schedule
it
because
the
the
type
of
resources
that
you're
requesting
you
know
it
will
basically
have
to
wait
forever.
A
One
way
to
get
around
that
is
to
use
the
preemptive
that
basically
allows
your
job
to
be
preempted
after
a
certain
time,
so
that
a
higher
job
priority
job
can
be
scheduled,
and
then
your
job
can
re-cute
at
a
later
time.
A
Let's
say
on
a
weekend
when
people
are
not
submitting
jobs,
the
job
will,
you
know,
be
recured,
but
for
this
it
is
important
that
your
code
has
checkpoint,
restart
capabilities
built
into
it,
so
that
when
it's
preempted
it's
able
to
save
its
state
and
then
it
is
able
to
return
from
the
same
state
when
it
is
a
recured
to
utilize
this.
You
have
to
use
the
preempt
queue,
Prem
Qs
in
your
batch
script,
and
you
have
these
additional
four
options
at
the
bottom.
That
I
have
highlighted
the
maximum
desired
time
limit.
A
This
is
96
hours,
so
basically
your
job
will
run
for
96
hours,
maximum.
That
is
the
sum
of
all
these
sessions,
and
then
that's
the
checkpoint
overhead
time.
Then
this
flag
reach.
You
means
that
your
job
will
be
very
cute.
If
you
do
not
add
this,
then
it
will
not
do
EQ.
The
last
one
is
an
important
one:
that's
the
open
mode!
You
have
to
set
it
to
a
pen.
That
is
because
these
files
that
you're
writing
your
output
to
will
be.
A
You
know,
be
appended
when
it
is
recued
so
that
you
know
these
are
not
overwritten.
So
it's
something
you
know
preemptive
is
to
take
advantage
of.
If
you,
if
you
run
very
long
jobs,
for
you
know
very
long
time,
expert
queue
is
another
way
of
saving
your.
You
know
compute
hours,
if
you
have
you
know,
if
you
need
to
Stage
data
from
hpss
the
the
long-term
storage,
then
it's
you
know
you
can
do
that
through
extra,
because
typically
data
transfer
from
SPSS
is
kind
of
very
slow.
A
So
it's
recommended
you
that
you
do
not
waste
your
hours
on
a
computer.
Instead,
you
use
the
XX
for
queue
and
that
will
save
you
a
lot
of
hours.
You
can
utilize
this
within
your
production.
Job
script,
by
using
this
command
as
patch
Dash
queue
X4
and
you
know
then
HSI
put
the
whatever
file
that
you
want
to
replace,
or
you
can
also
do
a
separate
job.
You
know
batch
script
for
this
after
you
have
completed
your
production
job.
A
You
basically
do
the
S2s
equals
2x4
the
time,
the
maximum
time
limit,
that
you
need
the
name
of
the
job
the
license
and
then
finally,
your
editor
command.
Consider
you
know
this
over
here.
You
can
see
that
we
are
not
requesting
any
node
type
because
or
the
number
of
nodes,
because
that
would
make
this
scale,
because
X
work
here
is
I.
Think
it's
used
on
on
type.
There
are
like
some
shared
notes
that
use
that
are
basically
doing
this
and
yeah.
A
You
can't
really
make
a
request
here,
so
these
are
some
Advanced
options
so
till
now,
whatever
we
have
covered
is
enough,
for
you
to
you
know,
run
your
jobs
in
an
efficient
manner,
but
these
are
some
Advanced
options
that
may
help
you
getting
the
best
out
of
your
experience
on.
Let's
say
that
you,
you
have
an
S
patch
script
and
you
want
to
launch
multiple
jobs.
You
know
just
using
one
request,
because
you
don't
want
to
write
separate
that
scripts,
but
you
just
want
everything
bundled
together
into
one.
A
Then
you
can
you
can
do
that.
You
just
need
to
enter
separate
S1
command
for
each
of
your
jobs,
for
example,
like
executable,
a
b
and
c
will
have
to
have
a
separate
launcher
command
for
each,
but
make
sure
that
you
request
the
resources
appropriately
that
are
needed
for
all
these
jobs
and
the
time
that
they're
needed
for
the
time
here
would
be
the
sum
of
the
time
that
is
needed
to
run
all
of
these.
A
But
if
you
want
to
run
things
concurrently
that
you
you
want
that
all
these
jobs
run
in
parallel
through
a
single
batch
script.
That
can
also
be
done
that
you'll
have
to
make
sure
that
you
request
the
resources
appropriately,
that
are
which
will
basically
be
the
sum
of
all
the
resources
needed
by
all
the
jobs.
For
example.
In
this
case,
each
job
is
using
two
nodes,
so
the
total
nodes
that
we
request
is
six.
Similarly,
the
time
here
would
be
the
maximum
time
needed
by
any
of
these
jobs.
A
It's
not
the
sum
as
it
was
in
the
example
before
there
are
some
changes
that
you
would
need
to
make.
That
is
to
end
the
Ampersand
sign
at
the
end
of
each
Launcher
command
and
add
the
weight
instruction
at
the
end
of
all
your
jobs.
This
will
make
sure
that
your
jobs
run
concurrently
and
the
job
is
not
terminated.
Till
the
last
job
has
returned
foreign.
A
If
you
have
a
sort
of
a
workflow
where
one
job
depends
on
another,
and
you
want
to
make
sure
that
if
certain
job
is
completed
before
a
new
job
is
scheduled
for
this,
you
can
make
use
of
the
dash
dash
dependency
option
in
the
S
patch.
So
how
this
works
is
first,
you
schedule
a
job.
Let's
say
we
have
a
job
first,
for
we
have
a
first
job.
We
have
a
second
job
and
we
want
that
a
second
job
we
run
only
after
the
first
job
has
been
completed.
A
So
what
we
do
is
we
submit
the
first
job
using
aspect
with
that
dash
dash
parcel
option
and
that
returns
a
job
ID.
We
can
save
that
in
a
variable,
and
then
we
can
schedule
the
second
job
using
the
dash
dash
dependency
option
where
we
are
going
to
set
it
to
after
okay,
followed
by
the
job
ID
of
the
job
that
we
we
wanted.
The
job
to
to
depend
on
after
OK
means
that
after
the
job
one
has
been
completed
successfully,
that
is,
it
was
not.
It
did
not
fail.
A
You
can
replace
this
with
after
any
as
over
here.
That
would
basically
mean
that
run
the
job
too,
doesn't
matter
how
a
job
one
you
know
and
it
if
it
fail
or
completed
successfully
in
the
last
line,
you
can
see
that
we
have
a
fourth
job,
that
we
want
to
run
only
if
the
jobs
two
and
three
have
completed.
So
you
can,
you
know,
add
the
chain
of
the
jobs
that
you
want.
A
You
know
in
the
tech
tax
dependency
option
with
a
comma
separated
list,
and
the
last
one
is
an
example
of
how
to
use
the
after
any
option.
It's
it's
basically
the
same
way
that
we
use
the
after
okay
option.
There
are,
there
are
some
other
options
as
well.
You
can
go
to
the
main
pages
of
the
S
patch
command
to
see
whatever
you
know
fits
your
needs.
A
You
can
also
use
job
chaining
in
in
a
job
in
a
s
batch
script,
and
for
that
you
will
need
to
do
the
dash
dash
dependency
option
within
your
aspatch
script.
A
Job
arrays
is
is
another
way
of
running
multiple
jobs.
You
know
bundling
the
jobs
together,
except
that
in
this
case
each
job
will
be
scheduled
separately
instead
of
within
the
same
batch
script.
It
is
helpful
if
all
of
your
jobs
use
very
same
resources
and
you
do
not
want
to
go
through
the
pain
of
submitting
each
job.
You
know
separately.
So
what
you
can
do
is
you
can
add
this
option
of
dash
dash
array
where
you
can
set
the
set
it
to
the
number
of
jobs
that
you
want
to.
A
You
know
exist
in
the
area,
for
example,
in
this
case,
I
need
I
need
to
schedule,
10
jobs,
so
I'm
going
to
set
it
from
one
to
ten,
and
once
you
have
this
set
the
slurm
array,
job
ID
will
can
be
used
as
as
an
index,
it
will
have
index
from
1
to
10
for
all
these
jobs,
and
you
can
use
this
to
index
your
job
directories
or
you
know,
name
your
output
files.
A
You
know,
however,
however
your
game,
you
know
it's
it's
another
option,
that's
you
know
can
be
used
if
you
want
to
launch
a
lot
of
jobs
at
the
same
type
in
a
very,
very
quick
manner.
It's
not
recommended
that
you
use
a
for
Loop
over
S1,
but
this
way
it's
you
know
it's
much
better.
It
may
not.
You
know,
get
you
the
best
turnaround
times,
because
you're
launching
a
lot
of
small.
You
know
a
lot
of
small
jobs,
but.
A
A
You
know
a
variable
For
Thread
Affinity.
This
may
not
be
in
any
way
optimized,
so
you
may
need
to
tweak
it
for
your
use
case
or
you
know
to
make
the
best
option
of
the
note
that
you're
trying
to
run
on,
but
it
is
a
very
good
starting
point.
So,
whatever
is
generated
here
may
may
be
able
to
run.
You
know,
as
is,
but
it
may
not
be
optimal.
A
So
if
you,
you
have
never
written
a
batch
script
before
if
you,
if
you
want
a
starting
point,
this
is
you
know
this
is
what
I
would
recommend.
So
this
is
the
link
that
you
can
go
to
it's
it's
freely
available.
You
know,
try,
try
it
out
multi-process
service
for
GPU
nodes,
so
Nvidia
has
this
thing
known
as
multi-process
service,
or
also
known
as
MPS,
which
allows
for
over
subscribing
gpus
when
when
they
are
being
shared
by
multiple
processes.
A
Typically,
when
you
have
a
code
which
uses
multiple
ranks
per
GPU,
it's
you
can
launch
kernels
on
GPS
from
multiple
Banks,
but
the
GPU
will
be
locked
on
one
kernel
at
a
time.
So
what
this
service
does
is
it?
It
allows
the
GPU
queue
to
be
filled
with
the
kernels
from
multiple
ranks
so
that
it
can
schedule
as
soon
as
it
has
resources
available.
So
it's
basically
a
way
of
improving
throughput.
In
some
cases
you
will
see
significant
performance
Improvement,
so
it's
highly
recommended
that
you
use
this.
A
You
can
enable
the
MPS
service
on
GPU
notes
using
this
command
over
here
and
once
you
have
completed
your
executable,
you
can
quit
this
by
using
this
option
to
to
make
things
easier
for
you.
We
haven't,
have
a
nurse
wrapper,
a
script
which
you
can
use
and
it's
available
at
this
link.
It
will
basically,
you
know,
make
things
very
simple
for
you.
You
won't
have
to
go
through
the
details
of
how
to
turn
the
MPS
on.
A
A
So
once
you
have
your
jobs
up
and
running,
you
would
want
to
monitor
your
jobs
like
what
state
they
are
in
how
far
they
are.
You
know
they
are
in
in
their
time.
There
are
multiple
options
that
you
can
use.
Sqs,
SQ
and
S
account
we'll
go
through
these
separately,
so
the
first
option
is
SQ
by
default.
It
will
show
you
jobs
from
all
the
users,
but
you
can,
you
know,
filter
them
out
with
the
user
ID
by
the
dash
U
option.
A
Sqs
is
a
nerd
Scrapper
on
SQ,
it's
basically
by
default.
It
will
show
you
the
jobs
that
you
have
submitted
and
that
sighting
the
the
most
important
command
any
user,
probably
will
be
using
s
account,
allows
you
to
view
the
jobs
that
you
know
have
been
that
was
submitted
in
the
past
and
have
executed
or
completed.
The
sqs
will
only
show
you
the
ones
that
are
currently
in
the
queue.
A
As
account
can
you
know
be
used
to
access
the
previous
jobs,
for
example,
in
this
case
I'm,
you
know
querying
the
job
that
I
submitted
from
August
25
to
August
30,
and
it
is
going
to
list
the
jobs
and
their
state
that
they've
finished
in.
You
can
see
that
I
had
a
lot
of
field
jobs
and
the
account
that
it
was
charged,
and
you
know,
multiple
other
options.
The
good
thing
about
s
account
is
that
you
can
configure
it
to
your
needs.
A
You
can
request
the
type
of
output
or
the
type
of
fields
that
you
need,
for
example,
here
I'm
requesting
these
fields
like
number
of
nodes
in
the
state
over
here,
so
I
can
see
how
many
nodes
I
used
for
these
jobs
in
the
previous
one.
You
can
see
that
I
just
told
the
number
of
CPUs
that
I
was
allocated.
This
is
the
default
option,
but
you
can
request
whatever
you
need.
If
you
want
any
more
information
about
your
jobs,
you
can
do
the
go
to
demand
pages
of
his
account.
A
There's
like
tons
of
other
views
that
you
can
explore
through,
but
be
mindful
that
the
maximum
query
it
duration
is
one
month.
You
won't
be
able
to
see
jobs
more
than
you
know,
30
days
in
the
past
or
one
month
in
the
past,
control
is
something
that
you
can
access
to
see
the
jobs
that
are
currently
queued
or
are
currently
being
run.
It
can
also
be
used
to
update
the
job
specifications.
For
example,
here
I'm
I
can
use
my
job
ID
to
see
the
job.
A
That's
that
is
running
at
that
moment,
and
it
tells
me
there's
a
lot
of
information
here
about
about
a
single
job.
You
know
type
Qs
and
type
of
accounting.
It
was
charged
to
the
number
of
nodes
used
the
priority.
The
fun
thing
with
that
control
update
is
that
you
can
change
the
specification.
The
settings
of
the
job
after
it
has
been
submitted,
for
example,
in
this
case
I'm
using
sqs
to
request
the
list
of
the
job
that
is
currently
pending
and
I.
A
See
that,
after
this
job
spending,
the
Qs
or
the
queue
that
it
is
in
is
GPU,
regular
and
I
want
to.
You
know,
accelerate
it
or
want
it
to
go
through
faster
and
I.
I've
been
told
that
I
have
this
qos,
which
is
early
signs
that
makes
my
job
go
through
faster.
So
what
I
do
is
I
use,
S
control,
update
I,
set
the
job
ID
equal
to
the
job
ID
of
this
job
over
here
and
then
I
set
the
Q
list
to
the
new
Qs.
A
That
I
want
this
to
be
updated
to
once
I've
done,
that
I
can
do
the
sqs
and
see
again
that
that
job
has
indeed
been
updated
from
GPU
regular
to
the
early
science.
Qrs
be
mindful
that
not
all
the
options
can
be
updated
with
the
Qs
with
the
update
option,
but
some
can
be.
If
you
want
to
cancel
a
job,
you
realize
you
don't
want
to
go
through
with
it.
You
can
test
us,
cancel
and
the
job
ID
and
it
will
be
canceled.
You
can.
A
You
know
see
here
you
can
see
here
that
I
cancel
the
job
and
it's
it's
not
visible
anymore
in
the
sqls
option,
so
there
are
some
best
practice
best
practices
I
see
that
I'm
about
to
run
out
of
time.
So
let's
go
through
this
quickly.
We
have
it's
always
good
to
go
to
the
documentation
page
and
see
the
type
of
queues
that
we
have
and
their
limitations.
This
is
the
snapshot
from
the
parameter
gpuq.
This
is
from
the
parameter
cpuq
you
can
you
know
it's
it's
always
these
things
are
changing.
You
know
frequently.
A
Time,
it's
good
to
know
what's
happening
underneath,
so
the
jobs
are
scheduled,
based
on
a
complex
combination
of
the
of
a
priority
value
which
is
basically
a
combination
of
you
know,
a
lot
of
different
things:
the
the
priority
values
associated
with
each
job,
and
then
we
have
two
slum
schedulers
a
main
and
a
backward
one.
A
The
main
scheduler
schedules
job
in
the
order
of
the
priority
list
for
a
few
days
in
the
future,
while
the
backfield
scheduler
schedules,
the
small
and
short
jobs
that
can
be
run
in
the
you
know,
gaps
in
between
the
large
ones.
So
what
that
tells
us
is
that
if
you
have
a
small
job
requesting
resources
for
a
short
time,
they
can
take
advantage
of
the
backfill
opportunities
so
make
sure
that
to
get
a
good
turnaround
time
you,
you
know,
you
check
the
resources
appropriately
so
that
you
know
you're
getting
a
quick
turnaround
times.
A
If
you
have
a
very
long
job,
try
to
you
know
checkpoint
and
try
to
break
it
into
try
to
use
the
preempt
queue
so
that
your
job
can
be
scheduled
quickly.
A
It's
important
for
all
the
users
to
request
only
the
time
that
they
need,
because
that
will
make
your
job
go
through
faster,
as
well
as
help
others
get
their
jobs
scheduled
faster
for
the
large
jobs.
It
can
become
a
little
difficult
to
launch
so
because
your
executable
has
to
be
available
on
every
note.
So
it's
recommended
that
you
use
spcast
to
you
know,
distribute
your
executable
to
all
the
notes
in
your
temporary
storage
and
then
the
one
that
will
make
things
faster
for
you.
A
It's
recommended
that
you
do
a
static,
build
for
the
large
jobs,
otherwise,
your
the
dynamic
libraries
that
will
need
to
be
accessed
at
runtime
from
you
know
from
all
the
nodes
that
will
make
things
very
slow.
We
have
a
shifted
talk
coming
up
in
the
afternoon
that
will
basically
discuss
how
to
launch
the
large
jobs,
because
that
have
shared
libraries
associated
with
them,
because
then
you
have
the
shared
libraries
available
on
all
the
nodes
through
the
shifter
image.
It's
it's
it's.
A
So
this
is
something
which
I'm
repeating
from
in
the
morning
that
Rebecca
mentioned
that
IO
is
not
optimized
for
Global
home.
It's
recommended
that
you
use
the
the
scratch
file
system,
which
is
a
high
performance,
parallel
file
system
for
large
shops,
and,
if
you
have
a
shared
software,
consider
putting
that
in
the
global
common
software.
If
it's
a
very
large
software,
you
need
a
lot
of
space.
You
can
open
a
ticket
under
customer
storage
yeah
for
further
information.
Please
refer
to
the
nurse
documentation.