►
From YouTube: NUG Monthly Meeting, May 20, 2021
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
B
Hi,
this
is
koichi
from
pnnl,
hey
hi.
I
just
put
on
a
chat
that
we
got
the
paper
accepted
and
then
published
in
the
final
form
actually
took
some
time.
But
this
paper
used
and
asked
resources
to
you
know
develop
an
algorithm
to
track
very
strong
thunderstorms
that
really
produce
a
lot
of
precipitation,
damaging
a
lot
of
infrastructure
and
also
we
use
quality
to
run
the
computer
models,
climate
models
to
a
bunch
of
simulations
and
then
also
we
used
nasa
resources
to
develop
and
use
the
machine.
B
A
And
thank
you
for
that
and
that
actually
that's
a
that's
a
good
point
when,
when
you
submit
papers
that
we're
using
nurse
resources,
it's
really
helpful
to
you
to
us
if
you
include
an
acknowledgement
and
we
have
somewhere
in
our
web
page
under
www.nurse.gov,
there
there's
a
kind
of
a
format
that
you
can
use,
but
we
use
this
to
kind
of
this.
This
helps
in
our
argument,
for
you
know,
funding
from
the
doe
basically
yeah.
B
Yeah
yeah,
I
think
we
copied
yeah
copy
that
statement
from
the
website.
Yes,.
A
Yeah,
so
yeah,
so
thanks
for
doing
that,
that's
really
helpful.
So
the
machine
learning
for
atmospheric
work
is
is
kind
of
an
interesting
field.
So
were
you
using
the
machine
learning
to
to
sort
of
speed
up
the
solver
or
to
look
at
the
results
of
the
the
you
know.
B
This
like
for
this
site,
it's
more
traditional.
Actually,
we
use
the
self-organizing
map
type
of
machine
learning
to
really
you
know
find
find
out.
B
You
know
particular
high
dimension
structure
of
atmosphere
using
the
you
know,
you
just
compress
all
the
different
variables
at
different
height
levels,
so
that
it's
just
difficult
for
us
to
do,
and
it's
very
non-linear
processes,
so
machine
learning.
It's
really
nice
to
tease
out
those
patterns
that
are
really
hidden
in
the
atmosphere
and
then
the
oe
is
really
pushing
to
use
machine
learning
in
our
field
as
well,
so
one
this
is
more
typical,
traditional
way.
B
But
you
know
a
bunch
of
us
also
working
on
to
use
machine
learning
to
develop
a
predictive
part
of
the
model,
particularly
for
those
processes
that
the
global
model
cannot
resolve
like
turbulence
and
convictions,
and
yeah
probably
is
somewhere
down
in
the
future.
I'd
like
to
post
another
work
that
some
of
us
working
on
developing
such
a
cloud
must
use
machine
learning,
to
study
about
the
cloud
and
aerosol
interactions
and
they
use
those
machine,
learning
method
or
trained
algorithm
as
a
part
of
the
global
models,
but
yeah
yeah.
A
Yes,
yeah
and
I
understand
that
thunderstorms
in
atmospheric
models
are
quite
difficult
to
detect
normally
because
they're
yeah,
they
tend
to
be
quite
high
resolution.
B
Yeah
and
it's
very
difficult
and
could
be
very
subjective
traditionally,
even
we
do
not
really
strictly
agree
with
what
constitute,
in
our
case,
this
particular
strong
storms,
what
we
call
mesoscale
convective
systems,
but
the
first
part
of
the
this
paper
is
really
try
to
make
it
as
objective
as
possible
and
as
realistic
as
as
possible.
So
it's
just
three
stages.
B
You
know
that
algorithm
and
the
model
simulations
using
a
new
global
model
using
certain
variable
resolution
grid
and
then
apply
finally
apply
machine
learning
algorithm
to
both
observation
models
to
more
objectively,
compare
the
reality
and
the
model.
A
Great
work,
anybody
else
got
a
win
or
a
success
that
they'd
like
to
share.
A
So
we
can
go
to
the
the
flip
side
of
the
coin
here,
which
is
which
is
today.
I
learned,
then
you
know
what
we.
What
we
do
here
is
research
and
part
of
the
the
nature
of
research.
Is
that
you
get
stuck
on
things
you
hit
dead
ends,
you
try,
you
try
new
things
and
they
don't
work
climbing
over
get
quite
challenging
learning
curves,
and
this
is
actually
yeah.
A
It
can
be
a
little
painful
and
frustrating,
but
it's
not
a
bad
thing,
because
you
know
this
is
this
is
kind
of
how
we,
how
we
learn
stuff
and
and
how
we
get
new
knowledge
and
new
discoveries
out
into
the
community
so
yeah.
So
this
section
is
kind
of
a
opportunity
to
talk
about.
You
know
something
that
bit
you
something
that
tripped
you
up,
or
even
just
something
interesting
that
you
stumbled
across.
A
A
A
C
A
And
it's
that
directory
is
very
difficult
to
find.
Is
the
wrong
word,
but
it's
not
obvious,
because
if
you
ls-la
in
your
home
directory
it's
not
visible
there,
you,
you
kind
of
need
to
it's
almost
like
you
need
to
know
about
the
the
secret
code.
You
have
to
ls-a
dot
snapshots
and
the
system
will
show
you
what's
inside
that
directory
and
if
I
remember
rightly,
I
think
larry
looked
at
this
more
recently
than
me
and
can
probably
correct
me.
The
snapshot
gets
taken
once
a
day.
C
A
A
So,
for
me,
most
of
my
kind
of
hard
lessons
over
the
last
more
than
a
week
month
have
been
around
tips
and
tricks
using
something
called
spack
which
you
may
have
heard,
of
which
we
actually
have
on
our
system.
You
can
you
can
module
loads
back
the
default.
One
at
the
moment
is
a
fairly
old
version.
It's
0.14.2,
I
think
it's
over
a
year
old,
specs
being
rapidly
actively
developed,
and
so
things
change
pretty
quickly.
A
A
It's
quite
neat
you
you
can
describe
using
a
dsl
what
you
would
like,
so
you
can
say
yeah.
I
would
like
to
install
your
package
slate
app
version
on
this
stack
with
this
compiler
and
the
setup
that
we've
got
will
install
it
by
default
into
a
directory
in
your
home
directory
called
sw.
For
for
software
in
the
module
file
for
0.16.1,
you
can
actually
change
that.
There's
a
variable
called
something
like
spec
preferred
base.
A
Running
to
you
know,
just
a
a
single
command
line,
really
a
couple
to
check
what
it's
going
to
do
and
then
and
then
tell
it
to
go
ahead
and
do
it,
which,
which
is
really
nice,
but
of
course
software
is
complex
and
and
things
don't
always
work,
and
so
yeah
there
are
when
it
breaks.
It
can
be
a
little
challenging
to
find
why
so,
I've
been
learning.
A
A
There's
a
whole
lot
of
information
there.
We
have
some
stuff
actually
about
it
on
our
web
pages
as
well.
A
Under
yeah,
don't
ask.
A
But
yeah
give
it
a
whirl.
You
might
find
that
it
works
really.
Well,
you
know
you
might
find
that
things
are
complicated,
but
you
know
you
can
also
drop
us
a
line
and
send
us
a
ticket
to
ask
for
assistance
with
it.
A
I
see
a
couple
of
other
things
wrapped
up
in
the
chat.
Oh
hey
over.
Am
I
pronouncing
that
correctly?
The
maximum
time
you
could
run
on
a
single
gpu
node
is
four
hours.
A
This
is
on
curry
gpu.
Do
you
want
to
tell
us
about
what
you
learned.
D
Oh
yeah,
I
was
yeah.
I
was
just
trying
to
run
some
hybrid
parameter
search
on
my
neural
network
and
I
I
was
trying
to
call
I
I
first
tried
to
use
on
jupiter
hub.
I
find
that
my
job
gets
very
kind
of
constantly
killed
after
a
certain
period
of
time.
I
didn't
really
count
how
long
that
is,
and
then
I
switched
to
the
configurable
gpu.
D
So
after
I
call
that
gpu,
I
think
I
can
get
like
a
stable
running
period
of
time,
but
that
gpu
only
lasts
for
like
four
hours.
So
I
was
just
wondering
if
there's
a
way
to
call
the
gpu
for
a
longer
time,
because
you
know
the
hyperparameter
search
you
already
kind
of
scale,
with
the
amount
of
the
number
of
attempts
you
make
so
the
longer
you
search
the
better
model
you
will
get
for
your
network,
so
yeah
some
people
in
the
chat
has
already
given
me
some
help.
So
I
think
that's
that's
very
helpful.
A
Yeah
so
there's
some
good
good
tips
there
actually
about
being
able
to
tweak
it
or
find
the
constraints
and
get
running
running
overnight.
So
that's
that's
good
to
know.
A
So
we
are
getting
up
to
25
past
11
before
we
got
a
few
more
minutes.
If
does
anybody
else
have
a
tip
or
trick
or
or
something
that
they
would
like
to
learn?
Because
it's
a
sticking
point.
A
All
right
so
for
our
next
section
we
have
a
space
for
announcements
and
calls
for
participation,
so
there's
there's
kind
of
a
lot
going
on
at
the
moment.
If
you
scanned
through
the
weekly
email
this
week,
you'll
have
seen
and
emails
that
went
out.
A
A
A
We
have
a
nurse
plus
nvidia
gpu
hackathon,
coming
up
the
original
deadline
for
submissions-
I
think
was
yesterday,
but
it's
been
extended,
so
you
have
until
the
end
of
the
week.
So
if
you
have
a
code
that
you're
working
on
to
get
gpu
ready-
and
you
would
like
some
assistance
from
experts
from
both
nurse
and
nvidia
that's
still
available-
the
web
address
here
is
in
the
slides.
A
There's
a
few
training
events
coming
up,
there's
some
intro
intro
to
nurse
coming
up
on
june
3.
So
it's
in
a
couple
of
weeks.
Some
of
you
might
have
seen
a
pinterest
parallel
where
which
was
a
topic
of
the
day
a
couple
of
months
ago
now,
so
a
picture
was
holding
office
hours
for
assistance
getting
up
and
going
with
parallel
wear
on
june
9.
A
and
I
think
there's
a
link
in
the
weekly
email
to
where
you
can
would
you
call
it
make
an
appointment
for
those
another
training
event
that
nesca
is
coming
up
is
a
crash
course
in
super
computing
and
another
one
that
I'll
give
you
like
a
heads
up,
one
that
hasn't
actually
been
announced
yet,
but
it
will
be
very
soon,
is
we'll
be
doing
some
training
for
using
elmod.
So
some
of
you
may
already
be
familiar
with
elmore.
A
It
gets
used
at
a
number
of
other
sites,
but
on
corey
we
are
using
the
modules
environment
that
we
use
is
tcl
modules.
It's
kind
of
the
original
one
elmod
is
a
it's
a
little
more
than
a
re-implementation
of
it,
but
it's
a
kind
of
a
follow-on
from
it.
The
same
sorts
of
ideas
with
you
know
a
newer
scripting
language
behind
it,
and
you
know
a
few.
A
few
sort
of
updates.
A
A
I
think
that
previously
you
needed
to
request
access
to
it.
It's
now
available
to
all
so
I
can
see
there's
a
bit
of
chat
going
on
in
the
in
the
zoom
chat.
That's
run
through
that,
so
that
everybody
has
seen
it.
A
So
there's
some
discussion
around
around
the
some
of
the
training
that,
whereas
I've
always
was
talking
about
doing
checkpoint,
restart
as
a
way
of
stopping
and
starting
things.
I
do.
We
just
recently
had
a
checkpoint,
restart
training
session
on
mpi
agnostic
network
agnostic,
checkpoint
restart.
I
think
we
have
some
notes
on
that
in
in
this
documentation
pages,
it's
probably
under
running
jobs,
here,
checkpoint
restart,
so
yeah,
particularly
for
for
your
serial
and
some
mpi
jobs.
If
it
doesn't
have
its
own
built-in
checkpointing
dmtcp
can
be
quite
handy.
A
You
know
it's
also
helpful
to
know
that
l
mod
supports
pcl
format,
module
files,
yes,
and
in
on
perlmutter
we
intend
to
be
using
l,
mod,
so
yeah
there
will
be
a
little
bit
of
transition.
It
should
be
fairly
smooth
in
the
most
things
work
in
exactly
the
same
way.
There's
a
there's
a
couple
of
slight
differences,
but
just
very
usefully.
If
you
have,
particularly
if
you
have
your
own
module
files,
elmod
understands
tcl
module
files
as
well,
which
is
compiling
for
hezbollah,
not
the
knl.
A
E
I
think
that's
a
good
question,
though
that's
that's
a
good
point
to
you
know,
bring
up
that,
since
some
codes
for
knl
do
need
to
be
compiled
on
a
knl
node.
I
I
think
we
could
discuss
internally
if
we
want
to
add
a
knl
node
to
the
compile
qos.
A
That's
a
that's
a
good
point
actually
and
because
she's
coming
in
about
the
debug
queue
time
limit,
because
they
know
compilation,
especially
for
a
large
code,
can
take
a
while.
A
So,
yes,
it
is
possible
to
cross-compile
the
knl
from
a
login
node,
something
that
I've
discovered-
and
probably
several
others
of
us
here
have-
is
that
the
part
that
cross
compiling
most
often
trips
up
on
is
if
you're,
using
cmake
or
dot
slash,
configure
and
they're
trying
to
run
little.
They
build
and
run
executables
to
see.
You
know
if
things
are
available
or
if
things
work,
and
unless
the
package
has
been
very
well
developed.
A
Quite
often
it
will
try
to.
You
know,
build
and
run
some
executable
to
test
something
and,
of
course,
because
you're
cross-compiling,
the
knl
there.
It
doesn't
work
on
the
login
node,
because
it's
built
using
avx
instruction
sets
so
for
that
sometimes
just
going
to
a
knl
node
through
you
know
getting
an
interactive
node
just
for
the
dot.
A
Slash
configure
step
can
help
to
sort
of
get
through
that
part,
and
then
you
can
go
back
to
the
login
node
to
do
the
actual
compiling,
which
can
be
a
lot
faster
than
doing
the
full
compile
on
the
kml
mode.
A
A
B
A
So
that's
the
announcements
that
I
know
about
does
anybody
else
have
any
announcements
or
cfps?
That
would
be
good
for
nurse
users
to
know
about.
A
Seems
like
more
work
than
is
necessary,
oh
just
because
it's
going
through
service
now,
but
there's
a
there's.
A
link
on
that
page
for
requests
and
one
of
the
requests
is
access
to
query
gpu
nodes.
A
A
Clicking
on
these,
for
the
slides
doesn't
helpful,
jumps
too
far
forwards.
All
right.
We
can
post
a
link
later
in
the
chat
or,
if
you
have
the
slides
open
in
front
of
you,
you'll
be
able
to
click
on
the
link.
A
There
so
that
so
we
have
a
a
few
people
with
us
that
are
part
of
the
nissa
program
who
have
some
corey
gpu
scripts
to
share
and
and
give
us
a
little
bit
of
a
walk
through.
A
C
Okay,
so
this
is
a
quick
example:
it's
gonna
demonstrate
actually
how
to
run
a
jupiter,
notebook
and
I'll
I'll
show
you
that
this
is
my
job
script
for
corey
gpu.
C
So
I'm
doing
an
s
batch
c
gpu,
I'm
asking
for
four
gpus
here,
I'm
emailing
myself
to
let
me
know
how
the
drive
is
going
and
for
anybody
who
uses
python
and
wants
to
source
a
custom
content
environment.
This
is
how
you
do
it
inside
your
script,
so
I
load
the
python
module
and
I
source
this
environment
called
papermill.
I've
already
built
papermill
is
this
a
library
that
allows
you
to
run
jupyter
notebooks
from
the
command
line
and
also
to
insert
overriding
parameter
cells.
So
papermill
is
really
cool.
C
We
have
some
docs
about
that
and
here's
here's
where
I'm
launching
it.
So
I
have
an
sron
where
I
ask
for
my
gpus
and
then
I
have
this
script
called
run
papermill.
So
that's
it
here.
C
I
import
the
library
called
paper
mill.
I
do
some
art,
parsing
and
stuff,
but
the
powerful
part
is
here.
This
is
where
I'm
actually
launching
my
jupiter
notebook
from
the
batch
script,
so
I
don't
actually
have
to
log
into
jupiter
and
I
don't
have
to
do
anything
interactively
which
is
cool,
so
I'm
launching
this
notebook
called
save,
that's
flexible
and
then
it's
saving
the
output
on
cfs
or
pretty
much
wherever
you
want
and
then
the
thing
that
my
notebook
is
actually
doing
then
is
spinning
up
a
dash
cluster
and
I'm
not.
C
I
can't
explain
everything
here,
but
desk
is
kind
of
a
python
parallel
task-based
library.
So
I
check
that
I
have
a
gpu,
I
spin
up
a
desk
cluster
and
then
I'm
doing
some
qdf
processing,
so
that's
kind
of
pandas,
but
on
a
gpu
and
then
when
I'm
done,
I
write
some
output
files
and
shut
my
cluster
down.
So
in
this
one
job
script,
I'm
able
to
start
and
run
a
bunch
of
jupyter
notebooks.
So
I
think
it's
pretty
powerful.
C
A
That
is
really
neat
yeah.
I
think
I'll
echo
williams
question.
C
Yeah,
william,
so
I
see
a
question
about
sharing
a
script
at
the
moment.
This
is
not
quite
ready
to
share,
but
we're
gonna,
publish,
I'm
gonna
put
up
a
public
version
of
this
stuff
in
the
next
week
or
two
because
it's
going
to
be
part
of
a
paper
at
sci-fi,
but
there's
paths
in
here
that
we
don't
want
users
to
see.
But
yes,.
A
Sounds
really
good,
so
the
the
workflow
for
developing.
That
would
then
be.
I
guess
that
you
begin
by
requesting
a
interactive
gpu
node
through
jupiter.nurse.gov.
Do
your
kind
of
development
with
paper
mill
there
to
you
know
for
the
gpu
side
of
it
and
then
go
on
to
develop
the
the
shell
script
separately.
C
Yeah,
actually
it's
kind
of
backwards,
so
I
I
would
start
by
logging
into
jupiter
and
you
know
developing
some
script
interactively
so
it
does
what
I
need
and
then,
when
I
have
what
I
want,
I
can
wrap
it
in
paper
mill
and
then
papermill
will
override
certain
cells
for
me.
So
I
can
do
a
parameter
scan
and
then
I
can
put
that
in
my
batch
script
but
yeah
the
other.
There
are
other
solutions
too,
so
yeah,
there's
cheap
text
and
I
think,
there's
other
options,
but
I
like
paper
mill,
it's
easy
to
use.
F
F
Yeah,
okay,
so
this
is
amrx's
essentially
sample
run
script,
and
I
can
throw
this
one
in
the
chat
for
you
guys
to
use
as
a
reference.
It
is
our
reference.
F
It
is
a
bit
outdated
in
some
places,
but
it's
outdated
in
a
way
that
we
actually
depend
on.
So
if
you
look
at
the
example
scripts
in
the
docs,
here's
what
you
normally
get-
and
you
see
that
most
use
this
gpu
per
task
to
define
that
for
this,
given
rank
you're
going
to
have
so
many
gpus.
So
the
only
difference
that
we
do
compared
to
this
is
we
we
loosen
this
amorex
inside
of
it
looks
at
all
visible,
gpus
and
parses
them
out
inside
the
code
at
initialization.
F
So
it
doesn't
do
this
up
front
and
limit
what
it
can
see,
allowing
if
right
now,
it's
really
for
testing
and
we
don't
really
use
it.
We
most
of
the
time
have
one
task
for
gpu
like
most
people,
but
we
don't
want
to
limit
that
right
now,
because
we
have
various
testing
and
you
know
experimental
things
that
we're
doing
so.
F
That's
really
the
only
difference
from
it
overall,
but
we
have
this
code
that
actually
this
script
actually
has
all
the
kind
of
things
that
you
would
need
to
do
to
run
on
query
gpu,
so
gpu
time
limit
java
id,
we
used
m1759
change
it
to
yours
and
then
all
the
tricky
pits.
We've
actually
got
a
note
down
here
about
what
they
are.
F
What
you
probably
want
to
do
is
change
this
to
gpus
per
task
is
equal
to
one
for
most
of
your
codes,
but
otherwise
this
works
nicely,
and
then
we
have
a
couple
examples
here
for
for
one
node
of
corey
gpu.
You
would
set
it
up
like
this
for
two
you'd
set
it
up
like
this,
so
notice,
cpc
is
per
task
and
gpu
is
per
node
and
and
task
is
per
node.
So
none
of
these
numbers
change
you
just
change
n
and
on
big
end
and
little
n.
We
then
have
some
slot
commands.
F
So
this
is
how
you
get
on
interactively.
So
if
you
do
a
single
gpu,
the
line
would
look
like
this.
Just
getting
a
single
gpu
10
is
the
even
distribution
of
cpu
threads
and
one
node,
and
then
a
full
single
node
using
the
exclusive
command
or
multi-node
using
the
exclusive
command.
If
you
wanted
to-
and
this
is
where
we
set
up
our
you-
you
put
your
executable
in
your
input.
We
run
like
this,
so
an
executable
and
then
an
input
file.
F
So
this
is
where
you
would
we
we
defined
it
in
our
code,
so
you
don't
have
to
tweak
much
and
then
there's
two
launches
here,
one
for
you're
running
it
in
s
batch.
So
you
just
launch
like
this.
If
you're
in
an
interactive,
node
you'd
want
to
put
your
configuration
in
here
so
then
the
s1
would
look
like
this,
and
we
have
this
also
here
to
compare
and
look
at
it.
F
So
here
is
n
site
systems,
profiling,
and
this
is
the
probably
the
one
that's
most
useful.
It
just
does
a
basic
profile
of
the
simula
of
the
run.
So
there's
your
exe
and
your
executable.
We
output
it
based
on
job
id
numbers
to
get
a
unique
id
every
time.
That's
all
it
is
output
to
this
file,
and
this
will
give
you
a
timeline.
F
So
this
is,
if
you're
looking
into
profiling
and
how
to
run
these
things
with
profiling.
Otherwise
we
have
all
the
basics
up
here
about
how
to
configure
your
system.
The
only
real
trick
that
I
would
say
when
doing
this
to
consider
is
always
grab
the
entire
node.
Don't
try
and
piece
a
node
or
get
a
part
of
it.
If
you
do
that,
you
start
running
into
numera
problems.
F
How
the
the
the
cpu
and
the
gpus
are
laid
out
might
not
be
exactly
what
you're
expecting
and
so
you'll
get
a
different
configuration
or
run
a
little
differently.
So
one
thing
that
we
usually
do
unless
we're
running
in
this
exclusive
mode
and
getting
our
this
interactive
mode
and
getting
one
gpu
is
we
get
the
entire
node?
We
ask
for
the
all
the
resources
in
the
entire
node
and
pick
out
the
parts
that
we
actually
need
for
the
run.
F
That
way
you
get
you're
sure
you
get
the
same,
numa
configuration
you
get
the
same
layout
of
your
resources,
and
so
you
get
consistent
results.
Otherwise,
here
it
is
feel
free
to
you
know,
copy
this
one
over
and
use
it
and
borrow
it
and
do
whatever
you
need
to
and
watch
out
for
that
gpu
for
per
tasks
flag,
which
is
the
one
difference
that
you
might
want
to
account
for,
if
your,
if
your
code
doesn't
pre-configure
your
gpus
like
amrex,
does.
A
So
yeah,
so
that's
a
a
very
interesting
note
about
the
how
the
code
works
and
and
jira's
gpu
versus
gpus
per
task,
so
so,
if
you're
so
so
in
summary,
then,
if
your
code
or
for
most
codes,
I
guess
they
they
offload
to
a
gpu
and-
and
it's
normally
one
gpu
per
task
is
the
is
the
most
common
assumption.
E
A
Right,
yep
and
so,
and
so
for
most
people
they
would
use
gpus
per
task
equals
one.
But
in
the
case
of
amrex
it
does
sort
of
careful
management
of
the
gpus
itself.
Yes
and
so
you're,
giving
you're,
basically
with
with
g-rez
you're,
giving
amrex
control
over
how
it
distributes
the
gpus
you're.
Just
telling
slurm
give
me
all
the
gpus
and
let
me
figure
it
out.
F
A
That
sounds
good
and
the
the
little
bonus
kind
of
history
there
of
how
to
run
insight,
yes
to
get
a
profile
of
your
code,
is
a
very
nice
addition
as
well.
F
A
F
So
what
we
have
we
have
built
in
what
are
they
called
the
the
tight,
the
profiler
wrappers,
that
sort
of
define
locations
in
it,
and
even
with
that
in
there,
which
adds
a
little
bit
extra
overhead
and
the
end
site
systems?
Profiler
doesn't
add
a
ton
of
overhead,
maybe
10,
20
30.
F
However,
nsight
compute
can
add
a
ton
of
overhead
when
you
individually
go
after
a
single
kernel.
That
can
add
a
whole
lot
of
overhead,
especially
if
it's
a
big,
thick
nasty
kernel,
so
the
systems
we
can
do
that
fairly
regularly
without
too
big
of
a
problem.
Now
you
always
when
you
profile
want
to
do
a
smaller
sub
sample.
You
don't
want
to
do
everything.
If
you
do
a
you
know,
full
production
run
and
try
and
sample
it.
You
will
see
overhead
there's
no
doubt,
but
for
like
small
little
good
test
cases
worth
testing.
F
F
Robert
most
generically,
that's
exactly
what
I
mean.
Yes,
we
grabbed
the
device
count,
set
them
up
and
then
and
parse
them
out,
but
for
the
most
time
we
do
that,
but
there's
a
couple
of
cases
where
we
want
to
do
something
more
fancy
like
give
multiple
gpus
to
a
single
rank
or
a
weird
subset,
and
we
want
to
have
the
flexibility
to
be
able
to
tweak
that
inside
the
codes.
Yes,
that's
what
we
do
differently
correct.
A
So
so
a
a,
I
guess,
a
normal
code,
then
that
just
calls
could
it
get
device
count
and
cuda
set
device.
They
can
still
use
gpus
per
task
equals.
Something
is
that
correct,
yeah.
F
A
And
andrew
has
a
comment
about
a
sense
job
step.
Viewer
ascent
is
the
summit.
Yes,
it's
the
summit.
Testimony.
F
That's
strictly
actually
a
summit
tool
that
job
step
viewer
and
we
don't
have
one.
We
have
a
job
script
generator
that
I'm
not
sure
what
the
plan
is
for
perlmutter.
If
someone
has
has
already
put
it
on
the
docket
to
upgrade
it
for
perlmutter
or
all
that
kind
of
stuff,
I'm
not
sure
if
it
actually
covers
corey
gpu
either
I
don't
believe
it
does.
A
Thanks
kevin
and
daddy,
were
you
able
to
access
your
script.
G
Yeah
yeah
I'll
share
my
screen
in
a
second
all
right.
Can
you
see
my
screen?
Yes,
that's
working.
Do
you
see
the
terminal.
A
Yeah
you're
on
dtno3.
G
Yeah,
that's
right!
Okay,
so
let
me
just
show
you
my
submission
job
submission
script
for
beginning
the
neural
network
on
a
few
gpus
on
a
single
node.
G
If
you
would
like
to
look
at
a
script
that
does
a
multi-node
training,
then
I
can
point
you
to
a
tutorial
prepared
by
steve
and
mustafa,
but
I'll
just
go
through
this
script,
which
is
yeah
one
of
the
it's
mostly
just
normal
stuff.
You
know
it
requests
a
node
and
then
this
one
is
requesting
four
gpus.
I
am
using
the
nurse
pytorch
ngc
image
over
here
to
train,
so
I'm
just
specifying
which
shifter
image
should
be
used
to
train.
This
part
here
is
essentially
copying.
G
A
bunch
of
data
online
furniture,
as
you
can
see,
is
copying
some
data
to
the
the
nvme
or
the
solid
state
drive
on
the
gpu.
G
And
yeah
this
is
this
part
over
here
is
telling
me
to
turn
the
code
which
python
environment
to
use,
and
finally,
you
are
the
line.
29
is
launching
the
white
horse,
distributed
training,
job
yeah.
Let
me
let
me
see.
G
I
think
there's
eight
gpus
on
a
node,
but
I
mean
yeah,
I
mean
the
only
yeah
the
there
is.
If
you
use
like.
Let's
say
you
use
four
gpus
and
I
submit
two
different
jobs
and
if
they
somehow
end
up
on
the
same
node,
then
the
gpu
might
not
have
enough
memory
to
run
both
jobs.
So
you
have
to
be
careful
when
you're
submitting
but
you're,
using
only
part
of
the
node,
or
only
a
few
gpus
on
a
single
node.
A
Right,
so
that's
a
that's
a
good
tip.
You
need
to
remember
the
amount
of
memory
that
each
gpu
has
when
when
setting
up
the
job.
G
Yeah
and
it
can
also
paste
a
link
to
the
essay
tutorial
that
mustafa
steve
and
I
think,
josh
have
developed,
which
should
allow
you
to
create
a
script
which
uses
multiple
gpus
and
either
spite
or
ddp.
G
Yeah,
I
think
you
can
look
at
what
shifter
images
are
available
on
the
nurse
repository
using
grep
or
something,
and
also
you
can
build
your
own
shifter
image
and
push
it
to.
A
Do
you
happen
to
know
when
the
shifter
image
was
being
built,
if
anything
special
had
to
be
done
for
it
for
yeah
for
pytorch
to
use
nurse
gpus.
G
A
That's
all
I
have
thank
you
very
much
so
so
we've
only
actually
got
about
five
minutes
left
in
the
meeting,
but
the
the
last
couple
of
things-
don't
don't
usually
take
very
long,
so
I
think
that's
probably
a
good
time
for
a
bit
of
a
q,
a
and
swapping
so
swapping
stories.
A
A
And
if
not,
I
see,
there's
been
quite
a
lot
of
your
questions
and
answers
and
discussions
happening
in
the
chat.
But
if
anybody
has
any
questions
about
you're
setting
up
gpu
scripts,
that
they'd
like
to
ask
either
our
panel
or
the
or
the
community
generally,
please
unmuted
speaker.
H
I'd
like
to
add
that
the
default
for
corey
gpu
is
a
non-exclusive
access
that
it's
a
shared
node
access,
so
that,
if
you
need
exclusive
access,
you
need
to
add,
is
it
dash
queue
exclusive
or
something
at
the
top
of
the
script?
Can
somebody.
H
If
it
is
double
dash
exclusive
or
s
batch
exclusive,
I
forgot
the
actual
syntax
since
it's
on
corey
scratch.
A
Here
we
go,
we
have
a
a
bit
of
discussion
in
the
chat,
so
it
said
double
dash
exclusive
and
I
think
that
is
a
dispatch
option.
So,
yes,
that's
a
good
tip.
The
the
gpus
are
not
exclusive
by
default
and
yeah.
So
if
you
do
need
exclusive
use
of
a
gpu
and-
and
you
might
find
that
for
a
lot
of
things-
you
don't
you
know
you
or
exclusive
use
of
a
node.
A
Rather,
I
think
you'll
probably
find
with
gpus
that
there's
a
sufficient
amount
of
power
in
each
gpu
that
you
know
you
only
need
a
small
portion
of
the
resources
on
a
node
for
single
gpu
type
jobs
or
smaller
jobs.
A
You
don't
want
to
require
all
8
gpus
when,
when
you're
trying
to.
A
A
Okay,
so
share
my
screen
again
in
the
in
the
meantime.
You've
probably
already
seen
this,
but
if
not
it's
just
a
or
or
even
if
you
have
a
good
reminder,
so
the
quarry
gpu
nodes
have
their
own
docs
help
webpage
at
docsdev.net.gov
and
amongst
the
various
information
here,
there's
actually
kind
of
a
diagram
of
what
the
node
layout
looks
like.
So
you
can
see.
There's
you
know
two
cpu
sockets
with
four
gpus
attached
to
each
cpu
socket
and
nv
link
across
them
all.
A
Let
us
know,
drop
us
a
line
either
either
something
in
the
webinars
or
or
you
can
direct
message
me
in
in
slack
or
send
us
a
ticket.
A
It
would
be
great
to
hear
from
people
and
a
quick
look
over
last
month's
numbers
before
we
wind
up
so
overall
availability,
we
actually
took
a
a
few
hits.
In
april
we
had
a
few
a
few
outages.
Unfortunately,
there
was
a
of
course
you
know
regular
scheduled
monthly
maintenance,
but
there
was
a
couple
of
issues
that
hit
some
of
them
external.
A
So
we
had
a
an
electrical
issue
that
took
out
sort
of
a
couple
of
cabinets
with
some
knock-on
effects
and-
and
it
was
electrical
related,
but
this
one
was
actually
a
hardware
failure
in
the
cabinet
over
here.
So
we
did
take
a
a
few
knocks
during
april.
A
That's
it
hpss
and
cfs
that
continue
to
have
very
good
availability.
Core
utilization
was
very
high.
We're
up
at
97
large
jobs
were
comfortably
above
our
target,
so
we
have
a
target
of
25
of
corey's
workload
being
things
that
need
something
of
corey
type
scale
and
yeah.
So
in
april
we
had
a
little
over
30
and
yeah.
A
We've
been
sitting
at
a
relatively
high
numbers
for
for
a
little
while
there
now
tickets
coming
in
and
close
at
the
beginning
of
may
we
had
a
backlog
of
about
400
and
a
little
less
than
500
tickets.
So
we
typically
see
you
might
have
noticed
a
trend
here
over
the
last
few
months.
It's
pretty
normal
to
see
in
the
five
or
600
new
tickets
a
month
kind
of
range
coming
in.
A
And
that's
all
we
have
for
today.
Thank
you
again.
Everyone
for
participating,
and
especially
kevin
and
laurie
and
jody
for
walking
us
through
your
scripts.
A
Thank
you
all
again
I'll.
Stop
the
recording
now
and
we'll
look
forward
to
seeing
you
at
the
film
out
of
dedication
and
yeah
at
our
next
meeting.
A
Yes,
absolutely
we're
chatting
in
the
webinars
channel,
but
also
for
general
questions.
There's
a
the
general
channel
is
good
too.