►
From YouTube: NUG Monthly Meeting 16 Feb 2023
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
B
A
So
for
those
who
have
been
to
these
meetings
before
we
follow
a
reasonably
predictable
and
very
interactive
format
and
schedule,
so
we
have
a
a
reasonably
large
turnout
today,
so
I
think
for
for
speaking
up,
maybe
use
the
raise
hand
icon
under
under
reactions.
A
It's
not
obvious
yeah.
You
should
be
able
to
hit
the
reactions
button
and
raise
a
hand,
and
that
way
we
can
kind
of.
If
you
manage
how
many
people
are
speaking
at
once,
that's
it
please
do
raise
a
hand
and
and
contribute
well.
We
also
chat
sort
of
in
the
in
the
chat
and
in
the
the
nurse
users
of
slack
at
the
same
time,
keep
conversation
going
there
after
the
meeting
as
well.
A
So
we'll
follow
our
usual
agenda
pattern
with
a
one
minor
Edition
today,
so
we
start
out
with
a
win
of
the
month
and
today,
I
learned-
and
these
are
opportunities
to
you-
know-
talk
about
things
that
have
gone
well
and
and
things
for
there
that
haven't
gone
well
or
that
you've,
you
know
stumbled
across
that
are
interesting
and
beneficial
to
other
nurse
users.
A
We
have
Lippy
Gupta
from
nurse
care
who's,
going
to
talk
a
little
bit
about
the
user,
Community
survey
that
is
in
flight.
At
the
moment,
yeah
a
number
of
people
have
filled
it
out.
A
We
have
a
handful
of
announcements
and
goals
for
participation
and
there's
also
an
opportunity
for,
if
there's
an
event
that
you
know
of
that.
Other
nurse
users
might
be
interested
in
to
you
know.
Let
people
know
about
it
and
then
we'll
go
into
our
topic
of
the
day,
which
is
going
to
be
Chloe's
retirement.
So
Rebecca
from
nurse
ueg
is
here
and
she'll.
Give
us
a
bit
of
an
overview
of
the
plans
for
Curry's
retirement
coming
up
yeah
fairly
soon.
A
It's
it
kicks
things
off
with
win
of
the
month,
so
the
aim
of
this
segment
is
an
opportunity
to
show
off
an
achievement
or
shout
out
somebody
else's
achievement
that
you
know
of,
and
this
can
be
big
or
small.
Have
you
know
having
a
paper
accepted
somewhere
solving
a
bug,
and
you
know
it
was
always
interesting
to
hear
how
yeah
how
you
solved
it
and
I
think
is
you
know
it's
good
tips
for
other
users
as
well.
A
You
may
have
have
either
made
or
know
of
you
know.
Significant
scientific
achievement
might
be
a
candidate
for
one
of
the
science
highlights
that
we
present
to
doe
really
really
frequently
or
even
a
high
impact
scientific
Achievement,
Award
or
an
Innovative
use
of
high
performance
Computing
award.
C
Kevin
I
don't
have
anything
as
fancy,
as
you
know,
the
word
or
anything,
but
last
week
I
got
the
first
test
for
stream
triggered
communication
working
so
now,
I
have
something
that's
running
and
working
and
getting
cool
results
and
I
get
to
test
it
even
more
and
give
I
have
something
to
actually
present
at
Siam
in
next
week.
No
two
weeks.
Yes,
so
it
was
pretty
big
win
for
me.
So
this
is.
This
is
interesting
stream
triggered
stream
triggered
communication.
C
So
the
various
vendors
are
potentially
building
their
own
versions.
Now
the
one
I
tested
as
the
code
is
stable
and
sitting
is
nvidia's
ACX,
which
is
a
library
that
sits
on
top
of
an
MPI
and
essentially
maintains
and
manages
a
thread
that
puts
a
little
trigger
into
the
stream
and
when
you
reach
that
point
in
the
Stream,
the
thread
then
takes
over
and
makes
your
MPI
call
the
appropriate
point
in
time.
So
it's
it's
probably
not
gonna,
be
the
final
State.
C
A
So
this
is
interesting,
so
then,
as
a
as
a
usage
model,
just
the
application
poll,
the
stream
occasionally
or
does
it
just-
is
it
for
applications
that
are,
you
know,
reading
a
constant
stream
of
input
and
they,
you
know,
block
until
they've
got
the
next.
C
While
the
current
implementation
is
primarily
focused
on
gpus,
so
it's
a
that's
literally
a
Cuda
Stream
So
and
let
the
Cuda
scheduler
and
everything
handle
it.
The
general
model
is
so
that
it's
anything
that
can
be
represented
as
a
stream,
so
a
CPU
stream
object.
That
I'm
not
sure,
is
very
well
clarified
at
this
point
yet,
but
something
like
that
could
also
be
used
to
control
it
and
manage
it
where
the
user
essentially
fires
it
and
forgets
it
or
you
have
a
little
bit
of
control
over.
C
You
know
where
you're
at
and
and
handling
of
it
as
you
go,
but
that
that's
all
in
progress
and
fun
stuff
to
watch.
A
Yeah,
so
it's
almost
like
a
channels
model
of
communication
to
the
streams
yep.
Something
like
that
yeah
be
good
to
see
that
yeah
get
get
take
up.
C
A
So
I
think
I've
got
something
now
for
today,
I
learned
as
well,
which
is
at
least
or
began
to
learn
about
stream
triggered
computing
Thanksgiving.
That's
that's
really
interesting.
So
anybody
else
got
something
they'd
like
to
shout
out.
A
We
actually
have
some
extra
things
in
the
agenda
today,
so
I
might
move
on
to
the
flip
side
of
the
coin
of
today.
I
learned
and
I
guess
that
the
charge
question
for
this
segment
is
what
surprised
you
that
might
benefit
other
users
to
hear
about
you
know
and-
and
you
know,
might
help
with
our
documentation,
for
instance
as
well.
A
So
this
is
yeah,
not
everything
works
the
first
shot,
in
fact,
very
few
things
do
and
in
the
process
of
you
know
doing
research
and
and
achieving
something
you
tend
to
learn
a
lot
and
yeah.
The
goal
here
is
to
actually
you
know,
talk
about
those
things
that
they
they
might
not
have
worked,
but
that
doesn't
mean
they're
a
failure.
That
means
they're,
something
that
you
know
we
can
learn
from
that.
Potentially
each
other
can
benefit
from
as
well,
but
it
doesn't
even
have
to
be
something
that
you
got
stuck
on.
A
That
can
be
something
that
you
stumbled
across.
That
was
yeah
yeah.
This
is
this
is
an
interesting
topic
to
read
more
about
that
other
other
users
might
be
interested
in,
for
instance,
stream
triggered
Computing,
which
was
more
or
less
completely
off
my
radar
and
took
him
and
talked
about
it.
Just
now,.
A
D
I
ran
into
this
when
I
was
trying
to
debug
preemptable
jobs
and
I
I
learned
that
there's
a
flag
as
a
flag
time
in
which
to
me,
I
sort
of
naively
assumed
meant.
This
is
the
minimum
amount
of
time.
I
want
the
job
to
run
the
minimum
amount
of
time.
D
I
can
tolerate
having
the
job
run,
but
what
it
actually
means
to
slurm
is
this
is
how
short
you
want
your
job
to
be
so,
if
I
ask
for
a
time
in
of
10
minutes
and
it
could
fit
in
10
minutes,
it
would
only
give
me
10
minutes
and
nothing
else.
It
wouldn't
keep
going
and
keep
submitting
it
afterwards.
It
would
just
say:
oh
you
got
your
time
in
was
10
minutes.
D
You
got
your
10
minutes,
you're
done
so
I
was
pretty
confused
by
what
it
was
doing
for
a
while
until
I
actually
went
and
read
the
documentation.
So
today
I
learned
that
it
means
Slim's
time
in
not
my
minimum
time.
A
A
A
So
so
have
you
have
you
seen
times
when
you
got
longer
than
the
time
min
in
the
schedule,
because
I
wonder
if
this
was
related
to
how
busy
the
system
is
as
well.
E
I
have
in
the
preempt
queue
so
in
the
preempt
queue.
You
say
how
much
time
you
want
I
mean
I
didn't
use
the
time
Min.
But
but
then
you
can,
you
know,
say:
I
need,
you
know
five
hours
and
it
gives
me
two
and
a
half
hours
because
it
pretty
up
to
me
at
two
and
a
half,
but
not
that
the
two
minute,
the
two
women.
A
A
Okay,
good,
what
do
you
call
it?
A
good
reason
to
to
dig
into
checkpoint,
restart
and
other
options
like
that,
thanks
Lisa,
that's
a
good
tip.
A
If
nothing's
jumping
out,
then
you
might
move
along
to
talk
a
little
bit
about
our
user
community
of
practice
and
the
community
survey
Olivia.
Would
you
like
to
tell
us
some
more
about
this.
F
Yes,
I
would
be
happy
to
I
think
you
can
actually
skip
this
I
think
something
went
wrong
in
the
yeah
there.
We
go
okay,
great
okay!
Well,
let
me
introduce
myself.
My
name
is
Lippy.
F
I
am
actually
a
postdoc
at
nurse,
but
I
started
out
as
a
user,
so
I
was
using
nurse
resources
to
do
my
thesis
work
when
I
was
in
graduate
school
and
that's
how
I
learned
about
nurse
and
ended
up
applying
for
a
postdoc.
So
if
any
of
you
are
in
that
boat
and
are
interested
in
learning
about
the
postdoc
opportunities
at
nurse,
I
would
be
happy
to
talk
about
that,
but
also
because
I
was
the
user.
I
I've
been
really
interested
in
user
engagement.
F
Now
that
I'm
part
of
nurse
and
currently
we're
in
the
process
of
wanting
to
create
a
really
active
community,
in
particular
a
community
of
practice,
and
so
I.
Think
at
last
month's
meeting,
Rebecca
told
you
a
little
bit
about
what
is
a
community
of
practice
and
it
requires
a
couple
things.
First
of
all,
it
requires
a
shared
debate
of
Interest.
So
likely
you
are
a
nurse
user,
but
also
now,
hopefully
interested
in
research
Computing
in
high
performance
Computing
for
the
purpose
of
doing
science.
F
So
many
of
us
share
that
domain
of
Interest.
That's
how
I
pivoted
from
I
was
doing
my
PhD
in
physics
and
now
I'm
at
nurse
learning
about
high
performance,
Computing
and
not
doing
quite
as
much
physics,
because
you
know
I
I
learned,
I
I
became
interested
in
that
there's
also
an
actively
cultivated
and
maintained
sense
of
community,
so
I
think
the
key
word
here
is
really
active.
F
We
we
have
to
be
much
more
involved
in
how
we're
not
only
just
cultivating
this
community,
but
also
maintaining
it
so
having
different
programs
and
events
that
are
taking
place
within
the
community,
so
that
people
can
can
practice
that
shared
domain
of
interest
can
can
participate
in
it
and
exactly
the
last
thing
active
practice
of
the
sure
domain
of
Interest,
so
that
Community
is
is
involved
with
creating
or
collaborating
in
some
way
we're
sharing
information,
and
this
is
happening
often
and
in
various
forms,
and
so
we're
in
the
process
now
of
creating
that
user
community
of
practice
next
slide.
F
Yeah,
so
the
one
of
the
things
that
Rebecca
had
shared
is
that
we
are
looking
for
a
lot
of
community
feedback.
We
have
a
lot
of
ideas.
Many
of
us
we're
nurse
users.
I
was
a
nurse
user.
I
have
a
lot
of
ideas
about
if
this.
If
these
things
have
been
available
to
me
when
I
was
just
just
a
user,
you
know
my
experience
at
nurse
my
ability
to
do
science.
My
ability
to
use
resources
would
have
been.
F
You
know
better
or
different
in
some
way,
and
so
we
want
people
who
are
currently
in
that
position
to
give
us
feedback
about
a
lot
of
different
things.
We,
you
know
we're
we're
wanting
to
hear
about.
You
know
what
what
might
a
user
Community
look
like
to
you?
What
do
you
think
might
be
missing
from
your
current
user
experience?
F
What
kinds
of
trainings
and
programs
would
help
you
feel
like
you're,
actively
participating
in
this
community
of
practice,
and
so
one
way
we
want
to
collect
this
information
is
through
focus
groups.
The
idea
being,
we
want
to
gather
people
who
are
interested
in
talking
to
us
over
Zoom
small
groups
to
discuss
with
us.
You
know
what
are
your
ideas?
What
are
your
challenges?
What
are
the
reasons
that
you
know
this
kind
of
involvement
would
be
not
helpful.
F
Maybe
to
you
or
you
know
what
are
things
that
could
be
really
helpful
to
you
and
so
it'll
be
an
opportunity
to
discuss
things
with
us
directly
and
but
we
we
also
want
to
just
collect
some
feedback
that
doesn't
require
participation
in
one
of
these
focus
groups
you
could
participate
in
our
survey.
F
So
if
you
go
to
the
next
slide,
we
really
need
you
to
make
this
happen
and
we're
going
to
do
kind
of
an
in-class
exercise.
We
don't
want
to
ask
people
to
spend
time
outside
of
this
meeting
to
to
do
this,
because
that
we
know
that
sometimes
that's
a
barrier
to
participate.
So
Steve
has
kindly
allowed
us
to
take
a
couple
minutes
here
during
this
meeting
for
everybody
to
go
into
this
survey
and
complete
it.
Most
of
it
is
something
that
you
can
just
provide
a
yes
or
no
answer
to.
F
There
are
spaces
in
there
to
fill
in
some
information.
If
you
want
to
there
aren't
any
like
paragraphs
of
information
requested,
and
it's
also
a
great
place
to
let
us
know
if
you're
interested
in
a
focus
group,
so
everybody
go
ahead
and
you
can
use
the
QR
code.
You
can
use
the
link
that
Rebecca
is
putting
in
chat
and
we're
just
going
to
let
you
fill
that
out
for
a
couple
minutes
so
that
you
don't
have
to
worry
about
it
later
and
we
can
hopefully
get
a
ton
of
feedback
right
now.
F
I'm
also
happy
to
answer
any
questions
that
people
have
either
about
the
survey
or
about
user
Community
engagement
or
anything
like
that
or
about
being
a
post-socket
nurse
whatever,
but
I
will
I.
Will
let
people
fill
that
out
if
there's
no
questions.
A
Thanks,
let
me
I
guess,
first
of
all,
are
there
any
questions.
B
A
Would
people
like
to
either
use
use
the
the
phone
camera
or
click
on
the
link
and
we'll
spend
maybe
kind
of
about
five
minutes
on
the
on
the
survey
and
then
come
back
to
see
where
people
are
at
and.
A
And
I
guess
yeah:
if
you
have
any
any
questions
along
the
way,
raise
a
hand.
F
And
thank
you
again
for
taking
the
time
to
do
this.
This
is
really
informative
to
us,
because
we
we
want
to
make
sure
that
the
events
programs,
trainings
whatever
we're
thinking
about
helping
you
know
put
together,
is
actually
going
to
be
useful
to
the
users.
So
this
is
a
really
important
part
of
the
process
for
us.
So
thanks.
A
F
Yeah,
so
we're
asking
a
lot
of
questions
about
current
engagement
in
the
form
of
participating
in
maybe
these
meetings
trainings
slots
you
could
check
it
out.
You're,
absolutely
welcome
to
I
think
you
should
be
able
to
click
through
it
because
I
I,
don't
think
any
of
the
questions
are
actually
required,
so
you
should
be
able
to
click
through
it.
Otherwise
I
can
even
share
it
with
you
in
another
format,
but
the
idea
is
yeah
questions
about
the
current
engagement.
F
We
have
a
nurse
user
slack,
so
we
want
to
find
out,
like
is
the
slack
useful
or
helpful
to
you?
Do
you
use
it?
If
you
do
use
it?
How
do
you
use
it
and
then
some
questions
about
just
to
find
out
who
who
is
filling
out
our
survey
just
so
we
get
a
sense
of
who
engaged
with
us
even
in
the
survey
and
then
also
some
opportunities
for
people
to
think
about.
Oh,
if
this
program
existed,
you
know
would
I
participate.
Would
I
be
interested
in
you
know.
F
And
I
did
put
in
the
chat
my
email
address.
If
anyone
is
you
know
doesn't
like
surveys
which
is
valid,
but
you
would
wouldn't
mind
sharing
your
I'm
I
am
in
the
nurse
user
slack.
You
can
message
me,
you
could
email
me
if
you
have
a
thought
you
could
email.
Anyone
that
you
know
is
in
that
you
know
nurse
user,
any
nurse
user
space
or
anyone
at
nurse
can
they
can
forward
it
on
to
me.
If
you
have
an
idea,
I'm
I'm
open
to
hearing
about
it,.
F
F
Thanks
Gregory
yeah,
I,
I,
agree,
I,
think
people
get
survey
fatigue
and
they
also
have
a
hard
time
scheduling
it
in
for
their
own
time.
So
we
thought
this
would
just
help
people
do
it
and
then
we'd
have
that
great
data.
So
thank
you.
F
Steve
I
think
giving
people,
maybe
one
more
minute,
is
a
good
idea
and
then
I
think
we
should
move
on.
Let's
say
another,
just
one
minute.
A
Sounds
good
we'll
give
another
minute
and
and
I
guess
if
you
haven't
finished
it
by
then
you
can
probably
keep
on
going
while
we
go
through
some
announcements
and
course
for
participation.
A
Okay,
it's
been,
it's
been
probably
six
or
seven
minutes
now,
so
hopefully
people
were
able
to
get
most
of
the
way
through
it
and
and
continue
either
so
yeah
during
the
meeting
or
afterwards,
and
thank
you
all
for,
for
you
know,
yeah,
for
participating
and
working
through
that.
A
So
we
have
a
handful
of
announcements
and
calls
for
participation.
There
are
some
that
were
announced
in
the
weekly
email
and
you
can
easily
go
back
and
see
those
and
click
on
the
links.
There
are
some
that
might
be
of
particular
interest
to
people.
If
you
are
a
student
or
have
or
know
students,
they
you
or
they
might
be
interested
to
know
that
nisk
is
a
bunch
of
summer
internships
available.
A
So
yeah
we're
looking
for
interns
for
the
for
the
summer
period,
yeah,
there's
a
list
of
projects
and
some
more
information
at
this
leak
here
and
these
slides
will
go
up
on
the
web
page
afterwards
as
well.
But
if
you
go
to
the
most
recent
weekly
email,
there
are
items
for
all
of
these
there's
a
couple
of
cfps
that
we
know
about.
There's
the
AY
23
research
in
Quantum
information
science
on
palmada
call
for
participation,
is
now
open.
A
A
couple
of
webinars
and
and
seminars
coming
up
through
ECP.
So
the
the
ECP
ideas
series
as
a
talk
on
the
red
extra
scale,
particle
accelerator
and
lasers,
laser
modeling
on
March
15th,
and
actually
this
link
to
the
best
practices.
Webinars
got
also
linked
to
their
previous
webinars
and
there's
some
really
interesting
content
in
there.
A
Another
event
that
ECP
is
doing
is
a
HPC
workfast,
Workforce
seminar
on
strategies
for
inclusive
mentorship
and
on
kind
of
a
yeah
Workforce
note
nurse
has
got
actually
quite
a
few
positions
open
at
the
moment,
and
nisk
users
often
turned
into
a
really
good
nurse
staff.
So
yeah.
We
encourage
you
to
take.
Take
a
look
at
this
careers
page
there's
a
link
to
it
in
the
weekly
email
and
yeah
considered
joining
nurse
as
a
as
a
staff
member.
A
That's
relevant
to
today's
topic
is
we
have
some
training
and
office
hours
around
migrating
from
Corey
to
perlmata,
so
there's
a
training
session
session
scheduled
for
March
10th,
so
page
on
that
on
the
www
site
and
we'll
have
office
hours
coming
up
starting
next
week
for
kind
of
several
sessions,
and
just
before
passing
on
to
Rebecca,
to
talk
more
about
Corey's
retirement
than
that
migration,
does
anybody
else
have
any
announcements
or
calls
for
participation
that
other
nurse
users
might
be
interested
to
join,
want
to
know
about.
A
If
not,
and
if
you
think
it
went
along
the
way
feel
free
to
drop
a
drop,
a
link
in
the
chat,
but
we
might
move
on
to
our
topic
of
the
day,
which
is
about
Corey's
retirement,
so
Rebecca's,
the
leader
of
user
engagement
at
desk
Rebecca.
Do
you
just
want
to
say
next
at
the
appropriate
moments
and
I'll
move
through
the
slides.
H
Oh
same
thing
different
day?
Okay,
all
right!
So
everybody
is
here
for
this
exciting
topic
about
Corey
retirement,
so
I'm
gonna
try
to
give
you
all
some
background
information
so
that
you
can
kind
of
understand
everything
that's
going
on
and
what
our
plans
are
for
the
retirement
of
Corey.
So
first
we're
going
to
talk
about
the
life
cycle
of
a
super
computer.
H
Then
we're
going
to
talk
about
why
we're
why
we
aren't
going
to
retire
Corey
and
the
Corey
retirement
schedule
and
we're
also
going
to
talk
about
Pearl
Mudder.
So
that's
that's
sort
of
our
outline
next
slide,
please!
H
So
this.
This
is
basically
an
overview
of
our
the
life
cycle
of
a
super
computer.
So
the
first
thing
that
happens
is
we
design
the
machine
I'll
go
into
a
few
more
details
about
this
in
subsequent
slides,
but
the
idea
is:
we've
got
to
actually
figure
out
what
we're
going
to
get
and
how
we're
gonna
do
it.
The
next
step
is
to
actually
build
that
machine,
and
that
is
primarily
done
by
our
vendor,
but
sort
of
a
collaborative
approach
as
well.
H
So
we
you
know,
as
you
may
have
noticed,
we
take
monthly
maintenances
on
our
machines
to
make
sure
that
they're
still
in
Tip-Top
shape
for
for
you
all
and
that
they're
still
working
eventually
at
the
end
of
the
life
cycle.
Then
we
start
thinking
about
retiring
machines,
and
so
we
then
retire
them
and
decommission
them
and
then,
after
the
machine
is
turned
off,
then
it
gets
recycled
and
so
again
I'm
going
to
talk
more
about
all
of
these
phases.
This
is
the
general
overview,
so
Stephen
field
push
it
again.
H
That's
my
one
animation
in
this
presentation,
so
I
put
the
machines
at
the
various
different
places
where
they
are
in
this
progression.
So
if
you
look
down
at
the
bottom
right,
there,
we've
got
Corey
we're
in
the
operate
and
maintain,
but
we're
getting
to
the
retire
stage
on
Corey.
H
We
are
still
test
validate
and
we're
getting
to
the
operation
stage
for
Pearl
meter
and
then
I
put
n10
that
stands
for
nurse
10
that'll
be
our
10th
machine
that
we
acquire.
We
are
in
the
process
of
Designing
that
one
right
now,
while
we're
also
trying
to
do
all
these
other
things
with
existing
machines
next
slide.
H
H
H
So
the
next
step
is
start
to
start
building
the
machine,
and
so
the
building
of
it
begins
in
the
vendors
Factory,
and
so
they
actually
will
assemble
a
machine
and
then
they'll
kind
of
test.
It
a
bit
make
sure
that
it
sort
of
functions
and
then
they
will
disassemble
it
and
send
it
to
nurse,
and-
and
so
that's
what
they
provide.
We
on
our
side.
We
provide
all
the
necessary
power,
water,
cooling
and
stuff
like
that,
for
the
machines.
H
Now
it
I
used
to
work
in
Australia
and
we
got
a
machine
and
they
actually
shipped
it
on
an
airplane
to
Sydney,
and
then
they
shipped
it
in
a
Road
train.
It's
a
truck
with
lots
of
long
trailers
after
it,
like
five
trailers,
they're
shifted
on
a
road
train
to
us
in
Western
Australia.
That
was
a
pretty
exciting
Journey,
but
in
our
case
I
think
they
mostly
just
come
by
track.
Okay,
next
slide,
please,
okay!
H
So
the
testing
part
I
alluded
to
this
before
testing
actually
begins
in
the
factory,
so
they
do
some
Factory
tests
and
often
under
normal
circumstances.
We
actually
go
and
we
look
at
the
machine
while
they
test
it
and
we
sort
of
help
to
make
sure
that
it
is
at
least
providing
the
initial
functionality
that
we
would.
We
would
expect
to
be
able
to
provide
in
that
environment
before
they
bring
it
to
nurse.
So
then,
once
they
do
bring
it
to
nurse
and
they
reassemble
the
machine.
H
We
test
it
a
lot
further,
and
so
we
do
a
lot
of
Hardware
software,
Network
testing
and-
and
we
also
let
friendly
users
on
the
machine.
So
we
had,
for
example,
an
early
science
period
with
Pro
matter,
where
we
let
everybody
on
who
was
participating
in
our
early
science
program
and
they
kind
of
checked
out
the
machine
kind
of
broke
it
in
and
they
were
what
we
call
Friendly
users.
So
they
understood
that
maybe
everything
wasn't
working
quite
properly,
but
they
were
there
to
help
us
too.
H
Now
the
vendor
I
wouldn't
want
to
be
a
supercomputer
vendor
because
they
have
to
put
forth
I
mean
millions
of
dollars
for
all
of
the
parts
in
this
machine
and
then
after
we
accept
the
machine,
then
we
pay
them
for
it.
So
they're
they're
fronting
a
lot
of
the
expense
of
these
machines.
I
mean
there
are
some
Milestones
where
they'll
get
some
partial
payments,
but
the
bulk
of
the
payment
comes
at
the
end.
When
we
accept
the
machine
and
in
order
for
us
to
accept
it,
we
have
to
pass
a
lot
of
tests.
H
So
we
have
functionality,
testing
performance,
testing,
stability,
testing,
reliability,
testing.
You
know
we
do
very
thorough
testing,
and
this
includes
a
30-day
stability
test
where
we
have
all
of
our
users
on
there
just
pounding
away
at
the
machine,
and
it
has
to
remain
in
service
for
a
very
high
percentage
of
the
time
anyway,
with
users
on
it.
During
this
30-day
period
before
we
can
pay
them
money
all
right
next
slide.
H
Now
we've
accepted
our
machine.
It's
it's
all
going!
Well
one:
when
we're
operating
our
machine,
it's
around
the
clock
operation,
so
we
have
staff
on
site,
24,
7,
365
days
a
year,
there's
somebody
there
on
Christmas,
there's
somebody
there.
You
know
at
2
A.M
on
a
Sunday.
You
know
every
day,
there's
somebody
there
and
while
we're
doing
that,
we
also
are
doing
a
regular
maintenance
of
our
machine.
So
we
have
actual
on-site
vendor
staff
at
nurse
from
hpe
and
they
perform
actual
physical
maintenance
of
the
machine.
H
So
they,
if
there's
a
node,
that's
gone
bad
or
something
they'll,
pull
it
out
and
they'll,
replace
it
or
they'll
fix
it.
They'll
replace
cables
that
have
gone
bad.
You
know
you
name
it
anything
physical,
they
will,
they
will
do
to
repair
the
machine
and
then,
in
addition,
we
do
regular
upgrades
of
the
system.
Software,
for
example.
There
may
be
security
issues
that
we
need
to
be
sure
to
patch
before
they
become
a
problem.
H
There
may
be
bugs
that
are
in
the
software
that
we
also
will
upgrade
or
patch
in
order
to
fix
those
problems,
and
also
sometimes
we
actually
get
more
functionality.
When
we
introduce
new
software
onto
the
machine,
all
right
next
slide,
okay,
so
let's
stop
and
for
a
second
here
and
talk
about
reliability.
So
when
we
get
the
machine,
there's
like
this,
what
we
would
call
a
ShakeOut
period
where
there's
faulty
new
hardware
that
you
know
somehow
wasn't
detected
or
or
things
just
fail.
H
They
tend
to
find
this
in
system
reliability,
which
apparently
actually
an
area
of
study
that
people
study.
So
that's
kind
of
cool.
So
there's
they
call
it
an
infant
mortality.
Failure
I
hate
to
use
that
term.
H
So
I'm
not
going
to
say
that
again,
but
early
on
parts
will
fail
on
a
machine
and
also
late
in
its
life
parts
will
fail
and
so
there's
the
early
failures
and
then
there's
the
wear-up
failures
and
then
there's
just
also
just
totally
random
failures
and
during
those
during
the
lifetime
of
the
machine
that
is.
This
is
what's
called
the
bathtub
curve.
If
you
look
over
here,
there
is
a
curve
of
of
the
failure
rates
and
yeah.
H
Thank
you,
and
so
you
can
see
it's
kind
of
higher
at
the
beginning
and
then
it
kind
of
slopes
down
and
it's
pretty
flat
through
the
middle,
and
then
it
comes
back
up
at
the
end
and
so
in
the
middle
period.
You
know
the
middle
age
of
the
computer,
it
it
has
a
pretty
low,
constant
failure
rate,
but
then,
as
as
the
time
goes
on,
we
start
to
get
more
of
these
wear
out
failures
and,
and
so
the
failure
rate
goes
up
again,
all
right
next
slide.
H
Okay,
so
we
retire
our
machines
at
the
end
of
their
useful
life,
because
after
a
certain
point
as
you've
seen
like
the
bathtub
curve,
there
failure
rates
really
begin
to
rise
and
machines
become
a
lot
harder
to
support
and
much
less
reliable.
So
another
reason
is
that
as
new
technologies
come
out,
they
tend
to
be
more
energy
efficient
and
provide
more
compute
power
for
the
same
amount
of
of
energy.
So
that's
another
good
reason
that
we
like
to
get
a
new
machine.
H
So
once
we
retire
the
machine,
then
we
actually,
we
return
it
to
the
vendor,
That's
How,
We,
Do,
It,
and
what
they
do
with
the
machine.
Is
they
recycle
it
in
some
way?
So
sometimes
they
resell
it.
So,
for
example,
my
understanding
is
that
there's
a
part
of
Edison
that
is
in
Texas
now
having
a
second
life,
others
other
parts
of
it.
H
They
will
use
it
for
spare
parts
for
similar
models
that
are
still
in
operation
and
then,
if
they
can't
do
either
of
those
things,
then
they
will
just
take
out
all
of
like
the
valuable
metals
or
other
components
and
they'll
remove
those
and
they'll
recycle
the
the
valuable
metals
and
things
like
that
from
the
machine,
all
right
next
slide.
H
So
why
do
we
need
to
retire
Corey?
Well,
Corey
has
reached
the
end
of
its
useful
lifespan.
So
you
know
how
dog
years,
like
there's
seven
dog
years
to
one
human
year,
so
I
think
there
may
be
like
12
super
computer
years
to
one
human
year
so
about
the
age
about
one
year
a
month.
So
Corey
is
in
its
80s
90s,
maybe
even
100
years
old,
and
so
it's
really
not.
It's
really
reached
sort
of
the
end
of
the
expected
lifespan
of
a
super
computer.
H
This
model
of
supercomputer
is
no
longer
being
produced.
So
that
means
that,
of
course,
the
processors
haven't
been
produced
for
years.
The
memory
also,
but
there's
also
other
components
like
the
cabinet
components
like
fans,
electrical
parts.
Those
were
all
custom
made
for
this
machine
and
those
are
no
longer
manufactured.
So
if
something
fails,
we
have
to
rely
on
remanufactured
replacement
parts
assuming
that
we
can
even
find
them.
H
So
we
also
have
observed
more
frequent
failures
in
the
in
the
individual
components
of
the
machine.
I
mean
it
may
not
be
as
visible
to
uo,
but
occasionally
lately
we've
been
having
all
cabinets
going
down
because
of
the
rectifier
in
them.
It's
an
electrical
part,
I,
don't
exactly
know
what
it
does,
but
the
the
rectifier
has
been
going
down
and
and
then
we
have
to
replace
that,
and
so
that's
very
disruptive,
two
years
or
so.
That's
that's
another
reason
why
the
the
reliability
is
going
down
and
then
also
failures.
H
Subsequent
from
now
are
becoming
more
and
more
difficult
to
recover
from
because
we
don't
have
parts
that
we
can
actually
put
in
the
machine
we've
run
out
of
particular
concern
to
us
is
the
scratch
system.
There
are
no
spare
parts
available
for
the
scratch
system.
So
if
something
fails
in
the
scratch
system,
we
could
lose
user
data.
So
we've
already
told
you
all
that
it
shouldn't
be
a
surprise
to
you
and
any
and
in
order
to
recover
from
that,
we
may
end
up
repurposing
internally
parts
of
the
machine.
H
So
it's
if
something
fails,
then
we
just
may
end
up
having
to
shrink
the
machine
because
we
can't
we
don't
have
that
part
anymore,
and
so
we
just
have
to
shrink
everything
down
and
we
may
shrink
it
even
more
so
that
we
can
have
a
spare
part.
It
just
kind
of
depends
on
what
happens
and
also
recoveries
may
take
longer
than
they
would
normally,
because
again
we
don't
have
replacement
parts.
H
So
here
is
our
current
plan
for
retiring
Quarry,
so
on
March
31st
we're
going
to
remove
all
of
the
auxiliary
components
from
Corey,
so
large
memory
nodes,
those
are
nodes
that
we
acquired
in
I,
guess
2020
that
have
a
large
memory
on
them.
Those
are
still
very
useful
and
they
still
exist.
H
H
It
won't
be
a
fast
process,
but
but
that's
our
plan
for
those
and
then
there's
also
a
GPU
partition
non,
quite
just
a
very
small
partition
of
gpus
that
we
also
we
acquired
those
probably
five
years
ago
or
more
and
those
nodes
are
going
to
just
be
retired,
because
those
have
obsolete
gpus
that
aren't
as
good
as
the
gpus
that
we
have
in
Pro
matter.
So
we
don't
need
to
keep
those
at
all
so
that
part
we're
just
going
to
retire.
H
So
then,
at
the
end
of
the
at
the
end
of
April,
we're
going
to
retire
Corey
as
a
whole
and
let's
call
that
date,
t
okay,
we
will
let
you
have
access
to
the
scratch
system
for
another
week
after
t,
but
then
after
that,
we'll
power
down
the
machine
all
together
and
then
the
next
month
we
will
start
to
remove
Corey
from
the
machine
room
next
slide,
please.
H
So,
let's
talk
about
Pearl
better,
because
that's
kind
of
the
elephant
in
the
room
for
everyone.
So
we
haven't
yet
completed
the
testing
of
our
final
configuration
of
prometer.
So
prometer
has
14
GPU
cabinets,
12,
CPU
cabinets
and
the
network
is
slingshot
11.,
that's
what
ss11
stands
for.
H
We
we
have
finally
reached
that
configuration
earlier
this
month,
I'm,
not
sure
when
we're
going
to
be
able
to
start
testing
this
file
configuration.
But
it's
going
to
happen
soon.
Now
we
are
not
going
to
retire
Corey
until
Pearl
letter
is
Thoroughly
tested
and
working
for
users.
So
I
mentioned
before
about
testing.
So
we've
already
done
some
testing,
like
the
functionality
like
the
system,
provides
a
lot
of
basic
functionalities
check.
We
can
check
that
box
performance.
We
were
able
to
achieve
certain
performance
levels
on
certain
benchmarks,
so
we
can
check
that
box.
H
The
two
that
are
really
remaining
right
now
are
the
stability
and
the
reliability.
So
the
system
needs
to
remain
up
during
our
testing
period.
I
told
you
there's
this
30-day
window
where
it
needs
to
basically
remain
up
and
then
reliability.
We
need
to
have
fewer
hardware
and
software
failures
than
what
we're
seeing
right
now,
the
next
slide.
H
So
we
know
that
promoter
is
not
meeting
our
expectations
and
it's
not
meeting
yours
either
yet,
and
we
understand
how
important
it
is
that
promoter
is
a
reliable
machine
in
order
for
you
all
to
continue
to
make
scientific
progress,
so
we
meet
with
hpe,
that's
our
vendor.
We
meet
with
them
every
day
to
address
bugs
and
issues
right
now
we
have
some
hpe
experts
at
nurse
today
this
week
to
focus
on
resolving
these
issues
that
we
are
all
experiencing
and
we're
optimistic
that
this
collaboration
will
improve
promoters.
H
Reliability
next
slide,
so
we're
working
together
with
hpe
to
address
the
stability
of
the
slingshot
Network.
We
know
that's
not
not
working
right,
yet
the
I
o
performance
on
promoter's
scratch
system
and
on
our
community
file
system.
Those
are
also
not
working
well
and
the
node
Hardware
reliability.
We
also
have
seen
some
issues
there
and
in
this
collaboration
we
have
developed
some
new
processes,
so
we've
developed
some
new
methods
to
like
have
a
methodical
process.
H
We've
made
some
configuration
changes
to
the
community
file
system
and
to
the
CFS
client
on
prometer
in
order
to
stabilize
the
network
communication
and
performance.
Some
of
those
things
seem
to
be
panning
out
for
us.
H
So
so
that's
good
too,
and
then
we're
also
rolling
out
some
fixes
to
some
slingshot
Network,
bugs
that
we
discovered
and
hpe
has
also
discovered
and
things
that
that
they
have
been
able
to
fix
so
we're
rolling
those
out
this
week
and
next
week
and
we're
confident
that
those
will
make
a
big
difference
in
the
performance
of
the
machine.
H
So,
in
summary,
in
the
super
computer
lifestyle
lifestyle
life
cycle,
Corey
has
reached
the
end,
there's
no
new
parts
being
manufactured
for
it.
It
makes
the
upkeep
especially
challenging
we
plan
to
retire
Corey
at
the
end
of
April
promoter.
Reliability
issues
are
being
addressed
at
top
priority
by
us
and
our
vendor
hpe
and
that's
all
I
have
at
this
time
I'm
happy
to
answer
any
questions.
H
Hi
Alex
Copeland
here
from
jgi
yeah
I'm
glad
to
hear
you
say
that
promoter
is,
is
not
meeting
your
specs
with
respect
to
reliability.
H
That
certainly
I
think
been
a
question
for
all
of
us,
trying
to
use
the
system
and
now
I'm
just
wondering
if
you're,
so
you
have
these
aspirational
goals
for
what
you
hope
to
address
in
promoter
to
get
it
up
to
spec
and
in
contrast,
you
have
a
hard
seems
to
be
like
a
hard
calendar
deadline
for
the
retirement
of
Quarry
and
I'm
wondering
if
that
is
also
actually
more
aspirational
and
that
we
should
understand
this.
H
H
So
what
we're
saying
is
that
we
really
want
to
retire
Corey
on
April
30th
and
that's
our
goal,
but
but
the
prerequisite
for
that
is
going
to
be
prometer
is
going
to
have
to
be
in
a
state
where
we
feel
comfortable
retiring
Corey.
So
if
we
have
to,
we
will
extend
that
date,
but
but
you
can
I
guess
the
reason
we
say
April
30th
we
provide
that
date
is
so
that
you
can
know
that
it
is
guaranteed
to
be
there
through
that
date.
H
So,
if
perimeter
tomorrow
was
suddenly
perfect
in
every
way,
okay,
we
wouldn't
retire
Corey.
The
next
day
we
would.
We
would
wait
until
April
30th.
Does
that
make
sense
it
does
and
if
I
can
ask
a
follow-up.
The
the
measurements
or
the
metrics
that
you've
discussed
about
measuring
reliability
and
stability
for
promitter
seem
a
little
bit
squishy
a
little
a
little
vague
and
I
think
I
mean,
for
example,
your
last
comment
that
a
promoter
today
was
perfect.
H
It
doesn't
seem
like,
from
the
point
of
view,
a
user,
that's
something
that
you
could
possibly
determine
in
a
day
it's
I
mean
we
would
like
to
see
stability
measured
in
months,
not
in
days
right
absolutely
absolutely
I
mean
that
was
just
purely
hypothetical
right.
No,
so
we
have.
We
have
a
really
big
long
document.
How
many
me,
how
many
pages
is
it
Tina,
hundreds
of
pages
I,
think
yeah.
H
H
H
We
we
generally
want
to
see
some
stability
before
we
start
that,
because
we
want
the
expectation
that
the
system
is
going
to
pass
that
30-day
test,
and
during
that
time
there
has
to
be
at
least
a
seven
day
window
where
the
system
does
not
have
any
failures
so
other
than
maybe
you
know
falling
out
or
something
along
those
lines,
but
nothing
that
isn't
not
recoverable
or
that
takes
out
the
majority
of
the
system
in
what
we
would
call
a
system-wide
outage,
and
that
is
is
a
term
that
we
use
for
some
portion.
H
That
is
what
that
is
defined
in
the
sow
to
fail,
but
it's
not
like
the
whole
machine
has
to
fail
for
that
to
be
considered
a
system-wide
outage,
so
those
so
somewhere
within
that
30
day.
We
have
to
have
that
seven.
H
Seven
days
of
reliable
running
as
well
as
there
is
a
Only
One,
system-wide
Outage
allowed
during
that
time
frame
the
understanding
you
know
as
we
bring
in
these
systems
that
are
new
we're
using
new
technologies
that
we
aren't
expecting
them
to
be,
maybe
as
stable
as
they
will
later
on
in
their
life.
So
we
that's
why
we
have
the
ability
for
them
to
have
one
system
white
outage
during
that
30-day
window.
That
number
decreases
as
the
age
of
the
system
increases
based
on
our
requirements.
H
Thanks
Tina,
so
does
that?
Does
that
help
Alex?
Well,
it's
it's
certainly
moving
in
the
right
direction.
I
actually
have
some
other
questions,
but
I'd
like
to
you
know,
give
other
people
a
chance.
If
they
have
questions
sure.
Okay,
thanks
anybody
else
have
questions.
H
Okay,
I
see
vivec
has
your
hand
raised,
go
ahead,
hey
Rebecca,
thanks
for
opposing
this
I
was
just
curious.
What
are
there?
Are
there
any
suggestions
or
maybe
recommended
best
practices
for
users
who
kind
of
rely
on
Prom
letter,
as
kind
of
like
a
development
resource
during
these
kinds
of
periods
of
downtime,
maybe
some
kind
of
like
workflow,
that
kind
of
replicates
the
conditions
that
would
be
on
promoters,
so
we
can
kind
of
be
productive
during
this
downtime.
H
That's
an
excellent
question.
I
might
defer
this
to
some
of
my
other
colleagues
who
are
also
here
so
I
mean
so
a
lot
of
things
are
quite
similar
from
Corey
to
Perimeter
I
mean
in
the
like:
the
the
user
environment.
Of
course
the
architecture
is
different.
If
you
are,
if
you
are
using
the
gpus
on
promitter,
obviously
that's
different.
H
I
think,
maybe
are
you
wondering
sort
of
in
the
in
the
realm
of
like
performance,
how
you
can
gauge
the
performance
of
your
code?
Is
that
what
you're
trying
to
think
about?
Or
what
do
you
specifically
sort
of
I
mean
there
are
certain
codes
that
you
know
I
would
love
to
test.
H
You
know
on
like
a
single
node
of
promoter
because
of
CPU
problem
letter,
because
that
is
a
nice
fast
CPU
and
I
can't
really
and
lots
of
memory
and
I
can't
really
do
it
on
my
laptop,
but
during
these
periods
of
downtime
I
can't
really
do
that.
So,
if
there
are
some,
is
there
some
kind
of
option
it
would
be?
It
would
be
great
to
know
about
it,
and
if
there
isn't,
you
know
a
suggestion
might
be.
H
You
know
if
there
is
some
dedicated
set
of
nodes
set
aside,
you
know
just
for
development
purposes
and
not
like
large
scale
runs.
That
would
be
really
useful
as
a
user
okay.
So
your
your
question
is
mainly
about
sort
of
mitigating
the
down
times
mitigating
the
down
times
and
if
there's
like
an
option
because,
like
you
know,
it's
not
like
I
need,
like
four,
you
know,
even
more
than
one
node
I
just
need
the
resources
of
a
single
node.
H
So
there
was
some
kind
of
option
to
get
get
that,
and
especially
you
know,
without
something
that
replicates
promoter's
environment,
so
I
don't
have
to
spend
time
changing
things
when
I
do
a
production
run.
That
would
be
good
right.
Go
back
to
Corey
switch
back
to
switch
everything
back.
You
understand,
yeah,
I!
Think
that's
a
that's.
A
really
good
suggestion.
I
think
we'll
have
to
think
about.
Some
of
these
maintenances
are
are
things
that
affect
the
entire.
H
The
spine
of
the
system
like
this
one
in
particular
we're
working
on
the
thing
that
runs
the
whole
network
for
this
the
system.
So
when
those
things
happen,
we
end
up
having
to
take
the
whole
machine
away,
but
I
think
that
we
can
definitely
look
at
options
to
keeping.
Even
you
know
the
portion
of
we're
not
doing
brain
surgery,
basically
yeah.
H
That
would
be
really
useful,
you
know,
but
for
the
people
who
rely
on
it
for
development
and
I
wanted
to
say
thank
you,
for
you
know
like
giving
high
priority
to
kind
of
address
some
of
the
issues
that
have
been
going
on
with
promoter
appreciate
it
great.
So
just
looking
at
the
time
we
are
at
the
top
of
the
hour,
however,
I
suspect
that
people
still
have
a
few
more
questions.
H
So
if
people
need
to
you
know,
move
on
to
your
next
thing,
then
then
you
know,
please
feel
free
and
and
thanks
for
joining
us
today,
but
we'll
continue
the
meeting
going
for
a
little
bit
longer
to
extend
the
the
Q
a
opportunity.
H
If
you
have
a
question,
yeah
I
have
a
question
actually
so
yeah
for
for
Cody.
This
CPU
part
is
larger
than
the
GPU,
so
CPU
portion
is
large
and
the
GPU
nodes
are
I.
Think
not
that
I
mean
not
in
numbers,
but
in
the
new
machine
is
you
you
expect,
can
I
expect
two
gpus
to
be
major
portion
and
CPU
to
be
less
portion
or
they
are
kind
of
mixed
up,
because
some
of
the
codes
will
not
run
in
GPU
and
some
of
the
codes
will
not
run
a
CPU
so
right?
H
So
yes,
so
I
believe
there
are
more
CPU
nodes
than
GPU
nodes
on
prometer
but
the
number
of
cabinets.
We
have
14
cabinets
of
GPU
nodes
and
12
cabinets
of
CPU
nodes,
but
because
you
can
fit
two
CPU
nodes
in
the
same
space
as
one
GPU
node
that
there
are
overall,
more
gpus
since
there
are
four
gpus
correct.
That's
true.
There
are
more
gpus
than
CPUs,
but
yes
in
terms
of
number
of
nodes,
yeah.
That
kind
of
depends
on
how
you
think
about
it.
I
suppose,
okay.
H
H
H
Okay,
so
the
data
transfer
nodes
are,
you
know,
are
independent
of
Corey,
but
they
do
have
the
Corey
scratch
system
mounted,
which
may
be
why
you're
you're
thinking
that
they
could
be
related
to
Corey
and
they
currently
do
not
have
a
promoter
scratch
system
mounted
on
them.
I
think,
ultimately,
I
think
we
would
like
to
have
the
promoter
scratched
on
them,
but
I.
Don't
think
that
that
we
have
the
time
at
this
point
to
be
able
to
do
that
so
for
now
they
won't
have
it
Tina.
H
Can
you
comment
I
think
that
there
is.
There
are
some
technicalities
there
for
trying
to
move
the
file
system
outside
of
of
the
parameter,
Network
and
so
I
think
we're
looking
at
other
options
like
maybe
bringing
some
dtms
inside
the
promoter
window
or
things
like
that,
but
we
haven't
quite
gotten
to
a
firm
Solution
on
it
yet,
but
we
are
looking
for
some
ways
that
we
can
make
that
improve
the
access
for
the
to
the
file
systems
from
external
nodes.
H
H
H
So
this
is
about
a
question
about
using
Palmer's
GPU
nodes
and
looking
at
our
usage
this
year,
I
was
wondering
if
there
are
any
plans
for
a
shared
queue
on
the
pole,
Mata
GPU
notes
there
is
a
shared
CPU
queue
right
now.
It's
my
understanding
and
the
motivations
behind
this
is
that
we've
noticed
that,
even
if
we
have
you
know
GPU
codes,
sometimes
there
are
calculations
that
won't
scale
to
using
all
four
a100s
and
I.
Think
it's
always
worth
being
around.
H
H
Yeah
I
can
probably
talk
a
little
bit
more
about
that.
So
the
we
had
an
issue
with
the
ability
to
actually
partition
the
memory
on
the
system,
so
a
user
could
overrun
their
allocation
of
memory,
which
would
could
impact
other
users
on
the
nodes
that
required
some
fixes
from
from
the
vendors,
so
we
and
unfortunately
that
was
one
that
was
crossed
multiple
vendors,
so
we
have
been
working
with
them.
H
We
have
a
a
fix,
but
it's
not
going
to
be
fully
implemented
until
a
little
later,
so
I
believe
that
will
be
coming
in
it's
in
the
either
the
dates
keep
shifting
on
when
the
releases
are
coming
out,
but
it's
it
should
be
here
before
June,
but
it's
somewhere
in
that
time
frame.
H
H
Yeah
all
right
any
other
questions.
If
there's
nobody
else,
then
I'd
like
to
go
back
one
more
time
to
the
reliability
question.
If
we
have
time
for
it
sure
so
you
know
given
given
the
reliability
of
Corey
and
I,
haven't
looked
at
this
data
in
a
while,
but
I
think
the
last
time
I
plotted
it.
There
was
only
one
month
in
its
entire
lifetime
that
didn't
have
an
unscheduled
outage
that
was
February,
2022
I.
H
Believe
I,
wonder
if
this
30-day
test
that
you
have
planned
for
Paul
Mudder,
especially
given
the
fact
that
the
system
is
similarly
complex
and
similarly
uses
sort
of
new
and
untested
Technology.
If
that
30-day
test
window
is,
is
really
going
to
be
enough
to
give
you
confidence
that
the
system
well
after
the
test
is
over
and
pass
going
to
be
reliable
and
if
there's
any
flexibility
to
change
it
and
make
it
a
longer
time
window.
H
Given
the
newness
of
these
systems,
there
is
always
going
to
be
I
think,
especially
in
the
early
phases.
Some
I'm,
actually
a
little
surprised
at
the
numbers.
I'll
have
to
go
back
and
look
so
I
thought
we
had
more
time
than
that
without
unscheduled
outages,
but
on
Quarry,
but
they
is
no.
We
do
not
have
the
because
we've
already
signed
the
agreement
with
the
vendor
as
far
as
the
time
frames
for
those
tests.
H
So
we
cannot
extend
that
date,
but
we
do
work
constantly
on
trying
to
improve
the
reliability
of
these
systems.
H
H
So
we're
hoping
to
get
that
to
be
definitely
more
stable
over
time.
I
think
the
advantage
that
we
will
have
over
Corey
in
the
longer
run
is
that
we
do
have
the
ability
to
do
rolling
upgrades
on
the
system.
So
our
hope
and
expectation
is
that,
as
we
stabilize
the
system,
we
will
be
able
to
do
even
our
upgrades
on
the
system
without
taking
the
system
away
from
our
users.
So
we
are
hoping
and
part
of
what
we
put
into
our
sow
is
to
be
able
to
do
those
kinds
of
rolling
updates.
H
H
So
one
last
remark,
and
that
is
that,
on
your
slide,
16,
where
you
describe
the
process
for
selecting
a
new
system,
I
was
struck
by
the
fact
that
there
at
least
I
didn't
see
something
that
I
recognized,
as
you
know,
basically
getting
input
from
the
users.
That
seems
to
be
mostly
communication
between
you
and
the
vendor
and
I.
H
Just
wonder
if
if
that
was
different-
and
you
did
specifically
get
add
some
sort
of
user
input
to
this
process
and
I
just
bring
it
up
here,
because
this
is,
after
all,
the
the
user
group
meeting.
If
you
wouldn't
get,
you
know
a
strong
request
for
things
like
reliability
as
being
like
a
primary
objective
and
something
that
you
spec
out
in
the
beginning
more
rigorously
and
you
know,
have
tougher
requirements
for
the
vendor
and
that
that
you
know
those
those
constraints
end
up
sort
of
balancing.
H
Whatever
else
is
pushing
for
throwing
some
new
and
spiffy
tack
into
the
system,
so
I
I
think
I
probably
started
here
at
step,
one
instead
of
Step
Zero,
which
of
course
is
you
know
getting
user
requirements
so
and
we
have
regularly
done
requirements
Gathering
from
the
users
and
then
we
we
kind
of
translate
that
into
our
requirements
for
the
machine.
H
Okay,
I,
don't
recall
having
seen
that.
Maybe
it's
in
the
some
part
of
the
user
survey
or.
H
We
have
had,
and
we
we
published
I
guess
now.
We
have
gone
through
three
rounds
with
all
the
different
program
offices
in
the
office
of
science
and
health.
H
What
we
call
these
requirements
reviews
the
last
set
was
also
included,
Oak,
Ridge
and
Argonne,
and
we're
called
the
exascale
requirement
reviews
and
we
do
draw
heavily
from
those
refinings
in
those
reviews
when,
when
together
requirements
for
a
next
system,
I
guess
I
was
thinking
of
something
a
bit
more
Grassroots
that
you,
you
know
you
ask
your
user
group
and
and
people
that
the
user
group
are
in
contact
with.
H
You
know
that
or
ask
directly
the
users
of
the
system
through
through
one
of
your
surveys,
but
this
is
really
just
this.
Is
you
know
a
comment
in
my
my
opinion?
Suggestion?
That's
not
a
ticket
or
leave
it
I
suppose
so.
I
think
that
one
of
the
things
that
is
I
understand
what
you're
saying
is
like
we
could
put
in
a
cluster
that
is
using
known
technologies
that
is
very
stable,
but
part
of
what
these
supercomputer
centers
do
is
try
to
make
sure
that
the
US
is
keeping
up
in
the
technology
Realms.
H
It
does
result
in
some
instability,
especially
in
the
early
phases
of
these
systems,
and
we
do
our
best
to
try
to
stabilize
those
as
the
system
ages,
but
that
is
part
of
what
we
do
is
to
try
to
bring
in
those
newer
technologies
that
aren't
yet
that
they're
still
in
the
relatively
new
phases,
so
that
we
can
help
the
vendors
get
those
to
become
more
reliable
and
useful.
H
You
know,
I
was
just
talking
with
the
the
nurse
10,
which
will
be
our
next
system
team
and
they
are
in
the
business
of
putting
together
the
draft
technical
specs
for
for
that
system
and
those
are
just
draft
and
I
believe
we
were
going
to
try
to
we're
going
to
try
to
schedule.
We
will.
We
will
schedule
a
time
to
present
those
draft
specs
to
the
user
group,
and
then
people
can
comment
on
them
and
we
can
provide
input
in
that
way.
H
All
right,
any
other
questions
sure
if
the
nurse
staff
isn't
running
off
I'd
like
to
voice
one
of
my
concerns
regarding
the
retirement
of
Corey
and
that's
the
loss
of
the
workflow
nodes.
H
I
understand
that
s
crontab
is
supposed
to
be
the
solution.
I'm
concerned
that
there
are
at
least
two
known
issues
unless
it's
been
fixed
recently
there's
this
issue,
where
one
failing
instance,
basically
cancels
that
line
comments
out
that
line
in
our
s-cron
tab.
So
any
of
these
reliability
issues
that
you've
been
poked
on
today
can
end
up,
causing
me
to
you
know,
lose
all
my
automated
runs
over
vacation
because
I
wasn't
around
to
edit
my
S
Chron
tab
to
remove
that
comment.
H
The
other
one
is
the
lack
of
time
zone,
support
and
the
fact
that
we
have
to
do
things
in
Universal
record
at
a
time
which
does
not
align
with
daylight
savings
time.
If
the
US
drops
daylight
savings
time
great.
My
problem
solved
I
subtract,
seven,
eight
hours,
whichever
it
is,
but
the
inability
to
have
the
time
of
scheduled
jobs
adjust
seasonally
to
the
work
day
of
our
our
team
is,
is
annoying
and
it's
going
to
require
either
semi-annual.
H
Somebody
has
to
go
in
and
change
the
s-con
tab
to
deal
with
that
or
writing
some
sort
of
weird
cluge.
That
starts
the
job
an
hour
that
does
asleep
3600
for
fractions
of
the
year
so
that
it
can
start
at
the
same
uniform
time
just
having
regular
Cron
on
something
like
the
query.
Workflow
nodes
would
would
fix
that.
For
me,
I
think
there
was
mentioned
earlier
that
having
some
dedicated
subcluster
for
something
like
the
data
transfer
nodes,
Globus
endpoints
things
like
that
would
also
be
something
you're
looking
into
so
I'm.
H
Just
wondering
about.
Is
there
possible
to
bring
back
the
the
workflow
node
concept?
That,
for
me,
has
worked
very
nicely
on
on
Corey?
Sorry,
very
long
question?
No,
that's
a
great
question:
I'm
hoping
Lisa
can
maybe
yeah.
Maybe
I
can
respond
to
yeah
Paul.
Thanks
for
the
question
and
I
know
that
there
have
been
some
really
painful
behaviors
with
the
scrum
tab
as
it
gets
rolled
out
that
I
run
into
as
well.
This,
the
one
you
mentioned
in
particular
canceling.
The
whole
job
is
what
they
preparing.
H
So
we
actually
have
a
couple
bugs
open
with
alarm
with
skid
MD
about
this
Behavior,
because
it's
it's
not
the
right,
behavior
and
we're
having
actually
next
week,
our
Q
committee
is
meeting
to
talk
about
how
we
can
Implement
some
solutions
for
that
particular
problem.
So
I
think
that
that
should
get
better
I
think.
H
Ultimately,
we
are
going
to
look
at
some
kind
of
longer
term
solution,
perhaps
involving
containerizing
user
instances
that
that
would
handle
these
sort
of
long-running
things
that
that
happens,
that
I
think
is
going
to
be
actually
kind
of
exciting
kind
of
forward-looking.
H
But
in
the
meantime,
we're
going
to
work
on
making
this
ground
tab
less
painful
to
use.
So
I
definitely
hear
you
about
that
and
about
the
time
zone.
H
Yeah,
as
I
said,
there's
a
sleep
3600
depending
on
the
current
time
zone,
hack
that
I'm,
aware
of
but
I
haven't
I've
had
the
heart
to
go
in
and
Implement
something
that
that
disgusting,
quite
yet
right,
yeah
yeah
we're
definitely
looking
for
Solutions
in
that
in
that
particular
the
time
zone.
One
is
actually
pretty
tough,
because
the
inside
of
all
of
pearl
meters
and
in
that
time
zone
and
storm
has
to
talk
to
it
and
it
somehow
doesn't
translate
well
from
the
outside.
H
So
we're
still
working
on
a
solution
there,
but
regular
regular
cron
works.
Just
fine
I
have
cront
tabs
on
machines
at
Oakridge
that
work.
Just
fine
scheduling
everything
in
Pacific
time.
For
me,
it's
not
like
the
technology
to
convert
doesn't
exist,
but
I
understand
it
may
not
be
the
top
of
sketch
MD's
priorities
and
it's
not
for
nurse
to
fix.
It
is
definitely
for
that
vendor
to
fix
I,
understand
the
roles
there.
H
Yeah
and
I
think
that,
and
unfortunately
we
around,
let's
see
I,
mean
the
reason
we're
going
with
scrantab
offers
a
lot
of
advantages
and
that
your
stuff
will
always
work.
Even
if
your
favorite
node,
where
your
con
tab
was
it's
offline
right,
so
I
mean
you
can
run
into
the
same
problem
where,
if
you,
if
you're
off
on
vacation
and
Corey
21,
goes
offline
for
some
reason,
we
have
to
take
it
out.
All
of
your
crons
are
going
to
be
disabled.
We
won't
have
any
really
way
of
moving
them.
Where
is
this
one?
H
The
batch
system
should
move
them
automatically
to
running
nodes,
so
I
think
it
offers
a
lot
of
advantages
for
users,
but
I
do
agree
that
we
have
a
long
way
to
go
before
it
can
sort
of
right.
Now,
it's
like
in
the
uncanny
valley
of
KRON,
where
it's
light
cron,
but
not
enough
like
it
that
the
difference
it's
really
stand
out.
Yeah
and
I
personally
am
holding
off
as
long
as
I
can
migrating.
H
My
nightly
regression
testing
from
a
cron
tab
on
Corey
21
to
scront
tab
on
fermudder,
because
I'm
hoping
that
these
issues
can
get
worked
out,
but
yeah
come
April
I'm
going
to
have
to
do
it
all
right
well,
thank
you.
H
As
I
see,
Leon
Sean
has
your
hand
up.
Yes,
this
is
Benson
from
well
I
know
so
I
have
a
so.
Is
it
possible
to
use
a
like
a
container
technology,
so
users
can
move
their
clothes
from
Curry
to
Sonata?
H
H
Palmetto's
CPUs,
you
know
compatible
with
Corey's
Haswell
nodes,
at
least
in
terms
of
instructions,
said
I'm,
less
sure
about
the
the
specifics
of
the
operating
system
because
they're,
you
know
there
are
differences
in
things
like
their
networks
and
so
on.
It
could
be
that
things
just
work
because
of
dynamic
linking.
H
But
I
haven't
tested
it
I
noticed
recently
this
there
are
no
Intel
compiler
on
the
model
for
I.
Think
for
all
users.
I
mean
I
have
to
like
download
it
for
myself
and
it
started
on
my
home
folder.
H
So
that
might
be
useful
right.
Okay,
so
the
access
to
the
Intel
compiler
is
a
effector
in
the
in
the
difficulty
of
migrating
yeah.
H
Your
Technique
right
told
me
that
Intel
comparison
of
the
body
in
your
system
or
something
like
I'm,
also
yeah.
We
don't
have
a
programming
environment
for
the
Intel
compiler
I
believe
the
football
API
compilers
can
be
sort
of
you
know,
get
downloaded
and
installed
in
in
dollar
home,
but
I
haven't
explored
it
too
closely.
H
All
right,
Francois,
hi
yeah,
thanks
so
much
for
the
presentation.
It's
actually
very
useful,
I
have
a
question
regarding
something
you
had
on
your
slide,
which
says
that
the
performance
part
of
the
acceptance
of
pro
model
has
been
completed.
H
So
I
assume
that
there
have
been
then
therefore
benchmarks
that
have
been
run
that
show
the
actual
performance
and
by
actual
I,
really
want
to
say
the
one
that
can
be
reached
as
opposed
to
the
nominal
performance
that
we
see
in
in
the
specs.
These
benchmarks
available
and
I'm
thinking,
in
particular,
about
the
interconnect
bandwidth,
because
those
are
numbers
that
would
really
be
helpful
for
us
to
guide
us
in
what
strategy
we
will
take.
What
kind
of
specialization
will
make
do
we
want
to
work
on
single
node,
single
node,
multi-gpu
multi-node
and
such
decisions.
H
Well
actually?
So
technically
we
have
not
done
all
of
the
benchmarks,
because
we
don't.
We
haven't,
started
acceptance,
testing
yet
I
think
we
do
have
some
numbers,
but
I
have
to
go.
Look
to
see
what
we
have
available:
okay,
so
so
Francois.
Why
don't
you
just
submit
a
ticket
and
we'll
try
to
figure
it
out?
Because
I
don't
have
the
answer
to
your
question
right
now,
I'm
afraid
yeah,
actually
I
I
do
have
a
pending
ticket.
H
That's
been
there
for
some
time
where
I
specifically
ask
about
the
reachable
bandwidth
of
internode
communications
between
GPU
nodes,
which,
according
to
the
diagrams
that
are
on
the
website,
should
be
100,
megabytes,
100,
gigabytes
per
second,
but
I,
don't
know
if
that's
actually
reachable,
okay,
okay,
well,
I'll
I'll
ask
people
to
follow
up
yeah.
We
should
have
some
numbers
for
that.
H
H
Okay,
John
yeah,
just
following
up
on
the
question
about
the
Intel
compilers
I
mean
we
had
an
exchange
in
the
chat,
but
for
every
for
people
who
didn't
read
that
we
had
an
issue
with
a
code
that
had
vectorization
directives
that
weren't
being
respected
by
the
any
of
the
existing
compilers
on
promoter,
and
we
found
that
using
the
Intel
compiler
gave
us
three
times
speed
up
compared
to
the
the
supported
compilers,
and
this
is
for
a
GP
for
CPU
code.
H
I
realize
Intel
is
not
supported,
but
it
seems
like
the
existing
compiler
Suites
are
sorely
lacking
in
vectorization,
because
these
are
explicit
directives.
This
wasn't
any
asking
anything
especially
hard
of
the
compiler.
It
just
was
ignoring
them
and
so
yeah.
So
we
built
Intel
and
of
course
you
have
to
use
the
MPI
that
comes
with
the
Intel,
but
otherwise
we
got
a
factor
of
about
three
speed
up
for
our
CPU
code
on
Pro
modeler.
H
We
were
told
that,
because
these
are
AMD
chips,
Intel
is
not
supported
right.
We
certainly
have
not.
We
do
not
have
plans
to
support
the
the
Intel
compiler
at
this
point,
but
you
know
we
it's
not
that
we're
totally
closed
to
making
any
changes.
It's
just
we
yeah
I
mean
compiler.
Compiler
support
is
supported
as
a
combinatorial
issue.
Of
course
you
just
explode
what
you
have
to
do,
maybe
the
existing
compilers
so
that
the
question
from
Paul
kit
in
the
chat.
What
does
the
Intel
compiler
vectorize?
H
So
we
have
vectorization
directives
that
tell
you
know,
registers
that
are
vectorized
to
Loop
over.
You
know
four
floating
Point
operations
within
that
register,
and
and
you
you
look
at
the
diagnostic
output
from
the
existing
compilers
and
they
just
say
you
know
unsupported
directive
or
I'm,
ignoring
this
directive,
and
you
run
it
with
the
Intel
compiler
and
it
it
actually
uses
it.
And
so
you
know
there's
a
three
to
four
times
speed
up
because
they're,
because
you
know
you're
getting
more
calculations
done
for
each
cycle
as
it's.
H
It's
piled
together
in
the
register,
I
believe
the
Pearl
mutter
registers
aren't
as
wide
as
the
Corey
ones,
but
they
are
still
Vector
registers
there
that
that
are
ignored
by
the
compilers
okay.
H
So
yeah
well
put
about
a
CPU.
Well
matters:
CPUs
I've
got
adx2,
which
is
the
same
as
what
Cory's
as
well
knows,
has
got.
They
don't
have
the
AVX
512,
but
the
k
l
nodes
have
right
right
exactly
so.
It's
not,
but
there's
still,
you
still
can
vectorize
over
them
and
and
I
mean
the
compilers,
of
course,
are
not
identical
between
Corey
and
Pearl
Motor.
So,
for
whatever
reason,
the
code
does
not
compile
with
that
support
on
promoter.
H
H
H
Right
so
none
of
them
succeeded
in
supporting
these
directives,
but
Intel
did
do
you
know
whether
that
was
particularly
the
was
it
Intel
classic,
or
did
they
Intel
the
the
llv
M1
work
as
well,
pretty
sure
it's
llv
M1
I'd
have
to
check,
but
it
was
just
basically
the
latest
Intel
installation
package,
including
MGI.
H
Yeah
I
mean
you
do.
Did
you
mention
that
you
have
a
ticket
open?
We
it's
probably
closed.
Now
we
had
a
long
chat
with
one
of
the
consultants
and
and
they
weren't
able
to
really
help
with
the
existing
compilers
and,
and
they
said
you
know
they
can't
support
the
Intel
compiler.
We
we've
finally
tried
the
Intel
compiler,
because
the
one
of
the
code
developers
said
that
they
had
good
success
with
Intel
on
these
factorization
directors
and
and
just
Paul
hargrove's
comment
yeah.
H
Obviously,
you
know
vectorizing,
sorry
yeah,
that
that
should
have
said
cpu.gpu,
because
these
are
it's
Nvidia
gpus
that
should
have
said
AMD
CPUs,
the
just
even
just
a
symbiization
Intel
classic
compilers
at
least
I'll
include
this
disclaimer
in
their
documentation.
Most
users
don't
see
it
aren't
aware
of
this.
They
actually
had.
They
actually
had
to
put
this
in
because
of
a
lawsuit,
because
their
compiler
actually
does
not
do
some
simd
type
optimizations
on
AMD,
so
I'm
actually
kind
of
surprised
or
was
surprised
you
stating
that
it
did
do
it.
H
But
then
you
said
it's
the
llvm
based
one,
the
the
newest
recent
one,
in
which
case
this
warning
does
not
apply.
This
is
definitely
related
to
their
their
classic
compilers.
So
I
I
mixed
up
at
least
two
things.
I
said
GPU
in
that
text,
where
it
should
have
said
CPU,
but
also
if
this
is
not
the
classic
compilers,
then
this
disclaimer
is
not
not
present.
Yeah
I
can
dig
into
that
sorry
go
ahead.
Oh
I,
just
remember
from
back
in
the
day
you
know
they
they.
H
Actually
there
was
that
lawsuit,
because
they
would
actually
check
like
what
type
of
CPU
is
this
and
then
it
would
not
optimize
if
it
was
not
an
Intel
CPU,
exactly
yeah,
even
though
the
the
CPU
was
capable
of
accepting
that
optimization
like
it
still
wouldn't
yeah,
wouldn't
do
it
and
I
think
that's
that's
still.
In
still
the
case
with
mkl
and
Israelis
was
a
module
on
Palmetto
for
enabling
a
workaround,
for
it
basically
trick
mkl
into
thinking
that
it's
running
on
an
Intel
CPU.
H
Yeah,
you
know
I
I
can
open
a
new
ticket
describing
our
experiences
in
more
detail.
I
can't
navigate
quick
enough
into
what
I
had
done
several
months
ago
to
specifically
answer
all
the
questions
about
what
the
directive
was
or
be
100
sure
on
which
version
of
the
Intel
compiler
Suite.
It
was
when
it's
taking
another
look
at
yeah,
but
yeah.
Just
just
to
comment,
I
mean
our
preference,
of
course,
would
be
to
see
the
supported,
compilers
doing
a
better
job
on
vectorization
yeah
yeah.
Thanks
for
that
feedback.
H
All
right
do
we
have
any
more
questions
or
so
I'm.
Looking
at
the
time
and
we've
we've
hit
12
30
now
so
we've
sort
of
gone
a
fair
bit
over
our
normal
time
and
thanks
everybody,
for
you
know
a
great,
a
great
set
of
questions
and
thanks,
especially
Rebecca
and
Tino
I,
think
they
needed
to
drop
out
and
Richard
and
Lisa
for
yeah
for
a
lot
of
help
with
you
know
answering
some
of
these
questions,
yeah,
maybe
for
further
questions.
H
We
can
sort
of
continue
the
discussion
asynchronously
via
VIA
Slack,
right
yeah
thanks
everyone,
so
yeah
thanks
all
again
well
wind
up
here
and
we'll
see
you
at
the
next
month's
meeting.
Thank
you
and
thanks
Rebecca
again
for
your
presentation
on
describing
the
life
cycle
of
a
supercomputer,
so
yeah.