►
From YouTube: NUG Monthly meeting 15 Oct 2020
Description
NUG Monthly meeting 15 Oct 2020
A
Those
of
you
who
joined
last
month's
meeting
and
or
read
through
their
entire
message
in
the
emails
we've
gone
to
a
slightly
different
format
from
the
previous
webinar
format,
so
we're
going
for
a
much
more
interactive
format.
A
So
please
participate
raise
a
hand
or
just
speak
up.
I
think
most
people
are
able
to
speak,
keep
an
eye
on
the
participants
list,
just
a
heads
up,
so
we're
recording
the
zoom
and
we'll
put
the
recording
up
with
a
you
know,
a
link
to
it
by
the
webinars
channel
and
on
the
www.nurse
page
about
the
monthly
meeting,
also
encourage
people
to
chat
away
in
the
nurse
user
slack.
A
We
have
a
hashtag
webinars
channel,
that's
intended
sort
of
yeah.
For
this.
The
advantage
of
using
the
slack
is
that
we
can
continue
the
conversation
beyond
the
zoom
meeting
and
also
sort
of
record
things
and
after
the
meeting
in
the
next
day,
or
so
I'll,
add
a
summary
into
that
channel
of
yeah,
interesting
things
that
came
out
of
this
meeting.
A
So
our
agenda
kind
of
regular
agenda
for
the
for
this
meeting
is
start
out
with
a
a
win
of
the
month
and
then
a
today
I
learned
a
section
we'll
explain
those
in
a
minute
go
through
some
announcements
and
then
the
topic
of
the
day
for
this
month
is
going
to
be
the
sea
scratch
one
crash,
which
I'm
sure
has.
A
People's
imagination
at
the
beginning
of
this
month
and
attention,
then
we'll
just
go
through
some
incoming
meetings
and
last
month
numbers.
A
So
the
idea
of
the
win
of
the
month
section
is
for
our
users
to
show
off
an
achieve
achievement
or
shout
out
an
achievement
of
somebody
else
that
you
know
of
you
know
things
like
how
to
having
a
paper
accepted,
solving
a
bug
that
had
been
giving
you
some
grief
yeah.
It
can
be
something
quite
big
like
a
scientific
achievement.
Yeah.
A
This
is
a
good
source
to
nominate
something
as
a
candidate
for
a
science
highlight
or
a
high
impact
scientific
achievement
award,
innovative
usage,
high
computing
award
and
what
we're
interested
in
hearing
is
yeah
what
you
did.
What
what
you
achieved?
Yeah
tell
us
your
success
and
what
was
the
key
insight
that
came
from
it?
A
I
think
people
can
just
unmute
themselves
and
speak.
Does
anybody
have
a
win
of
the
month
they'd
like
to
show
off.
A
So
I
have
one
as
a
bit
of
a
shout
out
to
one
of
our
users:
the
new
docs.nurse.gov
site.
I
guess
not
that
new
anymore
is
hosted
on
gitlab
and
users
can
contribute.
You
can
make
a
you
know
if
you've
got
something
missing
or
a
correction
to
make,
you
can
make
a
merge
request
and
submit
it,
and
during
our
c-scratch
issues
we
had
our
first
user
contributed.
Merge
request
came
in
through
git
lab,
so
yeah
shout
out
to
tauren
bechtel.
A
I
I
hope
I'm
pronouncing
that
correctly,
who
spotted
some
missing
information
on
our
current
issues,
page
and
posted
a
merge
request
that
were
then
able
to
just
merge
in
and
yeah.
That
was,
you
know,
a
timely
contribution
to
docs
and
yeah.
We
encourage
everybody
else
to
do
the
same.
When
you
see
things
that
could
be
added
to
or
corrected
from
our
docs,
it's
at.
What's
the
site,
nurse.gitlab.io
I'll
paste
that
in
the
chat.
B
Hi,
this
is
koichi
from
bill.
Now
I
don't.
I
can't
hear
me
hi
hi
steve.
I
don't
have
any
particular.
You
know
innovations,
but
this
one's
a
small
thing.
I
just
started
more:
it's
actually
using
the
slack
of
nas
users,
yeah
more
studied
more.
B
How
do
you
say
often
using
this
in
the
past
several
weeks,
and
I
found
it
quite
nice
to
see
what's
going
on
in
the
community,
particularly
when
I
have
some
issues
like
logging
logging
in
in
the
the
last
machine,
then
just
look
at
the
track
and
how
other
people
doing
if
it's
just
unique
problem
to
me
or
everyone's
having.
So
I
wonder
how
how
many
people
or
users
are
using
this
slack
nurse
channel
and
do
they
usually
use
in
inside
web
browser
or
it's
more
typical
to
download
its
app.
A
A
Interested
about
what
other
people
do,
but
personally
I
use
the
slack
app,
but
I
quite
like
it
and
I
have
quite
a
lot
of
slack
organizations.
I
see
channels
lined
up
on
it
yeah.
It
is
good
to
see
that
the
general
channel
and
a
few
others
seem
to
be
reasonably
active.
A
I
heard.
Apparently,
a
lot
of
the
activity
is
actually
in
direct
messages.
You
know
private
messages
between
users,
so
users
are
finding
it
a
good
way
to
communicate
sort
of
amongst
amongst
user
teams.
Basically.
C
Yeah
we
I'm
peter
mars
and
we're
using
the
slack
channel
as
a
private
channel
online
eric's
users
a
lot
for
a
hackathon
preparation
for
pro
motor,
I'm
a
pi
on
my
knees
up
for
pole,
motor.
A
A
It
takes
a
it
takes
a
bit
of
a
run-up,
but
but
you
can
have
quite
a
you
know,
a
good
effect
with
these
intense
activities.
Yeah,
that's
true.
A
That's
great
thanks,
okay,
so
then
the
other
side
of
the
coin
is
today
I
learned
and-
and
this
is
a
opportunity
to
talk
about
something
that
happened
or
that
you
discovered
that
surprised,
you
that
you
know
might
be
a
benefit
to
other
users
and,
incidentally,
might
also
give
us
some
tips
for
improving
our
documentation,
things
that
we
could
call
out
or
make
more
obvious
so,
for
example,
so
so
this
is
yeah.
This
doesn't
have
to
be
a
success.
A
This
can
be,
and
I'm
stuck
on
something
yeah
so
yeah,
something
that
you
got
stuck
on
a
dead
end,
something
that
you
really
thought
was
the
case
and
on
further
debugging
turned
out
not
to
be
true
and
that
that
leads
to
a
tip
new
tips,
you've
discovered
for
using
nurse
or
or
or
something
external
that
you've
discovered.
That
might
benefit
other
nurse
users.
You
know
a
good
presentation,
for
instance,
yourself.
D
So
this.
A
D
When
we
had
to
move
to
like
the
gps
file
systems
because
of
the
test,
issues
and
whatnot
was
that
we
have
one
file
system,
a
dedicated
jgi,
but
but
it's
basically
a
read-only
file
system,
and
I
found
that
reading
from
that
was
about
10
times
faster
than
reading
from
project
b
or
from
computing
file
system.
D
And
there
was
a
you
know,
a
clog,
maybe
in
the
project
b
or
or
a
community
file
system
that
was
causing
the
I
o
to
be
much
much
slower
than
what
we
already
saw
in
muster.
But
the
read-only
file
system
was
about
the
same
speed
as
luster
was.
A
Interesting,
that's
so
yeah,
that's
a
that's
a
good
tip
and
that's
that's
one
of
the
file
systems.
It's
a
jgi
specific
file
system
right.
D
Right
right
project
dna
is
our
read-only
file
system.
Can't
it
can't.
D
From
the
nodes,
but
you
can
read
from
it,
it
was
much
much
faster
than
project
b.
A
Yep
yeah,
that's
a
good
tip
the
I
guess
the
equivalent
in
a
way
across
the
the
wider
your
nurse
community
is
for,
if
you
are
building
software
for
your
group
to
use
the
global
common
global
common
software
in
your
project.
Name
directory
is,
is
similarly
mounted
read-only
on
the
compute
nodes,
and
it
should
give
you
a
little
bit
of
performance
for
your
starting
the
software,
particularly
on
lots
of
nodes,
and
I
think
part
of
the
reason
is
to
do
with
the
ability
better
ability
to
to
do.
Caching
on
the
nodes.
D
Something
in
the
chat
that
says
something
to
that
where,
if
the
gpu
file
system
is
mounted
read
only
then
it
can
use
all
the
servers
to
distribute
the
load,
which
makes
a
lot
of
sense.
D
D
A
Could
we
can
we
use
yeah,
more
read-only
file
systems.
E
So,
if
you
don't
mind
me,
jumping
in
I
can
at
least
speak
to
it
a
little
bit
where
the
reason
this
works
well
for
jgi
is
that
jgi
has
one
place
that
can
write
to
that
file
system
and
we've
communicated
very
carefully
that
no
one
should
ever
change
a
file
that
might
be
accessed
on
on
query
in
a
job.
That's
running,
and
that's
because
of
this
very
aggressive
caching
and
the
non-awareness
that
a
given
I
o
thread
will
have
relative
to
another
one.
E
E
It
is
a
lot
faster.
It's
much
much
faster
it
you
know.
Posix
I
o
is,
is
damaging
in
some
ways
to
hpc
performance.
So
it's
worth
it,
but
it
takes
a
great
deal
of
care
and
and
planning
and
preparedness,
and
I
I
think
it's
a
great
idea.
I
just
I've
always
been
confused
about
how
to
best
communicate
that
it's
been
a
little
easier
with
jgi,
but
yeah.
F
F
So
you
know
it's
a
website
in
chicago
or
something
for
moving
data
between
you
know
scratch
and
cfs.
For
example.
That's
you
know
far
faster
than
rsync
or
just
copy,
but
a
limiting
factor
until
very
recently
was
that
it
didn't
work
with
a
collaboration
account.
F
It
only
worked
with
individual
accounts
because
of
how
you
have
to
authenticate,
but
yesterday
lisa
got
us
set
up
so
that
we
can
now
use
our
desi
collaboration,
account
to
authenticate
with
globus
and
move
the
data
around
with
a
custom
endpoint,
and
so
that's
really
great
and
opens
up
more
possibilities
for
us
to
use
it
for
doing
productions
between
the
two.
So
thanks
lisa
and
if
you're
a
user
and
haven't
used
globus
to
move
data
internal
to
nurse,
consider
it.
A
Cool
yeah,
that's
a
that's
a
really
good
tip,
actually,
even
even
for
users
who
are
not
using
a
collaboration
account.
The
idea
or
the
the
mechanism
of
using
globos
to
move
data
around
even
internally
to
nurse
is
is
very
valuable,
but
glitters
is
a
really
nice
tool.
Actually,
I
quite
like
it
great
for
moving
stuff
around
between
sites
as
well
and
being
able
to
you
know,
set
up
kind
of
a
fire
and
forget
for
a
large
transfer
is
pretty
handy
too
thanks
steven.
B
Can
I
ask
kind
of
question
about
the
globus
sure
this
is
called
cherien,
so
I'm
using
glovers
to
transfer
files
externally
from
other
systems
to
mask
and
during
this
trans
mission?
If
I
look
at
the
event
log
on
global's
website,
they
sometimes
say
some
errands.
This
is
like
time
out.
Error
or
file
system
does
not
allow
append,
but
in
the
end,
when
I
just
wait
for
enough
time
all
the
files
are
copy
and
there's
no,
you
know
programs
in
all
the
files.
B
So
eventually
these
are
doing
nothing
from
my
side.
It
just
works,
and
I
am
just
curious.
Then
what
are
those
error
messages
and
it's
coming
looks
like
it's
coming
from
the
nas
endpoint,
and
this
is
the
case
when
I
use
globus
to
mask
hpss-
that's
hpss,
endpoint
yeah.
I
wonder
if
this
is
common.
I
did
transfer
quite
an
amount
of
data
in
the
last
few
weeks
and
many
times
I
see
these
errors,
but
eventually
it
just
works.
Fine.
I
wonder
if
that's
the
case
for
other
people
or
most
of
the
users.
G
But
because
of
the
way
that
hpss
works,
the
software,
you
can't
do
that
it's
a
single
stream,
and
so
you
can't
resume
an
interrupted
transfer
that
gets
blocked
for
hpss,
for
instance.
G
So
one
of
the
things
that
we
recommend,
if
you're
trying
to
transfer
lots
and
lots
and
lots
of
data
to
hpss,
is
that
you
do
a
two-step
transfer.
You
go
to
cfs
or
scratch
first
here
and
then
use
hsr
h-tar
to
put
them
in.
But
that's
you
know,
that's
if
you're
doing
like
lots
and
lots
like
terabytes
and
terabytes
hundreds
of
terabytes
of
data
transfer,
because
these
resumes
can
be
kind
of
bothersome
for
large
data.
G
But
otherwise,
if
it's
just
a
small
amount
as
as
you've
seen,
globus
will
just
try
again
and
it
just
takes
a
little
bit
longer
and
it
eventually
gets
in
there.
A
A
So
so
I
think
I
got
two
take
homes
from
that.
One
was
that
when
you
see
these
messages,
don't
panic
the
transfer
succeeds
in
the
end.
It's
not
broken,
and
the
other
is
that
if
you're
seeing
them
a
lot,
which
is
more
likely
to
happen
with
really
large
transfers
and
particularly
to
hbss
doing
doing
a
two-step
transfer,
where
you
first
transfer
to
c-scratch
or
cfs
should
avoid
them.
A
So
I
guess
one
more
today
I
learned
from
the
nurse
side
of
things,
and
I
guess
this
was
well.
I
certainly
learned
it
and
I
guess
others
did
as
well
when
we
were
working
with
corey
in
read-only
mode
and
yeah.
I
think
that
our
users
helped
a
bit
in
discovering
this
yeah.
We
we
found
a
few
things
where
we
have
dependencies
on
scratch
and
in
at
least
some
of
the
cases
we're
able
to
mitigate
those
some
some
of
them
were
yeah.
A
We
can
flag
to
look
at
in
more
detail
so
yeah,
for
instance,
shifter
use
a
c
scratch
for
some
of
its
staging
of
images
and
that
one
that
one
will
take
a
a
little
bit
more
of
a
run.
Up
to
you
know
to
work
around.
I
guess
for
time,
says
c
stretches
not
available.
I
think
you
know
a
few
people
noticed
when
we
first
went
into
the
kind
of
debug
mode.
Where
see
scratch
was
unavailable.
There
were
some
issues
with
logins
hanging,
and
our
systems
group
was
able
to.
A
A
So
we
have
a
few
of
these
from
the
nurse
side
and
then
we'll
kind
of
open
the
floor
for
announcements
from
users,
so
this
is
kind
of
an
opportunity
to
if
you're
hosting
a
conference,
for
instance,
you
know
we're
or
a
meeting
events
that
other
users
might
be
interested
in
participating.
This
is
a
good
opportunity
to
announce
them.
So,
first
of
all,
there's
a
few
that
were
in
the
weekly
emails.
You
should
be
able
to
dig
this
up
in
the
email
or
on
the
announcements
page
at
www.nurse.gov.
A
A
Those
of
us
who
were
at
last
month's
meeting
would
have
seen
zinji's
presentation
about
checkpoint
restart
on
corey
and
during
that
she
made
the
first
announcement
of
a
new
conference
first
international
symposium
on
checkpointing
for
supercomputing,
supercheck,
21,
and
so
there's
a
cfp
for
that
and
your
links
to
it
through
the
weekly
email,
also,
if
you're
interested
in
being
part
of
the
hpc
community
kind
of
in
a
in
a
wider
sense.
A
The
sc21
committee
steering
committee
has
a
call
for
planning
committee
volunteers,
that's
currently
open
and
we
have
a
very
new
announcement
that
will
send
something
around
in
a
little
bit
more
detail
shortly.
But
we'll
have
a
pause
of
the
job
queue
and
see.
A
Scratch
will
be
temporarily
unavailable
for
a
few
hours
on
monday
morning,
while
we
add
a
new
adu
and
we'll
actually
talk
a
little
bit
about
what
that
means
very
shortly,
to
help
with
debugging
from
the
outcomes
of
the
file
system
crash
that
we
experienced
at
the
end
of
last
month.
A
Another
big
one
kind
of
I
think
you've
probably
seen
this
a
little
bit
already
on
the
nurse
users
slack
channel.
It's
also
been
in
the
weekly
emails
and
this
will
soon
become
the
default
thing,
but
we
really
encourage
people
to
try
it
out.
We
have
a
new
help
portal
for
our
ticket
system.
That
gets
you
fairly
quickly
and
easily
to
tickets,
to
documentation
and
to
common
requests.
It's
got
a
fairly
decent
search
bar
a
much
more
user-friendly
user
interface,
and
you
know
when
you
do
open
a
ticket.
A
It
goes
through
to
a
much
more
particular
usable
form
for
doing
so.
So
we
think
this
is
a
great
improvement
and
we're
really
keen
to
have
users,
try
it
out
and
tell
us
any
kind
of
your
issues
or
glitches
that
you
notice
yeah
and
we're
hoping
to
make
this
the
yeah
the
live
destination
for
help.nurse.gov
quite
soon
now,
so
it's
at
nurse.servicenowservices.com
sp
for
service
portal,
probably
easiest
is
the
there'll
be
a
link,
or
there
is
already
a
link
to
that
in
the
general
channel.
A
That's
all
the
ones
that
I
know
about
at
the
moment.
Does
anybody
else
have
any
announcements
or
calls
for
participation
they'd
like
to
make.
A
Okay,
if
not,
then
we'll
go
on
to
our
topic
of
the
day.
So
I
think
this
is
a
a
very
topical
one
that
people
will
be
interested
in
and
I'm
going
to
pass
over
to
doug
jacobson
who's,
the
leader
of
our
computational
systems
group.
A
So
he
heads
up
the
group
that
sort
of
does
the
the
systems
administration,
if
you
like,
for
corey
and
associated
systems-
and
he
has
quite
a
lot
of
knowledge
about
you-
know
what
happened
and
why,
at
least
to
the
degree
that
we
know-
and
I
will
pass
over
to
you
doug-
are
you
able
to
share
or
do
I
need
to
enable
something.
E
E
Oh
good
thought
all
right,
so
so
what
so
so
see
scratch
one
crashed
as
you
know,
and
it
caused
the
system
to
be
down
quite
a
lot.
This
last
time
now
I
am
happy
to
say
that
we
found
a
way
to
remain
in
some
form
of
operations
and
we'll
talk
about
what
that
means
through
most
of
this,
but
what
you
know
so
so
just
sort
of
working
through
the
slide
a
little
bit
to
give
you
an
idea
of
what
actually
happened.
E
So
the
structure
of
the
lustre
file
system
is
such
that
it's
comprised
of
about
two
is
comprised
of
258
servers
and
248
of
those
store.
Your
data
and
six
of
them
store
your
metadata
or
manage
the
file
system
in
some
way,
and
then
two
of
them
manage
the
whole
rest
of
that
cluster.
The
actual
file
system
is
a
cluster
all
of
its
own.
That's
the
name.
Luster
sounds
like
cluster
right
anyway.
E
So
when
we
look
at
the
structure
of
the
file
system,
we
have
sort
of
you
can
sort
of
the
easiest
way
to
think
about.
It
is
to
divide
it
into
two
groups
of
nodes.
One
is,
is
metadata
nodes
data
about
your
data
and
the
rest
is:
is
object,
storage,
nodes,
your
data
and
so
the
metadata
nodes.
These
are
the
things
that
provide
all
of
the
structure
of
the
file
system
that
you
see
when
you
change
directories
to
global
c
scratch.
One
sd
slash
your
username.
E
Is
it's
really
important
because
it
is
keeping
track
of
what
all
the
data
on
the
file
system
is
and
where
it
is
one
of
the
interesting
features
that
we'll
talk
about
today
is
that
with
lustre
you
can
stripe
the
data.
This
is
to
say
that
you
can
say
you
know.
I
don't
want
my
my
entire
file
to
be
on
on
the
first
oss
I
want
you
know.
E
E
One
is
that
you
get
better
performance,
because
now
four
servers
are
talking
to
you
instead
of
just
one,
but
also,
if
you're
doing
a
sequential
read
of
the
the
file,
you
actually
are
being
sort
of
being
a
good
citizen
in
many
ways,
because
if,
if
you
put
one
single
large
file
on
the
system,
that
oss
will
actually
only
talk
to
you
and
that
oss
can
be
completely
taken
over
by
your
process.
Trying
to
read
that
file,
so
striping
also
gives
all
of
your
neighboring
users
a
chance
to
get
their
data
as
well.
E
So
you
get
better
performance
and
everybody
else
does
too,
when
we
do
striping,
there's
a
reason.
I'm
telling
you
about
this,
but
okay.
So
what
was
happening
in
this
crash
is
that
is
that
our
metadata
server
was
crashing
now
you'll
see
that
we
have
a
big
metadata
server
and
apparently
a
little
metadata
server.
It's
called
a
an
adu
which
stands
for
additional
dne
unit
which,
as
you
may
have
noticed,
has
a
yet
another
acronym
wrapped
inside
of
it,
and
let's
just
not
worry
about
that.
E
The
point
is
is
that
these
are
other
metadata
servers.
So
when
we
created
your
scratch
directory
here
at
nursk,
about
30
of
you
ended
up
with
all
of
your
data
on
the
primary
nds
and
then
the
rest
of
folks
end
up
getting
sort
of
shunted
off
to
this
other
adu,
and
we
have
four
of
them.
Unfortunately,
in
order
to
get
to
your
your
directory
in
order
to
find
such
global
speech,
one
slash
sd
up
to
that
point,
you've
got
to
go
through
this
one.
E
So
you
no
matter
what
even
if
you're
in
a
little
in
one
of
these
80s,
you
still
have
a
dependency
on
the
primary
mds
and
it
was
the
primary
mbs
it
was
crashing.
So
it's
very,
very
disruptive,
okay,
all
right,
apparently
you
don't
get
to
click
okay.
So
what
actually
happened-
and
actually
I
happened
to
be
on
call
for
almost
this
entire
incident,
so
that
was
interesting
at
least
for
the
two
production
crashes.
So
on
thursday
september
24th
it
crashed-
and
you
know,
we've
seen
this
before
well-
we've
seen
the
metadata
server
crash.
E
It's
not
great,
but
you
know
it's.
It's
a
very
complex
system
comprised
of
very
complex
software,
and
software
sometimes
has
bugs,
and
so
you
know
we
filed
the
normal
critical
case
that
we
would
and
unfortunately
when
we
tried
to
bring
it
back
up,
so
we
do
have
failover
capabilities
for
all
these
servers.
It
couldn't
fail
over
automatically
and
that's
because
some
of
the
safeguards
to
make
sure
that
all
the
all
the
system
data
is
secure
and
safe
and
well
written
out
was
damaged
in
this
crash.
E
That's
the
file
system
journal
here
and
so
that
forced
us
to
go
through
several
different
runs
of
what's
called
an
fsck
file
system
check
and
repair,
and
that
takes
that's.
Each
fsck
takes
eight
hours
to
run
and
to
do
it
correctly
for
our
system,
we
have
to
run
it
three
different
ways
three
times
so
that's
30.
E
It
basically
takes
24
hours
of
checking
and
if
any
errors
come
up,
you
know,
then
those
have
to
be
manually
resolved,
and
so
it
ends
up
taking
about
36
hours
to
recover
from
a
metadata
server
crash
where
it
can't
fail
over
it's
quite
disruptive,
so
we
were
very
happy
on
friday.
We
got
to
come
back
into
production
and
then
on
sunday
it
repeated
again
at
the
time
both
of
these
were
right
around
noon
and
we
were
signing
a
lot
of
significance
to
it
being
around
noon.
E
It
ends
up
not
being
as
far
as
we
know
that
that
that's
not
the
issue,
and
so
you
know
at
this
point
we
were
nervous
because
it
was
exactly
the
same
crash
in
terms
of
you
know.
We
look
at
this.
The
linux
kernel
stack
trace,
it
was
identical,
and
so
the
the
concern
was
that
some
unknown,
some
unknown
portion
of
the
workload
was
sort
of
reproducibly
touching
this
and
would
potentially
put
user
data
at
risk,
and
so
we
didn't
feel
that
we
could
bring
the
system
back
into
production.
E
So
we
went
through
the
crash.
You
know
through
the
repair
again,
but
we
didn't
have
any
immediate
direction
forward.
On
tuesday
we
worked
out
a
plan
with
hpe
hp
is
our
vendor.
They
used
to
be
called
cray
wherein
the
the
systems
team.
My
team
would
basically
completely
rewrite
all
aspects
of
how
the
system
is
accessed,
and
so
we
put
in
a
we
put
in
a
a
big
day
and
we
changed
all
the
zone
queues.
We
changed
all
the
login
policies.
E
We
changed
every
aspect
of
how
people
get
in
in
order
to
create
this
debug
query
capability,
which
we'll
have
at
the
ready,
if
we
ever
need
it
again,
but
it's
our
goal
to
not
and
at
the
same
time,
hpe
put
together
a
special
debug
kernel
that
would
help
them
to
zero
in
on
what
was
the
problem
and
then,
finally,
we
were
some
of
the
some
of
my
team
was,
and
a
lot
of
hpe
was
dedicated
to
trying
to
look
at
the
actual
crash
memory
dumps
to
try
to
understand
what
might
be
causing
it.
E
E
I
mean
take
action
in
order
to
try
to
crash
the
machine,
so
it
made
it
kind
of
weird
that
normally
our
goal
is
to
keep
corey
up
as
as
well
and
efficiently
as
we
can,
and
now
our
goal
was
to
crash
corey
as
well
and
as
sufficiently
as
we
could
and
it
took
about
another
36
hours
from
wednesday.
We
came
back
until
friday
at
at
8
am
to
crash
the
system
again,
and
so
we
reproduced
it.
E
Unfortunately,
the
journal
was
damaged
again,
and
so
it
took
another
36
hours
to
repair
almost
immediately
after
they
repaired
the
journal.
We
brought
it
back
up
on
sunday.
E
What
we
did
is
we
made
quite
available
without
sea
scratch,
one
at
all,
it's
not
to
say
that
it
was
as
useful
as
as
normal,
but
at
the
very
least
you
know
some
useful
work
can
be
done
with
the
machine,
and
I
believe
that
we
may
have
made
that
time
available
as
free
time
so
that
that's
always
a
nice
thing
and
so
what
we
changed
when
we
got
into
after
we
repaired
it
again
is
that
we
we
removed
everyone
from
the
machine,
and
we
did
that
so
that
we
could
produ
be
sure
that
now
that,
with
your
help
with
all
of
the
nurse
user
bases
help
had,
you
know
identified
the
correct
workload,
we
wanted
to
be
sure
that
we
could
isolate
it
down
to
a
particular
synthetic
workload.
E
At
the
same
time,
hpe
gave
us
some
additional.
You
know,
capabilities
with
that
debugging
kernel,
to
try
to
assure
that
the
journal
would
not
get
corrupted.
That
did
not
happen,
unfortunately,
and
so
it
once
again
took
another
36
hours
to
repair
the
disk,
actually
a
little
longer
that
time
because
of
some
complications
that
came
up,
but
in
any
case
on
on
thursday,
the
8th
we
decided
to
return
to
normal
production.
What
we
did
is
that
we
took
our
reproducer
workload
and
we
ran
it
without
the
secret
ingredient
that
we
figured
out.
E
E
The
typical
load
that
it
operates
at
and
so
load
alone
is
not
the
crashing
sort
of
associated
issue,
so
that
gave
us
the
confidence
we
needed
so
that,
basically,
as
long
as
we
avoid
the
thing
that
hurts
we
won't,
you
know
we
won't
crash
and
we've
been
working
very
carefully
with
a
number
of
people
to
try
to
make
sure
that
we
avoid
that
behavior
okay.
So
what?
What?
What
did
we
get
out
of
all
this?
So
there
is
no
root
cause,
yet
we
don't
know
exactly
what
this
is
causing.
E
This
we're
sure
of
a
couple
things.
One
is
that
this
is
a
bug
and
it's
a
bug
at
the
kernel
layer,
the
linux
kernel
layer.
Now
that
could
be
in
a
kernel
module
like
luster,
or
it
could
be
deep
in
the
I
o
subsystems
itself.
These,
the
the
corey
scratch
system
runs
a
modified
version
of
red
hat
enterprise,
linux
7..
So
it's
not
it's.
It
is
modified,
but
it's
not
super
modified.
So
you
know
it's
a
well-tested
kernel
highly
regarded
as
being
reliable,
so
that
should
be
generally
okay.
E
However,
what
you
know,
like
I
told
you-
we
were
doing
some
pretty
deep
investigation
of
these
memory
dumps
and
what
we
found
is
that
so
we
were
getting
then
very
consistently
that
the
the
specific
invalid
value
was
changing
a
little
tiny
bit,
but
it
was
always
a
very
specific
pointer
in
a
in
a
very
low
level.
I
o
data
structure,
so
a
data
structure
that
was
you
know
basically
about
to
be
written
to
disk
there.
E
That
does
it
sort
of
imply
a
couple
different
methods
for
how
it
might
be
getting
modified.
The
details
don't
actually
matter
here,
but
the
point
is
that
is
that
it's
very
sort
of
deterministic,
it's
always
the
same
crash.
It's
not
random
and
we
found
when
we
looked
at
the
pages
that
were
modified
as
being
associated
with
those
I
o
structures.
E
They
did
reveal
a
particular
application
as
being
correlated
and
that
really
sped
us
along
in
terms
of
identifying
that
reproducer
stack
and
so
what
the
application
was
doing.
Is
it's
performing
a
lot
of
unlink
operations
in
parallel
and
and
the
the
particularly
unique
item
here
is
that
this
is
happening
when
the
file
is
striped
over
many
many
ost.
So,
basically,
all
of
them-
and
this
was
not
an
intentional
configuration.
E
However-
and
I
just
want
to
stress
that-
there's
nothing
wrong
with
what
this
application
was
doing.
It
just
happens
to
be
that
it's
tickling
this
bug
so
no
problem
there
we're
going
to
solve
it.
We're
going
to
fix
the
bug,
we're
still
not
going
to
recommend
that
workload,
though,
but
so
the
specific
mechanism
of
how
this
workload
is
causing
the
corruption
pointer
is
unknown.
E
We
do
not
know
neither
does
hpe
but
they're
working
on
it
and
they've
got
a
lot
of
people
dedicated
to
this
project,
and
throughout
this
incident
I
was
talking
to
them
every
single
day,
sometimes
twice
a
day
as
well
a
lot
of
people
at
nurse,
and
at
this
time
we
are
still
meeting
with
them
three
times
a
week,
while
we're
sort
of
moving
into
the
next
phase.
E
We'll
talk
a
little
bit
about
that,
but
based
on
what
what
we
learned
there
is
that
it's
somehow
tied
to
using
many
many
osts
and
if
I
have
to,
if
I'm
forced
to
speculate,
there's
a
couple.
Different
paths
that
have
been
discussed.
One
is
that
when
the
stripe
count
is
extremely
high,
extended
attribute,
I
know
blocks
must
be
used,
and
it
may
be
that
this
is
related.
E
That
in
some
way,
when
doing
deletions
pulling
in
these
additional
extended
attribute
blocks
may
may
generate
the
kind
of
race
that
we're
seeing.
Another
possibility
is
that
there's
a
big,
dramatic
increase
in
the
complexity
of
messaging
when
deleting
files,
when
you
have
to
talk
to
a
lot
of
osts
and
maintain
locks.
E
So
you
know
very
naively.
We
think
that
it's
completely
safe
to
use
one
stripe
or
two
stripes
and
there's
clearly
some
risky
using
248
stripes,
it's
hard
to
know
what
the
distribution
is
within
there,
but
everything
that
we
know
tells
us
that
using
our
our
largest
recommended
sizes,
the
so-called
striped
large,
that's
in
our
documentation,
that's
a
striped
count
of
72..
It
should
be
safe.
However,
for
the
time
being,
we're
asking
people
to
not
stripe
higher
than
272.
E
E
We
are
working
on
a
file
system
scan
to
detect
both
the
high
stripe
counts
so
that
we
can
directly
contact
users,
but
also
to
identify
damaged
inodes
that
were
triggered
during
this.
I
am
happy
to
say
that
almost
everything
we're
finding
is
in
my
directory,
where
we
ran
the
reproducers
or
in
some
of
the
other
known
places
that
were
associated
with
this.
With
this
issue,
there
have
been
a
couple
of
other
user
files
and
we'll
talk
about
what
the
recovery
will
be
for
them.
But
essentially
the
message
we
want
to
send
is
contact
us.
E
If
you
see
something
weird
with
the
file
system,
hpe
has
options
and
can
work
with
us
on
each
each
and
everything
and
very
important
to
understand
is
that
there
is
no
user
workflow,
that's
actually
causing
this.
The
error
is
deeper
in
the
system
and,
like
I
said
for
now,
our
goal
is
to
avoid
these
conditions,
so
there
have
been
a
couple
of
after
effects,
and
one
thing
I
want
to
be
clear
about
is
that
we
actually
don't
know
that
this
is
related
to
the
crash.
E
But
people
have
noticed
that,
since
we
came
back
in
production,
sometimes
login
nodes
will
hang
and
by
hang,
perhaps
you
were
trying
to
access
a
file
on
on
scratch,
one
perhaps
you
were
trying
to
ls.
Perhaps
you
were
trying
to
submit
a
job
because
submitting
a
job
actually
talks
to
lester
to
check
your
quota
and
in
some
cases
that
just
doesn't
work
anymore.
E
What
we
found
is
that
the
leicester
client
can
only
handle
a
limited
number
of
simultaneous
change,
requests
to
the
metadata
server
and
it's
possible
that
there
is
accessing
one
of
these
damaged
files
may
be
generating.
You
know
basically
be
dead,
locking
one
of
those
one
of
those
potential
rpcs,
that's
unclear.
It
could
just
be
a
different
bug,
but
so
we
don't
we
don't
know
if
it's
related
or
not,
but
they
happen
one
after
the
other.
So
we
we're
sort
of
it's
hard
to
not
assume
that
they're
related
to
each
other
as
a
mitigation.
E
What
we're
doing
is
we
are
monitoring
the
number
of
rpcs
in
flight
and
try
and
then
using
a
number
of
techniques
to
try
to
identify
a
login
node,
that's
impacted
and
we'll
try
to
reboot
it
as
soon
as
we
identify
that
it's
failed.
The
idea
being
is
that
at
that
point
we
know
that
login
node
is
no
longer
useful
for
submitting
jobs.
It's
no
longer
useful
for
accessing
scratch
is
no
longer
useful
for
data
transfer
and
so
it'd
be
better.
E
If
we
just
crack,
you
know
crash
it
as
soon
as
possible
and
get
that
to
get
that
debugging
information
over
to
grey.
The
correction,
you
know,
will
sort
of
depend
on
what
the
problem
is.
If
it
is
related
to
the
damaged
files,
then
the
check
and
repair
of
luster,
which
is
at
a
different
layer
in
the
check
and
repair
of
the
metadata
system,
will
complete
over
the
next
week
or
so,
and
in
that
time
we
should.
We
should
know
more.
E
E
If
you
try
to
ls
minus
l,
you
might
see
all
question
marks
for
the
mode
of
the
file
or
the
size,
and
if
you
try
to
delete
it,
it
says
it's
not
a
it's,
not
a
it's,
not
a
file
or
directory,
which
is
kind
of
weird.
You
can
move
its
parent
directory
around,
but
you
can't
move
the
file.
You
can't
rename
it.
We
are
working
with
hp
to
repair
and
recover
those,
but
the
best
thing
to
do
for
now
is
just
to
rename
the
directory
they're
in
or
just
ignore
that
they
exist.
E
E
If
you
have
any
questions,
please
go
to
servicenow
help.nurse.gov
to
file
a
ticket.
Okay.
So
this
I
I
want
to
point
out
that
I'm
I'm
you
know
like
lisa.
Many
people
have
been
working
on
many
different
aspects
of
this.
I'm
just
sort
of
reporting
it
so
okay,
so
debugging
this
has
been,
has
been
extremely
challenging
and
we
have
to
fix
it
because
you
know
this,
you
know
we
we
don't
know.
E
You
know,
we've
identified
one
way
this
bug
can
happen,
but
we
don't
know
that
this
is
all
the
ways
that
this
bug
can
happen.
And
so
you
know
what
are
we?
What
are
we
doing
right
now?
So
one
thing
that
we've
been
doing
the
whole
time
is:
we've
been
trying
to
reproduce
as
a
smaller
scale,
as
it's
become
increasingly
clear
that
the
number
of
osts
seems
to
help
it's
becoming
very
unlikely
that
we
can
reproduce
this
on
our
small
test
system,
which
only
has
four
osts.
E
So
that's
probably
not
going
to
work
so
we're
going
to
have
to
use
corey.
The
next
problem
is
it's
a
very
long
debug
cycle.
It
takes
about
36
hours
for
us
to
cr
boot
system,
crashes,
crash
the
system
and
then
repair
it,
but
it
takes
even
longer
for
hpe
to
analyze
all
the
results
at
the
level
of
verbosity
and
complexity
of
the
debugging
kernels
that
they're
building
so
crashing
the
whole
mdt
is
not
actually
a
useful
activity.
E
At
this
point,
it's
disruptive
to
you
and-
and
it's
not
getting
us
the
the
type
of
iteration
site
time
that
we
need
and
with
so
many
with
such
a
complex
system,
we
need
to
basically
be
able
to
do
a
lot
of
very
rapid,
well-designed
experiments
to
try
to
start
teasing
out.
What
is
the
fundamental
issue?
E
So
that's
been
really
important.
That
level
of
engagement
really
want
to
say
thank
you
for
that.
So
we
are
we.
We
are
running
instrumental
kernels
in
case
something
crashes
now,
but
more
importantly,
we
hpe
is
sending
us
a
new
metadata
server,
oh
oops,
so-called
adu.
This
is
one
of
the
small
ones
that
we
will
use
to
isolate
just
this
workload.
E
This
should
allow
us
to
take
about
200
nodes
of
query.
We
suspect,
plus
that
adu
plus
a
little
tiny
slice
of
all
the
oss
to
then
crash
just
that
adu
and
since
no
other
user
data
will
be
on
it.
That
will
have
two
important
aspects.
One
is
that
it
will
be
very
quick
to
repair
because
it
won't
have
like
2
billion
files
on
it.
Second,
it
won't
have
any
of
your
data
on
it,
so
you
won't
notice
it
and
it
won't
cause
logins
to
hang
or
anything
else.
E
So
that's
our
plan
in
order
to
get
that
done.
We
actually
just
today
right
after
the
the
great
shakeout
completed
you
know.
After
the
hpe
engineers
were
able
to
get
out
from
under
the
table,
they
were
able
to
then
install
the
new
adu.
It
just
came
today
and
it's
been
installed
into
our
data
center
and
into
c
scratch
one,
but
we
can't
add
it
to
the
so
to
the
system
until
we
can
quiesce
the
whole
c
scratch
one
and
so
on
monday
morning,
at
seven
a.m.
E
We're
going
to
we're
not
going
to
reboot
cory,
but
we're
going
to
stop
all
the
jobs
and
we're
going
to
lie.
We're
going
to
kill
all
of
the
login
sessions
again
we're
going
to
unbounce
the
scratch
one
and
we're
going
to
add
this
new
adu,
and
then
at
that
point
we
can
begin
the
debugging
experiment
alongside
all
of
everyone
else's
work.
E
A
B
Can
I
just
start
this
very
just
clarification
question
about
just
one
acronym.
I
saw
in
the
few
slides
back,
it's
rpc
what
this
rpc
means.
I
think
you
mentioned
that
when
we
yeah
sorry.
E
B
And
that
happens
whenever
we
submit
a
job
or
access
scratch
system
actually.
E
Yes,
absolutely
so,
anytime
that
you
ask
to
open
a
file,
that's
going
to
send
an
rpc
to
the
metadata
server
and
the
metadata
server
will
then
send
more
to
the
oss.
When
you
go
to
submit
a
job
to
the
system,
sbatch
will
submit
an
rpc
to
the
certain
controller
software
to
add
that
job.
So
actually,
I
I
tend
to
think
about
file
systems
and
slurm
in
a
very
similar
manner.
They
saw
they
don't
solve
the
same
problem,
but
they
use
very
similar
techniques.
C
I
understand
there's
good
reasons
that
when
you
log
in
to
nurse
that
they
tend
to
send
you
to
the
same
login
as
that
you
were
on
last
time,
but
if
that
login
note
is
hanging
and
you're,
not
that
yet
noticed
it
and
having
to
reboot
it.
Yet
is
there
a
way
that
I
can
clear
the
history?
So
I
get
a
random
login
note.
C
E
This
so
so
I'm
going
to
answer
both
of
your
questions,
I'm
going
to
answer
the
question
that
you
asked
as
well
as
the
question
that
I
that
it
implies
to
me
so
so
the
short
answer
is:
is
that
it's
based
on
your
ip
address,
and
so,
unless
you
can
change
your
ip
address,
it's
not
going
to
be
easy
to
change
to
a
different,
login
node.
E
However,
what
we
can
do,
and
what
we
have
done
is
that
the
load
balancer
that's
a
particular
piece
of
hardware.
That's
why
it's
you
know
it's
it's
configurability
every
time
we
every
time
we
I'm
nervous
about
touching
that
particular
piece
of
hardware,
because
every
time
we
do
it,
it
seems
to
seems
to
generate
a
new,
exciting,
a
new
excitement
for
us,
but
what
it
does
is
it
talks
to
each
of
our
login
servers
every
couple
seconds
and
says:
are
you
up?
Are
you
up?
Are
you
up?
E
Are
you
up
and
we've
never
done
much
with
that
up
or
down
check
until
now,
and
so
we
identified
two
different
ways
that
we
can
see
if
the
scratches,
if,
if
the
mount
point
to
scratch,
is
hun
without
blocking
that
process,
one
is
is
looking
for
a
well-known
login
process
that
can
hang
that's
the
thing
that
you're
noticing
you're
getting
hung
on
and
if
we
see
five
copies
of
those
that
are
still
in
the
process
table.
E
So
that
way
you
even
if
you
were
assigned
to
it
once
we
once
it's
marked
down
in
the
load
balancer
no
new
sessions
will
go
there
like
I
said
I
think
it's
happening
every
two
seconds:
okay
and
then
the
second
thing
that
we
did
is
we
also
we
were
able
to
identify
a
signature
at
the
lustre
layer
itself
so
that
we're
not
waiting
for
users
to
necessarily
you
know,
walk
you
know,
walk
in
the
front
door
and
find
an
unwelcome
environment.
E
Where
you
know
that's
now
responding
to
activity
that
was
already
on
the
login,
node
and
already
dust
you
know
already
failing
and
we're
marking
the
node
down
in
that
case
and
then
finally,
what
we've
done
is
that
we've
engaged
our
our
our
operation
staff.
You
know
we
have
site,
set
reliability,
engineers
that
that
work,
24
7
at
nursk
and
they
are.
E
They
are
now
monitoring
very
carefully
for
this,
this
new
health
check
and
they
and
they
are
empowered
to
reboot
these
nodes
as
soon
as
they
notice
that
it's
happening,
it
still
has
to
be
a
manual
process
because
we
need
to
collect
debugging
information
in
order
to
solve
the
problem,
but
the
during
daylight
hours
from
well
from
7am
to
9pm.
E
This
is
an
urgent
issue
for
us
and
overnight
it's
handled
only
a
couple
times.
A
So
we've
officially
got
about
one
minute:
left
doug
indicated
that
he
can
stay
on
the
meeting
for
a
little
bit
longer.
I
can
stay
on
a
bit
longer
and
and
a
few
others
might
be
able
to
as
well.
What
we
might
do,
though,
is
I
just
reshare.
This
screen
we'll
quickly
run
through
the
last
couple
of
items
and
then
return
to
q.
A
and
people
are
free
to
stay
and
go
according
to
their
schedule.
A
So
the
last
couple
of
items
we
have
coming
up
the
third
thursday
we're
interested
in
topic,
requests
and
suggestions,
but
I
think
we
can
take
this
offline
and
please
make
make
suggestions
or
requests
on
the
nurse
slack,
particularly
their
webinars
channel,
and
the
other
item
that
we
finish
up
on
is
last
month's
numbers.
So
for
september
you
can
see
that
corey's
scheduled
availability
yet
took
a
bit
of
a
hit
reasons
that
we've
been
talking
about.
Hpsfs
and
cfs
were
still
all
very
good
in
the
timeline.
A
A
B
Needed,
okay,
I
can
ask
another
more
general
question
so
about
this
queries
issues
quite
special
and
not
really
haven't,
took
place
so
frequently
in
the
past,
like
I
believe,
but
still
maybe
could
happen
in
in
the
future,
something
similar.
But
so
I
was
wondering
if
this
incident
can
affect
the
future
planning
of
computing
systems
in
in
particular.
In
my
I
am
wondering
if
we
change
this
so
right
now,
my
understanding
is
the
given
the
space
at
nursk.
B
We
can
install
two
big
systems
like
corey
or
paramotor.
I
guess
so
when
we,
you
know
change
one
of
the
two
like
87
paramount.
We
have
to
decommission
the
old
one
and
move
away,
make
a
space
and
then
move
into
the
new
system.
So
while
we
are
doing
that,
we
have
only
one
system,
and
if
something
like
that
happens
to
one
system,
then
it
affects
users,
productivity
quite
a
bit.
B
But
if
I
don't
know,
if
you
have
enough
space
to
have
another
more
intermediate
systems,
that's
not
maybe
too
big,
but
many
of
our
application
doesn't
really
need
that
much
big
systems,
even
though
that
helps
queueing,
but
our
job
itself.
That
doesn't
really
need
too
big
systems,
so
maybe
during
the
daytime,
during
the
you
know,
standard
time,
that's
more
like
for
analysis
or
smaller
jobs,
and
then
we
separate
maybe
scotch
system,
just
like
eddie
anderson
corey
used
to
be
and
in
that
direction
that
might
give
more
resiliency
to
the
overall
system.
B
So
yeah,
that's
sort
of
what
I
was
feeling.
But
having
said
that,
I
really
appreciate
the
how
you
guys
took
that
corey
to
the
product
production,
at
least
without
even
without
accessing
scratch
space.
Actually,
that
two
weeks
made
me
able
to
do
some
analysis
and
and
put
up
the
presentation
which
I
made
in
this
week's
tuesdays.
We
are
having
pm
meeting
this
week
for
one
of
the
auto
science
rgm
program
and
it
really
made
one
nice
slide
in
the
last
three
days
before
the
conference.
B
E
So
right,
what
I
can
say
is
that
you
know
clearly.
This
is
a
this.
This
was
a
major
bug
and
it's
a
bug
at
a
very
deep
layer
of
the
machine.
It
is
challenging
to
predict
exactly
the
the
we
we're
working
very
carefully
with
with
with
our
vendor
and
internally,
on
sort
of
crafting.
Our
our
longer
term
plan,
as
as
a
result
resiliency
for
pearl
mudder
is
going,
is
a
central
goal
for
that
system.
E
The
way
that
I'd
like
to
phrase
it
is
that
our
goal
for
for
pro
mudder
is,
is
continuous
operations,
so
you
know
short
of
you
know
a
a
facility
outage.
You
know
like
we
have
to
shut
down
the
power
because
of
a
psps
event,
you
know,
or
something
or
one
of
these
facility.
Anyway,
you
know
our
goal
will
be
to
try
to
keep
promoter
online
to
some
level
now.
I
also
recognize,
though,
that
this
particular
bug
is.
E
Is
I'm
hoping
that
it's
a
once-in-a-lifetime
per
system
type
bug,
and
so
I
you
know
I
I
I'm
just
I
I
I
I
basically
we're
trying
to
balance
sort
of
you
know
what
is
the
you
know
trying
to
balance
sort
of
you
know
the
risk
versus
versus
resiliency,
because
we
can't
be
resilient
to
all
possible
things.
E
Exactly
for
the
reasons
that
you
talk
about
cory
and
pearl
mudder
are
going
are
fundamentally
different
systems
as
well,
so
edison
and
corey,
while
they
have
different
processing
elements,
you
know
iv,
bridge
versus
knl
and,
of
course,
very
different
scales.
The
way
that
we
operate
them
was
identical.
E
In
fact,
we
used
identical
software
for
both
machines
from
a
system
layer,
and
so
it
really
gave
us
a
nice,
a
b
test
strategy
bring
it
to
edison,
let
it
hang
out
for
a
little
while,
and
then
that
said,
this
isn't
thought
to
be
a
result
of
an
upgrade.
This
is
thought
to
be
more
of
a
bug.
That's
always
been
there
and
was
expressed
by
a
change
in
the
workload.
E
So
these
these
kinds
of
things
will
be
hard
at
this
scale
to
avoid
in
the
long
term.
So
anyway,
I
hear
you-
and
I
just
want
to
assure
you
that
we're
thinking
about
different
ways
of
dealing
with
these
types
of
things
in
the
future.
I
Hi
this
is
ramesh
from
from
argonne,
and
I
I'm
doing
some
computations
on
hankuri
and
during
the
course
of
my
conversations
on
chat,
I
was
actually
chatting
with
somebody
at
nursk
when
you
had
had
this
problem
with
your
file
system
initially
about
a
month
month
ago,
and
we
were
actually
just
comparing
notes
because
I'm
from
argonne
I'm
familiar
with
a
system
which
is
fairly
similar
to
corey
the
theta
system.
Over
here
we
have
recently
upgraded
our
operating
system
and
before
we
updated
upgraded
the
operating
system.
I
I
So
the
suggestion
that
I
had
made
was
that
you
know
perhaps
some
of
you
might
want
to
actually
contact
the
alcf
folks
just
to
see
if
there
was
something
that
they
did,
which
could
help
with
with
the
situation
that
you're
facing,
which
also
primarily
seems
to
stem
from
the
wound,
killer
associated
with
luster
the
file
system.
So
it's
just
a
just
a
suggestion.
I
thought
I
just
mentioned
this.
I'm
I'm
sure
you're
far
more
conversant
with
what
you're
doing
than
anything.
I
suggest
I'll
tell
you
so.
A
Thanks
very
much
so
so
you're
talking
about
this
is
the
the
login
hangs
or
see
scratch
issues
or
a
different
issue.
I
Well,
there
were,
there
were
login,
hangs,
so
one
of
the
symptoms,
I
didn't
notice.
At
my
end,
of
course,
let
me
also
preface
this
by
saying
that
I'm
a
recent
user
of
corey
I've
been
on
corey
for
about
a
month
now
and
because
we
have
a
d
o
e
b,
our
stimulus
funded
kovit
19
project,
which
has
some
fairly
aggressive
timelines,
and
I'm
trying
to
get
some
of
those
computations
done
on
corey,
because
I
just
can't
do
it
on
theta.
I
Another
problem
that
I
had
initially
noticed
was
that
I
was
actually
not
able
to
submit
my
job
at
all,
and
that
was
actually
a
problem
with,
with
with,
I
think,
the
job
scheduler
and
which
in
turn
affect
the
job
scheduler,
which
in
turn
was
affected,
because
there
was
a
there
was
a
problem
with
the
with
the
file
system,
so.
E
Very
close,
however:
we,
the
job
scheduler,
doesn't
actually
talk
at
all
to
to
to
the
lesser.
But
what
does
is
s
batch
itself.
I
E
E
That
is
a
rare
occurrence
when
it's
unavailable,
but
because
of
the
issues
we've
been
having.
It's
not
been
so
rare,
a
thing
that
I
haven't
talked
about,
but
you
know
we
are
working
on
removing
at
the
very
least,
the
the
login
hang
if
c
scratch
one
is
is
is
is
is
not
available.
I
Okay,
okay,
the
other
thing
I
thought
I
would
mention
was
that
initially,
when
I
had
gotten
on
to
the
corey,
I
was
experimenting
a
bit
with
file
striping
and
of
course
I
had
actually
used
a
striping
script
that
you
actually
have.
I
I
didn't.
I
didn't
stripe
as
wide
as
272.
It
was
far
more
nominal
than
that.
Just
to
see
how
my
write
speeds
would
improve,
and
I
was
wondering
if
it's
still
okay
to
do
that
or
you
you
don't
want
us
to
use
those
scripts
at
all.
With
regard
to
striping.
A
The
scripts
are
still
good,
they're,
still
good,
okay,
okay
yeah,
so
so
the
scripts
only
go
up
to
stripe
large,
which
is
a
striping
of
72..
So
so
we
kind
of
ask
that
you
don't
strike
more
than
that,
because
you
know
there's
a
there's,
an
increased
risk
with
with
increasing
the
script,
the
stripe
further
yeah,
but
we're
pretty
comfortable
with
that.
A
A
So
so
nurse
actually
only
informally
monitors
the
the
nurse
slack.
So
it's
not
actually
an
official
support
channel.
We,
we
kind
of
yeah,
encourage
users
to
chat
via
it,
because
that
you
know
it's
good
for
interaction
that
enables
our
user
community
to
you
know
help
each
other
a
bit,
and
you
know
we
do
sort
of
check
in
on
it
occasionally
to
see
if
there's
any
yeah,
to
see
what
what
is
going
on.