►
From YouTube: 7. Data Ecosystem Overview
Description
Learn about the data ecosystem at NERSC.
Slides for all sessions can be downloaded from here: https://www.nersc.gov/users/training/events/new-user-training-june-21-2019/
A
A
Gellin
is
going
to
be
chatting
about
file
systems
best
practices.
You
know,
how
do
you
store
move,
write
data
to
to
a
file
system?
Quincy
Koziol
will
be
joining
us
remotely
and
he'll
be
chatting
about
io
libraries.
So
there's
a
lot
of
emphasis
on
you
know:
how
do
you
store
and
move
data?
That's
that's
a
fundamental
operation
in
the
afternoon
we're
going
to
shift
more
towards
in
in
the
in
the
part
after
the
break,
we're
gonna
shift
more
towards
analytics.
A
So
now
that
you
know
about
how
to
store
and
manage
your
data,
how
do
you
actually
analyze
it
so
increasingly
Python
and
Jupiter
or
key
technologies
in
that
space
and
rollin
is
gonna?
You
know,
walk
you
through
those
I
think.
Initially
we
are
chained
in
place
for
shame,
cannon
in
place
for
shift
to
a
particular
container
technology,
but
rollin
is
going
to
speak.
A
You
know
on
Shane's
behalf
and
finally,
at
the
end
of
the
day,
I'm
going
to
be
chatting
about
deep
learning,
so
I
think
that's
mostly
the
distaff,
that's
in
the
room
at
the
moment
now
you
really
should
feel
free
to
interrupt
us
at
any
point,
and
you
know
ask
us
questions.
You
can
engage
with
nurse
stuff
by
sending
tickets
to
consult
and
us
gov
or
you
can
chat
with
us
in
person.
So
now
that
we
are
all
here
in
a
room,
you
know
please
interrupt
us.
Ask
questions
catch
us
in
the
break.
A
Catch
us,
you
know
after
the
day
is
over
because
we're
really
all
looking
forward
to
interacting
with
you
today,
all
right.
So
I
think
I
mentioned
that
you
know.
Data
is
extremely
important
for
nurse
and
very
often,
if
you
look
at
an
organizational
structure
you
can
you
can
make
out
what
the
priorities
are.
So
our
task.
A
You
know
we
have
the
systems
department
that
makes
sure
that
our
systems
are
performant
and
running
all
the
time
you
heard
from
again
Rebecca
and
in
the
morning
on
the
HPC
side
of
things,
and
then
we
now
have
a
data
department
that
is
whose
Charter
it
is
to
make
sure
that
our
systems
are
responsive
to
the
emerging.
The
current
and
emerging
data
needs
of
the
user
community,
so
I
lead
the
dash
group.
You
know
we
manage
the
the
user
facing
data
stack.
A
We
have
the
data
science
engagement
group
led
by
Debi
bard,
who
has
specific
strategic
engagements
with
different
science
communities.
Damiana
Hazen
leads
the
storage
systems
group,
so
they
manage
SPSS.
The
archival
system
and
the
file
systems,
and
then
Cori
Snavely,
leads
the
infrastructure
services
group.
So,
even
though,
today
you
know,
daya's
will
be
sort
of
presenting
the
user
facing
data
stack,
but
there
are
several
groups
who
have
a
lot
of
active
roles
in
the
data
space.
A
All
right,
so
hopefully,
I
think
this
is
clear
to
you
by
now,
but
we've
tried
our
best
to
make
sure
that
Cori
as
a
single
unified
system
can
support
both
simulation
and
data
workloads.
I
would
say
that
maybe
three
or
four
years
ago
there
was
a
genuine
question
mark
a
task
on
whether
we
should
have
a
different
system
that
does
data
and
data
analytics
and
then
maybe
a
separate
system
that
does
simulation.
But
we
made
the
strategic
decision
that
you
know
single
system
will
do
a
good
job
in
supporting
both
so
I.
A
Think
some
of
you
who
so
I
guess
I
did
want
to
get
a
sense
for
it.
Are
you
I,
guess
who's
a
new
user
to
nurse
you're
just
maybe
getting
started?
Maybe
if
you
can
raise
your
hands,
all
right
sounds
good
and
how
many
of
you
have
already
logged
on
to
nurse
systems
are
familiar
with
masks.
Okay,
all
right!
So
you're
predominantly
you
know
new
users,
so
we
do
have
the
the
Intel
Haswell
partition
in
many
ways.
A
If
you
really
do
not
want
to
modify
your
code,
then
then
the
Haswell
partition
is
where
you
can
continue
to
run
your
your
jobs,
but
going
forward.
Of
course,
truly
leveraging.
Many
core
computing
is
is
important
and
the
knights
landing
partition
is
what
is
recommended
for
for
those
for
those
needs.
So
I'm
gonna
come.
You
know
in
a
few
slides
to
what
are
some
of
the
data.
Specific
features
that
we've
configured
on
Cory,
but
first
I
do
want
to
walk
you
through
the
stack.
A
So
again,
if
you
are
a
data
user
and
there
is
some
software
or
a
service
that
you
want
to
leverage
this
is
the
production
stack
that
we
support
at
Nazca.
So
if
you
care
about
data
transfer
and
access,
so
let's
just
talk
about
data
transfer.
For
a
moment.
You
know
you
have
your
data
set
in
your
lab.
There
is
maybe
a
remote
instrument
and
you'd
like
to
move
that
data
set
to
nurse.
Then
we
recommend
that
you
use
Global's
and
great
FTP.
A
Those
are
the
two
tools
that
you
can
use
once
your
data
is
in
place
here.
Chances
are
that
you
want
to
share
the
data
set
with
the
rest
of
your
community,
so
web
portals
become
very
important,
and
you
know
there
are
a
range
of
technologies
that
you
can
use
more
and
more
beyond
just
sharing
data
with
other
users.
It
is
maybe
also
important
to
share
code
or
your
analysis.
Scripts
and
Jupiter
is
a
key
technology
that
you
can
choose
to
leverage
for.
For
that.
A
Chances
are
that
you
need
to
move
a
lot
of
data,
manage
a
lot
of
data
analyze,
a
lot
of
data,
and
you
need
to
do
this
repeatedly.
You
want
to
make
sure
that
the
entire
workflow
is
automated.
So
there
are
a
few
tools
that
you
can
use.
Fireworks
is
a
fairly
sophisticated
tool
that
understands
all
of
the
file
systems,
the
queuing
system.
That
knows
that,
hopefully,
you
heard
about
in
the
morning-
and
you
can
choose
to
use
fireworks
to
caption
and
automate
your
workflow
tasks.
Farmer
is
another
technology
that
that
we
support
here
at
nurse.
A
So
if
you
have
embarassingly
collections
of
embarrassing
apparel
jobs,
then
tasks
for
Merck
and
in
many
ways
take
care
of
that.
You
know
important
use
case
now.
I'll
note
that
many
communities
already
have
workflow
tools
pre-decided
for
them,
and
we
try
to
work
with
those
communities
to
make
sure
that
the
workflow
tools
will
work.
We
will
continue
to
work
at
desk
now.
Data
management,
I,
think,
is
a
key
bit
again.
It's
one
of
those
things
which
you,
you
know
only
learn
when
maybe
you're
in
grad
school
or
as
a
postdoc.
A
Someone
has
maybe
already
decided
a
data
management
scheme
for
you,
you're
gonna
be
storing
your
data,
you
know
maybe
as
CSV
txt
files
or
you
know,
or
some
other
scheme,
but
the
moment
you
start
talking
about
big
datasets
terabytes
of
data,
tens
of
terabytes
of
data,
or
even
even
you
know,
hundreds
of
gigs
of
data.
It
is
really
quite
critical
that
you
pay
attention
to
how
you're
storing
your
data
sets.
A
So
modern
I/o
library
is
like
hdf5
netcdf
root,
have
all
of
the
good
characteristics,
I
would
say
of
a
you
know:
data
management
solution,
so
you're
welcome
to
use
those.
We
support
those
at
that
desk
and
then,
if
you
do
want
to
use
data
bases,
it
makes
sense.
Perhaps
for
you
to
use
database,
then
MongoDB,
my
sequel
and
Postgres
is
what
we
use.
So
these
are
all
tools
that
are
well
supported
at
this
point
in
time.
A
So
you
know
you're
not
going
to
see
C,
C++
and
Fortran
here
in
this
light,
I
think
we
all
recognize
that
people
care
about
higher-level
languages,
so
more
and
more
python
is
the
recommended
language.
If
you
wanna,
if
you
care
about
generic
analytics,
if
you're
a
statistician,
anyone
are
unsophisticated
statistical
analysis,
then
R
is
a
is
a
tool
that
you
can
use.
Julia
is
an
emerging
language
that
you
may
choose
to
explore.
Sparc
is
an
interesting
framework,
an
analytics
framework
that
again,
you
can
also
leverage.
A
Now
there
are
legacy
tools
like
MATLAB
and
Mathematica
that,
of
course,
you
know
have
been
there
for
a
while
and
will
be
around
so
you're
welcome
to
use
those.
And
finally,
there
are
you
know
a
bunch
of
libraries
in
the
deep
learning
space
that
I'm
gonna
get
to
towards
the
end
of
the
presentation.
So
this
entire
stack
is
in
production.
So
it's
it's
there.
It's
available
to
you
you're,
welcome
to
use
it.
There
is
documentation.
You
can
file
trouble
tickets
with
us.
A
A
You
know,
rollin
is
gonna,
go
into
Jupiter
notebooks
and
the
fact
that
we
have
dedicated
nodes
for
Jupiter
soon
will
have
dedicated
compute
back-end
notes
for
for
2
butanol.
Perhaps
there
are
some
jobs
that
will
run
in
Syria
but
require
a
lot
of
memory.
So
there
are
some
big
mem
nodes
that
that
you
can
use.
There
are
some
workflow
dedicated,
workflow
nodes.
Where
you
know,
perhaps
you
need
to
let
your
workflow
manager
run
for
a
long
time,
so
those
those
nodes
can
be
used.
A
A
So
now
there
are
real-time
queues
in
place
that
that
can
let
you
do
that
interactivity
again
is
really
quite
cheap,
so
you
know
again,
as
a
data
user
may
be
the
the
prospect
of
waiting
in
queue
for
3
days
to
run
your
analysis.
It's
not
very
appealing
so
I
think
interactively
says
you
can
use
the
interactive
queue,
submit
your
job
and
hopefully
you'll
get.
You
know
command
shell,
some
compute
nodes
belly
on
a
short
time
or
room
IO
is
again
is
key,
so
you
know
making
sure
that
you
can
read
and
write.
A
Data
fast
is
important
and
I
think
what
we
are
seeing
increasingly
is
that
GPFS
and
lustre
file
systems
are
not
keeping
pace,
so
the
burst
buffer
technology
is
something
that
you
can.
You
can
choose
to
use
all
right.
So
all
of
these
are
features
that
we've
tried
to
configure
on
Cori
to
make
sure
that
you
know
you
as
a
data
user
are
productive,
but
if
there
are
things
that
you're
still
struggling
with,
you
know,
please
let
us
know
all
right,
so
I
think
I'll.
A
You
know
just
make
a
few
asks
of
you
for
the
remainder
you
know
for
ours.
Please
engage
with
us,
you
know
the
reason
we've
set
aside
four
hours
today
is
to
be
able
to
talk
to
you
and
maybe
educate
you,
but
then
also
learn
from
you
on
what
is
and
what
is
not
working
well
and
do
tell
us
about
your
interesting
science
problems.
I
mean
fundamentally
the
reason
we
in
the
group
you
know
other
stuff
at
nurse
work
at
nurse
as
opposed
to
doing
the
same
job
in
the
industry
is
because
we
care
about
science.
A
So
if
you
have
any
interesting
science
problems
that
you
want
to
work
on
that
you
wanna,
you
know
have
breakthroughs
in
or
the
coming
years,
then
please
tell
us
about
it,
and
you
know
we
can.
We
can
provide
you
with
some
pointers,
all
right,
so
I'm
gonna
stop
there
and
while
we
do
the
switch,
are
there
any
questions
or
comments
for
me.