►
From YouTube: CI WG demo: NIEHS Office of Data Science Efforts
Description
Date: 4/6/18
Presenter: Mike Conway
Institution: National Institute of Environmental Health Sciences (NIEHS)
South Big Data Hub
A
Most
people
we've
known
Mike
because
he's
been
at
many
of
these
meetings
here,
so
he
probably
doesn't
need
too
much
of
an
introduction
here,
but
he's
going
to
talk
about
the
data
science
efforts
set
and
I'm
not
gonna,
try
and
pronounce
it
so
and
ie
HS,
because
I'm
sure
everybody
pronounces
that
acronym
differently,
but
talking
about
what
sort
of
what's
going
on
in
the
office
of
data
science
there
and
what
the
priorities
are
in
the
office
and
where
they're
going
so
I'll
turn
it
over
to
Mike.
To
tell
us
all
about
it.
Hey.
B
Guys
and
and
some
of
this
I'm
not
going
to
I'm
not
going
to
like
talk
about
a
particular
product
or
development
effort,
because
it's
a
strange
world
but
I've
got
sort
of
jumped
over
the
fence
and
we're
now
in
a
position
to
sort
of
eat
our
own
dog
food.
If
you
little
so
I
remember,
there
was
some
discussion
during
the
last
meeting
about
I
think
this
was
with
Renaye
talking
about.
B
Are
we
producers
or
consumers
and
I
was
mentioning
the
fact
that
we're
sort
of
now
sort
of
straddling
that
line
and
I
sort
of
want
to
talk
about
what
we're
doing
and
sort
of
why?
We
think
the
hubs
are
important
and
what
we
are
looking
to
both
get
from
and
contribute
to
that
effort
in
the
context
of
the
hugs.
So
with
that
so
I'm
now
working
at
nie,
HS,
National
Institute
of
Environmental
Health
Sciences,
which
is
part
of
the
NIH,
a
larger
NIH
organization,
were
based
here
in
RTP
we
have
a
beautiful
campus.
B
That
I
contend
is
actually
the
model
for
Hawkins,
lab
and
stranger
things.
If
you
guys
don't
know
the
stranger
things,
people
were
and
went
to
Jordan
High
School.
So
if
you
see
the
picture
of
the
lab
and
compared
to
Hawkins
lab,
you
tell
me
okay,
so,
but
I
previously
I
was
working
with
dr.
Moore,
dr.
B
B
So
this
is
from
our
mission
statement,
accelerate
scientific
discovery,
collaborative
research
and
improve
public
health
through
that
out
of
application
of
scientific
data
and
knowledge.
So
I
with
the
wonderful
presentation
we
just
saw
just
remove
the
words
hydrology
and
insert
Public,
Health
and
I
know:
there's
even
public
health
overlaps
with
hearts,
hydrologic,
community
and
you'll
see
that
we
are
actually
all
working
on
the
same
problems
when
you
step
back
and
look
at
it
from
the
sort
of
multi
disciplinary
level,
and
that
again,
is
where
we
were
five
years
ago
with
the
data.
B
Ness
was
multi,
multi
disciplinary
research
so,
for
example,
the
controversies
with
Gen
X
and
the
water
here
in
North
Carolina,
that's
a
hydrological
problem
as
well
as
a
public
health
problem.
So
you
can
see
definitely
how
these
over.
So
this
is
really
at
the
very
highest
level.
What
we're
concerned
about
so
with
toxicology
and
the
things
we
do
at
ni
EHS
were
worried
a
lot
about
life,
science
and
genomics,
and
you
know
how
the
environment
impacts,
the
the
genome
and
the
health
effects
of
environmental
exposures.
B
You
know
you
know
next-generation
sequencing,
all
these
sorts
of
technologies
managing
that
data,
managing
it
in
terms
of
compute,
creating
datasets
that
are
preserved,
honorable
with
authenticity,
provenance
and
all
the
regan
more
things
that
I
like
a
channel
for
you
guys.
So
it's
really
a
nice
place
to
be
so.
You
know,
as
I'll
say,
we're
starting
internally.
What
we're
dealing
with
a
lot
now
are
the
internal
datasets
coming
out
of
our
core
labs,
so
krylia
mass
spectrometry.
B
We
have
studies
where
you
know
we
have,
then
tissue
samples
or
different
things
are
being
sent
out
by
the
P
is
two
different
core
labs,
where
different
sorts
of
assays
are
done.
All
that
data
is
returned
to
the
pis
and
integrated
and
analyzed
to
create
studies
on,
for
example,
cancer
effects
of
a
certain
chemical
or
we're
doing
Studies
on
all
kinds
of
environmental
exposures.
B
So
it
is
not
just
fast
key
fouls
that
we
are
concerned
with
it's.
This
larger
idea
of
an
information
Commons
and
a
knowledge
network
taking
in
all
these
different
aspects
of
health
science
and
bringing
them
together
with
the
point
being
to
increase
the
the
speed
to
which
we
can
find
new
knowledge.
So
you
have
this
larger
picture
that
we
are
all
driving
towards
in
our
own
domains,
but
it's
always
towards
right
and
so
we're
perpetually
towards,
but
we're
starting
with
some
more
prosaic
concerns
were
and
I
borrowed
this
from
Regan.
B
But
this
is
this
idea
that
data
has
a
lifecycle
and,
in
a
lot
of
these
projects,
were
more
concerned
with
the
right
to
boxes.
If
you
will
of
this
data
lifecycle.
So
as
we
even
look
at
the
NIH
Commons
effort
that
is
really
oriented
towards
these
right,
two
boxes
so
we're
organization
right
now.
That
is
internally
way
over
on
the
left-hand
side,
which
is
managing
data.
B
Coming
off
of
our
core
labs,
managing
researchers
trying
to
assemble
different
kinds
of
assays,
different
sorts
of
data
sets
to
create
publications,
to
create
to
drive
conclusions,
and
so
all
that
is
occurring
on
this
left-hand
side
of
the
chart
and
even
just
dealing
with
those
first,
three
or
four
boxes.
You
know
our
work
is
cut
out
for
us,
so
so
again,
where
we
are
is
more
on
this
left-hand
side.
B
We're
very
much
concerned
now
with
internal
facing
data,
with
an
eye
towards
what's
going
to
happen
and
I
would
actually
sort
of
submit
that
a
lot
of
the
entities
that
are
involved
with
projects
like
the
BD
hub
are
actually
in
reality,
living
in
that
side
of
the
space.
But
what
we're
concerned
about
is
how
do
we
manage
this
data
over
on
the
left-hand
side
so
that
it
can
transition
over?
To
that
right
hand,
side
as
fare
data,
where
we've
kept
provenance,
where
we
know
where
the
data
came
from,
we
can
make
assertions
about
its
authenticity.
B
We
can
also
have
the
repeatability
to
you,
know,
validate
a
study
or
if
questions
come
up,
we
could
you
know-
or
we
can
hoard,
that
you
know,
since
this
data
is
also
very
expensive,
to
acquire.
How
can
we
reuse
this
data,
so
those
sorts
of
things,
so
our
architecture
right
now
internally
we're
really
dealing
with
repositories
for
this
core
data
and
managing
the
delivery
of
this
core
data
to
our
researchers.
B
That's
here,
and
then
we
are
very
much
interested
in
cwl
and
getting
to
some
common
processes
for
describing
workflows
and
processes.
So
like
most
people
doctor
and
singularity
the
ability
to
run
workflows
on
one
platform
and
move
over
to
a
larger,
more
capable
platform.
So
as
we
move
from
things
like
am
I
seeked
Mexique
and
the
bigger
scarier
things
like
nova,
seek
and
needing
to
avail
ourselves
of
more
computational
power?
B
Similarly,
we're
very
concerned
with
starting
to
develop
utilize
all
of
the
different
ontology
and
taxonomy
that
are
out
there.
So
we're
in
the
process
of
evaluating
and
adopting
of
you,
know,
tools
to
manage
our
ontology
and
taxonomy
and
are
actually
currently
wrestling
with
the
kinds
of
data
we
can
extract
from
the
workflows
and
how
we
integrate
human
curation
into
this,
both
to
start
describing
the
experimental
methods
that
are
being
used.
The
questions
that
are
being
asked,
the
outcomes
that
results
from
data
and
like
them
to
publications.
None
of
this
again
should
be
surprising
to
anybody.
B
But
again,
if
you
in
within
the
BD
host
context,
what
we're
really
interested
in
is
kind
of
future
proofing,
and-
and
we
look
at
the
BD
hubs
and
even
things
like
we
are
very
interested
in
ga4gh
and
other
groups
in
terms
of
future
for
proofing
and
sort
of
confirming
the
architectural
choices
we
are
making
now
internally.
So
that,
as
we
begin
to
open
up
to
the
world,
we're
in
a
place
where
we
don't
have
to
redo
everything,
if
this
makes
sense
and
I
think
a
lot
of
people
are
in
the
same
place,
we
are.
B
This
is
just
to
show
not
necessarily
to
be
able
to
read
this
chart,
but
this
is
just
to
show
where
we
are.
Is
we
have
all
these
basic
elements
either
in
place
or
in
pilot,
but
we're
choosing
among
technologies?
You
know
internal
identifier.
Are
we
looking
at
men
IDs?
How
are
we
archiving
data
so
things
like
BD
bag?
Is
that
appropriate
to
us
in
choosing
ontology
z'
metadata,
how
we
describe
our
data
and
then
processing
pipelines
from
all
of
these
different
data
sources?
B
So
in
terms
of
VT
hubs,
you
know,
I
looked
at
the
rings
and
the
spokes-
and
you
know
I-
think
our
concerns.
It's
like
Maslow's
hierarchy.
Our
concerns
are
the
ring
concerns
where
are
the
highway
signs
and
where
the
guardrails-
and
you
know
so,
you
know
for
us,
you
know
what
here
are
some
of
the
things
that
we
as
an
organization
are
looking
at
so
very
much
the
NIH
Commons.
What's
going
on
with
the.
C
B
And
what
we
learn
from
the
data
notes,
since
that's
where
I
came
from
we're
looking
at
peer
institutions,
so
we're
looking
at
what
the
hydrological
communities
do
we're,
also
looking
at
what
NCI
is
doing
with
their
cancer
data
comms
in
Scott
Collins,
and
also
what
vendors
are
doing
and
say,
although
I
think
this
group
and
the
concerns
this
group
are
expressing
are
way
ahead
of
where
the
vendors
are
I
have
a
major
you
know.
B
If
you
talk
to
vendors,
sometimes-
and
you
know
they
won't
know,
what
fair
is
even
people
who
are
dealing
in
you
know
things
like
storage
technology
and
stuff.
Fair
is
not
something
that
really
has
penetrated
a
lot
in
the
vendor
community.
So
I
really
do
think
that
people
that
know
about
this
and
the
real
conversations
are
happening
in
the
context
of
what
the
hub's
are
doing
a
guardrails,
and
this
is
the
other
thing.
B
The
other
ring,
though,
of
what
the
hub's
are
doing,
is
looking
at
things
like
security
and
compliance,
we're
very
concerned
with
data
usage
agreements
and
again,
when
the
reasons
were
internal
facing
right
now
is
or
when
you
start
opening
things
up
to
external
collaboration.
That's
that's
like
going
from
two
kids
to
three
kids
or
something
like.
B
Linear
progression
of
complexity,
yeah
and
so,
and
that's
kind
of
where
I
are
nothing
again.
I
didn't
want
to
get
too
much
in
the
nuts
and
bolts,
but
we
have
been
really
interested
in
the
various
presentations
of
the
various
technologies
and
again
sitting
on
the
other
side
of
the
fence.
We,
you
know
as
part
of
ni.
C
B
We
kind
of
focus
on
fair
plus
computable,
because
I
think
the
tools
we
need
mostly
exists
in
the
open-source
community
in
the
kinds
of
things
that
NSF
is
produced
and
the
kinds
of
things
that
NIH
is
producing
in
their
comments,
efforts
and
how
the
hub
can
be
a
resource
for
us
in
terms
of
sort
of
navigating
this
and
I
will
sort
of
stop
there.
We
could
go
into
the
mind-numbing
detail,
but
I
know
we
all
love
to,
but
what
I
was
trying
to
keep
it
more
towards
I
guess.
C
B
Are
they
are
not
a
separate
topic
at
all?
The
reason
I
had
don't
have
them
on?
There
is
because
it's
sort
of
not
my
expertise
area
but
I
do
know.
For
example,
if
you
look
at
the
NIH
data
Commons
efforts
that
there
are
whole
sections
whole
key
concerns
of
the
NIH
Commons
efforts
on
things
like
ethics,
privacy,
the
science
drivers
in
scientific
use
cases
for
this
data,
so
it
had
more
to
do
with
the
fact
that
that's
not
an
area
can
cover
very
cover
very
well.
C
This
is
math
from
UC
Davis
have
a
question
for
you.
I
was
interested
to
hear
that
you
are
thinking
about.
You,
know
ontology
and
vocabulary
management
tools,
and
can
you
talk
a
little
bit
about
that
and
and
what
that
means
and
how
you
interact
with
other
ontology
communities
like
Bo
or
or
or
or
whether
you
or
whether,
the
you're,
the
right
person
to
talk
to
about
that
actually.
B
E
Briefly
say
that
we
are
looking
at
several
biomedical
ontologies
and
identifying,
which
ones
can
serve
our
needs.
So
that's
one
aspect
that
we
are
looking
at
and
the
other
aspect
is
how
to
manage
these
ontologies.
So
we
are
looking
at
several
different
tools
and
in
the
process
of
evaluating
them,
and
maybe
Stephanie
can
add
further
on
that.
D
That
obviously
was
not
you
know
that
we
would
be
able
to
incorporate
and
not
try
to
reinvent
the
wheel
on
and
it's
understanding
where
there
might
be
gaps
in
vocabulary.
So,
for
example,
maybe
apps
related
to
exposures
that
we
would
want
to
work
on
developing
and
obviously
arguments
for
exposure
made
some
areas
of
difference
than
how
you
can
a
might
need
you
to
implement
that.
But
it
is
definitely
a
topic
area.
A
All
right,
I
just
had
one
quick
question
because
I
sort
of
greedy
to
ask
it
but
I'm
going
to
do
it
anyway.
So
a
lot
of
what
I
keep
seeing
in
NIH
and
other
spaces
for
a
lot
of
this
work.
They
keep
talking
about
cloud,
and
you
talked
about
cloud
and
vendors
and
things.
What
is
what?
What
do
you
feel
sort
of
the
the
NIH
is
view
on
what
a
cloud
is
is
because
it's
one
of
those
questions
everybody's
got
an
opinion,
but
nobody
seems
to
be
the
same.
I'll.
B
B
Internally,
it
could
be
his
prosaic
is:
do
we
have
we've
got
data
on
this
net
app
and
we've
got
data
on
the
DDM?
How
do
we
treat
that
under
a
common
view
and
that's
our
interest
in
irods
all
right
and
you
know
we
could
have
data
in
some
occasion
where
it's
on
the
DDM
and
then
you
have
a
replica
or
you
push
stuff
out
the
Amazon
or
on
the
glacier
or
something
like
that.
So
is
that
cloud.
What
so
I
think
cloud
is
a
meaningless
term
I
think
more.
What's
important?
Is
data?
C
B
A
lot
of
the
effort
is
really
more
oriented
towards
minimizing
data
egress
charkas
by
moving
computational
units
to
where
big
data
sets
it
right,
and
that
is
just
you
know,
that's
the
same
problem
we
have
if
we
want
to
process
data
locally
or
process
it
on
some
high-capacity
resources
that
within
nie
HS
or
send
stuff
to
biowulf,
or
something
like
that
so
I
think
cloud
is
misleading,
I,
think
it's
the
standards
and
open
api's
yeah.
A
That
was
sort
of
what
I
was
going
for.
Cuz
yeah
I
like
to
pick
on
that
word
so,
but
I
thought
you
know,
I
think
it's
one
of
those
things
that
is
good
to
sort
of
talk
about,
especially
in
this
hub
environment,
because
you
know
at
some
level
we
are
the
cloud
and
the
important
thing
to
sort
of
remember
on
that.
That's,
oh.
B
No
I
was
just
gonna,
say:
I.
Think
that's
what's
what's
important
is
that
we
would
like
to
build
systems
that
are
not
walled
gardens
that,
yes,
that
we
can
and
have
other
people
consider
to
be
a
cloud.
Even
if
it's
sitting
on
just
a
tool,
you
know
Linux
rack
server
in
our
computer
room.
We
can
still
say
it's
filed
in
terms
of
how
you
perceive
it
right.
Yeah,
I,.