►
From YouTube: CI WG demo: SciDAS
Description
Date: 03/31/17
Presenters: Claris Castillo (RENCI) & Alex Feltus (Clemson)
Institutions: Renaissance Computing Institute & Clemson University
South Big Data Hub
A
Alex
Phyllis
and
Clarice
custom
foresight,
a
national
cyber
infrastructure
for
scientific
data
analysis
at
scale,
so
we
are
now
looking
into
the
future.
Here
clearly
is
a
senior
conversation
with
Network
Systems
researcher
and
the
Renaissance
of
Guinea
Institute
here
at
the
University
of
North,
Carolina
Chapel,
Hill
and
Alex
is
an
associate
professor
in
Clemson,
University's,
Department
of
genetics
and
biochemistry
and
CEO
of
hello.
Look
at
that
rate,
symposium,
allele,
ok,
allele!
A
Thank
you
so
die
death
is
designed
to
improve
flexibility
and
accessibility
to
national
resources,
helping
researchers
more
effectively
use
a
broader
array
of
these
resources,
sideout
to
develop
you
to
large-scale
systems,
biology
and
hydrology
DSS,
but
is
acceptable
to
many
other
domains.
So
with
that
I'll
hand
it
over
to
Alex
and
Chloe.
B
We
haven't
even
got
to
the
point
now,
since
we
started
in
February
to
decide
on
how
to
pronounce
our
own
acronym
for
really
really
new,
but
primarily
what
this
is
is
a
end
user
driven
project
where
we're
trying
to
build
your
music
views,
we're
using
on
experimental
tools,
we're
trying
to
build
the
production
systems
to
be
able
to
do
what
the
NSF
is
doing
and
actually
just
generative
generate
usable
scientific
results.
And
so
one
of
the
big
problems
were
trying
to
solve.
B
Is
that
even
scientists
who
are
very
geeky
and
understand
how
to
use
high
performance
computing
and
things
like
that
they're
having
trouble
processing
their
own
DNA?
This
is
me
I'm,
one
of
these
people
I'm
an
end-user
and
we're
constantly
trying
to
figure
out
why
our
nodes
aren't
working
or
jobs
are
failing
and
we're
dealing
with
I
have
a
hundred
and
fifty
terabytes
of
data,
and
we
filled
up
instantly
by
trying
to
deal
with
some
of
these
practical
practical
issues.
C
B
So
I'm,
so
there
you
know
people
having
problems,
processing
their
data
from
a
data
perspective
like
again
begin
myself
and
one
of
my
colleagues
and
then
there's
a
people
that
are
building
in
amazing
systems.
That
I
really
didn't
really
know
much
about
till
it's
years
ago,
other
than
just
using
whatever
my
campus
add
to
be
able
to
process
data.
B
These
amazing
systems,
you
know,
ranging
from
well
Clarisse,
we'll
talk
about
some
of
the
systems
that
will
be
used,
but
I
found
that
a
lot
of
times
we
don't
know
what
these
systems
are,
a
scientist
going
to
market
very
well,
sometimes
it
always
said
my
needs
and
what
they
I'll
have
up
here
is
sometimes
it's
just
like
I,
don't
really
know
how
to
get
access
to
the
systems.
Do
I
have
to
what
hoops
do
I
have
to
go
through,
and
sometimes
that
keeps
me
from
museum.
B
So
next
next
slide
it's
a
solution
of
this
and
the
way
we
rewrote
this
proposal
is
to
embed
active
end
users
like
some
of
the
people
that
are
there.
Pis
and
Kofi
is
on
this
proposal.
We're
trying
to
process
data
at
the
Teheran
petascale,
but
added
by
scale,
looking
at
from
an
end
user
perspective
and
so
much
raw
data
asked
to
go
to
the
whole
system
and
then
embedding
those
on
software
design
teams
with
cyber
infrastructure
developers
like
as
like
we're
working
with
the
Rinty
and
these
agile
design
teams.
B
So
what
we're
doing
is
we're
sharing
our
destiny
on
NSF
reports
and
really
we're
developing
the
system
while
we're
building
it
next
next
slide.
So
what
we're
doing
to
be
able
to
do
this
on
the
design
team
is
for
the
distributed
cyber
infrastructure
component
is
gluing
systems
together.
Allow
us
to
discover
the
main
data
move.
It
fluidly
across
networks
be
able
to
launch
scientific
workflows
in
the
way
that
a
scientist
is
comfortable
doing,
even
if
it's
not
the
best
way,
and
so
it
improved
the
flexibility
access
to
to
national
and
global
resources.
B
This
is
our
solutions,
problem
and
I
shouldn't
miss
this
I,
didn't
say
at
the
very
beginning
is
that
this
is
NSF
funded
project,
and
this
is
Clemson
rincey
at
Chapel,
Hill
and
Washington.
State
universities
are
the
primary
side,
but
we're
collaborating
with
people
already
around
the
country.
Next
slide.
B
This
again
is
user
engineer,
focus
we're
trying
to
have
the
scientist
use
our
embedded
in
stress
test
the
system
and
a
big
problem.
Part
of
this
is
that
I
find
that
using
big
data
sets
in
genomics,
which
is
what
I
do
genomics
and
genetics
this.
The
general-purpose
systems
done
always
functionally.
You
expect
them
to
and
the
networks
of
move
data
around
it
as
well.
Next,
one
disk
ego
I
put
these
dumb
animations
in
here,
and
so
this
is
that
what
the
the
concept
here
is
that
we
have
already
on
on
it's
on
the
team
systems.
B
B
We
have
some
space
biologists,
but
also
some
people
that
are
interested
in
from
looking
in
the
Earth
from
space
fermented
image
analysis,
we're
really
marrying
these
people
with
people
like
to
work
in
rincey
and
also
people
to
work
in
technical
people,
internet
to
you
and
a
lot
of
partners,
the
computer,
scientist
and
network
engineer,
storage
engineers,
the
visualization
people,
the
HCI
people.
We
really
try
to
bring
all
this
together
into
one
try
to
create
a
creative
village
to
be
able
to
get
some
work
done
next
slide,
and
so
there's
the
concept.
B
So
you
know,
if,
if
you
build
it,
they
will
come
we're
trying
to
change
that
concept.
Here
too,
they
will
help
build
it
long
using
it.
Ok,
so
here's
our
little
construction
of
a
baseball
field,
so
we
can
all
play
together.
The
network
engineer
the
geneticist.
You
know
where
they're
going
to
go
was
together
after
the
mowing,
the
lawn
and
constructing
the
field.
We're
really
trying
to
try
to
do
this
instead
trying
to
alleviate
the
problem
of
building
a
massive
system.
That
is
great
right,
but
it
doesn't
get
quite
advertised
correctly.
B
It
doesn't
get
utilized
to
its
full
potential
and
we're
trying
to
do
that
in
education,
as
the
user
really
trying
to
respect
the
inefficient
unoptimized
habit
space
of
the
domain
scientists
that
that's
that,
if
you
don't
do
it
that
way,
you
can
turn
fine
is
off
because
they
don't
want
to
change
their
their
bad
habits
next
slide
and
so
for
stress
testing.
You
know,
I'm
a
biologist
and
a
big
big
part
of
the
data
that
we're
running
through
the
system
as
we
develop.
It
is
biological,
and
this
is
just
a.
B
This-
was
taken
yesterday,
a
snapshot
of
the
data.
That's
in
one
repository,
the
NCBI
sequence
read
archive
in
Maryland,
which
is
such
that
the
internet
see
when
we
move
data
through
internet
I've
been
to
our
systems
right
now,
I've
been
doing
this
for
years.
Actually,
but
just
it
just
cast
the
magic
10
quadrillion,
80,
GS
and
C's
base
pairs
that
have
been
developed
with
a
new
or
not
new.
Eight
year
old,
sequencing
technology,
that's
really
districts,
huge
data
sets
you
can
see.
This
is
an
exponential
curve.
This
is
a
lot
of
data.
B
A
base
pair
is
a
byte,
and
so
you
can
look
at
this
as
petabytes
over
10
petabytes
of
data
that
have
been
generated
and
I
see
this,
as
you
know,
hitting
X
of
ice
if
it
hasn't
already
it's
just
not
in
this
repository,
if
it's
out
there,
if
I
think
I
saw
in
John's
slide
that
there
was
on
during
12
petabytes
moved
across
skynyrd
at
to
last
and
2016
times.
I
got
that
right.
B
You
know
this
that
you
could
have
a
very
small
number
of
scientists
moving
this
data
around
and
mining
it
and
meeting
that
capacity.
Just
from
this
one
repository
so
there's
a
lot
of
data
to
crunch
six
lives,
okay
and
so
the
biologists
you
know,
I
got
stacks
hard
drives
in
the
office
that
we
had
used
from
received
from
colleagues
to
process
their
data.
You
can't
do
that
anymore,
it's
like,
and
so
one
of
the
stress
tests
where
actually
we're
trying
to
develop
biological
results
while
we're
using
the
system.
B
We
in
my
lab
in
a
state
of
Sigma
and
watch
the
State
University
of
Kali
and
some
other
people,
we
are
generating
G
Direction
patterns.
We
build
these
graphs
of
G
interactions.
This
picture
here
shows
the
little
dots
are
inter
genes
and
line
between
them
or
gene
interactions,
and
we
process
that
10
petabytes
of
data
at
at
NCBI.
That's
a
big
place
where
we
draw
from
in
other
places,
choose
to
the
polenta,
generate
these
kind
of
networks,
and
so
this
is
a
massive
computational
problem
to
do
so.
B
Really
stress
testing
it
from
an
algorithmic
and
just
raw
data
perspective.
Men
are
in
what
we're
doing
next
slide,
please,
if
so,
we're
really
trying
to
this
kind
of
get
ridiculous
here
and
see
how
much
we
can
stress
test
the
system
and
we're
trying
to
I
focus
a
lot
of
plants
and
sort
of
a
human
perspective
too,
and
really
trying
to
branched
out
into
the
Tree
of
Life
and
pull
big
datasets
from
different
nodes
on
the
Tree
of
Life
different
organisms.
B
There's
some
numbers
here
showing
you
like
this
is
actually
pretty
dated
that
there
are
38
terabytes
of
green
plant
law
data
that
we
can
process.
We
probably
right
now.
If
we
restored
process
and
raw
data
will
be
generating,
you
know
3,
petabytes
or
so
intermediate
intermediate
files,
and
we
don't
have
that
kind
of
storage
we're
trying
to
deal
with
a
data
processes
in
this.
You
know
how
far
we
can
push
the
envelope
when
we're
generating
this
data.
B
This
is
sort
of
a
model
with
the
grant
we're
going
to
be
publishing
the
genome
interaction
patterns
that
are
interest
to
different
groups
of
scientists
and
biologists,
medical
scientists
that
could
be
mine
for
information,
good
blood,
and
this
is
a
we're
going
to
detail
here.
But
this
is
a
lot
of
the
workflows
that
the
comfort
zone
for
from
my
group
is
where
we've
been
developing
Pegasus
workflows
and
run
on
the
open
science
trip.
B
The
open
science
kids
allowed
it
to
really
scale
up
from
a
pretty
robust
data
center,
a
Clemson,
and
that
we
can,
you
know,
branch
out
into
exceed
resources
as
well,
but
we
already
have
robust
workflows
like
this
one
that
generates
gene
expression,
matrices
Pegasus,
workflows
that
are
functional
now,
with
Qi
the
weekend,
we're
going
to
be
testing
into
the
NSF
cloud
instance.
It's
like
chameleon
and
cloud
lab
and
then
really
moving
to
OSD
for
production
purposes
and
using
Internet
is
a
big
data
mover
for
a
data
next
slide,
and
so
we're
not
just
about
biology.
B
This
is
the
collaboration
for
many
many
different
University
and
universities
and
we're
seeing
if
we
can't
like
find
some
some
cross-pollination
between
the
disciplines
with
this,
but
also
be
able
to
engage
Hydra
share,
learn
from
what
they've
done
in
their
database,
but
also
be
able
to
use,
add
some
more
data,
crunching
power
to
Hydra
share,
and
so
there's
many
many
aspects
to
it
and
Hydra
shares
that's
a
really
cool
resource,
and
so
we're
we're
seeing
how
we
can
further
enable
this
next
leg.
I.
B
C
First
modulus
idols,
which
is
an
open
source
software
that
provides
tools
to
manage
data
from
occupation
to
archival
and
reuse,
destroyed
all
a
lifecycle
and
has
been
widely
adopted
by
the
government,
the
industry
and
academia,
and
it's
also
the
back
end
on
some
other
prominent
ACI
efforts
like
a
cyber
it
used
to
be
I
plant.
We
have
then
also
Orca
an
estrogen,
II
or
katissa
controls
a
world
that
was
developed
in
collaboration
with
silk
to
orchestrate
resources
across
validated
resource
providers.
C
A
musicians
provider
could
be
cloud
providers
or
neighbor
provided
link
into
the
tool,
and
currently
he
wants
a
rose
electricity
in
Genova
to
genie,
a
genie
testbed,
a
studying
more
than
25
and
more
than
20
network
providers
and
in
a
to
Jeannie
Orca.
Allow
scientists
to
create
virtual
infrastructure
of
slices
tailored
customized
to
their
needs,
bringing
together
in
a
computer
source,
attic
storage,
the
social
network
all
connected
to
layer,
2
networking
and
distributed
across
multiple
cloud
or
cloud
providers
around
the
nation
and
internationally
actually,
and
so
when
we
have
these
two
foundational
technologies.
C
It
is
behind
cider
each
atom,
and
this
is
a
project
in
where
we
integrate,
is
an
NSF
for
the
projects
that
completed
one
a
year
ago.
But
in
in
this,
we
integrated
pegasus
the
workflow
management
system,
develop
III
and
USD
with
Orca
to
demonstrate
that
typical
applications
could
dynamically
in
legal
changes
in
infrastructure
to
adapt
to
changes
in
the
workload
like,
for
example,
is
a
scientific
workflow
needed
to
the
three
for
transferred
data
from
an
external
data
depository
or
needed
to
scale
out.
C
We
could
dynamically
the
workflow
itself,
the
scientific
application
to
dynamically
instantiate
Network
links
the
to
enable
the
data
transfer
and,
as
you
remember,
Alex,
a
representative
workloads
that
we
are
going
to
be
using
to
drive.
These
projects
are
Pegasus
entity
worthless,
currently
running
in
osg,
the
type
of
program
which
is
in
the
same
pain
of
the
previous
one.
C
With
these
two
in
layers
being
separated
like,
for
example,
we
demonstrated
and
do
something
in
supercomputing
that
we
could
prioritize
the
traffic
of
a
or
the
transfer
of
a
certain
data
based
on
the
metadata
associated
with
the
file,
and
this
is
what
this
integration
of
idols
and
or
attack
enable,
among
others,
many
others
Sdn
basic
limitation
in
techniques
that
we
developed
on
the
radii,
and
here
we
are
facing
the
federal
scale,
challenge
that
Alex
and
Stephen
have
brought
to
us.
We
are
going
to
be
building
and
hardening
these
efforts.
C
Previous
efforts
is
to
allow
scientists
like
Alice,
to
build
collaborative
infrastructure
across
a
much
more
richer
and
complete
ecosystem
that
will
include
cloud
lab
tag,
a
million
open
site
grid
and
its
genus
and
be
tightly
integrated
with
a
wide
area
network
in
data
Greece
idols
that
would
be
initially
deployed
in
Renzi,
Washington,
State
University
and
clamp
them
to
meet
the
needs
of
the
specific
applications
of
the
biology
community
and
the
hydration
community.
The
procedure
by
hydro
share.
C
We
will
also
be
extending
these
while
creating
new
approaches
so
that
we
can
provision
containerized
scientific
applications.
So
now,
in
the
past,
we
have
been
using
virtual
machines
as
the
unit
of
deployment
of
scientific
applications.
That's
the
case
for
at
Eugenie
and
indicate
for
cloud
like
a
million
we're
kind
of
webs.
C
What's
going
to
be
building
new
mechanisms
so
that
instead,
we
can
containerize
applications
and
throw
on
a
containers
which
is
obviously
going
to
make
the
system
more
efficient,
more
more
portable,
and
we
hope
is
going
to
take
us
to
to
the
next
level
of
the
system
in
the
1574
efforts
that
we,
the
water
meter
width
and
this
figure
shortage
is
a
ten
thousand
review
and
we
have
started
this
project
is
like
a
month
ago.
So
we
have
been
a
heavy
discussions
on
what
we
should
be
doing
and
how
we
would
do
it.
C
Is
that
software
infrastructure
effort
led
by
what
is
the
syllabus
in
intention
to
set
a
biology
community
and
that
is
currently
deployed
in
a
more
than
100
sites,
so
I'm
as
pyro
sure
have
needs
to
em
drawn
or
enable
the
it
accused
the
computation
against
the
data
that
they
host
and
we
plan
side
that
we
hope
to
try
this
to
be
the
platform
for
that
and
that,
if
sorry,
we
don't
have
a
thank
you
climb.
How
is
that
possible?
Thank.
A
D
So
so
I'm
curious,
so
this
is
for
Alex
or
Chloe
say
is
I
guess
you
know
it's
interesting
to
me
that
you
know
what
you've
built
to
support
the
biology
is
seeming
like
it's
going
to
be
more
generally
applicable.
I
mean.
Do
you
see
this
so
you've
got
sort
of
the
genomics
community,
as
well
as
the
hydrology
community,
any
thoughts
beyond
man
in
terms
of
what
you
might
be
able
to
report
with
this
I.
B
C
I
mean
at
effects
Mia,
so
this
is
the
structure
point
of
view.
We're
really
designing
the
system
to
be
extremely
generic
there.
For
example,
if
you
look
at
the
top
of
the
architecture,
I
mean
we're
looking
at
enabling
the
provision
of
containers-
and
you
know
whatever
is
within
those
container-
is
completely
immaterial
to
the
infrastructure
and
I
don't
associate
generic
cell
phone
from
hosting
data,
so
the
system
is
by
design
in
genetics
to
serve
the
broader
community.
B
D
It's
great
cleary
said
you
know,
I'd
be
very
interested
in
you
know.
Probably
some
further
conversations
about
you
know.
Just
as
I
was
pointing
out.
You
know
we're
looking
at
what
our
next
investment
is
going
to
be.
So
you
know
Alex
and
I
have
already
started.
This
conversation
be
great
to
come,
bring
in
on.
C
C
E
So
you
know
we
have
for
a
long
time
talked
about
trying
to
create
generic
infrastructure
and
I
think
we're
doing
a
pretty
good
job
of
getting
there,
but
yet,
no
matter
how
generic
it
is.
The
I
think
inherent
level
of
complexity,
of
what
we're
trying
to
get
done
in
creating
the
infrastructure
for
community
of
science
kind
of
almost
requires
that
these
special
people,
these
unicorns,
that
Joan
was
talking
about
that
can
help.
E
You
know
the
communities
of
science
take
advantage
of
generic,
but
not
personalized
software,
and
so
I'm
wondering
you
know,
I
wonder
if
we
could
start
I
don't
know
thinking
about.
Maybe
is
the
right
term
to
use
about
how
far
you
can
go
afield
from
let's
say
you
build
something
it
works
for
a
particular
community
like
cybers,
did
I
think
it's
been
quite
successful.
You
know
building
an
infrastructure
for
a
community
of
science
and
more
than
one
community
of
science.
E
So
how
you
know
the
question
is,
but
yet
in
talking
to
drop
the
level
of
alcohol
hand-holding
is
quite
extensive.
So
my
question
is:
is
there
a
way
of
crew
I'm,
not
sure
if
Howard's
still
there
is
there?
A
way
to
do
a
Facebook
similarity,
you
know
friends,
map
of
scientists
and
science
domains
so
that
we
think
about
how
far
a
domain
is
from
another
domain
to
kind
of
get
an
idea
of
how
hard
it's
going
to
be
to
migrate.
Cyber
infrastructure
do.
B
F
A
C
B
One
model
stand
that
we
have
to:
is
it
so
just
to
focus
on
the
triple
databases,
so
the
triple
is
that
it's
just
it's
a
technology
that
it's
a
repository
of
restoring
genomic
data,
but
it's
built
around
lots
of
different
like
communities
within
the
live
science
community
that
are
a
lot
of
more
agricultural
focused.
So
it
would
be
like
a
tree
community
and
different
there's.
Actually,
the
insect
I
5k.
Is
there
too,
and
I
think
that
that
that
is
the
way
to
tether
together
the
users
through
these
repositories
that
they're
already
using?
B
And
then
you
know
putting
these
these
tools,
these
air
coding
generic
tools
into
their
hands
that
meet
meet
their
needs.
So
I
think
that
there's
ways
to
do
that
without
trying
to
like
you
know,
just
kind
of
naturally
go
through
from
a
repository
perspective
that
human-computer
interaction
and
interface
is
out.
There
that's
a
way
to
build,
build
ease,
but
I
was
bringing
this
up,
but
I
think
that
it's
you
know
defining
the
word
community
is
a
very
difficult
thing,
especially
writing
about
biology.
B
A
But
that
this
is
Florence,
that
might
be
something
that's
an
opportunity
for
the
hugs
over
time
right,
because
we're
going
to
be
seeing
these
different
communities
and
we're
going
to
be
trying
to
continue
to
enable
cross
hubs
across
both
cross,
whatever
collaboration,
it
could
be
that
we're
able
to
build
that
map
over
time.
It's
just
going
to
take
time,
but
that
could
be
an
interesting
objective
for
the
hub's
working
together.
What
do
you
all
think
I.
D
Had
a
follow,
this
is
adjacent
Arendt,
Zi
Defoe
to
a
stands
question
so
I
heard
about
mapping
and
we're
talking
about
mapping
data
sets
and
then
we're
talking
about
unicorns
and
the
ability
to
will
help
to
sort
of
enable
this
science
and
to
me
that
sounds
like
more
like
mapping,
expertise
than
datasets
themselves.
If
you
have
one
specific
community
of
practice,
that
is
using
the
infrastructure
and
they're
quite
successful
and
you
have
a
perhaps
a
new
community
that
is
joining.
That
does
need
that
expertise.
E
So
if
there
is
a
metric
that
allows
you
to
say
these
two
data
collections
are
close,
even
though
ones
for
erotic,
psychologists
and
the
other
one
is
for
left-leaning
journalists,
no
I'm
just
making
this
about.
You
know,
maybe
there's
something
to
be
said
that
there's
a
look,
you
know,
I
mean
I,
admit
the
do
vocabulary.
We
haven't
done
anything
to
characterize
the
expertise.
I
fully
admit
that
I'm
just
saying
we
got
to
use
what
but
Gregor.
D
E
A
C
I
think
Alan
attended
this
sort
of
pollination
that
you
talked
about
right,
that
we're
doing
with
hydrologist
community
we're
getting
a
family
involved
from
UNC
to
also
understand
how
these
these
non-commissioned
have
smeared
and
data
set
needs.
This
similar
data
needs
right,
so
I
think
we
could
say
see
that
we
are
trying
to
understand
how
these
two
communities
on
their
data
needs
and
the
type
of
data
that
is
in
which
are
alike
and
how
they
could
use
the
system.
Is
that
correct,
Alex.
B
Yeah
and
I
think
that
I
look
at
I
think
this
is
where
this
project
is.
It
were
epic
we're
in
a
good
place,
but
we
can
all
work
together
and,
if
that's
a
big,
most
important
thing
but
I
like
everything
from
a
data
perspective
right
and
I
just
hacked
through
PAC
through
my
my
years
of
scientists,
you
know
scaling
up
my
research
and
so
I
I.
Look
at
I
hear
what
you're
saying
about
generalizing
systems
and
software,
but
I,
look
at
ways
to
generalize
the
data,
and
so
I
like
in
biology
is
pretty
easy.
B
We
have
the
evolutionary
tree
to
relate
data
together
and
I
would
look
at
different,
triple
database
repositories
and
the
ncbi
has
different
ways
of
aggregating
data
around
the
Tree
of
Life
and
so
from
a
domain.
Scientists
biologists
perspective.
That's
the
way
I
would
tether
together.
The
researchers
is
through
the
Tree
of
Life
evolutionary
perspectives
and
then
in
other
we're
doing.
I
have
visions
with
sie.
Das
is
mapping
a
lot
of
the
cyber
infrastructure
to
be
able
to
look
at
complex
gene
interactions
across
the
Tree
of
Life.
B
C
G
So
this
is
this
is
Renee,
so
just
to
chime
in
a
little
bit
on
the
discussion
that
and
I
can't
pay
too
much
about
it,
because
we're
still
writing
the
proposal
for
it
and
not
not
that
secret.
It's
just
not
fully
baked
yet,
but
one
of
the
things
that
we're
working
on
is
is
rather
than
focusing
on
the
underlying
infrastructure
or
databases
or
repositories,
or
a
lot
of
heavy
metadata
we're
trying
to
build
a
resource
as
the
first
step
of
our
BD
map.
G
So
the
hub's
are
all
thinking
about
putting
together
different
versions
of
in
in
some
coordinated
manner
of
big
data
resource
maps.
So
we're
trying
to
do
something,
that's
as
practical
as
possible,
with
as
little
lift
as
possible,
and
what
we're
focusing
on
is
the
notion
of
domain
experts
posting
their
challenges,
the
the
problems
that
they're
trying
to
solve,
along
with
the
data
and
trying
to
create
interactions
between
them
and
data
science
teams.
You
know
frequently
what
happened
to
our
machine
learning
things
our
weather
frequently.
G
What
happens
is
datasets
published
and
then
you
say:
okay
go
out
and
figure
out
what
to
do
with
this
data.
Scientist.
Machine
learning
folks,
so
we're
trying
to
do
it
the
other
way
around
to
create
an
ongoing
dialogue
between
the
domain
experts
and
the
data
scientists,
where
the
data
is
a
central
part
of
that
dialogue
and
then
have
the
community
start
to
create
solutions
on
the
data
science
or
machine
learning
side
that
actually
solved
those
problems.
G
So
this
may
be
as
part
of
what
everybody's
been
talking
about,
a
way
of
starting
to
gather
the
data
about
what
the
different
challenges
are
in
different
domains.
Who
are
the
types
of
folks
who
are
asking
for
specific
solutions,
and
then
you
can
kind
of
map
it
to
datasets
and
the
underlying
infrastructure.
E
That
that's
helpful,
so
it
kind
of
stems
to
be
like
this.
What
you're
doing
is
kind
of
doing
a
very
early
chunk
of
what
some
people
call
the
blackboard
architectural
model,
where
people
post
things
up
on
this
blackboard-
and
you
know
they
just
basically
add
you
know
they
share
ideas,
partial
solutions.
E
E
It's
an
old
idea,
but
it's
actually
getting
subtraction
again
because
actually
I
think
it's
fairly
old
I
think
it
was
published
some
years
ago,
but
it's
getting
subtraction
because
now,
with
you
know,
kind
of
smart
api's,
you
can
start
thinking
about
being
able
to
post
a
solution
and
reach
out
to
data
in
kind
of
a
lightweight
fashion.
So
it's
almost
like
a
mash-up
toward
a
solution.
G
Yeah,
that's
that's
in
fact
the
the
general
direction
that
the
thinking
is
going
in
and
I'm
not
sure
the
vibe
is
the
first
time
I
hear
of
the
blackboard
architectural
model.
Aha,
that's
the
folks
who
are
helping
me
put
this
together,
whether
whether
they
know
of
the
model,
but
that's
that's
good
to
know
too,
but
but
you
know,
happy
happy
to
fill
you
in
once.
They've
fought
through
it
a
little
bit
more
because
it
sounds
like
this
is
something
that
that
there
are
lots
of
moving
pieces.
B
E
So
there's
a
lot
of
what
I'm
saying
is
informed
by
my
fairly
recent
experience
with
this
NIH
effort
that
has
taken
a
kind
of
an
odd
turn,
but
it's
an
odd
turn
that
I
find
particularly
intriguing
and
it
is
to
kind
of
create.
So
it
was
proposed
that
we
there's
four
teams
that
are
being
funded
across
the
country
to
try
to
pull
together
very
disparate
data
sets
and
attack.
E
Basically
fundamental
questions
around
bio,
human
biology
and
the
idea
was
floated
that
would
suite
like
a
blackboard
system
where
we
post
stuff
and
we're
not
trying
to
provide
answers
or
provide
complete
sources
of
knowledge
or
anything
else.
We're
just
going
to
let
this
kind
of
emerge
over
time.
It's
real
simple
metaphor,
and
so
the
reason
I'm
saying
you
can
reach
out
to
the
data
is
that
we
have
a
hackathon
coming
up
fairly
soon
and
what
we've
been
charged
to
do
is
to
take
all
of
the
data
sources.
E
We
have
and
put
them
up
behind
an
API,
a
smart
API
and
expose
them
and
then
we're
going
to
do
queries
across,
and
so
it's
a
really
just
object,
intriguing
concept
because
it
doesn't
require
me
to
do
a
lot
of
I
mean
it
looks,
I
think
it'll
be
inefficient
and
it
isn't
what
you'd
want
if
you're
trying
to
get
answers
fast
at
highly
complex
problems,
but
it
does
permit
you
to
kind
of
start
to
assemble
a
lot
of
stuff
without
tying
everybody
down.
I
think
the
way
to
do
it
fully
specified.
E
G
That's
the
general
idea
that
we're
going
force,
and
so
we
you
know,
we
started
to
look
at
a
lot
of
different
efforts
and,
as
an
example,
very
metadata,
intense
efforts,
those
well
when
they're
up
and
running
then
they're
quite
good
solutions,
but
the
the
challenge
is
getting
them
up
and
running.
You
know
across
each
new
domain,
particularly
if
you're
the
domain
is
very
different,
so
we're
trying
to
think
of
a
way
of
doing
this.
That's
much
lighter!
G
E
In
a
lot
of
ways,
it's
analogous
to
this
data
like
concept
that
people
have
been
banding
about
recently,
which
is
don't
try
to
get
all
your
data
organized
and
into
a
central
database,
just
pour
it
all
into
receptacles
that
you
can
control
and
eventually
somebody
will
start
working
on
some
part
of
it.
That
will
help
inform
something
else
and
I'm
in
an
emergent
fashion.
You'll
start
to
organize
the
data,
so
this
is
kind
of
even
more
generalized,
which
says:
don't
even
put
it
in
a
receptacle
just
set
up
an
API.
A
E
D
But
this
is
where
a
link
data
comes
in
handy
because
you
can
get
sort
of
emergent
annotations
across
data
sets
to
the
linked
data
specification
and
with
the
appropriate
tools
that
allows
your
users
to
not
only
annotate.
You
know
the
datasets
themselves,
but
the
relationships
between
them,
which
may
be
Han
yeah.
G
F
A
B
B
Okay,
can
I
ask
a
question
along
these
lines.
So
what
do
you
from
an
Internet
of
Things
perspective
and
I'm
thinking
about
DNA
sequencers
right
expect
I'm
just
focus
on
the
biology,
but
what
happens
when
every
every
Hospital
effort
everywhere
there's
a
blood
test
when
every
research
lab
is,
is
generating?
What
forms
talking
about
the
block?
The
bronzo
bytes
of
a
DNA
sequence,
data
that
you're
never
going
to
be
able
to
annotate
you're,
never
going
to
be
able
to
store
it's
going
to
be
after
the
ephemeral.
B
E
You
know
I
think
that's
an
existential
question
and
I
don't
mean
to
be
facetious
about
it.
I
think
it's
very
hard
at
this
point
to
come
to
grips
with
the
fact
that
we're
going
to
be
producing
date,
I
mean
I,
think
we're
already
there
that
we're
producing
data
that
we're
never
going
to
look
at
and
how
we
hook
that
into
something
useful.
E
Well,
maybe
there
will
emerge
some
kind
of
quantum
storage
device
and
we
can
put
front-ends
on
it
or
something
I
just
don't
know
I,
but
you
know
what
has
been
most
frustrating
part
of
my
career,
which
is
getting
long
in
the
tooth.
I
guess
is
that
we
always
try
to
engineer
things
to
work
really
really
well
and
and
I'm
inclined
to
try
to
figure
out
ways
of
getting
things
engineered
to
work
in
a
half-assed
fashion.
I
hope
it
gets
better.
E
B
A
B
And
it's
not
a
this
is
not
a
try
to
meet
I,
don't
have
a
science
fiction
problem
or
existential,
it's
it!
A
guy
right
now,
I've
got
170
terabytes
or
something
like
that.
Storage
and
I
need
to
process
petabytes
of
data,
and
that's
you
know,
try
to
go
out
to
national
computes
able
to
do
that,
but
I'm
just
one
dude.
What
happens
when
there's
like
you
know,
100,000
people
that
need
to
do
this.
F
B
I
think
it's
something
that
that's
kind
of
what
psychosis.
What
I
want
to
be
is
like
kinda
like
getting
at
some
of
these
concepts
of
like
trying
to
do
too
much
and
being
able
to
what
do
you
want
to
keep
in?
What
do
you
want
to
erase
because
you're?
Never
that's
what
other
Internet's
here
is
that
I
want
to
be
able
to
have
the
data
in
a
repository
and
then
process
it
and
delete
it
and
not
worry
about
having
to
download
it
again,
like
that
kind
of
experimental,
more
ephemeral,
type,
experimentation,
yeah
and.
A
I
think
collaborating
with
NIH
and
stupid
and
those
guys
will
be
interesting
over
time
as
they
get
more
into
precision:
medicine,
precision
cancer
too,
but
when
they're
trying
to
leverage
all
these
different
pieces
of
data,
you
know
clinical
research
from
around
the
planet
in
a
hundred
different
languages
if
they
use
cognitive
computing,
an
AI
to
interpret
and
translate
and
then
put
it
with
clinical
research
data,
environmental
and
your
peg
plant
animal
genomics
data.
All
this
stuff
trying
to
come
together
to
create
context.
D
I
think
this
leads
right
into
that
to
the
link
data
and
then
the
metadata
and
the
provenance
so
you're
going
to
have
datasets
that
you
can't
keep
because
they're
too
big
and
you're
going
to
process
them
and
they're
going
to
have
results
and
you're
still
going
to
need
to
know
how
you
got
to
those
results,
and
you
basically
have
a
shadow
of
the
data
that
was
at
some
point.
You
may
need
to
go
back
and
recreate
all.
D
C
C
F
It
may
be
that
we
just
have
to
start
to
be
comfortable
with
more
uncertainty.
I
mean
this
is
something
in
the
data
bridge
right,
so
we
so
we're
trying
to
build.
Let's
say
we
have
a
hundred
thousand
data
sets
and
we
build
a
network
on
okay
and,
and
we
say
to
the
users:
okay,
give
us
all
your
data
set
and
we'll
give
you
the
ones
that
we
think
are
the
most
similar.
F
But
the
problem
is
there's
a
dozen
at
least
or
maybe
more
like
50
ways
to
define
some
learn
and
and
there's
no
actual
answer
right.
There's
no
ground
truth.
Nobody
knows,
nobody
can
look
at
a
hundred
thousand
data
sets
and
say:
oh
yeah,
these
are
the
ones
are
most
similar
if
they
could,
but
I
wouldn't
move
on
and
do
something
else.
F
C
F
D
A
C
Data
because
it's
linked
data
extracting
the
fundamental
algorithms
that
have
enable
Google
to
do
what
it
does,
because
the
data
are
because
the
worst
failures
have
leaked
and
they
can
drive,
is
not
based
on
the
popularity
which
is
a
which
is
a
metric
of
success
or
usefulness.
It
cannot
suggest
that
perhaps
you
know,
which
is
where
you
should
prepare,
neither
the
data
set.
If
they
are
brought
into
a
link
model
linked
data
model,
then
you
can
just
apply
the
techniques
that
are
being
developed
in
computer
science
to
do
the
next
Google.
C
F
C
F
C
F
F
F
F
B
F
A
A
Nih,
head
yeah,
went
off
and
built
the
whole
system
mission,
follow
it,
but
I
actually
is
connecting
with
the
Google
person
to
suggested
what
you're
talking
about
and
then
to
follow
up
with
you
on
that.
It
is
4:30
that
we'll
take
one
last
comment
or
question
and
then
we're
going
to
close
for
the
day.