►
Description
SBDH-HarnessingtheDataRevolution-Roundtable
A
So
we're
very
excited
I,
so
welcome
to
the
inaugural
South
Big
Data
hub
data
science
roundtable
for
those
who
may
be
wishing
to
live
tweets
this
event
and
quote
Titan.
Please
use
hashtag,
SPD,
h16,
south
big
data
hub
16
or
hashtag
BD
hubs;
I'm
dr.
Lea
Shanley,
the
co
executive
director
of
the
South
big
data
hub
here
at
the
Renaissance
computing
Institute
at
beautiful
North,
Carolina,
Chapel
Hill.
For
those
of
you
who
may
not
be
familiar.
A
The
south
big
data
hub
builds
R&D
communities
of
practice
and
accelerates
partnerships
among
governments,
industry
and
academia
for
those
who
apply
data
science
and
analytics
to
help
solve
regional
and
national
challenges.
The
South
big
data
hub
is
part
of
a
network
of
4
hubs
launched
by
the
National
Science
Foundation
in
2015
and
co-sponsored
by
our
host
institutions
and
other
partners.
A
We
manage
the
South
hub
jointly
with
Georgia
Tech
Georgia
Institute
of
Technology
in
Atlanta,
and
we
serve
16
states
in
the
South
Region
everywhere,
from
Texas
to
Delaware
and
everywhere
in
between
and
have
500
members
from
universities,
nonprofits
corporations,
foundations,
communities
and
communities
of
practice.
Before
we
get
started,
I'd
like
to
introduce
dr.
Stan,
a
halt,
he's
the
director
of
the
Renaissance
computing
Institute
and
one
of
the
two
principal
investigators.
B
Thanks
Lee,
yes,
I'm,
so
pleased
to
see
everybody
here
in
the
room
and
then
also
hearing
all
the
beats.
As
people
come
online.
It's
would
you
estimate
Stephanie
how
many
people
online
at
this
point.
Yes,
35
or
40
more
people
in
36
we're
going
in
the
right
direction.
So
it's
my
great
pleasure
to
introduce
Jacob
aru
from
NSF
he's
currently
on
assignment
as
senior
advisor
for
data
science
and
in
the
size
Directorate
at
the
National
Science
Foundation
I've,
taken
for
a
long
time
he's
a
wonderful
colleague
does
very
interesting
research
and
scientific
research
himself.
B
He's
the
distinguished
scientist
and
associate
director
of
data
initiatives
at
San,
Diego,
supercomputer
center
and
that's
part
of
UC
San
Diego
and
he
worked
on
applied
and
applications
oriented
research
problems,
all
of
which
are
related
to
data
management
and
data.
Analytics
he's
been
part
of
a
lot
of
national
initiatives.
He
was
CIA.
B
Project,
it
was
part
of
cyber
infrastructure
from
comparative
effectiveness,
psych
or
neon
nice.
Quasi
G
on
the
list
goes
on
and
on.
Needless
to
say,
he
is
an
incredible
asset
for
NSF.
You
know,
NSF
has
steered
with
this
leadership
and
we
were
very
pleased
to
have
him
come
down
to
Chapel,
Hill
and
market
tomorrow,
he's
over
at
RTI,
giving
another
talk
and
we
will
be
on
Durham
or
another
South
Betty
hub
initiative
with
a
group,
so
I
welcome
everybody
here.
Thank
you
so
much
just
a
reminder.
B
C
B
C
C
C
C
C
B
C
C
The
process
by
which
does
each
record
had
discussions
or
what
they
thought
would
be.
These
big
ideas
for
the
next
five
10
15
years
from
a
Directorate
point
of
view,
and
these
were
all
bubbled
up
when
assistant
director
ad
retreat.
That
happened
across
all
the
directorates,
and
then
these
were
filtered
down
to
these
set
of.
B
C
B
C
B
C
Ideas
and
everything
else
was
connected,
but
there
are
six
navigating
the
new
Arctic
work
at
the
human
technology
front
here:
understanding
the
rules
of
life.
Basically,
you
know
you
know,
type,
the
quantum
leap
curing
and
then
windows
on
the
universe,
ways
like
mechanisms.
Let
me
now
have
data
sciences
in
the
middle,
and
actually
you
can
already
see
that
many
of
these,
if
not
all,
actually
have
huge
data
requirements
as
well
right,
so
they.
C
C
C
C
B
C
About
this
notion
of
midscale,
so
if
you
look
at
NSF
today,
the
standard
programs
might
get
up
to
say
20
million
dollars,
or
so,
if
they're,
the
big
SPC
and
those
kinds
of
things.
And
then
there
are
the
major
research
equipment
projects
which
go
around
start
from
around
200
million
Forster.
So
there's
this
gap
between
20
million
for
200
million,
where
there
could
be
a
lot
of
interesting
work.
That
could
be
done,
but
there
is
really
no
sort
of
vehicle
to
make
that
possible.
B
C
And
they
are
all
very
interconnected,
so
this
circle
shows
on
the
left-hand
side.
Are
all
the
big
ideas
on
the
right
hand,
side
are
all
the
records
and
the
lines
just
saying
that
they're
all
connected
as
you
can
see
with
the
jumble.
But
if
you
look
at
harnessing
the
data
idea,
it
connects
to
actually
all
the
directories.
So
that's
important
everybody
would
be
interested
in
that
and
then,
if
you
look
at
also
size,
it
has
links
to
all
the
ideas
behind
this
stuff.
That's
very
interesting
and
exciting.
C
So
now,
let's
talk
very
quickly
about
existing
programs
at
NSF,
so
when
we
look
at
in
the
data
area,
so
when
we
look
at
our
data
science
and
data
programs
in
general,
we
like
to
use
this
a
quadrant
diagram
to
show
these
sort
of
four
basic
areas
in
which
investments
are
being
done.
One
is,
of
course,
naturally
foundational
research
or
the
other
is
cyber
infrastructure.
A
third
one
is
education
and
workforce
development,
and
then
collaboration
and
partnerships
and
overlying.
All
of
that
are
policy.
Issues
such
as
open
data
and
data.
C
I
won't
go
through
all
of
these,
but
these
I
just
show
you
the
names
here
and
so
the
big
data
research
program,
which
actually
I
have
hand
in
helping
coordinate
it's
actually
cross
foundational.
We
have
many
program
officers
live
in,
is
a
big
help
on
this,
and
and
we
have
other
fillings
from
every
directory.
We
have
folks
involved
in
this,
but
that's
something
I'm
more
familiar
with,
and
then
there
are
these
other
programs
that
we
also
have,
which
are
all
on
the
foundational
research
research
side.
C
B
B
C
C
The
other
thing
I
wanted
to
mention
is
at
the
federal
level
again.
This
is
my
way
of
just
giving
you
some
quick
context.
There
is.
There
are
multiple
interagency
coordination
groups
in
different
technical
areas:
the
White
House
Office
of
Science
Technology
Policy.
One
of
the
coordination
groups
is
in
Big
Data.
That's
an
interagency
working
group
like
cochair,
along
with
my
clinic
Allen
Theory
who's.
C
A
C
C
C
And
training,
as
if
familiar
almost
every
University
any
size
is
looking
at
starting
programs
in
data
science,
undergrad
level
masters,
there's
a
huge
demand
for
these
kinds
of
skills
right
now
in
industry.
So
how
do
we
do
things
in
the
short
term,
but
also
what's
the
long
term
strategy?
And
finally,.
C
C
C
C
And
the
third
is
called
data
intensive
research.
This
is
research
with
data
in
all
of
the
domains,
may
be
biology,
lines
and
cell
biology
here,
and
so
those
would
be
the
key
sort
of
areas
in
terms
of
the
research.
But
then
also
education
is
a
big
aspect
and
then,
as
I
said
before,
there
has
to
be
some.
C
C
C
B
C
C
C
C
C
C
Ideally,
we
get
the
most
bang
for
the
buck
and
also
to
make
it
really
interesting
research
in
this
area.
All
of
this
should
be
able
to
serve
a
general
set
of
way.
I
think
we
are
trying
to
with
this
vision.
We
are
also
trying
to
get
away
from
some
of
the
siloing
that
has
ended.
I
mean
right
now,
the
way
I
marry
FC
projects
are
done.
This
each
project
does
its
own
cyber,
and
so
the
question
here
is:
could
we
do
something
more
generic
that
could.
B
C
C
C
C
B
B
C
Technology,
technological
or
pragmatic
issues,
I,
and
certainly,
if
you
put
all
of
that
together,
that's
a
new
discipline.
I,
also
like
to
think
about
translational
data
science.
The
translational
data
science,
then,
is
the
notion
of
applying
data
science
techniques
to
solve
real
world
problems.
So.
C
B
B
A
C
C
C
B
C
B
B
C
B
C
B
C
C
C
Scale,
companies
they
do
this
every
day,
so
we
have
challenges
like
that,
so
data
themselves
are
and
then
you
might
build
under
software
stacks
and
use
and
also
distributed
test
bits.
That
is
something
like
smart
and
connected
communities.
There
might
be
a
testbed.
There
is
this
project
for
the
array.
C
Smart
city
projects
out
there,
and
so
a
testbed
that
that
you
could
use
for
smart
cities
could
be
part
of
instances
and
so
on
actually
neon
the
National
ecological
Observatory
Network,
when
it
was
originally
invited
envisioned
for
then
mission
actually
has
an
open
system
so
that
others
can
plug
in
there.
If
logic,
not.
C
C
That
we
just
awarded
under
the
big
data
program,
a
couple
of
guys
from
Virginia,
Tech
and
University
of
Miami.
They
are
creating
this
testbed
for
smart
cities,
and
this
last
sentence
here
says
the
spread
is
intended
to
be
open
access
to
be
able
to
support
both
research
and
the
whole
scientific
institution,
as
well
as
other
users
requiring
non
proprietary
money.
C
B
C
We
facilitate
so
let
me
now
just
highlight
a
few
of
the
things
that
would
be
part
of
this
vision
going
forward.
I
already
mentioned
theoretical
foundations
and
systems,
and
so
on.
So
just
just
to
dig
a
little
bit
more
into
that
and,
as
I
recently
held
a
workshop
on
this
topic
of
theoretical
foundations
of
data
science,
there.
C
You
know
so
I
just
pulled
out
a
few
statements
from
the
report.
Theoretical
foundations
are
fundamental
for
industrial
applications,
scientific
understanding.
There
is
a
demand
for
training
prints
in
this
area
by
the
way,
this
workshop
invited
sort
of
1/3,
1/3
1/3
people
from
theory,
people
from
computer
science.
But
this
is
a
really
sort
of
machine
learning,
statistics,
folks
and
math
equally
divided
among
those
communities,
and
it
was
sponsored
by
size
as
well
as
the
EMS.
C
C
C
C
C
Another
idea
was
machine
learning
systems
and
there's
a
meeting
again
I'll
mention
in
the
next
slide.
We
had
some
folks
from
industry,
including
the
vice-president
for
cognitive
systems,
from
IBM's
of
this
V,
the
Watson,
and
they
were
talking
about
how
it's
very
important
to
think
about
how
to
build
generalized
machine
learning
systems,
because
really,
what's
going
on
right
now
in
the
industry,
is
building.
C
B
C,
and
they
can
clearly
see
the
writing
on
the
wall,
that
this
is
not
a
scalable
and
the
cost
of
maintaining
multiple
vertical
systems
will
surrender
square
in
this,
so
they
would
like
to
they
really
like
academia
to
start
talking
about.
How
do
you
build
a
generic
machine
learning
system?
What
are
the
things
that
are
transferable
from
one
domain
to
the
other?
That.
B
C
C
C
B
C
B
C
C
There
are
certainly
institutional
level
repositories,
so
they
said
NSF,
the
institutional
repository
would
be
non
NSF,
maybe
University
have
their
own
repositories
or
regional
network,
and
there
may
be
large
community
there.
This
is
for
you
know,
you
must
not
only
use
the
small
in
a
basis,
that's
actually
the
entire
Wyoming
finding
supercomputer
building
which,
where
there
are
pedabytes
large.
C
And
I
think
this
is
also
part
of
that
same
provision.
If
you
think
about
back
to
that
picture
of
harnessing
the
data
vision,
do
you
want
to
populate
it
at
the
bottom,
with
all
these
different
kinds
of
data?
But
then
you
want
to
have
services
and
provide
you
integrated
access,
so
one
activity.
We
are
right.
Now
that's
going
on
that
you
get
trying
in
this
area.
Yes,
you
know,
could
we
create
an
open
knowledge?
C
C
C
C
C
A
C
C
C
So
I
think
the
key
idea
here
is
and
how
do
we
embed
ethics?
The
only
thing
I'll
say
here
is
I
think
there
are
ways
to
you
really
need
to
make
it
integral.
It's
not
a
question
of
saying.
Let's
take,
though,
all
the
curriculum
that
we
haven't
just
tack
on
one
more
course
on
ethics
and
the
kids
will
learn
something
it.
C
And
my
last
slide,
I
think
is
just
to
talk
about
some
upcoming
events
and
activities.
We
have
just
funded
the
National
Academy
of
Sciences
run
this
workshop
and
envisioning
the
data
science
and
it's
CSTB
the
computer
science
board
at
na
s.
We're
also
the
statistics
word
in
the
board
on
science
education.
So
what
we
want
them
to
look
at
is
step
back
and
do
some
blue
sky
thinking.
If
data
science
was
a
new
discipline,
what
would
it
look
like
not.
B
C
C
C
We
have
another
workshop
coming
up
under
the
nitrile
umbrella,
called
metrics
for
assessing
the
value
of
theory,
eating
it
supporting
him
and
that
and
then
next
year
we're
gonna
have
the
big
date
of
ki
meeting,
which
we
had
last
year
for
the
very
first
time.
So
this
second
big
deal
of
we
are
meeting,
but
this
time
we
are
combining
it
the
annual
meeting
of
the
house.
A
A
Listening
through
the
WebEx
and
people
in
the
room,
we'll
start
with
some
questions
here
and
Stephanie
will
signal
us
if
there's
questions
for
those
participating
on
WebEx
if
you're
on
WebEx,
please
type
your
questions
in
the
box
there
so
Stephanie
did
you
have
a
question
and
we'd
like
to
create
a
discussion.
So
don't
just
ask
questions
but
off
your
your
comments
and
thoughts
on
this
start
responding
to
each
other
as
well.
So
with
that
first
question.
B
It's
not
a
stretch
to
call
these
socio-technical
systems,
and
you
know
whenever
we
go
to
these
workshops,
know
the
technical
parts
easy.
The
hard
part
is
the
social
part.
So,
to
what
extent
is
studying
these
kinds
of
interactions
and
facilitating
community
engagement
to
develop
and
use
cyber
infrastructure?
Is
that.
C
C
A
C
C
C
Like
we
know
this
now,
I
mean
this
is
something
they
didn't
know
many
years,
but
I
think
we
kind
of
know
this
now.
So
hopefully
we're
not
gonna
make
the
same
naive
and
actually,
in
that
context,
I
want
to
mention
that
one
of
the
projects
that
we
funded
they
will
rip
us.
University
wine
I've
been
funded
to
do
a
socio
technical
evaluation,
a
graphic
study
of
the
house
and
because,
let's
not
be
glib
and
think
this
is
all
going
to
be
successful.
C
A
C
C
B
So
I'm
interested
in
a
number
of
I'm
really
interested
in
the
number
of
questions,
but
the
one
that
just
jumps
out
at
me
but
I
have
to
ask
the
thinking
behind
it.
Is
this
idea
of
data
Sciences
of
discipline,
because
right
now,
I
think
data
science
is
viewed
by
many
on
many
campuses
as
an
integrated
solution
and
I
can't
think
of.
A
B
B
C
C
C
C
B
C
B
B
C
A
C
B
C
A
B
C
B
B
C
C
By
that
followed
by
this,
there
are
different
algorithms
for
doing
those
where
it
comes
from
is
actually
what
our
colleague
from
IBM
mentioned
in
our
meeting.
They
said,
if
you
look
at
a
Watson
for
Jeopardy,
it
was
what's
the
system
that
was
built
with
each
other,
then
they
took
that
and
they
created
Watson
for
oncology.
That's
the
system
for
doing
oncology,
what's
actually
a
different
system
for
doing
archaeology.
C
The
same
some
of
the
same
engineer,
some
some
experiences,
but
what
they
found
is
they
had
now
they're
doing
Watson
for
insurance,
industry
Watson
and
their
frustration
is
they're
having
to
start
from
scratch
for
everything,
and
in
fact
it
was
industry,
people
who
said,
and
the
Amazon
person
they're
also
agreed,
who
said
there's
gotta
be
a
better
way
should
there
must
be
some
more
systems
we're
doing
this.
So
me,
it's
an
open
question
how
you
would
generalize.
B
Your
last
comment
kind
of
struck
me
as
reflecting
something
that
I
know
already
I'm
a
domain
expert
I
pretend
to
be
a
data
scientist
once
in
a
while,
but
really
I'm
a
domain
expert
and
the
hard
part
about
doing
the
work
that
were
that
I'm
interested
in
is
NIH.
Why
usually
issues
often
the
source
of
my
data
does
mandate
data
sharing,
but
the
problem
with
the
data
is
the
metadata
hasn't
been
captured
to
allow
a
non
expert
to
use
the
data
realistically,
and
you
know
it's
an
example.
B
You
know
I
could
just
make
it
so
I'm
a
physician
I
can
say
you
know
we
could
have
a
data
point
about
people's
weight.
Okay,
actually
weight
isn't
uniform
anyplace.
Does
it
mean
your
lightest
weight
in
the
day?
Does
it
mean
dressed
as
me
with
shoes?
It
means
all
these
things
and
that
metadata
is
routinely
lost
and
in
fact
I
would
go
as
far
to
say,
as
one
of
the
major
sources
of
data
for
genetics
is
DB
gap
and
in
fact,
the
loss.
B
The
metadata
is
intentional
by
the
people
that
submit
the
data
because
they
don't
really
want
other
people
to
use
it.
I
mean
that's
a
cynical,
provocative
statement,
but
it's
not
far
from
the
truth,
and
so
you
know
the
role
you
know
when
you
talk
about
needing
to
rebuild
Watson
it's
because
that
domain
expertise
wasn't
there
in
the
design
and
so
I'm
curious.
If
you
have
thoughts
about
how
to
you
know,
bring
in
the
expertise
of
the
domain
expert
today
of
science.
B
C
C
Whole
space
of
metadata
and
because
you
can
easily
come
up
with
examples
where,
for
just
one
piece
of
data
there
is
a
and
there
may
be
other
cases
where
data
doesn't
need
them,
so
there's
a
whole
range
of
things,
I'm,
not
just
understanding.
What
is
that
range
and
also
understanding
what
kind
of
processing
can
I
do
with
the
data?
Given
that
I
have
this
metadata?
B
C
C
B
B
C
C
C
B
C
C
You
go
respect
some
of
the
earlier
things.
I
said
you
know
where
of
all
these
situations,
where
sometimes
this
data
is
good
enough
using
an
analysis
and
sometimes
it's
data,
science
or
computers,
technical
person,
you
don't
know
that
that's
very
domain-specific
now,
maybe
you
could
say
run
that
over
time
by
observing.
What's
done,
but
initially
it
has
to
be
the
domain
person
look
right
now.
That's
all
I
got
it's
okay,
let's
use,
but
then
there's
a
metadata
there
as
well
right
when
I
do
their
analysis
with
that
data.
A
So
Jim
Beach
asks
the
quality
of
commercial
applications
and
user
environments
is
increasing
in
accelerating
your
usability
and
user
experience.
For
example,
you
mentioned
Siri
applications
deliver
the
value
of
data
science
repositories
and
integration
to
science
stakeholders.
Yet
science
domains
are
highly
constrained
by
grant
funding
levels,
deliver
nimble
and
user
applications
to
researchers
and
students.
Where
would
the
resources
come
from
to
take
the
value
of
the
infrastructure
envisioned
to
implement.
C
C
C
C
B
C
Think
it's
not
that
academia
has
all
the
answers,
so
I
think
it
would
be
good
to
get
engaged.
My
own
feeling
I
can
say
this
is
very
preliminary.
We
are
still
in
a
lot
of
discussions
with
vendors
and
so
on,
but
let's
see
yeah
we
are.
You
know
it's
possible
that
industry
will
help
us
in
some
ways
getting
right
question.
C
C
B
A
C
B
C
B
C
C
C
C
B
C
C
B
C
In
that
area,
and
the
first
time
I
actually
mentioned
this
concept
of
translational
data,
science
was
actually
at
the
big
date
of
VI
meeting
last
year
and
in
that
slide,
I
had
a
thing.
It
said:
CI,
reuse,
I,
don't
know
how
many
of
you
got
funding
under
that,
but
they
used
to
be
a
program
under
o
CI
or
CI
Reeves.
Actually,
I
got
some
money,
out
of
which
was
the
concept
that
if
you
build
some
cyber
infrastructure
in.
C
C
B
C
B
B
C
C
C
C
B
C
C
C
A
C
A
B
C
This
the
question
is
about
software.
Being
first
class
object
like
I
would
have
to
say
that's
already,
there
I
mean.
Actually,
if
you
look
at
the
data
management,
it
actually
has
worrying
about
what
you
need,
which
actually
was
taken
from
the
software
program.
I
think
that
damn
cat,
so
somebody
actually
created
it
originally,
so
there
are-
and
it's
been
used
by
multiple
programs.
So.
C
Is
a
different
question:
in
fact,
let
me
mention
it
actually
I.
This
is
something
I'm
very
interested
in
is
so
everybody
talks
about
reproducible.
When
do
we
actually,
when
do
we
get
to
do
it?
So
I've
been
thinking
quite
a
bit
about
what
would
be
the
role
of
say,
NSF,
funding
agency
and
after
some
folks
it's
still
very
hard
ideas,
but
you
know
it
would
be
interesting
if
we
could
provide
some
kinds
of
incentives.
C
C
C
C
B
B
C
B
C
A
A
This
remotely,
we
will
be
hosting
these
monthly
if
you
would
like
to
serve
as
a
presenter
panelist
or
would
like
to
hosts
one
of
these
data
science
roundtables
for
the
south
hope
at
your
institution.
Please,
let
me
know
for
those
in
the
WebEx
I
did
type
in
my
email.
Also,
the
south
big
data
hub
infrastructure,
working.
A
Friday
at
3
o'clock,
Carl
has
so
kindly
in
coordination
with
reading
more
organized
a
series
of
demos,
so
we'll
be
doing
one
day,
two
demos
each
week
or
every
other
Friday
from
now
in
the
next
couple
months.
So
with
that,
thank
you
all
a
doctor,
Titan
will
be
staying
on.
A
brew
will
be
staying
for
the
next
hour.