►
From YouTube: CI WG demo: Dataverse
Description
Date: 03/02/17
Presenter: Gustavo Durand
Institution: Harvard University
Northeast Big Data Hub
A
Description:
this
is
data
versus
an
open
source
platform.
That's
built
to
build
data
repositories
for
sharing
and
publishing
research
data,
evolving.
The
projects
community
focus
from
small
social
scientists
or
big
data
needs
of
research
has
led
to
the
features
focused
on
computation,
big
data,
storage
and
replication.
A
We
have
joining
us
today,
Gustavo
Doron
from
Harvard
University,
he's
the
technical
lead
and
architect
for
dataverse
and
has
been
working
with
dataverse
since
2006,
leaving
the
overall
architecture
and
technical
design
of
the
application
mentors
and
provides
code
reviews
for
estella
developers
assist
the
dataverse
community
to
contribute
features
that
are
important
to
them
and
works
closely
with
the
development
project
managers
in
leading
the
team
and
project
so
without
further
ado.
I'll
turn
it
over
to
the
tsavo
to
tell
us
more
about
the
universe.
B
Thanks,
so
you
guys
can
see
the
slides
right.
Yes,
ok,
so
yeah,
so
basically
I
will
be
talking
about
the
project.
I'll,
give
an
overview
discuss
some
of
its
features.
The
technology
were
built
on
then
shift
over
talk
a
little
bit
about
our
process
and
some
of
the
benefits
of
how
we're
doing
that
and
then
about
some
of
the
collaborations.
We've
had
and
I'll
end
discussing
a
little
bit
more
about
the
community
overall.
B
A
little
bit
later
is
the
transparency
we're
able
to
have
with
the
community
and
then
also
the
fact
that
we're
able
to
collaborate-
and
it's
not
just
all
developed
by
us,
but
it's
developed
by
the
actual
users
and
there's
a
sense
of
ownership
from
within
the
community
of
the
product
itself.
So
we've
been
working
on
this
since
2000
2006
here
at
Harvard,
our
funding
comes
partially
from
within
my
qss
and
also
we
work
on
getting
different
grants
and
these
different
collaborations
that
help
adding
a
lot
of
these
features.
B
We
have
a
core
team
here,
that's
about
15,
which
includes
everybody
from
the
UI
UX
team
and
the
designers
to
the
management
and
technically,
like
me,
also
all
the
developers,
obviously,
and
even
our
curation
and
metadata
specialists,
and
then
we
also
have,
as
we
mentioned,
a
lot
of
contributions
from
the
community
itself.
So
the
team
expands
and
grows
over
time
remotely
with
the
community.
B
Okay.
So,
as
I
said,
data
users
and
workflows
so
some
of
the
features
we
have
to
support
different
kinds
of
data
and
I,
one
of
the
interesting
parts
of
the
fact
of
being
open
source
and
being
having
this
large
community.
But
also
one
of
the
challenges
is
that
we
have
lots
of
different
types
of
institutions
and
organizations
who
want
to
use
it.
B
So
they
have
lots
of
different
variety
of
data,
and
so
you
know
one
of
the
things
that
database
does
is
it
provides
persistent,
IDs
and
URLs
for
all
your
datasets,
but
some
organizations
want
to
use
data
site
for
the
persistent
IDs
someone
who's
handles,
there's
actually
another
group
that
wants
to
use
something
called
Daraa.
So
we
have
it
built
so
that
the
ID,
it's
modular
and
you
can
add,
on
different
providers
as
needed.
B
We
also
make
sure
we
generate
a
citation
for
any
data
set
and
provide
proper
attribution
and
we're
compliant
with
Fair
principles
and
general
data,
citation
principles
and
the
way
we
do
the
citations
of
data
and
have
landing
pages
and
are
machine,
readable
pages
and
things
like
that.
We
do
offer
domain
specific
metadata
and,
by
that,
just
what
we
mean
is
well.
Every
dataset
has
a
core
set
of
metadata,
that's
needed
for
the
citation
which
the
title
of
the
author.
B
You
also
want
description
and
keywords,
and
things
to
describe
it
that
we,
the
software
itself,
allows
installations
to
create
and
use
different
metadata
blocks
is
what
we
call
them
to
support
different
domains.
So,
if
you're,
an
astronomy,
there's
a
block,
that's
has
specific
fields
that
are
built
upon
standards
that
have
to
do
with
astronomy.
B
So
we
also
need
to
support
different
kinds
of
users.
So
with
that
we
different
institutions
want
different
ways
of
logging
in
we
provide
a
native.
You
know
using
a
password
that
you
can
create
through
a
database
application,
but
we
also
use
Shibboleth
to
connect
to
a
lot
of
the
different
universities.
So
I
can
log
in,
for
example,
with
my
Harvard
key
and
not
eat
a
soul,
separate
login,
and
we
also
do
have
o
F,
which
we
primarily
added
for
orchid.
B
But
if
we
with
it,
we
also
got
for
basic
with
just
little
bit
of
extra
work,
a
Google
login
and
the
github
login,
and
now
that's
configurable
for
each
installation
of
data
bursts.
So
you
can
decide
installation
which
of
the
sign-in
options
you
want
to
include
for
yours
so
well
at
Harvard.
We
allow
any
of
those
there's
Texas.
Digital
libraries,
for
example,
only
allows
people
who
are
associated
with
the
Texas
University,
that's
part
of
the
consortium,
and
so
they
only
have
the
Shiva
plug-in,
for
example,
because
the
users
can
be
different
sized.
B
You
know,
for
you
might
have
an
individual
researcher
who
has
data
that
they
want
to
put
on
a
database
or
you
might
have
an
institute
like
for
us
IQ
SS.
You
might
have
a
journal
there's
this
ability
to
embed
data
versus
within
other
data
verses,
and
so
you
can
basically
manage
the
hierarchy
of
your
organization
with
different
data
verses
to
to
be
able
to
categorize
a
different
data.
You
have
with
that,
we
also
have
branding
and
we
actually
have
branding
both
at
the
installation
level.
B
So
Harvard
dataverse
looks
different
than
Texas
digital
library
and
looks
different
than
Odom
looks
different
than
scholars,
portal
and
Canada,
etc,
etc.
But
then
each
individual
data
verse
within
that
installation
can
also
have
a
little
bit
of
branding
that
kind
of
says.
Yet
this
is
my
data
as
researcher
you
know,
John
Smith,
and
then
we
also
have
widgets
so
that
you
can
take
a
view
of
data
verse
and
actually
embed
that
in
your
website.
B
B
B
We
also
for
journals,
have
the
in
or
developed
in
for
journals,
but
it's
for
use
for
anybody.
The
idea
of
private
URLs.
So
when
a
data
set
is
in
draft
version,
you
can
provide
a
private
URL
that,
with
the
token
allows
so
much.
Oh,
not
instantly
go
and
look
at
the
data
set
and
therefore
provide
anonymous
peer
review,
and
we
also
have
a
bunch
of
different
upload
and
download
workflows.
So
you
can
upload
files
via
the
browser.
B
We
can
also,
if
you
have
a
Dropbox
account,
you
can
there's
a
button
to
say,
add
from
Dropbox
and
you
go
and
go
ahead
and
log
into
your
Dropbox
and
answer
it.
That
way,
and
one
of
the
cloud
reasons
we've
worked
on
and
we'll
talk
a
little
more
about
that
when
I
get
to
the
collaboration
slide
is
this
for
big
data
packages.
B
We
are
now
able
to
our
sync
data
over
and
not
use
the
browser
and
drop
because
they
have
HTTP
limitations
on
the
size,
and
so
when
we
use
the
our
sink
to
be
able
to
transfer
these
larger
big
data
packages
and,
in
general,
one
of
the
big
things
about
data
versus
interoperability.
So
we
provide
a
lot
of
api's.
We
provide
sword
for
deposit,
it's
a
standard,
simple
standard
for
being
able
to
deposit
stuff,
but
we
have
very
robust
native
API.
B
Our
goal
is
to
try
to
make
anything
that
you
can
do
through
the
UI
be
doable
through
api's
as
well.
Probably
not
we're
not
quite
there
we're,
maybe
at
90%
or
something.
But
you
know,
I
can
so
I'm
using
your
I
can
use
the
search
API
to
do
a
search
and
then
I
can
download
all
the
API
I
can
also
upload
and
publish
via
api's.
B
We
set
it
up
to
harvest
all
the
metadata
from
all
the
other
data
verses
as
they
come
to
be
so
that
you
can
go
to
Harvard
a
diverse
and
search
for
everything.
We
don't
grab
their
data.
So
when
you
find
something
interesting
and
you
click
on
the
link,
it'll
send
you
straight
over
to
their
installation
and
then
you
they're
able
to
manage
their
permissions.
That
way
so
being
the
tech
lead.
B
I
always
have
a
quick
slide
about
the
technology
quick
little
bit,
but
if
there's
any
questions
afterwards,
but
we're
a
Java
Web
Apps,
we
run
a
glassfish,
we
use
Java
standard
edition,
8
and
the
Enterprise
Edition
7.
We
use
a
lot
of
different
modules
for
all
the
presentation,
business
and
storage
layers
in
the
backend.
We
store
things
in
a
Postgres
database.
B
We
use
solar
for
indexing
and
search
and,
as
I
had
mentioned
before,
you
can
store
files
either
in
the
filesystem
Swifter
s3
I
mentioned
our
transparency
and
that's
one
of
the
nice
things
about
the
open
source.
So
we
have
a
our
develop
process.
Is
this
all
our
issues
are
in
github?
We
also
use
this
called
waffle,
which
shows
a
trend
issues
as
a
transition
during
a
sprint,
and
so
these
are
when
something
comes
in.
It
comes
in
the
Inbox
when
we
start
looking
at
it,
we
move
it
to
backlog
and
we
contribute
Sprint's.
B
B
It
then
goes
through
a
QA
process
and
then,
when
I
through
it
gets
to
done
these
two
links,
we
don't
time
to
go
through
them,
but
the
first
shows
our
roadmap
and
upcoming
recent
and
upcoming
releases,
and
the
second
is
a
link
to
that
waffle
board.
So
you
would
see
this
and
this
image
shows
it
all
closed
up.
But
if
you
click,
if
you
go
to
and
then
click
on
the
double
arrows,
it'll
open
up
and
you'll
see
the
specific
issues
in
each
of
those
different
parts
of
the
process.
B
We
worked
on
adding
this
rsync
support
to
be
able
to
get
large
data
in
and
out
of
data
verse
and
to
be
able
to
we're
working
with
them
to
be
able
to
replicate
it
across
different
places
so
that,
if
you
are
working
in
France
and
the
we
can
replicate
it
to
a
server
there,
you
can
do
your
compute
computation
work
there
and
it's
closer
and
you
might
have
access
to
that
environment,
but
not
to
some
other
environment.
We've
also
worked
with
the
Massachusetts
open
cloud:
that's
how
we
got
the
Swiss
storage.
B
We
worked
with
them
to
get
the
Swiss
storage
enabled
and
through
swift
to
be
able
to
allow
compute
access
to
your
data,
and
the
other
collaborations
are
ones
that
we
often
talk
about
these
presentations.
I,
don't
think
they're,
not
so
super
relevant,
but
in
general
they're,
interesting
for
different
reasons.
The
first
one
for
handle
support
that
was
one
of
the
persistent
identifiers.
B
We
actually
worked
with
a
couple
of
different
organizations
and
it
was
managing
the
work
that
Dawn's
first
started,
but
then
sim
it
finished
up
and
so
that
we
could
then
merge
into
the
core
research
space
was
all
about
creating
a
Java,
API
client
for
the
API
library
for
client.
Libraries.
Heed
me
for
the
api's
to
be
able
to
use
eps
through
java
and
we've
been
working
with
the
provenance
group
to
get
provenance
and
are
working
on
that.
Actually,
this
current
sprint
we're
trying
to
finish
on
some
stuff.
We
linked
to
that.
B
So
our
community
I
mentioned
that
we
have
many
different
isolations.
So
we
work
on
the
sock,
we're
here
at
IQ
SS
and
we
also
run
one
of
the
installations
to
Harvard
University
in
solution
in
collaboration
with
libraries.
But
the
software
is
down
is
downloadable
and
able
to
be
installed
anywhere
and
we
currently
have
32
different
installations
around
the
world.
B
That's
what
these
big
dots
are
on
the
map
and
those
are
ones
that
are
in
production
and
are
willing
to
be
on
the
map,
but
there's
a
lot
of
others
that
are
also
testing
it
out
and
our
communities
continues
to
grow
that
way
and
because
we
have
all
these
organizations
and
everyone's
interested
in
different
features.
We've
had
over
50
different
people
contribute
to
code,
and
we
have
lots
of
hundreds
of
members
in
the
community
which
include
these
developers,
but
also
includes
research,
researchers,
librarians
and
data
scientists.
We
meet
with
them
regularly.
B
We
have
a
community
call
every
two
weeks
and
so,
if
you're
at
all
interested,
that's
gonna
be
not
next
Tuesday
but
the
week
after,
but
every
two
weeks
we
do
a
community
call.
And
then
we
have
an
annual
community
meeting
that
we
host
here
at
Harvard
where
people
come
in
and
we
have
presentations
about
what
people
are
working
on
in
the
group
and
what
we're
working
on,
and
we
present
the
roadmap
coming
up
and
to
have
discussions
and
lunch
and
things
like
that.
B
Here's
a
picture
of
our
community
from
the
last
community
member
last
community
excuse
me
and
I
think
that's
what
I've
got
our
next
media
meeting
is
June
13,
14
15.
So
if
you're
interested
in
wanna
join
us,
let
us
know
if
you
have
other
questions,
obviously
feel
free
to
ask
now,
but
we're
always
available
through
the
support
and
or
other
these
other
channels.
B
We
are
definitely
looking
into
that
I
mean
or
dataverse
itself.
You
can
upload
any
files,
and
so
you're
definitely
encouraged
to
not
just
upload
your
data,
but
also
upload
the
software
and
code
that
you
use
to
get
the
results
that
you've
been
getting
or
we
have
had
talked
with
I,
don't
know
if
you've
heard
of
a
group
called
code
ocean,
they're.
A
B
Know
for-profit
company,
but
they
work
on
this
idea
of
bundling
the
software
and
data
together
and
then
providing
I
think
they
call
it
I.
Think
it's
not.
They
don't
know
the
container.
They
call
it
compute
module
I
think,
but
the
basic
idea
is
right,
gather
and
mm-hmm
reproducibility
and
so
we're
work.
B
We
have
proposals
with
them
to
get
a
grant,
so
we
can
work
with
them
and
we,
the
one
thing
we
want
to
make
sure
is
that
we're
not
developing
specific
to
them
but
to
that
kind
of
product
so
that
if
there's
an
open
source,
one
that
someone
has
that
we
can
connect
with
that
as
well.
But
we
definitely
want
to
create
that
idea
of
being
able
to
reproducibility
very
important
within
the
data
data
world
and
and
within
dataverse
itself,
so
we're
working
with
those
groups.
A
C
B
I'd
say
it's
beyond
the
academic
domain.
I
mean
it
is
definitely
about
research
and
academic
research
in
that
sense,
but
it's
I
mean
it's
not
just
for
universities.
A
lot
of
the
different
installations
are
like,
for
example,
the
netherlands
has
one
for
their
national
social
science
organization
called
don's.
There's
a
few
other
organizations
that
have
been
with
like
agricultural
studies
and
things
like
that
across
Mexico
and
France,
and
the
rest
of
the
world
that
aren't
necessarily
university
based
yeah,
it's
definitely
academic
refor.
B
The
most
part
we
have
also
talked
with
some
of
the
groups
like
Red
Hat,
is
interested
in
possibly
using
dataverse
for
some
of
their
internal
kind
of
processing
of
data,
which
is
obviously
very
different
than
what
we're
used
to
in
the
back
academic
world.
So
our
goal
is
to
produce
it
as
a
platform
that
could
be
for
anything.
B
C
I
would
love
to
we're
working
on
a
number
of
things
that
that
spanned
outside
of
academia.
It's
really
at
the
intersection
of
academia,
government
agencies,
for-profit
companies,
etc
and
and
I'm
a
firm
believer
in
not
reinventing
the
wheel.
So
I
would
love
to
particularly
easy,
since
you
guys
are
in
the
Northeast,
I'd
love
to
come
up
and-
and
maybe
talk
a
little
bit
more
about
how
this
might
be
applied
to
some
of
the
things
that
we're
thinking
about
yeah.
B
Definitely
get
in
touch
with
us
either
me
directly
or
through
any
of
those
links,
and
you
know
iein
or
you
know
the
project
manager
or
some
of
the
other
people.
We
can
set
up
a
meeting
and
figure
something
out
sounds.