►
Description
Date: 5/3/2019
Presenter: Valentin Pentchev
Institution: Indiana University
Midwest Big Data Hub
A
B
A
You
NSI
he's
been
involved
in
several
I
think
very
interesting
and
kind
of
leading-edge
projects
over
the
years.
A
lot
of
us
will
know
him
from
awesome
with
a
know
he's
here
today
to
talk
to
us
about
cadre
and
you'll,
have
to
tell
us
how
you,
like
that
acronym
pronounced
the
collaborative
archive
and
data
research,
environment,
and
so
I
will
turn
it
over
to
you
about
it.
Thank.
C
You
my
full
name,
is
thank
you
very
much
for
inviting
me
here.
I
hope
everybody
can
hear
me
okay
and
see
my
sides,
if
not
please
blink
twice.
Okay,
thank
you.
I
am
new
to
this
group,
but
I
veteran
of
the
Midwest
big
data
hub
I
was
there
on
the
first
Charette
through
the
wild
wild
west
rules
of
engagement.
In
three
years
ago,
we
have
the
metric
science
institute,
were
presiding
over
networks
and
spoke
when
spokes
were
Stewart
King
and
we're
even
participating
in
a
beta
Sciences
ring
of
the
Midwest
big
data.
C
C
Talking
about
today
is
a
different
project
called
cadre.
That's
called,
we
were
dot
can
be
pronounced,
collaborative
archives
and
data
research
environment,
which
is
the
product
of
a
project
we
received
funding
from
the
IMLS
Institute
of
Museum
and
Library
Sciences
called
shared
big
data
gateway
for
libraries.
C
The
reason
we
we
are
discussing
this
there
is
a
dilemma
against
most
of
the
libraries
who
cannot
provide
research
with
sustainable
standardized
access,
the
license
data
set
as
well
as
pretty
available
datasets
out
there.
We
have
been
experiencing
this
problem
at
Indiana
University
by
ourselves
by
the
means
of
implementing
the
infrastructure
and
allowing
our
researchers
who
come
from
various
different
backgrounds
to
access
this
data,
and
we
realize
that
for
anybody
outside
of
the
r-1
institutions,
this
is
almost
impossible
and
even
for
the
r-1
folks,
it
is
possible
but
requires
a
very
large
investment
in
resources.
C
We
are
collaborating
with
the
Big
Ten
academic
alliance
and
a
lot
of
their
members
are
funding
the
project.
But
the
idea
is
to
have
an
open
tier
as
well
for
everyone
out
there
being
in
a
research
library
in
a
community
college
or
just
interested
public
to
be
able
to
access
to
those
resources
or
the
Big
Bang
and
the
rest
of
the
r-1
institutions.
This
provides
solution
to
the
main
problem
of
big
data,
sets
implementation
and
getting
access
to
both
license
and
open
source
beta
and
in
cabra.
C
We
also
tried
to
our
gift
and
our
speaker
more
about
this
in
a
minute:
various
kinds
of
researchers
with
various
levels
and
depth
of
understanding
in
computer
technology
in
general.
So
there
will
be
a
graphical
interface
that
will
guide
people
through
creation
of
the
queries
and
exploratory
of
the
data.
A
little
bit
more
on
the
second
point,
when
I
came
here
at
Indiana
University
in
2015,
we
were
one
of
the
few
institutions
who
had
purchased
the
data
and
what
was
the
region,
and
back
then
was
dated
to
be.
C
This
is
the
web
of
science
data
from
then
Thomson
Reuters
now
terabit.
The
original
plan
was
for
the
data
to
be
stored
in
a
non
networked
computer,
which
is
changed
to
a
basement
wall
and
have
a
sign-in
and
sign-out
sheet.
It
took
us
about
three
years
to
create
and
convince
the
administration
of
the
universe
that
we
can
solve
the
technical
challenge
of
creating
a
secure
and
click
on
one
of
our
conduit
clusters,
where
the
data
can
live
and
be
accessed
by
researchers,
while
still
adhering
to
the
vendor
specification,
I've
taught
for
a
while.
C
We're
also
proud
that
this
would
be
cheaper
for
the
big
institutions
and
it
will
be
possible
for
everyone
else
is
it
would
rough
estimate
see
things
from
half
a
million
to
a
million
dollars
on
average
to
start
accessing
the
data?
We
also
have
the
support
of
three
out
of
the
four
big
data
hubs.
We
are
hoping
that
the
fourth
one
will
join
us
as
well,
and
this
is
the
beginning
of
a
project
that
will
hopefully
span
way
beyond
the
issue
of
you
see
the
day.
C
Quickly.
Queue
works
about
a
few
words
about
a
uni.
We
are
what
I
call
unique
started:
cross
campus
transdisciplinary,
Institute
inside
of
a
big
educational
institution.
They
told
me
to
start
there
when
I
came
from
California,
where
I
was
involved
in
multiple
other
startups
and
I
tried
to
keep
for
fourth
year
the
startup
culture
alive
at
uni.
Our
mission
is
to
strengthen
the
theories
methods,
analytical
tools
and
practice
of
network
science,
but
also
to
foster
collaborative
interdisciplinary
approach,
understanding
the
complex
problems
of
our
society,
the
startup
culture
history.
C
It
is
important
because
a
union
was
formed
intentionally
outside
of
every
school
at
IU.
We
don't
have
a
team,
we
don't
find
for
indirects,
we
don't,
you
should
decrease
and
we
don't
have
students.
What
we
do
have
is
through
project
like
this:
try
to
bring
collaboration
to
people
who
care
about
network
science.
C
C
We
decided
to
take
the
opposite
of
through
research,
into
engaging
the
research
community
trying
to
build,
especially
what
researchers
are
telling
us
they
care
about.
So
the
constituencies
of
the
other
project
are
people
from
informatics
and
computer
science
who
care
about
application,
programming
interfaces,
notebooks
and
access
to
raw
data.
These
are
usually
folks
who
have
lots
of
PhD
students
and
postdocs
at
their
disposal
and
need
access
to
the
data
itself.
C
What
we
have
for
all
of
them
are
three
different
approaches
to
the
same
data
for
the
first
ones:
access
to
the
raw
to
the
raw
data,
in
the
form
that
it
comes
from
level
science
through
care
of
it
or
through
Microsoft
academic
graph
through
US
patent
data
and
so
on.
This
is
XML
JSON,
common
tab
separated
files.
We
provide
an
access
to
dynamic,
schema
and
cloud
native
technologies.
Like
you
see
on
Microsoft,
Azure
site
and
adenine
glue
on
AWS,
then
we
allow
them
to
spin
their
own
data,
briefs
our
clusters
using
HD
insight
or
earmark.
C
If
those
terms
don't
mean
anything
to
you,
they
are
intended
for
people
who
do
this
on
a
daily
basis.
For
the
second
part
for
the
second
type
of
a
researcher,
we
are
currently
engaging
the
research
data
Commons
center
here
at
IU,
for
a
research
of
the
best
database
and
datasets
to
answer
specific
queries,
since
the
creation
and
gathering
of
user
stories
showed
us
how
different
than
a
theory
genius
the
research
is,
we
figured
the
different
queries
to
those
data
sets
would
be
best
answered
by
different
technologies.
C
We
plan
and
currently
have,
in
the
end,
clave
a
relational
database.
We
are
bringing
graph
database
and
testing
between
few
of
those
listed
here,
neo4j
dagger
graph
agents
graph,
and
we
also
utilize
the
same
cloud
and
service
technologies
mentioned
above
and
for
people
who
don't
know
how
to
write,
sequel
and
cipher
queries.
C
The
main
part
of
the
project,
or
one
of
the
main
part
of
the
project,
is
a
guided
query
building
interface,
the
query
builder,
which
is
a
web
interface
currently
being
thought
process,
will
allow
researchers
to
generate
and
create
their
own
queries
and
also
one
very
seldomly
discussed
part
of
it
being
able
to
suggest
the
best
technology.
First
user
will
have
control
over
it
at
any
point,
but
our
initial
tests
show
that
relational
databases
answer
some
queries.
C
Another
big
part
of
the
of
the
end
crave
or
the
shared
big
data
access
gateway
is
a
federated
login
which
allows
us-
and
this
is
already
working
and
available
to
use
which
allows
us
to
pass
authentication
to
the
educational
institutions
themselves.
This
is
first
to
make
our
life
easier,
but
also
to
make
sure
we
are
able
to
restrict
access
to
appropriate
resources.
C
Most
of
the
Big
Ten
academic
partners
have
now
cleared
and
access
to
web
of
science
data
from
claret
analytics
a
few
of
them,
however,
have
a
little
bit
different
requirements
than
the
rest
based
on
the
federated
login
in
the
institution
that
answers
the
login
request.
We
can
figure
out
and
further
restrict
who
has
access
to
what
data
sets.
This
is
an
empathetic
includes
proprietary
data
like
the
web
of
science
and
mixes
it
with
open
source
data
like
the
Microsoft
academic
graph,
again,
US,
Open,
Data
few
newspaper,
publications,
memnet
records
and
so
on.
C
This
will
ensure
reproducibility,
replicability,
provenance
and
transparency
of
the
data,
since
every
process
and
permutation
is
first
very
well
documented
and
second
duis
are
issued
at
every
step.
Data
gets
changed.
This
will
allow
for
creation
of
publication
and
actually
education
packages
that
will
be
easily
traced
by
a
single
link
to
the
end
wave
or
to
the
shared
big
data
gateway
and
have
not
only
data,
but
all
the
tools
and
libraries
require
to
work
with
it.
C
You
can
trace
every
permutation
and
change
the
data
that
was
made
since
the
inception,
and
you
can
first
use
the
tools
published
on
your
own
data
set,
or
vice
versa,
use
your
own
tools
on
datasets
that
we
already
have
yes,
this
is
basically
it
in
a
very
quick
nutshell:
lightning
talk,
I
guess
this
is
our
main
diagram
and
a
few
of
the
assistants.
I
already
mentioned
the
authenticated
federated
system.
If
you
take
a
look,
it
includes
a
custom
login.
C
We
are
trying
to
incorporate
Google,
Microsoft
and
Facebook
accounts
as
well
for
those
not
part
of
educational
institutions
who
still
want
to
access
a
little
bit
limited
but
available
three
tier
of
the
ad
click,
the
research
asset
Commons,
where
people
will
be
able
to
share
and
save
their
data.
The
web
query
interfaces
that
will
allow
access
to
various
resources
and.
C
B
C
To
me,
and
also
in
the
future,
allow
access
to
the
local
institutional
resources.
This
is
the
reason
I
have
been
following
since
the
beginning
and
very
interested
in
the
open
storage
network,
because
technologies,
like
the
open
storage
network,
will
make
available
for
institutions
to
use
their
own
compute
infrastructure
and
pull
up
and
down
those
huge
data
sets
and
I
think
this
is
it
from
me
and
the
rest
of
the
readership
I
want
to
thank
you
for
the
opportunity
and
out.
There
are
a
few
ways
to
contact
us.
C
One
last
thing
that
I
forgot
to
mention
this
was
just
announced
on
the
product
owner
council:
we're
starting
a
Catholic
fellowship
program
for
anyone
that
is
interested
and
using
our
data
sets,
you
can
go
to
cadre
dot,
IO
dot,
edu
and
find
the
fellowship
program
link.
Please
distribute
this
to
your
students
and
researchers
who
are
interested.
We
will
be
getting
access
to
preliminary
data
for
all
our
fellows
and
we'll
be
also
able
to
sponsor
six
of
them.
Six
teams
to
come
present
with
us
at
Isis
I'd
2019
in
Rome.
B
May
I
ask
a
question:
this
is
for
sure,
please.
This
is
a
great
presentation,
so
you
know
I
work
with
you
know
trusted
CIA
at
IU,
and
also
with
the
research
software,
research,
Security,
Operations
Center,
and
one
of
the
things
we're
looking
at
because
I
work
on
accelerating
cyber
security
research
into
practice
is
getting
datasets
that
the
cyber
security
researchers
need
and
giving
them
access
to
those.
So
those
there's
two
different
problems.
What
is
actually
finding
people
who
can
share
cyber
security
related
data?
B
C
The
project
it's
a
two
year
project,
which
we
honestly
believe
will
be
able
to
keep
working
on
perpetuity
to
the
membership
model,
but
for
the
three
year
of
the
project,
we
promise
those
three
datasets
that
I
mentioned
before
and
we'll
have
to
deliver
on
those.
The
idea,
however,
is
that
this
will
pave
the
way
of
expanding
the
resource
to
multiple
other
datasets
and
again.
This
is
the
first
question.
I
get
everyone
has
a
data
sets
they
want
to
share,
or
at
least
make
available
to
to
their
users
in
each
of
them.
C
B
C
It
cost
few
hundred
thousand
dollars
to
have,
and
once
you
have
it
again,
you're
delivered
an
XML
card,
drive
car
back
with
XML.
So
originally
we
were
one
of
the
few
institutions
who
have
purchased
it
here
at
IU
and
we
knew
about
other
labs
at
UIUC
at
UC
and
the
rest
would
purchase
their
own
office
conference
with
all
of
them.
C
C
B
B
Could
I
ask
a
question?
That's
that's
sort
of
related.
You
had
a
box
in
the
diagram
that
was
labeled,
I,
think
granular
pigeon
permissions.
How
do
you
manage
those
permissions
or
granular
data
set
permissions
so
if
I
want
to
access
and
I
have
rights
to
certain
things,
how
do
I
communicate
those
rights
or
how
do
you
find
them
out?
Well,.
C
There's
a
box
of
rocks,
most
of
them
are
data.
Sets
you
have
access
to
and
currently
again
this
is
a
very
young
project.
It
was
supposed
start
in
September
Elektra
start
in
January.
Currently
we
based
this
on
the
institution.
We
know
that
University
of
Iowa
has
access
to
the
web
of
science
University
of
Iowa
steak.
However,
it
does
not
so
using
the
authenticated
federated
authenticated
system,
we
pass
the
authentication
tokens
to
the
appropriate
institution,
get
it
thinking
if
you
log
on
as
part
of
IU
or
UI,
you
get
access
to
the
data.
C
If
you
log
on,
as
part
of
you,
I
see
you
doc,
a
second
form
of
permissions
are
when
people
are
allowed
to
create
their
own
teams
and
assign
them
to
projects
and
then
share
data
with
them.
People
in
my
lab
people
in
my
university
or
I,
want
to
share
this
with
everyone.
We
allow
users
to
do
that.
However,
there
is
another
check
of
is
their
proprietary
data,
which
is
not
allowed
to
be
shared
if
anyone
wants
to
make
derivatives
of
their
work
based
on
marks
of
the
Adamic
graph,
which
is
an
open
resource
public.
C
B
C
It
is
a
thorough
definition.
So,
and
again
this
is
not
very
yet.
This
is
a
plant
to
be
there,
but
the
idea
is
that
I
can
invite
people
to
join
me
or
I
can
add
to
my
team
people
who
are
already
part
of
cadre.
Thank
you.
So
we
would
allow
multiple
people
to
be
members
of
multiple
teams
that
do
not
need
to
be
limited
to
the
same
institution
before
collaborating
they
should
be
able
to
form
teams
of
which
are
promises,
intuition
wise
Thanks.
C
Well,
thank
you
very
much.
I
will
share
the
link
to
the
public
fellowship
in
the
chat.
Please
take
a
look
and
or
word
as
much
as
you
can.
We
need
all
information
because
again
we
have
working
proof
of
concepts
that
we
demo
by
the
way
the
link
I
will
share,
will
have
a
another
link
to
the
recorded
video
demonstration
we
did
few
days
ago,
but
this
is
a
two-year
project
in
its
infancy
that
we
are
shaping,
along
hopefully
by
the
feedback
of
the
researchers
who
use
this
data.
A
Super
well
I
know
I'll,
be
following
up
with
you
and
I'm
sure
you're
connected
to
Melissa
already
on
the
OS
n.
So
when
we're
able
to
help
we
we
look
forward
to
connecting
with
you
on
that
and
I
will
I
will
definitely
let
people
know
about
your
fellowships
I
think
that
would
be
a
very
sought
after
but
yes
I.
One
final,
just
thank
you
again,
Val
for
your
presentation
and
for
taking
time
on
a
Friday
afternoon.
I
know
that
was
that
was
really
interesting
and
we
appreciate
your
time.
Thank
you.