►
Description
Date: 11/11/16
Presenter: Jane Greenberg
Institution: Drexel University
Northeast Big Data Hub
A
A
You
can
think
about
indexing
data
or
finding
metadata
semantics
metadata
to
represent
data,
and
rarely
is
one
mistake,
controlled
vocabulary
or
semantic
oncology
Duff.
For
instance,
you
may
have
a
data
setting
in
these
geographic
information,
taxonomic
information,
topical
information
and
also
you
need
multiple
types
of
topical
information.
That's
part
of
the
motivation.
The
next
is
that
your
taxes
and
my
taxes
are
paying
for
the
creation
of
ontology.
A
There
are
hundreds
of
oncology's
out
there.
A
lot
of
them
are
being
created
by
federal
agencies
and
increasingly
they're
being
published
on
the
web
and
in
public
release
data
formats
and
available
for
machine
activities,
and
so
taking
advantage
of
that
and
the
fact
that
people
in
indexing
is
costly.
Have
one
person
individually,
curates
working
with
multiple
vocabulary.
Size
uses
all
NASCO
to
integrate
these
existing
oncology
or
oncology's
that
you
may
be
creating
in-house,
but
it
create
in
a
linked
data,
Semantic
Web
type
format.
So
the
acronym
is
I
love
it.
A
You
could
think
about
the
little
be
going
out
to
the
different
vocabularies
and
bringing
back
the
right.
The
right
term
or
contest
to
schematically
represent
your
your
data.
So
let
me
this
is
just
an
example
of
sort
of
the
architecture
and
I
think
I
pretty
much
said
this
already,
but
hive
allows
you,
researchers
by
terms
for
multiple
vocabularies
or
semantic
mythology,
I've
done
using
machine
learning,
I'm
going
to
say
a
little
bit
more
about
hi
1.0,
which
worked
with
Kia
and
C++,
which
is
a
machine
learning.
A
Algorithm
came
out
of
Waikato,
University
and
Jones,
going
to
talk
more
about
new
work
and
we're
we're
moving
right
along
that
you
could
actually
slug
these
different
on
machine
learning.
Algorithms,
depending
on
your
types
of
data
and
so
I've
also
just
encourages
entropically
for
51.0,
is
dating
back
a
while
here,
but
this
is
just
to
give
you
a
picture
of
it
for
the
homepage
here
and
you
can
see
you
can
register
multiple
vocabularies,
and
this
is
the
hive
demo
you
you
can
take.
A
A
The
next
here
is
an
example
of
the
concept,
rather
where
you
could
select
multiple
vocabularies
and
search
on
these
one
fell
swoop
right,
your
your
concepts,
so
the
here's,
a
search
for
the
word
precipitation
in
agrovoc,
which
is
a
vocab
and
compel
of
the
UN
Food
and
Agriculture
Organization
LTS
H,
and
you
see
the
keyword
in
context
for
precipitation,
and
then
there
is
the
third
approach,
which
is
the
automatic
indexing
approach.
Those
little
green
dots
here
in
this
high
1.01.
Is
you
select
your
terminologies
to
you?
Choose
your
file
or
you
can
put
an
ear.
A
A
Anything
and
other
scientists
publish
an
article
and
they're
putting
their
data
into
the
Dryad
repository
and
I'm,
just
going
to
kind
of
assume
of
people
have
heard
of
drive,
but
it's
a
repository
for
day
lines
published
research
should
have
been
very
involved
in
that,
so
the
scientists
go
to
the
hot,
selects
the
vocabulary
and
in
this
case
examples
just
three
vocabularies
and
comes
out
with
a
tag
cloud.
Some
of
these
terms
are
very
good
and
clicking
on
them
and
fun
them
to
the
metadata
to
represent
the
the
data
is
good.
A
Some
of
these
terms
are
not
so
good
and
that
have
to
do
with
the
machine
training
that
we
could
do
a
better
job
at
just
in
the
interest
of
time.
Here
is
an
example.
What
the
original
architecture
looks
like
combining
several
open
source
technologies
with
Java
based
originally,
what
you've
been
looking
at
is
the
demos
by
the
original
demo
site
and
will
give
you
an
update
there,
and
the
code
was
originally
on.
Google,
we've
moved
over
pretty
much
all
the
way
over
to
github
and
I'm.
A
Sorry
that
I
don't
have
to
correct
URL
there
mike
is
going
to
say
a
little
bit
about
hot
hives
in
IRAs.
Actually
in
a
minute
just
to
show
you,
here's
a
Joan
has
to
a
demonstration
at
Renzi,
and
we
just
put
in
a
several
ontology
here,
and
you
know
the
time
you
think
about
the
time
it
takes
somebody
to
look
on
multiple
vocabularies
of
ridiculous.
We
should
be
doing
that
this
stuff.
A
A
C
C
As
far
as
I
actually
sort
of
did
in
the
previous
meeting,
so
they're,
essentially
cran
drawing
setup
yeah
but
I
wanted
to
I
want
to
just
kind
of
put
in
the
context
of
how
high
pits
and
the
we're
we
see
the
system.
So
next
one
there
John.
So
the
whole
idea
here
is
that
in
on
the
DFC
side
at
the
core
is
that
I
rod
server
and
it's
not
really
good
for
doing
discovery,
but
what
it's
really
good
at
as
being
like
sort
of
the
canonical
data
and
metadata
repository.
C
So
you
can
have
positive.
You
can
have
policies
that
control
both
of
those
things.
So
the
stuff
is
are
always
there
and
then
we
can
treat
indexes
as
ephemeral.
So
we
can
project
different
parts
of
the
grid
out
to
different
indexes.
So
you
might
have
this
collection
being
indexed
through
a
elastic
search.
C
You
might
have
another
collection
that
is
indexed
and
put
into
a
triple
store,
and
this
is
where
high
fits
in
is
specifically
in
this
area
of
semantic
metadata,
linked
data
and
also
being
able
to
have
triple
store
representations
of
the
data
inside
the
catalogue.
So
that's
the
point
of
that
and
then
the
next
slide
so
we're
hive
comes
in
is
you
can
have
policies
that
describe
metadata
templates,
so
any
data
in
this
collection
has
to
be
curated
with
certain
kinds
of
metadata
which
can
include
curation.
C
For
example,
you
must
you
must
apply
in
number
of
terms
from
via
from
these
vocabularies
or
you
or
these
vocabularies
are
suggested
for
human
curation.
Okay,
the
idea
would
be
okay,
we
want
agrovoc
and
we
want
mesh
as
available
vocabularies
and
and
curators
can
be
queued
to
search
across
those
vocabularies
to
find
applicable
terms,
they're
applied
to
the
data
and
then
on.
C
The
application
of
those
terms
on
the
editing
of
metadata
on
the
ingest
of
new
data
and
automatic
extraction
terms
can
be
applied
from
hive,
which
can
then
trigger
indexing
into,
for
example,
triple
store
next
next
slide.
Okay
and
the
whole
point
of
this
is
really
the
nugget
here.
Is
this
idea
of
virtual
collections-
and
this
is
based
on
the
idea
that
you
can
set
a
rate
these
grids,
so
we
have
a
global
namespace
and
then
here
what
we're
using
hive
and
we're
marking
up
collections
with
semantic
terms
from
vocabularies
that
the
hive
service
provides.
C
I've
provides
again
the
tools
for
human
curation,
as
well
as
tools
for
automatic
extraction
of
terms
using
natural
language
processing,
which
seems
to
fit
really
nice
into
what
brown
brown
dog
is
contemplating.
But
the
idea
is,
you
can
federate
with
another
site
who
is
also
using
vocabularies
applied
to
their
collections?
C
These
two
Institute's
can
then
feder
8
together,
and
you
can
then
index
across
the
the
data
in
both
those
collections
and
project
that
into
an
index
and
then,
for
example,
you
can
do
a
sparkle
search
and
you
can
actually
see
data
that
is
spread
across
distributed
collections
using
a
tool
like
sparkle
and
they
appear
and
behave
as
one
collection
and
I
think.
The
really
cool
thing
is
for
the
folks
work
that
were
at
the
Northeast
hub.
C
What
I'm
most
excited
about
is
taking
this
exact
concept
that
Jane
is
talking
about
in
doing
things
like
applying
data
sharing
agreements
as
computer
actionable
policies
between
these
nodes
in
the
Federation.
So
you
can
have
one
collection
from
the
outside
user
looks
like
one
coherent
collection.
It's
actually
distributed
geographically
between
organizations
that
maybe
have
agreements
in
place,
and
then
you
can
find
those
terms
using
tools
that
real
people
use,
because
real
people
are
not
going
to
use
with
dumb
query
language
and
irods
they're,
going
to
use
sparkle
or
elasticsearch,
or
something
like
that.
A
C
B
B
These
are
just
you
know,
snapshots
of
what
changes
showed
you
in
the
oh
version.
This
is
a
Java
implementation
that
runs
on
a
tomcat
server,
they're,
probably
I,
don't
know
at
least
four
different
types
of
data
stores,
underneath
this
multiple
api's
and
needless
to
say,
is
rather
complex.
One
of
the
obstacles
I
think
that
we
ran
up
against
was
that
it
was
becoming
increasingly
difficult
to
enhance
and
improve
an
ad
capability
to
hyeondo,
and
one
of
the
reasons
for
this
was
in
part,
because
this
was
developed.
B
This
version
of
it
I
think
was
wrapped
up
around
2011
2012.
A
lot
of
those
Java
libraries
that
it's
built
on
are
updated
by
kills
me
the
source
code
for
some
of
them
isn't
even
available.
So
the
effort
to
upgrade
all
of
these
libraries
is
is
significant,
that
in
the
fact
that
it
is
not
a
very
extendable
modular
design
under
the
covers,
so
the
other
issue
that
came
about-
and
this
is
one
that
I
think
Jane
encountered-
was
that
we
were
running
into
difficulty-
finding
Java
skilled
programmers.
B
It
seems
these
days
that
everyone
is
jumping
on
the
Python
bandwagon
and
probably
with
good
reason.
It
is
the
class
I
teach
over
at
UNC.
It
is
a
relatively
easy
language
relative
to
Java.
It
is
an
interpretive
languages.
Language,
unlike
the
compiled
Java
language,
but
I,
have
to
say
it
is
easy
to
pick
up
I.
A
A
B
So
so
a
lot
of
it,
though,
was
just
addressing
the
skills
availability
to
keep
going,
to
be
able
to
extend
and
to
be
able
to
have
more
of
a
plug-in
architecture
and
new
algorithms
and
vocabularies,
and
one
of
the
things
that
has
always
been
an
issue
is
being
able
to
import
vocabularies
into
job
into
haiwa.
Know
it's
a
difficult,
time-consuming,
somewhat
painful
process.
B
It
is
not
something
that
you
can
easily
hand
off
to
someone
who
does
not
have
development
experience,
so
that
is
something
that
has
always
been
an
effort
that
we've
been
focusing
on
and
that
has
been
simplified,
a
fair
amount,
because
more
of
these
libraries
do
seem
to
be
supporting
the
RDF
format,
the
resources
churches
format.
So
if
you
could
at
least
have
a
common
format,
that's
when,
when
a
substantial
have
starts
to
being
able
to
simplify
this
insulin
process
and
also
the
data
model,
I
want
to
have
four
different
data
stores.
B
We
have
one
that
doesn't
mean
that
we
won't
need
to
add
others
as
we
as
we
continue
with
this,
but
start
start
small
start
fast,
make
it
work.
The
other
thing
that
we
need
to
do
is
just
improve
the
web
services.
There
is
a
REST
API
for
high
bueno.
That
is
what
we
have
derived
and
given
to
Mike,
so
that
API
is
available,
but
the
intent
here
is
to
do
the
same
thing
with
high
to
O
and
improve
the
interface.
The
ones
for
where
the
Java
version
is
is
a
little
bit
cumbersome.
B
B
What's
already
out
there,
there
is
high
100
runs
on
Tomcat,
which
is
very
powerful
web
and
application
server.
There
is
a
very
lightweight
web
framework
available
in
Python
called
cherry
pie.
This
was
referred
to
me
from
another
faculty
member
at
UNC,
who
has
used
it
before
so
far,
so
good,
the
Drexel
administrative
support
has
been
wonderful
and
getting
this
up
on
their
server,
so
it's
working
pretty
well
so
far.
B
B
B
A
key
word
for
metadata
generation,
so
we
didn't
use
that
we
did
actually
contact
them
to
see
if
we
could
find
out
about
going
to
applies
on
versions,
but
that
wasn't
in
the
word,
but
that
would
be
a
good
thing
to
do-
is
to
try
to
pass
aya
sofya
algorithm.
But
the
author
of
that
actually
had
taken
a
look
at
this
rake
album
again
and
she
said
actually
I
think
that
it
was
pretty
good,
but
it
needed
some
enhancements,
but
rake
is
fast.
B
It
probably
only
works
with
maybe
three
features
when
it
tries
to
evaluate
the
ranking
of
a
term,
but
that's
a
good
start
and
the
code
is
available
there.
So
you
can
add
what
you
want
to
it.
It's
basically
open
source
RDF
lab
has
been
invaluable
because
the
libraries
I'm
sorry
Cabul
Ares
that
we've
imported
into
hive.
B
Oh
so
far,
we're
all
in
RDF
format,
our
DF
lives
did
most
of
the
work,
didn't
have
to
write
anything
and
then
SQL
Lite,
which
is
a
very
lightweight
relational
database,
which
you
probably
all
have
on
your
phones
and
on
your
laptops.
Pervasive
is
what
is
being
used
here
for
dispersion
of
having
oh.
This
is
very
much
modular
lives
with
an
API,
so
we
had
to
scale
up
to
a
larger
database
on
that
can
be
easily
be
done,
but
we're
starting
out
with
SQLite.
B
So
these
are
intended
to
be
just
a
few
screenshots
to
show
you
how
at
least
from
this
demo
page
how
it
mirrors
the
same
functionality
in
heroine.
Oh,
this
is
the
home
page.
If
you
have
two
slides,
when
you
click
on
the
blue
link,
it
will
take
it
to
the
demo.
This
is
a
list
on
the
home
page
of
the
vocabularies
that
have
been
imported
into
to
load.
So
it's
a
little
hard
to
read,
but
it's
the
unified
astronomy's,
the
scarf
I
think
the
next
one
is
the
US
Geological
Survey.
B
A
B
Yeah,
that
was
one
of
the
ones
you
have
to
gaming.
So
all
these
vocabularies
ruin
our
dear
format,
I'm
sure.
Either
you
go
to
these
websites.
They
give
you
a
link
to
download
the
file.
I
wrote
some
RDF,
I'm
Anna
wrote
some
Python
code
to
use
their
IDF
live
library
to
process
these,
and
basically
they
go
through
parse
all
the
content
and
they
generate
actually
a
very
large
graph
database.
The
graph
database
of
triples
the
triple
structure
defines
the
concepts
and
their
relationships.
B
It's
also
a
pretty
big
database,
so
my
first
test
was
actually
with
the
United
of
unified
astronomy,
bazaars
and
that
has
I
think
something
like
eighteen
hundred
concepts
and
I.
Don't
remember
the
number
of
relationships,
but
the
graph
database
that
got
generated
was
about
ninety
Meg.
It
also
has
thirty
thousand
plus
triple
Gator,
so
I
started
looking
through
and
going.
This
is
an
opportunity,
for
you
know.
Improvement
here
is
ninety
mega
kind
of
dig
and
I
realized.
An
awful
lot
of
the
triples
have
nothing
to
do
with
what
I
was
just.
B
It
was
really
kind
of
extraneous
information
for
our
purpose,
so
I
did
pull
those
triples
out,
but
then
I
also
took
the
graph
database
and
I
just
converted
it
to
a
relational
database.
The
very
simple
database
with
three
tables
I
believe
and
when
I
did,
that
it
became
1.1,
Meg
or
thousand
troubles,
and
it's
fascinating.
B
This
is
the
Browse
vocabulary,
page
again
similar
to
1.0.
You
select
a
vocabulary
on
the
left
hand,
side
you're,
going
to
see
a
hierarchical
view,
a
tree
structure.
All
of
these
vocabularies
that
come
from
RDF
are
hierarchically.
Organized
most
of
them
are
organized
around
top
concept,
so
the
vocabulary,
definers,
are
authors
themselves,
define
the
top
concepts
of
the
vocabulary
and
we
use
those
okay.
So
the
starting
points
there
usually
aren't
very
many
of
them,
maybe
20
25
of
those.
B
In
this
example,
you
can
see
that
I've
used
the
USGS
one
I
highlighted
term
in
there
when
you
click
on
it.
What
you're
going
to
get
on
the
right
is
the
detailed
information
so
from
RDF
you
get
your
preferred
label
for
a
term
or
a
key
word.
Yet
ultimate
labels
get
notice
if
it's
and
provided
broader
concepts
narrower
concepts
relating
councils,
so
you
can
then
select
inside
of
those
I'm,
not
sure
if
I'm
going
to
lose
one.
B
This
is
an
example.
This
is
the
actual
demo
that
the
United
unified
answer
I,
think
I.
Dare
all
your
attacks
is
unified
astronomy,
bazaars?
If
you
look
at
the
Big
Bang
Theory,
you
can
come
over
here,
find
a
little
extra
information,
the
broader
concepts.
These
are
narrower
concepts
you
can
go
through
and
basically
you
know
if
you're
looking
for
particular
terms,
you
want
to
confirm
that
the
term
that
is
being
used
for
metadata
is
correct.
You
can
get
such
detailed
information
about.
C
B
Well,
I
have
much
help
to
say:
I
have
to
say
this
is
admin
at
Drexel
has
been
wonderful
because
we
had
some
logging
issues
and
that
we're
not
related
to
the
code,
but
we
are
got
five
minutes
left.
Okay,
starting
vocabularies
and
I-
didn't
talk
about
this
before
this
is
just
a
case
where
you
can
pick
the
vocabulary
since
you
want
to
search
from
you
enter
some
kind
of
you
know,
concept
key
word
whatever
it
does
a
wild-card
search,
and
it's
going
to
list
over
here
the
different
vocabularies
that
have
a
concept
that
contain
that.
B
Then
here's
indexing
and
again
this
is
very
similar
to
what
we
were
doing
before
you
can
use
selectable
Cabul
areas
of
interest.
You
enter
the
URL
for
the
web
resource
of
the
document
click
index,
and
this
is
the
word
cloud,
and
this
just
shows
which
keywords
were
selected,
and
you
know
it's
like
a
regular
tag
cloud
where
the
the
font
size
of
the
keep
see
the
ranking.
B
Lastly-
and
this
is
just
sort
of
my
own
kind
of
working
set
of
next
steps-
I
do
need
to
harden
the
web
services
API.
It
is
a
REST
API
right
now
the
the
browser
has
a
lot
of
JavaScript
that
issues
Ajax
calls
there's
a
REST
API
and
the
backend
talks
to
python
python
talks
to
the
database
generates
json
that
gets
sent
back
and
it
gets
converted
to
the
web
page
that
API
does
need
to
be
hardened
or
use.
C
B
A
B
A
C
A
B
C
B
A
Approach
and
the
technology
and
I
said
like
the
LTE
or
folks,
you
ties,
and
you
know,
I,
don't
know
how
often
they
use
it.
I
could
check
again,
but
they
have
their
own
hide.
It
looks
very
different.
Brian
has
a
Isthmus
of
hive
and
it's
a
prototype.
There's
some
legal
group
in
Italy,
that's
using
it
and
they
have
their
own
book
cavities,
but
the
demo
hive
could
be
a
service
yeah.
You
know
we
had
like
lots
of
February.
A
A
That's
my
particular
interests,
economic,
others.
Any
other
questions
you
I
will
have
questions
that
I
know
you
and
I
are
following
up
five.
Yes
own
afterwards.
I
do
want
to
say.
We
right
now
have
on
the
books,
Mike,
Conway
and
Kenton.
Doing
additional
demos
on
December,
9th
I'm
going
to
be
out
of
town,
the
I
Triple
E
be
do
summits
and
other
conferences,
and,
given
us
the
holidays,
would
people
want
to
postpone
those
demos
to
the
new
year?
Okay,
Mike
is
nodded
yes
Kenton.