►
From YouTube: CI WG demo: Biodiversity Open Data Landscape
Description
Date: 02/17/17
Presenter: Jennifer Hammock
Institution: Smithsonian Institution
South Big Data Hub
A
Today,
we're
really
excited
about
the
two
presentations
we're
going
to
have
the
first
dr.
Jennifer
hammock
from
new
studies
institutions,
and
she
is
a
project
manager
at
the
encyclopedia
of
life.
A
lapsed
chemist
and
marine
biologist
abouthe
has
the
learning
biodiversity
informatics
job
for
four
years
and
jennifer,
and
I
met.
Maybe
that
was
the
beginning
four
or
five
years
ago,
when
you
were
starting
on
the
road
of
citizen
science,
crowd-sourced
data
collection
as
well
and
dr.
A
Matt
Spitzer
will
follow
and
the
Open
Science
framework
for
connecting
institutional
research
Matt
is
a
community
manager
at
the
Center
for
open
science
and
nonprofit
tech
organization,
building,
open
source
public
goods
infrastructure
to
improve
the
transparency,
openness
and
reproducibility
of
research
and
I.
Think
I
forgot
to
mention
I'm
Lea
Shanley
for
those
who
may
not
know
me
and
the
prospective
director
of
the
South
big
data
hub,
so
welcome
all
on
a
beautiful
Friday
afternoon.
We're
very
glad
to
have
you
here.
A
B
You
Leah
and
thank
you-
everyone
for
coming
I,
haven't
been
to
as
many
of
these
meetings
that
I
like,
but
I
try
to
keep
up
on
the
slack
and
I
think
this
is
a
little
different
for
most
of
the
M
most
of
the
presentations
you've
had
I've
seen
a
lot
of
solutions
presented
and
I
am
I,
guess
going
to
have
more
of
a
use
case
representing
my
research
community
and
our
data
I
do
have
one
or
two
things
to
crow
about,
but
mostly
I
think
it
turns
out
I'm
bringing
in
my
problems.
B
B
B
So
imagine
research
being
done
over
a
period
of
about
200
years
since
sort
of
late
in
the
career
of
Carl
Linnaeus
who
started
naming
organisms,
and
you
can
imagine
the
variety
you
might
get
in
recorded
data
and
a
lot
of
it
is
written
down.
The
vast
bulk
of
the
breadth
of
information
we
need
is
written
down,
so
it's
recently
been
digitized
and
is
being
reborn
digitally
increasingly.
B
So
a
lot
of
old
data,
a
lot
of
data,
that's
inherently
complex,
but
again
we're
only
talking
for
the
most
part
in
the
millions
of
of
Records
I'm
not
going
to
delve
for
those
are
enduring
too
much
into
barcodes
or
other
molecular
data.
We
touch
that
fear,
but
that's
another
sort
of
born-digital,
much
more
organized
subs
fear
of
biodiversity
data,
so
most
of
the
problems
are
outside
of
that
area.
B
This
is
the
first
department
that
might
be
in
the
Smithsonian
here
in
DC,
same
museum,
different
departments
is
entomology
and
again
are
their
prettiest
specimens,
but
this
is
representative
of
the
depth
and
numbers.
And
finally,
this
is
the
invertebrate
zoology
department,
which
is
everything
in
the
sea.
That's
not
a
fish
or
a
mammal,
pretty
much
so
this
is
perhaps
the
widest
evolutionary
variation
that
you
see
in
the.
B
These
guys
are
one
one
big
source
of
the
raw
material
that
we're
working
with
the
other
one
is
born
digital
and
often
online,
and
it
it
goes
by
many
names.
Science
is
probably
the
most
familiar.
This
is
a
snippet
of
activity
from
a
platform
called
I
naturalist.
This
is
where
ordinary
humans,
who
may
be
professional
biologist
but
usually
are
not
going
identify
an
organism
and
place
it
with
a
photograph
to
voucher
what
it
was
on
a
map
with
a
time
stamp.
B
This
is
a
representation
of
eye
net
data
at
the
global
scale,
but
to
give
you
a
sense
of
density,
this
is
our
neighborhood
here
in
DC,
so
to
use
it.
This
is
this:
is
the
typical
density
of
pins
you'd
find
in
an
urban
area
on
a
platform
like
AI
naturalist,
of
which
there
are
maybe
a
dozen
sizeable
ones
over
over
the
globe
and
a
couple
of
hundred
smaller
boutique
platforms?
B
This
is
a
snapshot
of
jeebus,
which
is
the
largest
aggregator
of
data
of
this
kind,
and
it
does
include
I
naturalist,
for
example,
and
a
number
of
Museum
sources,
the
Smithsonian's
in
there,
the
big
European
Natural,
History,
Museum
of
London,
Hamburg,
Paris
and
so
on.
Darrel
in
there,
and
also
the
biggest
single
provider
to
this
map
is
called
Ebert
at
the
citizen
science
platform,
because
birds
have
more
charisma
than
any
other
group
that
they
provide
about
a
third
of
the
data
overall
on
this
platform.
B
B
Mentioned
some
of
the
data
was
in
text
form,
and
so
that
is
primarily
right
now
in
the
process
of
being
structured
for
the
most
part,
it
has
been
scanned
already
you'll
find
it.
This
is
a
snapshot
from
the
biodiversity
heritage.
Library
you
will
find
an
image
of
a
page.
You'll
see
are
on
the
OCR,
keeps
evolving
as
I'm
sure
you're
all
aware.
The
quality
right
now
is
not
bad
for
most
of
the
text
is
available
in
here,
although
it
does
depend
to
a
certain
extent
on
typeface,
and
that
depends
to
a
certain
extent
on
vintage.
B
So
altogether,
this
stuff
is
connected
in
something
we
described
as
a
biodiversity
knowledge
graph.
This
image
is
from
one
of
the
pillars
of
our
community,
who
is
the
the
futurist
among
biodiversity
people.
His
name
is
Rob
page,
that's
his
blog
down
at
the
bottom,
and
he
is
trying
to
drag
us
into
the
21st
century.
B
B
Most
of
this,
this
is
that
one
accomplishments
I
really
wanted
to
crow
about.
Most
of
this
data
is
mostly
open.
Sorry,
most
of
these
categories
of
data
are
mostly
open,
so
you
can
get
to
it.
There
usually
isn't
a
paywall,
and
usually
there
are
restrictions
on
race
of
this
data,
so
we
don't
have
a
lot
of
security
concerns
or
ownership
concerns
in
this
area.
Here
the
exceptions
to
those
are,
for
the
most
part,
about
half
open
for,
for
publications,
scientific
journals
and
for
photos,
be
its
collection,
photos
or
amateur
wildlife
photos.
B
These
allow
reuse
of
information
that
is
in
copyright
without
seeking
permission
from
the
they're
they're
legal
licenses.
They
have
legal
code
available
and
these
are
different
flavors,
which
have
different
restrictions
upon
reuse.
The
Creative
Commons
team
has
also
made
available
other
stamps
which
assert
a
lack
of
copyright.
B
So
I'm
going
to
talk
to
you
now
in
a
little
bit
more
detail
about
this
area
of
the
graphic
because
I'm
more
familiar
with
it.
This
is
the
traits
based
area
of
a
graph.
A
trait
for
biological
organism
can
be
many
different
things
for
your
native
attribute,
like
body,
mass
or
body
length,
flower
color
for
plants
preferred
habitat,
so
they
can.
They
can
be
tangled
up
in
behavior
and
my
project.
The
Encyclopedia
of
life,
as
well
as
a
number
of
smaller
projects,
have
been
aggregating
traits
at
the
species
level.
B
Trying
to
make
this
information
available
for
reuse
and
for
better
indexing
of
the
rest
of
biodiversity.
Knowledge
traits
come
from
a
number
of
different
places,
but
an
important
one
is
modern
publications,
either
data
or
data
papers
or
published
datasets.
There
are
a
number
of
publishers
out
there
that
are
helping
individual
researchers
as
well
as
institutions
make
this
data
available.
It
is
still
in
an
interesting
variety
of
structures
and
formats.
You
can
get
anything
from,
for
example,
ecological
archives
or
Dryad.
You
can
download
datasets.
B
And
those
datasets
could
be
anything
from
XML,
most
often
something
tabulated
like
a
TSP
file
or
spreadsheet,
in
Excel
format,
to
a
PDF
of
a
table
which
you
can
you
know.
Usually
it
will
be
selectable
and
you
can
get
the
data
out
somehow,
but
there's
still
a
wide
variation
in
how
individual
researchers
are
expressing,
what
they
think
of
the
structured
data.
B
So
this
is
an
example
of
a
page
on
the
Encyclopedia
of
life
offer
which
I
support-
and
this
is
a
summary
page
showing
you
a
sort
of
an
array
of
structured
data
records
that
we
have
available
for
this
shark.
This
is
just
a
summary
view,
but
to
get
a
closer
look,
this
is
what
an
individual
record
might
look
like.
Look
like
these
are
two
different
records
for
different
species.
So
you
know,
onset
of
fertility
is
typical
of
a
vertebrate.
B
Animal
cell
mass
is
more
common
for
a
microbe
like
sea,
Sparkle
and
records
come
with
metadata,
but
we
aggregate
from
everywhere
and
metadata
very
extremely
widely.
We
have
an
ever-expanding
vocabulary
of
metadata
terms
and
we
borrow
them
wherever
possible
from
online
fairly
mature
online
ontology,
many
of
which
are
produced
by
something
called
the
oboe
foundry,
but
we
have
found
that
all
existing
ontology
is
in
our
domain
are
incomplete,
so
we
frequently
have
to
make
up
terms
and
place
them
in
our
ad-hoc
vocabulary.
B
B
Probably
too
small
to
read,
but
this
is
a
single
relationship
between
three
species:
a
host
plant,
a
disease
microorganism
and
an
insect
vector
that
carries
the
disease
and
all
the
nodes
are
important
pieces
of
metadata
that
are
attached
to
this
relationship,
including
things
like
provenance.
How
we
know
various
things
about
these
organisms
and
lots
and
lots
of
context
which
life
stage
of
the
insect
which
tissue
on
the
plant,
what
the
symptoms
are
of
the
disease
and
so
forth.
B
B
Oh
here,
it
is
okay,
sorry
about
that.
So
that's
the
kind
of
records
that
we
get.
B
We
aggregate
here
a
tol
from
a
number
of
smaller
providers
who
are
specialists
in
a
given
type
of
data
or
individual
cultures.
Some
within
with
particular
subject
areas
somewhat
particular
taxonomic
areas,
types
of
organisms
interest
them.
Some
of
our
data
is
distilled
out
of
what
we
would
call
occurrence.
Data,
for
example,
Obus
the
ocean.
It
functions
by
our
geographic
information
system,
contains
records
of
organisms
at
sea.
B
So
just
between
kind
of
this
is
a
species
location
in
time
stamp
at
sea.
They
often
will
also
have
depth
information,
and
so
you
can
distill
out
of
that
a
depth
range
per
species
at
which
it
offend
you
can
determine
whether
it's
a
shallow
water
species
or
a
deep-sea
species.
So
the
variety
of
sources
is
broad.
The
number
of
sources
is
quite
large
as
several
hundred
already
and
we've
only
been
added
with
a
small
crew
for
two
years.
B
So
this
is
the
landscape
that
we
operate
in
and
one
more
source
of
complexity
that
might
be
adding
data
to
this
bucket
in
the
near
future.
Of
course,
everything
these
days
is
affected
by
AI
and
the
two
things
that's
doing,
AI,
of
course,
is
learning
to
read,
and
since
we
have
all
these
text
sources
of
information,
they
are
beginning
where
the
community
is
beginning
to
annotate
our
biodiversity
documents
and
anticipation
of
being
able
to
reason
over
them
with
AI.
That's.
B
Possibly
more
work
than
regular
document
processing,
just
because
natural
language
in
this
context
is
a
little
bit
unnatural,
so
the
knowledge
that
the
existing
systems
may
have
in
interpreting
human
speech
may
not
work
very
well
with
scientific
jargon
from
a
hundred
years
ago.
So
it's
it's
still
in
the
experimental
phase.
I
personally
have
been
involved
in
three
different
research
projects
on
biodiversity
document
annotation
in
the
past
three
years,
so
it's
very
active,
but
it's
proving
both
complicated
to
work
out
and
the
the
one
last
thing
is
machine.
B
Vision,
of
course,
has
also
gone
through
some
some
radical
advancement,
the
last
couple
of
years
and
plant
net
is
the
first
product
that
I
have
seen
come
to
market.
That
has
really
benefited
from
deep
learning,
but
the
potential
is
becoming
clear
for
crowdsourcing
the
identification
of
individual
organisms
out
in
the
world
by
anyone
who
has
a
we
can
have
it's:
it's
not
equally
tractable
with
all
organisms,
but
with
flowering
plants
with
anything
larger
than
a
centimeter
or
two
that
can
be
captured
well
enough
on
a
phone.
B
It
is
potentially
someday
possible
to
to
get
an
identification
from
a
photograph.
If
you
take
it
well
enough,
currently
photo
quality,
which
is
a
human
problem,
primarily,
is
probably
the
biggest
bottleneck
to
to
getting
this
kind
of
thing
to
be
more
productive
in
terms
of
number
of
observations,
and
that
is
what
I
have
for
you.
C
C
C
C
C
B
B
So
I
think
it
is
the
French
who
operate
out
of
ten
Tasmania
into
the
Antarctic.
For
example,
see
the
Americans
and
the
Europeans,
of
course,
are
responsible
for
that
cloud
in
the
upper
middle
yeah
yeah.
So
the
patterns
that
you
can
see
there's
a
lot
of
bias
in
where
things
have
been
sampled,
but
that's
for
a
variety
of
reasons.
The
way
cities
are
lit
up
on
every
continent.
That's
mostly
citizen
science,
yeah.
C
A
C
C
B
C
Put
a
business's
mic
over
a
UNC,
so
I'm
serious
about
you,
know
I'm
curious
about
now
trying
to
match
this
up
with
what
the
data
hubs
are
interested
in.
So,
like
one
question
would
be
how
much
what's
going
on
here,
overlaps
with
efforts
like
NSF
of
efforts
like
data
one
or
seed
or
carrot
hop,
or
you
know
how
it
intersects
with
things
like
efforts
to
build
sort
of
national
scale
catalogs.
C
So,
for
example,
how
are
we
promoting
findability
of
this
kind
of
data,
and
you
know
perhaps
how
you
know:
I'm
looking
for
sort
of
intersection
points
now
with
if
the
data
hub
was
a
thing
that
brought
together
researchers
and
citizens
with
this
data
to
discover
it
manipulate
it
and
so
forth?
Are
there
any
sort
of
ideas
or
visions
about
what's
lacking,
what's
already
covered
at.
B
A
simple
level
this
has
been
somewhat
addressed.
For
example,
sources
of
exactly
this
kind
of
data
are
usually
in
contact
with
one
another.
So
for
occurrence
data
in
the
US,
for
example,
there
is
a
portal
called
Bice
on
biodiversity
information,
something
their
slogan.
Is
biodiversity
information
serving
your
nation
I
think,
and
they
are
a
hub
of
Jebus,
which
is
this
map,
so
they
are
already
hooked
together
and
there's
a
similar
one
for
marine,
their
American
Marine
data,
which
is
just
called
ovis
USA,
which
is
a
node
of
the
ovis
repository
in
terms
of
leveraging.
B
This
data
against
sort
of
orthogonal
datasets
within
the
US
or
other
nationally
organized
data,
community
I
think
that's
in
its
infancy,
but
the
barrier
I
think
is
more
of
a
subject
matter.
One
than
a
geographic
one
I
think
a
given
data
type
is
better
organized
across
the
world
than
different
data.
Types
are
organized
within
a
nation
if
it
makes
sense.
C
B
Can
think
of
two
to
two
areas
where
I
know
enough
to
respond?
One
is
wisten
within
this
very
kind
of
data.
This
particular
biodiversity
data
of
the
species
on
the
in
the
place.
At
the
time
we
tend
not
to
aggregate
metadata
so
much
as
really
try
to
get
all
the
records
represented
in
aggregator
databases,
and
then
the
problem
does
become
duplication
because
there's
more
than
one
layer
of
aggregation.
B
But
that's
the
state
this
particular
data
type
is
in
currently
and
oh
do
I,
remember
what
the
oh
yes,
oh,
the
aggregation
of
metadata
is
more
outside
my
can,
but
my
understanding
was
that
that
was
one
of
the
rolls
of
data
one,
for
example,
in
the
US,
but
someone
more
savvy
might
be
able
to
help
answer
that
question.
I.
C
Can
make
a
comment
on
it?
We've
got
a
project
here,
red-suited
assessor,
a
database
is
called
a
data
version,
basically
about
aquiline,
made
metadata
and
searching
through
it
right
now.
We
work
with
people
from
a
neuroscience
community
think
helps
kids
connect
to
try
to
try
to
map
a
fair
amount
of
metadata
from
different
location
together.
So
there
I
mean
I
think
that's
the
only
way.
I
know
exactly
what
I.
C
C
In
the
case
of
the
neuroscientist
at
the
moment,
well,
what
they're
trying
to
do
is
what
they
think
of
as
harmonization
which
some
of
us
might
think
of.
The
semantic
net
is
just
going
to
try
to
take
a
bunch
of
their
great
key
value,
data
right
and
enact
the
keys
onto
some
sort
of
common
key
and
try
to
map
the
values
or
comic
set
so
that
you
could
be
in.
It
makes
you
kind
of
a
persons
in
a.