►
Description
Date: 9/7/2018
Presenter: Patrick McGarry
Institution: data.world
South Big Data Hub
A
I
would
like
to
welcome
Patrick
McGarry
he's
currently
building
the
thriving
data
community
around
data
dot
world
as
their
head
of
community
and
prior
to
this
he
served
as
the
director
of
community
for
the
Ceph
open-source
project
at
ink
tank
and
later
for
Red
Hat.
After
a
successful
acquisition,
where
we've
heard
you
fir
to
that,
Patrick
enthusiastically
helps
companies
to
understand
and
adopt
open
search
ideals
through
community
engagement,
conferences
and
events.
So
with
that
and
if
she's
had
a
chance
to
share
his
screen
I'll
hand
it
over
to
you
Patrick
all.
B
Patrick
all
right
well,
thank
you
very
much
for
having
me
here.
I
know
this
is
a
short
talk,
so
I'll
try
and
get
us
out
of
here
on
time,
but
you
know
I
hadn't
planned
on
talking
a
lot
about
data
dot
world
specifically
I'm,
usually
I
have
the
N
viewable
position
of
being
in
the
community
side
of
the
house,
which
means
I,
don't
have
to
make
the
company
any
money,
so
I
just
get
to
play
with
the
cool
toys
and
all
the
cool
people
doing
stuff.
B
So
this
was
a
little
bit
more
aspirational,
although
I
think
some
of
this
may
be
new,
so
we
can
speed
through
that
and
then
have
a
bit
more
of
a
discussion
if
people
would
like
to
do
that,
so
I'll
jump
right
in
I
always
like
to
start
with
a
few
factoids
at
the
beginning
of
some
of
my
talks.
Some
of
these
may
be
new.
Some
of
these
may
not
be
new
they're,
just
some
of
my
favorites,
especially
this
one.
B
You
know
that's
more
than
10x
the
number
of
websites
that
existed
when
Google
launched.
You
know
there.
You
compare
the
world
of
kind
of
documents
on
the
web
to
the
world
of
of
data
sets
in
the
world,
and
you
know
the
problem,
isn't
that
we
don't
have
the
data.
You
know
there's
tons
and
tons
of
data
that's
being
generated,
and
you
know
in
many
cases
it's
being
shared.
The
problem
is
that
we
can't
find
and
use
that
data.
B
You
know
it's
it's
either
hiding
on
a
dusty,
ftp,
server,
somewhere
or
or
even
worse,
off
on
somebody's
laptop,
which
is
a
major
bummer,
and
so
you
know
you
start
looking
at
the
idea
of
the
Semantic
Web.
You
know
and
like
I
said
it's
that
idea
of
you
know
they're
the
web
was
creating
links
between
documents
and
data.
You
know
the
Semantic
Web
is
creating
links
between
data,
and
you
know
this
is
not
a
new
concept.
It's
been
around
for
a
couple
of
decades.
B
You
know
Tim
berners-lee,
I
love
to
listen
to
some
of
the
talks
that
he
give
talking
about.
Hey
ya
invented
the
Internet,
it's
kind
of
cool.
It's
had
some
impact
on
the
world,
but
really,
if
we
can
do
the
same
thing
with
data,
it
will
probably
have
more
like
10x
of
an
impact
on
the
world.
The
nice
part
is,
you
know
the
the
history
of
usage
of
the
Semantic
Web
was
always
okay.
B
You
know
people
like
to
use
that,
but
now
what
happens
if,
instead
of
having
to
try
to
do
your
own
demographics
research,
you
could
just
take
all
of
the
very
feature-rich
data
that
the
US
Census
collects
on
a
periodic
basis
and
immediately
just
merge
that
into
your
sales
and
marketing
data
and
start
asking
so
much
more
relevant
and
interesting
questions.
And
so
that's
the
kind
of
thing
that
we
do
you
know
and
and
we're
seeing
a
lot
more
of
that
just
across
the
industry.
B
You
know,
we've
also
done
some
some
interesting
work
with
like
the
CDC,
and
you
know
the
the
obviously
the
the
the
ideal
is.
What,
if
you
know,
cancer
researchers
from
London,
LA
and
Tokyo
could
all
share
their
work
seamlessly
so
that
we
could
kind
of
get
a
multiplication
of
effort.
But
you
know
what,
if
there's?
No,
if
obviously
everyone
here.
This
is
not
news
to
you
folks,
you
know
we're
moving
the
needle
on
all
of
this.
So
you
know
the
open
data
movement
is
is
crazy.
B
Unfortunately,
you
know
we're
at
a
very
interesting
nexus
of
kind
of
the
academic
researchers,
the
open
source
world,
the
open
data
you
know,
kind
of
the
individual
data
enthusiasts
as
well
as
kind
of
the
you
know,
the
corporate
interests
and
and
the
public
sector.
So
you
know
governments,
municipalities,
NGOs
foundations
and
every
single
person
that
I
talked
to
in
any
of
those
groups,
always
says
that
you
know.
B
The
idea
of
these
silos
that
are
that
are
incredibly
difficult
and
you
know
we're
trying,
as
an
industry
I
think
to
create
this
idea
of
a
data-driven
culture
and
that's
really
difficult
to
do
when
you're
either.
You
know
when
you
consider
kind
of
what
are
the?
What
does
the
landscape
look
like?
Are
you
a
lone
wolf
data
scientist?
Well,
okay,
you
can
consume
some
stuff.
You
can
do
some
work
and
you
share
your
insights,
but
you
know
how
much
impact
can
you
have
on
the?
B
And
so
you
start
looking
at
you
know
modern
data
teamwork
and
what
does
that?
Look
like?
Okay?
Well,
maybe
you're,
a
small/medium
business
or
maybe
you're
a
small
research
team.
You've
got
five
or
ten
people,
and
then
you
start
looking
at
you
know
who
are
the
people
that
are
modern
companies
that
have
been
building
their
companies
from
the
ground
up
to
be
data-driven?
You
know
Google
and
Facebook
have
thousand
or
thousands
of
you
know,
data
workers.
B
However,
you
might
define
that
you
know
Amazon
in
particular,
is
very
good
at
having
a
lot
of
people
digging
into
numbers
all
the
time,
but
obviously
the
the
multiplicative
ability
of
kind
of
that
longtail.
This
is
the
community
approach.
This
is
the
let's
do
with
open
data.
What
open
source
is
done
with
code
for
many
years
successively,
and
one
of
my
favorite
examples
was
the
idea
of
when
you
bring
people
and
data
together.
There's
some
really
exciting
things
that
can
start
happening.
You
know
closer
to
home
for
data
dot.
B
World
was
the
obviously
the
recent
hurricane
activity
that
hit
her
hit
Houston
so
hard,
and
there
was
a
lot
of
stuff
going
on
and
there
was
a
lot
of
you
know,
people
that
were
trying
to
jump
in
and
help
and
they
didn't
really
know
what
to
do.
And
so
you
know
there
was
this
mobilization
around
hurricane
Harvey
that
did
some
really
cool
stuff.
You
know
people
started
saying
all
right.
The
the
phone
networks
are
overloaded.
B
You
know
we
can't
get
ahold
of
people,
and
so
people
were
actually
sitting
on
their
rooftops
and
tweeting
about
hey
I'm.
Here
you
know
the
water
levels,
thankfully,
have
stopped
rising,
but
we
can't
get
anywhere.
We
can't
get
out,
and
so
there
was
actually
a
couple
of
different
groups
that
came
together
and
started
doing
some
natural
language
processing
on
Twitter.
B
For
saying,
okay,
let's
find
the
people
that
are
in
dire
need,
let's
find
the
people
that
are
screaming
for
help,
but
then,
as
they
started
to
work
their
way
through
that
list,
they
started
being
able
to
have
really
impact
on
groups
and
say
alright.
Let's
mobilize,
you
know
individuals
and
coordinate
efforts
to
get
people
rescued
or
get
people
help,
and
then
all
the
way
down
to
kind
of
after
things
had
calmed
down
a
little
bit.
Looking
at
you
know
well
water
reports,
and
things
like
that.
So
there
was
all
of
this.
B
Did
some
interesting
analysis
and
I
think
that's
what
you
see
on
the
right
hand
side
there
the
picture
of
the
kind
of
the
hardest
hit
neighborhoods.
You
know
based
on
a
number
of
different
factors,
and
then
she
obviously
shared
all
of
her
data
so
that
other
people
could
kind
of
double-check
and
say:
oh,
hey,
did
you
think
about
this?
And
we've
actually
seen
multiple
projects
that
were
spawned
off
of
her
work.
B
That
started
saying
well,
I
want
to
do
deeper
analysis
on
you
know:
public
works
impact
or
I
want
to
take
a
look
at
you
know
individual.
You
know
commercial
housing
or
what
have
you
so
it
was
really
interesting
to
kind
of
put
the
power
of
linking
data
and
people
together.
So
what's
next,
you
know,
I
will
share
a
little
bit
of
data
that
world,
and
if
people
want
to
ask
questions
or
whatnot,
you
know
we're
a
free
resource,
we're
a
very
good
hub
model
kind
of
thing.
B
For
you
know
we
we
get
a
frequently
compared
as
the
github
for
data.
So
if
you
want
to
do
things
in
public
or
kind
of
for
the
good
of
mankind
sort
of
thing,
you
know
create
a
free
account
and
go
to
town.
That's
all
you
got
to
do
you
know
it's.
You
know
the
the
way
that
we
stay
in
business.
Is
you
know
people
that
want
to
have
a
lot
of
private
datasets
or
really
large
datasets.
B
B
The
data
practices
community
was
something
that
we
started
back
in
last
November,
where
we
basically
tried
to
gather
a
bunch
of
you
know:
visionary
thinkers
around
the
worlds
of
semantics
and
data
journalism
and
data
visualization
and
open
source
and
kind
of
all
of
these
different
people.
We
put
them
in
a
room
and
just
kind
of
shook
it
to
see
what
would
happen
and-
and
we
came
out
of
the
other
side,
with
a
lot
of
the
thoughts
around
hey.
What
can
we
do
to
start
breaking
down?
Some
of
these
silos:
what
can
we
do?
B
So
we
created
this
manifesto
for
data
practices.
It's
a
set
of
values
and
principles
that
kind
of
describe
what
modern
ethical
data
teamwork
looks
like,
and
since
then
we
have
I
think
over
1,500
signatories
and
some
really
notable
authors
that
were
on
there.
Everybody
from
you
know,
DJ
Patil,
all
the
way
to
the
folks
on
you
know,
working
on
Jupiter
and
our
communities,
people
like
Bryan,
Granger,
Fernando,
Perez,
etcetera,
etc.
So
it's
really
been
interesting
and
impactful,
but
more
than
that,
we
wanted
to
move
beyond
words
on
a
page.
C
C
Yeah
thanks
thanks
Patrick
that
was
really
great
to
hear
about
what
you're
doing
and
I
think
the
problems
from
at
least
my
perspective
that
you
highlighted
are
you
know
right
on
myth,
dosser,
we
hear
a
lot
about
sort
of
big
data,
but
I
think
capitalizing
on
the
promises
really
requires
breaking
down
silos
and
just
curious
to
that
end.
Sort
of
what
you
know.
What
might
be
some
examples
you
mentioned
NLP,
which
would
be
processing.
You
know
a
stream
of
data
and
extracting
information
from
it
and
put
in
organizing
it.
C
B
Well,
I
mean
that's
a
broad
question
with
a
lot
of
moving
parts,
but
there
there
are
definitely
a
number
of
different
things
that
I
could
use
as
examples
here.
As
far
as
archival
goes,
we've
seen
a
lot
within
the
governmental
space
in
particular
and
I'm
thinking
now,
especially
of
like
ecosystem
type.
You
know
environmental
data,
so
we've
done
a
number
of
a
cathodic
of
ER
mental
groups
that
were
afraid
of
their
that
they're.
B
You
know,
data
was
suddenly
going
to
disappear,
and
so
they
put
it
up
on
David
out
world,
so
it
would
be
in
a
third
party
repository
and
to
further
those
ends.
We've
actually
been
working
with
zone
odo
to
make
sure
that
we
could
allow
our
community
to
mint
their
own
duis
so
that
they
could
kind
of
have
a
better
archival
story.
B
We
already,
you
know,
version
all
of
the
data
that
comes
into
day
to
that
world
and
surface
all
of
that
stuff,
but
we're
working
to
kind
of
increase
our
capabilities
there
and
allow
people
to
have
more
interactivity
with
those
versions
rather
than
just
being
able
to
you
know,
download
or
revert
you
know
and
start
looking
at.
You
know:
how
can
we
create
the
idea
of
a
data,
pull
request,
kind
of
thing,
and
so
those
are
some
of
the
things
that
we're
doing
in
terms
of
you
know,
data
archival.
B
You
know
looking
at
streaming
data
and
the
things
that
you
can
kind
of
do
with
that.
We're
also
looking
at
things
like
you
know:
hardware
failure
data,
so
some
of
you
may
be
familiar
with
the
Backblaze
data
set,
which
takes
a
look
at
a
really
wide
number
of
hard
drives
and
how
long
it
takes
them
to
fail.
We're
actually
working
with
the
Ceph
community
right
now,
they've
built
in
a
collection
agent
so
that
they
can
anonymously
gather.
B
You
know
what
types
of
hard
drives
are
involved
in
large
storage
clusters,
and
you
know
we're
working
with
people
like
Cisco
and
a
couple
of
others
to
start
getting
some
data
out
there
to
show
what
the
failure
rate
looks
like
and
what
the
usage
profiles
lead
to
faster
failure,
and
things
like
that.
So
there's
always
you
know
interesting
stuff.
That's
going
on
out
there,
so
I
hope,
I
answered
your
question.
B
C
B
D
This
is
Johnny
I.
Had
a
question
did
trying
to
frame
it
early,
you
need
in
the
talk.
You
talked
about
the
Semantic
Web,
which
was
a
way
of
organizing
links
to
data.
That's
spread
around
the
world
as
you
describe
data
dot
world.
It
feels
like
you.
You
simplify
things
by
concentrating
all
the
data
that
that
your
overlay
gabe's
gives
access
to
in
one
place
effectively.
You
think
that
is
likely
to
change
and
over
time,
you're
mixing
data
across
the
world.
Oh
absolutely.
B
Originally
we
wanted
the
data
in
a
single
place,
so
we
could
start
doing
some
interesting
things
in
a
more
simplistic
way,
as
we
continue
to
build
and
scale,
but
we're
to
the
point
now
where
we're
starting
to
do
like
VPC
deployments,
and
you
know,
like
I,
said
that
virtualized,
let's
get
the
metadata
in,
so
that
your
data
doesn't
have
to
move.
This
is
especially
important
for
very
large
datasets
or,
for
you
know,
regulatory
type
data
sets
worth
FinTech
or
ensure
tech
type
of
ramifications.
B
So
yes,
I
absolutely
see
a
world
in
which
we
tend
to
index
the
data,
but
not
necessarily
move
it
in-house,
as
we
as
we
go
forward
because
yeah,
the
the
Semantic
Web
stuff
is
really
important
to
us.
You
know
being
able
to
be
the
the
Amazon
recommender
system
where
it
says.
Hey,
we
see
your
data.
Has
this
certain
shape,
here's
a
couple
of
things
that
might
be
related
or
might
help
you
to
enhance
what
you're
doing.
D
B
D
Do
you
have
a
you,
surely
have
something
interesting
to
say
about
right
about
what
data
I'm
sorry
moving
data
it
I
miss
I,
want
to
analyze
the
chunk
of
data.
I
need
to
move
it
to
somewhere,
where
I
can
analyze
it
or
there's
resources
locally
to
analyze
it?
How
to
how
do
people
generally
get
the
cycles
to
apply
it
to
the
data
that
great
thing
well,
yeah.
B
That's
that's
still
a
hotbed
of
contention.
You
know:
do
you
bring
the
data
to
the
analysis
or
the
analysis
to
the
data
right,
so
it's
I
and
I,
and
it's
interesting
working
with
the
AIA
and
m/l
communities,
because
you
know
they
tend
to
think
in
very
different
ways.
You
know
I
like
working
with
some
of
my
friends
at
Google,
because
you
know
they're
they're.
The
guys
that
are
their
favorite
saying
is
I
forgot
how
to
count.
B
That's
all
that
small,
you
know,
and
so,
when
they're
working
with
massive
data
you,
so
you
can't
move
it
around.
So
you
got
to
figure
out
ways
to
get
the
compute
to
the
data.
But
you
know
it's:
it's
interesting,
the
kind
of
innovative
tricks
and
things
and
tools
that
are
coming
out,
especially
in
the
cloud
lawyers
right
with
Amazon
versus
Google
versus
Microsoft
versus
Oracle.
You
know
everybody's
got
a
story
to
tell
about
why
their
cloud
is
better
and
I.
B
Think
that
we're
only
going
to
see
more
and
more
sophisticated
options
when
it
comes
to
data
and
analysis
and
how
the
two
shall
meet
so
I
I
don't
have
any
very
strong
opinions.
You
know
I
probably
fits
massive
data.
You
want
to
bring
your
compute
to
it,
but
for
me,
looking
at
the
data
science
landscape,
it
seems
like
the
more
you
can
do
to
take
a
slice
of
your
data
or
a
sample
of
your
data,
the
better
off
you're
going
to
be.
You
know
trying
to
do
analysis
on
terabytes
or
petabytes
of
data.
D
B
Definitely,
and
now
let
I
will
answer
this
question
to
the
best
of
my
ability,
but
keep
in
mind.
I
am
neither
a
data
scientist
nor
an
oncologist
so
so
take
what
I
say
with
a
relative
grain
of
salt.
But
that
said,
you
know
we
have
started
to
roll
out
some
of
our
more
semantically
focused
features.
You
know
we
have
already
in
the
system
the
idea
of
matching.
B
So
if
you
a
good
example,
is
if
you
upload
a
data
set
and
it
has
a
column,
that's
called
zi
P
and
it's
all
five
digit
numbers
we'll
ask
you:
is
this
a
US,
zip
code?
And
if
you
say
yes,
we
start
to
infer
a
certain
amount
of
semantic,
meaning
from
that
you
know
we'll
be
able
to
say:
hey:
do
you
want
to
bring
in
the
city
or
the
state
or
the
census
tract
or
the
you
know,
so
you
can
immediately
kind
of
enhance
your
data
set
automatically
based
on
some
of
our
in-house
ontology
x'.
B
We're
also
starting
to
get
to
the
point
now
where
we
will
start
building
the
tools
for
people
to
bring
in
their
own
custom,
ontology
x'.
Whether
this
is
you
know,
UPC
data
from
you
know:
commercial,
real
estate,
retail
stuff
or
whether
that's
you
know,
you
know
the
the
fisheries
data
from
the
Pacific
Northwest
or
whatever
it
may
be.
B
There's
there's
a
lot
of
work
to
make
sure
that
we
can
do
custom
ontology,
so
we
have
definitely
talked
with
and
about
dbpedia
and
some
of
those
types
of
things
and
so
we're
still
kind
of
working
out
what
that
tool
is
gonna,
look
like
and
how
many
linkages
there
might
be.
That
said,
data
that
world
has
definitely
taken
the
stance
of
if
something
exists
and
it's
doing
a
good
job.
Let's
not
reinvent
the
wheel,
so
you
know
when
it
comes
to
analysis.
B
We
tend
to
integrate
with
people
like
my
kirsov
power,
bi
or
tableau,
or
you
know,
Google
data
studio
or
whatever.
We
don't
want
to
build
our
own
analysis
and
be
yet
another
tool
on
the
landscape
of
already
tool
fatigue
people.
So
you
know
if
there's
ontological
tools
out
there,
and
you
know
that
was
the
the
the
subject
of
our
CTO
s
doctoral
thesis.
So
our
is,
you
know,
postgraduate
work,
so
I'm
sure
he's
definitely
aware
of
it
and
we're
trying
to
not
reinvent
the
wheel
again
so
I.
D
B
D
Right
well
just
just
a
point
of
clarification
because
you
said
dbpedia
and
I
said
wiki
data
and
yeah
and
they're
they're
similar,
but
quite
different
animals.
I
just
want
to
make
sure
you
know.
Wiki
data
is
really
doing
an
amazing
job
of
lining
up
identifiers
and
I
think
that
it's
probably
something
you
guys
could
actually
make
good
use
of
I'm.
B
A
Okay,
well
super
if
we
don't
have
any
other
questions,
that's
probably
a
good
segue
into
the
last
part
of
our
agenda
and
thank
you
again
Patrick
for
your
presentation.
We
really
appreciate
that
and
I
can't
wait
to
read
through
data
practices.
Org.
Oh.