►
From YouTube: 2017-FEB-23 -- Ceph Tech Talks: Big Data Analytics
Description
Adit Madan talks about Big Data Analytics on Ceph using Alluxio..
http://ceph.com/ceph-tech-talks/
A
B
Good
thanks,
Patrick
thanks
everyone
for
joining
as
we
go
through.
The
presentation
today
feel
free
to
stop
me
at
any
time.
If
you
have
questions-
or
you
would
like
to
talk
about
something
today,
like
Patrick,
said
I'm
going
to
talk
about
how
Alexia
can
be
used
to
speed
up
data
analytics
on
top
of
asset
storage,
cluster.
B
Okay,
before
we
start
a
little
bit
about
myself,
I'm
a
software
engineer
at
a
Luxio,
the
company
behind
the
open-source
project,
I
graduated
from
CMU
in
2013,
where
I
worked
on
different
distributed
and
storage
system
problems,
and
before
that
I
was
an
undergrad
at
the
Indian
Institute
of
Technology
in
Delhi
feel
free
to
get
in
touch
with
me.
After
the
talk,
if
you
are
interested,
my
email
is
right
there
on
the
screen.
B
It's
other
battle-ax,
your
comm,
so
I'll
start
with
a
brief
introduction
of
what
Alexio
is
and
the
ecosystem
it's
typically
used
in
as
Alexio.
The
open
source
project
is
actually
one
of
the
fastest
growing
open
source
projects
in
the
big
data
space.
The
graph
that
we're
looking
at
it's
showing
us
the
number
of
contributors
for
different
projects
in
the
diff
in
the
early
stages
of
the
project
itself.
B
So
the
essence
of
what
Alexios
does
is
connect
any
application
to
any
storage
at
memories,
key
memory
speed
at
any
scale,
but
to
give
you
a
little
more
context
on
where
we
are
coming
from,
we
we
started
with
the
world
in
which
we
had
one
computer
aim,
work
which
was
Hadoop
MapReduce,
and
there
was
one
typical
one
storage
system
used.
Typically
with
MapReduce,
which
was
the
Hadoop
distributed
file
system.
B
B
B
This
variety
of
storage
systems,
which
was
not
an
easy
task
now
what
what
you
can
do
with
the
Luxio
is
that
you
can
configure
your
computer
aim
work
to
work
with
the
Luxio
and
it
Luxio
itself
handles
all
of
the
communication
with
different
kinds
of
storage
system.
So,
as
an
application
developer,
you
only
worry
about
connecting
with
the
Luxio
and
Alexio
can
connect
with
different
storage
systems
underneath.
B
So
alexa
provides
different
kinds
of
interfaces,
but
the
recommended
interface
is
the
file
native
file
system
like
api
that
we
have,
which
would
give
you
access
to
any
of
these
systems
and
me
and
to
connect
to
the
system's
underneath,
if
you're
connecting
to
the
Hadoop
distributed
file
system,
you
use
the
HDFS
interface.
In
our
case,
when
we
connect
to
Swift
connect
youssef,
we
use
the
Swift
interface
to
connect
to
set.
B
Alex
you,
like
I,
said
it
brings
three
main
benefits.
The
first
is
to
unify
different
storage
systems,
taking
these
high
performance
by
running
jobs
at
memory,
speed
as
a
Luxio
is
co-located
with
the
concrete,
and
you
also
save
money
by
only,
and
only
by
the
compute
and
storage.
You
need
with
the
flexibility
that
Alexia
provides
with
the
separation
of
compute
and
storage.
B
B
You
can
use
cost-efficient
object
stores,
you
can
scale
the
resources
independently
eye
on
a
need
on
and
on
as
on
as
needed
basis,
and
you
could
also
use
a
big
native
frame
bus
without
sharing
any
resources
with
the
underlying
storage.
Now,
however,
it
is
a
disadvantage
of
the
separation
between
compute
and
storage.
B
Is
that
whenever
the
compute
framework
has
to
access
data
it
has
to
it
takes
longer,
because
the
storage
is
a
father
away
and
the
network
latency
and
throughput
is
near,
latency
is
high
and
the
throughput
is
low,
and
this
is
exactly
where
Luxio
comes
into
the
picture.
As
the
compute
side,
data
management
layer.
B
So
if
you
look
at
the
use
case
without
Alexia
in
in
the
example
that
I'm
going
to
present
today,
you
I'm
using
spark,
as
as
a
computer,
aim
work
and
I'll
use
Saif.
As
the
storage
framework,
the
the
box
in
the
end
could
be
replaced
with
which
F.
Now,
whenever
you're
accessing
data
from
the
storage
in
the
complete
framework,
you
observe
high,
latency
and
and
and
you're
bounded
by
the
network
throughput,
which
is
available
from
storage
system.
B
B
Here's
an
example
use
case
people
at
Baidu
were
using
a
Luxio
to
accelerate
story
data
access
from
from
Baidu
file
system.
Cluster
Alexio
was
managing
over
2
petabytes
of
data,
both
in
memory
and
hard
disk
drives,
and
the
size
of
the
deployment
was
over
over
200
nodes
in
in
this
particular
use
case,
Alexia
was
able
to
bring
your
performance
benefits
of
over
at
30x
by
the
benefits
that
I
have
already
outlined
previously.
B
I'll
show
you
experiments,
results
and,
and
a
demo
video
of
running
spark
on
top
assessed
in
the
configuration
that
I
have
I'm
running
everything
on
ec2
I'm,
using
four
types
of
machines
which
I
named
one.
The
first
type
of
machine
is
called
the
compute
master,
which
is
running
the
spark
which
is
running
spark
and
allows
your
master
processes.
The
second
type
of
machine
that
I
have
are
the
compute
workers,
which
is
running
spark
an
deluxe.
B
Your
workers
and
I
use
three
of
these
workers
I
and
the
third
type
of
node
is
a
Storage
Manager,
which
is
running
the
say,
the
separators
gateway,
daemon
and
also
the
monitor
process
and
then.
Lastly,
the
actual
data
lies
on
nodes
named
as
storage
devices,
which
is
essentially
the
Ceph
OSDs
I'll
use.
Our
three
dot
extra
large,
instant
type
and
all
of
the
machines
have
been
launched
in
in
the
same
availability
zone.
Also
note
that
this
is
not
a
requirement
that
everything
has
to
be
in
the
same
availability
zone.
B
B
The
versions
that
I've
used
abused,
safe
hammer
Alexio
the
recently
released
a
Luxio
open
source
version.
1.4
I've
used
a
custom,
jaws
library
in
case
anyone
is
interested
in
reproducing
the
numbers.
Their
jaws
is
essentially
the
client
library
that
allows
your
uses
to
communicate
with
a
storage
back-end
which
supports
the
swift
api.
B
B
I'll
show
you
a
quick
five-minute,
video
of
some
of
the
things
that
I
described
so
in
before
I
start.
The
video
park,
Luxio
and
Seth
have
been
pre
deployed
with
on
the
configuration
that
I
showed
you
test
has
been
pre-populated
with
a
60
GB
data
set,
which
is
not
present
in
a
Luxio
memory.
So
it's
only
in
intercept
when
I
start.
The
video
I
will
show
you
a
sample
application
of
running
queries
in
SPARC
using
the
spot.
B
Shell
of
what
we'll
do
is
we'll
run
a
simple
spark
count,
job
which
is
counting
the
number
of
lines
in
a
file
which
is
the
60
day
with
a
60
GB
data
set
I
will
run
on
a
second
spot
count
job
which
will
show
you
the
performance
or
the
caching
effects
in
SPARC
itself,
and
also
compare
it
with
with
storing
the
data
in
a
Luxio
memory.
I
restart
the
shell
and
I
will
show
you
the
performance
of
a
third
count.
B
The
the
implication
of
restarting
the
shell
is
that
once
you
restart
the
shell
or
whatever
data
a
spark
had
cached
is
lost,
but
this
limitation
is
not
present
when
you
use
a
Luxio,
as
you
will
see
in
the
performance
results
that
that
I
share.
So
the
third
count
that
you
do,
you
will
see
significant
higher
performance
with
the
Luxio
individual
to
end
the
demo.
I
will
show
ad-hoc
queries
using
Alexio,
but
what
I
mean
by
that
is
that
I'll
store
some
I'll
issue?
A
word
count.
B
B
The
first
thing
that
I
did
is
I
already
have
SEF
configured
as
and
as
a
storage
system
that
allows
Co
communicates
with
not
in
memory
in
in
in
the
text
that
we
see
right
now.
It
means
that
the
data
is
not
being
managed
in
Alessio,
but
Alexio
is
aware,
is
aware
of
these
files
existing
in
safe
storage.
So
there
is
a
folder
named
ADA,
which
has
25
each
file.
Is
each
file,
is
3
3
gigabytes
and
which
makes
up
our
total
sample
data
set
of
60
gigabytes?
B
B
B
So
yeah
I'm
going
to
pause
there
so,
but
the
way
spark
communicates
with
the
Luxio
is
using
the
Luxio
filesystem
system
like
interface.
Here.
What
we
did
was
we
created
a
file
with
the
parts
Alexia
then
demo
masters
is,
is
the
host
name,
which
has
the
Alexia
master
process
and
data
is
essentially
is
the
is
the
directory
that
we
are
running
our
compute
job
on.
B
So
I
have
forwarded
this
I
have
fast
forwarded
the
actual
running
time
of
the
compute
job
of
this
is
a
job
which
runs
for
12
minutes
use.
You
can
see
that
spark
is
running
with
process
local
locality,
which
means
that
which
also
means
that
a
lot
of
the
data
is
not
being
fetched
from
Alexio
at
the
moment
from
alaska
memory.
B
Both
for
the
sake
of
comparison,
Luxio
and
Alexia
has
been
configured
with
Alexio
and
when
I
do
a
direct
access
on
says
both
have
been
configured
with
a
512
megabyte
block
size.
What
that
means
is
that
for
the
60
GB
data
set,
there
are
120
tasks
created
by
spark
and
like,
and
we
saw
that
the
count
job
on
Alex
Hewitt.
The
first
count
job
that
we
did.
It
took
seven
hundred
and
fifty
seconds,
which
is
approximately
12
minutes.
B
The
next
thing
that
we
did
was
we
reevaluate
the
file
in
Alexia,
or
this
tab
is
essentially
fetching
the
block
locations
from
Alexia.
So
it
is
fetching
the
locations
of
the
Luxio
worker
processes
which
store
the
data.
Now
that
the
data
has
been
fetched
from
reemerged
a
remote
store
as
a
storage
cluster
in
tulips.
Here.
B
You
will
see
that
now
that
we
do
this
part
with
a
count.
Job
performs
count
job
again,
the
count
job
finishes
much
faster
and
also
your
which
you
can
see
that
the
tasks
were
run
with
note
local
locality,
which
is
which
means
that
the
tasks
have
been
launched
on
the
nodes
which
store
which
told
the
data.
B
B
B
Okay,
so
the
next
thing
that
I'm
going
to
do
is
I'm
going
to
perform
some
word
count:
operations
on
top
of
the
same
data
set
and
store.
The
intermediate
count
results
in
Alexio.
What
I
mean
by
the
intermediate
count
results
is
that
once
we
calculate
the
the
word,
the
count
for
each
word,
we
store
this
information
in
Luxio,
which
we,
which
I
call
intermediate
data,
and
then
I
will
perform
subsequent
queries
on
the
intermediate
data,
which
is
not
accessing
the
entire
16gb
data
set
from
from
remote
set
storage.
B
We
saw
that
the
job
took
about
400
seconds
to
complete
and,
as
you
can
see
on
the
screen
over
there,
the
four
it
took
about
four
hundred
and
twenty-two
seconds
to
complete
the
next
thing
that
we
do
in
the
demo
is
we
store
the
intermediate
data
in
a
Luxio
in
a
file
named
60
gb
counts,
which
stores
a
key
value
pair.
The
key
is
the
word,
and
the
value
is
the
count
of
that
word,
and
now
we
will
issue
subsequent
queries
on
on
the
60g
counts
by.
B
Note
that
I
also
exited
the
shell
and
restarted
the
spark
shell,
so
that
to
demonstrate
that
once
the
data
is
in
the
Luxio,
it
can
be,
it
can
be
shared
across
different
applications.
This
is
also
relevant
in
the
case
in
which
are
typical,
Big
Data
workloads
I,
can
also
share
data
across
different
jobs,
and
the
benefits
that
we
see
with
spark
caching
are
only
applicable
to
the
same
job
itself.
B
B
We
observed
that
the
the
job
took
approximately
the
same
time
now
when
we
do
the
same
the
same
operation
in
the
same
spark
shell
for
a
second
time,
Luxio
took,
took
a
little
bit
less
time
than
it
took
for
then
it
took
to
run
spot
on
top
of
the
data
which
is
cached
in
spark.
The
difference
in
the
second
count
that
we
have
over
there
is
that
in
the
blue
bar
data
is
stored
in
Luxio
and
in
the
red
bar.
The
data
is
stored
in
spark.
B
What
we
did
next
was
we
restarted
the
spark
shell
which
is
simulating
another
application,
and
we
perform
the
the
count
job
again
in
case
of
Palacio.
You
can
see
that
when
the
data
was
being
accessed
from
Alexia
memory,
it
took
approximately
20
times
less
time
as
it
took
to
access
the
same
data
from
a
remote,
safe
storage.
Cluster,
and
this
is
the
performance
benefit
that
we
see
with
a
lot
of
when
accessing
data
from
Alessio
4
for
repeated
accesses.
B
We
also
have
a
white
paper
on
the
use
case
that
I
just
described
if
you're
interested
in
looking
at
how
you
would
set
up
a
Luxio
on
top
of
says,
win
spot.
The
white
paper
has
detailed
instructions
on
doing
that
and
and
and
the
blog
would
give
you
a
brief
introduction
on
what
the
white
paper
is
talking
about.