►
Description
Speaker: Chris Fregly, Data Solutions Engineer
The audience will participate in a live, interactive demo that generates high-quality recommendations using the latest Spark-Cassandra integration for real time, approximate, and advanced analytics including machine learning, graph processing, and text processing.
A
B
My
name
is
Chris
fregley
I
joins
I
started
off
at
playboy.
I
I
keep
the
playboy
thing
up
here,
because
it
actually
comes
into
to
play
a
little
bit
later.
It's
been
a
few
years
and
that's
me
at
the
mansion,
with
my
boss
and
his
wife,
yeah
back
at
my
younger
thinner
days,
yeah,
that's
me
and
my
mom
at
Netflix,
moved
out
here
from
Chicago
to
join
Netflix
joined
a
to
brick
shortly
after
and
then
now
I'm
at
the
IBM
spark
tech
center.
Well,
not
short.
You
know
this
is
five
or
six
years
later.
B
B
Of
course,
IBM
has
been
backing
it
yeah
and
if
you
yeah
it's
like
pretty
much
a
dream
job
right,
my
job,
my
boss,
my
company
support
this
folia,
which
I
didn't
believe
I
asked
everyone
before
I
joined
I'm
like
is
this
for
real,
can
I
not
wear
khaki
pants,
pleated
khaki
pants
can
I
just
wear
my
shorts
and
you
know
flip-flops
and
they
said
yeah,
that's
good,
so
yeah.
That's
why
I
joined
so
yeah
yeah.
B
So
speaking
of
the
meetup
actually
monday,
I
planned
this
right
before
the
Cassandra
summit,
because
I
knew
a
lot
of
datastax
people
would
be
in
town
from
all
over,
so
datastax
was
kind
enough
to
like
host
it
right
down
the
street
at
their
office.
If
you,
google,
there's
two
slide
presentations
so
Russell,
who
I
think
is
actually
talking
right
now,
possibly
talking
about
the
spark
Cassandra
connector,
he
has
about
90
slides
that
actually
goes
into
more
depth
than
he's
covering
here
at
the
summit.
B
So
you
might
want
to
check
that
out
and
then
I
have
my
slides,
which
is
about
half
of
these
slides.
So
you
won't
be
missing
too
much
and
then
the
upcoming
ones
project-
tungsten.
You
probably
heard
about
that
spark
and
then
focusing
on
the
elasticsearch
connector.
There's
a
lot
of
people
asking
about
the
internal,
so
yeah
on
that
and,
of
course,
yeah
catalyst,
deep
dive.
B
So
we'll
try
to
blaze
through
these
recommendations
live
demo
talk
about
data
frames
at
sort
of
a
high
level,
then
like
dig
into
what
the
catalyst
optimizer
does
and
other
various
query
plans.
We've
got
data
sources,
API,
that's
the
main
API.
So
when
I
like
Russell
and
like
pioter
and
Alex,
and
those
guys
went
to
write
the
spark
Cassandra
connector,
that's
what
they
used
right,
that
data
sources,
API
from
spark
and
data
bricks,
has
changed
quite
a
bit
over
the
last
year,
so
yeah
they
did
a
really
good
job.
Keeping
up
with
that
wow.
B
We'll
talk
about
how
to
create
right,
like
your
own,
like
custom
data
source,
will
talk
about
like
native
ones.
That
are
part
of
spark.
Will
talk
about
third
party
ones
that
people
have
built
and
then
a
few
tips
on
sparks
equal
performance
tuning
from
like
my
work
at
data
bricks
and
yeah,
currently
with
IBM.
B
So
yeah
just
real
quick.
You
guys
probably
know
this,
but
there's
a
couple
types
of
like
recommendations
right
you
know,
there's
like
personalized
non-personalized
yeah.
We
saw
this
quite
a
bit
at
Netflix,
I'm,
sure,
there's
a
lot
of
Netflix
people
and
you're
the
whole
cold
start
problem
right.
What
do
you
do?
B
Yes,
you
can
also
do
things
like
PageRank.
You
can
just
kind
of
look
and
see
who
people
which
movies
people
are
liking.
This
is
like
a
dating
data
set
that
that
relic
will
be
showing.
So
really
it's
rarely
user
to
item.
But,
like
the
item
is
actually
like
another
user
right.
So
it's
kind
of
funny
and
then
personalized
you
can
do
things
like
collaborative
filtering
matrix,
factorization
yeah,
we'll
have
a
demo
without
a
bit.
B
So
yeah
this
is
interesting.
This
was
some
terms.
I
picked
up,
like
probably
the
past
couple
years
about
like
different
types
of
like
user
feedback
and
back
in
like
right
like
the
early
days
of
netflix,
they
relied
on.
You
know,
ratings
or
people
liking,
each
other.
You
know
picture
or
like
tinder,
where
you're
just
swiping.
B
Those
are
just
likes:
they're,
not
1,
to
5
1
to
10
ratings,
it's
just
zeros
and
ones,
but
you
can't
really
count
on
that
right,
so
yeah,
there's
these
other
types
of
feedback
call
the
implicit
feedback
which
is
like
hover,
easing
searches
and
clicks,
and
right
like
how
long
you
spend
watching
the
movie.
How
long
you
write
like
view
the
person,
if
you
click
in,
to
see
multiple
pictures
of
the
person.
If
you
spend
time
reading
the
air
like
movie,
the
like
summary
and
all
that
kind
of
stuff.
B
Yes,
all
that
gets
fed
into
these
models,
obviously
for
future
recommendation.
So
yeah,
that's
the
big
data
right.
Just
a
quick
note
on
similarity.
I
mean
this
is
what
it
all
comes
down
to
your
right:
we're
trying
to
find
user
user
similarity
relic
user
item
similarity
items
item
similarity
was
made
like
pretty
popular
by
other
guys
up
at
amazon,
so
yeah,
and
I
mean
it
basically
all
comes
down
to
math
right.
B
There's
this
concept
of
log
likelihood
that
sort
of
associated
with
Jaccard
similarity
where
you
factor
out
popular
so
at
like
Netflix,
for
example,
where,
like
everyone
at
first
the
first
week,
you
were
there,
you
have
to
say
what
your
favorite
movie
and
TV
show
was
and
yeah
like
I
guess,
like
eighty-five
percent
of
rarely
people
said
Shawshank
Redemption,
so
that
was
kind
of
funny
but
yeah.
That
would
be
an
example
of
like
that's,
not
really
like
a
high
value
recommendation
right.
So
you
want
to
factor
that
out
so
yeah.
Those
are
techniques
to
do
it.
B
B
And
then
comparing
similarity
I
mean
this
is
the
whole
challenge
here
right,
so
you
basically
have
to
compare
like
everything
to
everything
else
and,
of
course,
that's
the
scariest
thing
in
the
world
right,
that's
like
Cartesian
and
that's
tons
of
shuffle.
That's
a
huge
huge
cost
right,
yeah,
like
network
costs,
so
there's
clever
ways
to
do
it.
You
can
approximate
right,
that's
the
most
obvious,
so
trying
to
reduce
em,
which
is
the
number
of
rows
you
want
to
buck
it.
B
There's
this
algorithm
called
locality,
sensitive
hashing
that
probably
comes
up
about
once
a
week
and
I
inbox
people
asking
about
this
yeah.
So
that's
one
to
keep
an
eye
out
on
and
then
to
reduce
n
right,
like
you
can
use
sparse
matrices
and
pull
out
common
values
like
zero
yeah.
Something
interesting,
zero
like
sometimes
isn't
right,
like
the
most
popular
or
yeah.
The
most
frequent
item
so
or
got
the
most
frequent
value
so
keep
an
eye
on
that.
B
All
right,
this
is
oh
yeah,
so
this
talk
used
to
be
called
spark
after
dark,
still
kind
of
out
there
on
the
web.
So
it
was
essence
like
my
playboy
days
I
when
I
first
joined
I
was
too
young
to
know
about
playboy
after
dark.
I
think
it
was
in
the
60s
or
something
and
I.
You
know
got
ahold
of
all
the
footage
and
started
watching
and
yeah
super
progressive
right,
like
Marvin,
Gaye,
yeah,
old-school,
Jerry,
Garcia
and,
of
course,
half
in
his
heyday
there
with
his
short
haircut.
A
B
So
that's
kind
of
where
all
this
came
from,
so
this
is
what
I
need
you
guys
to
do
and
then
I'll
do
it
with
you.
So
this
is
sort
of
yeah
there's
this
project
called
flux.
Capacitor
yeah,
if
you
guys
have
seen
me
talk
before
I,
like
tend
to
use
flux,
capacitor
for
various
projects,
I've
done
out
the
years
it
started
off
with
a
netflix
project,
and
now
it's
yeah.
B
Over
to
Big
Data,
so
we've
got
Kafka.
We
got
spark
streaming.
Cassandra
we've
got
ml
lib,
generating
the
models
and
making
recommendations,
putting
recommendations
into
elasticsearch
right
like
really
for
no
good
reason.
I
just
wanted
to
like
demo
the
elastic
search
connector
and
see
how
it
worked,
and
then
the
user
yeah,
so
the
user
talks,
so
the
user
being.
If
you
go
to
spark
after
dark
and
select
three
like
actors
and
three
actresses,
that's
that's
now
coming
into
Kafka,
which
is
listening.
B
Yes,
I
have
a
docker
image
sitting
out
on
a
soft
layer
yeah
I
was
using
Amazon
up
until
like
two
weeks
ago
and
IBM
started
kind
of
pushing
me
toward
softlayer.
But
it's
a
bare
metal
yeah
like
piece
hardware,
so
it's
actually
pretty
fast
but
yeah.
So
if
you
guys
just
want
to
do
that
for
a
little
bit
for
like
a
minute.
B
Yes,
I'm
going
to
do
it
too.
This
is
totally
anonymous.
I'm,
not
I,
don't
know
who
you
are
and
the
data
is
going
to
get
blown
away
when
I
get
rid
of
the
dr
image
anyway.
So
just
pick
a
few
of
your
favorite
people,
yeah,
both
men
and
women,
doesn't
matter
we're
all
like
adults
here,
we're
all
in
the
Bay
Area.
B
B
I'm
going
to
make
sure
I
zoom
in
here
I,
usually
don't
go
about
10
seconds
without
someone
yelling
at
me
to
make
the
screen
bigger,
so
I've
gotten
better
about
that
right,
like
have
you
guys
use
notebooks
I've
used
ipython
notebook,
you
guys
got
raise
your
hands
but
yeah,
so
yeah
very
similar
to
that.
It's
going
to
look
a
lot
like
this
right,
like
data
bricks
as
a
product
used
to
be
called
data,
bricks
cloud
now,
just
rebranded
data
bricks.
That's
also
a
notebook
type
thing.
They
support
our.
B
They
support
right,
like
a
lot
of
like
the
advanced
commercial
features,
but
yeah
Zeppelin
is
open
source.
So
here
I'm,
just
importing
libraries
I've
got
the
standard.
Connector
I've
got
Kafka
and
let's
see
some
CSV
stuff
here
so
yeah
so
like
when
you're
clicking
on
the
actual
page,
that's
calling
a
REST
API
into
Kafka.
That's
a
right
like
confluence
the
company
that
spun
out
of
like
LinkedIn
those
guys
been
building
a
rest
proxy,
which
is
super
valuable
I,
don't
know
if
anyone's
actually
gonna
use
it
in
production.
B
B
B
B
A
A
B
A
B
Alright,
so
yeah
just
to
show
you
guys
some
of
the
code
so
I'm
just
setting
up
reference
data
right
here.
This
is
I,
basically
just
pulled
it
off.
Imdb
actors
actresses
profiles,
their
BIOS
pictures
things
like
that
that
I
can
just
join
just
to
make
some
of
these
charts
a
little
bit
more
interesting,
showing
names
instead
of
ID's
right.
B
B
Is
going
to
keep
this
is
just
going
to
display,
so
this
is
the
from
so
the
number
that
is
on
your
guys.
Web
page
is
the
one
on
the
left
that's
from,
and
then
two
is
going
to
be.
The
it's
gonna
be
one
of
the
15
yeah
9000,
I
think,
is
the
females
and
1000
is
the
males
by
the
way,
like
your
recommendations
are
going
to
have
both.
In
fact,
I
had
one
time
where
all
five
of
my
recommendations
were
guys
so
I'm
not
quite
sure
what
was
going
on
there.
B
B
B
B
B
B
B
It
got
wedged
earlier
today,
but
so
that
would
be
and
then
here's
some
pagerank
action
here
pretty
typical.
But
if
you
think
about
it
you
know
you
have
all
these
people
on
the
left
that
are
liking
all
these
people
on
the
right.
You
know
picture
Twitter
kind
of
swiping
along
right,
so
you've
got
this
big
great.
You
know
this
big
by
part,
bipartite
graph
going
on
here
and
you're
right,
like
contributing
your
likes
to
them
and
they're
contributing
back
to
the
rest
of
the
yeah.
This
is
like
my
buddy
right
here.
B
He
just
likes
everyone
just
because
he
thinks
that
that's
how
he's
going
to
you
know
you
just
play
the
numbers.
I
guess
all
right.
So
let's
try
to
do
so.
Let's
get
to
the
interesting
stuff
here
which
is
going
to
be
the
matrix
factorization
we
can
get
back
to
this
later,
but
basically
what
it
would
do
is
populate
elasticsearch
and
then,
when
you
click
those
three
links
at
the
bottom,
those
are
our
personalized
to
you.
It
knows
your
ID
and
it
would
grab
it
out
of
that.
It.
B
Alright,
so
yeah,
if
you
guys,
want
to
see
some
more
of
this
in
action,
github
com,
flux,
capacitor
and
then
yeah
that
there's
a
docker
image
that
has
all
of
this
built
in.
So
it's
got
like
Kabana,
logstash,
ganglia,
yeah,
all
the
metrics
and
all
that
kind
of
stuff.
So
it's
pretty
cool
alright,
so
data
frames
inspired
by
our
and
pandas
I'm
going
to
just
kind
of
blaze
through
here.
You
should
always
be
writing
to
data
frames
that
there
was
a
big
shift
in
this
as
I've
spark
13
yeah.
B
A
B
So
think
of
a
data
frame
as
a
logical
plan
container,
you
know
like
pig
or
yet
like
sequel,
that
kind
of
thing
where
you're
just
building
up
transformations,
building
up
a
dag,
but
then
catalyst
is
the
one
that
actually
comes
in.
So
right,
like
catalyst,
was
a
rewrite
right
like
initially,
there
was
shark
and
so
shark
was
basically
hive
on
spark
and
it
used
the
like
hive
optimizer
beneath
the
covers,
and
it
really
wasn't
it
didn't
fit
into
the
spark
right
like
execution
model,
it
was
more
like
mapreduce.
B
It
was
limited
by
that,
so
it
was
kind
of
a
right
like
sit
down
we're
like
a
big
meeting
where
they
decided
yeah.
Let's
just
rewrite
this
thing.
This
is
it's
going
to
be
a
long
road
ahead,
but
yeah
they
have
smart
guys,
guys
guys
have
PhDs
and
this
kind
of
stuff
so
yeah
they
set
out
to
do
it
and
keeping
it
clean,
keeping
it
it
open
too.
So
that's
the
data
sources
API
that
we'll
talk
about
in
a
second
you
can
plug
things
in
you
can
plug
in
custom
rules.
B
You
can
yeah
basically
hook
into
any
plan
that
you
want.
You
can
like
manipulate
it
at
all
the
levels-
logical,
physical
optimize,
that
kind
of
thing.
Yes,
you
can
write
custom
udfs.
The
one
drawback
to
this
is
that
catalyst
doesn't
understand.
What's
going
on
beneath
that
UDF,
so
you
might
actually
have
some
problems
if
your
start
right,
like
doing
crazy
things
in
there.
Yes,
a
catalyst
has
to
optimize
around
it
can't
can
include
it
in
the
fun
and
as
of
spark
15
there's
a
new
UD
AF
support
yeah
for
like
custom.
B
You
d
is
right
now,
the
only
like
UD
AF,
so
you
can
use
our
the
hive
you
da
FS,
so
spark
sequel
is
essentially
a
closed
subset
to
hive
ql
right
like
closed
meaning
there's
a
few
obscure
things
that
sparks
equal
doesn't
implement.
Right,
like
one
of
the
things
I
BM
and
data
bricks
have
been
talking
about
together,
is
trying
to
build
that
out
to
get
right
like
to
become
more
like
an
c
sequel,
compliant
get
like
t,
pcds
and
all
that
stuff
running
smoothly.
B
B
The
big
thing
here
that
will
talk
about
quite
a
bit
in
a
second
here
as
predicate
push
downs
right.
So
basically
the
ability
to
to
as
deep
down
into
the
source
itself
and
like
you'll,
see
this.
When
we
talk
about
the
data
sources,
API
there's
ways
to
actually
push
filters
down
so
that
you
aren't
returning
that
data.
B
Yes,
I'll
show
examples
of
that
where
you
can
actually
check
to
see-
and
I
think
I
actually
found
a
bug
in
the
spark
Cassandra
connector
this
morning,
so
I
have
to
write
like
track
down
Russell
and
those
guys
after
this,
because
I
ya
can't
seem
to
get
it
to
push
down.
So
it's
probably
something
I'm
doing,
but
it's
not
clear
so
yeah,
there's
hooks
for
custom
rules,
so
yeah
you
can
implement
your
own
rule.
B
Alright,
so
he
at
the
top
is
a
yeah
there's
this
this
sort
of
concept
of
a
data
frame,
dsl
right.
So
it's
all
the
familiar
things
selects
and
filters
and
if
you
run
explain
true
on
it
or
from
a
sequel
standpoint,
just
really
put
explained
at
the
beginning
of
your
sequel,
it's
going
to
be
the
exact
same
thing:
it's
going
to
dump
out
the
parse
logical,
the
analyze,
logical,
yeah,
the
optimized,
logical
and
then
physical
yeah,
so
keep
an
eye
on
these.
B
So
what
you'll
see
here
is
the
sort
of
early
phases,
so
there's
two
filters
at
the
top
there
right,
there's
really
there's
three
types
of
genders
in
this
data
set:
there's
female
male
and
then
unknown
or
undetermined,
or
something
so
here,
I'm
specifying
it
separately.
So
it
shows
up
on
like
two
separate
lines,
and
then
this
is
kind
of
simple
stuff,
but
yeah
this
collapses
it
into
so,
if
we
stuck
with
the
parse
logical
or
the
right
like
analyze,
logical,
that
would
like
require
two
passes
through
the
data.
B
So,
of
course
we
can
collapse
those
filters
and
just
do
one
pass,
and
then
you
see
that
in
the
physical
and
then
yeah,
so
up
top
is
kind
of
a
like
generic.
This
is
on
the
spark
right
like
documentation,
site
the
sparks
equal
programming
guide,
but
that
kind
of
shows
there's
there's
some
cost
analysis
going
on.
So
after
physical
plans
are
chosen,
there's
a
cost
step,
we're
basically
there's
there's
one
optimization:
that's
there
right
now
and
that's
the
like
broadcast
join
right
like
broadcast
hash
join,
which
takes
it's
basically
maps.
B
B
Yeah,
so
here's
the
data
set
when
I
started
on
this
spark
after
dark
thing
about
a
year
ago.
This
was
the
only
dating
orale
public,
dating
data
set
that
I
could
find
so
like
suits
my
needs.
It's
just
a
simple
joins
and
I
could
do
partitions
based
on
the
rating
partitions
based
on
gender
things
like
that.
So
let
me
actually
show
that.
B
Okay,
hopefully
Zeppelin
is
settled
down
here,
a
little
bit
so
I
just
kind
of
threw
this
together
this
morning
performance
comparison
right
like
with
these
notebooks
you
can
put
in
markdown.
You
can
actually
run
shell
commands.
Please
don't
hop
on
the
server
and
there's
no
protection.
Don't
yet,
please
don't
delete
everything.
I
mean
it's
a
docker
image,
so
you're
not
going
to
get
that
far
but
yeah.
B
So
we're
going
to
compare
we're
basically
going
to
join
ratings
and
genders
right,
so
ratings
have
user,
ID
have
both
user
IDs
and
then
rating
and
then
I
want
to
get
their
gender.
So,
like
pretty
simple,
but
the
thing
to
note
here
is
that
CSVs
JSON
you,
you
cannot
partition
by
those
right
now.
Yeah
they're,
not
supported
I,
think
there's
work
going
on
specifically
for
like
JSON,
but
still
actually
I
think
it
might
be
experimental
in
15,
but
now
with
parquet.
B
B
Similar
to
orc,
oh
yeah,
which,
by
the
way
there
is
orc
support
as
of
spark
14
I,
think
so,
if
you
guys
are
still
using
that
that's
supported
now
so
just
kind
of
comparing
the
whole
point
here
is
where
our
pushdowns
happening
and
right
like
how
yeah
right
like
how
effective
are
they
so
CSV
just
kind
of
pulling
things
in
here.
I
want
to
point
out
one
thing:
if
you
see
a
filter
in
your
physical
plan,
that's
bad!
So
this
means.
So
this
is
csv
where
there's
no
push
downs
right.
B
So
this
means
I'm
pulling
in
all
the
data
and
then
spark
has
to
filter
it.
So
my
query
here
is
I'm
trying
to
find
right
like
the
medium
hotties.
I
call
it
would
be
four
through
six
if
you
rated
436.
This
is
a
that
way.
I
just
wanted
to
have
two
filters
there.
So
we're
and
filter
are
just
aliases,
so
they're
the
same
thing
and
then
I'm
printing
out
the
logical
plan,
all
the
plans,
so
you
see
that
the
filter
gets
collapsed.
So
that's
kind
of
nice,
but
it's
still
there
on
the
physical
plan.
B
B
B
Any
kind
of
push
downs
yeah,
so
parque
is
good
at
column
skipping
right
because
it
has
all
the
columns
together
if
it
knows
that
you
only
want
that
you're
only
selecting
one
out
of
you
know,
hundreds
of
columns
it'll
physically
skip
over
those
right
on
disk,
but
it's
not
partitioned
by
rating.
So
let's
look
at
the
bottom
here:
filter
yeah!
That's
not
good!
Now!
If
we
do.
B
So
now,
let's
see
how
things
evolve
here,
so
we
have
these
two.
They
get
collapsed,
that's
pretty
cool
and
then
boom
right
there,
and
so
this
changes
to
this
is
Union
rdd.
So
this
is
taking
the
bare
minimum
and
then
sticking
yeah
sticking
them
together.
If
you
see
Matt
partitions,
that
means
it
has
to
do
a
full
scan
and
that's
not
good.
So,
let's
get
to
Cassandra
here.
Oh.
B
B
This
is
the
smaller
table
that
is
benefiting
from
the
from
yeah.
The
actual
partitioning
I
did
that
on
purpose.
What
is
this?
Oh
yeah
I
tried
to
switch
over
to
15,
so
the
best
here,
oh
yeah,
so
basically,
what
I
found
with
Cassandra
I
was
trying
the
15
like
I,
like
the
new
connector,
to
see
if
it
fixed
the
issue,
but
you'll
see
filter
on
top
and
then
Matt
partitions.
B
B
The
one
to
keep
an
eye
on
is
pruned
filter
scan,
that's
the
one
that
that's
the
basically
like
the
Holy
Grail,
that's
column,
pruning
and
a
predicate
push
down
so
I'm
working
on
DynamoDB,
just
because
I
don't
know,
I
signed
up
for
it
like
a
while
ago
and
I'm
still
working
on
it.
But
if
you
want
to
create
one,
this
is
what
you
do
just
basically
look
at
the
existing
ones.
B
So
far,
the
most
complicated
to
that
I've
seen
are
the
Cassandra
and
the
elastic
search
right
and
then
parquet,
of
course,
but
yeah
keep
an
eye
out.
If
you
guys
don't
know
spark
packages,
if
you
do
contribute
your
own
data
source
right
now,
IBM
is
obviously
working
on
db2
and
their
big
sequel
and,
like
some
other
integrations
as
of
about
a
year
ago,
maybe
year
and
a
half
ago
right,
like
data
bricks,
doesn't
accept
anything
into
the
core
that
is
vendor-specific,
so
that
kind
of
slowed.
B
Turning
pretty
basic
stuff
push
downs.
This
is
showing
how
catalyst
is
pushing
that
filter
deep
into
the
source,
so
yeah.
This
is
I.
Did
this
on
purpose,
it's
kind
of
ridiculous.
These
are
all
the
rules,
but
for
like
a
Sandra
right,
like
obviously
anything
that's
going
to
get
pushed
down,
has
to
be
part
of
the
partition
key
and
there's
some
weird,
like
combinations
of
things
and
certain
rules
and
stuff
that
you
still
like
applied
to
the
partition
key
and
where
the
partition
key
is
the
first
or
the
last,
whatever
so
native
jdbc
JSON.
B
B
So
there's
CSV:
this
is
actually
built
by
data
bricks.
This
gets
released
every
big
release.
They
update
it
every
same
thing:
yeah
redshift
are
like
data
bricks
cloud.
The
product
is
built
on
Amazon's,
of
course,
like
a
lot
of
our
customers,
have
red
shift
yeah,
so
redshifts
pretty
interesting,
because
the
redshift
server,
the
like
redshift
master,
is
a
a
single
bottleneck
right.
So
yes,
anytime,
that
you're
returning
super
large
number
of
results
that
yeah
that's,
obviously
not
going
to
be
good.
B
So
what
we
do
is
we
actually
write
out
to
s3
and
then
we
can
parallel
pull
from
that.
So
yeah
within,
like
redshift,
there's
I
think
it's
called
unload
and
you
give
it
a
the
yeah
that
s3n
that,
like
temporary
bucket
to
store
it
in
and
then
we
can
parallel
eyes
from
there
and
then
pull
the
data
in
faster
there's.
An
upcoming
meet
up
I
think
it's
februari.
It's
right
before
the
elastic
search
conference
coming
up,
februari
2016.
B
So
what
we
doing
the
same
thing
I
did
on
Monday
with
Russell
and
the
guys
from
datastax
will
be
tearing
that
thing
open.
Let's
see,
yeah
Cassandra,
of
course
yeah.
This
is
kind
of
a
cool
one,
yeah
there's
a
rest
data
source
that
yeah.
So
this
is
Michael
armbrust
yeah,
if
you
guys
have
any
questions
about
sparks
equal
he's,
not
that
hard
to
find
right,
like
data
bricks
as
a
startup
yeah,
so
people
like
to
choose
their
first
names.
B
Obviously,
when
you're
a
startup,
so
I
fill
in
the
blanks
there,
but
yeah
he's
super
active
on
the
spark
user
list
too.
So
yeah
posting
there
is
the
best
thing
to
do,
but
it's
kind
of
cool
like
you
can
give
it
a
rest
endpoint.
You
can
write
and
read
in
a
restful
way,
yeah,
I'm
working
on
dynamo,
yeah
me
and
ehrlich.
B
I
think
this
is
one
of
the
last
slides
here,
yeah
performance
tuning
I'm
not
going
to
go
over
these
specifically
gal
post
this
on
SlideShare
right
after,
but
some
of
the
various
things
like
you
definitely
want
to
turn
on
tungsten
bump
this
value
up
shuffle
partitions
people
always
forget
about
that
one.
It's
a
huge
bottleneck
turn
on
partition
discovery!
That's!
When
you're
reading
you
could
point
to
a
parent
directory,
it'll
infer
all
the
partitions.
B
So
if
you
have
a
by
date
or
gender
or
whatever
it'll
figure
that
out
yeah,
I'm
heading
to
spark
some
at
amsterdam
yeah
anyone
go
into
that
spark
summit.
Please
bail
me
out
of
jail.
That's
what
the
little
yet
police
guy
is
and
then
the
13th
is
my
birthday
in
Scotland,
so
that
that's
not
going
to
be
pretty.