►
Description
Big Fast SQL with Presto with Kyle Bader of Red Hat and Kamil Bajda-Pawlikowski of Starburst Data.
Filmed October 28th, 2019 in San Francisco.
A
A
Another
great
thing,
so,
first
of
all,
I'd
like
to
talk
about
Presta
as
a
sequel
on
anything
engine.
So
it's
an
obviously
open
source
project
was
first
started
about
seven
years
ago
at
Facebook
and
then
spread
here
in
the
valley
and
beyond,
very
very
quickly
and
myself
and
my
team
were
involved
in
this
project
for
almost
five
years.
By
now.
A
So
what's
unique
about
Presta
is
a
compute
only
distributed
sequel
engine,
which
means
you
can
deploy
it
almost
anywhere
and
it
you
can
actually
allow
press
so
to
access
data
from
many
many
different
data
sources,
some
of
them.
Some
of
those
are
object.
Storage
like
like
safe
or
you
know,
Amazon,
s3
or
Google,
Cloud,
storage
or
Arthur,
blob,
storage,
idos
and
other
technologies.
A
Like
this,
you
can
also
query
HDFS
and
Hadoop,
obviously
known
for
storing
big
data,
but
you
can
also
connect
to
a
variety
of
different
databases
like
or
called
data,
sequel
server
and
PostGIS,
and
so
on,
so
on,
and
also
no
sequel
engines
like
Cassandra
and
most
recently
elasticsearch
as
well.
So
it's
very,
very
powerful
mechanism,
where
you
separate
compute
and
storage
and
and
you
can
do-
provide
scalable
processing
using
multiple
machines
in
York,
presto
cluster
and
then
from
the
user
perspective.
A
A
A
A
Now
so
why
people
like
first
or
why
so
many
companies
are
deciding
to
you,
leverage
presto
rather
than
alternatives,
I
think
it's
several
different
reasons,
and
some
of
them
are
summarized
on
this
slide.
First
of
all,
it's
a
community
driven
open-source
project
used
by
a
number
of
big
players
who
better
their
psychoanalytic
needs
on
presto,
so
guys
like
Airbnb
Netflix,
lived
Airbnb
and
many
many
LinkedIn,
and
many
many
others
right
and
a
part
of
the
community
driving
this
forward.
A
You
know
making
sure
that
the
product
survives,
despite
of
any
changes
of
a
single
individual
company,
deciding
to
go
further
or
not.
It's
a
very
powerful
high
performance,
sequel
engine
proven
at
scale.
So
the
largest
deployments
of
presto
are,
you
know,
approaching
about
a
1,000
machines
in
a
single
cluster
and
there
many
companies
are
actually
running
many
many
clusters
because,
since
its
compute
only
is
very
easy
to
spin
them
up
and
down
and
give
access
to
certain
data
sources
without
sort
of
creating
data
silos.
B
A
Mentioned
fundamental
piece
of
architecture
is
separation
of
computing
storage,
which
means
presto
itself,
doesn't
have
any
favorite
storage
mechanism
doesn't
come
with
its
own
mechanism
to
store
the
data
it
relies
on
whenever
your
big
data
is
whether
that's
storage
or
HDFS.
You
may
keep
some
of
your
older
data
in
Oracle,
Tara
data
and
other
data
warehouses.
You
can
keep
some
access
and
operational
data
from
Clausius
or
a
sequel,
server
or
anywhere
it
lives
right
now.
A
So
with
that,
we
also
like
to
say
you
know
it
represents
a
big
value
as
a
no
vendor,
locking,
first
of
all,
it's
an
open
source
project,
so
we
can
run
it
use
it
without
any
vendor.
If
you
like,
you
know
you're
on
free
from
being
tied
to
any
Hadoop
distribution,
it
works
across
any
distribution.
You
can
change
the
storage
underneath
presto
and
your
applications.
Your
end
users
will
be
still
interacting
with
the
same
data
without
knowing
you
actually
move
from.
You
know
HDFS
to
object,
storage,
for
example,
or
wood.
A
You
can
move
from
on-premise
deployment
to
the
cloud
or
the
other
way
around
and
and
things
for
them
do
not
change,
because
pressed
is
isolating
them
from
that
entirely
and
and
again,
you're
not
tied
to
any
specific
infrastructure.
So
you
can
move
between
clouds,
for
example,
if
that's
your
choice
so
provides
a
great
insulation
flexibility.
A
Okay,
so
stuber's,
you
know,
as
I
mentioned,
we
are
involved
in
price
to
community
for
for
many
years
ready.
We
have
large
customers
in
production,
both
on-premise
and
in
in
various
cloud
deployments
with
with
kubernetes.
We
are
now
in
enabling,
through
very
similar
experience
across
any
cloud
and
on-premise
environments
like
OpenShift,
for
example,
so
which
is
really
great
for
both
customers
and
us
as
developers
that
we
don't
have
to
necessarily
you
know,
handle
custom
deployment
mechanism
for
each
cloud
separately
and
as
an
enterprise
vendor.
A
A
So,
as
I
mentioned,
presto
is
very,
very
high
performance,
sequel
engine
and
it
was
built
like
that
from
the
beginning.
So
the
objective
for
the
team
that
was
implementing
this
was
make
interactive
penalties
in
big
scale,
a
reality
right
so
before
presto,
that
was
high,
for
you
know,
obviously
very
highly
respected
engines
can
handle
petabytes
of
data.
A
B
B
They
basically
wrote
their
own
deployment
tools
for
deploying
these
different
presto
clusters,
and
one
of
the
things
that
are
great
about
being
able
you
know,
having
open
shift
is
alleviating
this
burden
from
folks
right
instead
of
having
to
write,
you
know
some
write,
scripting
and
some
sort
of
configuration
management
tools.
They
can
use
something
like
I
can
operator,
and
so
you
know
and
having
written
you
know,
ansible
play
books
for
presto
I
can
I
can
appreciate
not
having
to
do
that
anymore.
B
So
all
the
things
that
kubernetes
are
good
at
you
kind
of
get
once
you
start
using
the
operator
framework
to
deploy
clusters
and
particularly
presto
clusters.
So
instead
of
having
to
worry
about,
you
know
provisioning
new
nodes
or
do
fault-tolerance
kubernetes
kind
of
handles
that
for
you
it
can.
You
can
say
how
many
workers,
how
many
press
co-workers
you
want
online
and
it'll
bring
that
many
up.
If
one
goes
down,
then
it'll
provision
a
new
one
and
it'll
get
bound
to
a
different
node.
You
can
trivially
scale
it
right,
so
I
can
go
in.
B
I
can
change
the
number
of
replicas
for
workers
up
and
then
you
know
I
have
more,
and
so
you
can
potentially
make
it
so
that
you
know,
if
you
have
a
higher
query
volume
that
you
scale
out
the
cluster
to
be
able
to.
You
know,
keep
your
your
query
responsiveness,
low
and
then,
if
the
volume
of
queries
kind
of
subsides-
and
you
can-
you
know,
scale
it
back
in
and
because
it's
compute
only
there's
no
you
you
don't
have
to
worry
about
it
right.
B
So
we
kind
of
connected
the
dots
and
they
made
it
they
made.
It
happen
with
a
little
bit
of
help,
but
mostly
them
they
had.
It
was
like
90%
done
by
the
time
we
started.
Having
the
conversation
with
them,
so
what
the
operator
does
is
it
deploys
the
court
deploys
coordinator
and
worker?
That
would
then
work
together.
So
you
submit
the
queries
to
the
coordinator.
B
So
at
this
point
this
is
a
screenshot
from
one
of
my
OpenShift
4.2
clusters.
If
you
go
into
the
the
catalog
under
the
Big
Data
section,
there's
the
Presto
ovulate
er
operator,
so
it's
under
the
OLM
and
you
can
click
and
install
and
you
know,
then
you
can
submit
CRS
and
effect
a
presto
cluster
for
your
environment
and
begin
to
experiment
with
it.
B
So,
where
does
where
does
SEF
come
in?
Well?
I
had
a
little
lightning
taco
Larry
about
the
scalability
of
SEF,
but
SEF
and
presto
actually
work
really
great
together,
because
it's
just
an
object,
store
and
presto
is
just
a
compute
engine.
So
there's
not
really.
You
know
opinions
around
using
a
particular
storage
or
using
a
particular
query
engine,
because
it's
not
a
verticalized
stack
and
originally
you
know.
I
was
learned
about
presto
by
way
of
customers
right.
So
we
had.
B
So,
in
a
an
open
shift,
an
open
shift
environment,
we
have
open
shift
container
storage,
which
is
the
the
packaging
of
of
Ceph
with
an
operator
that
can
manage
staff
which
is
called
rooks
F
and
then
additionally,
there's
another
component
called
nuba,
which
is
kind
of
a
multi
cloud
gateway.
That
can
you
know
you
can
have
multiple
different
object
stores
on
prem
or
public
cloud
and
it
can
kind
of
route
and
have
sophisticated
policy
around
where
where
data
should
be
placed.
B
B
And
so
you
don't
really
know
like
if
I'm
a
if
I'm,
just
a
data,
scientist
and
I'm
interacting
with
the
data
I,
don't
really
know
necessarily
if,
if
the
tables
have
already
been
created
and
I'm
just
doing
sequel,
queries
I,
don't
really
know
where
the
data
is
coming
from,
and
that's
one
of
the
nice
things
about
presto
right.
They
can
have
multiple.
You
can
have
multiple
different
data
sources.
You
can
have
some
relational
databases,
you
could
have
an
object
store.
B
You
could
have
some
older
data,
that's
in
HDFS
and
from
the
data
scientist
perspective
it
doesn't
they
don't
know
that
it's
being
sourced
from
one
one
place
or
another
place,
and
so
this
is
kind
of
nice.
If
you
wanted
to
create
tables
that
map
to
object
store,
you
know
it's
as
simple
as
running
a
few
statements
and
then
you
provide
basically
an
external
location
right.
B
By
date,
then
it'll
read
all
the
files
that
are
in,
like
it
a
little
filter
the
path
right,
so
it'll
query
for
the
list
of
all
the
objects
that
are
in
the
bucket
with
this
particular
prefix
that
match.
You
know
based
on
the
time
range
and
then
it'll
read
in
all
those
files,
and
then
you
know,
parquet
has
metadata
and
so
on
and
so
forth
and
so
it'll
bring
all
that
in
and
the
person
writing.
The
sequel
has
no
idea.