►
From YouTube: Ceph Days NYC 2023: SQL on Ceph
Description
Presented by: Patrick Donnelly | IBM
Ceph was originally designed to fill a need for a distributed file system within scientific computing environments but has since grown to become a dominant *unified* software-defined distribute storage system. This talk will cover the new development of an SQLite Virtual File System (VFS) on top of Ceph's distributed object store (RADOS). I will show how SQL can now be run on Ceph for both its internal use and for new application storage requirements.
https://ceph.io/en/community/events/2023/ceph-days-nyc/
A
All
right,
hey
folks,
I'm
Patrick,
Donnelly
I
work
primarily
on
CFS
for
IBM,
although
that's
not
the
topic
of
conversation
today,
but
feel
free
to
stop
me
outside.
If
you
want
to
chat
about
7fs
at
all,
today,
I'm
going
to
be
talking
about
SQL
and
stuff,
which
is
a
provocative
title
to
get
people
in
the
room,
I'm
really
going
to
be
talking
about
SQL
light
on
Seth,
which
is
should
also
be
exciting,
but
maybe
it's
not
the
distributed
sequel.
You
may
be
thinking
of
we'll
talk
about
that
more
later.
A
So
here's
a
brief
outline
of
my
talk,
but
while
people
skim
that,
because
I
don't
need
to
read
it
to
you
who
here
is
used,
rados,
okay,
who's
actually
used
liberatos,
okay,
just
a
few
people
and
who
Harris
use
SQL,
Lite,
okay,
so
lots
of
people.
So
we
should
all
be
pretty
excited
about
that.
A
This
slide
is
is
fairly
famous,
because
it's
almost
in
every
step
developer
talk
very
exciting.
We
got
our
three
pillars
of
of
Seth.
They
all
talk
to
libratos,
that's
what
they
have
in
common,
but
did
you
know
the
set
manager
also
talks
to
liberatos
very
exciting
yeah,
so
the
set
manager
does
many
operations
on
libratos.
Anything
that
needs
to
be
persisted.
Some
things
are
also
stored
in
what
we
call
this.
A
The
the
monitor
config
store,
which
we
use
for
a
super
reliable
store
that
can't
that
has
to
operate
even
when
rados
is,
is
inoperable,
especially
in
like
early
bootstrap
stages
of
a
cluster
when
we
don't
even
have
an
OSD
yet,
but
for
for
some
things
we
can,
we
can
just
persist
directly
to
rattles.
A
So
let's
talk
about
the
set
manager,
design
Dan
talked
about
it
earlier.
You
know.
We
now
have
this
magical
way
to
run
Python
and
ceph.
As
part
of
that
manages
cluster
operations,
we
have
these
different
modules,
so
everything's
abstracted
into
a
ceft
manager
module
these
modules.
Take
care
of
things
like
orchestrating
demons.
Orchestrating
upgrades
fadm
is:
is
partially
a
CF
manager
module.
A
A
A
That
actually
does
talk
to
to
ffs,
but
generally
it's
a
standalone
demon.
So
what
does
it
provide
it?
It
provides
a
basically
a
glue
for
Python
and
ceph
lets.
The
manager
modules
talk
to
things
like
the
mon
client,
get
access
to
the
monitor,
Maps
OSD
maps
and
then
also
some
of
the
modules
even
need
to
use
labrados
for
persistent
state.
What's
an
example
of
such
a
module,
we
have
the
device
health
device,
health
module.
A
A
So
it's
a
fairly
simple
module,
but
one
of
the
things
it
was
doing
is
that
it
was
basically
using
liberatos
to
store
all
these
all
of
this
data.
And
if
anyone
here
who
has
used
labrados
knows
it's
not
really
the
most
programmatically
friendly
Library
I
mean
it
it's
a
good
library
for
for
doing
distributed
storage,
but
it
forces
you
to
think
about
things
that
are
rather
difficult
problems,
things
like
consistency
or
how
do
I
scale
out
right.
A
These
are
not
really
things
that
you
want
to
spend
time
doing
in
in
a
language
like
python,
so
this
is
where
sqlite
enters
the
picture
for
those
of
us
who
have
not
used
sqlite,
it's
an
application
library
that
allows
you
to
store
sqlite
database
as
a
regular
file,
so
a
single
blob
of
data,
it's
known
for
being,
especially
if
you
go
to
their
website,
it's
known
for
being
one
of
the
most
use
SQL
database
engine
in
the
world.
It's
on
everybody's
phone,
it's
on
almost
every
computer,
that's
deployed
in
the
world.
A
It
has
a
very
rich
C
API,
allowing
us
to
embed
it
and
also
extend
it
and
then
another
highlight
of
SQL
light,
which
was
one
of
the
reasons
I
was
looking
at
this
at
all
was
it's.
It
has
an
easy
binding
for
python.
Now,
of
course,
now
we
have
Python
and
Stefan
in
the
form
of
the
stuff
manager.
A
So,
let's
see
if
we
can
bring
these
two
worlds
together,
the
way
we
actually
do,
that
is
through
the
sqlite
VFS,
and
that
at
the
beginning
only
provided
access
to
what
was
What's
called
the
Unix
VFS
access
to
the
Unix
file
system
or
any
file
system
local
file
system.
A
A
A
What
does
it
actually
look
like?
We
have
two:
it
actually
ends
up
creating
two
files,
the
journal
and
the
database.
The
journal
is
usually
deleted
by
default,
so
you
don't
ever
see
it
if
you
do
an
LS
after
interacting
with
your
database
but
predominantly,
but
there
is
actually
two
different
files
that
we
care
about
so
enter
libsep
sqlite.
So
cephsqlite
is
a
VFS
that
allows
you
to
stripe.
He
sqlite
database
over
rattles
the
the
real
highlight
of
lipsteps
equal
light.
Is
it
doesn't
require
any
application
modification
in
order
to
use
it?
A
You
just
have
to
load
this
F
SQL,
light
Dynamic
library
and
and
specify
an
alternate
URI
for
your
for
your
database.
So
again,
the
journal
in
that
database
are
automatically
striped
over
the
osds.
You
don't
have
to
think
about
it
at
all.
You
just
use
your
sqlite
database
like
normal.
A
How
this
actually
works
is
on
a
new
library
called
Simple
rattle
striper.
This
is
just
a
simple
interface
that
allows
us
to
stripe
a
variable
size,
blob
onto
a
number
of
objects.
This
is
something
we're
already
fairly
good
at
doing
at
doing
with
SEF
it's
ffs
files,
RBD
images,
our
rgw
objects.
You
know
these
are
things
that
get
striped
over
multiple
objects.
A
Similarly,
we're
going
to
do
the
same
thing
for
this
fsqlite
database.
This
is
conceptually
based
off
of
a
CERN
Library
where's
where's
Dan,
oh
Dan,
liberato
striper,
which
was
developed
by
CERN.
Unfortunately,
I
couldn't
use
it
because
it
does
a
sync
on
every
or
it
does
write
reads
and
writes
all
synchronously
by
acquiring
a
lock.
I
was
very
sad
about
that.
A
So
again,
simple,
rattle
striper
provides
all
these
Primitives
that
we
use
open,
read,
write
scene
close
and
it
Stripes
all
the
data
over
these
objects.
You
can
actually
use
a
rattles
client,
the
CLI
client,
with
a
striper
option
to
read
and
write
these
databases
out
of
rattles,
so
it's
actually
compatible
with
the
librato
striper,
that's
already
in
stuff.
A
So
if
you
want
to
use
lipsep
sqlite,
what
do
you
have
to
do
again?
I
told
you,
you
don't
have
to
modify
your
your
your
application.
I
wasn't
lying.
You
do
have
to
tell
our
SQL
light
to
load
the
VFS.
That's
done
through
a
special
sqlite
command
dot
load.
A
You
just
specify
this
fsql
light
library
and
away
it
goes
it
loads
it
up,
and
then
you
open
the
URI
associated
with
f
sqlite,
which
is
involves
specifying
a
pool
ID
a
rattle
same
space
in
the
pool
which
is
optional
and
then
the
name
of
your
database
and
then
the
the
VFS
which
would
be
ceph.
That's
all
there
is
to
it.
Maybe
some,
depending
on
your
your
situation.
There
may
be
some
Enviro
environment
variables
you
may
need
to
set,
namely
like
where
the
cefconf
is
or
any
additional
arguments
you
want
to
specify.
A
So
it's
the
way
this
works
is
it.
The
SEF
manager
now
has
a
uniform
interface
for
interacting
with
a
asqlite
database,
that's
associated
with
every
manager
module,
so
every
manager
module
can
just
automatically
use
the
sqlite
database.
It
doesn't
need
to
be
tailored
at
all.
To
do
that,
we
have
this.
We
also
now
have
a
DOT
manager
pool
which
some
of
you
may
have
noticed.
It
was
actually
in
the
previous
talk.
You
can
thank
me
for
that.
A
It
used
to
be
called
the
device
health
pool,
but
we
just
generalized
it
to
be
the
dot
manager
pool
so
and
be
used
by
all
the
manager
modules.
Hopefully,
your
upgrades
went
well
when
that
rename
took
place
the
the
actual
access
within
the
stuff
manager
you
can
see
on
the
right
in
that
code.
Snippet
is
fairly
simple.
A
You
just
special
specify
your
SQL
statement,
as
you
normally
would,
when
using
SQL
Lite
within
Python,
and
then
you
have
to
acquire
a
database
lock
in
case
you
have
multiple
threads
within
a
manager
module
while
accessing
the
database.
A
You
have
to
synchronize
on
a
lock
which
is
provided
to
you
and
then
actually
create
a
transaction
with
it
within
the
database
by
specifying
the
database
in
the
with
statement
and
then
finally,
you
just
execute
the
SQL
and
then
I
can
do
things
like
read
out
the
data
with-
and
this
is
a
this
is
a
simple
generator
which
is
doing
a
select
on
B
from
who,
where
a
equals
question
mark
a
is
provided
to
the
function
f,
so
we're
reading
all
the
columns
B
in
in
in
that
table.
A
Let's
see
what
else
so
everything
else
is
abstracted
away.
Where
is
the
data?
So
in
all
of
these
manager
modules
the
database
that's
associated
with
each
module
is
just
called
main.db
there's
also
a
main.db
journal
file.
As
I
said,
the
sqlite
has
a
journal
file
associated
with
each
database
and
then
in
the
dot
manager
pool,
there's
a
namespace
named
for
each
manager,
module
and
then
a
number
of
files
or
sorry
objects
for
each
given
database,
blob,
main.dbe
or
main.dbjournal.
A
So
that's
what
you'll
find
there
if
you
did
a
rattles
LS
there,
the
device
health
module.
So
now
the
schema
is
fairly
simple,
but
I
didn't
talk
about
earlier.
Forgive
me,
the
device
health
object
module.
What
it
would
do
is
it
would
create
an
object
in
rados
named
after
the
device
that
it
was
collecting
health
for
and
then
it
would
use
the
omap
key
values
to
store
a
series
of
smart
data
dumps
in
in
the
in
the
omap.
A
So
now
this
is
all
just
transferred
into
a
very
simple
schema
for
keeping
track
of
the
devices
and
device
health
metrics.
So
we've
got
a
single
table
which
gives
us
the
device
ID
it's
very
simple
table,
and
then
the
device
health
metrics,
which
gives
us
the
the
time
stamp
of
when
the
sample
was
taken.
A
The
device
ID
associated
with
that
that
sample
and
then
the
the
raw
smart
text
and
then
the
primary
key
for
that
that
table
is
the
device
ID
and
the
timestamp,
and
so
I
can
do
complex
queries
to
look
up
a
series
of
smart
data.
Smart
data
dumps
and
then
also
I,
can
like
if
I'm
trimming,
the
journal
or
sorry
the
database
as
part
of
routine
garbage
collection.
A
I
can
do
very
simple
delete
statements
as
well,
and
then
this
is
an
actual
snippet
of
real
code
from
the
manager
module
I
just
took
out
a
few
keywords:
first,
horizontal
space,
but
otherwise
this
is
basically
exactly
the
code.
If
I'm
up
putting
a
device
metrics
into
the
device
health
metrics
table,
I
am
going
to
first
create
the
device
ID
if
it
doesn't
already
exist,
create
an
epic
associated
with
the
timestamp
and
then
ver
the
the
third
step
json.load.
That's
where
I'm
loading,
the
the
the
the
data
dump,
make
sure
it's
valid.
A
A
So
it
looks
like
normal
sqlite
code
in
Python,
so
I,
don't
know
why
the
GIF
already
started,
but
it
did
so.
This
is
lib
sets
equal
light
in
action
wisely
or
not.
I
tried
to
do
the
tutorial
and
then
record
it,
but
so
this
is
a
SEF
status
here,
I'm
just
listing
the
pools,
you
can
see
the
dot
manager
pool
and
then
I'm
purging
it
so
I'm
deleting
everything
in
it
and
I'm
just
verifying
that.
A
A
It's
just
setting
the
stage
for
the
next
part,
so
here
I'm,
going
to
actually
Run
sqlite
3
command
line
tool
to
put
a
database
in
SEF
I'm
doing
this
within
a
developer
environment,
so
I'm
setting
a
number
of
environment
variables,
namely
where
to
find
the
libsef
sqlite
library.
It's
in
my
build
directory.
A
I
have
to
tell
it
where
the
ceph
comp
file
is
the
which
key
ring
I'm
going
to
use
and
then
which
credential
I
want,
which
is
the
admin
credential,
which
obviously
is
a
developer,
will
tell
you,
don't
don't
use
that
in
routine
operation?
A
A
So
I'm
going
to
put
it
in
the
pool
a
in
namespace
B
and
the
the
database
name
will
be
a.db
I'm.
Creating
a
simple
table
with
integer
an
a
single
integer
column
and
I'm,
inserting
one
value
into
that
table,
one
so
I'm
just
going
to
dump
it
verify
that
the
dump
of
the
the
sqlite
is
exactly
what
I
would
expect,
leaving
SQL
light
now
I'm
going
to
do
a
rattles
command
to
LS
all
the
objects
on
that
pool.
You
can
see
in
names,
namespace
B,
the
single
database
object
a
DOT
db.000.
A
and
then
finally,
as
I
promised
you,
you
can
use
the
striper
command
to
actually
slurp
the
database
out
of
rados.
If
you
want
to
which
I
did-
and
you
can
just
have
a
look
at
it,
it's
eight
kilobytes,
which
is
why
it
only
took
a
single
object
in
rados
and
then
did
sqlite
on
that
data.
That
database,
which
is
now
local
and
verified
everything,
is
as
it
should
be,.
A
So
in
the
next
one,
I
We
have
basically
the
same
thing
here,
I'm
just
going
to
forgive
it's
copy
paste
because
it's
just
a
lot
of
text,
but
this
is
basically
a
fancy
way
to
create
an
infinite
Loop
limited
by
the
number
of
rows
to
insert
a
number
of
random
integers
into
this
table,
and
you
can
see
that
at
this
point
now
the
table
is
large
enough.
A
A
So,
unlike
my
faustin
version
of
this
talk,
I
have
a
little
bit
of
time
to
actually
talk
about
some
performance
notes.
So
a
lot
of
this
is
already
dealt
with
in
the
SEF
manager,
but
if
you're
playing
thinking
about
using
this
object
yourself,
you
want
to
play
with
it.
There
are
some
caveats:
is
performance
wise
to
actually
squeeze
out
the
most
performance
you
can
with
this
Library,
it's
all
documented
in
the
documentation
which
is
at
the
bottom.
Hopefully,
the
slides
will
be
available
at
the
end
of
this
half
day.
A
If
not,
you
can
always
email
me
to
get
to
get
a
copy
of
the
slides,
but
basically
you
need
to.
There
are
a
number
of
things
you
have
to
control
in
order
to
get
the
maximum
performance
once
one
is
the
page
size.
The
default
page
size
in
sqlite
is
very
small,
which
results
in
excessive
reads
and
writes
to
the
backing.
Osds
I
reckon
I
recommend
raising
that
as
high
as
possible.
A
I
believe
I
use
64k
in
the
manager,
keeping
in
mind
that
a
single
object
can
be
four
megabytes,
so
you
want
to
kind
of
maximize
the
amount
of
or
reduce
the
number
of
iops
involved
with
your
use
of
sqlite
sorry.
That
screen
is
too
small
for
me
to
read
I
you
want
to
use
a
larger
cache
size,
So
reading,
to
avoid
reading
from
rattles.
A
So
there's
no
cache
and
simple
rattle
striper
deliberately
instead
use
the
sqlite
cache
instead
and
the
way
you
can
get
more
cash
is
telling
sqlite
to
use
more
cash.
You
want
to
persist
the
sqlite
database,
so
you'd
want
to
avoid
deletion
of
the
objects
associated
with
like
the
journal
for
the
sqlite
database.
If
you,
if
you
can
so
that's
one
reason,
we
would
pers.
That
would
be
why
we
would
persist
it
to
avoid
those
unnecessary
operations.
A
You
want
to
use
exclusive,
locking,
if
possible,
of
the
sqlite
database
that
can
reduce
your
transaction
lead
at
C
from
five
to
three
operations
per
transaction.
So
normally,
when
I
do
it
enter
a
transaction
in
sqlite
I
have
to
lock
the
database
and
then
do
some
reads
and
writes
and
then
finally
unlock
the
database
if
I
enter
an
exclusive
mode
of
operation,
sqlite
locks
the
database
once
at
the
start
and
unlocks
it
at
the
end,
when
I
close
the
database
so.
A
Whenever
you're
doing
a
transaction
and
then
another
optimization
you
can
do
is
use
the
wall
journaling
in
sqlite,
which
normally
requires
shared
memory
communication
between
clients
of
the
database,
but
if
you're
using
the
exclusive
locking
mode,
then
you
can
you
don't
need
to
use
shared
memory
instead,
sqlite
will
allow
you
to
use
a
wall
journal
without
without
that,
obviously,
if
I'm
using
a
sqlite
database
in
rattles,
there's
not
going
to
be
any
shared
memory,
communication
between
the
clients
which
could
be
anywhere.
A
So
if
I
do
that,
then
you
get
the
most
optimal
behavior
of
one
operation
per
transaction,
because
I've
got
the
database
open
exclusively
and
I'm
just
writing
to
a
wall
journal
in
the
general
case.
So
that's
that's
going
to
get
you
two
to
five
millisecond
latency
depending
on,
of
course,
your
cluster.
If
it's
dance
cluster,
it
could
be
up
to
155
milliseconds.
A
Reads
are
synchronous,
that's
really
unfortunate.
That's
just
baked
in
the
into
the
design
of
SQL
lights.
Vfs
I've
complained
about
it.
It's
actually,
while
it's
not
on
this
site,
it's
a
link
on
the
next
slide.
Where
you
know,
there's
not
really
anything.
I
can
do
about
it.
If,
if
sqlite
does
a
read
through
the
VFS,
it's
a
synchronous
read
to
alteratos,
ideally,
we'd
be
able
to
do
asynchronous
reads
and
then
sqlite
would
do
a
gathering
at
the
end.
A
I
think
I
know
what
you're
talking
about,
but
it's
not
the
right
thing
not
for
reads
any,
but
we
do
do
asynchronous
rights
out
to
El
Dorados
yeah,
so
the
retrospective
on
sqlite
VFS.
The
end
of
this
slide
is
where
I
ranted
the
sqlite
developers
about
the
VFS.
A
A
It's
limited
use
so
far
in
the
other
modules.
Hopefully
gonna
make
improvements
to
that.
Maybe
by
publicizing
it
more
but
I,
for
example,
would
be
nice
to
use
in
Telemetry
to
keep
track
of
reports
that
a
cluster
is
used
in
the
past.
A
We
did
have
a
few
bugs
I'm.
You
know.
Most
of
them
were
related
to
like
some
packaging.
There
was
one
GCC
static,
compiler
error,
I'd,
add
a
rattles
Destructor
to
or
sorry
a
a
Destructor
to
the
library
for
to
avoid
a
bug
we
found
in
in
unit
tests,
and
then
there
is
one
outstanding
bug
which
is
tripped
up
a
number
of
clusters.
A
It's
not
really
debilitating,
but
if
we
lose
the
the
lock
on
the
database,
perhaps
because
the
the
there
was
a
short
Network
partition,
and
so
the
library
was
not
able
to
renew
the
lock
in
the
background,
then
it
automatically
block
lists
itself
to
protect
the
Integrity
of
the
database,
which
means
that
whatever
the
module
was
doing
in
this
case
device
health
won't.
It
can't
continue
doing
and
the
logic
wasn't
yet
built
into
it.
A
A
So,
as
far
as
future
work
right
now,
the
library
is,
does
everything
under
an
exclusive
lock
on
the
database
within
rattles,
so
that
you
know
I'm
all
there's
only
ever
one
reader,
one
writer,
that's
not
any
architectural
limitation.
It's
just
something.
I
need
to
address
by
adding
support
for
multiple
readers
and
then
read
ahead
performance,
which
is
related
to
the
limitation
in
the
the
sqlite
VFS.
A
Only
allowing
one
or
doing
synchronous
reads
so
the
only
way
to
approach
that
problem
from
our
end
is
to
add,
read
ahead
and
that's
going
to
require
exploring
how
read
ahead
Works
in
within
the
Unix
context,
to
see
if
I
can
reproduce
the
correct
read
ahead
performance
in
in
which
will
mirror
give
us
a
somewhat
similar
performance.
In
that
regard,
we
do
have
a
lips
Libs,
ffs,
SQL
Lite,
which
was
developed
a
year
ago
by
a
gsoc
student.
A
It
does
require
a
little
bit
of
cleanup.
It
wasn't,
it
wasn't
merged
yet,
but
that's
another
Library
binding.
We
would
like
to
add
the
main
reason.
You'd
want
to
use
that
over
like
say,
put
sqlite
on
CFS
mounted.
Normally
is,
then,
you
don't
have
to
actually
Mount
CFS.
It's
just
sqlite
on
Libs
ffs,
on
onset,
all
right,
I'm,
very
confused.
Why
the
slides
here.
A
Well,
someone
edited
my
slide
deck.
Sorry,
those
were
appendix
slides.
There's.
My
thank
you
and
my
contact
information.
There's
a
blog
post
concerning
lib
sets
equal
light
if
you'd.
Rather
we
see
this
talk
again
in
text
form
and
the
documentation
which
also
talks
about
all
the
performance
notes.
I
mentioned
earlier.
Any
questions.
A
A
B
A
And
there's
no
chance
of
having
objects
that
are
too
big
either,
because
the
the
objects
are
limited
in
size
by
the
striper.
B
A
So
I
believe
the
architectural
limit
in
sqlite
is
480
terabytes
I've
gotten
up
to
a
few
terabytes
before
I.
You
know
stopped
you
I
would
not
advise
getting
databases
that
size,
because
we'll
be
you'll
start
to
hit.
You
know
when
you
actually
want
to
read
out
of
the
database
you're
going
to
start
hitting
performance
issues,
namely
because
of
the
lack
of
asynchronous
reads
so
I
Would
by
default
recommend
at
least
at
this
time.
A
You
know,
databases
on
the
smaller
side
and
maybe
more
databases
rather
than
one
large
database.
Once
you
start
wanting
to
go
larger
I
would
I
would
start
looking
at
things
like
postgres,
which
may
someday
end
up
on
Seth.