►
Description
Let's peek under the hood of serialization formats and see how properties inherent to the data representations themselves will either help or hinder us for a given problem.
Learn what to consider when writing your own formats by looking inside some of the best.
More at https://rustfest.global/session/9-everything-is-serialization/
A
B
B
B
Because
of
these
wires
and
the
physical
separation
of
the
system's
components.
The
data
which
drives
each
component
must
be
in
well-specified
agreed
upon
formats
at
this
level
of
abstraction
of
the
system.
We
usually
think
of
the
data
in
terms
of
serialization
serialization
at
this
level
includes
many
well-known
formats,
mp3,
json
http
among
others.
B
B
B
B
C
B
Specified
predefined,
purpose-built
serialization
formats
at
this
level
we're
thinking
about
smaller
serialization
formats,
like
the
little
endian
format
for
integers
instruction
sets
addresses
floating
point
opcodes
and
many
others
at
each
level
of
abstraction
of
the
computer
system.
You
will
find
components
driven
by
data
sent
over
wires
in
serialization
formats.
B
B
B
Perhaps
you
think
in
high-level
tasks
like
serving
an
http
request,
high-level
tasks
are
all
described
in
terms.
First
of
parsing
data,
in
this
case
a
url
which
is
a
standard,
serialization
format
having
a
host
a
path
and
other
data
followed
by
a
transform
and,
lastly,
a
serialization,
in
this
case,
an
http
response.
B
B
B
B
B
If
the
serialization
format
has
low
entropy,
the
throughput
of
data
flowing
through
the
system
is
limited
by
the
wires
connecting
components
put
another
way
bloat
in
the
representation
of
the
data
throttles.
The
throughput
of
information,
also
data
dependencies
in
the
serialization
format
pause
the
flow
of
data
through
the
system,
incurring
the
latency
cost
of
the
wires.
B
B
B
Size
limit,
even
when
that
limit
is
bound
primarily
by
throughput
and
time.
I'd
like
to
drive
these
points
home
with
a
series
of
case
studies.
We
will
look
at
some
properties
inherent
to
the
data
representations
used
by
specific
serialization
formats
and
see
how
the
formats
themselves
either
help
us
solve
a
problem
or
get
in
the
way.
In
each
example,
we
will
also
get
to
see
how
rust
gives
you
best
in
class
tools
for
manipulating
data
across
any
representation.
C
B
Tend
to
reflect
the
architecture
of
the
systems
that
use
them,
our
computer
systems
are
constructed
of
many
components:
nesting
into
other
subsystems,
comprised
of
more
components,
serialization
formats
nest
in
a
way
that
reflects
that,
for
example,
inside
a
tcp
packet,
a
serialization
format.
You
may
find
part
of
an
http
request.
B
B
B
B
Push
serialized
data
from
each
format
into
the
same
buffer
rather
than
serializing
into
separate
buffers
independently
and
then
copying
each
buffer
into
the
nesting
format
above
moving
control
over
memory
to
the
caller
and
safely
passing
mutable
or
immutable
data
is
the
name
of
the
game.
These
capabilities
are
all
necessary
when
parsing
and
writing
nesting
serialization
formats.
B
B
B
What's
great
about
this,
is
that
value
is
generic
over
the
kind
of
text
to
parse
into
one
type
that
implements
the
text?
Trait
is
string,
so
you
can
parse
a
graphql
query
into
string
as
the
text
type
and
because
value
then
will
own
its
data.
This
allows
you
to
manipulate
the
graphql
and
write
it
back
out.
That
capability
comes
with
a
trade-off.
B
A
C
B
That's
okay:
rust
takes
this
up
a
notch,
because
there
is
a
third
type
from
the
standard
library
that
implements
text.
This
type
is
a
cow
of
string,
a
clone
on
right
string.
With
this
safe
and
convenient
type
enabled
by
our
friend
and
ally,
the
borrow
checker.
We
can
parse
the
graphql
in
such
a
way
that
all
of
the
text
efficiently
refers
to
the
source,
except
just
the
parts
that
you
manipulate
and
it's
all
specified
at
the
call
site.
This
is
the
kind
of
pleasantry
that
I've
come
to
expect
from
rust
dependencies.
B
B
B
The
issue
is
in
the
way
that
graphql
nests
its
serialization
formats,
the
graphql
string
value,
is
unicode,
but
the
way
that
graphql
embeds
strings
is
by
putting
quotes
around
them.
With
this
design
choice
any
quotes
in
the
string
must
be
escaped,
which
inserts
new
data
interspersed
with
the
original
data.
This
comes
with
consequences.
B
One
when
encoding
a
graphql
string
value
the
length
of
the
value
is
not
known
up
front.
The
length
may
increase
from
the
re-encoding
process.
That
means
that
you
can't
rely
on
resizing
the
buffer.
You
are
encoding
to
upfront
before
copying,
but
instead
must
continually
check
this
buffer
size
when
encoding
this
value
or
over,
allocate
by
twice
as
much
when
reading
graphql,
it
is
impossible
to
refer
to
data
because
it
needs
to
go
through
a
parse
step
to
remove
the
escape
character.
B
This
problem
compounds,
if
you
want
to
nest
a
byte
array
containing
another
serialization
format
in
graphql,
there
is
no
support
for
directly
storing
bytes
in
graphql,
so
bytes
must
be
encoded
into
a
string
using
base
16
or
base64
or
similar.
That
means
three.
Three
encode
steps
are
necessary
to
nest.
Another
format
there
is
encoding.
The
data
as
bytes
encoding
that
as
a
string
and
finally
re-encoding
the
escaped
string
that
may
compound
even
further.
B
Json
strings
are
also
quoted
strings,
meaning
the
same
data
goes
through
another
allocation
and
decode
step.
It
is
common
to
log
the
json
another
layer,
another
encode
step.
So
now,
if
we
want
to
get
that
binary
data
from
the
logs,
it's
just
allocating
and
decoding
the
same
data
over
and
over
up
through
each
layer
for
every
field.
B
B
The
difference
between
the
two
can
be
the
difference
between
having
decoding
be
a
major
bottleneck
or
instant,
no
amount
of
engineering
effort
spent
on
optimizing.
The
pipeline
that
consumes
the
data
can
improve
the
situation
because
the
cost
is
intrinsic
to
the
representation
of
the
data.
You
have
to
design
the
representation
differently
to
overcome
this.
B
C
B
B
The
bottom
part
depicts
the
data
with
three
f64
slots,
one
for
each
position
coordinate
three
slots,
one
for
each
color
channel
and
a
blank
slot
for
padding,
which
just
makes
everything
line
up
nicely.
B
B
B
If
you
want
to
break
up
data
into
batches
for
parallelism,
the
most
straightforward
way
you
can
do,
that
is
to
have
fixed
size,
structs
stored
in
contiguous
arrays.
With
that
choice
of
sterilization
format,
you
can
know
where
any
arbitrary
slice
of
data
lives
and
therefore
breaks
the
data
up
into
batches
of
any
desired
size
in
constant
time.
B
The
serialization
format
reflects
the
architecture
of
the
system.
Contrast
that
to
sending
the
data
to
the
gpu
in
say,
json
with
json,
the
interpretation
of
every
single
byte
in
the
data
depends
on
every
preceding
byte.
The
current
element's
length
is
unknown
until
you
search
for
and
find
a
token
indicating.
The
end
of
that
item
often
a
comma
or
a
closed
bracket.
B
B
C
B
Arguably,
it's
the
data
dependencies
that
make
writing
a
correct,
json
parser,
a
challenging
engineering
problem
in
the
first
place
returning
to
the
vertex
buffer
format.
If
we
were
to
graph
its
data
dependencies,
the
interpretation
of
each
byte
in
the
data
is
only
dependent
on
the
first
few
bytes
in
the
description
of
the
buffer.
B
B
B
The
trade-off
is
inherent
to
the
representation
json
can
utilize,
both
in
json,
for
example,
a
smaller
number
will
take
fewer
bytes
to
represent
than
the
larger
number
integers
between
0
and
9
take
1
byte,
because
they
only
need
a
single
character
numbers
between
10
and
9,
take
2,
bytes,
and
so
on.
Here's
a
depiction
of
that.
B
B
B
To
recap
that
the
format
used
by
vertex
buffers
has
a
different
set
of
capabilities
than
json
is
not
something
that
can
be
worked
around
with
any
amount
of
engineering
effort
when
consuming
the
data.
Those
capabilities
are
inherent
to
the
representations
themselves,
and
if
you
want
different
capabilities,
you
need
to
change
the
representation.
B
Okay.
Having
established
that
writing
the
data
is
the
problem
we
are
trying
to
solve
and
the
characteristics
the
serialization
format
must
have
because
of
the
gpu's
architecture.
Let's
write
a
program
to
serialize
the
data,
we'll
write
this
program
in
two
languages,
first
in
typescript
and
then
in
rust.
I
don't
do
this
to
disparage
typescript.
B
The
function
we
will
write
is
a
very
stripped
down
version
of
what
you
might
need
to
write
a
single
vertex
to
a
vertex
buffer
for
a
game.
Our
vertex
will
consist
of
only
a
position
with
three
32-bit
float
coordinates
and
a
color
having
three
u8
channels.
There
are
likely
significantly
more
fields.
You
would
want
to
pack
into
a
vertex
in
a
real
game,
but
this
is
good
for
illustration.
B
B
C
B
B
C
B
B
A
B
B
B
C
B
We
did
our
interfaces
in
the
typescript
program.
We
leave
out
the
interface
for
buffer
holding
the
byte
array
in
count.
We
aren't
going
to
need
that
now.
Let's
look
at
the
function
to
write
the
vertex
buffer
dot,
push
vertex,
that's
it
rust,
isn't
hiding
the
fact
that
our
data
is
represented
as
bytes,
underneath
the
hood
and
has
given
us
control
of
the
representation
we
needed
to
annotate
the
structs
on
the
previous
slide,
with
wrapper
c
moving
all
error-prone
work
into
the
compiler
between
javascript
and
rust,
which
do
you
think
would
have
better
performance.
B
Remember
that,
because
the
choice
of
serialization
format
is
a
deciding
factor
and
how
you
can
approach
the
problem
that
the
advantage
rest
gives
us
of
being
able
to
choose
how
data
is
represented,
carries
forward
into
every
problem,
not
just
writing.
Vertex
buffers
for
the
final
case
study,
I'd
like
to
take
some
time
to
go
into
how
a
new
experimental,
serialization
format
called
tree
buff
represents
data
in
a
way
that
is
amenable
to
fast
compression.
B
B
B
B
C
B
B
We
have
a
data
set
a
game
of
go.
What
we
want
is
an
algorithm
to
predict
the
next
move
in
the
game
to
help
us
we're
going
to
visualize
the
raw
data
from
the
data
set.
This
scatter
plot
is
a
visual
representation
of
the
actual
bytes
of
a
go
game.
As
you
read
from
left
to
right,
there
is
a
dot
for
each
byte
in
the
file,
with
the
dot's
height,
corresponding
to
the
value
of
that
byte.
B
Our
eyes
can
kind
of
pick
up
on
some
kind
of
clustering
of
the
dots
they
don't
appear
random,
that
the
data
does
not
appear
random
is
a
good
indication
that
some
sort
of
compression
is
possible
coming
up
with
an
algorithm
to
predict.
The
value
of
a
dot
may
not
be
apparent
by
just
looking
at
a
scatter
plot.
B
B
Gzip's
prediction
works
great
for
text
where
words
are
often
repeated,
at
least
in
the
english
language.
Words
are
constructed
from
syllables,
so
it's
even
possible
to
find
repetition
in
a
text.
Even
in
the
absence
of
repeated
words,
the
problem
is
that
in
our
go
game
the
same
coordinate
on
the
board
is
seldom
repeated.
B
B
B
B
B
We
first
need
to
separate
the
data
so
that,
logically,
related
data
are
stored
locally.
Instead
of
writing
an
x,
followed
by
a
y
like
most
serialization
formats,
would
do
let's
write
out
all
the
x's
first
and
then
all
the
y's,
here's
a
visual
representation
of
that
it
looks
maybe
tighter
than
before.
B
C
B
B
B
That's
going
to
be
our
prediction
with
our
prediction,
algorithm
in
hand.
Next,
we
need
to
come
up
with
a
representation,
we're
going
to
write
a
variable
length
encoding
in
this
graphic.
We
have
three
rows
of
boxes
where
we
will
describe
the
variable
length
encoding.
Each
box
holds
a
single
bit.
There
are
three
boxes
on
the
top
row.
The
first
box
contains
zero.
The
next
two
boxes
are
blank.
B
B
B
It
didn't
have
to
work
out
that
way,
but
we
can
do
this
because
a
go
board
has
only
19
points
along
each
axis,
which
means
that
we're
not
using
the
full
range
of
a
byte.
If
we
did
use
the
full
range,
the
encoding
would
have
to
have.
Some
values
extend
beyond
8
bits,
but
indeed
most
data
sets
do
not
use
the
full
range
of
the
underlying
types
used
in
the
representation.
B
B
B
It
requires
less
work
to
subtract
the
previous
value
in
a
sequence
than
to
search
for
redundancy
by
scanning
many
values
in
the
sequence.
Note
that
this
is
not
the
best
prediction
algorithm
possible.
If
you
wanted
to
get
serious
about
compression
and
squeeze
the
file
down
further,
you
could
make
an
even
better
prediction
algorithm.
B
B
C
D
B
Sounds
like
it
might
be
useful
for
more
than
just
go.
Let's
review
by
comparing
these
methods
in
a
matrix.
This
chart
shows
each
of
the
three
methods
we
considered
written
across
the
top
gzip
delta
compression
and
ai
compression
written
on
the
side.
We
have
three
categories:
the
compression
ratio
is
how
small
the
file
is
performance
is
how
fast
we
can
read
and
write
the
file
and
the
difficulty
is
the
engineering
effort
required
to
produce
and
maintain
the
code
that
implements
the
compression
method.
B
B
B
The
overall
score.
Hardly
matters,
though,
because
where
gzip
wins
is
in
the
difficulty
category,
it
doesn't
take
a
lot
of
engineering
effort
to
grab
an
existing
crate
from
crates.I
o
and
run
gzip
on
your
data.
You
get
a
lot
with
minimal
effort
using
something
like
gzip
effort
is
important
for
working
professionals
under
tight
deadlines.
B
C
B
Especially
true
when
those
performance
gains
come
with
any
engineering
cost
you're
not
likely
to
be
criticized
by
your
peers
for
using
gzip,
whereas
the
delta
compression
method
required
a
fair
bit
of
custom
code.
But
what
if
we
could
move
that
check
mark
for
the
lowest
difficulty
in
engineering
effort
from
gzip
to
the
delta
compression
method?
B
B
B
B
The
next
thing
we
did
with
the
delta
compression
was
that
we
applied
a
type
aware
compression
method
after
having
arranged
the
data
to
maximize
the
locality
of
related
data,
subtracting
ins
and
writing.
The
deltas
was
only
possible
because
we
knew
that
the
bytes
were
uhs
and
not
say
strings
where
subtracting
adjacent
characters
would
produce
nonsense.
B
Tree
buff,
again
generalizes
this
principle
and
uses
different
high
performance
type,
aware
compression
methods
for
the
different
kinds
of
data
in
the
tree.
Since
no
compression
method
is
one
size
fits
all,
it
even
spends
some
performance
trying
a
few
different
compression
techniques
on
a
sample
of
the
data
from
each
buffer.
B
B
B
B
B
C
B
But
with
a
cardinality
in
the
data
that
reflects
a
real-world
distribution
of
values,
what
will
be
measured
is
relative,
cpu
time
to
round-trip
the
data
through
serialize
and
deserialize,
and
the
relative
file
size.
The
format
we'll
be
comparing
two
is
message
pack
which,
as
described
by
messagepack.org,
is
like
json
but
fast
and
small.
B
B
B
The
improvements
are
significant,
considering
that
the
first
thing
that
treebuff
has
to
do
is
to
reorganize
your
data
into
a
tree
of
buffers
before
starting
to
write
and
then
reverse
that
transformation.
When
reading
the
data,
it
has
no
right
to
even
match
the
speed
of
message
pack
much
less
significantly
outperform
it.
If
you
wonder
how
this
can
be
real,
the
answers
have
everything
to
do
with
data
dependencies
and
choices
made
in
representing
the
data
as
bytes
everything
we
just
covered.
B
B
Geojson
is
a
relatively
compact
format
as
far
as
json
goes,
because
geojson
doesn't
describe
each
point
with
redundant
tags
like
latitude
and
longitude
repeated
over
and
over
as
most
json
formats
would,
but
instead
opt
to
store
that
data
in
a
giant
nested
arrays
to
minimize
overhead
here
are
the
results.
The
green
box
is
geojson.
B
C
B
B
Capabilities
that
you
can
design
into
representations
that
we
did
not
explore
today,
if
you
consider
serialization
and
representation
as
first
class
citizens
next
to
algorithms
and
code
structure,
and
if
you
use
the
proper
tools
to
parse
and
manipulate
data,
you'll
be
surprised
by
the
impact.
Thank
you.
D
Yes,
thank
you
zach.
Actually,
to
be
honest,
your
civilization
is
much
much
much
deeper
than
I
understand
now
and
yeah
yeah.
Thank
you
for
the
such
deep
presentation
and
we
have.
B
Yeah,
well,
I'm
sorry
that
I
didn't
make
it
as
accessible
as
I
planned
to
be
it
was,
I
don't
know
it
was.
I
did
find
it
a
bit
of
a
struggle
to
kind
of
kind
of
get
the
ideas
down
into
something
like
into
a
small
package
and
really
present
them.
It
was
a
struggle,
so
I'm
I'm
sorry
that
it
didn't.
It
wasn't
as
easy
to
follow,
as
I
had
hoped
when
I
planned
for
the
talk.
D
But
yeah
no,
no
worries
about
it
has
a
lot
of
I
say
case
studies,
so
it
should
be.
To
be
honest,
I
need
some
more
time
to
digest
your
what
you
representation
right
now,
but
well.
I
understand
that
how
important
actually
yeah
that
should
be
our
first
season
in
the
programming.
B
Yeah,
that
really
is
the
focus
of
the
talk
I
mean,
if,
if
you
want
to
to
bring
your
programming
to
the
next
level,.
D
B
Think
that
the
best
way
to
do
that
is
to
just
go
back
to
the
basics
of
the
problem.
Every
problem
really
is
a
problem
about
data
and
transforming
data
and
then
in
the
end,
serializing
data,
so
just
like
keeping
that
in
mind,
instead
of
adding
a
lot
of
layers
of
complexity
on
top
of
that,
and
really
focusing
on
that
problem,
I
think
can
can
help
a
lot.
So
if.
C
B
Presentations
that
you
could
watch
I'd
recommend,
for
example,
watching
data
oriented
programming
by
mike
acton.
Is
he
he
talks
about
a
lot
of
things
in
the
same
terms?
So
that's
interesting.
Yeah
definitely
start
there
and
then
just
follow
the
line
with
data
oriented
programming.
It's
a
lot
to
learn
in
that
field.
D
Yeah
thanks
thanks
for
suggestion,
and
we
have
one
question
that
is:
is
there
anything
tribal
is
bad
for.
B
Sure
so
tree
buff
is,
is
taking
advantage
of
being
able
to
like
find
predictability
in
data
with
arrays.
B
So
if
you
are,
if
you
want
to
do
say,
messages
like
server
to
server
communication
for
things
which
do
not
contain
a
race,
then
maybe
something
like
protobuf
would
be
better
for
that
it
tries
hard
not
to
be
bad
and
that
kind
of
a
case
where
there
are
no
arrays.
But
there
are
some
fundamental
tradeoffs
that
wherever
prebuff
can
optimize
for
the
case
with
arrays,
it
will
do
so
because
the
gains
there
can
be
significant
where
there
isn't
that.
B
I'm
gonna
stick
around
in
the
chat
too
for
a
little
bit.
So
if
anyone
has
the
questions
there,
I'll
answer
them.