►
Description
A talk about "perceptual hashing" which allows us to find visual similarities between images.
Rust and Tell is a monthly event (the last Tuesday of every month) to share ideas and learn about new things about Rust.
Check out https://berline.rs/ for more.
A
Good
evening,
ladies
and
gentlemen,
here
tonight
we're
about
to
talk
about
I,
have
with
pictures.
I
am
super
responsible
of
pictures.
I
take
I
scattered
them
all
over.
My
SD
cards
backup
this
cloud
backup
somewhere
there
somewhere
in
compressed
form
somewhere
I,
have
only
known
some
selected
pictures.
I
need
a
way
to
find
similar
directories.
The
places
among
my
backups,
which
contains
similar
pictures
and/or
in
the
general
case,
given
a
new
picture,
I
just
took
I,
want
a
list
of
directories
or
files
which
are
somewhat
similar
to
this
picture.
A
So
I
need
to
extract
certain
properties
of
the
picture
and
put
them
in
a
database
of
sorts.
So
we
can
start
with
some
sort
of
a
hashing
function
right
which
takes
the
data
which
takes
the
picture
and
transforms
it
into
this
short
compact
representation,
which
I
can
which
is
more
wieldy.
I
can
work
with.
We
know
hashing,
for
example,
from
the
world
of
cryptography
right
where
we
take
say
contents
of
our
get
commit
and
get
this
very
nice
short
representation,
this
sha-1
representation.
A
This
doesn't
work.
It
doesn't
work
because,
if
I
make
a
small
change
in
my
input
and
the
resulting
hash
is
completely
different
and
it's
very
important
if
I'm
working
with
cryptography
right,
I
want
to
see
if
somebody
tampered
with
date,
I'm
sending
over
or
messed
something
in
my
get
history.
So
want
you
to
look
for
something
else,
and
there
is
a
family
of
algorithms
which
answers
exactly.
This
demand.
They're
called
perceptual
hashes
and
they
correspond
to
how
we
humans,
percept
the
world.
A
A
A
A
Having
this
declaration
in
place,
we
can
now
and
write
a
simple
image
hash
function,
which
exposes
an
API
more
natural
to
rotations.
Given
us
a
reference
to
a
string
slice,
we
will
get
an
optional
hash.
We
will
create
a
C
string
out
two
out
of
that
reference
to
a
string
slice
and
then,
in
the
unsafe
block,
we
will
pass
the
underlying
memory
to
the
C
function.
If
the
function
returns
zero,
we
return
some
hash.
Otherwise
we
have
nothing
to
return.
A
Everybody
with
me
so
far
looks
good
excellent.
This
is
enough
to
wrap
it
with
a
bit
of
rapid
wave
main
function
and
expose
a
simple
tool
which,
given
all
the
arguments
on
the
command
line,
it
interprets
them
as
names
of
files
goes
one
after
another
and
hashes
all
the
files.
In
case
of
any
errors,
we
just
ignore
them
a
report,
an
error
and
move
on.
A
Now
I've
got
hashes
for
all
the
files
I
can
find
on
my
desk,
but
let's
return
to
the
main
question:
how
can
I,
given
all
the
hashes
I,
have?
How
can
I
find
similar
things
which
are
somewhere
on
my
backup
drives
in
my
database
of
hashes?
We
know
good
algorithms
for
exact
search,
right,
hidden
hash
table,
but
we're
not
talking
about
the
exact
search
right.
We
want
a
tolerance,
a
window
of
error.
A
We
could
use
a
bee
tree,
but
bee
tree
relies
on
the
assumption
that
we
have
firstly
most
significant
and
then
least
significant
bits,
most
significant
at
the
top
of
that
really
significant
at
the
bottom.
The
problem
is,
it
makes
no
difference
and
our
world
where
the
bits
differ.
They
are
equally
significant
need
something
else,
and
luckily
there
is
a
data
structure
from
the
future
defined
forty
years
ago,
which
solves
this
exact
problem.
A
Let's
start
with
the
distance
function,
our
distance
function
expressed
in
terms
of
those
extra
two
types
is
just
an
exclusive
or
of
two
hashes,
and
then
we
just
count
all
the
bits
which
are
set
with
the
distance
in
place.
We
for
the
sake
of
simplicity,
differentiate
between
empty
and
non-empty
trees.
If
the
tree
is
non
empty,
it
has
a
node
inside
and
the
node
looks
as
follows:
it
has
a
single
hash
and
a
collection
of
children,
and
now
the
invariant
is
that,
within
a
all
the
hashes
we
have
in
a
single
sub
tree
sub
nodes.
A
A
child
of
this
node
have
exactly
the
same
distance
to
the
current
hash.
It's
fairly
abstract
I've
got
a
picture
for
you.
We've
got
the
following
tree:
we've
got
the
main
node
and
two
sub
nodes,
which
can
be
arbitrarily
deep
and
the
rule
is
all
the
stuff
on
the
left
is
one
distant
from
the
top
hash.
All
the
stuff
on
the
right
is
six
distant
from
the
top
hash.
Our
hashes
are
64
bit
long,
so
we
can
have
at
most
64
children
of
a
single
node.
A
A
Otherwise,
I
call
insert
on
the
node
and
my
non
empty
tree,
and
the
insert
on
a
node
looks
as
follows:
if
the
new
hash
that
has
to
be
inserted
is
equal
to
the
current
hash,
undone,
it's
there.
Otherwise
I
have
to
calculate
the
distance
between
the
hash
to
be
inserted
in
the
current
hash
itself
and
then
I.
Look
at
look
at
the
collection
of
my
children
and
I.
A
Take
the
child
adds
offset
new
distance
if
it's
absent,
I
insert
a
singleton,
node
and
then
I
recur
call
insert
on
the
node
I've
found,
because
this
is
the
place
which,
according
to
the
principle
according
to
which
we
build
the
tree.
This
is
where
the
hash
must
be
long,
and
this
is
enough
to
build
the
tree
preserving
our
constraint.
A
Let's
try
to
find
something
inside
given
a
reference
to
a
BK
tree.
I
want
to
find
all
their
hashes
inside,
which
are
at
most
tolerance,
distant
from
the
needle
and
get
a
vector
of
hashes
for
the
sake
of
simplicity.
If
I'm
empty
I
return,
the
lantee
vector,
if
I'm,
not
empty,
I,
call
find
on
the
node
passing
the
tolerance
list,
I
think
most
code
I
have
to
show
for
tonight
no
worries
it
won't
get
worse
than
that.
A
We've
got
a
reference
to
self
and
a
needle
and
intolerance.
We
start
with
an
accumulator
and
we
calculate
the
distance
between
the
hash
to
be
inserted
as
between
the
current
hash
and
the
needle
I
am
looking
for.
If
the
current
hash
is
within
the
acceptable
tolerance,
then
I
add
the
current
hash
to
the
result
and
then
I
recur
into
all
the
children.
A
It
brings
me
nothing
right,
I'm,
still
looking
at
every
single
node
I
have
in
my
tree
and
in
order
to
optimize
this
loop
in
order
to
narrow
down
the
search
and
prune
nodes
I'm
not
interested
in,
we
have
to
look
at
certain
properties
of
the
distance
function,
specifically
the
fact
that
the
distance
function
is
a
metric
and
metric
and
in
the
mathematical
sense,
this
distance,
we
know
from
the
real
world
and
it
has
for
a
function
to
be
a
metric.
It
has
to
satisfy
four
properties.
First
of
all
and
metric
must
be
non-negative.
A
A
Metric
is
symmetric.
Distance
ay
B
is
same
as
distance
B,
a
and
finally
that
long
inequality
at
the
bottom
is
something
you
know
from
the
real
world
and
it's
called
the
triangle.
Inequality.
The
direct
way
cannot
be
longer
than
a
way
through
a
dieter
to
be,
and
this
is
the
most
relevant
property
for
our
bigotry.
A
If
we
apply
the
triangle
inequality
to
a
needle,
we're
looking
for
in
our
tree,
we
can
take
this.
This
inequality
in
term
in
terms
of
distances
between
those
hashes,
the
hash
in
the
current
node,
the
needle
and
all
the
nodes
in
a
certain
child
node,
because
they're
guaranteed
to
have
the
same
distance.
And
now,
if
we
apply
this
inequality,
I'm
not
going
to
bore
you
with
the
math.
A
But
if
we
apply
this
inequality
to
two
edges
of
this
triangle
and
play
around
with
the
math,
we
end
up
with
a
limit
with
a
constraint
for
the
distance.
The
distance
between
hashes
in
the
child
node
and
the
current
hash
must
belong
to
a
certain
interval
or
expressed
graphically.
We
have
a
needle
distance,
then
distance
between
the
current
hash
and
the
needle
and
give
or
take
tolerance.
A
A
I
think
so,
if
I
give
it
to
learnings
of
ten
I
get
those
two,
but
funnily
enough.
Those
two
which
I
took
three
years
and
three
thousand
kilometers
apart,
are
also
ten
distant
according
to
the
P
hash
metric.
But
the
cool
thing
is:
if
I'm,
not
happy
of
results,
BK
tree
seems
the
same.
I
just
need
a
different
hashing
algorithm
and
there
is
a
variety
of
them
to
choose
from.
A
Okay,
ladies
and
gentleman
three
things:
firstly,
perceptual
hashes
are
really
powerful
technique
to
summarize
media
content.
A
Secondly,
if
we
have
a
big
high
tree,
we
can
very
quickly
search
through
those
metric
spaces
of
our
perceptual
hashes
and,
thirdly,
a
call
to
action.
If
you
think
that
big
a
tree
makes
sense-
and
you
know
how
to
publish
crates,
let
me
know
because
I
have
no
idea
how
to
publish
this
is
Olaf
gods.
Thank
you
very
much
for
your
patience.