Rust Programming Language Rust and Tell Berlin, 15 Mar 2019

Previous Meeting

⏯

youtube image

►

From YouTube: Rust and Tell February 2019: Jan Stępień - That Looks Oddly Familiar

Description

A talk about "perceptual hashing" which allows us to find visual similarities between images.

Rust and Tell is a monthly event (the last Tuesday of every month) to share ideas and learn about new things about Rust.
Check out https://berline.rs/ for more.

A

Good evening, ladies and gentlemen, here tonight we're about to talk about I, have with pictures. I am super responsible of pictures. I take I scattered them all over. My SD cards backup this cloud backup somewhere there somewhere in compressed form somewhere I, have only known some selected pictures. I need a way to find similar directories. The places among my backups, which contains similar pictures and/or in the general case, given a new picture, I just took I, want a list of directories or files which are somewhat similar to this picture.

A

So I need to extract certain properties of the picture and put them in a database of sorts. So we can start with some sort of a hashing function right which takes the data which takes the picture and transforms it into this short compact representation, which I can which is more wieldy. I can work with. We know hashing, for example, from the world of cryptography right where we take say contents of our get commit and get this very nice short representation, this sha-1 representation.

A

This doesn't work. It doesn't work because, if I make a small change in my input and the resulting hash is completely different and it's very important if I'm working with cryptography right, I want to see if somebody tampered with date, I'm sending over or messed something in my get history. So want you to look for something else, and there is a family of algorithms which answers exactly. This demand. They're called perceptual hashes and they correspond to how we humans, percept the world.

A

Small changes and the input result in small changes and the output and by similar hashes I, mean hashes, which differ on the minimal number of positions when represented as bit vectors.

A

One example of this family is called P hash. You can just get it off the internet under some Oprah license an open-source license. I think it's GPL. You can then take image files, for example, JPEGs and P hash will translate them for you and those short 64-bit vectors.

A

P hash exposes a nice and friendly C API right, given a file name and a pointer to a piece of memory where I can store the hash I will get to 0. If hashing was successful, let's warm up and try to wrap this API with a bit of rust.

A

We declare an extern function, PhD CT image hash and we declare its API using data types. We import it from the lip secret. We need to also instruct the linker that, in order to link something against this external definition, it needs to reach to the P hash library.

A

Having this declaration in place, we can now and write a simple image hash function, which exposes an API more natural to rotations. Given us a reference to a string slice, we will get an optional hash. We will create a C string out two out of that reference to a string slice and then, in the unsafe block, we will pass the underlying memory to the C function. If the function returns zero, we return some hash. Otherwise we have nothing to return.

A

Everybody with me so far looks good excellent. This is enough to wrap it with a bit of rapid wave main function and expose a simple tool which, given all the arguments on the command line, it interprets them as names of files goes one after another and hashes all the files. In case of any errors, we just ignore them a report, an error and move on.

A

Now I've got hashes for all the files I can find on my desk, but let's return to the main question: how can I, given all the hashes I, have? How can I find similar things which are somewhere on my backup drives in my database of hashes? We know good algorithms for exact search, right, hidden hash table, but we're not talking about the exact search right. We want a tolerance, a window of error.

A

We could use a bee tree, but bee tree relies on the assumption that we have firstly most significant and then least significant bits, most significant at the top of that really significant at the bottom. The problem is, it makes no difference and our world where the bits differ. They are equally significant need something else, and luckily there is a data structure from the future defined forty years ago, which solves this exact problem.

A

It's called the pique tree and it's a tree which is guided by certain properties whose construction has been guided by certain invariants, and it can help us answer the question of how do I get those hashes quickly.

A

Let's start with the distance function, our distance function expressed in terms of those extra two types is just an exclusive or of two hashes, and then we just count all the bits which are set with the distance in place. We for the sake of simplicity, differentiate between empty and non-empty trees. If the tree is non empty, it has a node inside and the node looks as follows: it has a single hash and a collection of children, and now the invariant is that, within a all the hashes we have in a single sub tree sub nodes.

A

A child of this node have exactly the same distance to the current hash. It's fairly abstract I've got a picture for you. We've got the following tree: we've got the main node and two sub nodes, which can be arbitrarily deep and the rule is all the stuff on the left is one distant from the top hash. All the stuff on the right is six distant from the top hash. Our hashes are 64 bit long, so we can have at most 64 children of a single node.

A

With the description of the data structure in place, we can move on and try to insert something to such a data structure. Let's start with a function to create a single, a singleton. No, it's given a hash. I returned a note of that hash and no children.

A

Now given a BK tree, I want to insert something it's okay, mutable reference to BK tree and everything depends on the shape of the tree if it's empty I, just replace myself with a non empty tree with a single hash.

A

Otherwise, I call insert on the node and my non empty tree, and the insert on a node looks as follows: if the new hash that has to be inserted is equal to the current hash, undone, it's there. Otherwise I have to calculate the distance between the hash to be inserted in the current hash itself and then I. Look at look at the collection of my children and I.

A

Take the child adds offset new distance if it's absent, I insert a singleton, node and then I recur call insert on the node I've found, because this is the place which, according to the principle according to which we build the tree. This is where the hash must be long, and this is enough to build the tree preserving our constraint.

A

Let's try to find something inside given a reference to a BK tree. I want to find all their hashes inside, which are at most tolerance, distant from the needle and get a vector of hashes for the sake of simplicity. If I'm empty I return, the lantee vector, if I'm, not empty, I, call find on the node passing the tolerance list, I think most code I have to show for tonight no worries it won't get worse than that.

A

We've got a reference to self and a needle and intolerance. We start with an accumulator and we calculate the distance between the hash to be inserted as between the current hash and the needle I am looking for. If the current hash is within the acceptable tolerance, then I add the current hash to the result and then I recur into all the children.

A

I have doing exactly the same thing for every single child for every single hash we can find, in a child, add the child to the accumulator, and we can then test this implementation by starting with a mutable BK tree, inserting two hashes into it, and then making sure that only one of those hashes is too distant from the test.

A

Well, there is a problem. There is a problem, because this for loop is an exhaustive search.

A

It brings me nothing right, I'm, still looking at every single node I have in my tree and in order to optimize this loop in order to narrow down the search and prune nodes I'm not interested in, we have to look at certain properties of the distance function, specifically the fact that the distance function is a metric and metric and in the mathematical sense, this distance, we know from the real world and it has for a function to be a metric. It has to satisfy four properties. First of all and metric must be non-negative.

A

Secondly, if a metric is negative, then the distance between two things is I mean. Then things are the same.

A

Metric is symmetric. Distance ay B is same as distance B, a and finally that long inequality at the bottom is something you know from the real world and it's called the triangle. Inequality. The direct way cannot be longer than a way through a dieter to be, and this is the most relevant property for our bigotry.

A

If we apply the triangle inequality to a needle, we're looking for in our tree, we can take this. This inequality in term in terms of distances between those hashes, the hash in the current node, the needle and all the nodes in a certain child node, because they're guaranteed to have the same distance. And now, if we apply this inequality, I'm not going to bore you with the math.

A

But if we apply this inequality to two edges of this triangle and play around with the math, we end up with a limit with a constraint for the distance. The distance between hashes in the child node and the current hash must belong to a certain interval or expressed graphically. We have a needle distance, then distance between the current hash and the needle and give or take tolerance.

A

All of those all of the child nodes must fall into this.

A

Interval- and we can now apply this property to the for loop. We have here everything above the for loop will remain the same I'm going to modify the for loop itself. The nested for loop will stay the same, but I will wrap it with an. If statement. Take a look.

A

We call the inner for loop only if the distance to the to all the children in the sub node satisfies that criterion and, as we saw two times before, the narrower our margin of error.

A

Our tolerance argument, the more nodes we will prune during the search allowing us to save time, and that's it I'm more than happy to grab that grab a pen and the whiteboard in that room and go through the math with you after the talk but for now I'm the happiest person, because this works- and it allows me to quickly find pictures and in the database I've built using this tool, for example, those two pictures which clearly indicate that my hands just shake when I take pictures are ten bits apart.

A

I think so, if I give it to learnings of ten I get those two, but funnily enough. Those two which I took three years and three thousand kilometers apart, are also ten distant according to the P hash metric. But the cool thing is: if I'm, not happy of results, BK tree seems the same. I just need a different hashing algorithm and there is a variety of them to choose from.

A

Okay, um ladies and gentleman three things: firstly, perceptual hashes are really powerful technique to summarize media content.

A

Secondly, if we have a big high tree, we can very quickly search through those metric spaces of our perceptual hashes and, thirdly, a call to action. If you think that big a tree makes sense- and you know how to publish crates, let me know because I have no idea how to publish this is Olaf gods. Thank you very much for your patience.

A