Rust Programming Language Computer Vision Meetups, 29 Sep 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Rust CV meeting 2021 09 29 - A Brief Introduction to Photogrammetry - Geordon Worley

Description

Here is a link to the slides: https://docs.google.com/presentation/d/1kOe7-t_Kn20CEAcgZM5FbddZYsag-hnEZoTrRCKEazw/edit#slide=id.p

A

All right, um hello, everyone, uh I'm jordan worley. uh I uh started the the russ computer vision organization on github and the discord server.

A

My primary goal with that was to take many different apis that exist in open, opencv, openmbg, etc, um and the many different uh visual slam frameworks out there that kind of exist for research purposes and then create cohesive apis. That would allow people to swap in and out their algorithms and also to be able to create computer vision pipelines really simply so that people can get into computer vision quickly.

A

So um I uh this this talk is going to be just a brief introduction to photogrammetry we're not going to go into too many of the the the really deep details on it. um My hope is that for peop, some of the people here that may not be as familiar with photogrammetry, specifically or that for uh people that are looking at the recording um that uh I might point people to uh just just to get an idea of what this is. uh Maybe this can help inspire some people to be interested in this topic.

A

So um let's go ahead and dive in. So first thing is um what is photogrammetry um so photogrammetry? Just means taking measurements from images, uh but today, typically what that means is you'll, hear it used in the movie industry or for the game industry to mean um we're taking some images, creating a 3d model of a person, a rock or something, and putting that asset into uh into some environment, um that uh that process typically called structure from motion, um and uh sometimes uh we're also using this for uh robot navigation.

A

So um you might have a video feed coming from a robot and that is kind of creating a three-dimensional representation of the environment, and then that is allowing it to navigate along. So this, this general concept of taking measurement from images has changed a lot over the years. You can still just take measurements by creating these reconstructions and then measuring them. But of course it is now very, very all-encompassing.

A

So today, what we're going to talk about is what's kind of spelled out here, so we're going to talk about what a visual slam pipeline or structure from motion pipeline might look like, and I'm going to talk uh specifically about some things that that we do use in rust, cv and some things that we don't so a cause. A is one of those things um and rather than using bundle adjustment, we may use constraints, but these principles are uh generally useful so that people can understand.

A

What's going on so the first stages, we have a video coming in. We decode that we get some image.

A

The image needs to have certain descriptors of visual data in it extracted so that the computer can understand that those features and then look for those features in multiple images for matching and then once we perform matching.

A

We get kind of that image in the top right, where you have some features in one image and the features in another image and we've kind of matched them figured out, which ones are the correct ones, um and then, from that we we now can roughly determine how the camera has moved between frames.

A

And then this allows us to then um figure out our 3d points and then that kind of all feeds back in a loop as we kind of optimize the positions of the cameras add more data, there's a fantastic amount of data in these uh in this process.

A

So uh the first thing I'll talk about is scale and computer vision. So um uh it appears that we're looking from the uh the the corner of a room high up, uh perhaps off, of a balcony. Looking down, um you see the chairs fireplace, uh but um in actuality it was really just a doll house. So um what happened there? um So really? uh What happened was everything was much smaller than we thought.

A

Not only is the the furniture significantly smaller, but also the distance that we are from the objects is significantly smaller, but the relative scale that we got from from looking at that is uh is approximately um we we knew what direction we were in right. We knew we were up near the balcony. We actually are still taking a photo from the same angle, and that was something we could recognize, but we couldn't tell how far away we were right.

A

That's the important factor from one image. You actually cannot tell how far away something is, and in fact, even if you were to take multiple images, uh you would not be able to tell this. So uh why is that? And um what is some of the math behind those?

A

So on the left here we have an example of what is typically called the pinhole camera model, so the way that a camera works is that light is is coming in um from some point, so light's traveling in roughly a straight line. It comes in from let's say the top of this tree, and if we did not have a pinhole or a focal point on the camera, then it would just be a huge blur right. If you look at the light reflected on some surface, that's white.

A

You can't really see an image on that, but if we are able to get the lights to focus through this point, um then we can get a a a concrete image showing up now, as you can see it's upside down um when the light actually passes through the focal point of your of your camera or the pin, uh the the pinhole in a actual pinhole camera, it actually flips upside down and that's corrected for uh digitally, and um this uh this is uh basically what's happening.

A

Light is coming in a straight path directly through this one single point, the pinhole or the focal point, or what is typically referred to in this field as the optical center of the camera, and um that point is very important because that, since we know that the light passes through that specific point that allows us to do a little bit of math now on the right, you can see uh see a few things.

A

So this is kind of a diagram of some point x that is viewed from two different images, so one image is being taken on the left one's being taken on the right, and if we were to observe uh point uh point x, then, um from the perspective of the left view, we actually do not know specifically uh whether or not the point x is at x1, x2, x3 or or x.

A

We actually just know that what direction it came from which is really important, um and now, if we look at the the right view by having that right view, if we know where it's located, if we can observe that point now, um assuming there's no error right, we could know exactly where that that point is located, because we have two intersecting rays that basically overlap and then tell us in space where that point is this. uh This general concept is typically referred to as epipolar geometry.

A

So the um the the point el on the left view as the uh the epic polar points um here and that's basically where the the the camera on the right would actually its focal point would show up on the uh the image on the left view. um Because, again, everything is just a direction um in this case and the uh the the plane there is referred to as epi the epipolar plane and um the uh the epipolar line- and this is very important- is the red line on the right view.

A

So if we were to observe point uh x on the left view right, uh it could exist along a line and now, uh if you project that line onto the right view, that red line shows all the possible positions on the right view where we could observe that point just based on the information we got from the left view. But since we also know where it is in the right view, that's how we're able to constrain where, on that line. That point is located.

A

So that's kind of how we're figuring out where these points exist in the triangulation process.

A

Now there's also some other uh weird things that we have to deal with as well. So, um for instance, sometimes points can exist uh at infinity and in this case uh we we can't necessarily find exactly where those rays overlap.

A

So in our pipelines we actually handle this using homogeneous coordinates, which, which is very convenient for doing transformations on cameras and things of that nature. But uh it allows us to to represent points which exists very far away, um and uh in that case the only information we really preserve is the direction that that feature came in.

A

So, in the case of like, let's say the stars, those are so uh so far away that there's no way, no matter how much we move our camera along the earth, we will never observe a parallax effect, we'll never see that star on the horizon, move because it's so incredibly far away.

A

So the only information we really get from this process is uh is what direction is it in, and this can still uh still be useful in various fields, it's not as useful for tracking the position of the camera, but it could be helpful for tracking the rotation of the camera because you would know, regardless of where you you move around. You would expect this feature to exist in the same direction.

A

So that's something that uh that can be used.

A

Now, I'm going to talk about the actual process of how we get these features out of the images- um and this is a very interesting one. So we um I'm going to talk about akaze, but on the left. Here is an example of generally.

A

What is what is done to extract these uh features from these images at different scales so on the left is what's referred to as an image pyramid. uh You take an image so level zero here in this case is the the the original image and by blurring and subsampling this image repeatedly.

A

We we kind of get a view of the image at different scales, we're looking at it from um as if we were, you know if you will further away but there's more detail when you're closer up than than further away now in a cause.

A

What what it's doing is is similar but but slightly different, so, rather than blurring the whole image, it actually blurs the image selectively. So uh this this process basically generates a map of the the scale space. So looking at okay are there very fine, grained fine details in certain parts of the images on the right you can see at two different scales so on the left is a uh a higher, a higher or a lower level of detail um or the level zero.

A

If you, if you will, in the case of the image impairment and then on the right, we have one where a lot of the details have been removed so at a a higher scale.

A

In that case, now all the areas that have lots of little fine details are white and the areas that have larger scale details are black, and so what it does is it effectively removes those small, tiny details, the further it goes, and then it resamples every so often to shrink the image down to a smaller pixel account, and you can see on the bottom.

A

What that looks like so on the left is a finer detail level and then on the right is a higher detail level, where it's kind of blurred out the features in the floor to where you can only see the floor tiles. Now you can no longer see the actual grains of the wood, um and things like that. So this is um this is very useful for us to be able to uh to detect features and images at various scales.

A

The next thing that it that we need to do is is actually find the features and extract information about them. So what is done in this case is we're actually looking for areas of interest in the image which it's going to define in this case, as curvature peaks in luvanosi or or rather extremas of of luminosity. So um to do that, we actually need to kind of look at uh what are the changes in the image across the vertical and the horizontal so on.

A

The left here is an example of a a sharp filter which is very quick to run, and it approximates it approximates the uh the gradient computation from a horizontal and vertical gradient perspective. So we can compute a uh what is it? What is the? How is the luminosity changing along the vertical and the horizontal in the image, and by applying that um in in different directions, uh we can actually compute the uh the the extremas in the uh in the luminosity so kind of those peaks and troughs, and if you do this at different scales.

A

So where is that? Where is that uh peaking at different scales? We actually get different kinds of information. So on the right, here's an example of the what those peaks look like at different scales on the top is at a fine level of detail. So you can see lots of little features there peaks around like the plants, you can even see on the floorboard, some the shoes and things in the environment and then below that you get these really big blobs. That kind of show all right.

A

This is there's a big feature for this whole shoe or for um this you know just some very large object in the environment. It kind of goes to a larger scale, and you see significantly less points at these scales because we've, you know, there's less information at them. So this is important, though, because if we are close to an object, then we will see more more details, but we also want to extract those high level details so that when we're further from the object, we can recognize that object again.

A

And so what we end up with is on the right. These key points detected. This is kind of a visualization of the key points and their their size roughly. So this will show you uh what scale have we detected these key points at and, as you can see, it's found, for instance, on the curtains. It finds lots of really tiny features and on the uh the shoes, the objects on the table, those are much larger features, and so um this this is.

A

This is the kind of information we want to extract, and now the last stage is: um how do we actually get the description of that feature out that the algorithm will use to tell two different features apart so um on um on the bottom, we have the information sources that we have. In this case, we have the actual uh image itself, the luminosity, so we kind of we've removed the color.

A

We have the vertical gradients and we have the horizontal gradient and those are computed using the sharp filters.

A

We then use this sampling pattern on the top, it's kind of like a night's move and then sampling from a three by three grid, and it's going to compare one thing in the red block to one thing in the green block, and this is kind of rotated to a orientation that is specific to that feature.

A

So it's going to say what orientation does this feature have and the reason why it does that is so that if the camera rotates or moves around, it will always kind of know exactly which direction it's aligned in and by doing this we can then get. uh We can then take the difference in luminosity.

A

um In those two points, we can take a difference in vertical gradients and then horizontal gradients in those two points, and then, um if, if one is greater than the other, then that becomes a a one bits in our output and if it's the other way, it becomes a zero in our output so effectively.

A

What we're doing is we're taking a very large amount of comparisons, actually 486 bits in total of comparisons from all these different night moves around this area and then comparing all these different grid squares and specifically for the luminosity and the various gradients.

A

Then we take that that that uh bitwise information and uh we need to compare it. So um how do we compare it? uh We uh we effectively are going to use hamming distance. So what we want to know is how many of those comparisons were different between two different features.

A

An important thing to note is that, because we are just looking at whether something is greater in one than the other, we also get a deal of light and variance as well. So if the scene is brighter, then a darker area will still be darker than the lighter area. So this is um this.

A

Is this is very useful uh in in cases where the iso settings on the camera might change, or uh in cases where you go out at uh uh when uh perhaps the lighting is slightly different outside it's not perfect, but uh it helps a little bit.

A

So um when we take these, these comparisons we're just looking at how many of these these bits are going to be different from each other. It turns out that this is incredibly easy to do uh incredibly quick to do for a modern computer effectively with some caveats. This is basically an xor and a pop count.

A

So what that means is we're getting the xor operation of a computer, which is in a very, very simple instruction a computer can can run, can basically give you which bits are different and then uh as another set of bits, and then we can count the number of ones in that result, and that tells us how many of those bits were different, and this is again incredibly fast to execute, which is why binary features are preferred for real-time operation, whereas there are many other ways to match features and deal with features in other situations when you're trying to go quickly.

A

This is one of the quickest ways to compare features.

A

And on the right, you'll actually see on the top.

A

This is um uh I, I believe, uh brisk uh or um uh brisk features that are being compared between uh two images um and and match, and there is uh not very much verification being done on it. No, nothing is used about the geometry to uh to filter outliers and things like that. We just kind of have a huge bag of matches and, as you can see, some of them are are definitely wrong. They just kind of go to random places, but a lot of them are right.

A

You can kind of see a bunch of straight lines, moving from one image to the other, so there's there's some kind of idea that a lot of these matches are correct and on the bottom we have an example where a cause a is used which starts out with less outliers to begin with, but um then we also apply what's called geometric filtering to it, so we we take into account. Where could the features be based on those epipolar lines we saw before they can only lie in a certain area?

A

So where uh does this actually work? um Geometrically? Is it even possible for this feature to match and by using applying that filtering we can get excuse me. We can get rid of uh almost all of the outliers.

A

uh There's there'll be a few, though, for sure, depending on your settings, of course, um so uh the way that we actually perform that filtering um is uh using a sample consensus process. So the most common one is called ransack, there's many others um and there are alternatives to this. But um but what we're? uh What we're trying to do is find a hypothesis that best fits our our data points um sometimes referred to as a model as well of the of the data.

A

So in this case our model is uh how are those two cameras or where is this camera uh located in respect to a whole bunch of pre-existing 3d points um and in the case of of line fitting the way this this works? Is you have two points? You can compute a line so uh effectively a y-intercept and a slope or there's different ways to represent line, but you find some line and you define some threshold for what is an outlier and what is an inlayer and on the bottom left.

A

You can see that there is a line generated there from those two points that has a bunch of yellow points which are considered inliers, a bunch of blue points that are considered outliers now. That is clearly wrong, but, as you continuously run, this process, you'll you'll, take those points and you'll generate models from them. Eventually, you'll generate a model from a point. Let's say two points that are are strongly on the line and that will fit the points very well.

A

So a lot of points will be considered in liars and only the two points that are kind of outside of that general line will be considered outliers in that case, and on the right uh there with the blue line and the red line is uh effectively what you get so there's lots of different things which do not fall on the line and those are all considered outliers.

A

It's not perfect. Some things will lie on the line and they will be outliers.

A

That does occasionally happen, but uh but, as you get more information from more cameras- and you, you observe a feature from multiple images, you can be more sure about how correct that point is, but this is kind of how we initially take those matches and filter them out. So this gets us a good set of quality data. Assuming that a good deal of our input data is correct. If it's, if the data is not, it has a lot of outliers. We may not be able to find a good model at the data and now.

A

The next thing that we that we want to do is is create a 3d model, so we uh we have some model um that uh we're kind of assuming here we have some model that exists right and we're adding a new frame to that model. um The main metric for how good a given match in an image is um the the what is actually used as the uh the distance to the line in this um sample consensus process in our um when we're actually performing that with cameras is what's called, reprojection error.

A

So you basically- and this is not the only way to do this- you can also use sine or cosine distance among other things, but what you can do is basically take the points figure out where it should show up on the image and then figure out where the it's actually showing up on the camera. So where is the feature detected on the camera in the image and where should it show up on the image and then that's kind of?

A

uh That's that's our error right there, the distance of that point or um it can be separated into x and y separately, and so what we actually want to do is take all of these cameras and all of these reprojection errors and minimize them. So what you need to do is basically look at. How could we tweak every single um point, every single camera position, every single camera rotation such that it would reduce these reprojection errors, and so this actually creates a very large jacobian matrix.

A

That is sometimes solved as a least squares problem and one uh one example is using levenberg mccart mccart there's also, but there's some other things you can do as well, and then this is put this on. The right is actually an example of the approximate hessian, which is generated from multiplying the jacobian by transpose.

A

So that's a lot of math uh mumbo jumbo, but basically the gist of this is we want to move and those little cameras around and the points around in the reconstruction such that everything lines up, so that the points are in the place that is predicted by the observation in the image. So you kind of like move the cameras around until they're all in the best spot.

A

They can possibly be to reduce that error, and then we triangulate those points and on the left is an example of a 3d reconstruction produced using rust cv on what you can see is actually apartment, uh complex building and uh some uh trees and and grass just below it, and then a parking lot in front of it and you can maybe make out a few cars. Might there might be some compression going through zoom, but uh that that's uh that's what it is and uh this uh and that's also from video.

A

I took myself as well. So basically uh it's it's a really cool, uh a really cool process that you can actually recreate 3d information, even though it's been collapsed down into 2d, going into your camera, really cool stuff and uh there's my credits so uh and my citations all right uh does anyone. I think that's it. We can go into questions.