Computer Vision (CMU)
Computer Vision: Algorithms and Applications
Multiple View Geometry (best reference for geometry and vision)
Image processing
Basics of filtering
 Convolution ($f*g$)

1d/2d, continue/discrete
$(f*g)(x)=\int\limits f(y)g(xy)dy$
f = filter, g = signal

separable ($M\times M$ pixels with $N\times N$ filter)
nonseparable $M^2\times N^2$
separable $2\times M^2\times N$


Gradients
Blur the image first then calculate the gradient
Because $\frac{\partial}{\partial x}(h*f) = (\frac{\partial}{\partial x} h) * f$, do the convolution with the derivative of filter (Gaussian)
 2D Gaussian filters

Gaussian
$h_\sigma(u,v)=\frac{1}{2\pi\sigma^2}\exp(\frac{u^2+v^2}{2\sigma^2})$

Derivative of Gaussian
$\frac{\partial}{\partial x}h_\sigma(u,v)$, x can be u or v

Laplacian of Gaussian (LoG) filter
$\nabla^2h_\sigma(u,v)$

Gabor filters
$\exp(\frac{u^2+v^2}{2\sigma^2})\cos(2\pi(k_uu+k_vv))$

Image pyramids
 Gaussian pyramid
 Algorithm
Repeat until min resolution reached
 Use Gaussian filter to smooth the image
 Subsample (e.g. remove even rows and cols)
 Can’t recover the image from upper level (blurring is losy)
 Algorithm
Repeat until min resolution reached
 Laplacian pyramid
 Downsample Algorithm
Repeat until min resolution reached
 Use Gaussian filter to smooth the image
 Calculate the residual (retain the residuals instead of blurred images)
 Subsample
 Upsample Algorithm
Repeat until original resolution reached
 Upsample
 Sum with residual
 Why called Laplacian x  Gaussian(x) ~= Laplacian(x)
 Downsample Algorithm
Repeat until min resolution reached
 Other pyramids
 Steerable pyramid (each level keeps multiple directions)
 Wavelets

Fourier transform ($\mathcal{F}(g)$)
Discrete Fourier transform (DFT) is a matrix multiplication $F=Wf$, in practice is implemented using fast Fourier transform (FFT)
Convolution theorem
The Fourier transform of the convolution of two functions is the product of their Fourier transforms
$\mathcal{F}(g*h)=\mathcal{F}(g)\mathcal{F}(h)$
$\mathcal{F}^{1}(gh)=\mathcal{F}^{1}(g)*\mathcal{F}^{1}(h)$
Hough transforms
 Line parameterizations
 Slope intercept form $y=mx+b$
 Double intercept form $\frac{x}{a}+\frac{y}{b}=1$
 Normal form $x\cos\theta+y\sin\theta=\rho$
 Hough transform
 Image space x, y are vars and theta, rho are parameters
 Parameter space theta, rho are vars and x, y are parameters One point in image space becomes a curve in parameter spaces (all lines that pass the point). The local maximum in parameter space corresponds to detected lines in image space.
 Hough circles […]
Feature detection and correspondences
Corner detection

Error function
$E(u,v)=\sum\limits_{x,y}w(x,y)(I(x+u,y+v)I(x,y))^2$
E: error function, w: window function (e.g. Gaussian, or 1 in window 0 outside), I: intensity
Using Taylor expansion, approximate E as

Calculate eigen values of $M$:
 Corners
 $\lambda_1\approx\lambda_2$ and both large: corner
 $\lambda_1\approx\lambda_2$ and both small: flat
 $\lambda_1\gg\lambda_2$: vertical edge
 $\lambda_2\gg\lambda_1$: horizontal edge
 Detectors
 Harris & Stephens: $\det M  \kappa\text{Trace}^2 M$
 invariant to intensity shift ($I+b$) or scale ($aI$)
 not invariant to scale of image
 Kanade & Tomasi: $\min(\lambda_1,\lambda_2)$
 Nobel: $\frac{\det M}{\text{Trace} M + \varepsilon}$
 Harris & Stephens: $\det M  \kappa\text{Trace}^2 M$
 Corners

Multiscale blob detection
 For each level of Gaussian pyramid compute feature response
 For each level of Gaussian pyramid if the response is local maximum cross scale, save scale and location of feature (x,y,s)
Feature descriptors
 Basics
 Image patch (vector of intensity values, e.g. for a 3 * 3 neighbor)
 Image gradient (difference of intensity values)
 Color histogram (count colors), invariant to scale and rotation
 Spatial histograms (compute histograms over spatial cells)
 Orientation normalization (use the dominant image gradient direction to normalize the patch)
 MultiScale Oriented Patches (MOPS)
 Given a feature (x, y, s, theta)
 Get a 40 * 40 image patch along the direction theta (possible interpolation for pixels), subsample to 8 * 8 (low frequency filtering to absorb localization error)
 Subtract the mean and divide by standard deviation
 Haar wavelet transform
 Given a feature (x, y, s, theta)
 Histograms of Oriented Gradients (HOG)
https://www.learnopencv.com/histogramoforientedgradients/ http://www.geocities.ws/talh_davidc/#cst_extract
 Given a image, calculate the gradient (magnitude & direction) using filters (e.g. Sober)
 Divide images into cells of size 8 * 8
 Create a histogram of gradient directions, containing 9 bins 0, 20, 40, …, 160 (if the angle is 165, with magnitude M, put M/4 to bin 0 and 3M/4 to bin 160)
 Normalize the histogram Consider block of size 16 * 16, with stride 8, for each block, concatenate the histogram counts into a vector, and stored the normalized vector Concatenate all the normalized vectors of all blocks into a giant vector as the HOG feature vector
 Scale Invariant Feature Transform (SIFT)
http://aishack.in/tutorials/siftscaleinvariantfeaturetransformintroduction/
 scale space (SIFT suggests 4 octaves and 5 blur levels, so 20 images in total)
 take the original image, blur it progressively (this series of images called an octave)
 resize the image to half, blur it progressively
 LoG approximations for each octave, calculate the difference between images (difference of Gaussian approximate laplacian)
 Find key points
 locate maxima/minima in DoG A pixel is key point if it’s largest/smallest of all 26 neighbors (8 in the same blur scale, 9 in the previous blur scale, and 9 in the later blur scale)

Find subpixel maxima/minima Use interpolation (taylor expansion) on marked key points to find the extreme points between pixels
5 blur levels give 4 DoG, which gives two extrema images

Remove low contrast key points
 Remove low contrast features (if the intensity is below a threshold, reject the point)
 Remove edges (calculate x, y gradients at the key point (eigenvalues of M), reject the point if the ratio of two eigenvalues is large)

Key point orientations
Until now, for each key point, we have the location (x, y) and scale s (the Gaussian variance is sigma), need to determine the direction theta
 The magnitude and orientation is calculated for all pixels in a window (size is 1.5 sigma) around the key point
 The orientations are broken into 36 bins (10 degrees each), the histogram is blurred by 1.5 sigma.
 The key point is assigned to the histogram peak
 The key point is also assigned to any peak above 80% of the highest (creating a new key point with same location and scale but different orientation)

Generate a feature
Now, for each key point we have location (x, y), scale s, and orientation theta. For each key point,
 Consider a 16 * 16 window around it, break the window into sixteen 4 * 4 small windows (we have 16 small windows)
 Within each small window, gradient magnitudes and orientations (the key point’s orientation theta is subtracted, so it’s relative orientations) are calculated, and put into a 8 bin histogram (45 degrees each), the amount added also depends on the distance from the key point. So each small window as 8 features (for 8 bins)
 16 small windows give in total 128 features (can cap the features at 0.2, any number >= 0.2 is replaced by 0.2)
 scale space (SIFT suggests 4 octaves and 5 blur levels, so 20 images in total)
Transformations and geometry
Transformation $x’=f(x;p)$
Homographies

Projective space A projective space is roughly the set of the lines passing through the origin.

Homography A homography is an isomorphism of projective spaces.

Transformation Affine transformation (combinations of linear transformations and translations)
Projective transformation
(8 degrees of freedom), (x’, y’, w’) convert back to (x’/w’, y’/w’)

Determine unknown transformations linear system, minimize square errors, least squares error $E=Axb_2^2$ $x=(A^TA)^{1}A^Tb$

Summary
Image alignment
 panoramas from image stitching
when can we use homographies
 the scene is planar
 the scene is very far or has small depth variation, so approximatEly planar
 the scene is captured under camera rotation only, no translation or pose change
 Direct linear transform (DLT)
Given ${(p_i’,p_i)}$, find the best $H$ such that
$P’=HP$

Linear system (can assume $h_9=1$, so 8 DO)

Expand

Replace $\alpha$ and rearrange into $A_ih=0$ for ith point

Stack systems together and solve the least square problem to have $H$

 Image alignment pipeline
 Feature point detection (e.g. detect corners and extract features)
 Feature matching (match points from two images)

Exhaustive search
for each feature in one image, look at all the other features in the other image(s)

Hashing
compute a short descriptor from each feature vector, or hash longer descriptors (randomly)
 Nearest neighbor techniques

Featurespace outlier rejection
sum of squared differences (SSD)
 1NN: SSD of the closest match
 2NN: SSD of the secondclosest match
 calculate 1NN / 2NN, it’s a match if the ratio smaller than a threshold (e.g. threshold = 5)


Homography estimation
Random Sample Consensus (RANSAC)

Algorithm
Repeat until the best model is found with high confidence
 Sample (randomly) the number of points required to fit the model
 Solve for model parameters using samples

Score by the fraction of inliers within a preset threshold of the model
e.g. if the model is a line, inliers are the points which distance to the line is smaller than a threshold

For image alignment (RANSAC)
 Repeat
 Get four pair of points randomly
 Compute H using DLT
 Count inliers (pair of points have similar projections)
 End if H has large number of inliers
 Recompute H using all inliers
 Repeat

Structure (scene geometry)  Motion (camera geometry)  Measurements  

Pose Estimation  known  estimate  3D to 2D correspondences 
Triangulation  estimate  known  2D to 2D correspondences 
Reconstruction  estimate  estimate  2D to 2D correspondences 
Camera calibration

Homogeneous corrdinates and inhomogeneous (heterogeneous) coordinates
Multiple View Geometry Section 2.2.1

line
A line in the plane can be represented by $ax+by+c=0$, thus $(a,b,c)$ and for nonzero $k$, $k(a,b,c)$ represent the same line.
$(a,b,c)$ is the homogeneous representation of a line

point
A point $(x,y)$ lies on a line $l=(a,b,c)$ iff $ax+by+c=0$, i.e. $\langle(x,y,z),(a,b,c)\rangle=0$. For nonzero $k$, $k(x,y,1)$ also satisfies the equation.
$(x,y,1)$ is the homogeneous representation of a point.
$(x,y)$ is the inhomogeneous representation of a point.


Camera model
Multiple View Geometry Section 6.1
Assume
 The focal length of the camera is $f$
 A 3D world point is $(X,Y,Z)$, where the origin point is the center of camera $C$
 The image plane is at $Z=f$
Camera center is also known as optical center.
The point where the principal axis meets the image plane is called the principle point ($p$ in the image above).
The plane through the camera center parallel to the image plane is called the principal plane.
Camera model using homogeneous coordinates
Denote:
 World point $\mathbf{X}_\text{cam}=(X,Y,Z,1)$ in the camera coordinate frame.
 Image point $\mathbf{x}=(x,y,w)$ in the camera coordinate frame
 Camera projection matrix $P\in\mathbb{R}^{3\times4}$
Then $\mathbf{x}=P\mathbf{X}_\text{cam}$

If the origin of image plane is at the principal point (intersection of image plane and principle axis), then
$K$ is the camera calibration matrix.

If the principal point’s coordinates in the image plane is $(p_x,p_y)$, then

In general, world point is in the world coordinate frame
Denote
 the inhomogeneous coordinates of a point in the world frame as $\tilde{\mathbf{X}}_\text{world}\in\mathbb{R}^3$
 the inhomogeneous coordinates of the camera center $\tilde{\mathbf{C}}_\text{world}\in\mathbb{R}^3$
 the inhomogeneous coordinates of a point in the camera frame as $\tilde{\mathbf{X}}_\text{cam}\in\mathbb{R}^3$
Then we have using inhomogeneous coordinates.
In homogeneous coordinates, it’s
which gives:
here $\mathbf{x}\in\mathbb{R}^3$ and $\mathbf{X}_\text{world}\in\mathbb{R}^4$, and we can denote
This system has 9 degrees of freedom
 3 for K ($f,p_x,p_y$)
 3 for $R$ (can rotate about x,y,zaxis)
 3 for $\tilde{\mathbf{C}}\in\mathbb{R}^3$

Pixel might not be square, 10 DOF

Sensor might be skewed, finite projective camera, 11 DOF

weak perspective camera

Geometric camera calibration (pose estimation)
Given matched points ${\mathbf{X}_i,\mathbf{x}_i}$ and a camera model $\mathbf{x}=P\mathbf{X}$, need to estimate $P$

heterogeneous coordinates $(x’,y’)$ are $x’=\frac{p_1^T\mathbf{X}}{p_3^T\mathbf{X}},y’=\frac{p_2^T\mathbf{X}}{p_3^T\mathbf{X}}$, equivalent to
$p_1^T\mathbf{X}p_3^T\mathbf{X}x’=0$
$p_2^T\mathbf{X}p_3^T\mathbf{X}y’=0$

Combine to a single system

Concatenate systems into one $Ap=0$
Solve $\arg\minAp_2^2$ subject to $p_2^2=1$

After getting $P$ need to find the intrinsic (of $K$) and extrinsic parameters
We know that with $M=KR$

So we can find $\tilde{\mathbf{C}}$ first as the projection of the camera center is origin.
, so $P\tilde{\mathbf{C}}=0_{3,1}$
Do SVD of $P$ and $\tilde{\mathbf{C}}$ is the eigen vector with smallest eigen value

Find $M=KR$ for right upper triangle matrix $K$ and orthogonal matrix $R$, QR decomposition!

 Pros and cons
 pros: simple, analytical
 cons: doesn’t minimize the correct error function

Minimizing reprojection error
Triangulation

Definition
Given a set of (noisy) matched points ${(x_i,x_i’)}$ and camera matrices $P,P’$, estimate the 3D point $X$ in the world frame.
The model is $\mathbf{x}=P\mathbf{X}$, also $\mathbf{x}=\alpha P\mathbf{X}$ as we are using homogeneous coordinates.

Cross product
$c=a\times b= a b \sin\theta~n$, with $\theta$ the angle between $a$ and $b$, $n$ the unit vector perpendicular to $a$ and $b$, the direction is determined using right hand rule. Thus $a\times a = 0$ 
Solution
$\mathbf{x}=\alpha P\mathbf{X}$ means $\mathbf{x}$ and $P\mathbf{X}$ has the same direction, thus, $\mathbf{x}\times P\mathbf{X}=0$, which gives
Here, the last equation is a combination of the first two. By stacking the system for $P$ and $P’$, $X$ can be estimated using SVD.
Epipolar geometry

Epipole might not be always in the image.

Points on the epipolar line $l’$ marked as $x’$ are the potential matches for $x$.

Cross product

Essential matrix (maps a point to a line)
Given a point in one image, multiplying by the essential matrix will tell us the epipolar line in the second view, i.e. $l’=Ex$.
Derivation
 Denote $t=oo’$, then $x’=R(xt)$ where $R$ is the rotation of view.
 Three vectors, $x,t,x’$ are coplanar, thus $x^T(t\times x)=0$ and also $(xt)^T(t\times x)=0$
 Since $x’=R(xt)$, we have ${x’}^TR=(xt)$, then
 $E=R[t]_\times$ is a 3*3 matrix
Properties

LonguetHiggins equation ${x’}^TEx=0$
 Epipolar lines
 $x^Tl=0$, $l=E^Tx’$
 ${x’}^Tl’=0$, $l’=Ex$
where an epipolar line $l$, i.e. $ax+by+c=0$, is denoted as $l=[a,b,c]^T$, and a point $x$ is on $l$ means $x^Tl=0$.
 Epipoles
 $Ee=0$
 ${e’}^TE=0$
Assumption We assume that the points aligned to camera coordinate axis (calibrated camera). The points $x,x’$ above are actually points in camera frame, should be denoted as $\hat{x},\hat{x}’$ and note the image points as $x,x’$, then we have $\hat{x}=K^{1}x$.
Note that before we have where $\mathbf{x}$ is homogenous image point and $\mathbf{X}_\text{cam}$ is homogenous camera point.

Fundamental matrix
The fundamental matrix is a generalization of the essential matrix, where the assumption of calibrated cameras is removed.
The equation of essential matrix is in fact ${\hat{x}’}^TE\hat{x}=0$, using $\hat{x}=K^{1}x$, we have ${x’}^TFx=0$ with $F={K’}^{T}EK^{1}$, a 3*3 matrix.
Eightpoint algorithm
Stereo

We have $\frac{X}{Z}=\frac{x}{f}, \frac{bX}{Z}=\frac{x’}{f}$, therefore disparity $d=xx’=\frac{bf}{Z}$ which is inversely proportional to depth $Z$. In the figure, orange lines are image planes and $O,O’$ are camera centers.

Stereo rectification

Compute Depth
To compute depth, need to make epipolar lines horizontal and calculate the depth from disparity $Z=\frac{bf}{d}$. The epipolar lines are horizontal when $R=I,t=(T,0,0)$, then
${x’}^TEx=0$ gives $v=v’$ where $x=(u,v,1)^T$. So the image of a 3D point will always be on the same horizontal line.

Stereo rectification
Reproject image planes onto a common plane parallel to the line between camera centers. It requires two homographies (3*3 transform) for each input image.
Rotate the right camera by $R$, Rotate both cameras so that the epipoles are at infinity, Adjust the scale

Stereo matching
For each pixel, find the epipolar line, scan the line for best match, and compute the depth from dispartiy
Structure from motion
Physicsbased vision (not finished)
 Reflectance and image formation
 Radiometry
 Shape from shading
 Photometric stereo
 Color
Objects, faces, and learning
 Basics of probability
 Kmeans, KNN, PCA, SVM
 Bag of words
 ViolaJones face detection
 Perceptron, backpropagation
 Convolutional neural networks
Dealing with motion
Optical flow
Assumptions:

Color constancy
Brightness constancy for intensity images. Intensity is the same.

Small motion
Pixels only moves a little. Displacement is small $(u\delta t, v\delta t)$.
$I(x+u\delta t,y+v\delta t,t+\delta t)=I(x,y,t)$
$\dfrac{dI}{dt}=\dfrac{\partial I}{\partial x}\dfrac{dx}{dt}+\dfrac{\partial I}{\partial y}\dfrac{dy}{dt}+\dfrac{\partial I}{\partial t}=0y$ ($dt$ is total derivative)
$I_xu+I_yv+I_t=0$
Constant flow
Assumptions:
 Flow is locally smooth
 Neighboring pixels have same displacement
Use a 5*5 image patch, gives 25 equaltions:
$I_x(p_i)u+I_y(p_i)v+I_t(p_i)=0$
Can be solved by LS
Matrix is the same as Harris Corner Detector.
Aperture problem
HornSchunck optical flow

Enforce brightness constancy
$E_{d}(i, j)=\left[I_{x} u_{i j}+I_{y} v_{i j}+I_{t}\right]^{2}$

Enforce smooth flow field
$E_{s}(i, j)=\frac{1}{4}\left[\left(u_{i j}u_{i+1, j}\right)^{2}+\left(u_{i j}u_{i, j+1}\right)^{2}+\left(v_{i j}v_{i+1, j}\right)^{2}+\left(v_{i j}v_{i, j+1}\right)^{2}\right]$
Image Alignment
$\min\limits_p \sum\limits_x[I(W(x;p))T(x)]^2$ where
 $T$ is template image
 $W$ is warped image
 $I(x’)=I(W(x;p))$, $I$ is nonparametric
LucasKanade alignment
$I(W(x;p+\Delta p))\approx I(W(x;p))+\dfrac{\partial I(W(x;p))}{\partial x’}\dfrac{\partial W(x;p)}{\partial p}\Delta p$
Assume we have a initial guess of $p$, then solve for $\Delta p$ using LS
Left: Lucas Kanade (Additive alignment), Right: ShumSzeliski (Compositional alignment)
In conclusion, the update rules are:

Lucas Kanade (additive alignment)
$W(x;p)\leftarrow W(x;p+\Delta p)$

ShumSzeliski (compositional alignment)
$W(x;p)\leftarrow W(x;p)\circ W(x;\Delta p)$

BakerMatthews (inverse compositional alignment)
$W(x;p)\leftarrow W(x;p)\circ W(x;\Delta p)^{1}$
Kalman Filtering
Tracking (KLT, MeanShift)
KanadeLucasTomasi (KLT)
 Find corners satisfying $\min(\lambda_1,\lambda_2)>\lambda$
 For each corner compute displacement to next frame using the LucasKanade method
 Store displacement of each corner, update corner position
 (optional) Add more corner points every M frames using 1
 Repeat 2 to 4
 Returns long trajectories for each corner point
MeanShift
Like gradient descent of the kernel density estimate $P(x)$ A
MeanShift for images
Each pixel is point with a weight
Nonrigid object tracking

Target descriptor $q={q_1,\cdots,q_M}$ with $q_{m}=C \sum\limits_n k(x_{n}^{2}) \delta[b(x_{n})m]$

Candidate descriptor $p(y)={p_1(y),\cdots,p_M(y)}$ with $p_{m}(y)=C_{h} \sum\limits_{n} k(\frac{yx_{n}}{h}^{2}) \delta[b(x_{n})m]$
There’s only one target but multiple candidates.

Similarity between target and candidate
$d(y)=\sqrt{1\rho[p(y),q]}$
$\rho[p(y),q]=\frac{p(y)^\intercal q}{pq}$ is the cosine distance minimize the distance $d$ is to maximize $\rho$
Temporal inference and SLAM
Temporal state model. States $X_t$ and observation $E_t$. Assumptions:
 stationary process $P(E_tX_t)=P_t(E_tX_t)$
 Markov $P(X_tX_t1,\cdots)=P(X_tX_t1)$
 prior state is known $X_0$ is known
 Joint probability $P(X_{0}) \prod_{t=1}^{T} P(X_{t}  X_{t1}) P(E_{t}  X_{t})$