January 12, 2021

Rotoscoping with OpenCV/C++

One of the most basic problems in VFX is separating an object in a video from its background. The object then composited back into new environment or scene. This task has different names such as matting, keying or more popular term rotoscoping.

In this demo you can see how a church in the foreground is separated from the background blue sky, which was later replaced with a sky full of stars. This is the kind of problem matting can help with.

Full code and how to use it is in this repo https://github.com/smirnov-am/vfx/tree/main/matting

Matting problem

Every pixel \( I(x, y) \) in a video frame can be expressed as a composition of \( F(x, y) \) - foreground pixel - and \( B(x, y) \) - background pixel - as:

\(\alpha\) is a scalar in a range from 0 to 1 and \(I\), \(F\) and \(B\) are vectors in RGB space, where each component represents an intensity of a respective color channel. This formula is referred to as matting equation.

It’s impossible to solve this problem in general case as this system has only 3 equations - one for each color channel - but 7 unknowns. These unknowns are components of \(F\), \(B\) and a scalar value of \(\alpha\).

However, if \(\alpha\) is known, foreground and background can be derived:

But there is a way to supply additional information that can make this problem solvable and help with finding \(\alpha\). Manually mark some pixels as foreground with white and, background with black like on this picture creating a trimap:


I’ve taken the first frame from my footage and using image editor roughly painted the church with white and sky with black brush.

There are two method how to solve matting equation [1] given such a trimap covered in this post:

  • Bayes’ inference
  • artificial neural network

Apply Bayes theorem

Since the frame pixel \(I\) is something you already have as an input, what can be reasonable values for \(F\), \(B\) and \(\alpha\) to produce such a frame? This question can be represented as the probability of these unknowns given \(I\).

Now you need to find such unknowns - foreground \( F \), background \(B\) and \(\alpha\) - for which this probability is maximum.

Bayes theorem can revert that conditional probability:

The right part of this equation is a product of likelihood and prior probability, divided by evidence probability. Let’s tackle them separately.

The likelihood term in Bayes theorem reflects the knowledge of how likely is an image if foreground, background and \(\alpha\) are known. If they are known, \(I\) is directly computed using matting equation [1]. So you can model the likelihood PDF as a very narrow function with maximum where the matting equation is valid:

This PDF is a bell-shaped curve with the width controlled by \(\sigma\), which is an input parameter for this method.

Prior term is a bit more difficult. First, the assumption that foreground, background and and \(\alpha\) are independent of each other breaks this term into 3:

Another useful assumption is that the probability distribution of \(\alpha\) is a constant. In reality is a beta-distribution with a lot of values close to extremes 0 and 1 and very few values in between.

As for PDFs of \(F\) and \(B\), they can be modeled as Gaussian. Again in general case they are not, but in case of my footage the foreground is mainly brown and the background - sky - is blue. I’m going to put these assumptions to test later on.

Given that foreground and background PDFs are Gaussian, the probability of \(F\) and \(B\) in the prior term can be modeled as:

At last, the probability of evidence \(P(I)\) is a constant and is not relevant.

Given all these models and assumptions it’s possible to just take a derivative and find the argument for which the conditional probability is at its maximum. This gives a closed form solution like:

The exact elements of the matrices can be found in the code:

  • 6-by-6 \(A\) matrix for left hand side - code
  • 6-by-1 vector \(y\) for right hand side - code

Given that system of linear equations and a formula for \(\alpha\)

it’s possible to iteratively solve them until alpha is converged to some stable value.

Check the assumptions

Now let’s test the assumption that PDF for foreground and background are Gaussian. Go thought the frame and trimap pixel by pixel at the same time and plot the pixels for which trimap value is either 0 or 1 in 3D RGB space:


I’ve done so with Python and pandas. I’ve used plotly to visualize the distributions. Furthermore, I’m not plotting all pixels as there are thousands of them, but just a fraction. But it’s enough to see that foreground and background are nicely separated compact blobs:

Get alpha map for the whole video

The trimap is created for the first frame of the video. Now there is an approximate alpha for each pixel in this first frame. It can be used in equation [4] to calculate foreground and background and use them to get more precise alpha using equation [5] until convergence to some stable number. So the alpha values calculated this way constitute a full alpha map for the first frame.

The alpha map for the first frame then used as a trimap for the second frame. Since there very little difference between the frames it speeds up calculations as the algorithm converges to stable values more quickly

But still in my case of Full HD 60fps footage which is 12s long it took hours. However, it has good potential for parallelization, I didn’t implement that, because got a more interesting idea.

The trimap provides a way to mark some pixels as 0 and another as 1 which looks like a perfect fit for a supervised learning algorithm, that can be implemented using OpenCV build-in artificial neural network.

Use an artificial neural network

OpenCV has an implementation of multi-layer perceptron (MLP). To model foreground-background classification an input MLP layer is going to have 3 nodes - one for each color channel. The output layer has a single node for alpha. Since alpha can be fractional the output can be used as is.

All other hyper-parameters of the neural net should be tuned. For that you can train a network and produce an alpha map for the first frame. By comparing them visually you can select the best ones, just short-circuiting a standard approach of separating training set to train-test-validate batches.

The MLP topology looks like this:


Once the network is trained it is used to predict alpha for each pixel in each frame. The result again is a video alpha map.

This method works much faster compared to Bayes’ inference once the network is trained, but has a slow start since training and tuning the hyper-parameters takes some time.

Final composition

Finally, given a video alpha map - a video consisting of alpha maps of individual frames of the original video - you can compose a new video by replacing the background with the new one using equations [2] and [3]

Here is a 30sec reel that shows the whole process:

Support the author - Buy me a coffee!

Comments powered by Talkyard.

© Alexey Smirnov 2021