Image Processing
Now that we know how a digital image is represented, let's discuss how computers can use this information to find patterns that will be used to classify an image or localize objects. So, in order to get any useful or actionable information from an image, a computer has to resolve an image into a recognizable or known pattern. As for any machine learning algorithm, computer vision needs some features in order to learn patterns.
Unlike structured data, where each feature is well defined in advance and stored in separate columns, images don't follow any specific pattern. It is impossible to say, for instance, that the third line will always contain the eye of an animal or that the bottom left corner will always represent a red, round-shaped object. Images can be of anything and don't follow any structure. This is why they are considered to be unstructured data.
However, images do contain features. They contain different shapes (lines, circles, rectangles, and so on), colors (red, blue, orange, yellow, and so on), and specific characteristics related to different types of objects (hair, wheel, leaves, and so on). Our eyes and brain can easily analyze and interpret all these features and identify objects in images. Therefore, we need to simulate the same analytical process for computers. This is where image filters (also called kernels) come into play.
Image filters are small matrices specialized in detecting a defined pattern. For instance, we can have a filter for detecting vertical lines only and another one only for horizontal lines. Computer vision systems run such filters in every part of the image and generate a new image with the detected patterns highlighted. These kinds of generated images are called feature maps. An example of a feature map where an edge-detection filter is used is shown in the following figure:
Such filters are widely used in image processing. If you've used Adobe Photoshop before (or any other image processing tool), you will have most likely used filters such as Gaussian and Sharpen.
Convolution Operations
Now that we know the basics of image processing, we can start our journey with CNNs. As we mentioned previously, computer vision relies on applying filters to an image to recognize different patterns or features and generate feature maps. But how are these filters applied to the pixels of an image? You could guess that there is some sort of mathematical operation behind it, and you would be absolutely right. This operation is called convolution.
A convolution operation is composed of two stages:
- An element-wise product of two matrices
- A sum of the elements of a matrix
Let's look at an example of how to convolute two matrices, A and B:
First, we need to perform an element-wise multiplication with matrices A and B. We will get another matrix, C, as a result, with the following values:
- 1st row, 1st column: 5 × 1 = 5
- 1st row, 2nd column: 10 × 0 = 0
- 1st row, 3rd column: 15 × (-1) = -15
- 2nd row, 1st column: 10 × 2 = 20
- 2nd row, 2nd column: 20 × 0 = 0
- 2nd row, 3rd column: 30 × (-2) = -60
- 3rd row, 1st column: 100 × 1 = 100
- 3rd row, 2nd column: 150 × 0 = 0
- 3rd row, 3rd column: 200 × (-1) = -200
Note
An element-wise multiplication is different from a standard matrix multiplication, which operates at the row and column level rather than on each element.
Finally, we just have to perform a sum on all elements of matrix C, which will give us the following:
5+0-15+20+0-60+100+0-200 = -150
The final result of the entire convolution operation on matrices A and B is -150, as shown in the following diagram:
In this example, Matrix B is actually a filter (or kernel) called Sobel that is used for detecting vertical lines (there is also a variant for horizontal lines). Matrix A will be a portion of an image with the same dimensions as the filter (this is mandatory in order to perform element-wise multiplication).
Note
A filter is, in general, a square matrix such as (3,3) or (5,5).
For a CNN, filters are actually parameters that will be learned (that is, defined) during the training process. So, the values of each filter that will be used will be set by the CNN itself. This is an important concept to go through before we learn how to train a CNN.
Exercise 3.01: Implementing a Convolution Operation
In this exercise, we will use TensorFlow to implement a convolution operation on two matrices: [[1,2,3],[4,5,6],[7,8,9]] and [[1,0,-1],[1,0,-1],[1,0,-1]]. Perform the following steps to complete this exercise:
- Open a new Jupyter Notebook file and name it Exercise 3.01.
- Import the tensorflow library:
import tensorflow as tf
- Create a tensor called A from the first matrix, ([[1,2,3],[4,5,6],[7,8,9]]). Print its value:
A = tf.Variable([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
A
The output will be as follows:
<tf.Variable 'Variable:0' shape=(3, 3) dtype=int32,
numpy=array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])>
- Create a tensor called B from the first matrix, ([[1,0,-1],[1,0,-1],[1,0,-1]]). Print its value:
B = tf.Variable([[1, 0, -1], [1, 0, -1], [1, 0, -1]])
B
The output will be as follows:
<tf.Variable 'Variable:0' shape=(3, 3) dtype=int32,
numpy=array([[ 1, 0, -1],
[ 1, 0, -1],
[ 1, 0, -1]])>
- Perform an element-wise multiplication on A and B using tf.math.multiply(). Save the result in mult_out and print it:
mult_out = tf.math.multiply(A, B)
mult_out
The expected output will be as follows:
<tf.Tensor: id=19, shape=(3, 3), dtype=int32,
numpy=array([[ 1, 0, -3],
[ 4, 0, -6],
[ 7, 0, -9]])>
- Perform an element-wise sum on mult_out using tf.math.reduce_sum(). Save the result in conv_out and print it:
conv_out = tf.math.reduce_sum(mult_out)
conv_out
The expected output will be as follows:
<tf.Tensor: id=21, shape=(), dtype=int32, numpy=-6>
The result of the convolution operation on the two matrices, [[1,2,3],[4,5,6],[7,8,9]] and [[1,0,-1],[1,0,-1],[1,0,-1]], is -6.
Note
To access the source code for this specific section, please refer to https://packt.live/320pEfC.
You can also run this example online at https://packt.live/2ZdeLFr. You must execute the entire Notebook in order to get the desired result.
In this exercise, we used the built-in functions of TensorFlow to perform a convolution operation on two matrices.
Stride
So far, we have learned how to perform a single convolution operation. We learned that a convolution operation uses a filter of a specific size, say, (3, 3), that is, 3 × 3, and applies it on a portion of the image of a similar size. If we have a large image, let's say of size (512, 512), then we can just look at a very tiny part of the image.
Taking tiny parts of the image at a time, we need to perform the same convolution operation on the entire space of a given image. To do so, we will apply a technique called sliding. As the name implies, sliding is where we apply the filter to an adjacent area of the previous convolution operation: we just slide the filter and apply convolution.
If we start from the top-left corner of an image, we can slide the filter by one pixel at a time to the right. Once we get to the right edge, we can slide down the filter by one pixel. We repeat this sliding operation until we've applied convolution to the entire space of the image:
Rather than sliding by 1 pixel only, we can choose a bigger sliding window, such as 2 or 3 pixels. The parameter defining the value of this sliding window is called stride. With a bigger stride value, there will be fewer overlapping pixels, but the resulting feature map will have smaller dimensions, so you will be losing a bit of information.
In the preceding example, we applied a Sobel filter on an image that has been split horizontally with dark values on the left-hand side and white ones on the right-hand side. The resulting feature map has high values (800) in the middle, which indicates that the Sobel filter found a vertical line in that area. This is how sliding convolution helps to detect specific patterns in an image.
Padding
In the previous section, we learned how a filter can go through all the pixels of an image with pixel sliding. Combined with the convolution operation, this process helps to detect patterns (that is, extract features) in an image.
Applying a convolution to an image will result in a feature map that has smaller dimensions than the input image. A technique called padding can be used in order to get the exact same dimensions for the feature map as for the input image. It consists of adding a layer of pixels with a value of 0 to the edge:
In the preceding example, the input image has the dimensions (6,6). Once padded, its dimensions increased to (8,8). Now, we can apply convolution on it with a filter of size (3,3):
The resulting image after convoluting the padded image is (6,6) in terms of its dimensions, which is the exact same dimensions as for the original input image. The resulting feature map has high values in the middle of the image, just like the previous example without padding. So, the filter can still find the same pattern in the image. But you may notice now that we have very low values (-800) on the left edge. This is actually fine as lower values mean the filter hasn't found any pattern in this area.
The following formulas can be used for calculating the output dimensions of a feature map after a convolution:
Here, we have the following:
- w: Width of the input image
- h: Height of the input image
- p: Number of pixels used on each side for padding
- f: Filter size
- s: Number of pixels in the stride
Let's apply this formula to the preceding example:
- w = 6
- h = 6
- p = 1
- f = 3
- s = 1
Then, calculate the output dimensions as follows:
So, the dimensions of the resulting feature map are (6,6).