Understanding of Convolutional Layer

This is a note of the basic knowledge of Convolutional layer in CNN.

Mostly from reference: 能否对卷积神经网络工作原理做一个直观的解释？ - Owl of Minerva的回答 - 知乎 https://www.zhihu.com/question/39022858/answer/224446917

Partly from notes of book Deep Learning

1. Nature of the convolutional layer

Picture can be stored as a map of values with several channels (1-3, 1 when it is a gray pic, 3 when it is a color picture represent RGB channels). This map of values can be seen as a matrix. Then operations on a picture can be writen down in matrix form.

We can do smoothing or blur on the picture by applying a moving window on it to extract some specific value from the window. Mathematically, this is a weigted sum operation on each value of the map cover by the moving window. More specifically, it works like a matrix multiplication. For example, if we want to do a 3x3 smoothing, it works like below:

This works the same with multiplying the original matrix with a 3x3 matrix step by step:

Here the 3x3 small matrix has a name Kernel.

Therefore, by doing so on the picture, we apply a Lowpass filtering on it. The kernel is so-call filter and this operation is called convolution.

To be extended, all of the filtering operation applying on 2-dimention picture can be written as convolution form. For example, Gaussian Filtering and Laplace Filtering.

So why we need a convolutional layer in our network? This is because of its capability of capturing some specific features like certain shape, curve, corner and the like. This capability represented by output a significant value when convolution on the target feature meanwhile output trivial with others which can be achieve by neuron activation.

Thus after convolution on the whole matrix, we can get a result matrix which has greater value around the certain pattern of curve and trivial value other areas. This is a activation map. The relative higher value area is what we want to detect.

2. Multi-layer CNN

When we want to train a Convolutional Neural Network, what we want to do is to train a sequence of filters. For example when we want to manipulate the MNIST dataset and set up 10 output channels in the first convolutional layer, this means that we are applying 10 filters respectively on a 28x28x1 matrix in order to capture 12 different patterns. Then we can do other operation on the output.

In order to capture the correct patterns/features which are useful for our final output, we need to train on our kernels. This is what we do on training. More concrete, we need to try and build correct filter with correct weights to find out the features we want during training with BP.

More over, when we talk about building up a CNN with several layers, what we actually do is to apply a sequence of filters. Every filter is feeded with the output of its previous filter. This is under the consideration of that a front layer of convolution can be used to capture lower level or simpler level of features, like line, angle, and then the succeed filters can operate on these lower level features to construct outputs of more complex features

3. Why CNN?

Convolutional operation helps to improved the network with three ideas:Sparse iteration, Parameter sharing and Equivariant representations. What is more, using convolutional layer can also helps in offering an strategy for unfixed inputs.

Traditional neural network has the representation which using matrix operation, matrix multiplication in particular, to describe the relationship between neurons. This meas that each neuron has connection with every one in the next layer (fully connected layer). Sometimes this is a good representation, however, sometimes not. Image that we have a very big matrix for a picture, the connection/relations between the top-left pixels and bottom-right is seldom concerned by people. People always concern about the relation/connection among a relatively small area, like the corner/line/turn-off in pictures and using sets of these “small features” to construct a abstract understanding of the big picture. Therefore, we need a method to store fewer relationships between neurons for meaningful features and reduce the time complexity. Here we get convolutional network, which has the charater of sparse iteraction. What is more, in a deep nerual network, neurons buried deep in the network have a board indirect connections with most of the input. This allows us to use sparse iteraction to describe complicated interaction.

Parameter sharing is sharing parameters in several functions of a model. In the case of CNN we share the weights in convolutional kernels. In CNN, every elements of a kernel will be applied to almost each points of the input (depends on the padding strategy). Therefore the parameter sharing allows us to focus the training on a single parameter set for the whole model but not a set for each point like the FCN. This is a very efficient way to reduce the needed hard driver storage. (What is more, a multiplication between 2-D convolutional kernel and target matrix can be reduce to 2 multiplication of 2 1-D convolutional kernel, which means the operation reduced from nxn into n+n).

Nevertheless, parameter sharing permit the equivariance of translation (平移不变性) to the CNN. If a function can change its output as the changes happen in its input, we say that this is a equivariant funciton. Specially, when f(x) and g(x) satisfy that f(g(x)) = g(f(x)), we say that f(x) is equivariant on g. Turns back to CNN. We know that convolution holds this equivariance for a specific pattern moving around the picture (translation). However, there are still some cases we do not want the parameter sharing and convolution do loss its equivariance in some transform (like zooming or rotation). Fortunately, these special cases can be handled by CNN.

Reference:

Mostly from the answer of this:https://www.zhihu.com/question/39022858/answer/224446917 . Great thanks.