"Wow, your shoes look amazing!"
"I love that black tie on you."
"Those squared goggles do not suit your oval face".
In your day to day life, you pass your opinions just in the blink of an eye, so quickly. They say think before you speak and technically speaking, we think a lot before we speak, it's just that we have become so fast because of evolution that these things take no time for us to decide.
A Convolutional Neural Network aims to do so too. Convolutional Neural Network can do classifications but is widely used in image recognition because of its outstanding pattern detection abilities.
But what are these patterns? And what makes a Convolutional NN so perfect in detecting them?
These patterns are the way we perceive an object. A rope in dark is perceived as a snake, but why so? Because both snake and rope and have a similar structure and the absence of light makes it difficult to distinguish. But that's not important. The important thing is how we perceive objects.
Just like a rope's structure is considered, we look for structures in other things too. A round object is a ball, now that ball can be football, volleyball, baseball, etc.
This is how a normal picture is seen by a Convolutional NN.
A Convolutional neural network that is deep enough will recognize these patterns. A pattern can be a shape like a circle, triangle, square, etc based on which CNN detects objects. Like legs of birds have a structure that can be disintegrated into different shapes.
CNN has 4 layers that help it recognize these patterns and make it fantastic human development.
1. Convolution Layer
2. Pooling Layer
3. Flattening Layer
4. Dense Layer
If you search the meaning of "convolute" on the internet then it means "making things complex". So what does it mean by a convolution layer?
In an Artificial NN, there is an input layer then comes a set of hidden layers and finally the output layer. But in a Convolutional NN, a convolution layer is added before that input making the process long and complex, hence the name Convolutional.
The 1st layer of Convolutional NN has filters that we apply to our data through a sliding window.
The filter is usually a squared matrix say 2*2 that has been raised to some values, also known as weights, to work upon data. Now, we move the sliding window over the image, the image will have current values stored in the sliding window (a matrix of the same dimension as the filter). We take the dot product of these two matrices and store them in another matrix called the output matrix.
The output of a 3D filter with the color image is a 2D matrix.
The depth of a color image is 3 (for RGB channel) whilst for a greyscale image is 1.
A Convolutional layer has an activation layer that is always activated by a non-linear activation function. A linear function does not support learning as much as we require. Usually, this activation is ReLu that gives positive values for positive results and 0 for all negative results.
The Pooling Layer
In an Artificial NN, every node is connected to every other node making it several parameters to learn. The pooling layer helps us to reduce these learning parameters to boost our learning time with the help of downsampling. To do so we need to pass two hyper-parameters:
Dimensions of a Spatial Extent:
The value of n for which we can take an n*n feature representation and map to a single value.
The number of features that the sliding window skips along with the height and width is called a stride. If you have larger strides then we skip large pixels between every two pools hence resulting in smaller volumes.
We have two functions for the output of a pooling layer:
a. Max Filter: Returns max value among the feature in the region.
b. Average Filter: Returns the average of the values in the region.
The depth of the image remains unchanged after pooling. It reduces the chances of over-fitting as there is less number of parameters.
The padding becomes an important aspect to learn here. The padding adds an extra layer along the sides of the image. When we pool our image, it happens that some columns or rows are visited more times whilst those along the edges are visited a lesser number of times. So along the edges, we add a layer of padding (values usually zero) so now if the columns and rows along the edges are visited less, it does not matter as we will have relevant information stored next to these columns and those columns will be visited more.
The convolution and pooling happen together and before we reach the next layers, convolution and pooling have already happened many times before.
Now the output of a pooling layer is a 3D feature map because, as said above, the depth remains unchanged all this time. This output goes input to the next set of hidden layers. But for a fully connected layer that is the end layer we need the input to be 1D. So that's what a flattening layer does. It will convert a 3D feature to a 1D dataset.
Like an output of 32 filters along 13 height pixels and 13 width pixels can be converted as:
13*13*32 à 5408*1
This is a single vector. From here every neuron is connected to every other neuron of the next layer. And here it becomes fully connected layer to the output layer.
The output layer
The output layer is supposed to have as many output neurons as many unique outputs we can get. For example, if we have 2 outputs: a cat and a dog, so we can have 2 outputs. An image with both a cat and a dog in it will get respective weights in the output layer. If the answer has to be yes or no then even a single neuron can work in that case as shown in the figure.
However, if we want several animals to be distinguished then we need more output neurons. If a zoo has 50 unique animals then we need 50 neurons in the output layer. Just like the figure down below. The output has a range of things including sunset and other animals.
This convolution NN has hidden layers in them with 'softmax' activation for the last hidden layer.
CNN looks at pixels in context; it can learn patterns and objects and recognizes them even if they are in different positions on the image.
So now your network is also capable of distinguishing between a black tie and red shoes. You can now even click images of real objects and ask CNN to detect what the object name is.
At the end, CNN may look like this one. A series of convolution and pooling layer results in a matrix that can be flattened to get the best predictions.
The accuracy and precision for this might be a little tricky but it'll give you some results. Always remember to reshape your images as per your CNN to avoid errors.