20-30 years ago who could have thought about having self driving car or face detection. Computer vision field was obsolete. But Now, Computer vision is one of industry hot topic having endless application onto real world. This technology has enabled computer to perceive world as human, thus reducing long hours of tedious and dirty work.
Convolution meaning - Transforming an input to something different.
CNN has two parts:
Feature Extractor - This is where convolution occur. It simply extracts feature(we will explain in a minute how) and pass it on to Classifier. It extracts feature, reduce dimension, Flattens the image and then feed it to ANN. The role of the Convolution is to reduce the images into a form which is easier to process, without losing features which are critical for getting a good prediction
Classifier - This is basically ANN. Flattened input is given to it which it classifies according to weights and bias.
Understanding Input Images :
An Image has various channels most common 3 channel is RGB. There are various other channels like CMYK, Grayscale, HSV. The below image is RGB having 3 channels namely Red, Green, Blue.
Now for Higher resolution images, like HD resolution having 1280*720 and each pixel having 3 channels. Total input dimension becomes 1280*720*3 = 2764800. Working on this dimension is madness, you can say goodbye to your deadline and your precious time.
So, CNN is required as it not only reduces the dimension of image considerably but also retains features and emphasize them more. Filters, Strides and Padding:
Filters are unique patterns in an image. They are of specific size, called kernel size. They go through the input image looking for similarity or patterns and store their result in a matrix called activation map. Each filter produces one Activation map.
Filters/Kernels loops through whole image. Its movement is decided by Strides that is how many pixels of image to leave while looping. Lower Strides may increase noise in our data, higher strides may lead to underfitting. Strides depends on Kernel Size and Input size of the image.
Below Figure, K is the kernel of size 3X3. The filter will loop through image I, and store total no. of matched features in Activation map. Here Filter has stride of 1 ,that is leaving (1*rows of Filter) no. of pixel when traversing through image's columns. Leaving (1*columns of Filter) no. of pixel, when traversing through image's rows.
Notice when kernel is looping through the image, how the edge pixels are always mapped fewer times than other pixels of the image. This produces underfitted result on the model where features are located on the edge of the image and may produce ambiguous result. To avoid this we pad edges of the image with pixel of null values or 0 values, This is calledPadding. It will all be clear by below gif.
Like convolution layer, Pooling also reduces the dimensions of the image and extracts dominant features from the image. There are two types of pooling MaxPooling and Average pooling. Out of the two, Max Pooling is more prevalent as it reduces noise much better than average pooling though both may have their usage over certain cases.
Next, Pooled Image is flattened to feed it to Classification layer or ANN.
Classification Layer / ANN:
This flattened Image is then feeded to ANN. No. of nodes, hidden layer, loss function, callbacks is parameterized, instantiated. Back propagation occur in each epoch, weights and biases are updated. Activation function like sigmoid and SoftMax are applied. After a certain epoch, model will be able to distinguish and classify images. Calculate, evaluate the quality of the model and TADAA!! you have your first CNN ready.
I am also giving link for the jupyter notebook of this article. Fork it, download it and try to get hands on the CNN by tweaking various parameters. There are many more details included in the notebook which would concrete your knowledge about CNN you learned from this article.