Do you see the Ape in the image? Well, the ape is blind and to add more, it is famished. It wants to eat that banana placed lowest in the curve. You see, he is not even turned in the direction of where the banana is placed. How will it eat the banana?
The smart solution to the blind ape’s problem is he should start walking just anywhere. If he finds that slope is increasing that would mean he is going farther from the banana. But if the slope tends to decrease he is in the right direction.
This is what Gradient Descent is, the act of moving downwards. But here’s a question what should be the Ape’s step size?
As you see, it is blind so it won’t see the banana. In order to eat the banana, he might have to stick its body with it to know it’s there.
If the Ape goes with bigger steps he might jump out of the downhill and reach the other end. But if his step size is adequate enough, he might have the food today.
Gradient Descent will help him finding the right size of the step that he should take so as to reach the banana. This step size will be called as the learning rate or ‘η’.
In an optimization algorithm, the learning rate is the step size taken at each of the iterations while moving towards a minimum loss/cost function.
The cost function of this curve will take you to the input value that shall get you to the banana i.e. the lowest output value. In simpler terms, a cost function should get you the minimum output.
Let’s dig deeper and understand its formula.
The formula is
θj: θj - αɗ j(θ0,θ1) / ɗ θj for j=0 and j=1.
When opened for further understanding it becomes,
θ0: θ0 – α/m . mΣi=1 [ hθ (xi ) – yi) ]
θ1: θ0 – α/m . mΣi=1 [ hθ (xi) – yi) xi ]
The θ0 and θ1 are updated simultaneously.
Have a look at the following image, what do you notice?
There are not one but many minimums in this curve however most of them are local minimums and there’s only one global minimum. If your learning rate is small then it is a high probability that it will get stuck in one of these local minimums.
Let’s first understand how to move from the position you are standing it.
Well, figure out the slope of the position that you are at and then do one of the following:
Shift left if the slope is positive.
Shift right if the slope is negative.
But this was a single input and single output problem.
GRADIENT DESCENT IN MULTIPLE INPUTS
Let’s come out of that narrowness of our minds and seek reality.
Do you see the image of the rabbit in small hills? Well, the rabbit has to reach the minimum value that is the downhill position. Everything is the same as it was earlier but here’s a little change. At this point in time, you won’t ask the slope but instead, you’ll ask for the direction to go in that reduces the cost function.
The gradient of a function gives you the direction of the steepest ascent in multivariate calculus. It simply means that it gives you the direction that you should step in so that cost function increases drastically.
But as I said above, we want the cost function to reduce so what value is this steep ascent to us? Well, by taking the negative of the gradient we’ll step towards what minimizes the cost function fastest. The length of the gradient will give you the value of that steep.
The rabbit, however, is in a tri-variate calculus where x and y are the inputs and z is the output being the direction it should move in.
If this has to be done when there are more number of inputs than just 1 or 2 inputs, say we want to do this for 500 inputs, in that scenario we take all the inputs along with their weights and biases and arrange them in a single vector. In this gigantic input space, there is a direction leading towards the minimum cost function.
The negative gradient cost function is also a vector. This cost function is actually the average of all the training data. So minimizing it will result in all the samples in the training data. This cost function is actually a layer of complexity on top of input, output, and parameters.
While training the dataset the weights and biases are changed repeatedly so as to reach maximum efficiency that is the lowest point in the least time.
Backpropagation is the algorithm used to compute this gradient descent efficiently.
This process of nudging an input function by some multiple of the negative gradient is called gradient descent. Gradient gives the steep, gradient descent gives the downwards direction.
As far as this point, we should know that we feed the network with some training dataset and then test dataset that will give out the minimum loss function.
TYPES OF GRADIENT DESCENT
In Batch Gradient descent you give a whole batch containing multiple rows of information at a single time to the ANN to make sense of.
In stochastic gradient descent, you feed single rows of information at a time.
Now, stochastic GD brings more accuracy but takes more time. So we came up with a middle method of mini-batch GD in which we provide batch but of comparatively very small size to that of batch GD. This gives us almost the same accuracy in lesser time.
HOW DOES LEARNING RATE AFFECT GD
Well, look at the image below, it might seem a little complicated to you even if it is well labeled.
Look at the yellow line that is in the middle and appears parallel to the x-axis. This shows the accuracy of the learning rate as big as 100 with varied epoch and batch sizes. In simple terms, the model learns nothing. The noted accuracy here was 10.9%. Even a small child would have done better.
See the blue line; it is when the learning rate is 0.001. The accuracy is below 70% and yet again no use.
Look more closely and you’ll find that 0.01 and 0.1 are giving good accuracy of near about 90%. While a learning rate of 1 is just as added and just as not.
DIFFERENT VARIANTS OF GRADIENT DESCENT ALGORITHM
All these variants are just intelligent amendments done the Gradient Descent Algorithm to get better results in a lesser amount of iterations and thus time. We will discuss all these in the next upcoming articles. Stay tuned till then!