Gradient descent is an optimization method that helps us find the precise combination of weights for a network that may minimize the output error. It's how we turn the dials on our network and fine-tune it so we are satisfied with its output. An honest thanks to approach gradient descent is solely by unpacking its name.
Let’s start there. Gradient essentially means slope. Descent means to descend or go down. Therefore, gradient descent has something to try to to with descending down a slope. This now likely raises some questions:
What are we descending?
What is the slope?
How fast will we descend?
With gradient descent, we descend down the slope of gradients to seek out an all-time low point, which is where the error is smallest. Remember, the gradients are the errors of the individual weights, and therefore the goal of a network is to reduce these errors. By descending down the gradients we are actively trying to reduce the value function and attain the world minimum. Our steps are determined by the steepness of the slope (the gradient itself) and therefore the learning rate.
The learning rate is generally a value that races or slows down how quickly an algorithm learns. Technically, it's a hyperparameter set within the pre-stage that determines the size of the step an algorithm takes when moving towards a minimum.
The procedure of gradient descent starts off with randomly picking the parameters or the coefficients for the function. The next step is to calculate the cost with the difference between the actual and the predicted value. Then we calculate the derivative of the cost. A derivative refers to the slope of a function which also gives us the direction to move the coefficient values in order to get a lower cost in the next iteration.
Suppose we have a ball on the upper edge of a pond and if we push it a little, what would happen?
Yes, it would roll down to the bottom of the pond, obviously due to gravity. Or in other words, the ball tends to roll towards the "minimum" point of the pond. Let us suppose we have some function. we have a function, F(x) =x^2+1 and we are provided with some x values as well. Now, all we need to do is put this value in the above function to get corresponding values for y and the plot ends up looking something like the picture below.
Now what? Gravity won't show its power on our plot. How do we get it to the minimum point?
To do this, we need to know 2 things-
The direction of the slope ( Using gradient.)
By how many units we need to roll. (Learning rate)
One approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i.e., the direction that causes the function to increase the most), and repeat with the new starting point. Similarly, you can try to minimize a function by taking small steps in the opposite direction, as shown in the picture.
Do you see a potential problem though? The above is a simple representation that does not take into consideration the complexity of real-life neural networks. In practice, neural networks have hundreds if not thousands of weights which completely change the landscape of the simple graph above.
In fact, what if the pond surface the ball was descending is not smooth, but has many local minima that it can become stuck in and falsely believe is the global minimum? Like shown in the picture below:
To help mitigate the problem an optimization algorithm is typically used in conjunction with backpropagation. A popular example is Momentum, which is used to help push the algorithm out of a local minimum. It essentially adds a boost to the direction the network is moving in.
Need for gradient descent
Essentially, training is a search for the set of parameters that will cause the model to have the lowest error for a training set. If we had an infinite amount of computation resources, we would try every possible combination of weights to determine the one that provided the lowest error during the training, sounds pretty easy eh?
But since we do not have unlimited computing resources, what we need is some sort of shortcut to prevent the need to examine every possible combination (of weights in case of neural networks and of slope and intercept in order to find the best fit line). These training methods utilize clever techniques to avoid performing the brute search.
GD in machine learning help identifies the perfect parameters that can help in finding the best fitting line for a dataset by minimizing the cost as we move towards the global minima.
In deep learning, we adjust weights by backpropagation and make use of gradient descent to find the perfect combination of weights for our neural network as it can be near to impossible to manually try each and every possible combination of weights.
Problems that can arise while using GD:
Gradient Descent is a sound technique that works most of the time. There are a few cases where gradient descent can fail to work properly. This can happen due to the following reasons:
Gradient descent only works for problems that have a well defined convex optimization problem.
Saddle point problem. This is a point in the data where the gradient is zero but that point is not an optimal point. We don’t have a specific way to avoid this point and is still an active area of research.
Some problems might arise if the execution is not done properly while using gradient descent, such as vanishing gradient or exploding gradient problems. These problems occur when the gradient is too small or too large. And because of this problem, the algorithms do not converge.
Choosing the right learning rate:
The learning rate defines what should be your step size to get to the bottom or to the minimum point. You might need to take a large step when you are too far away from the minimum point or you might be too close to the minimum that is you take a large step you could just miss it. So, choosing the right learning rate is essential.
Choosing the right learning rate is of paramount importance when coding the learning part of a neural network.
Choose too small a rate, and the algorithm may become so slow that you will not be able to find the minimum in a reasonable amount of time (or a number of iterations).
Choose too big a rate, and the method may just bounce around the minimum, without ever reaching it.
A typical sign of a learning rate that is too big is that the cost function may become nan (“not a number,” in Python slang). Printing the cost function at regular intervals during the training process is a good way of checking such kind of problems. This will give you a chance to stop the process and avoid wasting time (in case you see nan appearing).
In deep-learning problems, each iteration will cost time, and you will have to perform this process several times. Choosing the right learning rate is a key part of designing a good model because it will make training much faster (or make it impossible)
Variants of Gradient Descent
There are three main variants of gradient descent. These three differ in how much data we use to compute the gradient of the objective function.
Batch Gradient Descent
Mini-Batch Gradient Descent
Batch Gradient Descent:
Gradient descent is also known as Vanilla Gradient Descent, which is just a fancy term for Standard Gradient descent without any bells or whistles and is one of the simplest forms of Gradient descent. It computes the gradient of the cost functions w.r.t the parameters (θ) for the entire dataset. In simple terms, all of the training data is taken into consideration in order to take a single step.
θ = θ − η · ∇θJ(θ; x (i) ; y (i) )
Batch gradient descent can be very slow as it calculates the gradients for the whole dataset for just one update, hence it is intractable for datasets that rea too large to fit into memory. Also, it does not allow updating the model online. However, batch gradient works exceptionally well for convex or relatively smooth error manifolds.
Stochastic Gradient Descent:
In batch gradient descent, we were taking into consideration the whole dataset for each step. But what if the dataset is too large? We all know deep learning models perform better when it has data in bulk. The more the information provided, the better the learning. Suppose we have 3 million examples, then to take one step, our model will have to calculate 3 million gradients, which definitely does not sound like an efficient job. To deal with such a problem, we have stochastic gradient descent.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x (i) and label y (i) :
θ = θ − η · ∇θJ(θ; x (i) ; y (i) )
In stochastic gradient descent, we use one example at a time to take a step. As we are considering one example at a time the loss function will experience some fluctuations over the training examples and won’t necessarily decrease but eventually, the cost will decrease with fluctuations as shown in the picture below.
One cool thing about gradient descent is that when we get new data, we can easily use it to take another step for the parameter estimates without having to start from scratch. That means we can use this to learn online as well i.e. with new examples on the fly
Mini Batch Gradient Descent:
The mini-batch gradient descent takes the best of both stochastic as well as batch. Stochastic gradient descent converged faster for the larger datasets. But, we can not implement the vectorized implementation on it as it uses only one example at a time. This can result in slow computations. To tackle this, we use the best of both worlds. Talking about the formal definition, mini-batch gradient descent performs an update for every mini-batch of n training examples:
θ = θ − η · ∇θJ(θ; x (i:i+n) ; y (i:i+n) )
In simpler terms, we do not use the whole dataset at once or a single example at a time. Rather, we use a batch of a fixed number of training examples less than the actual dataset and name it as a mini-batch. t. Common mini-batch sizes range between 50 and 256 but can vary for different applications. With mini-batch gradient descent we are updating our parameters frequently as well as we can use vectorized implementation for faster computations. Also, the cost still fluctuates somewhat similar to SGD because we are averaging a small number of examples at a time.
Evolution of GD
SGD has trouble navigating through areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. These areas are known as ravines. SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides (local minima) rather than along the ravine towards the optimum. Momentum helps accelerate gradients in the right direction. This is shown in the following pictures:
: SGD without momentum.
SGD with momentum: Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as shown in the picture.
This is done by adding a fraction γ of the update vector of the past time step to the current update vector.
v(t) = γv(t-1) + η∇θJ(θ)
θ = θ − v(t)
where η is the learning rate, t is time, and γ is usually set to 0.9 or similar value.
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. Or say we are just trying to find the exponential moving average to reduce the noise as we move towards global minima.
In simple words, in order to solve a local minimum problem, the idea is to walk a bit fast with momentum and determination in a way that if you get stuck in a local minima you can power through and get over the hump to look for a lower minimum. In case of normal gradient descent, it just gets us to local minima and we want to go over the hump but by now, the gradient is zero or too small so it won't give us a good step. If we take the average of the last few steps it can push us towards a hump. Now, the average can be a bit drastic as the step we made 10 steps ago is much less relevant than the step we last made. So, we can attach weight to each step so that the previous step matters a lot while the step before matters less and less. That's what momentum is. Momentum is a constant γ(gamma) that range between 0 and 1, that attaches to the steps as follows:
The previous step gets multiplied by 1, the one before by gamma, the one before by gamma square and the one before my gamma cube. In this way the steps that took place a long time ago matter much less than the recent ones. And that get's us over the hump.
Nesterov accelerated gradient
The standard momentum method computes the gradient at the current location first -and then takes a big jump in the direction of the updated accumulated gradient.
Nesterov accelerated gradient is inspired by the Nesterov method for optimizing convex functions. It states:
First, make a big jump in the direction of the previously accumulated gradient.
Then measure the gradient where you end up and make a correction.
While Momentum first computes the current gradient (small blue vector) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previously accumulated gradient (brown vector), measures the gradient and then makes a correction (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness.
Adagrad stands for "Adaptive Gradient Algorithm". Up until now what we were able to adapt our updates (learning rate) to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.
So, all Adagrad does is that it adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. It is well-suited for dealing with sparse data. One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate.
It was proposed in 2011 so it is new it dynamically varies the learning rate at each update and for every way individually so the original rule was:
consider the adaptive gradient algorithm the change in the weight will be given by the same expression but in addition, it will be divided by the square root of G at time T plus Epsilon,
so what is G here, G is the adaptation magic Gi at epoch t equals Gi at epoch (t-1) plus the square of the gradient and epoch t.
We begin at G not equal to 0, at each step G increases because we are adding two non-negative numbers thus G is a monotonously increasing function since we are dividing the learning rate η, by a monotonously increasing function. Eta (η) divided by that is obviously monotonously decreasing. the epsilon is some small number we need to put there because if G is zero we won't be able to perform the division. This is Adagrad and it is basically a smart adaptive learning rate scheduler. Adaptive stands for the fact that effective learning rate is based on the training itself it is not a preset learning schedule like the exponential one where all the e2 values are calculated regardless of the training process. Another very important point is that the adaptation is per weight this means every individual weight in the whole network keeps track of its own G function to normalize its own steps it's an important observation as different weights do not reach their optimal values simultaneously.
Adadelta was derived from Adagrad in order to improve upon the two main drawbacks of the method:
1) the continual decay of learning rates throughout training, and
2) the need for a manually selected global learning rate.
AdaDelta is a variant of AdaGrad that keeps only the most recent history rather than accumulating it as AdaGrad does.
ADAM (a more recently developed updating technique from the University of Toronto) derives learning rates from estimates of first and second moments of the gradients
ADAM is derived from adaptive moment estimation. This optimization technique computes an adaptive learning rate for each parameter. It defines momentum and variance of the gradient of the loss and leverages a combined effect to update weight parameters. The momentum and variance together help smooth learning curve and effectively improve the learning process. The mathematical representation may be simplified in the following way: Weights = Weights – (Momentum and Variance combined).