Neural networks are designed in a way to enact like an actual human brain. The neural network tries to simulate the functions of interconnected neurons by passing an input layer which can be seen as the sensory organ used to receive the information. The information thus received is provided to the neurons in the hidden layer. Each neuron gives importance to certain input nodes based on the weights and finally, an output is produced based on the information built by the neurons. In an artificial neural network, various components are tuned to improve the ANN model. Each component has its influence giving a better result that may be at the cost of an increase in model training time. So what are the components and where do we need to tune them? Let’s find out. Input layer
The input layer is composed of nodes that brings in the initial data after pre-processing. The data could be on any subject matter depending upon our classification problem but the values are always numerical. If they are not then we have to convert them into numerical using pre-processing techniques.
Input nodes are nothing but the features of the data we have. Let's say if it's 'Salary of employees' dataset then the features could be employee name, gender, salary, age, and experience. We have to keep the important feature and drop the irrelevant ones. The number of features is equivalent to the number of input nodes.
The hidden layer is the ones that reside between the input layer and the output layer. It takes the weighted nodes as the input and produces an output with the help of an activation function. This is the layer where the actual learning takes place. The hidden layer works as a biological neuron.
It is the last layer of a neural network. There can be a single node or multiple nodes in the output layer depending upon the classification problem.
The activation function defines the output of a node based on the input provided. Neurons in neural networks support two functions which are summation and activation.
A summation is the matrix product of weights and input.
Activation is the transformation of the values after the summation. After the activation is performed, the resultant is considered as the output.
The activation function to use depends upon the problem. In the case of classification, we use the sigmoid activation function, in the case of multi-class classification we'll use softmax function and in case of regression, we use the ReLU activation function. Now let's discuss these functions.
Linear activation- It represents a linear change from input to output. It is rarely used activation because it has a constant gradient due to which you can't do gradient descent. When you'll calculate the partial derivative in linear activation then you'll get 0 and hence you can't improve it.
Non-linear activation- This type of activation change the input in a non-linear fashion. It is widely used in deep learning models. Different types of activation are used in different cases. Its types are:
Sigmoid function- It is also known as the logistic function and it converts the input value(x) in a range from 0 to 1, irrespective of how large or small the input value is. It is usually used for binary classification.
Tanh function- Hyper tan function is quite similar to the sigmoid function, the only difference is that it converts the input values from -1 to 1 rather than 0 to 1. It gives more dispersed values.
ReLU function- It stands for 'Rectified Linear Unit'. It is calculated by max(0,x) where x is the positive input value. Any value below 0 is considered 0 and the value above is taken as it is. ReLU is very flexible when dealing with non-linear data.
Softmax activation- It is similar to a sigmoid function in terms of classification. We usually use it for multiclass classification. The difference between sigmoid and softmax is that in softmax the probabilities of class sum up to 1.
The loss metrics is a numerical measure of how wrong our predictions are. A bad prediction means greater loss and vice verse. Mathematically, it is the difference between the actual output and the predicted output. So let’s discuss some of the loss metrics used by neural network:
Mean Squared Error (MSE)
This function calculates the mean of the square of all the errors values i.e the difference between true and predicted values
Mean Absolute Error (MAE)
This function is quite similar to MSE, the only difference is that in mean absolute error we take the mode of the error values and not the square. It works well even with the outliers. It is not widely used because it generates a large gradient even for small values.
Cross-Entropy Loss (log loss)
This function measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label. A perfect model would have a log loss of 0.
Deep neural network uses a stochastic gradient descent algorithm to train. It is an optimization algorithm that estimates the error gradient. The weights are updated using backpropagation. The amount by which the weight is updated during training is known as step size or Learning rate. The learning rate hyperparameter controls the rate or speed at which the model learns.
Effects of the learning rate
A large learning rate allows the model to learn faster, at the cost of reaching the local minima. A smaller learning rate allows the model to learn more optimal solutions (reach the global minima) but it takes longer for the model to train. So basically it’s a trade-off between accuracy and the time taken for the model to train.
Usually, the learning rate is set below 1 so that as the weights update, they don’t shoot and hence avoid the divergence problem. We can use the learning rate schedule in which we vary the learning rate rather than keeping a fixed value.
Problem with fixed Learning rate- With a fixed learning rate we can encounter a saddle point when considering multiple dimensions. A saddle point is a point which is maxima along one dimension but minima along the other.
As a solution, there is a concept of the cyclic learning rate, developed by Leslie N. Smith which states "Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations".
During the training process, we make changes in the parameters (weights) of our model to try and minimize that loss function in order to make our predictions as correct as possible. But how exactly do you do that? How do you change the parameters of your model, by how much, and when?
Optimizers shape the model in a way that produces an accurate form possible by tuning the weights. The loss function is the guide to the terrain, telling the optimizer when it’s moving in the right or wrong direction. It cannot know where to start from so it starts with random values. As the loss starts to decrease it move in that direction and as it starts increasing, the optimizer tunes it back accordingly. Some of the optimizers are:
Stochastic Gradient descent
Adam is the most widely used optimiser as it has fewer hyperparameters to work on and the convergence is more accurate.
These are the primary components of an artificial neural network. we can adjust them accordingly depending upon the data we are working on and the model requirements. There are a lot more elements which can be tweaked and tuned to improve the ann model.