Neurons, the reason our brain works. Neurons are the sole reasons for our reflexes, our activities (besides our will, obviously), and our fully functioning body. The human brain consists of about 100 billion neurons, making on the order of 100 trillion connections. Neural networks also are known as an artificial neural network, are a type of machine learning algorithm that is designed to recognize patterns that are inspired by the biological brain. The goal is to solve problems in a way that is similar to the human brain.
Conventional software approaches a problem in a different way than a neural network. Conventional software uses an algorithmic (following a set of instructions) approach to solve problems. Whereas, neural networks try to mimic how neurons in the human brain work. As a matter of fact, they learn by example instead of being programmed to perform a specific task. Let’s see what a neural network consists of:
As the name suggests an input node contains the input of the network. The input is always numerical. But if it is not numerical by default it must always be converted. For instance:
Images are often converted to grayscale and each pixel is measured on the scale of 0 to 1 for intensity( where 0 is black and white is 1).
Text that might be representing categories such as male/female or cat/dog might be converted to 1 and 0, respectively.
Sound can also be converted to numbers that represent amplitude across time, with zero when silent and 1 for loud.
The input node is located within the input node which is the first layer of the network. Each input node represents a single dimension and is called a feature, and all the features are stored within a vector.
For example, if we have an image of 28x28 pixels as input, we have a total of 784 input nodes. This is for the reason, that each input node represents a single image pixel and there are 784 pixels in total (28x28).
The layer of nodes between the input and output layer is known as the hidden layer. There can be multiple as well as a single hidden layer in a network, and the more that exist, the “deeper” the learning that a network can perform. That is what the “deep” in deep learning means. More number of hidden layers implies deep learning.
A hidden node is a node within a hidden layer. There is no rule of thumb to decide how many nodes a hidden layer must-have. A hidden layer can have numerous hidden nodes. Studies have demonstrated that layers that contain the same amount of nodes generally performed the same o better than a decreasing or increasing a pyramid-shaped network.
An output node is a node within an output layer. There can be a single as well as multiple output nodes depending on the objective of the network. For instance, if your model is classifying between a cat and a dog i.e. a binary classification, only a single neuron can suffice with 0 or 1 output else if a network is classifying the handwritten digits from 0 to 9, there would be 10 output nodes. Similar to the input of a network, the output is a vector.
Weights on connections in a neural network are coefficients that scale (amplify or minimize) the input signal to a given neuron in the network. In common representations of neural networks, these are the lines/arrows going from one point to another, the edges of the mathematical graph. Often, connections are notated as w in mathematical representations of neural networks.
A bias node is an extra node added to each hidden and output layer, and it connects to every node within each respective layer. These are scalar values added to the input to ensure that at least a few nodes per layer are activated regardless of signal strength. Biases allow learning to happen by giving the network activity in the event of low signal. It generally has a constant value of 1 or -1. Biases actually enable the activation function to move left or right. Biases are notated as b, and, just like weights, biases are also modified throughout the learning process.
There are two total mathematical functions that are used, and both of them occur inside of every hidden and output nodes.
In a neural network, the summation operator sums up all the input the neuron is receiving to calculate the net input of a node. Each node in the hidden or output layer has multiple inputs from the previous layers and hence needs to be summed up in order to calculate the net input of the node and that is exactly what a summation operator does. The summation operator uses matrices to do this and its output is known as a dot product or inner product.
The summation operator looks like this when it comes to calculating the net input of a node.
Activations are the values passed on to the next layer from each previous layer. These values are the output of the activation function of each artificial neuron. The functions that govern the artificial neuron’s behavior are called activation functions. An activation function takes the output of the summation operator and transforms it into the final output of the node.
An activation function basically squeezes the input and transforms it into an output value that represents how much a node should fire. When an artificial neuron passes on a nonzero value to another artificial neuron, it is said to be activated.
Types of activation functions:
There are plenty of activation functions to choose from. A network might make use of different activation functions at different layers because the functions differ in output and have respective strengths and weaknesses.
The picture shows the different activation functions.
With the linear function, it is evident that the input is the output. There is no transformation that occurs. F(x ) is simply x. In simple terms, the function just passes the input unchanged and hence is used mostly in the input layer.
The sigmoid function can reduce the outliers from the data without removing them. It squashes its input to create output between 0 and 1.
The equation for the logistic/sigmoid activation function is shown in the picture.
Tanh is a hyperbolic tangent function. It is very much similar to the sigmoid function except for the fact that it squeezes the output in the range of -1 to 1. The flexible range of Tanh and sigmoid is what enables them to solve non-linear problems. Tanh can deal easily with negative numbers.
Soft-max is the function you will often find at the output layer of a classifier. The software activation function returns the probability distribution over mutually exclusive output classes. For instance, if we have a multi-class modelling problem yet we care only about the best score across these classes, we’d use a soft-max output layer with an argmax() function to get the highest score of all the classes.
Rectified Linear Unit:
The rectified linear unit activates a node only if the input is above a certain quantity. In other terms, all input that is <= 0 (zero) is set to 0 (zero) and all input that is > 0 (zero) is equal to the input.
In recent years (2015 -) the ReLU function has gained popularity because of its phenomenal performance within deep neural networks, especially for image recognition with convolutional neural networks (CNNs). ReLU activation functions have shown to train better in practice than sigmoid activation functions.
Logistic and Tanh functions are also widely used too. There are mainly two reasons for that:
Neural Networks are also often used to solve problems that can not be solved by separating classes with a straight line i.e. non-linear problems. An example would be image classification.
For example, with the image on the left, there are two groups that are being classified. These groups are easily separated with a straight line, and thus can be solved linearly.
Activation functions such as sigmoid and Tanh essentially “break” the linearity of a network and enable it to solve more complex problems.
You might be wondering why do we need an activation function at all to compute the combined output (o) when we could just pass the value of (o) as the final output? Well, there can be a bit of a problem with that. First of all, the output can range from -Infinity to +Infinity and we won't have any clear way of defining the threshold value where activation should happen.
Secondly, the network will in a way be close to useless, as it won’t really learn. This is where a bit of calculus and derivatives come into the picture. To simplify, if the activation function is a linear function (no activation), then the derivative of that function becomes 0; this becomes a big problem as training with the backpropagation algorithm helps give feedback to the network about errors in classification and thereby helps a neuron to adjust its weights by using a derivative of the function. Now if that only becomes 0, the network loses its ability to learn. In simple terms, there is no point in a deep neural network as the output of having just one layer would be similar to having n number of layers. Hence, we would always need a nonlinear activation function (at least in all hidden layers) to get the network to learn properly.
The Loss Function:
The loss function is a measure that quantifies how right our network is or if we state in terms of loss, how wrong our model is? The error is the difference between a network’s actual output and target output i.e. between the predicted and real values respectively.
To understand loss function, let's assume you took several mock tests for physics: How would you measure your performance on each test? Obviously, the test score. Assume you scored 36, 60, 88, 90, and 98 out of 100 in five consecutive tests. It is clear that improving test scores are indicating that you are performing well. Had the test scores been decreasing, then it would indicate that the performance is decreasing and you would need to make a change to your studying methods to improve.
Similarly a neural network measure it's improvement using a loss function. This loss is a mathematical function used only once during forward propagation when it is applied to the output of a neural network. There are several standard loss functions defined in ML and DL based on the type of the outcome of data. For regression (where the output is continuous rather than discrete) some popular loss functions are:
Mean Squared error:
The name does justice to this error function. This function takes the sum of all squared output errors in a network and averages them, hence the name error mean square or mean square error.
The equation looks like this:
First of all, we calculate the local error of each output node i.e. the difference between a single training example’s actual output and target output. If there are multiple output layer nodes, this means there are multiple actual and target outputs. Hence, the local error is calculated for each node and the results are summed to create the final local error for a single training example.
Next, we square each local error. Squaring is pretty beneficial as this means that the difference calculated is treated the same whether it is positive or negative. This is important because it helps the network find a global minimum and also keeps different signs from canceling out each other.
Also, squaring helps the network converge faster. Larger derivatives are emphasized for large errors, which helps the network converge faster by taking large steps toward the global minimum. In contrast, smaller derivatives are emphasized for small errors, which helps the network converge faster by taking smaller steps towards the global minimum.
In the next step sum the local errors of every training example. And in the end, multiplying by 1/n normalizes the sum, which transforms the error into a common frame of reference that we can understand and work with.
Mean Absolute Error:
The average absolute error between actual and predicted outputs.
This simply averages the absolute error over the entire dataset.
For categorical outputs, the predictions would be for a class. For example, if you will graduate this year(1) or not (1). Will you get a medal (1) in your graduation ceremony or not (0)? Some cases may have multiple class items for example if you are classifying if one has Type 1 diabetes or Type 2 or Type 3? Or if the picture is of a dog, a cat or any other animal.
In these cases, we can not use the ones we used for regression for obvious reasons. So, we need to quantify the outcome of the class as probability and then define the loss based on those probability estimates as predictions. Some of the loss functions are:
Binary class entropy defines the loss when the categorical outcome is a binary variable like (Pass/Fail) or (Yes/No). The mathematical form would be
Loss = − [ y * log(p) + (1−y) * log(1−p) ]
This defines the loss when the categorical outcomes are non-binary, that is, >2 possible outcomes: (Yes/No/Maybe) or (Type 1/ Type 2/… Type n).
The goal of our network is to learn and have a minimum error i.e. to reach the global minima. In order to do so, we need to take several steps but how do we know at what rate we should be moving towards the minimum. If we move too fast we may never get to the minimum and instead keep diverging away from it. Else if we move too slow we might never get there or if we do who knows if the year's already over? The learning rate is a value that speeds up or slows down how quickly an algorithm learns. Technically, it determines the size of the step an algorithm takes when moving towards a global minimum. In practice, the minimum is never reached and the algorithm settles for the local minimum which is close to global.
A learning rate can be static and remain consistent throughout the learning process, or it can be programmed to scale down as the network’s error rate falls. There are multiple theories on how to select a proper learning rate, but in most cases selection will depend on trial and error.
The optimizer is the most important part of model training. The starting point of a network is a defined structure with randomized weights. The predictions made by the model at this time would almost always be of no value. The network takes one training sample's values as inputs to the neurons in the first layer, which then produces an output with the mentioned activation function. The output now becomes an input for the next layer, and so on. The prediction for the training sample would be the output of the final layer.
This is where the loss function comes into the picture. The loss function helps the network in determining how well (or poorly) the chosen set of weights has performed on the training sample. Reducing the loss is the next step for the model. But how exactly does it know what steps or updates it should perform on the weights to reduce the loss?
The optimizer function helps it understand this step. The optimizer function is a mathematical algorithm that uses derivatives, partial derivatives, and the chain rule in calculus to understand how much change the network will see in the loss function by making a small change in the weight of the neurons. The change in the loss function, which would be an increase or decrease, helps in determining the direction of the change required in the weight of the connection.
There are many other popular optimizers that can also be used for different DL models. They include:
Stochastic Gradient descent
Adam, which stands for Adaptive Moment Estimation, is by far the most popular and widely used optimizer in DL. In most cases, you can blindly choose Adam optimizer and forget about the optimization alternatives.
While developing and training ML and DL models, you will often witness a scenario where the trained model seems to perform well on the training dataset but fails to perform similarly on the test dataset. In data science, this phenomenon is called “overfitting.” Your model fits the data in a literal sense in case of overfitting. To understand this phenomenon let's look at an example from a great book for deep learning,
Say a five-year old loves to eat cakes baked by his mother. He demands cakes to be baked every day at home. His mother politely denies these demands, but assures him that she will bake cakes on certain occasions. The little boy now looks forward to each new day, hoping that it will be one of those occasions when his mother will bake a cake. His mother, on the other hand, had no real intention to find occasions to bake cakes. She would simply bake a cake every Sunday when she had time off from work. The five-year-old continues to watch every day and slowly learns that his mom will bake a cake on every Sunday. So, he learns the following pattern: “If day == Sunday, then Mother will bake cakes.” One fine Sunday, his mother had to travel for an errand and was left with no time to bake a cake. The five-yearold couldn’t understand his pattern breaking down. So, to accommodate the new event, he modified his rules by formulating the new pattern as follows: “If day == Sunday, then Mother will bake a cake, but if the day is in the last week of the month, then no cake.” In reality, the Sunday his mother missed baking the cake was a noise. He should have ideally ignored that and kept his previously learned pattern intact. But unfortunately, he failed to distinguish between signal and noise and thereby over-complicated his rules and over-fit the data.
This is such a great example to understand what over fitting is. Similar to the example above, a deep learning model over-fits the data when it learns from the noise and accommodates by adjusting the weights to suit the noise. That is the reason you would observe a large gap between the training accuracy of a model and the validation accuracy.
Regularisation helps in removing the over-fitting problem. The weights of the neuron edges are updated after each iteration. When the model encounters a sample with outliers and makes an assumption of it being a valid one, it tries to fit the model according to it by readjusting the weights. As the name suggests the outliers are very much different than the regular data points and hence the weight updates would be huge too. Regularisation handles this problem by adding the weights of the edges to the loss function and representing a higher loss. The network, in turn, tunes itself in order to reduce the loss and then updates the weights to move in the right direction.
Regularisation can be represented as:
Cost Function = Loss + Hyperparameter × [Weights]
The hyperparameter is λ / 2m and the value of λ is defined by the user.
The types of Regularisation are:
The absolute weights are added to the loss function. In order to generalize the model, the values of the weights are reduced to 0. For this reason, when we are trying to compress the model for faster computation this method is preferred
In L2 regularization, the squared weights are added to the loss function. For generalization, the values of the weights are reduced to near 0 (but not actually 0), and therefore this is also known as the “weight decay” method. In most cases, to reduce overfitting, L2 is highly recommended over L1.
Besides L1 and L2, there is another effective technique to reduce overfitting known as a dropout regularization. Using this method, the model randomly deactivates a few neurons for a layer during each iteration. Hence, in each iteration, the model looks at a slightly different structure of itself to optimize. With normal architecture, each neuron passes the output onto the other in a sort of serial manner. But with dropouts, each iteration induces a certain degree of randomness by deactivating a few neurons. The process is efficient due to the reduced computation and also works intuitively in reducing the overfitting.
Let's dive in for little basic now:
After pre-processing the data, we would need to first design the structure of the deep neural network. We define the number and types of layers, the number of neurons in each layer, the activation function and the optimizer to use.
In practice, the overall structure is developed using the model object in Keras.
from keras.models import Sequential from keras.layers import Dense, Activation
Sequential is the easiest way of defining the model as it allows easy creation of linear layers.
model = Sequential() model.add(Dense(6, input_dim=15)) model.add(Activation('relu'))
In the example above, the layer would have 6 neurons and would receive input with 15 neurons and be activated with the ReLU activation function.
Epoch: It represents one iteration over the entire dataset.
Batch: We cannot pass the entire dataset into a neural network at once, That's why the dataset is divided into number of batches.
The data is split into the following chunks:
Training dataset- used to train the neural network.
Validation dataset- Used to minimize overfitting.
Testing dataset - Used as a final test to gauge to see how accurately the network has been trained.
The life cycle of a model:
def simple_nn(): # initialize model model = Sequential() # add an input layer and a hidden layer model.add(Dense(units=100, activation='sigmoid', input_dim = num_input_nodes)) # add output layer model.add(Dense(units=num_classes, activation='softmax')) return model #defining a model model = simple_nn()
# Compiling the model model.compile(optimizer=sgd, loss='mse', metrics=['accuracy'])
Train the model:
Here we give our model the training data, the ratio of split and define epochs as well.
model_info = model.fit(train_features, train_labels, batch_size=64, epochs=10, verbose=2, validation_split=0.2) v
The next step includes evaluating the model by plotting the accuracy and loss of the model or by making a confusion matrix of the same. And in the end, making predictions using the model.
These are the components of an artificial neural network and it is an interesting learning experience. Do not be afraid of the maths involved because while learning neural networks you'll realise how alluring maths really is (your one chance to use algebra for real-life problems).