Exploring the Optimizers and Hyperparameters of Neural Networks
Training a neural network to get good predictions requires tweaking a lot of hyperparameters such as optimizers, activation functions, the number of hidden layers, the number of neurons in each layer, the number of epochs, and the learning rate. Let's go through each of them one by one and discuss them in detail.
Gradient Descent Optimizers
In an earlier section titled Perceptron Training Process in TensorFlow, we briefly touched upon the gradient descent optimizer without going into the details of how it works. This is a good time to explore the gradient descent optimizer in a little more detail. We will provide an intuitive explanation without going into the mathematical details.
The gradient descent optimizer's function is to minimize the loss or error. To understand how gradient descent works, you can think of this analogy: imagine a person at the top of a hill who wants to reach the bottom. At the beginning of the training, the loss is large, like the height of the hill's peak. The functioning of the optimizer is akin to the person descending the hill to the valley at the bottom, or rather, the lowest point of the hill, and not climbing up the hill that is on the other side of the valley.
Remember the learning rate parameter that we used while creating the optimizer? That can be compared to the size of the steps the person takes to climb down the hill. If these steps are large, it is fine at the beginning since the person can climb down faster, but once they near the bottom, if the steps are too large, the person crosses over to the other side of the valley. Then, in order to climb back down to the bottom of the valley, the person will try to move back but will move over to the other side again. This results in going back and forth without reaching the bottom of the valley.
On the other hand, if the person takes very small steps (a very small learning rate), they will take forever to reach the bottom of the valley; in other words, the model will take forever to converge. So, finding a learning rate that is neither too small nor too big is very important. However, unfortunately, there is no rule of thumb to find out in advance what the right value should be—we have to find it by trial and error.
There are two main types of gradient-based optimizers: batch and stochastic gradient descent. Before we jump into them, let's recall that one epoch means a training iteration where the neural network goes through all the training examples:
- In an epoch, when we reduce the loss across all the training examples, it is called batch gradient descent. This is also known as full batch gradient descent. To put it simply, after going through a full batch, we take a step to adjust the weights and biases of the network to reduce the loss and improve the predictions. There is a similar form of it called mini-batch gradient descent, where we take steps, that is, we adjust weights and biases, after going through a subset of the full dataset.
- In contrast to batch gradient descent, when we take a step at one example per iteration, we have stochastic gradient descent (SGD). The word stochastic tells us there is randomness involved here, which, in this case, is the batch that is randomly selected.
Though SGD works relatively well, there are advanced optimizers that can speed up the training process. They include SGD with momentum, Adagrad, and Adam.
The Vanishing Gradient Problem
In the Training a Perceptron section, we learned about the forward and backward propagation of neural networks. When a neural network performs forward propagation, the error gradient is calculated with respect to the true label, and backpropagation is performed to see which parameters (the weights and biases) of the neural network have contributed to the error and the extent to which they have done so. The error gradient is propagated from the output layer to the input layer to calculate gradients with respect to each parameter, and in the last step, the gradient descent step is performed to adjust the weights and biases according to the calculated gradient. As the error gradient is propagated backward, the gradients calculated at each parameter become smaller and smaller as it advances to the lower (initial) layers. This decrease in the gradients means that the changes to the weights and biases become smaller and smaller. Hence, our neural network struggles to find the global minimum and does not give good results. This is called the vanishing gradient problem. The problem happens with the use of the sigmoid (logistic) function as an activation function, and hence we use the ReLU activation function to train deep neural network models to avoid gradient complications and improve the results.
Hyperparameter Tuning
Like any other model training process in machine learning, it is possible to perform hyperparameter tuning to improve the performance of the neural network model. One of the parameters is the learning rate. The other parameters are as follows:
- Number of epochs: Increasing the number of epochs generally increases the accuracy and lowers the loss
- Number of layers: Increasing the number of layers increases the accuracy, as we saw in the exercises with MNIST
- Number of neurons per layer: This also increases the accuracy
And once again, there is no way to know in advance what the right number of layers or the right number of neurons per layer is. This has to be figured out by trial and error. It has to be noted that the larger the number of layers and the larger the number of neurons per layer, the greater the computational power required. Therefore, we start with the smallest possible numbers and slowly increase the number of layers and neurons.
Overfitting and Dropout
Neural networks with complex architectures and too many parameters tend to fit on all the data points, including noisy labels, leading to the problem of overfitting and neural networks that are not able to generalize well on unseen datasets. To tackle this issue, there is a technique called dropout:
In this technique, a certain number of neurons are deactivated randomly during the training process. The number of neurons to be deactivated is provided as a parameter in the form of a percentage. For example, Dropout = .2 means 20% of the neurons in that layer will be randomly deactivated during the training process. The same neurons are not deactivated more than once, but a different set of neurons is deactivated in each epoch. During testing, however, all the neurons are activated.
Here is an example of how we can add Dropout to a neural network model using Keras:
model.add(Dense(units = 300, activation = 'relu')) #Hidden layer1
model.add(Dense(units = 200, activation = 'relu')) #Hidden Layer2
model.add(Dropout(.20))
model.add(Dense(units = 100, activation = 'relu')) #Hidden Layer3
In this case, a dropout of 20% is added to Hidden Layer2. It is not necessary for the dropout to be added to all layers. As a data scientist, you can experiment and decide what the dropout value should be and how many layers need it.
Note
A more detailed explanation of dropout can be found in the paper by Nitish Srivastava et al. available here: http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf.
As we have come to the end of this chapter, let's test what we have learned so far with the following activity.