The Deep Learning Workshop
上QQ阅读APP看书,第一时间看更新

Training a Perceptron

To train a perceptron, we need the following components:

  • Data representation
  • Layers
  • Neural network representation
  • Loss function
  • Optimizer
  • Training loop

In the previous section, we covered most of the preceding components: the data representation of the input data and the true labels in TensorFlow. For layers, we have the linear layer and the activation functions, which we saw in the form of the net input function and the sigmoid function respectively. For the neural network representation, we made a function called perceptron(), which uses a linear layer and a sigmoid layer to perform predictions. What we did in the previous section using input data and initial weights and biases is called forward propagation. The actual neural network training involves two stages: forward propagation and backward propagation. We will explore them in detail in the next few steps. Let's look at the training process at a higher level:

  • A training iteration where the neural network goes through all the training examples is called an Epoch. This is one of the hyperparameters to be tweaked in order to train a neural network.
  • In each pass, a neural network does forward propagation, where data travels from the input to the output. As seen in Exercise 2.01, Perceptron Implementation, inputs are fed to the perceptron. Input data passes through the net input function and the activation function to produce the predicted output. The predicted output is compared with the labels or the ground truth, and the error or loss is calculated.
  • In order to make a neural network learn, learning being the adjustment of weights and biases in order to make correct predictions, there needs to be a loss function, which will calculate the error between an actual label and the predicted label.
  • To minimize the error in the neural network, the training loop needs an optimizer, which will minimize the loss on the basis of a loss function.
  • Once the error is calculated, the neural network then sees which nodes of the network contributed to the error and by how much. This is essential in order to make the predictions better in the next epoch. This way of propagating the error backward is called backward propagation (backpropagation). Backpropagation uses the chain rule from calculus to propagate the error (the error gradient) in reverse order until it reaches the input layer. As it propagates the error back through the network, it uses gradient descent to make fine adjustments to the weights and biases in the network by utilizing the error gradient calculated before.

This cycle continues until the loss is minimized.

Let's implement the theory we have discussed in TensorFlow. Revisit the code in Exercise 2.01, Perceptron Implementation, where the perceptron we created just did one forward pass. We got the following predictions, and we saw that our perceptron had not learned anything:

tf.Tensor(

[[0.5]

 [0.5]

 [0.5]

 [0.5]], shape=(4, 1), dtype=float32)

In order to make our perceptron learn, we need additional components, such as a training loop, a loss function, and an optimizer. Let's see how to implement these components in TensorFlow.

Perceptron Training Process in TensorFlow

In the next exercise, when we train our model, we will use a Stochastic Gradient Descent (SGD) optimizer to minimize the loss. There are a few more advanced optimizers available and provided by TensorFlow out of the box. We will look at the pros and cons of each of them in later sections. The following code will instantiate a stochastic gradient descent optimizer using TensorFlow:

learning_rate = 0.01

optimizer = tf.optimizers.SGD(learning_rate)

The perceptron function takes care of the forward propagation. For the backpropagation of the error, we have used an optimizer. Tf.optimizers.SGD creates an instance of an optimizer. SGD will update the parameters of the networks—weights and biases—on each example from the input data. We will discuss the functioning of the gradient descent optimizer in greater detail later in this chapter. We will also discuss the significance of the 0.01 parameter, which is known as the learning rate. The learning rate is the magnitude by which SGD takes a step in order to reach the global optimum of the loss function. The learning rate is another hyperparameter that needs to be tweaked in order to train a neural network.

The following code can be used to define the epochs, training loop, and loss function:

no_of_epochs = 1000

for n in range(no_of_epochs):

    loss = lambda:abs(tf.reduce_mean(tf.nn.\

           sigmoid_cross_entropy_with_logits\

           (labels=y,logits=perceptron(X))))

    optimizer.minimize(loss, [W, B])

Inside the training loop, the loss is calculated using the loss function, which is defined as a lambda function.

The tf.nn.sigmoid_cross_entropy_with_logits function calculates the loss value of each observation. It takes two parameters: Labels = y and logit = perceptron(x).

perceptron(X) returns the predicted value, which is the result of the forward propagation of the input, x. This is compared with the corresponding label value stored in y. The mean value is calculated using Tf.reduce_mean, and the magnitude is taken. The sign is ignored using the abs function. Optimizer.minimize takes the loss value and adjusts the weights and bias as a part of the backward propagation of the error.

The forward propagation is executed again with the new values of weights and bias. And this forward and backward process continues for the number of iterations we define.

During the backpropagation, the weights and biases are updated only if the loss is less than the previous cycle. Otherwise, the weights and biases remain unchanged. In this way, the optimizer ensures that even though it loops through the required number of iterations, it only stores the values of w and b for which the loss is minimal.

We have set the number of epochs for the training to 1,000 iterations. There is no rule of thumb for setting the number of epochs since the number of epochs is a hyperparameter. But how do we know when training has taken place successfully?

When we can see that the values of weights and biases have changed, we can conclude the training has taken place. Let's say we used a training loop for the OR data we saw in Exercise 2.01, Perceptron Implementation, we would see weights somewhat equal to the following:

[[0.412449151]

[0.412449151]]

And the bias would be something like this:

0.236065879

When the network has learned, that is, the weights and biases have been updated, we can see whether it is making accurate predictions using accuracy_score from the scikit-learn package. We can use it to measure the accuracy of the predictions as follows:

from sklearn.metrics import accuracy_score

print(accuracy_score(y, ypred))

Here, accuracy_score takes two parameters—the label values (y) and the predicted values (ypred)—and measures the accuracy. Let's say the result is 1.0. This means the perceptron is 100% accurate.

In the next exercise, we will train our perceptron to perform a binary classification.

Exercise 2.02: Perceptron as a Binary Classifier

In the previous section, we learned how to train a perceptron. In this exercise, we will train our perceptron to approximate a slightly more complicated function. We will be using randomly generated external data with two classes: class 0 and class 1. Our trained perceptron should be able to classify the random numbers based on their class:

Note

The data is in a CSV file called data.csv. You can download the file from GitHub by visiting https://packt.live/2BVtxIf.

  1. Import the required libraries:

    import tensorflow as tf

    import pandas as pd

    from sklearn.metrics import confusion_matrix

    from sklearn.metrics import accuracy_score

    import matplotlib.pyplot as plt

    %matplotlib inline

    Apart from tensorflow, we will need pandas to read the data from the CSV file, confusion_matrix and accuracy_score to measure the accuracy of our perceptron after the training, and matplotlib to visualize the data.

  2. Read the data from the data.csv file. It should be in the same path as the Jupyter Notebook file in which you are running this exercise's code. Otherwise, you will have to change the path in the code before executing it:

    df = pd.read_csv('data.csv')

  3. Examine the data:

    df.head()

    The output will be as follows:

    Figure 2.10: Contents of the DataFrame

    As you can see, the data has three columns. x1 and x2 are the features, and the label column contains the labels 0 or 1 for each observation. The best way to see this kind of data is through a scatter plot.

  4. Visualize the data by plotting it using matplotlib:

    plt.scatter(df[df['label'] == 0]['x1'], \

                df[df['label'] == 0]['x2'], \

                marker='*')

    plt.scatter(df[df['label'] == 1]['x1'], \

                df[df['label'] == 1]['x2'], marker='<')

    The output will be as follows:

    Figure 2.11: Scatter plot of external data

    This shows the two distinct classes of the data shown by the two different shapes. Data with the label 0 is represented by a star, while data with the label 1 is represented by a triangle.

  5. Prepare the data. This step is not unique to neural networks; you must have seen it in regular machine learning as well. Before submitting the data to a model for training, you split it into features and labels:

    X_input = df[['x1','x2']].values

    y_label = df[['label']].values

    x_input contains the features, x1 and x2. The values at the end convert it into matrix format, which is what is expected as input when the tensors are created. y_label contains the labels in matrix format.

  6. Create TensorFlow variables for features and labels and typecast them to float:

    x = tf.Variable(X_input, dtype=tf.float32)

    y = tf.Variable(y_label, dtype=tf.float32)

  7. The rest of the code is for the training of the perceptron, which we saw in Exercise 2.01, Perceptron Implementation:

    Exercise2.02.ipynb

    Number_of_features = 2

    Number_of_units = 1

    learning_rate = 0.01

    # weights and bias

    weight = tf.Variable(tf.zeros([Number_of_features, \

                                   Number_of_units]))

    bias = tf.Variable(tf.zeros([Number_of_units]))

    #optimizer

    optimizer = tf.optimizers.SGD(learning_rate)

    def perceptron(x):

        z = tf.add(tf.matmul(x,weight),bias)

        output = tf.sigmoid(z)

        return output

    Note

    The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.

  8. Display the values of weight and bias to show that the perceptron has been trained:

    tf.print(weight, bias)

    The output is as follows:

    [[-0.844034135]

     [0.673354745]] [0.0593947917]

  9. Pass the input data to check whether the perceptron classifies it correctly:

    ypred = perceptron(x)

  10. Round off the output to convert it into binary format:

    ypred = tf.round(ypred)

  11. Measure the accuracy using the accuracy_score method, as we did in the previous exercise:

    acc = accuracy_score(y.numpy(), ypred.numpy())

    print(acc)

    The output is as follows:

    1.0

    The perceptron gives 100% accuracy.

  12. The confusion matrix helps to get the performance measurement of a model. We will plot the confusion matrix using the scikit-learn package.

    cnf_matrix = confusion_matrix(y.numpy(), \

                                  ypred.numpy())

    print(cnf_matrix)

    The output will be as follows:

    [[12 0]

    [ 0 9]]

    All the numbers are along the diagonal, that is, 12 values corresponding to class 0 and 9 values corresponding to class 1 are properly classified by our trained perceptron (which has achieved 100% accuracy).

    Note

    To access the source code for this specific section, please refer to https://packt.live/3gJ73bY.

    You can also run this example online at https://packt.live/2DhelFw. You must execute the entire Notebook in order to get the desired result.

In this exercise, we trained our perceptron into a binary classifier, and it has done pretty well. In the next exercise, we will see how to create a multiclass classifier.

Multiclass Classifier

A classifier that can handle two classes is known as a binary classifier, like the one we saw in the preceding exercise. A classifier that can handle more than two classes is known as a multiclass classifier. We cannot build a multiclass classifier with a single neuron. Now we move from one neuron to one layer of multiple neurons, which is required for multiclass classifiers.

A single layer of multiple neurons can be trained to be a multiclass classifier. Some of the key points are detailed here. You need as many neurons as the number of classes; that is, for a 3-class classifier, you need 3 neurons; for a 10-class classifier you need 10 neurons, and so on.

As we saw in binary classification, we used sigmoid (logistic layer) to get predictions in the range of 0 to 1. In multiclass classification, we use a special type of activation function called the Softmax activation function to get probabilities across each class that sums to 1. With the sigmoid function in a multiclass setting, the probabilities do not necessarily add up to 1, so Softmax is preferred.

Before we implement the multiclass classifier, let's explore the Softmax activation function.

The Softmax Activation Function

The Softmax function is also known as the normalized exponential function. As the word normalized suggests, the Softmax function normalizes the input into a probability distribution that sums to 1. Mathematically, it is represented as follows:

Figure 2.12: Mathematical form of the Softmax function

To understand what Softmax does, let's use TensorFlow's built-in softmax function and see the output.

So, for the following code:

values = tf.Variable([3,1,7,2,4,5], dtype=tf.float32)

output = tf.nn.softmax(values)

tf.print(output)

The output will be:

[0.0151037546 0.00204407098 0.824637055

 0.00555636082 0.0410562605 0.111602485]

As you can see in the output, the values input is mapped to a probability distribution that sums to 1. Note that 7 (the highest value in the original input values) received the highest weight, 0.824637055. This is what the Softmax function is mainly used for: to focus on the largest values and suppress values that are below the maximum value. Also, if we sum the output, it adds up to ~ 1.

Illustrating the example in more detail, let's say we want to build a multiclass classifier with 3 classes. We will need 3 neurons connected to a Softmax activation function:

Figure 2.13: Softmax activation function used in a multiclass classification setting

As seen in Figure 2.13, x1, x2, and x3 are the input features, which go through the net input function of each of the three neurons, which have the weights and biases (Wi, j and bi) associated with it. Lastly, the output of the neuron is fed to the common Softmax activation function instead of the inpidual sigmoid functions. The Softmax activation function spits out the probabilities of the 3 classes: P1, P2, and P3. The sum of these three probabilities will add to 1 because of the Softmax layer.

As we saw in the previous section, Softmax highlights the maximum value and suppresses the rest of the values. Suppose a neural network is trained to classify the input into three classes, and for a given set of inputs, the output is class 2; then it would say that P2 has the highest value since it is passed through a Softmax layer. As you can see in the following figure, P2 has the highest value, which means the prediction is correct:

Figure 2.14: Probability P2 is the highest

An associated concept is one-hot encoding. As we have three different classes, class1, class2, and class3, we need to encode the class labels into a format that we can work with more easily; so, after applying one-hot encoding, we would see the following output:

Figure 2.15: One-hot encoded data for three classes

This makes the results quick and easy to interpret. In this case, the output that has the highest value is set to 1, and all others are set to 0. The one-hot encoded output of the preceding example would be like this:

Figure 2.16: One-hot encoded output probabilities

The labels of the training data also need to be one-hot encoded. And if they have a different format, they need to be converted into one-hot-encoded format before training the model. Let's do an exercise on multiclass classification with one-hot encoding.

Exercise 2.03: Multiclass Classification Using a Perceptron

To perform multiclass classification, we will be using the Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris), which has 3 classes of 50 instances each, where each class refers to a type of Iris. We will have a single layer of three neurons using the Softmax activation function:

Note

You can download the dataset from GitHub using this link: https://packt.live/3ekiBBf.

  1. Import the required libraries:

    import tensorflow as tf

    import pandas as pd

    from sklearn.metrics import confusion_matrix

    from sklearn.metrics import accuracy_score

    import matplotlib.pyplot as plt

    %matplotlib inline

    from pandas import get_dummies

    You must be familiar with all of these imports as they were used in the previous exercise, except for get_dummies. This function converts a given label data into the corresponding one-hot-encoded format.

  2. Load the iris.csv data:

    df = pd.read_csv('iris.csv')

  3. Let's examine the first five rows of the data:

    df.head()

    The output will be as follows:

    Figure 2.17: Contents of the DataFrame

  4. Visualize the data by using a scatter plot:

    plt.scatter(df[df['species'] == 0]['sepallength'],\

                df[df['species'] == 0]['sepalwidth'], marker='*')

    plt.scatter(df[df['species'] == 1]['sepallength'],\

                df[df['species'] == 1]['sepalwidth'], marker='<')

    plt.scatter(df[df['species'] == 2]['sepallength'], \

                df[df['species'] == 2]['sepalwidth'], marker='o')

    The resulting plot will be as follows. The x axis denotes the sepal length and the y axis denotes the sepal width. The shapes in the plot represent the three species of Iris, setosa (star), versicolor (triangle), and virginica (circle):

    Figure 2.18: Iris data scatter plot

    There are three classes, as can be seen in the visualization, denoted by different shapes.

  5. Separate the features and the labels:

    x = df[['petallength', 'petalwidth', \

            'sepallength', 'sepalwidth']].values

    y = df['species'].values

    values will transform the features into matrix format.

  6. Prepare the data by doing one-hot encoding on the classes:

    y = get_dummies(y)

    y = y.values

    get_dummies(y) will convert the labels into one-hot-encoded format.

  7. Create a variable to load the features and typecast it to float32:

    x = tf.Variable(x, dtype=tf.float32)

  8. Implement the perceptron layer with three neurons:

    Number_of_features = 4

    Number_of_units = 3

    # weights and bias

    weight = tf.Variable(tf.zeros([Number_of_features, \

                                   Number_of_units]))

    bias = tf.Variable(tf.zeros([Number_of_units]))  

    def perceptron(x):

        z = tf.add(tf.matmul(x, weight), bias)

        output = tf.nn.softmax(z)

        return output

    The code looks very similar to the single perceptron implementation. Only the Number_of_units parameter is set to 3. Therefore, the weight matrix will be 4 x 3 and the bias matrix will be 1 x 3.

    The other change is in the activation function:

    Output=tf.nn.softmax(x)

    We are using softmax instead of sigmoid.

  9. Create an instance of the optimizer. We will be using the Adam optimizer. At this point, you can think of Adam as an improved version of gradient descent that converges faster. We will cover it in detail later in the chapter:

    optimizer = tf.optimizers.Adam(.01)

  10. Define the training function:

    def train(i):

        for n in range(i):

            loss=lambda: abs(tf.reduce_mean\

                            (tf.nn.softmax_cross_entropy_with_logits(\

                             labels=y, logits=perceptron(x))))

            optimizer.minimize(loss, [weight, bias])

    Again, the code looks very similar to the single-neuron implementation except for the loss function. Instead of sigmoid_cross_entropy_with_logits, we use softmax_cross_entropy_with_logits.

  11. Run the training for 1000 iterations:

    train(1000)

  12. Print the values of the weights to see if they have changed. This is also an indication that our perceptron is learning:

    tf.print(weight)

    The output shows the learned weights of our perceptron:

    [[0.684310317 0.895633 -1.0132345]

    [2.6424644 -1.13437736 -3.20665336]

    [-2.96634197 -0.129377216 3.2572844]

    [-2.97383809 -3.13501668 3.2313652]]

  13. To test the accuracy, we feed the features to predict the output and then calculate the accuracy using accuracy_score, like in the previous exercise:

    ypred=perceptron(x)

    ypred=tf.round(ypred)

    accuracy_score(y, ypred)

    The output is:

    0.98

    It has given 98% accuracy, which is pretty good.

    Note

    To access the source code for this specific section, please refer to https://packt.live/2Dhes3U.

    You can also run this example online at https://packt.live/3iJJKkm. You must execute the entire Notebook in order to get the desired result.

In this exercise, we performed multiclass classification using our perceptron. Let's do a more complex and interesting case study of the handwritten digit recognition dataset in the next section.

MNIST Case Study

Now that we have seen how to train a single neuron and a single layer of neurons, let's take a look at more realistic data. MNIST is a famous case study. In the next exercise, we will create a 10-class classifier to classify the MNIST dataset. However, before that, you should get a good understanding of the MNIST dataset.

Modified National Institute of Standards and Technology (MNIST) refers to the modified dataset that the team led by Yann LeCun worked with at NIST. This project was aimed at handwritten digit recognition using neural networks.

We need to understand the dataset before we get into writing the code. The MNIST dataset is integrated into the TensorFlow library. It consists of 70,000 handwritten images of the digits 0 to 9:

Figure 2.19: Handwritten digits

When we say images, you might think these are JPEG files, but they are not. They are actually stored in the form of pixel values. As far as the computer is concerned, an image is a bunch of numbers. These numbers are pixel values ranging from 0 to 255. The dimension of each of these images is 28 x 28. The images are stored in the form of a 28 x 28 matrix, each cell containing real numbers ranging from 0 to 255. These are grayscale images (commonly known as black and white). 0 indicates white and 1 indicates complete black, and values in between indicate a certain shade of gray. The MNIST dataset is split into 60,000 training images and 10,000 test images.

Each image has a label associated with it ranging from 0 to 9. In the next exercise, let's build a 10-class classifier to classify the handwritten MNIST images.

Exercise 2.04: Classifying Handwritten Digits

In this exercise, we will build a single-layer 10-class classifier consisting of 10 neurons with the Softmax activation function. It will have an input layer of 784 pixels:

  1. Import the required libraries and packages just like we did in the earlier exercise:

    import tensorflow as tf

    import pandas as pd

    from sklearn.metrics import accuracy_score

    import matplotlib.pyplot as plt

    %matplotlib inline

    from pandas import get_dummies

  2. Create an instance of the MNIST dataset:

    mnist = tf.keras.datasets.mnist

  3. Load the MNIST dataset's train and test data:

    (train_features, train_labels), (test_features, test_labels) = \

    mnist.load_data()

  4. Normalize the data:

    train_features, test_features = train_features / 255.0, \

                                    test_features / 255.0

  5. Flatten the 2-dimensional images into row matrices. So, a 28 × 28 pixel gets flattened to 784 using the reshape function:

    x = tf.reshape(train_features,[60000, 784])

  6. Create a Variable with the features and typecast it to float32:

    x = tf.Variable(x)

    x = tf.cast(x, tf.float32)

  7. Create a one-hot encoding of the labels and transform it into a matrix:

    y_hot = get_dummies(train_labels)

    y = y_hot.values

  8. Create the single-layer neural network with 10 neurons and train it for 1000 iterations:

    Exercise2.04.ipynb

    #defining the parameters

    Number_of_features = 784

    Number_of_units = 10  

    # weights and bias

    weight = tf.Variable(tf.zeros([Number_of_features, \

                                   Number_of_units]))

    bias = tf.Variable(tf.zeros([Number_of_units]))

  9. Prepare the test data to measure the accuracy:

    # Prepare the test data to measure the accuracy.

    test = tf.reshape(test_features, [10000, 784])

    test = tf.Variable(test)

    test = tf.cast(test, tf.float32)

    test_hot = get_dummies(test_labels)

    test_matrix = test_hot.values

  10. Run the predictions by passing the test data through the network:

    ypred = perceptron(test)

    ypred = tf.round(ypred)

  11. Calculate the accuracy:

    accuracy_score(test_hot, ypred)

    The predicted accuracy is:

    0.9304

    Note

    To access the source code for this specific section, please refer to https://packt.live/3efd7Yh.

    You can also run this example online at https://packt.live/2Oc83ZW. You must execute the entire Notebook in order to get the desired result.

In this exercise, we saw how to create a single-layer multi-neuron neural network and train it as a multiclass classifier.

The next step is to build a multilayer neural network. However, before we do that, we must learn about the Keras API, since we use Keras to build dense neural networks.