How to Implement a Neural Network Using TensorFlow
In this section, we will look at the most important aspects to consider when implementing a deep neural network. Starting with the very basic concepts, we will go through all the steps that lead up to the creation of a state-of-the-art deep learning model. We will cover the network architecture's definition, training strategies, and performance improvement techniques, understanding how they work, and preparing you so that you can tackle the next section's exercises, where these concepts will be applied to solve real-world problems.
To successfully implement a deep neural network in TensorFlow, we have to complete a given number of steps. These can be summarized and grouped as follows:
- Model creation: Network architecture definition, input features encoding, embeddings, output layers
- Model training: Loss function definition, optimizer choice, features normalization, backpropagation
- Model validation: Strategies and key elements
- Model improvement: Overfitting countermeasures
- Model test and inference: Performance evaluation and online predictions
Let's look at each of these steps in detail.
Model Creation
The very first step is to create a model. Choosing an architecture is hardly something that can be done a priori on paper. It is a typical process that requires experimentation, going back and forth between model design and field validation and testing. This is the phase where all network layers are created and properly linked to generate a complete processing operation set that goes from inputs to outputs.
The very first layer is the one that is interfaced with input data, specifically, the so-called "input features." In the case of images, for example, input features are image pixels. Depending on the nature of the layer, the input features' dimensionality needs to be taken into account. You will learn how to choose layer dimensions, depending on the layer's nature, in the upcoming sections.
The very last layer is called the output layer. It generates model predictions, so its dimensions depend on the nature of the problem. For example, in classification problems, where the model has to predict in which of the, say, 10 classes a given instance falls, the model will have 10 neurons in the output layer providing 10 scores (one per class). In the upcoming sections, we will illustrate how to create output layers with the correct dimensions.
Between the first and last layers, there are intermediate layers, called hidden layers. These layers constitute the network architecture, and they are responsible for the core processing capabilities of the model. At the time of writing, a rule that can be used to choose the best network architecture doesn't exist; this is a process that requires a lot of experimentation, under the guidance of some general principles.
A very powerful and common approach is to leverage proven models from academic papers, using them as a starting point, and then adjusting the architecture appropriately to fit and fine-tune it to the custom problem. When pretrained literature models are used and fine-tuned, the procedure is called "transfer learning," meaning we are leveraging an already trained model and transferring its knowledge to the new model, which then won't start from scratch.
Once the model has been created, all its parameters (weights/biases) must be initialized (for all non-pretrained layers). You might be tempted to set them all equal to zero, but this is hardly a good choice. There are many different initialization schemes available, and again, which one to choose requires experience and experimentation. This aspect will become clearer in the following sections. Implementation will rely on default initialization to be performed by Keras/TensorFlow, which is usually a good and safe starting point.
A typical code example for model creation can be seen in the following snippet, which we studied in the previous section:
inputs = tf.keras.layers.Input(shape=(784,))
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dense(64, activation='relu')(x)
predictions = tf.keras.layers.Dense(10, activation='softmax')(x)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
Model Training
When a model is initialized and applied to input data without undergoing a training phase, it outputs random values. In order to improve its performance, we need to adjust its parameters (weights) to minimize its errors. This is the aim of the model training stage, which requires the following steps:
- First, we have to evaluate how "wrong" the model is with a given parameter configuration by computing a so-called "loss," which is a measure of model prediction error.
- Second, a hyperdimensional gradient is computed, which tells us how (in which direction) the model needs to change its parameters in order to improve current performance, thereby minimizing the loss function (it is indeed an optimization process).
- Finally, the model parameters are updated by taking a "step" in the negative gradient direction (following some precise rules) and the whole process restarts from the loss evaluation stage.
This procedure is repeated as many times as needed until the system converges and the model reaches its maximum performance (minimum loss).
A typical code example for model training is shown in the following snippet, which we studied in the previous sections:
model.compile(optimizer='rmsprop', \
loss='categorical_crossentropy', \
metrics=['accuracy'])
model.fit(data, labels) # starts training
Loss Function Definition
Model error can be measured by means of different loss functions. How to choose the best one requires experience. For complex applications, we often need to carefully adapt the loss function in order to drive training in directions we are interested in. As an example, let's look at how to define a typical loss that's used for classification problems: the sparse categorical cross entropy. To create it in Keras, we can use the following instruction:
loss_CatCrossEntropy = tf.keras.losses\
.SparseCategoricalCrossentropy()
This function operates on two inputs: true labels and predicted labels. Based on their values, it computes the loss associated with the model:
loss_CatCrossEntropy(y_true=groundTruth, y_pred=predictions)
Optimizer Choice
The second and third steps, estimating the gradient and updating the parameters, respectively, are addressed by optimizers. These objects calculate gradients and perform update steps in the gradient's direction to minimize model loss. There are many optimizers available, from the simplest ones to the most advanced (refer to the following diagram). They provide different performances, and which one to select is, again, a matter of experience and a trial-and-error process. As an example, the following code selects the Adam optimizer, assigning a specific learning rate of 0.01. This parameter regulates how "large" the step taken will be along the gradient direction:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
optimizer = tf.keras.optimizers.Adadelta(learning_rate=0.01)
optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.01)
optimizer = tf.keras.optimizers.Adamax(learning_rate=0.01)
optimizer = tf.keras.optimizers.Ftrl(learning_rate=0.01)
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.01)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
The following diagram is an instantaneous snapshot comparing different optimizers. It shows how quickly they move toward the minimum, starting all at the same time. We can see how some of them are faster than others:
Note
The preceding diagram was created by Alec Radford (https://twitter.com/alecrad).
Learning Rate Scheduling
In most cases, and for most deep learning models, the best results are achieved if the learning rate is gradually reduced during training. The reason for this can be seen in the following diagram:
When approaching the minimum of the loss function, we want to take smaller and smaller steps to efficiently reach the very bottom of the hyperdimensional concavity.
With Keras, it is possible to prescribe many different decreasing functions for the learning rate trend over epochs by means of a scheduler. One common choice is InverseTimeDecay. This can be implemented as follows:
lr_schedule = tf.keras.optimizers.schedules\
.InverseTimeDecay(0.001,\
decay_steps=STEPS_PER_EPOCH*1000,\
decay_rate=1, staircase=False)
The preceding code sets a decreasing function through InverseTimeDecay to hyperbolically decrease the learning rate to 1/2 of the base rate at 1,000 epochs, 1/3 at 2,000 epochs, and so on. This can be seen in the following graph:
Then, it is applied to an optimizer as an argument, as shown in the following snippet for the Adam optimizer:
tf.keras.optimizers.Adam(lr_schedule)
Each optimization step makes the loss drop, thereby improving the model. It is then possible to repeat the same process over and over until convergence is reached and the loss stops decreasing. The number of optimization steps performed is usually called the number of epochs.
Feature Normalization
The broad applications for deep neural networks favor their usage on very different types of inputs, from image pixels to credit card transaction history, from social account profile habits to audio recordings. From this, it is clear that raw input features cover very different numerical scales. As mentioned previously, training these models requires solving an optimization problem using a loss gradient calculation. For this reason, numerical aspects are of paramount importance, resulting in a speeding up of the process, as well as making it more robust. One of the most important practices, in this context, is feature normalization or standardization. The most common approach consists of performing the following steps for each feature:
- Calculating the mean and standard deviation using all the training set instances.
- Subtracting the mean and piding by standard deviation. Values calculated on the training set must be applied to the training, validation, and test sets.
This way, all the features will have zero mean and standard deviation equal to 1. Different, but similar, approaches scale feature values between a user-defined minimum-maximum range (for example, between –1 and 1) or apply similar transformations (for example, log scaling). As usual, in the field, which approach works better is hardly predictable and requires experience and a trial-and-error approach.
The following code snippet shows how data normalization is performed, wherein the mean and standard deviation of the original values are calculated, the mean is then subtracted from the original values, and the result is then pided by the standard deviation:
train_stats = train_dataset.describe()
train_stats = train_stats.transpose()
def norm(x):
return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
Model Validation
As stated in the previous subsections, a large portion of choices require experimentation, meaning we have to select a given configuration and evaluate how the corresponding model performs. In order to compute this performance measure, the candidate model must be applied to a set of instances and its output compared against ground truth values. This step can be repeated many times, depending on how many alternative configurations we want to compare. In the long run, these configuration choices can suffer an excessive influence of the set of instances used to measure model performance. For this reason, in order to have a final accurate performance measure of the model of choice, it has to be tested on a new set of instances that have never been seen before. The first set of instances is called a "validation set," while the final one is called a "test set."
There are different choices we can adopt when defining training, validation, and test sets, such as the following:
- 70:20:10: The initial dataset is decomposed into three chunks, that is, the training, validation, and test sets, with the proportion 70:20:10, respectively.
- 80:20 + k-Folding: The initial dataset is decomposed into two chunks, 80% training and 20% testing, respectively. Validation is performed using k-Folding on the training dataset: it is pided into 'k' folds and, in turn, training is carried out in 'k-1' folds, while validation is performed on the k-th piece. 'K' varies from 1 to k and metrics are averaged to obtain a global measure.
Many variants of the preceding methods can be used. The choices are strictly related to the problem and the available dataset.
The following code snippet shows how to prescribe an 80:20 split for validation when fitting a model on a training dataset:
model.fit(normed_train_data, train_labels, epochs=epochs, \
validation_split = 0.2, verbose=2)
Performance Metrics
In order to measure performances, beside the loss functions, other metrics are usually adopted. There is a very wide set of metrics available, and the question as to which you should use depends on many factors, including the type of problem, dataset characteristics, and so on. The following is a list of the most common ones:
- Mean Squared Error (MSE): Used for regression problems.
- Mean Absolute Error (MAE): Used for regression problems.
- Accuracy: Number of correct predictions pided by the number of total tested instances. This is used for classification problems.
- Receiver Operating Characteristic Area Under Curve (ROC AUC): Used for binary classification, especially in the presence of highly unbalanced data.
- Others: Fβ score, precision, and recall.
Model Improvement
In this section, we will look at a few techniques that can be used to improve the performance of a model.
Overfitting
A common problem we may typically encounter when training deep neural networks is a critical drop in model performance (measured, of course, on the validation or test set) when the number of training epochs passes a given threshold, even if, at the same time, the training loss continues to decrease. This phenomenon is called overfitting. It can be defined as follows: a highly representative model, a model with the relevant number of degrees of freedom (for example, a neural network with many layers and neurons), if trained "too much," bends itself to adhere to the training data, with the intent to minimize the training loss. This results in poor generalization performances, making validation and/or test errors higher. Deep learning models, thanks to their high-dimensional parameter space, are usually very good at fitting the training data, but the actual aim of building a machine learning model is being able to generalize what has been learned, not merely fit a dataset.
At this point, we might be tempted to significantly reduce the number of model parameters to avoid overfitting. But this would cause different problems. In fact, a model with an insufficient number of parameters would incur underfitting. Basically, it would not be able to properly fit the data, again resulting in poor performance, this time on both the training and validation/test sets.
The correct solution is the one that finds a proper balance between having a large number of parameters that would perfectly fit training data and having too small a number of model degrees of freedom, resulting in it being able to capture important information from data. It is currently not possible to identify the right size for a model so that it won't face overfitting or underfitting problems. Experimentation is a key element in this regard, thereby requiring the data engineer to build and test different architectures. A good rule is to start with models with a relatively small number of parameters and then increase them until generalization performance grows.
The best solution against overfitting is to enrich the training dataset with new data. Aim for complete coverage of the full range of inputs that are supported and expected by the model. New data should also contain additional information with respect to starting the dataset in order to effectively contrast overfitting and to result in a better generalization error. When collecting additional data is not possible or too expensive, it is necessary to adopt specific, very powerful techniques. The most important ones will be described here.
Regularization
Regularization is one of the most powerful tools used to contrast overfitting. Given a network architecture and a set of training data, there is an entire space of possible weights that produce the same results. Every combination of weights in this space defines a specific model. As we saw in the preceding section, we have to prefer, as a general principle, simple models over complex ones. A common way to reach this goal is to force network weights to assume small values, thereby regularizing the distribution of weights. This can be achieved through "weight regularization". This consists of shaping the loss function so that it can take weight values into consideration, adding a new term to it that is directly proportional to their magnitude. Two approaches are usually encountered:
- L1 regularization: The term that's added to the loss function is proportional to the absolute value of the weight coefficients, commonly referred to as the "L1 norm" of the weights.
- L2 regularization: The term that's added to the loss function is proportional to the square of the value of the weight coefficients, commonly referred to as the "L2 norm" of the weights.
Both of these have the effect of limiting the magnitude of the weights, but while L1 regularization tends to drive weights toward exactly zero, L2 regularization penalizes weights with a less strict constraint since the additional loss term grows at a higher rate. L2 is, in general, more common.
Keras contains pre-built L1 and L2 regularization objects. The user has to pass them as arguments to the network layers that they want to apply the technique to. The following code shows how to apply it to a common dense layer:
tf.keras.layers.Dense(512, activation='relu', \
kernel_regularizer=tf.keras\
.regularizers.l2(0.001))
The parameter that was passed to the L2 regularizer (0.001) shows that an additional loss term equal to 0.001 * weight_coefficient_value**2 will be added to the total loss of the network for every coefficient in the weight matrix.
Early Stopping
Early stopping is a specific form of regularization. The idea is to keep track of both training and validation errors during training and to continue training the model until both training and validation losses decrease. This allows us to spot the epochs threshold, after which the training loss' decrease would come as an expense of increased generalization error, so that we can stop training when validation/test performances have reached their maximum. One typical parameter the user has to choose when adopting this technique is the number of epochs the system should wait for and monitor before stopping the iterations if no improvement in the validation error is shown. This parameter is commonly named "patience."
Dropout
One of the most popular and effective reliable regularization techniques for neural networks is Dropout. It was developed at the University of Toronto by Prof. Hinton and his research group.
When Dropout is applied to a layer, a certain percentage of the layer output features during training are randomly set to zero (they drop out). For example, if the output of a given layer would normally have been [0.3, 0.4, 1.2, 0.1, 1.5] for a given set of input features during training, when dropout is applied, the same output vector will have some zero entries randomly distributed; for example, [0.3, 0, 1.2, 0.1, 0].
The idea behind dropout is to encourage each node to output values that are highly informative and meaningful on their own, without relying on its neighboring ones.
The parameter to be set when inserting a dropout layer is called the dropout rate: this represents the fraction of features that are being set to zero and is usually chosen in a range between 0.2 and 0.5. When performing inference, dropout is deactivated, and an additional operation needs to be executed to take into account the fact that more units will be active with respect to training time. To re-establish a balance between these two situations, the layer's output values are multiplied by a factor equal to the dropout rate, resulting in a scaling-down operation. In Keras, dropout can be introduced in a network using the dropout layer, which is applied to the output of the layer immediately before it. Consider the following code snippet:
dropout_model = tf.keras.Sequential([
#[...]
tf.keras.layers.Dense(512, activation='relu'), \
tf.keras.layers.Dropout(0.5), \
tf.keras.layers.Dense(256, activation='relu'), \
#[...]
])
As you can see, dropout is applied to the layer with 512 neurons, setting 50% of their values to 0.0 at training time, and multiplying their values by 0.5 at inference time.
Data Augmentation
Data augmentation is particularly useful when the number of instances available for training is limited. It is super easy to understand how it is implemented and works in the context of image processing. Suppose we want to train a network to classify images of different breeds of a specific species and we only have a limited number of examples for each breed. How can we enlarge the dataset to help the model generalize better? Data augmentation plays a major role in this context: the idea is to create new training instances, starting from those we already have and tweaking them appropriately. In the case of images, we can act on them by doing the following:
- Random rotations with respect to a point in the vicinity of the center
- Random crops
- Random affine transformations (shear, resize, and so on)
- Random horizontal/vertical flips
- White noise superimposition
- Salt and pepper noise superimposition
These are a few examples of data augmentation techniques that can be used for images, which, of course, have counterparts in other domains. This approach makes the model way more robust and improves its generalization performance, allowing it to abstract notions and knowledge about the specific problem it is facing in a more general way by giving privilege to the most informative input features.
Batch Normalization
Batch normalization is a technique that consists of applying a normalization transform to every batch of data. For example, in the context of training a deep network with a batch size of 128, meaning the system will process 128 training samples at a time, the batch normalization layer works this way:
- It calculates the mean and variance for each feature using all the samples of the given batch.
- It subtracts the corresponding feature mean that was previously calculated from each feature of every batch sample.
- It pides each feature of every batch sample by the square root of the corresponding feature variance.
Batch normalization has many benefits. It was initially proposed to solve internal covariate shift. While training deep networks, the layer's parameters continuously change, causing internal layers to constantly adapt and readjust to new distributions they see as inputs coming from the preceding layers. This is particularly critical for deep networks, where small changes in the first layers are amplified through the network. Normalizing the layer's output helps in bounding these shifts, speeding up training and generating more reliable models.
In addition, using batch normalization, we can do the following:
- We can adopt a higher learning rate without the risk of incurring the problem of vanishing or exploding gradients.
- We can favor network regularization by making its generalization better and mitigating overfitting.
- We can make the model become more robust to different initialization schemes and learning rates.
Model Testing and Inference
Once the model has been trained and its validation performances are satisfactory, we can move on to the final stage. As already stated, a final, accurate, model performance estimation requires that we test the model on a set of instances it has never seen before: the test set. After performance has been confirmed, the model can be moved to production for online inference, where it will serve as designed: new instances will be provided to the model and it will output predictions, leveraging the knowledge it has been designed and trained to have.
In the following subsections, three types of neural networks with specific elements/layers will be described. They will provide straightforward examples of different technologies that are widely encountered in the field.
Standard Fully Connected Neural Networks
The term fully connected neural network is commonly used to indicate deep neural networks that are only composed of fully connected layers. Fully connected layers are the layers whose neurons are connected to all the neurons of the previous layer, as well as all the neurons of the next one, as shown in the following diagram:
This chapter will mainly deal with fully connected networks. They map inputs to outputs through a series of intermediate hidden layers. These architectures are capable of handling a wide variety of problems, but they are limited in terms of the input dimensions they can handle, as well as the number of layers and number of neurons, due to the rapid growth of the number of parameters, which is strictly dependent on these variables.
An example of a fully connected neural network that will be encountered later on is the one presented as follows, built with the Keras API. It connects an input layer who dimension is equal to len(train_dataset.keys()) to an output layer of dimension 1, by means of two hidden layers with 64 neurons each:
model = tf.keras.Sequential([tf.keras.layers.Dense\
(64, activation='relu',\
input_shape=[len(train_dataset.keys())]),\
tf.keras.layers.Dense(64, activation='relu'),\
tf.keras.layers.Dense(1)])
Now, let's quickly solve an exercise in order to aid our understanding of fully connected neural networks.
Exercise 3.02: Building a Fully Connected Neural Network Model with the Keras High-Level API
In this exercise, we will build a fully connected neural network with an input dimension of 100, 2 hidden layers, and an output layer of 10 neurons. The following are the steps to complete this exercise:
- Import the TensorFlow module and print its version:
from __future__ import absolute_import, pision, \
print_function, unicode_literals
import tensorflow as tf
print("TensorFlow version: {}".format(tf.__version__))
This prints out the following line:
TensorFlow version: 2.1.0
- Create the network using the Keras sequential module. This allows us to build a model by stacking a series of layers, one after the other. In this specific case, we're using two hidden layers and an output layer:
INPUT_DIM = 100
OUTPUT_DIM = 10
model = tf.keras.Sequential([tf.keras.layers.Dense\
(128, activation='relu', \
input_shape=[INPUT_DIM]), \
tf.keras.layers.Dense(256, activation='relu'), \
tf.keras.layers.Dense(OUTPUT_DIM, activation='softmax')])
- Print the summary to look at the model description:
model.summary()
The output will be as follows:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 12928
_________________________________________________________________
dense_1 (Dense) (None, 256) 33024
_________________________________________________________________
dense_2 (Dense) (None, 10) 2570
=================================================================
Total params: 48,522
Trainable params: 48,522
Non-trainable params: 0
_________________________________________________________________
As you can see, the model has been created and the summary provides us with a clear understanding of the layers, their types and shapes, and the number of parameters of the network, which is very useful when building neural networks in real life.
Note
To access the source code for this specific section, please refer to https://packt.live/37s1M5w.
You can also run this example online at https://packt.live/3f9WzSq.
Now, let's move on and understand convolutional neural networks.
Convolutional Neural Networks
The term Convolutional Neural Network (CNN) usually identifies a deep neural network composed of a combination of the following:
- Convolutional layers
- Pooling layers
- Fully connected layers
One of the most successful applications of CNNs is in image and video processing tasks. In fact, they are way more capable, with respect to fully connected ones, of handling high-dimensional inputs such as images. They are also widely used for anomaly detection tasks, being used in autoencoders, as well as encoders for reinforcement learning algorithms, specifically for policy and value networks.
Convolutional layers can be thought of as a series of filters applied (convolved) to layer inputs to generate layer outputs. The main parameters of these layers are the number of filters they have and the dimension of the convolution kernel.
Pooling layers reduce the dimensions of the data; they combine the outputs of neuron clusters at one layer into a single neuron in the next layer. Pooling layers may compute a max (MaxPooling), which uses the maximum value from each cluster of neurons at the prior layer, or an average (AveragePooling), which uses the average value from each cluster of neurons at the prior layer.
These convolution/pooling operations encode input information in a compressed representation, up to a point where these new deep features, also called embeddings, are typically provided as inputs to standard fully connected layers at the very end of the network. A classic convolutional neural network schematization is represented in the following figure:
The following exercise shows how to create a convolutional neural network using the Keras high-level API.
Exercise 3.03: Building a Convolutional Neural Network Model with the Keras High-Level API
This exercise will show you how to build a convolutional neural network with three convolutional layers (number of filters equal to 16, 32, and 64, respectively, and a kernel size of 3), alternated with three MaxPooling layers, and, at the end, two fully connected layers with 512 and 1 neurons, respectively. Here is the step-by-step procedure:
- Import the TensorFlow module and print its version:
from __future__ import absolute_import, pision, \
print_function, unicode_literals
import tensorflow as tf
print("TensorFlow version: {}".format(tf.__version__))
This prints out the following line:
TensorFlow version: 2.1.0
- Create the network using the Keras sequential module:
IMG_HEIGHT = 480
IMG_WIDTH = 680
model = tf.keras.Sequential([tf.keras.layers.Conv2D\
(16, 3, padding='same',\
activation='relu',\
input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),\
tf.keras.layers.MaxPooling2D(),\
tf.keras.layers.Conv2D(32, 3, padding='same',\
activation='relu'),\
tf.keras.layers.MaxPooling2D(),\
tf.keras.layers.Conv2D(64, 3, padding='same',\
activation='relu'),\
tf.keras.layers.MaxPooling2D(),\
tf.keras.layers.Flatten(),\
tf.keras.layers.Dense(512, activation='relu'),\
tf.keras.layers.Dense(1)])
model.summary()
The preceding code allows us to build a model by stacking a series of layers, one after the other. In this specific case, three series of convolutional layers and max pooling layers are followed by a flattening layer and two dense layers.
This outputs the following model description:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 480, 680, 16) 448
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 240, 340, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 240, 340, 32) 4640
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 120, 170, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 120, 170, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 60, 85, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 326400) 0
_________________________________________________________________
dense (Dense) (None, 512) 167117312
_________________________________________________________________
dense_1 (Dense) (None, 1) 513
=================================================================
Total params: 167,141,409
Trainable params: 167,141,409
Non-trainable params: 0
Thus, we have successfully created a CNN using Keras. The preceding summary gives us significant information about the layers and the different parameters of the network.
Note
To access the source code for this specific section, please refer to https://packt.live/2AZJqwn.
You can also run this example online at https://packt.live/37p1OuX.
Now that we've dealt with convolutional neural networks, let's focus on another important architecture family: recurrent neural networks.
Recurrent Neural Networks
Recurrent neural networks are models composed of particular units that, in the same way as feedforward networks, are able to process data from input to output, but, unlike them, are also able to process data in the opposite direction using feedback loops. They are basically designed so that the output of a layer is redirected and becomes the input of the same layer using specific internal states capable of "remembering" previous states.
This specific feature makes them particularly suited for solving tasks characterized by temporal/sequential development. It can be useful to compare CNNs and RNNs to understand which problems one is more suited to than the other. CNNs are the best fit for problems where local coherence is strongly enhanced and is particularly the case for images/video. Local coherence is exploited to drastically reduce the number of weights needed to process high-dimensional inputs. RNNs, on the other hand, perform best on problems characterized by temporal development, which means tasks where data can be represented by time series. This is the case for natural language processing or speech recognition, where words and sounds are meaningful if they're considered in a specific sequence.
Recurrent architectures can be thought of as sequences of operations, and they are perfectly designed to keep track of historical data:
The most important components they are based on are GRUs and LSTMs. These blocks have internal elements and states explicitly dedicated to keeping track of important information for the task they aim to solve. They both address the issue of learning long-term dependencies successfully when training machine learning algorithms on temporal data. They tackle this problem by storing "memory" from data seen in the past in order to help the network make predictions in the future.
The main differences between GRUs and LSTMs are the number of gates, the inputs the unit has, and the cell states, which are the internal elements the make up the unit's memory. GRUs have one gate, while LSTMs have three gates, called the input, forget, and output gates. LSTMs are more flexible than GRUs since they have more parameters, which, on the other hand, makes them less efficient in terms of both memory and time.
These networks have been responsible for the great advancements in fields such as speech recognition, natural language processing, text-to-speech, machine translation, language modeling, and many other similar tasks.
The following is a block diagram of a typical GRU:
The following is a block diagram of a typical LSTM:
The following exercise shows how a recurrent network with LSTM units can be created using the Keras API.
Exercise 3.04: Building a Recurrent Neural Network Model with the Keras High-Level API
In this exercise, we will create a recurrent neural network using the Keras high-level API. It will have the following architecture: the very first layer is simply a layer that encodes, using certain rules, the input features, thereby producing a given set of embeddings. The second layer is a layer where 64 LSTM units are added to it. They are added inside a bidirectional wrapper, which is a specific layer that's used to improve and speed up learning by doubling the units it acts on and training the first ones with the input as-is, and the second ones with the input reversed (for example, words in a sentence read from right to left). Then, the outputs are concatenated. This technique has been proven to generate faster and better learning. Finally, two dense layers are added that have 64 and 1 neurons, respectively. Perform the following steps to complete this exercise:
- Import the TensorFlow module and print its version:
from __future__ import absolute_import, pision, \
print_function, unicode_literals
import tensorflow as tf
print("TensorFlow version: {}".format(tf.__version__))
This outputs the following line:
TensorFlow version: 2.1.0
- Build the model using the Keras sequential method and print the network summary:
EMBEDDING_SIZE = 8000
model = tf.keras.Sequential([\
tf.keras.layers.Embedding(EMBEDDING_SIZE, 64),\
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),\
tf.keras.layers.Dense(64, activation='relu'),\
tf.keras.layers.Dense(1)])
model.summary()
In the preceding code, the model is simply built by stacking up consecutive layers. First, there is the embedding layer, then the bidirectional one, which operates on the LSTM layer, and finally two dense layers at the end of the model.
The model summary will be as follows:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 512000
_________________________________________________________________
bidirectional (Bidirectional (None, 128) 66048
_________________________________________________________________
dense (Dense) (None, 64) 8256
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
Total params: 586,369
Trainable params: 586,369
Non-trainable params: 0
_________________________________________________________________
Note
To access the source code for this specific section, please refer to https://packt.live/3cX01OO.
You can also run this example online at https://packt.live/37nw1ud.
With this overview of how to implement a neural network using TensorFlow, the following sections will show you how to combine all these notions to tackle typical machine learning problems, including regression and classification problems.