Advanced Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Introduction to neural networks

Neural network is a fairly broad term that covers a number of related methods but, in our case, we will focus on a feedforward network that trains with backpropagation. I'm not going to waste our time discussing how the machine learning methodology is similar or dissimilar to how a biological brain works. We only need to start with a working definition of what a neural network is.

To know more about artificial neural networks, I think the Wikipedia entry is a good start: https://en.wikipedia.org/wiki/Artificial_neural_network.

To summarize, in machine learning and cognitive science, artificial neural networks (ANNs) are a family of statistical learning models inspired by biological neural networks (the central nervous systems of animals, the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown.

The motivation or benefit of ANNs is that they allow the modeling of highly complex relationships between inputs/features and response variable(s), especially if the relationships are highly nonlinear. No underlying assumptions are required to create and evaluate the model, and it can be used with qualitative and quantitative responses. If this is the yin, then the yang is the common criticism that the results are a black box, which means that there is no equation with the coefficients to examine and share with the business partners. In fact, the results are almost uninterpretable. The other criticisms revolve around how results can differ by just changing the initial random inputs and that training ANNs is computationally expensive and time-consuming.

The mathematics behind ANNs is not trivial by any measure. However, it is crucial to at least get a working understanding of what is happening. A good way to intuitively develop this understanding is to start with a diagram of a simplistic neural network.

In this simple network, the inputs or covariates consist of two nodes or neurons. The neuron labeled as 1 represents a constant or, more appropriately, the intercept. X1 represents a quantitative variable. W represents the weights that are multiplied by the input node values. These values become input nodes to the hidden node. You can have multiple hidden nodes, but the principle of what happens in just this one is the same. In the hidden node, H1, the weight * value computations are summed. As the intercept is notated as 1, then that input value is simply the weight, W1. Now the magic happens. The summed value is then transformed with the activation function, turning the input signal to an output signal. In this example, as it is the only the hidden node, it is multiplied by W3 and becomes the estimate of Y, our response. This is the feedforward portion of the algorithm:

But wait, there's more! To complete the cycle or epoch, as it is known, backpropagation happens and trains the model based on what was learned. To initiate backpropagation, an error is determined based on a loss function such as sum of squared error or cross-entropy, among others. As the weights, W1 and W2, were set to some initial random values between [-1, 1], the initial error may be high. Working backward, the weights are changed to minimize the error in the loss function. The following diagram portrays the backpropagation portion:

This completes one epoch. This process continues, using gradient descent (discussed in Chapter 5, K-Nearest Neighbors and Support Vector Machines) until the algorithm converges to the minimum error or a pre-specified number of epochs. If we assume that our activation function is simply linear, in this example, we would end up with the following:

The networks can get complicated if you add numerous input neurons, multiple neurons in a hidden node, and even multiple hidden nodes. It is important to note that the output from a neuron is connected to all the subsequent neurons and has weights assigned to all these connections. This greatly increases the model's complexity. Adding hidden nodes and increasing the number of neurons in the hidden nodes has not improved the performance of ANNs as we had hoped. Thus, the development of deep learning occurs, which in part relaxes the requirement of all these neuron connections.

There are a number of activation functions that you can use/try, including a simple linear function, or for a classification problem, the sigmoid function, which is a special case of the logistic function (Chapter 3, Logistic Regression). Other common activation functions are Rectifier, Maxout, and tanh (hyperbolic tangent).

We can plot a sigmoid function in R, first creating an R function to calculate the sigmoid function values:

    > sigmoid = function(x) {
1 / ( 1 + exp(-x) )
}

Then, it is a simple matter of plotting the function over a range of values, say -5 to 5:

    > x <- seq(-5, 5, .1)
> plot(sigmoid(x))

The output of the preceding command is as follows:

The tanh function (hyperbolic tangent) is a rescaling of the logistic sigmoid with the output between -1 and 1. The tanh function relates to sigmoid as follows, where x is the sigmoid function:

Let's plot the tanh and sigmoid functions for comparison purposes. Let's also use ggplot:

> install.packages("ggplot2")

> s <- sigmoid(x)

> t <- tanh(x)

> z <- data.frame(cbind(x, s, t))

> ggplot2::ggplot(z, ggplot2::aes(x)) +
ggplot2::geom_line(ggplot2::aes(y = s, color = "sigmoid")) +
ggplot2::geom_line(ggplot2::aes(y = t, color = "tanh"))


The output of the preceding command is as follows:

So, why use the tanh function versus sigmoid? It seems there are many opinions on the subject. In short, assuming you have scaled data with mean 0 and variance 1, the tanh function permits weights that are on average, close to zero (zero-centered). This helps in avoiding bias and improves convergence. Think about the implications of always having positive weights from an output neuron to an input neuron like in a sigmoid function activation. During backpropagation, the weights will become either all positive or all negative between layers. This may cause performance issues. Also, since the gradient at the tails of a sigmoid (0 and 1) are almost zero, during backpropagation, it can happen that almost no signal will flow between neurons of different layers. A full discussion of the issue is available from LeCun (1998). Keep in mind it is not a foregone conclusion that tanh is always better.

This all sounds fascinating, but the ANN almost went the way of disco as it just did not perform as well as advertised, especially when trying to use deep networks with many hidden layers and neurons. It seems that a slow, yet gradual revival came about with the seminal paper by Hinton and Salakhutdinov (2006) in the reformulated and, dare I say, rebranded neural network, deep learning.