上QQ阅读APP看书，第一时间看更新

Activation functions

To allow a neural network to learn complex decision boundaries, we apply a non-linear activation function to some of its layers. Commonly used functions include Tanh, ReLU, softmax, and variants of these. More technically, each neuron receives as input signal the weighted sum of the synaptic weights and the activation values of the neurons connected. One of the most widely used functions for this purpose is the so-called sigmoid function. It is a special case of the logistic function, which is defined by the following formula:

The domain of this function includes all real numbers, and the co-domain is (0, 1). This means that any value obtained as an output from a neuron (as per the calculation of its activation state), will always be between zero and one. The sigmoid function, as represented in the following diagram, provides an interpretation of the saturation rate of a neuron, from not being active (= 0) to complete saturation, which occurs at a predetermined maximum value (= 1).

On the other hand, a hyperbolic tangent, or tanh, is another form of the activation function. Tanh squashes a real-valued number to the range [-1, 1]. In particular, mathematically, tanh activation function can be expressed as follows:

The preceding equation can be represented in the following figure:

Sigmoid versus tanh activation function

In general, in the last level of an feedforward neural network (FFNN), the softmax function is applied as the decision boundary. This is a common case, especially when solving a classification problem. In probability theory, the output of the softmax function is squashed as the probability distribution over K different possible outcomes. Nevertheless, the softmax function is used in various multiclass classification methods, such that the network's output is distributed across classes (that is, probability distribution over the classes) having a dynamic range between -1 and 1 or 0 and 1.

For a regression problem, we do not need to use any activation function since the network generates continuous values—probabilities. However, I've seen people using the IDENTITY activation function for regression problems nowadays. We'll see this in later chapters.

To conclude, choosing proper activation functions and network weights initialization are two problems that make a network perform at its best and help to obtain good training. We'll discuss more in upcoming chapters; we will see where to use which activation function.