Activation Functions for Deep Neural Networks
The Universal Approximation Theorem
Any predictive model is a mathematical function, y = f(x) that can map the features (x) to the target variable (y). The function, f(x) can be a linear function or it can be a fairly complex nonlinear function. The function, f(x) can help predict with high accuracy depending on the distribution of the data. In the case of neural networks, it would also depend on the type of network architecture that’s employed. The Universal Approximation Theorem says that irrespective of what the f(x) is, a neural network model can be built that can approximately deliver the desired result. In order to build a proper neural network architecture, let us take a look at the activation functions.
What are Activation Functions?
Simplistically speaking, activation functions define the output of neurons given certain sets of inputs. Activation functions are mathematical functions that are added to neural network models to enable the models to learn complex patterns. An activation function takes in the output from the previous layer, passes it through the mathematical function to convert it into some form that can be considered as an input for the next computation layer. Activation functions determine the final accuracy of a network model while also contributing to the computational efficiency of building the model.
Why do we need Activation Functions?
In a neural network, if we add the hidden layers as the weighted sum of the inputs, this would translate into a linear function that is equivalent to a linear regression model.
In the above diagram, we see the hidden layer is simply the weighted sum of the inputs from the input layer. For example, b1 = bw1 + a1w1 + a2w3 which is nothing but a linear function.
Multi-layer neural network models can classify linearly inseparable classes. However, in order to do so, we need the network to be transformed to a nonlinear function. For this nonlinear transformation to happen, we would pass the weighted sum of the inputs through an activation function. These activation functions are nonlinear functions that are applied at the hidden layers. Each hidden layer can have different activation functions, though mostly all neurons in each layer will have the same activation function.
Types of Activation Functions?
In this section we discuss the following:
- Linear Function
- Threshold Activation Function
- Bipolar Activation Function
- Logistic Sigmoid Function
- Bipolar Sigmoid Function
- Hyperbolic Tangent Function
- Rectified Linear Unit Function
- Swish Function (proposed by Google Brain – a deep learning artificial intelligence research team at Google)
Linear Function: A linear function is similar to a straight line, y=mx. Irrespective of the number of hidden layers, if all the layers are linear in nature, then the final output is also simply a linear function of the input values. Hence we take a look at the other activation functions which are non-linear in nature and can help learn complex patterns.
Threshold Activation Function: In this case, if the input is above a certain value, the neuron is activated. However, it is to note that this function provides either a 1 or a 0 as the output. In other words, if we need to classify certain inputs into more than 2 categories, a Threshold-Activation function is not a suitable one. This function because of its binary output nature is also known as binary-step activation function.
Bipolar Activation Function: This is similar to the threshold function we explained above. However, this activation function will return an output of either -1 or +1 based on a threshold.
Logistic Sigmoid Function: One of the most frequently used activation functions is the Logistic Sigmoid Function. Its output ranges between 0 and 1 and is plotted as an ‘S’ shaped graph.
This is a nonlinear function and is characterized by a small change in x that would lead to a large change in y. This activation function is generally used for binary classification where the expected output is 0 or 1. This activation function provides an output between 0 and 1 and a default threshold of 0.5 is considered to convert the continuous output to 0 or 1 for classifying the observations
Another variation of the Logistic Sigmoid function is the Bipolar Sigmoid Function. This activation function is a rescaled version of the Logistic Sigmoid Function which provides an output in the range of -1 to +1
Hyperbolic Tangent Function: This activation function is quite similar to the sigmoid function. Its output ranges between -1 to +1.
Rectified Linear Activation Function: This activation function, also known as ReLU, outputs the input if it is positive, else will return zero. That is to say, if the input is zero or less, this function will return 0 or will return the input itself. This function mostly behaves like a linear function because of which computational simplicity is achieved.
This activation function has become quite popular and is often used because of its computational efficiency compared to sigmoid and the hyperbolic tangent function that helps the model converge faster.
Another critical point to note is that while the sigmoid & the hyperbolic tangent function try to approximate a zero value, the Rectified Linear Activation Functions can return true zero.
One disadvantage of ReLU is that when the inputs are close to zero or negative, the gradient of the function becomes zero. This causes a problem for the algorithm while performing back-propagation and in turn, the model cannot converge. This is commonly termed as the “Dying” ReLU problem.
There are a few variations of the ReLU activation function, such as Noisy ReLU, Leaky ReLU, Parametric ReLU, and Exponential Linear Units (ELU)
Leaky ReLU which is a modified version of ReLU, helps solve the “Dying” ReLU problem. It helps perform back-propagation even when the inputs are negative. Leaky ReLU, unlike ReLU, defines a small linear component of x when x is a negative value. With this change in leaky ReLU, the gradient can be of non-zero value instead of zero thus avoiding dead neurons. However, this might also bring in a challenge with Leaky ReLU when it comes to predicting negative values.
Exponential Linear Unit (ELU) is another variant of ReLU, which unlike ReLU and leaky ReLU, uses a log curve instead of a straight line to define the negative values.
Swish Activation Function: Swish is a new activation function that has been proposed by Google Brain. While ReLU returns zero for negative values, Swish doesn’t return a zero for negative inputs. Swish is a self-gating technique that implies that while normal gates require multiple scalar inputs, the self-gating technique requires a single input only. Swish has certain properties – Unlike ReLU, Swish is a smooth and non-monotonic function which makes it more acceptable compared to ReLU. Swish is unbounded above and bounded below. Swish is represented as x · σ(βx), where σ(z) = (1 + exp(−z))−1 is the sigmoid function and β is a constant or a trainable parameter.
Activation functions in deep learning and the vanishing gradient descent problem
Gradient-based methods are used by various algorithms to train the models. Neural networks algorithm uses stochastic gradient descent method to train the model. A neural network algorithm randomly assigns weights to the layers and once the output is predicted, it calculates the prediction errors. It uses these errors to estimate a gradient that can be used to update the weights in the network. This is done in order to reduce prediction errors. The error gradient is updated backward from the output layer to the input layer.
It is preferred to build a neural network model with more hidden layers. With more hidden layers, the neural network model can achieve the enhanced capability to perform more accurately.
One problem with too many layers is that the gradient diminishes pretty fast as it moves from the output layer to the input layer, i.e. during the backpropagation. By the time it reaches the other end backward, it is quite possible that the error might get too small to make any effect on the model performance improvement. Basically, this is a situation where the difficulty is faced during training a neural network model using gradient-based methods.
This is known as the vanishing gradient descent problem. Gradient-based methods might face this challenge when certain activation functions are used in the network.
In deep neural networks, various activations functions are used. However, when training deep neural network models, the vanishing gradient descent problems can demonstrate unstable behavior.
Various workaround solutions have been proposed to solve this problem. The most commonly used activation function is the ReLU activation function that has proven to perform way better than any other previously existing activation functions like a sigmoid or hyperbolic tangent.
As mentioned in an earlier paragraph, Swish improves upon ReLU being a smooth and non-monotonic function. However, though the vanishing gradient descent problem is much less severe than Swish, it does not completely avoid the vanishing gradient descent problem. To tackle this problem, a new activation function has been proposed.
“The activation function in the neural network is one of the important aspects which facilitates the deep training by introducing the nonlinearity into the learning process. However, because of zero-hard rectification, some of the existing activation functions such as ReLU and Swish miss to utilize the large negative input values and may suffer from the dying gradient problem. Thus, it is important to look for a better activation function that is free from such problems…. The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem… A very promising performance improvement is observed on three different types of neural networks including Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Networks like Long-short term memory (LSTM).“
Swalpa Kumar Roy, Suvojit Manna, et al, Jan 2019
In a paper published here, Swalpa Kumar Roy, Suvojit Manna, et al propose a new non-parametric activation function – the Linearly Scaled Hyperbolic Tangent (LiSHT) – for Neural Networks that attempts to tackle the vanishing gradient descent problem.