What is activation function?
In an Artificial Neural Network (ANN), the activation function is the feature that decides whether a neuron should be activated or not. It defines the output of a node for an input or a set of inputs.
Activation functions are used to introduce non-linear properties to neural networks.
Neural network activation functions are generally considered to be the most critical component of deep learning since they are employed to determine the output of deep learning models. They also determine the accuracy and the performance efficiency of the training model that could design or divide a large scale artificial neural network. These activation functions are important because without an activation function, a neural network is, in its essence, nothing more than a linear regression model. A linear equation is a polynomial of just one degree. It’s easy to solve, but the model would be unable to solve complex problems or higher degree polynomials.
The activation function is what carries out the non-linear transformation to the input, thus enabling it to learn and carry out tasks of increasing complexity, solving complex problems like language translations and image classifications.
What are the different types of activation functions?
There are various types of activation functions. Here are some of the widely used ones:
Binary Step Function
The binary step function is also known as the ‘Threshold Function’. It is essentially a threshold-based classifier.
When using the binary step function, if the input to the activation function is higher than the set threshold, then the neuron will be activated. If the input is lower than the threshold, then the neuron is deactivated.
This activation function can only be used for binary class problems. However, they can be tweaked to be applied to multi-class problems.
Another limitation is that the gradient(differential ) of the binary step function is zero, which hinders backpropagation.
Linear Activation Function
Linear functions are also known as straight-line functions. Here, the output is proportional to the weighted sum input. It’s function can be represented with this equation:
f(x) = ax + c
A major problem with this function is that the output of differentiation is constant and does not have any relation to the input. During the backpropagation process, weights and bias will get updated, but the gradient would not change.
Another issue is that irrespective of the number of layers in the neural network, the last layer will always be a linear function of the first layer.
Sigmoid Activation Function
These activation functions use a real value as an input and generates another value between 0 and 1 as the output. It translates inputs from the range in (-∞,∞) to the range in (0,1).
The sigmoid or logistic activation function is used widely in classification problems.
The derivative of a sigmoid function in a neural network will lie between 0 and 0.25. It is not monotonic. It even faces the ‘vanishing gradient and exploding gradient problem’.
Tanh Function
This is another non-linear activation function. The derivative of a Tanh function can be expressed in terms of the function itself.
A TanH function generates output values between -1 and 1.
ReLU Function
Even though ReLU stands for Rectified Linear Unit, these functions are not linear. They hold a significant advantage in the fact that they do not activate all the neurons at the same time. Neurons are deactivated only if the output of the linear transformation is less than 0.
The formula is: max(0,z)
Leaky ReLU
This is an enhanced version of the ReLU function. These functions try to solve the “dying ReLU” problem.
While a ReLU is 0 when z<0, Leaky ReLUs permit a tiny, non-zero, constant gradient α (usually, α=0.01).
Parameterised ReLU (PReLU)
Parameterised ReLU add a new parameter as a slope in the negative area of the function. These functions allow the neurons to choose which slope is best in the negative region.
Parameterised ReLUs can turn into ReLUs or Leaky ReLUs with certain values of α.
Exponential Linear Unit (ELU)
These functions converge at a quicker pace and tend to produce results with greater levels of accuracy.
They have an additional alpha constant which is a positive number.
Swish
The Swish activation function was discovered by researchers at Google. They are as computationally efficient as ReLU functions, but tend to perform better on deeper models than ReLU can.
Softmax activation Function
The Softmax activiation function is usually referred to as a combination of multiple sigmoid functions. They are used to solved multi-class classification problems. They calculate the probability distribution of the event over ‘n’ different events.
These probabilities will help in figuring out the target class for the inputs.
Why is activation function used?
The activation function determines whether or not a neuron should get activated. It does this task by calculating weighted sum and then proceeding to add bias with it. The activation function is used for the purpose of adding non-linearity into the output of a neuron.
In neural networks, the weights and biases of the neurons are updated based on the error at the output. This process is called backpropagation. It is one of the earliest techniques that was able to demonstrate that artificial neural networks had the ability to learn good internal representations. The backpropagation algorithm turned out to be extremely efficient, to the point here human experts were not needed for the purpose of discovering appropriate features. This made it possible for artificial neural networks to handle and solve complex problems that these ANNs could not deal with earlier. The whole point of backpropagation is to optimize the weights so that the artificial neural network had the ability to learn how to correctly map arbitrary inputs to outputs.
Activation functions are used because they make it possible for backpropagation to take place. This is due to the fact that gradients are supplied along with the error to update the weights and biases. The activation function can optimize strategy while carrying out backpropagation to measure gradient loss functions in the neural networks.