Instruction

All of us have encountered with the problem of choosing an activation function during the building of neural network models. But what is activation functions, why we should use them and which should we choose in the network? Here is some of the summary or notes wrote after reading the blog and the discussion.

0. Structure

  1. What?
  2. Why?
  3. Which?
  4. Reference

1. What is Activation Function

In biologically inspired neural networks, the activation function is usually an abstraction representing the rate of action potential firing in the cell. —-Wikipedia

Activation function can be translated as “激活函数/激励函数” in Chinese. As my favorate, “激活函数” which can denotes the revival of the neuron and embodies the imitated-biological characteristic of the neural network is the best. This translation makes me feel the network is really “alive(活的)"!

Well, back to business, what is the so-called activation function? See the following picture. The left part (till the summation function) is a simple perceptron which takes the sum of the weighted input and then generate an output value. Then a step function is applied to the weighted sum. This function that has been applied tho the weighted sum of the input of an neutron is defined as the activation function.

2. Why We Use Activation Function

According to the discussion here, the reason why we should apply activation functions in the nueral network is that we need this network to handle non-linear problems. Look back to the picture above, a single perceptron without acivation can be seen as a box which can output the linear combination of its input. It is only linearly solvable. This model is good enought to solve the classification problem like the left picture shown as below:

However, we cannot count on a simple perceptron to deal with complex problems, especially the non-lieanrly-separable problem like the right picture above.

This is because of the intrinsic structure of the perceptron. The output is a weighted linear combination of the input. So it cannot explain non-linear results, even if we form a network. Look at the left picture shown below.

Now, Our main character comes on stage. See the picture on he right. When we apply the activation function with a threshold to the output of the linear combination, the output of the perceptron will become a non-linear result of the input due to the non-linearity of the activation function. It transform the linear combination to a non-linear one! And, unfortunately, seems no one can tell the exact transform function between the input and output of the neuron at least till now.

BUT! The good thing is, we can use the neural network to deal with problems more complicated than just the linear part.

3. Which one to choose

Have said too much above, so JUST SHOW THE PIC!

$$f(z) = \frac{1}{1+e^{-z}}$$ Range of Value: (0,1).
Advantage: Works good when the difference between features are very large or trivial. Disadvantage: Need large amount of calculation when calculating the derivaion during backward propagation. Prone to generate the gradient vanishing.

Tanh

$$f(z) = tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$ $$tanh(x) = 2sigmoid(2x) - 1$$ Range of Value: [-1,1].
Advantage: Good at the case of large difference between features. Its average/expectation is 0, so it will work better than sigmoid - converging faster. Disadvantage: the gradient vanishing still exist.

Relu

$$\phi (x) = max(0,x)$$ Range of Value: [0,+∞).
Advantage: linear, unsaturate; less cost of calculation; less possible of gradient vanishing; works good even in the unsuperviced leanring; provides the ability of sparse expression. Disadvantage: easy to die when feeding into a large learning rate.

Leaky-Relu, P-Relu, R-Relu

Advanced/modified version of Relu. Which try to avoid the gradient vanishing and the “death” of neuron. More infor see the official document of TensorFlow and Keras.

Reference

  1. Wikipedia, https://en.wikipedia.org/wiki/Activation_function
  2. Discussion in Zhihu, https://www.zhihu.com/question/22334626
  3. 神经网络-激活函数-面面观(Activation Function), http://blog.csdn.net/cyh_24/article/details/50593400
  4. 常用激活函数比较, http://www.jianshu.com/p/22d9720dbf1a
  5. 浅谈深度学习中的激活函数 - The Activation Function in Deep Learning, http://www.cnblogs.com/rgvb178/p/6055213.html