1. INTRODUCTION
When an input sample is presented to feed-forward neural networks (FNNs), it is processed through a series of weighted sums and nonlinear activation functions. It was proved that the FNNs with enough hidden nodes can approximately implement any function [1]-[4]. Herein, the weighted sums to hidden nodes are a sort of projections from the input space to a hidden feature space followed by element-wise nonlinear activation functions. There have been research results to understand the role of hidden nodes. Oh and Lee proved that the nonlinear function of hidden nodes has an effect of decreasing correlations among hidden nodes [5]. They also argued that FNNs are a special type of nonlinear whitening filter [5]. And, Shah and Poon investigated that hidden nodes with sigmoidal activation functions have the ability to produce linearly independent internal representations [6].
In neural network field, information theory provides many fruitful research results. Lee et al. reported that FNNs have a capability of information extraction for pattern classification through hierarchically keeping inter-class information while reducing intra-class variations [7]. Learning rules were proposed stemming from information theory [8]-[14]. The upper bound for the probability of error was derived based on Renyi’s entropy [15]. Information theory can provide the construction strategy of neural networks [16]. Also, hidden information maximization and maximum mutual information methods were proposed for feature extractions [17], [18]. In this paper, we focus on the nonlinear activation functions of hidden nodes in a viewpoint of information theory. Under the assumption that nonlinear activation function can be approximated piece-wise linearly, we derive that the entropy of hidden nodes decreases after piece-wise linear transformation. Based on the derivation, we can interpret the role of hidden nodes in a viewpoint of entropy or uncertainty.
2. NONLINEAR EFFECT ON THE ENTROPY OF HIDDEN NODES
In FNNs, inputs are processed through a series of weighted sums and nonlinear activation functions. When inputs or weights are perturbed randomly, the weighted sums to hidden nodes are approximately jointly Gaussian according to the central limit theorem [19]-[21]. Therefore, we analyze the effect of nonlinear function on jointly Gaussian random variables.
If u and v are jointly Gaussian random variables with zero means, then the joint entropy of u and v [22] is given by
Herein, σu and σv are the standard deviation of u and v, respectively, and r is the correlation coefficient between u and v.
Let’s assume that u and v are transformed into y and z as follows:
where a, b, c, and d are nonzero real values. Then, the entropy of y is given by
where fy(y) is the probability density function (p.d.f.) of random variable y. By substituting[19]
Eq. (3) can be derived by
Here, fu (u) is the p.d.f. of random variable u and
since u is Gaussian with zero mean. Accordingly, the entropy of z is given by
The joint entropy after the piece-wise-linear transformations is defined by
The joint p.d.f. of y and z can be separated into four quadrants as follows:
Therefore, Eq. (8) is also separated into four parts as
Since the joint p.d.f. of u and v is given by
the first quadrant term in Eq. (10) is derived by
Here,
and
The second term in the right side of Eq. (12) becomes
Since y = au is symmetric and zero mean,
And according to the same procedure,
By Papoulis [19],
and
Also by Oh [23],
and
By substituting Eqs. (18) and (20) into Eq. (12),
Using the same procedure from Eq. (11) to (24), we can derive that
and
By substituting Eqs. (24), (25), (26), and (27) into Eq. (10), we attain
Finally we take the derivation
with the substitution of Eq. (1) into (28).
In FNNs, we usually use the tanh(.) sigmoidal function (Fig. 1) as the nonlinear activation function of hidden node. Assuming that the nonlinear activation function can be approximated piece-wise linearly as in Eq. (2), then the parameters a, b, c, and d correspond to the slope of nonlinear activation function. As shown in Fig. 2, the slope is less than or equal to 0.5. Thus, and h(y,z) Consequently, we can argue that the nonlinear activation function decreases the entropy or uncertainty of hidden nodes. The sigmoid activation function can be separated into linear region with steep slope and saturated region with gentle slope as shown in Fig. 1. When hidden node values are in the saturation region of sigmoid activation function, the entropy decreases much more. This coincides with the argument that hidden nodes tend to be saturated after successful training. Fig. 1.The sigmoid activation function of hidden node Fig. 2.The slope of the sigmoid activation function Furthermore, the entropy of hidden nodes provides hierarchical understanding of information extraction capabilities acquired through learning. Lee et al. argued that input samples have the inter-class information as well as the intra-class variation [7]. The inter-class information is the information content that an input sample belongs to a specific class, and the intra-class variation is a measure of the average variations within the classes including noise contaminations. After learning, input samples are projected to hidden nodes through weighted sums and element-wise nonlinear transformations. In this paper, we proved that the entropy of hidden nodes decreases after the nonlinear transformations. The decreasing of hidden nodes’ entropy is correspond to the decreasing of intra-class variations, as pointed out by Lee et al [7]. In this paper, we prove that the entropy of jointly Gaussian random variables decreases after piece-wise linear transformations. Also, the less steep the slopes are, the more the joint entropy decreases. Since the nonlinear activation function of hidden nodes can be approximated piece-wise linearly, we can argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the entropy of hidden nodes decrease much more after successful training of FNNs.3. CONCLUSIONS
References
- K. Hornik, M. Stincombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, 1989, pp. 359-366. https://doi.org/10.1016/0893-6080(89)90020-8
- K. Hornik, "Approximation Capabilities of Multilayer Feedforward Networks," Neural Networks, vol. 4, 1991, pp. 251-257. https://doi.org/10.1016/0893-6080(91)90009-T
- S. Suzuki, "Constructive Function Approximation by Three-layer Artificial Neural Networks," Neural Networks, vol. 11, 1998, pp. 1049-1058. https://doi.org/10.1016/S0893-6080(98)00068-9
- Y. Liao, S. C. Fang, and H. L. W. Nuttle, "Relaxed Conditions for Radial-Basis Function Networks to be Universal Approximators," Neural Networks, vol. 16, 2003, pp. 1019-1028. https://doi.org/10.1016/S0893-6080(02)00227-7
- S. H. Oh and Y. Lee, "Effect of Nonlinear Transformations on Correlation Between Weighted Sums in Multilayer Perceptrons," IEEE Trans., Neural Networks, vol. 5, 1994, pp. 508-510. https://doi.org/10.1109/72.286927
- J. V. Shah and C. S. Poon, "Linear Independence of Internal Representations in Multilayer Perceptrons," IEEE Trans., Neural Networks, vol. 10, 1999, pp. 10-18. https://doi.org/10.1109/72.737489
- Y. Lee and H. K. Song, "Analysis on the Efficiency of Pattern Recognition Layers Using Information Measures," Proc. IJCNN'93 Nagoya, vol. 3, Oct. 1993, pp. 2129-2132.
- A. El-Jaroudi and J. Makhoul, "A New Error Criterion for Posterior probability Estimation with Neural Nets," Proc. IJCNN'90, vol. 3, June 1990, pp. 185-192.
- M. Bichsel and P. Seitz, "Minimum Class Entropy: A maximum Information Approach to Layered Networks," Neural Networks, vol. 2, 1989, pp. 133-141. https://doi.org/10.1016/0893-6080(89)90030-0
- S. Ridella, S. Rovetta, and R. Zunino, "Representation and Generalization Properties of Class-Entropy Networks," IEEE Trans. Neural Networks, vol. 10, 1999, pp. 31-47. https://doi.org/10.1109/72.737491
- D. Erdogmus and J. C. Principe, "Entropy Minimization Algorithm for Multilayer Perceptrons," Proc. IJCNN'01, vol. 4, 2001, pp. 3003-3008.
- K. E. Hild II, D. Erdogmus, K. Torkkola, and J. C. Principe, "Feature Extraction Using Information-Theoretic Learning," IEEE Trans. PAMI, vol. 28, no. 9, 2006, pp. 1385-1392. https://doi.org/10.1109/TPAMI.2006.186
- R. Li, W. Liu, and J. C. Principe, "A Unifiying Criterion for Instaneous Blind Source Separation Based on Correntropy," Signal Processing, vol. 87, no. 8, 2007, pp. 1872-1881. https://doi.org/10.1016/j.sigpro.2007.01.022
- S. Ekici, S. Yildirim, and M. Poyraz, "Energy and Entropy-Based Feature Extraction for Locating Fault on Transmission Lines by Using Neural Network and Wavelet packet Decomposition," Expert Systems with Applications, vol. 34, 2008, pp. 2937-2944 https://doi.org/10.1016/j.eswa.2007.05.011
- D. Erdogmus and J. C. Principe, "Information Transfer Through Classifiers and Its Relation to Probability of Error," Proc. IJCNN'01, vol. 1, 2001, pp. 50-54.
- S. J. Lee, M. T. Jone, and H. L. Tsai, "Constructing Neural Networks for Multiclass-Discretization Based on Information Theory," IEEE Trans. Sys., Man, and Cyb.-Part B, vol. 29, 1999, pp. 445-453. https://doi.org/10.1109/3477.764881
- R. Kamimura and S. Nakanishi, "Hidden Information maximization for Feature Detection and Rule Discovery," Network: Computation in Neural Systems, vol. 6, 1995, pp. 577-602. https://doi.org/10.1088/0954-898X/6/4/004
- K. Torkkola, "Nonlinear Feature Transforms Using Maximum Mutual Information," Proc. IJCNN'01, vol. 4, 2001, pp. 2756-2761.
- A. Papoulis, Probability, Random Variables, and Stochastic Processes, second ed., New York: McGraw-Hill, 1984.
- Y. Lee, S. H. Oh, and M. W. Kim, "An Analysis of Premature Saturation in Back-Propagation Learning," Neural Networks, vol. 6, 1993, pp. 719-728. https://doi.org/10.1016/S0893-6080(05)80116-9
- Y. Lee and S. H. Oh, "Input Noise Immunity of Multilayer Perceptrons," ETRI Journal, vol. 16, 1994, pp. 35-43. https://doi.org/10.4218/etrij.94.0194.0013
- T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley and Sons, INC. 1991.
- S. H. Oh, "Decreasing of Correlations Among Hidden Neurons of Multilayer Perceptrons," Journal of the Korea Contents Association, vol. 3, no. 3, 2003, pp. 98-102.