DOI QR코드

DOI QR Code

Effect of Nonlinear Transformations on Entropy of Hidden Nodes

  • Oh, Sang-Hoon (Department of Information Communication Engineering Mokwon University)
  • Received : 2013.10.01
  • Accepted : 2014.01.12
  • Published : 2014.03.28

Abstract

Hidden nodes have a key role in the information processing of feed-forward neural networks in which inputs are processed through a series of weighted sums and nonlinear activation functions. In order to understand the role of hidden nodes, we must analyze the effect of the nonlinear activation functions on the weighted sums to hidden nodes. In this paper, we focus on the effect of nonlinear functions in a viewpoint of information theory. Under the assumption that the nonlinear activation function can be approximated piece-wise linearly, we prove that the entropy of weighted sums to hidden nodes decreases after piece-wise linear functions. Therefore, we argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the more the hidden nodes are saturated, the more the entropy of hidden nodes decreases. Based on this result, we can say that, after successful training of feed-forward neural networks, hidden nodes tend not to be in linear regions but to be in saturated regions of activation function with the effect of uncertainty reduction.

Keywords

1. INTRODUCTION

When an input sample is presented to feed-forward neural networks (FNNs), it is processed through a series of weighted sums and nonlinear activation functions. It was proved that the FNNs with enough hidden nodes can approximately implement any function [1]-[4]. Herein, the weighted sums to hidden nodes are a sort of projections from the input space to a hidden feature space followed by element-wise nonlinear activation functions. There have been research results to understand the role of hidden nodes. Oh and Lee proved that the nonlinear function of hidden nodes has an effect of decreasing correlations among hidden nodes [5]. They also argued that FNNs are a special type of nonlinear whitening filter [5]. And, Shah and Poon investigated that hidden nodes with sigmoidal activation functions have the ability to produce linearly independent internal representations [6].

In neural network field, information theory provides many fruitful research results. Lee et al. reported that FNNs have a capability of information extraction for pattern classification through hierarchically keeping inter-class information while reducing intra-class variations [7]. Learning rules were proposed stemming from information theory [8]-[14]. The upper bound for the probability of error was derived based on Renyi’s entropy [15]. Information theory can provide the construction strategy of neural networks [16]. Also, hidden information maximization and maximum mutual information methods were proposed for feature extractions [17], [18]. In this paper, we focus on the nonlinear activation functions of hidden nodes in a viewpoint of information theory. Under the assumption that nonlinear activation function can be approximated piece-wise linearly, we derive that the entropy of hidden nodes decreases after piece-wise linear transformation. Based on the derivation, we can interpret the role of hidden nodes in a viewpoint of entropy or uncertainty.

 

2. NONLINEAR EFFECT ON THE ENTROPY OF HIDDEN NODES

In FNNs, inputs are processed through a series of weighted sums and nonlinear activation functions. When inputs or weights are perturbed randomly, the weighted sums to hidden nodes are approximately jointly Gaussian according to the central limit theorem [19]-[21]. Therefore, we analyze the effect of nonlinear function on jointly Gaussian random variables.

If u and v are jointly Gaussian random variables with zero means, then the joint entropy of u and v [22] is given by

Herein, σu and σv are the standard deviation of u and v, respectively, and r is the correlation coefficient between u and v.

Let’s assume that u and v are transformed into y and z as follows:

where a, b, c, and d are nonzero real values. Then, the entropy of y is given by

where fy(y) is the probability density function (p.d.f.) of random variable y. By substituting[19]

Eq. (3) can be derived by

Here, fu (u) is the p.d.f. of random variable u and

since u is Gaussian with zero mean. Accordingly, the entropy of z is given by

The joint entropy after the piece-wise-linear transformations is defined by

The joint p.d.f. of y and z can be separated into four quadrants as follows:

Therefore, Eq. (8) is also separated into four parts as

Since the joint p.d.f. of u and v is given by

the first quadrant term in Eq. (10) is derived by

Here,

and

The second term in the right side of Eq. (12) becomes

Since y = au is symmetric and zero mean,

And according to the same procedure,

By Papoulis [19],

and

Also by Oh [23],

and

By substituting Eqs. (18) and (20) into Eq. (12),

Using the same procedure from Eq. (11) to (24), we can derive that

and

By substituting Eqs. (24), (25), (26), and (27) into Eq. (10), we attain

Finally we take the derivation

with the substitution of Eq. (1) into (28).

In FNNs, we usually use the tanh(.) sigmoidal function (Fig. 1) as the nonlinear activation function of hidden node. Assuming that the nonlinear activation function can be approximated piece-wise linearly as in Eq. (2), then the parameters a, b, c, and d correspond to the slope of nonlinear activation function. As shown in Fig. 2, the slope is less than or equal to 0.5. Thus, and h(y,z)

Consequently, we can argue that the nonlinear activation function decreases the entropy or uncertainty of hidden nodes. The sigmoid activation function can be separated into linear region with steep slope and saturated region with gentle slope as shown in Fig. 1. When hidden node values are in the saturation region of sigmoid activation function, the entropy decreases much more. This coincides with the argument that hidden nodes tend to be saturated after successful training.

Fig. 1.The sigmoid activation function of hidden node

Fig. 2.The slope of the sigmoid activation function

Furthermore, the entropy of hidden nodes provides hierarchical understanding of information extraction capabilities acquired through learning. Lee et al. argued that input samples have the inter-class information as well as the intra-class variation [7]. The inter-class information is the information content that an input sample belongs to a specific class, and the intra-class variation is a measure of the average variations within the classes including noise contaminations. After learning, input samples are projected to hidden nodes through weighted sums and element-wise nonlinear transformations. In this paper, we proved that the entropy of hidden nodes decreases after the nonlinear transformations. The decreasing of hidden nodes’ entropy is correspond to the decreasing of intra-class variations, as pointed out by Lee et al [7].

 

3. CONCLUSIONS

In this paper, we prove that the entropy of jointly Gaussian random variables decreases after piece-wise linear transformations. Also, the less steep the slopes are, the more the joint entropy decreases. Since the nonlinear activation function of hidden nodes can be approximated piece-wise linearly, we can argue that the nonlinear activation function decreases the uncertainty among hidden nodes. Furthermore, the entropy of hidden nodes decrease much more after successful training of FNNs.

References

  1. K. Hornik, M. Stincombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, 1989, pp. 359-366. https://doi.org/10.1016/0893-6080(89)90020-8
  2. K. Hornik, "Approximation Capabilities of Multilayer Feedforward Networks," Neural Networks, vol. 4, 1991, pp. 251-257. https://doi.org/10.1016/0893-6080(91)90009-T
  3. S. Suzuki, "Constructive Function Approximation by Three-layer Artificial Neural Networks," Neural Networks, vol. 11, 1998, pp. 1049-1058. https://doi.org/10.1016/S0893-6080(98)00068-9
  4. Y. Liao, S. C. Fang, and H. L. W. Nuttle, "Relaxed Conditions for Radial-Basis Function Networks to be Universal Approximators," Neural Networks, vol. 16, 2003, pp. 1019-1028. https://doi.org/10.1016/S0893-6080(02)00227-7
  5. S. H. Oh and Y. Lee, "Effect of Nonlinear Transformations on Correlation Between Weighted Sums in Multilayer Perceptrons," IEEE Trans., Neural Networks, vol. 5, 1994, pp. 508-510. https://doi.org/10.1109/72.286927
  6. J. V. Shah and C. S. Poon, "Linear Independence of Internal Representations in Multilayer Perceptrons," IEEE Trans., Neural Networks, vol. 10, 1999, pp. 10-18. https://doi.org/10.1109/72.737489
  7. Y. Lee and H. K. Song, "Analysis on the Efficiency of Pattern Recognition Layers Using Information Measures," Proc. IJCNN'93 Nagoya, vol. 3, Oct. 1993, pp. 2129-2132.
  8. A. El-Jaroudi and J. Makhoul, "A New Error Criterion for Posterior probability Estimation with Neural Nets," Proc. IJCNN'90, vol. 3, June 1990, pp. 185-192.
  9. M. Bichsel and P. Seitz, "Minimum Class Entropy: A maximum Information Approach to Layered Networks," Neural Networks, vol. 2, 1989, pp. 133-141. https://doi.org/10.1016/0893-6080(89)90030-0
  10. S. Ridella, S. Rovetta, and R. Zunino, "Representation and Generalization Properties of Class-Entropy Networks," IEEE Trans. Neural Networks, vol. 10, 1999, pp. 31-47. https://doi.org/10.1109/72.737491
  11. D. Erdogmus and J. C. Principe, "Entropy Minimization Algorithm for Multilayer Perceptrons," Proc. IJCNN'01, vol. 4, 2001, pp. 3003-3008.
  12. K. E. Hild II, D. Erdogmus, K. Torkkola, and J. C. Principe, "Feature Extraction Using Information-Theoretic Learning," IEEE Trans. PAMI, vol. 28, no. 9, 2006, pp. 1385-1392. https://doi.org/10.1109/TPAMI.2006.186
  13. R. Li, W. Liu, and J. C. Principe, "A Unifiying Criterion for Instaneous Blind Source Separation Based on Correntropy," Signal Processing, vol. 87, no. 8, 2007, pp. 1872-1881. https://doi.org/10.1016/j.sigpro.2007.01.022
  14. S. Ekici, S. Yildirim, and M. Poyraz, "Energy and Entropy-Based Feature Extraction for Locating Fault on Transmission Lines by Using Neural Network and Wavelet packet Decomposition," Expert Systems with Applications, vol. 34, 2008, pp. 2937-2944 https://doi.org/10.1016/j.eswa.2007.05.011
  15. D. Erdogmus and J. C. Principe, "Information Transfer Through Classifiers and Its Relation to Probability of Error," Proc. IJCNN'01, vol. 1, 2001, pp. 50-54.
  16. S. J. Lee, M. T. Jone, and H. L. Tsai, "Constructing Neural Networks for Multiclass-Discretization Based on Information Theory," IEEE Trans. Sys., Man, and Cyb.-Part B, vol. 29, 1999, pp. 445-453. https://doi.org/10.1109/3477.764881
  17. R. Kamimura and S. Nakanishi, "Hidden Information maximization for Feature Detection and Rule Discovery," Network: Computation in Neural Systems, vol. 6, 1995, pp. 577-602. https://doi.org/10.1088/0954-898X/6/4/004
  18. K. Torkkola, "Nonlinear Feature Transforms Using Maximum Mutual Information," Proc. IJCNN'01, vol. 4, 2001, pp. 2756-2761.
  19. A. Papoulis, Probability, Random Variables, and Stochastic Processes, second ed., New York: McGraw-Hill, 1984.
  20. Y. Lee, S. H. Oh, and M. W. Kim, "An Analysis of Premature Saturation in Back-Propagation Learning," Neural Networks, vol. 6, 1993, pp. 719-728. https://doi.org/10.1016/S0893-6080(05)80116-9
  21. Y. Lee and S. H. Oh, "Input Noise Immunity of Multilayer Perceptrons," ETRI Journal, vol. 16, 1994, pp. 35-43. https://doi.org/10.4218/etrij.94.0194.0013
  22. T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley and Sons, INC. 1991.
  23. S. H. Oh, "Decreasing of Correlations Among Hidden Neurons of Multilayer Perceptrons," Journal of the Korea Contents Association, vol. 3, no. 3, 2003, pp. 98-102.