1. INTRODUCTION
Multi-layer perceptron (MLP) neural networks can approximate any function with enough number of hidden nodes [1]-[3] and this increases applications of MLPs to wide fields such as pattern recognition, speech recognition, time series prediction, bioinformatics, etc. MLPs are usually trained with the error back-propagation (EBP) algorithm, which minimizes the mean-squared error (MSE) function between outputs and their desired values of MLP [4]. However, the EBP algorithm has drawbacks with slow learning convergence and poor generalization performance [5], [6]. This is due to the incorrect saturation of output nodes and overspecialization to training samples [6].
Usually, sigmoidal functions are adopted as activation functions of nodes in MLP. The sigmoidal activation function can be divided into a central linear region and two outer saturated regions. When an output node of MLP is in an extremely saturated region of the sigmoidal activation function opposite to a desired value, we say the output node is “incorrectly saturated.” The incorrect saturation makes updating amount of weights small and consequently learning convergence becomes slow. Also, when MLPs are trained too much for training samples, this causes overspecialization of MLP to training samples and generalization performance for untrained test samples will be poor.
Cross-entropy (CE) error function accelerates the EBP algorithm through decreasing the incorrect saturation of output nodes [5]. Furthermore, the n-th order extension of crossentropy (nCE) error function attains accelerated learning convergence and improved generalization capability by decreasing the incorrect saturation as well as preventing the overspecialization to training samples [6].
Information theory has done a great role in neural network community. For improved performance, information theoretic view provides many learning rules of neural networks such as minimum class-entropy, minimizing entropy, and feature extraction using information theoretic learning [7]-[11]. Also, information theory can be a basis for constructing neural networks [12]. The upper bound of probability of error was derived based on the Renyi’s entropy [13]. Maximizing the information contents of hidden nodes can be developed for better performance of MLPs [14], [15]. In this paper, we focus on the relationship between relative entropy and the CE error function.
Relative entropy is a divergence measure between two probability density functions [16]. Assuming that a random variable has two alphabets, the relative entropy becomes crossentropy (CE) error function which can accelerates the learning convergence of MLPs. Since nCE error function is an extension of CE error function, there must be a divergence measure corresponding to nCE error function as CE does. In this sense, this paper derives a new divergence measure from nCE error function. In section 2, the relationship between the relative entropy and CE is introduced. Section 3 derives a new divergence measure from nCE error function and compares the new divergence measure with the relative entropy. Finally, section 4 concludes this paper.
2. RELATIVE ENTROPY AND CROSS-ENTROPY
Consider a random variable x whose probability density function (p.d.f.) is p(x). In the case that the p.d.f. of x is estimated with q(x), we need to measure how accurate the estimation is. Therefore, the relative entropy is defined by
as a divergence measure between p(x) and q(x) [16]. Let’s assume that the random variable x has only two alphabets 0 and 1, in which the probabilities are
Also,
Then,
Here,
is the entropy of a random variable x with two alphabets and
is the cross-entropy. If we assume that ‘q’ corresponds to a real output value ‘y’ of MLP output node and ‘p’ corresponds to its desired value ‘t’, we can define the cross-entropy error function as
Thus, the cross-entropy error function is one specific type of relative entropy assuming that a random variable has only two alphabets [15].
We can use the unipolar [0, 1] mode or bipolar [-1, +1] mode for describing node values of MLPs. Since ‘t’ and ‘y’ corresponds to ‘p’ and ‘q’ respectively, they are in the range of [0, 1]. Thus, the relationship between relative entropy and cross-entropy error function is based on the unipolar mode of node values.
3. NEW DIVERGENC MEASURE FROM THE n-th ORDER EXTENSION OF CROSS-ENTROPY
The n-th order extension of cross-entropy (nCE) error function was proposed based on the bipolar mode of node values as [6]
where n is a natural number. In order to derive a new divergence measure from nCE error function based on the relationship between relative entropy CE error function, we need an unipolar mode formulation of nCE error function. That is derived as
We will derive new divergence measures from Eq. (9) with n=2 and 4.
When n=2, the nCE error function given by Eq. (9) becomes
where
and
By substituting Eqs. (11) and (12) into Eq. (10),
In order to derive a new divergence measure corresponding to nCE(n=2), t and y are substituted to p and q, respectively. This is the reverse procedure for deriving Eq. (7) from (6) by substituting ‘p’ and ‘q’ to ‘t’ and ‘y’, respectively. Then, we can get
Thus, by resembling the last equation in Eq. (4), the new divergence measure is derived by
where
When n=4, the nCE error function given by Eq. (9) is
where
and
Substituting Eqs. (18), (19), (20), (21), and (22) into Eq. (17),
By substituting t and y to p and q, respectively, we can get
Thus, by resembling the last equation in Eq. (4), the new divergence measure is derived by
where
In order to compare the new divergence measures given by Eqs. (15) and (25) with the relative entropy given by Eq. (4), we plot them in the range that p and q are in [0,1]. Fig. 1 shows the three-dimensional plot of relative entropy D(p||q). The x and y axes correspond to p and q, respectively, and z axis corresponds to D(p||q). D(p||q) is minimum of zero when p=q and it increases when p goes far from q. Since D(p||q) is a divergence measure, it is not symmetric.
Fig. 1.The three-dimensional plot of relative entropy D(p||q) with two alphabets
Fig. 2 shows the three-dimensional plot of new divergence measure F(p||q;n=2) given by Eq. (15). F(p||q;n=2) is minimum of zero when p=q and it increases when p goes far from q as D(p||q) does. Furthermore, we can find that F(p||q;n=2) is more flat than D(p||q). Also, the threedimensional plot of F(p||q;n=4) shown in Fig. 3 is minimum of zero when p=q and more flat than F(p||q;n=2) shown in Fig. 2. So, increasing the order n of the new divergence measure makes the new divergence measure more flat.
Fig. 2.The three-dimensional plot of new divergence measure with two alphabets when n=2, F(p||q;n=2)
Fig. 3.The three-dimensional plot of new divergence measure with two alphabets when n=4, F(p||q;n=4)
When applying MLPs to pattern classification, the optimal outputs of MLP based on various error functions were derived in [6] and [18]. We plot them in Fig. 4. The optimal output of MLP based on CE error function is a first order function of a posterior probability that a certain input sample belongs to a specific class. When using nCE error function with n=2 for training MLPs, as shown in Fig. 4, the optimal output of MLP shows more flat than the CE case. And, nCE error function with n=4 shows the optimal output more flat than CE and nCE with n=2 cases. The two-dimensional contour plots of CE and nCE error functions also show the same property [17]. So, we can argue that the property of divergence measures derived from CE and nCE coincides with the two-dimensional contour plot of CE and nCE error function in [17] and optimal outputs in [6] and [18].
Fig. 4.Optimal outputs of MLPs. Here, Q(x) denotes a posteriori probability that a certain input x belongs to a specific class
4. CONCLUSIONS
In this paper, we introduce the relationship between relative entropy and CE error function. When a random variable has only two alphabets, the relative entropy becomes cross-entropy. Based on the relationship, we derive a new divergence measure form the nCE error function. Comparing the three-dimensional plot of relative entropy and new divergence measure when n=2 and 4, we can argue that the order n of new divergence measure has an effect of flatting the divergence measure. This property coincides with the previous results which comparing the optimal outputs and contour plots of CE and nCE.
References
- K. Hornik, M. Stinchcombe, and H. White, “Multilayer Feed-forward Networks are Universal Approximators,” Neural Networks, vol. 2, 1989, pp. 359-366. https://doi.org/10.1016/0893-6080(89)90020-8
- K. Hornik, “Approximation Capabilities of Multilayer Feedforward Networks,” Neural Networks, vol. 4, 1991, pp. 251-257 https://doi.org/10.1016/0893-6080(91)90009-T
- S. Suzuki, “Constructive Function Approximation by Three-Layer Artificial Neural Networks,” Neural Networks, vol. 11, 1998, pp. 1049-1058 https://doi.org/10.1016/S0893-6080(98)00068-9
- D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing, Cambridge, MA, 1986.
- A. van Ooyen and B. Nienhuis, “Improving the Convergence of the Backpropagation Algorithm,” Neural Networks, vol. 5, 1992, pp. 465-471. https://doi.org/10.1016/0893-6080(92)90008-7
- S.-H. Oh, “Improving the Error Back-Propagation Algorithm with a Modified Error Function,” IEEE Trans. Neural Networks, vol. 8, 1997, pp. 799-803. https://doi.org/10.1109/72.572117
- A. El-Jaroudi and J. Makhoul, "A New Error Criterion for Posterior probability Estimation with Neural Nets," Proc. IJCNN'90, vol. III, Jun. 1990, pp. 185-192.
- M. Bichsel and P. Seitz, “Minimum Class Entropy: A maximum Information Approach to Layered Networks,” Neural Networks, vol. 2, 1989, pp. 133-141. https://doi.org/10.1016/0893-6080(89)90030-0
- S. Ridella, S. Rovetta, and R. Zunino, “Representation and Generalization Properties of Class-Entropy Networks,” IEEE Trans. Neural Networks, vol. 10, 1999, pp. 31-47. https://doi.org/10.1109/72.737491
- D. Erdogmus and J. C. Principe, "Entropy Minimization Algorithm for Multilayer Perceptrons," Proc. IJCNN'01, vol. 4, 2001, pp. 3003-3008.
- K. E. Hild II, D. Erdogmus, K. Torkkola, and J. C. Principe, “Feature Extraction Using Information-Theoretic Learning,” IEEE Trans. PAMI, vol. 28, no. 9, 2006, pp. 1385-1392. https://doi.org/10.1109/TPAMI.2006.186
- S.-J. Lee, M.-T. Jone, and H.-L. Tsai, “Constructing Neural Networks for Multiclass-Discretization Based on Information Theory,” IEEE Trans. Sys., Man, and Cyb.- Part B, vol. 29, 1999, pp. 445-453. https://doi.org/10.1109/3477.764881
- D. Erdogmus and J. C. Principe, "Information Transfer Through Classifiers and Its Relation to Probability of Error," Proc. IJCNN'01, vol. 1, 2001, pp. 50-54.
- R. Kamimura and S. Nakanishi, “Hidden Information maximization for Feature Detection and Rule Discovery,” Network: Computation in Neural Systems, vol. 6, 1995, pp. 577-602. https://doi.org/10.1088/0954-898X_6_4_004
- K. Torkkola, "Nonlinear Feature Transforms Using Maximum Mutual Information," Proc. IJCNN'01, vol. 4, 2001, pp. 2756-2761.
- T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 1991.
- S.-H. Oh, “Contour Plots of Objective Functions for FeedForward Neural Networks,” Int. Journal of Contents, vol. 8, no. 4, Dec. 2012, pp. 30-35. https://doi.org/10.5392/IJoC.2012.8.4.030
- S.-H. Oh, “Statistical Analyses of Various Error Functions For Pattern Classifiers,” CCIS, vol. 206, 2011, pp. 129-133.