1. Introduction
The primary source of information about a machine learning in these days is mostly RNA sequencing Big-data (RNA-seq). In terms of one of numerical systems in computational research area, like a Support Vector Machines-Recursive Feature Elimination (SVM-RFE), based on Support Vector Machine (SVM) criterion , data are usually represented as vectors such as featured patterns, especially in RNA-seq. They may correspond to measurements, being performed on the information gathered from the observation of a phenomenon. Usually all featured pattern genes in RNA-seq are not equally informative: some of them may be noisy, meaningless, correlated or irrelevant. Therefore, they are lack of identifying deferentially expressed genes [1]. A variable selection offeature pattern aims at selecting a subset of the featured genes which is distinguishable relevant for specially problems [1]. It is an important open issue: the hugh amount of data to gather or process should be reduced. That means if training itself might be easier, then the better estimates will be obtained when using relevant featured genes of RNA-seq. Therefore, moresophisticated processing algorithms should be used on smallerdimensional spaces than on the original measure space. Moreover, computational performances might get increasing when non-relevant informations do not interfere the processes [2, 3, 4, 5]. A variable selection of feature pattern has been the subject of intensive researches in the application ofidentifying differential-expressed genes for the maximum generelevancy and minimum gene redundancy. It has recently beganto be investigated in the machine learning algorithms such as random forest, K-nearest, and SVM. Because of the curse of dimensionality, whatever the domain is, a variable featureselection remains an open issue and non-monotonous. Moreover, the size of expressed genes are extremely larger than those of Big-data samples. That means the distinguishable subset of p variables for the discrimination does not alwayscontain the best discriminate subset of q variables (q < p) [6, 7]. Most algorithms for variable selection rely on human heuristics in the machine learning which perform a limited exploration on the whole set of variable combinations [8, 9, 10]. In the field of machine learning, feature selection has beenstudied. One of the most well-known machine learning algorithms is SVM, which is classical as well as original. And the one of a member of SVM-criterion is Support Vector Machine-Recursive Feature Elimination (SVM-RFE), which have been utilized in our research work.
We propose a novel algorithm of exploiting the efficiency of criteria derived from Support Vector Machines-Recursive Feature Elimination (SVM-RFE) [1] combining off-policy Q reinforcement learning for a variable feature selection inapplication to differential-expressed genes in RNA-seq. Weespecially employ an off-policy Q-learning in re inforcementlearning to be trained for controling the optimization of the criterion for better weight vectors in the SVM-RFE. Due to the reinforcement learning [2-5], the self-teaching algorithms are designed in the area of Big-Data on open issues. By accompanying algorithm in the area of RNA-seq Big-data, the variable feature selection on the huge amount of data might be helpful for reducing loads on Big-Data issues. Moreover, the variable feature selection might be appropriate for resourcedemands in other different researches. We think that ourproposed algorithm based on reinforcement learning is beyond the feature selection in the area of Big-data because re inforced selection algorithm makes the refined meaningful features. Anoff-policy Q-learning is regarded as superior to a discountmethod and a randomness of off-policy Q-learning might cuton the relevance of data because exploration are compromised in a policy of Q-learning.
By comparing our proposed algorithm with the well-known SVM-RFE combing Welch’ T, our result can show that the criterion from weight vector of SVM-RFE enhanced by Q-learning has been improved by a greediness of off-policy following a more exploratory of Q-learning.
2. METHODS
2.1 MOTIVATION
Our first purpose of the proposed algorithm is thatenhancing the weight vectors of SVM by exploration and exploitation. There is a big difference between on-policy and off-policy. SARSA or TD-learning is one of on-policies and a Q-learning is one of off-policies. Before we are regarding of the two policies, let us give a simple example. There isone decision that a robot moves either to a nearby door or to a distant gate. In terms of Q-learning with low \(\gamma\) by reducing a value, it moves to the nearby door (a goal). An off-policy Q-learning might have more advantage in terms of discounted methods and get captured in limit cycles. Therefore, trying to do random, as known as exploration is extremely important for a long-term goal success. We call it that &ld quo; Exploration-Exploitation Dilemma” [2-5]
When the determined policy has been given as follows:
Deterministic: function that maps states to actions.
\(\pi : S \to A a = \pi (s).\)
Examples: off-policy in Q-learning:
\(a_t=\pi(S_t)=argmax_aQ(S_t,a)\)
Stochastic: Probability of an action given a state s.
\(\pi : S \times A \to[0,1] with \sum_{a \in A_S} \space \pi(s,a) = 1 \space for \space all \space s \space P (a|s) =\pi(a,s)\)
The on-policy (TD-learning, SARSA) starts with a simples oft rather than off-policy Q-learning collects information fromsometimes random moves evaluates states as if a ε -greedy random-policy was employed and reduce randomness very slowly. Therefore, in terms of on-policy, it is rare to userandomness. Because of that when it comes to the high-dimension and low-sample-size data, it is difficult tomake advantage of exploration. We can compare the both TD-learning and Q-learning equations as follows: [2, 3, 4, 5]
Q-learning (off-policy): at = argmaxaQ(st,a) (plus exploration)Qt+1(st, at) = (1−η)Qt(st, at) + η(rt+1 + \(\gamma\)maxaQt(st+1,a)
TD-learning or SARSA (on-policy): Qt+1(st,at) = (1−η )Qt(st, at) + η(rt+1 + \(\gamma\)Qt(st+1,at+1))
Q-learning follows the rule: V(st+1)=maxaQ(st+1,a), however, at+1 can be anything. That is exploration. SARSA follows therule: at ∼ π (st,·) and updates the rule it learns by the precisevalue for \(\pi\)(s,a). That means that it does not make an advantage of exploration [2-5]
(그림 1) off-policy Q-learning 알고리즘(Figure 1) off-policy Q learning [2, 3, 4, 5]
2.2 SVM-RFE algorithm
Guyon et al. proposes a feature pattern selection, SVM-RFE[7]. The purpose is to find a distinguishable subset among variables of feature pattern, which maximizes the performance of the prediction method. It is based on a backward sequential selection. One starts with all the features and removes onefeature per one loop. In some research works, because of the large amount of feature genes, some chunks of features will be removed until the distinguishable features are left. When facing highly dimensional and the low size of samples, classification or prediction problem suffers from over-fitting and high-variance gradients [8]. However, some machinelearning algorithms likewise SVM-RFE can make good results on the low size of samples with low-variance gradients. SVM-RFE removes irrelevant the gene that is the smallest ranking criteria from the gene set [7]. The criteria of SVM for the score of gene ranking is used as the measurement of the determinant of featured genes. The weight vector w of the SVM defines the gene ranking score,
where \(w\) is calculated as.
\(w=\sum_{i=1}^{n} a_{i} x_{i} y_{i}\) (1)
where \(x_i\) is the gene expression array of a sample \(i\) in the training set, \(y_i\) is the class label of \(i\) ,\(y_i\) ∊[-1, 1] and \(a_i\) is the “Lagrangian Multiplier”. With a non-zero weight of vectors, \(a_i\) are support vectors [7].
(그림 2) SVM-RFE 알고리즘 R-언어 구현(Figure 2) The implementation of SVM-RFE Algorithm in R
The removed variable of SVM-RFE is significantly important. In the method, the removal minimizes the variation of ||w||2. Hence, the ranking criterion Rc for a given variable i is:
\(\begin{array}{l} {\mathrm{R}_{\mathrm{c}}=\left|\left\|\mathrm{w}^{2}\right\|-\left\|\mathrm{w}^{(\mathrm{i})}\right\|^{2}\right|=} \\ {1 / 2 | \sum_{\mathrm{k} j} \mathrm{a}_{\mathrm{k}} \mathrm{a}_{\mathrm{j}} \mathrm{y}_{\mathrm{k}} \mathrm{y}_{\mathrm{j}}\left(\mathrm{x}_{\mathrm{k}}\right)^{\mathrm{T}}\left(\mathrm{x}_{\mathrm{j}}\right)-\sum_{\mathrm{k} j} \mathrm{a}_{\mathrm{k}}^{(\mathrm{i})} \mathrm{a}_{\mathrm{j}}^{(\mathrm{i})} \mathrm{y}_{\mathrm{k}} \mathrm{y}_{\mathrm{j}}\left(\mathrm{x}_{\mathrm{k}}^{(\mathrm{i})}\right)^{\mathrm{T}}\left(\mathrm{x}_{\mathrm{j}}^{(\mathrm{i})}\right)}| \end{array} \) (2)
where xk are training examples and yk are class labels. Thealgorithm consists in first mapping x into a high dimensional space via a function F [11]. By maximizing the distance between the set of points F(xk) and the hyper-planeparameterized from (w;b), where w is weight vector and b is bias, while being consistent on the training set. The solution is determined by the Lagrangian theory where, ak is the solution of the following quadratic optimization problem and k);F(x1)> is the “Gram matrix” of the training examples [11].
2.3 Off-policy Q learning
Off-policy Q-learning collects information from sometimes random moves evaluates states as if a ε-greedy random-policy was employed and reduce randomness very slowly. Q-learning is off-policy TD Control. That means Q-learning is trained howto exploit action-value function, Q, and directly approximates the optimal action-value function, while being independent of the policy being followed [2-5].
\(Q(s_t,a_t) \leftarrow (s_t,a_t)+a(r_{t+1}+\gamma max_a, Q(S_{t+1},a')-Q(S_t,a_t))\) (3)
Off-policy Q-learning evaluates one policy while o beying another, for example, to evaluate the greedy policy, as kown as ε-greedy. That means it makes advantage of more exploratory scheme. The off-policy is utilized for behaviorshould be soft and may be slower, but remains more flexible if alternative ways appear [2-5].
3. THE PROPOSED ALGORITHM
In the proposed algorithm Fig 3, we SVM-RFE with off-policy Q-learning in reinforcement learning for variableselection of feature pattern in some applications. There aresome recent research works of variable selection of featurepattern based on SVM. For the criterion of Rakotomamonjyet al. [11], they utilize a gradient descent by the derivatives of ||w||2 with regard to a scaling array-vector associated tovariables. And in [12], the SVM-RFE and gradient of ||w||2 are fundamentally identified as they have the same ranking criteria. However in our proposed algorithm based on theε -greedy, the ranking criterion has been slightly effected in an iterative way.
(그림 3) 제안하는 Q-learning으로 향상된 SVM-RFE 알고리즘(Figure 3) The proposed algorithm combining SVM-RFE[1] with off-policy Q learning [2-5]
In our proposed SVM-RFE with Q-learning, SVM has beentrained in each iteration, depending on different sets of genes, G, because of randomness of ε-policy. In that G, the action policy has been improving for selecting more differential-expressed genes. Moreover, the action policy can be improved by back propagation of the gradient descent. There has been many state-of-the-art techniques in which hyper-parameters such as Lagrangian multiplier or a learning rate has beenselected by the system designers. However, in our proposed algorithm, we try to eliminate that flaws of the state-of-the-arttechnique and give a good change to over-fitting problem by only learning the action policy of off-policy Q-learning in reinforcement learning.
Normally, in terms of “High-Dimension and Low-Samples & rd quo;[13] like gene expressions array data, the on-policy (TD-learning or SARSA) are regarded to more better solutions which can evaluate or improve the policy used to makedecisions often using soft action choice, i.e. \(\pi(s,a) > 0∀a\), commit to always exploring and try to find the best policythat still explores. However, it could become trapped in local minima [2-5]. Because of those being trapped issues, we choose the off-policy Q-learning for evaluating one policy while following another, for example, trying to evaluate the greedy policy as known as ε-greedy while following a more exploratory scheme. The rules for behaviour like thatrandomness should be soft policies and not be sufficient and be slower [2-5]. However, the rule remains more adaptable if alternative ways appear, then it might lead to a better result following the greediness [2-5]. Therefore, we decide the use of off-policy Q-learning for more enhancement of the weight vectors of SVM-RFE.
4. Performance Evaluation
In recent studies, they claim that “for feature selection on the gene expression”, it is extremely important to select as less as significant distinguishable subsets for betterunderstanding and validation [7-10]. The smaller selected, the better claimed. However, the smaller features of pattern might not justify the high correlated variable feature target solutions remarkable compared to other methods because of “Curse of Dimension & rd quo;[6]. Moreover, smaller variable features of patternmight not be discriminated within a significant computational performance. Therefore, we try to discriminate the distinguishable variable selection of feature pattern that describe the complicated gene expressions with regard to the computational strengthen in published gene data, such as coloncancer in Alon et al[14].
Fig. 4 shows the original result of Alon et al. on theribosomal protein cluster. Fig. 5 shows that comparition of the proposed algorithm, SVM-RFE with enhanced weight vectors by Q-learning and the previous SVM-RFE withenhanced with weight vectors by Welch’ T. The results comparing are based on the original result by Alon et al [14]. We describe how many distinguishable variable selection offeature pattern are ranked in the output-list. The all-features of pattern selected from Alon et al[14] might not be in ourresult and also the previous’ result. However, our result of SVM-RFE with Q-learning is a little bit of better than those of the previous SVM-RFE combining Welch’ T. We get the & ld quo;gene U14971” (Human Robosomal Protein S9m RNA), &ld quo;geneX57691&rd quo; (40S Robosomal Protein S6) and “gene T58861&rd quo; (60S Robosomal Protein L30E). However, only two &ld quo;gene R20593&rd quo; (60S acidic Robosomal Protein P1) and “gene T58861” (60S Robosomal Protein L30E) are in the result of the previous SVM-RFE combining Welch’ T.
(그림 4) Cell Biology: Alon et al[14]의 결과(Figure 4) The result of Cell Biology: Alon et al[14]
(그림 5) 제안하는 Q-learninng 으로 강화된 Weight Vector의 SVM-RFE과 Welth’ T [7]의 Weight Vector에 의한 SVM-RFE의 결과 비교(Figure 5) Comparition of the proposed algorithm, SVM-RFE with Q-learning and the previous SVM-RFE with Welth’ T [7]
Moreover, we found out that the “SVM-RFE” itself is well-known for the most acceptable methods in many researchareas, because the gene “gene T58861” (60S Robosomal Protein L30E) are in the both result, ours and the previousone. Therefore, we can assure that there are some issues that should be improved for better results in terms of using SVM-RFE itself.
Our recent research, C. Kim[15] is regard to the SVM-RFE enhanced with minimum-redundance maximum-relevance (MRMR). The results of the research [11] are regard to &ld quo; howmany distinguishable features in the same places in the ranklist & rd quo;. The research [15] makes advantages of only machinelearning without enhancing weigh vectors by the re inforcementlearning. Therefore, we can find out that the proposed algorithm can improve the computational performance by enhancing the previous SVM-RFE with MRMR[7] usingenhanced weight vector by Q-learning for a better qualified learning algorithms. Morever, based on the recent works of reinforcement learning [16, 17], we will improve our results on ribosomal protein cluster [14].
5. Conclusion
We have suggested a novel algorithm of exploiting the efficiency of criteria derived from Support Vector Machines-Recursive Feature Elimination (SVM-RFE) [1] with off-policy Q-learning in reinforcement learning [2-5] for variable featureselection in application to differential-expressed genes of RNA-seq Big-data. We employ an off-policy Q-learning in reinforcement learning to learn how to control the optimization of the criteria based on the weight vectors of the SVM-RFE. We exploit a gradient descent by the derivatives of │||w2||-||w(i)||2| and maxaQ(st,a)+ exploration scheme. The ranking criterion based on the ε-greedy has been slightly effected inan iterative way of off-policy Q-learning. The SVM of ourproposed algorithm has been trained according to different sets of G because of randomness of ε-greey. In that G, the action policy has been improving for selecting more differential-expressed genes by back propagation of gradient descent. Our proposed algorithm try to eliminate the over-fitting problemby learning the action policy of off-policy Q-learning in reinforcement learning. By comparing our proposed algorithm with the previous SVM-RFE combining Welch’ T [7], we canshow that the criterion based on weight vector of SVM-RFE can be improved by the greedy policy following a more exploratory scheme of off-policy Q-learning.
References
- I. Guyon, J. Weston, S. Barnhill, V. Vapnik: Gene selection for cancer classification using support vector machine. Mach. Learn Vol 46, pp. 389-422, 2002. https://doi.org/10.1023/A:1012487302797
- D. Silver, A. Huang, C. J. Maddison, A.Guez, L.t Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, Vol 529, No. 7587, pp. 484-489, 2016. http://dx.doi.org/10.1038/nature16961
- R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, Vol 3, No. 1, pp. 9-44, 1988. https://doi.org/10.1007/BF00115009
- R. S. Sutton, A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. https://doi.org/10.1016/S1364-6613(99)01331-5
- V. Mnih , K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. NIPS 2013. http://arxiv.org/abs/1312.5602
- T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, Vol 4, No. 2, 2012. http://www.citeulike.org/user/ppolon/article/13997430
- S. Hansen Using Deep Q-Learning to Control Optimization Hyperparameters, Published 2016 in ArXiv. http://arXiv:1602.04062
- Z. Zhou, J. Wang, Y. Wang, Z. Zhu, J. Du, X. Liu and J. Quan, "Visual Tracking Using Improved Multiple Instance Learning with Co-training Framework for Moving Robot," KSII Transactions on Internet and Information Systems, vol. 12, no. 11, pp. 5496-5521, 2018. https://doi.org/10.3837/tiis.2018.11.018
- D. Zhao, B. Guo and Y. Yan, "A Sparse Target Matrix Generation Based Unsupervised Feature Learning Algorithm for Image Classification," KSII Transactions on Internet and Information Systems, vol. 12, no. 6, pp. 2806-2825, 2018. http://dx.doi.org/10.3837/tiis.2018.06.020
- M. u Qiao, Haitao Zhao, Shengchun Huang, Li Zhou and Shan Wang, "An Intelligent MAC Protocol Selection Method based on Machine Learning in Wireless Sensor Networks," KSII Transactions on Internet and Information Systems, vol. 12, no. 11, pp. 5425-5448, 2018. https://doi.org/10.3837/tiis.2018.11.014
- A. Rakotomamonjy, "Variable selection using SVMbased criteria", J. Machine Learn. Res., Vol 3, pp. 1357-1370, 2003. https://doi.org/10.1162/153244303322753706
- P. Leray, P. Gallinari, "Feature selection with neural networks", Behaviormetrika, Vol 26, Jan. 1999. https://doi.org/10.2333/bhmk.26.145
- B. Liu, Y. Wei, Y. Zhang, and Q. Yang, "Deep Neural Networks for High Dimension, Low Sample Size Data", IJCAI-17, pp. 2287-2293, Aug., 2017. https://doi.org/10.24963/ijcai.2017/318
- U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays", Proc. Natl. Acad. Sci. USA Vol. 96, No. 12, pp. 6745-6750, June 1999 Cell Biology, https://doi.org/10.1073/pnas.96.12.6745
- C. Kim, "A MA-plot-based Feature Selection by MRMR in SVM-RFE in RNA-Sequencing Data", Journal of KIIT. Vol. 16, No. 12, pp. 25-30, Dec., 2018. https://doi.org/10.14801/jkiit.2018.16.12.25
- A. Amiranashvili A. Dosovitskiy V. Koltun and T. Brox, "TD OR NOT TD: ANALYZING THE ROLE OF TEMPORAL DIFFERENCING IN DEEP REINFORCEMENT LEARNING", in ICLR 2018. https://dblp.org/rec/bib/journals/corr/abs-1806-01175
- B. Amos, I. Rodriguez, J. Sacks, B. Boots J. and Kolter, "Differentiable MPC for End-to-end Planning and Control", in NeurIPS 2018, https://dblp.org/rec/bib/journals/corr/abs-1810-13400