DOI QR코드

DOI QR Code

Tri-training algorithm based on cross entropy and K-nearest neighbors for network intrusion detection

  • Zhao, Jia (Nanchang Institute of Technology, School of Information Engineering) ;
  • Li, Song (Nanchang Institute of Technology, School of Information Engineering) ;
  • Wu, Runxiu (Nanchang Institute of Technology, School of Information Engineering) ;
  • Zhang, Yiying (College of artificial intelligence, Tianjin University of Science & Technology) ;
  • Zhang, Bo (State grid smart grid research institute co., ltd) ;
  • Han, Longzhe (Nanchang Institute of Technology, School of Information Engineering)
  • Received : 2022.04.13
  • Accepted : 2022.12.05
  • Published : 2022.12.31

Abstract

To address the problem of low detection accuracy due to training noise caused by mislabeling when Tri-training for network intrusion detection (NID), we propose a Tri-training algorithm based on cross entropy and K-nearest neighbors (TCK) for network intrusion detection. The proposed algorithm uses cross-entropy to replace the classification error rate to better identify the difference between the practical and predicted distributions of the model and reduce the prediction bias of mislabeled data to unlabeled data; K-nearest neighbors are used to remove the mislabeled data and reduce the number of mislabeled data. In order to verify the effectiveness of the algorithm proposed in this paper, experiments were conducted on 12 UCI datasets and NSL-KDD network intrusion datasets, and four indexes including accuracy, recall, F-measure and precision were used for comparison. The experimental results revealed that the TCK has superior performance than the conventional Tri-training algorithms and the Tri-training algorithms using only cross-entropy or K-nearest neighbor strategy.

Keywords

1. Introduction

Currently, With the rapid development of network technology, network security is facing a huge threat, and the emergence of network intrusion detection technology (NID) has played an important role in network security. Traditional NID techniques achieve intrusion detection by comparing attacks identified in the feature code database, but this method has a high leakage rate and lag [1]. Recently, with the rapid development of artificial intelligence [2-5] and machine learning technologies [6-8], machine learning-based NID methods are gradually becoming a research hotspot.

The lack of labeled data is the difficulty of machine learning in network intrusion detection. How to use a small amount of tag data to detect network intrusion is the focus of research. Machine learning based NID methods can be classified into supervised, unsupervised and semi-supervised detection methods. In supervised NID algorithms, all training data need to be labeled, and a good model cannot be trained when the data does not have labeled categories or when the data identification features are not obvious [9]. The unsupervised NID algorithm can learn from unlabeled data, but it cannot obtain high learning precision [1]. Semi-supervised algorithms can effectively solve the shortcomings of supervised and unsupervised NID algorithms, and can obtain a model with high learning precision in a small amount of labeled data and a large amount of unlabeled data.

Semi-Supervised Learning (SSL) [10-11] methods mainly include Disagreement-Based Semi-supervised Learning [12], Generative Methods [13], Discriminative Methods [14] and Graph-Based methods [15]. Since the disagreement-based methods are less affected by model assumptions, loss function non-convexity and data size problems, it can meet most of the network intrusion data detection requirements and has good classification properties. Therefore, disagreement-based methods were used for NID in this study.

The disagreement-based methods originated from the co-training algorithm (Co-training) proposed by Blum and Mitchell [16]. The co-training algorithm used two different views to train the classifier, and improves the performance of the algorithm by expanding the training set for each other. Two assumptions are required for co-training: (1) sufficient redundancy of views; (2) conditional independence assumption. In practice, few data satisfy sufficient redundancy of views and condition independent. To compensate for the shortcomings of the co-training algorithm, Zhou et al. [17] proposed the Tri-training algorithm. Using three classifiers, it solves the problem of harsh conditions in the co-training algorithm and does not require sufficient redundancy in the data set. However, Tri-training algorithm can generate training noise due to mislabeling, and how to solve the noise problem effectively is the focus of scholars' attention.

To address the problems of Tri-training, Hu et al. [18] proposed a semi-supervised patent text classification method based on improved Tri-training algorithm,which makes three changes to Tri-training. The algorithm firstly uses three base classifiers with large differences to train the same data set instead of updating three training sets at the same time, secondly in the process of the untagged data are marked only when the three base classifier consistent and marking probability is greater than their respective probability threshold to put it in the marking of the training set, finally, the update of the training set for dynamic tracking, real-time updates to the same untagged data probability threshold, effectively reduce the noise of the training set through the above three data. Li et al. [19] proposed a novel semi-supervised adaboost technique based on improved Tri-training algorithm. The algorithm achieves noise rejection in labeled data by calculating the predicted probability of unlabeled data compared with the set probability threshold, and then using the calculated probability thresholds as weights, the sum of the weights of the current marking errors is compared with the previous round, and if the sum of the current weights is smaller than the previous round, the base classifier is updated using the currently marked dataset. Zhang et al. [20] proposed a safety Tri-training algorithm based on cross entropy. It replaces the classification error rate with cross entropy, which effectively reduces the prediction bias of labeled noise on unlabeled data in Tri-training. Mo et al. [21] proposed a semi-supervised classification algorithm based on trapezoid network and improved three-training method. It improves the label confidence level on unlabeled data by calculating the classification difference of classifiers. The improved Tri-training algorithms mentioned above reduces the labeling noise to a certain extent, but uses a single strategy, which is not effective in improving the classification. Li et al. [22] pointed out in safe semi-supervised learning: a brief introductions algorithm that in some cases, learning with unlabeled data does not improve the performance of the algorithm, but degrades it by adding noise.

To address the problem of producing training noise due to mislabeling during Tri-training for NID, we propose the Tri-training algorithm based on cross entropy and K-nearest neighbors (TCK). The algorithm uses cross-entropy to replace the classification error rate. Cross-entropy can better identify the difference between the predicted distribution and the practical distribution of the model, and reduce the prediction bias of mislabeled data on unlabeled data; K-nearest neighbors are used to delete the mislabeled data and reduce the number of mislabeled data. Experimental results on the 12 sets of UCI datasets and the NSL-KDD network intrusion data set revealed that performances of the proposed algorithms are all improved in the four indexes (accuracy, recall, F-measure and precision).

In the second section, the paper introduces the basic ideas of the Tri-training algorithm, the concepts related to relative entropy and cross-entropy and the K-nearest neighbour idea. The third section introduces the principle of the algorithm and gives the pseudo-code of the algorithm. In the fourth section, the algorithm is tested experimentally and the results are analyzed. The fifth section summarizes the thesis.

2. Related work

This section introduces the basic idea of Tri-training algorithm, the concept of relative entropy and cross entropy, and the idea of K-nearest neighbor.

2.1 Tri-training algorithm

The Tri-Training algorithm uses three base classifiers to add pseudo-labels to the unlabeled data, The classification performance of the algorithm is improved by changing the training data set by labeling each other.. Assuming that there is a data set D, which includes a small amount of labeled data L and a large amount of unlabeled data U. L is Bootstrap sampled to train three base classifiers h1 , h2 and h3 Take the training process of h1 as a column: x is any point in U, h2 and h3 predict x at the same time, if h2 and h3 have the same prediction result, i.e., h2 (x) = h3(x), then the labeled result h2 (x) of x is put into a new training set L1' , L1'= L∪{x| x∈U and h2(x) = h3(x)}, L1' is the training set of h1 in the next round. The training process of classifiers h2 and h3 is similar to h1. The classification performance is improved by continuously updating the training set L1'(i = 1,2,3)and the base classifier h1'(i = 1,2,3). This process is repeated for the three classifiers until the classifiers h1, h2 and h3 do not change, and finally the final classification results are decided by majority voting method. The training flow chart is shown in Fig. 1.

E1KOBZ_2022_v16n12_3889_f0001.png 이미지

Fig. 1. Training flow chart

From the training process of the algorithm, it is clear that the algorithm is prone to noise when adding pseudo-labeling to unlabeled data. Angling et al. [23] demonstrated the learnability of the training set noise if the following conditions are satisfied. The training set noise is learnable:

\(\begin{aligned}m \geq \frac{2}{\epsilon^{2}(1-2 \eta)^{2}} \ln \left(\frac{2 N}{\delta}\right)\end{aligned}\)        (1)

where m is the training sample size, ϵ is the worst-case classification error rate, η(<0.5) is the upper limit of the classification noise rate, is the number of hypotheses, and δ is the confidence level.

Suppose ηL represents the classification noise rate on dataset L, then the number of false marks in L is ηL | L |. eit represents the classification error rate of hj and hk (j, k ≠ i) in round t. Suppose that in round t, the number of data sets that hj and hk jointly mark is z, and the number of data that both hj and hk correctly mark z is z', the \(\begin{aligned}e_{i}^{t}=\frac{z-z^{\prime}}{z}\end{aligned}\), therefore, the number of false marks in Lt is eit | Lt |, then the classification noise rate ηt of round t can be defined as Eq (2).

\(\begin{aligned}\eta^{t}=\frac{\eta_{L}|L|+e_{i}^{t}\left|L^{t}\right|}{|L|+\left|L^{t}\right|}\end{aligned}\)       (2)

Where Lt represents the training set marked hi by hj and hk (j, k ≠ i) in round t.

Zhou et al. [17] proved that as long as the updated training set satisfies Eq (3), the performance of the classifier will be improved.

\(\begin{aligned}\left|L \bigcup L^{t}\right|\left(1-2 \frac{\eta_{L}|L|+e_{i}^{t}\left|L^{t}\right|}{\left|L \bigcup L^{t}\right|}\right)^{2}>\left|L \bigcup L^{t-1}\right|\left(1-2 \frac{\eta_{L}|L|+e_{i}^{t-1}\left|L^{t-1}\right|}{\left|L \bigcup L^{t-1}\right|}\right)^{2}\end{aligned}\)       (3)

where Lt-1 denote the training set in rounds t-1 where hj and hk (j, k ≠ i)are hi labeled training set, eit-1 denotes the classification error rate of hj and hk (j, k ≠ i) in rounds t-1.

2.2 Relative entropy and cross entropy

Relative entropy, also known as KL (Kullback-Leibler) divergence or information divergence, represents an asymmetric measure of the difference between two probability distributions [24]. In information theory, relative entropy is the difference between the Shannon entropy of two probability distributions [25]. Let P(X) and Q(X) be the practical probability distribution and predicted probability distribution of the random variable X. The relative entropy is defined as [26]:

\(\begin{aligned}D_{K L}(P \| Q)=\sum_{x \in X} P(x) \lg \frac{P(x)}{Q(x)}\end{aligned}\)       (4)

The smaller the relative entropy, the smaller the deviation of the practical probability distribution P(X) of the model from the predicted probability distribution Q(X). When the practical probability distribution is the same as the predicted probability distribution, DKL =0.

The concept of cross-entropy was introduced by Rubinstein [27] to measure the variability between two probability distributions. Deform the Eq (4):

\(\begin{aligned} D_{K L}(P \| Q) & =\sum_{x \in X} P(x) \lg P(x)-\sum_{x \in X} P(x) \lg Q(x) \\ & =-H(P(x))+\left[-\sum_{x \in X} P(x) \lg Q(x)\right]\end{aligned}\)       (5)

The relative entropy is split into the entropy -H(P(x)) of the practical distribution P and the cross-entropy:

\(\begin{aligned}H(P, Q)=-\sum_{x \in X} P(x) \lg Q(x)\end{aligned}\)       (6)

The cross entropy has two important properties: (1) asymmetry, H(P, Q) and H(Q, P) are not equal; (2) non-negativity, the cross-entropy can only be greater than or equal to 0.

According to Eq (4), we can see that the relative entropy DKL (P || Q) changes mainly due to the cross-entropy H(P, Q) , and the entropy of P remains unchanged. In the past machine learning process, the relative entropy DKL is mainly used to determine the difference between the practical probability distribution P(X) and the predicted probability distribution Q(X) . From Eq (6), it is clear that the difference between P(X) and Q(X) can be determined by cross entropy, and it is more convenient to calculate the differexnce between two probability distributions by using cross entropy than relative entropy.

Cross-entropy has been widely used in machine learning. Too et al. [28] proposed the incremental clustering algorithm based on cross entropy, the algorithm uses cross-entropy to map data points in a high-dimensional space to a low-dimensional space, to partition dynamic data. The experimental results show that this method has lower time complexity in large-scale data environments or dynamic working environments. Liu et al. [29] applied cross-entropy to the class imbalance problem, and proposed a new weighted cross-entropy as a loss function. The experimental results show that this method can effectively reduce the impact of noise on the classification results. Santosa [30] applied cross-entropy to a dual Lagrangian support vector machine (SVM), using cross-entropy to solve the Lagrangian SVM optimization problem to find the optimal or at least near-optimal Lagrangian multipliers as a solution, the experimental results show that the proposed algorithm has obvious advantages in terms of computation time and accuracy.

2.3 K-nearest neighbors classification

K-nearest neighbors were proposed by Cover and Hart [31] in 1968. K-nearest neighbors means that each data can be represented by its K nearest neighbors. The core idea is that if most of the K-nearest neighbors of a sample belong to a class in the feature space, then the sample also belongs to that class.

A distance-based measure is used to find the k nearest neighbors, such as Euclidean distance. Let two points or tuples be X1 = (x11, x12, ⋯, x1n) and X2 = (x21, x22, ⋯, x2n) respectively, then the Euclidean distance of the two points or tuples is:

\(\begin{aligned}\operatorname{dist}\left(X_{1}, X_{2}\right)=\sqrt{\sum_{i=1}^{n}\left(x_{1 i}-x_{2 i}\right)^{2}}\end{aligned}\)       (7)

The K-nearest neighbor idea is widely used and has achieved good application results in the field of data mining. Wang et al. [32] proposed a new multi-label classification algorithm based on K-nearest neighbors and random walks. The algorithm uses the K-nearest neighbor idea to construct the edge set of the correlation between the vertex set of the random walk graph and the label of the training sample with the K-nearest neighbor training samples of the specific test data, which greatly reduces the time and space overhead. Xiao et al. [33] proposed a fast incremental learning algorithm for SVM based on K-nearest neighbors. The algorithm uses the K-nearest neighbor idea to extract the boundary vector set, and replaces the boundary vector set with the training set to train the SVM. Ren et al. [34] proposed an efficient density peak clustering algorithm based on hierarchical K-nearest neighbors and subcluster merging. The algorithm uses the K-nearest neighbor idea to divide the dataset into multiple layers so that the algorithm can obtain better clustering results. Wu et al. [35] proposed a density peaks clustering based on relative density estimating and multi cluster merging algorithm(DPC-RD-MCM), The algorithm redefines the local density by using the K-nearest neighbor idea. Through experiments on datasets with uneven density distribution, UCI datasets and complex morphological datasets, the DPC-RD-MCM algorithm can be used in data with uneven density distribution. Very good clustering effect is obtained on the complex morphological data set and UCI data set, and the clustering performance is higher than that of the comparison algorithm. Zhao et al. [36] proposed a density peak clustering algorithm based on mutual proximity. The new algorithm introduces the idea of k-nearest neighbors to calculate the local density, so as to ensure the relativity of the local density.

3. Tri-training algorithm based on cross-entropy and K-nearest neighbor (TCK)

3.1 Algorithm Principles

In order to better utilize the unlabeled data for learning and reduce the noise generated during the pseudo labeling process of Tri-training algorithm, this study proposes the TCK.

In semi-supervised learning, the predicted distribution Q(X) obtained from the training data is expected to be as close to the practical distribution P(X) as possible, but the practical distribution P is unknowable. Assuming that the training data is assumed to be obtained from independent homo-distributed sampling from real data, the predicted distribution Q is expected to be the least different from the training data distribution P’. Finding the least difference is equivalent to finding the cross-entropy H(P', Q) . The smaller the cross entropy value, the closer Q is to P’. Tri-training algorithms determine the difference between the practical and predicted distributions by the classification error rate, and the cross entropy is a better measure model of the difference between the practical and predicted distributions than the classification error rate. In this study, the classification error rate in Tri-training algorithms is replaced by cross entropy.

Tri-training algorithms generate noise when learning from unlabeled data, while semi-supervised learning expects no noise or as little noise as possible. K-Nearest neighbors can identify the noise effectively, so the K-Nearest neighbors idea is used for noise processing. This is done as follows: suppose the pseudo labeling training set generated by the algorithm in round t is Lt, the initial training set of the algorithm is L, x is any point in Lt. Find nearest neighbors of x and L. If the number of nearest neighbors same as the label x is greater than or equal to k' , keep the data, otherwise delete the data. According to the literature [37], the best experimental results were obtained when k and k' were taken as 3 and 2. In the subsequent experiments of this study, this parameter was also used to set k and k'.

3.2 Procedures

Algorithm: TCK

Input: unlabeled data set U, labeled data set L, testing data set T.

Output: Classification result h(x)

1. for i=1 to 3 do

2. Si ← Bootstrap Sample(L)

3. hi ← Learns(si)

4. ei'= 0.5; Li' = 0.5

5. end for

6. repeat until none of hi(i=1 to 3) changes

7. for i = 1 to 3 do

8. Li ← ∅; Si' ← ∅; updatei ← False

9. \(\begin{aligned}e_{i} \leftarrow Measure \; Cross \; Entropy\left(\frac{H_{j}+H_{k}}{2}\right)(j, k \neq i)\end{aligned}\)

10. if (ei ≺ ei')

11. then for every x∈U do

12. if hj(x) = hk(x)(j, k ≠ i)

13. then Li ← Li ∪ {x, hj(x)}

14. end for

15. if (Li' = 0)

16. then \(\begin{aligned}L_{i}^{\prime} \leftarrow\left[\frac{e_{i}}{e_{i}^{\prime}-e_{i}}+1\right]\end{aligned}\)

17. if (Li' ≺| Li |)

18. the if (ei | Li |≺ ei'Li')

19. then updatei ← True

20. else if \(\begin{aligned}L_{i}^{\prime} \succ \frac{e_{i}}{e_{i}^{\prime}-e_{i}}\end{aligned}\)

21. then \(\begin{aligned}L_{i} \leftarrow \operatorname{Subsample}\left(L_{i},\left[\frac{e_{i}^{\prime} L_{i}^{\prime}}{e_{i}}-1\right]\right)\end{aligned}\)

22. updatei ← True

23. end for

24. for i = 1 to 3 do

25. if updatei ← True

26. then for every l∈Li do

27. if (k' ≥ 2)

28. then Si' ← l

29. end for

30. then hi ← Learn(L∪Si'); ei ← ei; Li'←| Li |

31. end for

32. end repeat

33. Output : h(x) ← arg max \(\begin{aligned}\sum_{h_{i}(x)=y} 1\end{aligned}\)

Steps 1-5 of the pseudocode use Bootstrap sampling to obtain three different base classifiers, and initialize the error parameter ei' and the pseudo labeling scale Li'. Step 9 calculates the cross entropy to estimate the classification error ei of classifier hi. The predicted distribution is first obtained by predicting the labeled data set L, and then the cross entropy Hj and Hk are calculated with the practical distribution of the labeled data set L by Eq(8) to obtain the error ei = (Hj + Hk) / 2 ; Step 10 determine whether the error ei using cross entropy is smaller than the initial classification error ei'; Step 11 - 14 is hj, hk (j, k ≠ i) expands the hi training set by adding pseudo labeling by voting, and when the two classifiers are labeled consistently, pseudo labels are added for unlabeled samples; Step 15-23 ensure that the training set of pseudo labels can improve the algorithm performance; Steps 24-29 use the K-nearest neighbors idea to identify and remove the noise in the pseudo labels; Finally, the majority voting method is used to predict the output classification label of the label category of the test data set.

4. Experimental results and analysis

4.1 Experimental data set

In order to verify the effectiveness of the algorithm proposed in this paper, this study uses 12 datasets [38] from the UCI machine learning database (see Table 1) and the network intrusion data set NSL-KDD for experiments. The NSL-KDD dataset is a commonly used network intrusion data set, improved from KDD-CUP99, NSL-KDD removes the redundant data of KDD-CUP99, the data has 41 attributes, the last column is labeled category, four categories of attacks are Dos: an attack that attempts to shut down traffic to and from a target system, Probing: an attack that attempts to extract information from a network, U2R: an attack that starts with a regular user account and attempts to access a system or network as the super user (root), R2L: an attack that attempts to gain local access to a remote machine and a normal category Normal. To meet the experimental needs, the data set is divided into 3 parts: the test data set T, the labeled data set L, and the unlabeled data set U account for 20%, 20%, and 60%, respectively.

Table 1. UCI data set

E1KOBZ_2022_v16n12_3889_t0001.png 이미지

4.2 Experimental setup

To verify the effectiveness of the proposed algorithm and to test the effect of cross entropy and K-nearest neighbors strategies on the performance of Tri-training algorithms, four sets of experiments were done, namely Tri-training algorithm, Tri-training algorithm based on cross entropy (TCE), Tri-training algorithm based on K-nearest neighbors (TKNN) and Tri-training algorithm based on cross entropy and K-nearest neighbor (TCK). Tri-training is the benchmark algorithm. In this study, the algorithm performance is evaluated using four metrics: accuracy, precision, recall and F-measure. Table 2 is the confusion matrix associated with the indexes.

Table 2. Confusion matrix

E1KOBZ_2022_v16n12_3889_t0002.png 이미지

In the confusion matrices, TP, TN denote the correctly classified positive and negative classes, FP, FN denote the incorrectly classified positive and negative classes. Among the indexes, accuracy is used to count the percentage of tuples correctly identified by the classifier; precision calculates the percentage of positive tuples to be actually positive; recall counts the percentage of positive tuples predicted to be positive; and F-measure is the harmonic mean of precision and recall. The closer the value of these metrics is to 1, the better the performance of the algorithm. The performance metrics are calculated as follows:

\(\begin{aligned}accuracy=\frac{TP+TN}{P+N}\end{aligned}\)       (8)

\(\begin{aligned}precision=\frac{TP}{TP+FP}\end{aligned}\)       (9)

\(\begin{aligned}recall=\frac{TP}{TP+FN}\end{aligned}\)       (10)

\(\begin{aligned}F-measure=\frac{2 \times precision \times reacall}{precision + recall}\end{aligned}\)       (11)

4.3 Experimental results analysis of UCI data set

The algorithms were evaluated on the basis of accuracy, precision, recall, F-measure by the UCI data set, and in the case of multi-classification problems, precision and F-measure were calculated using the weighted mean, and recall was calculated using the macro-average. The experimental results are shown in Table 3-6, and the best results are marked in bold.

Table 3. Accuracy

E1KOBZ_2022_v16n12_3889_t0003.png 이미지

Table 4. Precision

E1KOBZ_2022_v16n12_3889_t0004.png 이미지

Table 5. Recall

E1KOBZ_2022_v16n12_3889_t0005.png 이미지

Table 6. F-measure

E1KOBZ_2022_v16n12_3889_t0006.png 이미지

According to Tables 3-6, we can see that TCK has a clear advantage over the 12 UCI datasets. The data sets with superiority in 4 indexes (accuracy, recall, F-measure and precision) are 8, 8, 8 and 7 respectively, indicating that the TCK has different degrees of improvement for each metric. Tri-training achieved good results on accuracy, recall, F-measure and precision on only 3, 3, 2, and 3 datasets. On accuracy, Tri-training and TCE only have one data set tied for first place, and TKNN and TCK also only have one data set tied for first place.

To further analyze the performance of the four algorithms, their combined performance is analyzed from a statistical point of view. In this study, the Friedman test was introduced to test the rank mean of the four evaluation indicators accuracy, recall, F-measure and precision. The Friedman test is a significant difference test, and its rank mean value reflects the comprehensive performance of the algorithm. The larger the rank mean value, the better the comprehensive performance. The rank mean table of Friedman test is shown in Table 7.

Table 7. Rank mean values of indexes in four algorithms

E1KOBZ_2022_v16n12_3889_t0007.png 이미지

As shown in Table 7, TCK ranked first in all four indexes. TCK ranked first and TCE ranked second, Tri-training ranked third, and TKNN ranked fourth in the mean value of the four indexes. The reason for TKNN being in the last position is that it is not combined with cross entropy, which leads to a high initial classification error rate and eliminates the correct labels during noise removal, resulting in bad classification performance.

4.4 Experimental results analysis of NSL-KDD NID dataset

Due to the large size of NSL-KDD data, 10% of the NSL-KDD data set is selected for the experiment. Accuracy was used as an evaluation index in the experiments, the experimental results are shown in Table 8.

Table 8. Accuracy of the four algorithms in NSL-KDD

E1KOBZ_2022_v16n12_3889_t0008.png 이미지

As shown in Table 8, the NSL-KDD data set has the highest accuracy obtained by TCK algorithm in 4 classes of attack types and 1 class of normal classes. TCE ranked second in the categories Normal, R2L, and TKNN ranked second in the categories Dos, Probing, and U2R. As observed, the improvement is more obvious on U2R, which is due to the small U2R training data set. The improvement effect of the rest of the classes is a little weaker, because their training set has been able to train a better model due to the larger data size.

5. Conclusions

This paper presents a Tri-training network intrusion detection algorithm based on cross-entropy and K-nearest neighbors (TCK). Since the learning process of the Tri-training algorithm generates training noise due to mislabeling, the TCK algorithm replaces the classification error rate with cross entropy to reduce the difference between the practical distribution and the predicted distribution; K-nearest neighbors are used to remove the pseudo labeling noise. By examining the UCI data set with the network intrusion data set NSL-KDD, the TCK algorithm has significantly improved in four indexes such as accuracy, recall, F-measure and precision compared with Tri-training, TCE and TKNN algorithms, and has better detection effect in NID. The key to optimize the performance of Tri-training algorithm accurate identification and effective removal of noises, which shall be further investigated in the future.

Acknowledgement

This research was supported by the National Natural Science Foundation of China under Grant (Nos. 52069014, 61962036), the Jiangxi Province Department of Education Science and Technology Project under Grant (No. GJJ180940).

References

  1. W. H. Luo, C. D. Xu, "Network Intrusion Detection Based on Improved MajorClust Clustering," Netinfo Security, vol. 20, no. 2, pp. 14-21, 2020.
  2. J. Zhao, D. D. Chen, R. B. Xiao, Z. H. Cui, H. Wang and I. Lee, "Multi-strategy ensemble firefly algorithm with equilibrium of convergence and diversity," Applied Soft Computing, vol. 123, no. 1, pp. 108938, Jul. 2022. https://doi.org/10.1016/j.asoc.2022.108938
  3. H. S. Wu and R. B. Xiao, "Flexible wolf pack algorithm for dynamic multidimensional knapsack problems," Research, vol. 2020, pp. 1762107, Feb. 2020.
  4. H. S. Wu, J. J. Xue, R. B. Xiao and J. Q. Hu, "Uncertain bilevel knapsack problem based on improved binary wolf pack algorithm," Frontiers of Information Technology & Electronic Engineering, vol. 21, no. 9, pp. 1356-1368, Jun. 2020. https://doi.org/10.1631/FITEE.1900437
  5. J. Zhao, L. Lv, H. Wang, H. Sun, R. X. Wu and Z. F. Xie, "Particle Swarm Optimization based on Vector Gaussian Learning," KSII Transactions on Internet and Information Systems, vol. 11, no. 4, pp. 2038-2057, Apr. 2017. https://doi.org/10.3837/tiis.2017.04.012
  6. L. Lv, X. D. Zhou, P. Kang, X. F. Fu, X. M. Tian, "Multi-Objective Firefly Algorithm with Hierarchical Learning," Journal of Network Intelligence, vol. 6, no. 3, pp. 411-427, Aug. 2021.
  7. J. Zhao, W. P. Chen, R. B. Xiao, J. Ye, "Firefly algorithm with division of roles for complex optimal scheduling," Frontiers of Information Technology & Electronic Engineering, vol. 22, no. 10, pp. 1311-1333, Oct. 2021. https://doi.org/10.1631/FITEE.2000691
  8. L. Lv, J. Y. Wang, R. X. Wu, H. Wang, I. Lee, "Density peaks clustering based on geodetic distance and dynamic neighborhood," International Journal of Bio-Inspired Computation, vol. 17, no. 1, pp. 24-33, Feb. 2021. https://doi.org/10.1504/IJBIC.2021.113363
  9. S. Y. Wu, J. Yu, X. P. Fan, "Intrusion Detection Algorithm Based on Tri-training," Computer Engineering, vol. 38, no. 6, pp. 158-160, 2012. https://doi.org/10.3969/j.issn.1000-3428.2012.06.052
  10. J. W. Liu, Y. Liu, X. L. Luo, "Semi-supervised learning methods," Chinese Journal of Computers, vol. 38, no. 8, pp. 1592-1617, 2015.
  11. O. Chapelle, B. Scholkopf and A. Eds, "Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews]," IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542-542, Mar. 2009.
  12. Z. H. Zhou, "Disagreement-based Semi-supervised learning," Acta Automatica Sinica, vol. 39, no. 11, pp. 1871-1878, 2013. https://doi.org/10.3724/SP.J.1004.2013.01871
  13. R. A. Fisher, "The use of multiple measurements in taxonomic problems," Annals of eugenics, vol. 7, no. 2, pp. 179-188, Sep. 1936. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
  14. D. J. Miller, H. S. Uyar, "A mixture of experts classifier with learning based on both labelled and unlabelled data," in Proc. of the 9th International Conference on Neural Information Processing Systems (NIPS 1996), Cambridge, MA, USA, pp. 571-577, 1996.
  15. A. Blum, S. Chawla, "Learning from labeled and unlabeled data using graph mincuts," in Proc. of the 8th international conference on Machine learning (ICML 2001), San Francisco, CA, USA, pp. 19-26, 2001.
  16. A. Blum and T. Mitchell, "Combining labeled and unlabeled data with co-training," in Proc. of the eleventh annual conference on Computational learning theory (COLT 1998), New York, NY, USA, pp. 92-100, Jul. 1998.
  17. Z. H. Zhou and M. Li, "Tri-training: exploiting unlabeled data using three classifiers," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1529-1541, Nov. 2005. https://doi.org/10.1109/TKDE.2005.186
  18. Y. Q. Hu, Q. Y. Qiu, X. Yu, "Semi-supervised patent text classification method based on improved Tri-training algorithm," Journal of Zhejiang University (Engineering Science), vol. 54, no. 2, pp. 331-339, 2020.
  19. D. M. Li, J. W Mao, S. Fuke, "A Novel Semi-supervised Adaboost Technique Based on Improved Tri-training," in Proc. of Australasian Conference on Information Security and Privacy(ACISP 2019), Cham, GERMANY, pp. 669-678, 2019.
  20. Y. Zhang, R. R. Chen, J. Zhang, "Safe tri-training algorithm based on cross entropy," Journal of Computer Research and Development, vol. 58, no. 1, pp. 60-69, 2021.
  21. J. W. Mo, P. Jia, "Semi-supervised classification model based on ladder network and improved tritraining," Acta Automatica Sinica, vol. 48(08), 2022.
  22. Y. F. Li, D. M. Liang, "Safe semi-supervised learning: a brief introduction," Frontiers of Computer Science, vol. 13, no. 4, pp. 669-676, Jun. 2019. https://doi.org/10.1007/s11704-019-8452-2
  23. D. Angluin, P. Laird, "Learning from noisy examples," Machine Learning, vol. 2, no. 4, pp. 343-370, Apr. 1988. https://doi.org/10.1007/BF00116829
  24. S. Kullback, R. A. Leibler, "On information and sufficiency," The annals of mathematical statistics, vol. 22, no. 1, pp. 79-86, Mar. 1951. https://doi.org/10.1214/aoms/1177729694
  25. I. Goodfellow, Y. Bengio, A. Courville, Deep learning. Massachusetts, USA : MIT press, 2016.
  26. D. J. C. MacKay, Information theory, inference and learning algorithms, Cambridge, UK: Cambridge university press, 2003.
  27. R. Y. Rubinstein, "Optimization of computer simulation models with rare events," European Journal of Operational Research, vol. 99, no. 1, pp. 89-112, May. 1997. https://doi.org/10.1016/S0377-2217(96)00385-2
  28. G. Too, X. J. Cheng, F. B. Qin, "Incremental clustering algorithm via cross-entropy," Journal of Systems Engineering and Electronics, vol. 16, no. 4, pp. 781-786, Dec. 2005. https://doi.org/10.1109/JSEE.2005.6071247
  29. H. Liu, Z. Liu, W. Jia, D. Zhang and J. Tan, "A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis," IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1583-1593, Mar. 2022. https://doi.org/10.1109/TII.2021.3084132
  30. B. Santosa, "Application of the Cross-Entropy Method to Dual Lagrange Support Vector Machine," in Proc. of the 5th International Conference on Advanced Data Mining and Applications(ADMA 2009), Beijing, CHINA, pp. 595-602, 2009.
  31. T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 1967. https://doi.org/10.1109/TIT.1967.1053964
  32. Z. W. Wang, S. K. Wang, B. T. Wan, "A novel multi-label classification algorithm based on Knearest neighbor and random walk," International Journal of Distributed Sensor Networks, vol. 16, no. 3, Mar. 2020.
  33. H. Xiao, F. Sun, Y. Liang, "A Fast Incremental Learning Algorithm for SVM Based on K Nearest Neighbors," in Proc. of 2010 International Conference on Artificial Intelligence and Computational Intelligence(ICCAI 2010), Sanya, China, pp. 413-416, 2010.
  34. C. Ren, L. Sun, Y. Yu and Q. Wu, "Effective Density Peaks Clustering Algorithm Based on the Layered K-Nearest Neighbors and Subcluster Merging," IEEE Access, vol. 8, pp. 123449-123468, Jun. 2020. https://doi.org/10.1109/access.2020.3006069
  35. R. X. Wu, S. H. Yin, J. Zhao, P. W. Li, B. H. Liu, "Density Peaks Clustering based on Relative Density Estimating and Multi Cluster Merging," Control and Decision, 2022.
  36. J Zhao, Z. F. Yao, L. Lv, T. H. Fan, "Density peaks clustering based on mutual neighbor degree," Control and Decision, vol. 36, no. 3, pp. 543-552, Mar. 2021.
  37. J. S. Sanchez, R. Barandela, A. I. Marques, et al, "Analysis of new techniques to obtain quality training sets," Pattern Recognition Letters, vol. 24, no. 7, pp. 1015-1022, Apr. 2003. https://doi.org/10.1016/S0167-8655(02)00225-8
  38. D. Dua, C, Graff, UCI Machine Learning Repository. [Online]. Available: http://archive.ics.uci.edu/ml