DOI QR코드

DOI QR Code

FAFS: A Fuzzy Association Feature Selection Method for Network Malicious Traffic Detection

  • Feng, Yongxin (School of Information Science and Engineering, Shenyang Ligong University) ;
  • Kang, Yingyun (School of Information Science and Engineering, Shenyang Ligong University) ;
  • Zhang, Hao (School of Information Science and Engineering, Shenyang Ligong University) ;
  • Zhang, Wenbo (School of Information Science and Engineering, Shenyang Ligong University)
  • Received : 2019.04.28
  • Accepted : 2019.10.05
  • Published : 2020.01.31

Abstract

Analyzing network traffic is the basis of dealing with network security issues. Most of the network security systems depend on the feature selection of network traffic data and the detection ability of malicious traffic in network can be improved by the correct method of feature selection. An FAFS method, which is short for Fuzzy Association Feature Selection method, is proposed in this paper for network malicious traffic detection. Association rules, which can reflect the relationship among different characteristic attributes of network traffic data, are mined by association analysis. The membership value of association rules are obtained by the calculation of fuzzy reasoning. The data features with the highest correlation intensity in network data sets are calculated by comparing the membership values in association rules. The dimension of data features are reduced and the detection ability of malicious traffic detection algorithm in network is improved by FAFS method. To verify the effect of malicious traffic feature selection by FAFS method, FAFS method is used to select data features of different dataset in this paper. Then, K-Nearest Neighbor algorithm, C4.5 Decision Tree algorithm and Naïve Bayes algorithm are used to test on the dataset above. Moreover, FAFS method is also compared with classical feature selection methods. The analysis of experimental results show that the precision and recall rate of malicious traffic detection in the network can be significantly improved by FAFS method, which provides a valuable reference for the establishment of network security system.

Keywords

1. Introduction

Network traffic analysis technology is the cornerstone of network security system. At the same time, network traffic analysis of malicious and benign programs behave quite differently [1]. To achieve network traffic analysis, the current traffic is scanned and analyzed by IDS(intrusion detection system). IDS is one of the network traffic analysis technology and is used to identify network malicious traffic in normal traffic [2]. Sandnet is another network behavior analysis environment, which focuses on network traffic analysis [3]. Fantasm is also a system, which can support safe and productive malware experimentation [4]. With the development of Internet, network traffic analysis is facing the challenge of great traffic flux, low accuracy of detecting malicious traffic. Meanwhile the classification of network traffic is usually only considered in a few of the most relevant features [5]. Therefore, it is necessary to perform some form of feature selection methods to avoid over-fitting and improve the performance of classification. Moreover, it’s also helpful to deal with limited computing resources for online or real-time IDS, through feature selection method. In addition, because of the relatively high feature space of network data sets, most machine learning methods, such as Bayes, Decision Tree and Neural Network, are easy to be over-fitting, when they are used in network anomaly traffic detection. Over-fitting can be effectively avoided, and the complexity of time and space can be reduced by feature selection method adopted in initial data sets. The accuracy of detection can be improved as well [6].

2. Related Work

In recent years, feature dimensionality reduction of network traffic data has been carried out extensively and studied deeply. Feature extraction and feature selection are the two main methods of feature dimension reduction [7]. Feature extraction, including Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA), is a combining transformation of the original features to form a new feature space. In the analysis of network traffic data, PCA and KPCA have been widely applicated and great progress has been made in the research on feature selection of network traffic data. Feature selection refers to the selection of some features from the original feature set to represent the entire data set [8]. According to different evaluation criteria, feature selection can be divided into filter model, embedding model and wrapper model. Although the principles of feature selection and feature extraction are different, the ultimate goal is to reduce the dimension of data sets and improve the effect of data analysis. A systematic method is required to select high-quality features. A divide-conquer and voting strategy [9] is proposed. In the divide-conquer and voting strategy, firstly, the original training set is segmented. Secondly, the feature subset is obtained by using the segmented subset. Thirdly, the final feature subset is obtained by voting. Aiming at solving the problem of serious data missing in massive data sets, mForest algorithm is proposed based on RF (random forest) algorithm [10]. The correlation among random features is further enhanced by the interpolation performance of RF algorithm, but the classification effect of multi-class data sets is not very obvious. The problems of multi-class imbalance and low recall rate of a few classes are also studied [11][12] and targeted feature selection methods are proposed. Although good results have been achieved in the experimental environment, the number of features selected is the same as the number of network applications, which is still facing considerable challenges in practical applications. Compared with feature extraction, feature selection methods of these systems are more complex generally, but the malicious traffic detection model constructed by these systems also has much higher detection accuracy.

3. FAFS Method

In order to solve the problem of system complexity in high dimensional network traffic data and reduce the difficulty in feature selection, an FAFS method based on the present research is proposed in this paper. Through fuzzy inference calculation, important features in network traffic data can be automatically selected. Most of traditional feature selection methods depend on experts or intelligent recognition systems, but the information used in human and intelligent recognition systems is often uncertain. Human thinking, which is not as accurate as classical mathematics, is uncertain, complex and fuzzy. Therefore, fuzzy inference is used to represent and process the uncertain information of feature selection in network traffic. On the premise of fuzzy judgment, an approximate fuzzy judgment conclusion is derived by using fuzzy language rules. The data features with the most correlation strength of feature attributes in network traffic are obtained and the data containing in these features are used to detect network malicious traffic.

3.1 Fuzzification

FAFS method firstly needs to be fuzzified, which means to transform association rules into fuzzy association rules. The process of fuzzification is to establish the mapping relationship among the exact values of association rules and the fuzzy sets through the memberships function to form the fuzzy association rules. Obtaining fuzzy association rules is divided into four steps: association rules mining, feature word selection and construction of fuzzy sets, data standardization processing, and calculation of membership degree. A detailed description of each step is as follows.

3.1.1 Association Rules Mining

Features with certain relevance can be calculated from a large number of network traffic data by association rule mining and appear together in different data categories. Those features are of great significance to the classification of network traffic data. Hence, relevance analysis of data features is indispensable in feature selection. At present, the main methods include Chi-square check, information gain, Pearson correlation coefficient and CfsSubsetEval[13]. The limitation of Chi-square verification is the "low-frequency defect", which exaggerates the role of low-frequency features. The low-frequency defect of Chi-square verification also results in excessive square value and eventually leads to errors in feature selection. The limitation of information gain is that it can only examine the contribution of features to the whole system, but not specifically to a certain category. Pearson correlation coefficient mainly measures the linear correlation degree of two variables, but there is no linear relationship among most data features in the network traffic data. Pearson correlation coefficient can not be applied in most network traffic data for feature selection[14]. CfsSubsetEval evaluates the attributes of subsets by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred. This method has a good effect in feature selection, but in the classification of network traffic data, the decisive features are usually only included in some of the most closely related features. Under the continuous iteration search, the relevance of network traffic data characteristics can be calculated through support and confidence. Although these rules contain a large amount of data redundancy, some of the rules here contain some features that can be the best representation of data set. Therefore, association rule mining is used in this paper to measure the correlation among network traffic characteristics. [15]

The definitions of association rules mining are described as follows, before mining association rules from network traffic data.

Definition 1: Association Rule

Association rule reflects the relationship among items, which is denoted by r and can be deemed to an implicative relation: X →Y . X and Y are called itemsets. X ⊆ C, Y ⊆ C, X ΠY =∅. Itemset C is an itemset that contains numbers of items in(n = 1, 2, 3,...). Item is the specific content of network traffic dataset D. An itemset that contains k items is called a k-itemset Ck (1 ≤ k ≤ n)

Definition 2: Frequent Itemset

The itemset satisfying minimum support degree is called frequent itemset. According to the number of items included in frequent itemset, frequent itemset is also called frequent k-itemset, denoted by Lk(1 ≤ k ≤ n ). The emergency frenquency of itemset is called support degree. Thus, support(C) is used to express the support degree of itemsets C. support(C )=\(\begin{equation} (C)=\frac{\operatorname{count}(C)}{\mathrm{m}} \end{equation}\), count(C) is the number of occurrences of itemset C in all transactions. A selected support degree threshold is called minimum support degree and is denoted by min_sup.

Definition 3: Strong Association Rule

Association rules which are satisfied minimum confidence degree, are called strong association rules. Confidence is used to indicate the frequency of itemset Y in transactions involving itemset X. Thus, confident( X → Y ) → is used to express the confidence degree of X → Y. count( X → Y ) =\(\begin{equation} \frac{\operatorname{count}(X \bigcup Y)}{\operatorname{count}(Y)} \end{equation}\), count(Y) is the number of occurrences of itemset in all transactions. A selected confidence threshold is called minimum confidence degree, denoted by min_conf.

The process of mining association rules is described as follows:

(1) Find_ L1 (Finding frequent 1-itemset L1)

Scanning the network traffic data set D, starting with itemset C1, L1 is found according to the given min_sup.

(2) Gen_Ck (Generating candidate k-itemsets) and Gen_ Lk (Generating frequent k-itemset)

According to the priori principle, if a set of items is frequent, then all its subsets must be frequent. Therefore, when generating candidate itemset C2, L1 can be directly used to generate it. After generating C2, the candidate itemset C2 is pruned according to the given min_sup, and the frequent itemset L2 is generated. L2 is the frequent itemset. By analogy, Ck-1 is generated from Lk-1, and Ck is pruned to produce Lk until frequent itemset of maximum items Lk is generated.

(3) Gen_StrongAssociationRules (Generating strong association rules)

According to the given min_conf, the frequent k-itemset Lk is pruned and strong association rules are generated

3.1.2 Selection of Feature Word and Construction of Fuzzy Set

Selecting feature words from association rules is an important step in fuzzifying association rules. The definitions of correlation are as follows:

Definition 4: Fuzzy Set

Let G be a domain, and mappingµF(w) : [0,1]is a fuzzy set F in G. Mapping (w) µF ,which is called membership function of F, which is used to express the membership degree of w to F. The input domain U and output domain V are included in the domain G.

Definition 5: Feature Word

Each feature attribute in association rule r is called feature word w. Then each association rule can be expressed as r = (w1 , w2 , w3 ,..., wl)(1 ≤ l ≤ k ) and l is the number of feature attributes in each association rule.

The importance of each feature word in association rules and the association strength of feature words in association rules are divided into three levels: high (H), medium (M), and low(L). In the input domain U, the association rule r = (w1 , w2 , w3 ,..., wI )is used as input, and the fuzzy set A ={ Hin,Min, Lin} is established to indicate the importance of the feature word. µA(w) is the membership function of w to the fuzzy set A. In the output domain, y ∈ V, which indicates the association strength of the features contained in the association rule after the calculation of the fuzzy inference, is the value of the output variable, and the fuzzy set B { Hout,Mout, Lout } = is established to indicate the strong association of the features contained in the association rule. µB(y) is the membership function of y subordinate to fuzzy set B.

3.1.3 Data Standardization Processing

Before the membership degree of each feature word is calculated, the input data should be standardized.

(1) Constructing a frequency matrix of feature words

The frequency matrix of feature words is defined as Eq.(1).

W[ wij]nxl, i = 1,2,3,...,n j = 1,2,3..., l (1)

wij is the number of occurrences of the feature word j in the set of rule i. n is the number of rules, l is the number of feature words in the rule i.

(2) Normalization

In order to balance the distribution of each feature word in the set, each feature word in the frequency matrix is normalized as Eq. (2).

\(\begin{equation} b_{i j}=\frac{\mathrm{w}_{i j}}{\left(\sum_{\mathrm{k}=1}^{\mathrm{n}} \mathrm{w}_{i \mathrm{k}}^{2}\right)^{1 / 2}}, i=1,2,3, \ldots, n \quad j=1,2,3, \ldots, l \end{equation}\)       (2)

3.1.4 Membership Degree Calculation

In this paper, the Gauss function is chosen as the membership function. The expression of the Gauss membership function is defined as Eq. (3).

\(\begin{equation} f(x, \sigma, c)=e^{-\frac{(x-c)^{2}}{2 \sigma^{2}}} \end{equation}\)       (3)

The central position of the function is determined by c and the shape of the function is determined by σ .

Commonly used methods to construct membership function include: fuzzy statistics method, reference function method (Gauss function, Triangular function, Trapezoidal function) and so on [16]. Traffic in the network usually presents a large number of traffic bytes in a short time.

These traffic byte data, such as the number of bytes in the network stream, the duration of the network flow, the average number of bytes per packet and other network traffic characteristics, tend to concentrate near a certain value, showing a symmetrical distribution of a peak value. Therefore, the Gauss distribution with this distribution law is chosen as the membership function. Although the Triangular function has similar distribution law and Trapezoidal function, the Gauss distribution is smoother and stabler, and its resolution can be controlled by operating the smoothness of the Gauss distribution

The normalized value bij is brought into the corresponding membership function to calculate the MS_Degree (membership degree) of each feature word belonging to each fuzzy set.

The fuzzy association rules r =(µ(w1),µ(w2),µ(w3),...,µ(wl))is formed by the membership degree of each association rules r =(w1,w2,w3,...,wl).

3.2 Establishment of Fuzzy Rule Base

Fuzzy rules are generated mainly by manually compiling some rules to meet certain needs. The following three fuzzy rules, which are the most important rules in the network traffic dataset, are manually compiled, to find some features with high correlation strength

FR1: IF w1 IS H, w2IS H, w3 IS H ,..., w1 IS H THEN r IS H (4)

FR2 : IF w1 IS M, w2IS M, w3 IS M ,..., w1 IS M THEN r IS M (5)

FR3: IF w1 IS L, w2IS L, w3 IS L ,..., w1 IS M THEN r IS L (6)

In order to reduce the computational complexity, only the three cases mentioned above are considered in this paper. When the importance of features in association rules is high, the importance of the association rules is also high. When the importance of features in association rules is medium, the importance of the association rules is also medium. When the importance of features in association rules is low, the importance of the association rules is also low. The importance of each association rule is calculated according to these three fuzzy rules

3.3 Fuzzy Inference Calculation

The calculation results of fuzzy inference mainly depend on the fuzzy implication relation RU V (,) and the synthetic operation rule between the fuzzy relation and the fuzzy set. The commonly used methods of fuzzy inference are Mamdani, Larsen and Sugeno inference [17]. The purpose of this paper is to select features that can improve malicious traffic detection from the feature attributes of network traffic data. Mamdani-type fuzzy inference method is adopted in this paper. The fuzzy implication relation of Mamdani-type fuzzy inference method is a compound proposition composed of U(w) and V (y), denoted by “U (w)→ V(y) ”.

The language information carrying capacity of Mamdani fuzzy inference system is prominent and it is suitable for expressing expert experience. It is very suitable for selecting network traffic feature. Larsen's inference method is very similar to Mamdani's inference process. The difference is that the product operation is used instead of the small operation in the calculation of excitation intensity and inference synthesis, which cannot calculate the importance of each feature in feature selection accurately. The membership function of Sugeno inference output can only be linear or constant, which is not universal. Therefore, Mamdani-type fuzzy inference method is adopted.

In Eq.(4), Eq.(5) and Eq.(6), for the preceding part of IF-THEN rule, the form of atomic fuzzy proposition is “T: w IS A”. Its true value takes the membership degree (w) µ A of the variable w to the fuzzy set A, is defined as Eq. (7).

P(T) = µA(w) (7)

According to the three fuzzy rules established in Section 3.2, three sub-fuzzy implication relations are derived from Mamdani's fuzzy inference system as Eq. (8), Eq. (9) and Eq. (10).

\(\begin{equation} \begin{aligned} R_{F R_{1}} &=\mu_{H_{i n}}\left(\mathrm{w}_{1}\right) \wedge \mu_{H_{i n}}\left(\mathrm{w}_{2}\right) \wedge \mu_{H_{i n}}\left(\mathrm{w}_{3}\right) \wedge \ldots \wedge \mu_{H_{i n}}\left(\mathrm{w}_{l}\right) \wedge \mu_{H_{o u t}}(y) \\ &=\min \left\{\mu_{H_{i n}}\left(\mathrm{w}_{1}\right), \mu_{H_{i n}}\left(\mathrm{w}_{2}\right), \mu_{H_{i n}}\left(\mathrm{w}_{3}\right), \ldots, \mu_{H_{i n}}\left(\mathrm{w}_{l}\right), \mu_{H_{o u t}}(y)\right\} \end{aligned} \end{equation}\)       (8)

\(\begin{equation} \begin{aligned} R_{F R_{2}} &=\mu_{M_{i n}}\left(\mathrm{w}_{1}\right) \wedge \mu_{M_{i n}}\left(\mathrm{w}_{2}\right) \wedge \mu_{M_{i n}}\left(\mathrm{w}_{3}\right) \wedge \ldots \wedge \mu_{M_{i n}}\left(\mathrm{w}_{l}\right) \wedge \mu_{M_{o u t}}(y) \\ &=\min \left\{\mu_{M_{i n}}\left(\mathrm{w}_{1}\right), \mu_{M_{i n}}\left(\mathrm{w}_{2}\right), \mu_{M_{i n}}\left(\mathrm{w}_{3}\right), \ldots, \mu_{M_{i n}}\left(\mathrm{w}_{l}\right), \mu_{M_{o u t}}(y)\right\} \end{aligned} \end{equation}\)       (9)

\(\begin{equation} \begin{aligned} R_{F R_{3}} &=\mu_{L_{i n}}\left(\mathrm{w}_{1}\right) \wedge \mu_{L_{i n}}\left(\mathrm{w}_{2}\right) \wedge \mu_{L_{i n}}\left(\mathrm{w}_{3}\right) \wedge \ldots \wedge \mu_{L_{i n}}\left(\mathrm{w}_{l}\right) \wedge \mu_{L_{o u t}}(y) \\ &=\min \left\{\mu_{L_{i n}}\left(\mathrm{w}_{1}\right), \mu_{L_{i n}}\left(\mathrm{w}_{2}\right), \mu_{L_{i n}}\left(\mathrm{w}_{3}\right), \ldots, \mu_{L_{\text {in }}}\left(\mathrm{w}_{l}\right), \mu_{L_{o u t}}(y)\right\} \end{aligned} \end{equation}\)       (10)

As shown in Eq.(11), the total fuzzy implication relation of the system is composed of three sub-fuzzy implication relations of RFR1 , RFR2 and RFR3 , and the relationship among them is‘or’

\(\begin{equation} R=R_{F R_{1}} \cup R_{F R_{2}} \cup R_{F R_{3}} \end{equation}\)       (11)

The input feature words are expressed by vector \(\begin{equation} \vec{W}=\left(\mu\left(\mathrm{w}_{1}\right), \mu\left(\mathrm{w}_{2}\right), \mu\left(\mathrm{w}_{3}\right), \cdots, \mu\left(\mathrm{w}_{l}\right)\right) \end{equation}\) ,Through fuzzy inference system, the output fuzzy quantity Y→ can be obtained by Eq.(12).

\(\begin{equation} \begin{aligned} \vec{Y}=& \vec{W} \circ\left(R_{F R_{1}} \cup R_{F R_{2}} \cup R_{F R_{3}}\right) \\ =& \vec{W} \circ R_{F R_{1}} \cup \vec{W} \circ R_{F R_{2}} \cup \vec{W} \circ R_{F R_{3}} \\ =&\left(\mu\left(\mathrm{w}_{1}\right), \mu\left(\mathrm{w}_{2}\right), \mu\left(\mathrm{w}_{3}\right), \ldots, \mu\left(\mathrm{w}_{l}\right)\right)^{\mathrm{T}} \circ R_{F R_{1}} \cup \\ &\left(\mu\left(\mathrm{w}_{1}\right), \mu\left(\mathrm{w}_{2}\right), \mu\left(\mathrm{w}_{3}\right), \ldots, \mu\left(\mathrm{w}_{l}\right)\right)^{\mathrm{T}} \circ R_{F R_{2}} \cup \\ &\left(\mu\left(\mathrm{w}_{1}\right), \mu\left(\mathrm{w}_{2}\right), \mu\left(\mathrm{w}_{3}\right), \ldots, \mu\left(\mathrm{w}_{l}\right)\right)^{\mathrm{T}} \circ R_{F R_{3}}\\ =& \min \left\{\mu_{H_{i n}}\left(\mathrm{w}_{1}\right), \mu_{H_{i n}}\left(\mathrm{w}_{2}\right), \mu_{H_{i n}}\left(\mathrm{w}_{3}\right), \ldots, \mu_{H_{i n}}\left(\mathrm{w}_{l}\right), \mu_{H_{o u t}}(y)\right\} \bigcup \\ & \min \left\{\mu_{M_{i n}}\left(\mathrm{w}_{1}\right), \mu_{M_{i n}}\left(\mathrm{w}_{2}\right), \mu_{M_{i n}}\left(\mathrm{w}_{3}\right), \ldots, \mu_{M_{i n}}\left(\mathrm{w}_{l}\right), \mu_{M_{o u t}}(y)\right\} \bigcup \\ & \min \left\{\mu_{L_{i n}}\left(\mathrm{w}_{1}\right), \mu_{L_{i n}}\left(\mathrm{w}_{2}\right), \mu_{L_{i n}}\left(\mathrm{w}_{3}\right), \ldots, \mu_{L_{i n}}\left(\mathrm{w}_{l}\right), \mu_{L_{o u}}(y)\right\} \end{aligned} \end{equation}\)       (12)

The symbol ‘ ο’ denotes the composition operator in the fuzzy relation. The operation method is the same as the ordinary matrix multiplication, but the ‘multiplication’ is changed to‘minimum’. The symbol ‘ ⋃ ’ indicates that the final output of the system is the result of the interaction of three fuzzy implication relations. Therefore, the output \(\vec{Y}\) at this time is still a fuzzy subset and must be defuzzified. \(Y\) stands for the Fuzzy quantity in fuzzy set \(\vec{Y}\), \(\mu_{\vec{y}}(\mathrm{Y})\) stands for the membership values of \(Y\) subordinated to \(\vec{Y}\).

3.4 Defuzzification

The task of defuzzification is to find a clear value to represent the fuzzy subset. In this paper,the area center of gravity method is used as the defuzzification method. Area center of gravity method is defined as Eq. (13).

\(\begin{equation} y=\frac{\int_{\vec{Y}} \mathrm{Y} \mu_{\vec{Y}}(\mathrm{Y}) d \mathrm{Y}}{\int_{\vec{Y}} \mu_{\vec{Y}}(\mathrm{Y}) d \mathrm{Y}} \end{equation}\)       (13)

Y stands for the Fuzzy quantity in fuzzy set \(\begin{equation} \vec{Y}, \mu_{\vec{y}}(\mathrm{Y}) \end{equation}\) stands for the memberships values of Y subordinated to \(\begin{equation} \vec{Y} \end{equation}\).

Area center of gravity method, maximum membership degree method are commonly used methods of defuzzification [18]. The element with the largest membership degree is chosen by the maximum membership degree method in the inference result fuzzy set as the output value. And the shape of the output membership function is not considered, but only the output value at the maximum membership degree. Much information will be lost inevitably, and some important features may be lost in the process of feature selection. Therefore, in order to obtain the accurate control quantity, as the defuzzification method used in the area center of gravity method. The area center of gravity method has smoother output inference control than the maximum membership method. The slightly changes of input value can lead to change in output value. It not only increases the applicability of feature selection algorithm, but also reduces the possibility of missed judgments.

In Eq. (13) y is the determinate value after defuzzification, Y is used to denote the fuzzy quantity in the fuzzy set \(\begin{equation} \vec{Y} \end{equation}\), and \(\begin{equation} \mu_{\vec{Y}}(\mathrm{Y}) \end{equation}\) is used to denote the membership value of Y to \(\begin{equation} \vec{Y} \end{equation}\). The association rules with the largest value of y are screened out and placed in the set of association rules (MAXVALUE_r) to determine the features. The features contained inMAXVALUE_r are used as the features for malicious traffic detection

3.5 Process

The theoretical knowledge and implementation details of the FAFS method are comprehensively introduced above, and the pseudo-code of the method is given as follows:

1.PNG 이미지

2.PNG 이미지

The network traffic data set D, min_sup and min_conf are input. The frequent k-itemset Lk is obtained by min_sup and the strong association rule r is obtained by min_conf. After all strong association rules are fuzzified, the fuzzy value of each strong association rule is obtained by fuzzy inference calculation. Then, the fuzzy value of each strong association rule isdefuzzified. Representing the y value of the feature association strength contained in all strong association rules, the value y is sorted from large to small, and the strong association rules with the largest value y are screened out. The feature attributes contained in these association rules are the features obtained by using the method of fuzzy association feature selection. At this point, the whole process of FAFS method is completed.

4. Experimental Results and Analysis

4.1 Experimental Data and Evaluation Methods

In order to verify the effect and superiority of FAFS method, FAFS method is used to select data features of KDD CUP99 data set [19], NSL-KDD data set [20] and Modbus_traffic network traffic data set [21], which contain dozens of common network attacks. TheNSL-KDD data set, which is more suitable for effective and accurate evaluation among different machine learning algorithms, eliminates the redundancy of KDD CUP99 data set, and the proportion of normal data and abnormal data are chosen properly in the data set. The volume of test and training data is more reasonable. There are 41 feature attributes and a data category label in NSL-KDD data set as that in KDD CUP99 data set. The Modbus_traffic data set is the data captured by Darryfei simulating network attacks. Data volume from large to small is KDD CUP99, NSL-KDD, Modbus_traffic.

Then, KNN algorithm (K-Nearest Neighbor algorithm), C4.5 Decision Tree algorithm andNaïve Bayes algorithm are used to test on the dataset above. The parameter k represents the data points with the smallest k distance for judgment in KNN algorithm. The value of parameter k is set 11 in this paper. In C4.5 Decision Tree algorithm, the best feature of data set is selected by calculating the information gain rate of each features. The threshold of information gain rate is set as 0.1.

At the same time, the precision rate refers to examples correctly labeled as positive and recall rate refers to negative examples correctly labeled as negative are taken as evaluation indexes.

Moreover, FAFS method is also compared with classical feature selection methods, such asCFS, GainR, InfoG, Sym, which is short for CfsSubsetEval, GainRatioAttributeEval,InfoGainAttributeEval, SymmetricalUncertAttributeEval respectively. CfsSubsetEval evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy among them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred.

GainRatioAttributeEval evaluates the worth of an attribute by measuring the gain ratio with respect to the class. InfoGainAttributeEval evaluates the worth of an attribute by measuring the information gain with respect to the class. SymmetricalUncertAttributeEval evaluates the worth of an attribute by measuring the symmetrical uncertainty with respect to the class.

4.2 Experimental Results and Analysis

The precision and recall rate of different original data sets and different data sets, the feature of which are selected by CFS, GainR, InfoG, Sym feature selection method and FAFS method are calculated.

(1) KDD CUP99 data set

Let fk (k=1,2,3, … ,41) be 41 feature attributes in KDD CUP99 original data set and f1=duration, f2=protocol_type, f3=service, …, f41=dst_host_srv_rerror_rate. The results of KDDCUP99 data set using FAFS method are partly shown in Table 1.

Table 1. Results of KDD CUP99 data set using FAFS method

E1KOBZ_2020_v14n1_240_t0001.png 이미지

According to Table 1, the association rules with the maximum y value of feature attribute association strength are obtained. The maximum value of y is 0.5607, and the number of association rules with the value 0.5607 is 17. The features of these association rules are as follows: f1=duration, f2=protocol_type, f3=service, f4=flag, f5=src_bytes, f23=count,f29=same_srv_rate, f32=dst_host_count. The eight features are features of KDD CUP99 data set selected by FAFS method.

Then, K-Nearest Neighbor algorithm, C4.5 Decision Tree algorithm and Naïve Bayes algorithm are used to test on the data containing the eight features from KDD CUP99 data set. Moreover, FAFS method is also compared with classical feature selection methods, such asCFS, GainR, InfoG, Sym. The network malicious traffic detection results of original KDDCUP99 data set, and KDD CUP99 data set, the features of which are selected by CFS, GainR,InfoG , Sym feature selection method and FAFS method are shown in Table 2.

Table 2. Detection results of KDD CUP99 data set(a) Precision rate of KDD CUP99 data set

E1KOBZ_2020_v14n1_240_t0002.png 이미지

(2) NSL-KDD data set

Let vk (k=1,2,3, …, 41) be 41 feature attributes in NSL-KDD original data set andv1=duration, v2=protocol_type, v3=service, …, v41=dst_host_srv_rerror_rate. The results ofNSL-KDD data set using FAFS method are partly shown in Table 3.

Table 3. Results of NSL-KDD data set using FAFS method

E1KOBZ_2020_v14n1_240_t0003.png 이미지

Table 3 shows that the association rules with the maximum y value of feature attribute association strength are obtained. The maximum value of y is 0.5869, and the number of association rules with the value 0.5869 is 198. The features of these association rules are as follows: v1=duration, v29=same_srv_rate, v30=diff_srv_rate, v31=srv_diff_host_rate, v33=dst_host_srv_count, v34=dst_host_same_srv_rate, v35=dst_host_diff_srv_rate. The seven features are features of NSL-KDD data set selected by FAFS method.

Then, K-Nearest Neighbor algorithm, C4.5 Decision Tree algorithm and Naïve Bayes algorithm are used to test on the data containing the seven features from NSL-KDD data set. Moreover, FAFS method is also compared with classical feature selection methods, such asCFS, GainR, InfoG, Sym. The network malicious traffic detection results of originalNSL-KDD data set, and NSL-KDD data set, the features of which are selected by CFS, GainR,InfoG, Sym feature selection method and FAFS method are shown in Table 4.

Table 4. Detection results of NSL-KDD data set(a) Precision rate of NSL-KDD data set

E1KOBZ_2020_v14n1_240_t0004.png 이미지

(3) Modbus_traffic data set

Let zk (k=1,2,3, …, 25) be 25 feature attributes in Modbus_traffic original data set andz1=right_ar, z2=left_ar, z3=sip, …, z25=content. The results of Modbus_traffic data set using FAFS method are partly shown in Table 5.

Table 5. Results of Modbus_traffic Data set using FAFS method

E1KOBZ_2020_v14n1_240_t0005.png 이미지

As shown in Table 5, the association rules with the maximum y value of feature attribute association strength are obtained. The maximum value of y is 0.5737, and the number of association rules with the value 0.5737 is 1690. The features of these association rules are as follows: z1=right_ar, z4=sport, z6=doprt, z9=ptc_lable, z11=uni_lable, z12=fun_code,z14=direction, z16=source_port, z18=destination_port, z22=length, z23=unit_lable, z25=content. The twelve features are features of Modbus_traffic data set selected by FAFS method.

Then, K-Nearest Neighbor algorithm, C4.5 Decision Tree algorithm and Naïve Bayes algorithm are used to test on the data containing the twelve features from Modbus_traffic data set. Moreover, FAFS method is also compared with classical feature selection methods, such as CFS, GainR, InfoG, Sym. The network malicious traffic detection results of originalModbus_traffic data set, and Modbus_traffic data set, the features of which are selected byCFS, GainR, InfoG, Sym feature selection method and FAFS method are shown in Table 6.

Table 6. Detection results of Modbus_traffic data set

E1KOBZ_2020_v14n1_240_t0006.png 이미지

The detection effect of K-Nearest Neighbor algorithm on different original data sets, and data sets, the feature of which are selected by CFS, GainR, InfoG, Sym feature selection method and FAFS method are shown in Fig. 1.

E1KOBZ_2020_v14n1_240_f0001.png 이미지

Fig. 1. Detection effect of K-nearest neighbor algorithm

The experiments are implemented at a server with Intel(R) Core(TM) i5-7500 3.4GHz. The execution time required by K-NN algorithm detecting malicious traffic in different data sets using different feature dimensionality reduction methods are shown in Table 7.

Table 7. Execution time required by K-NN algorithm

E1KOBZ_2020_v14n1_240_t0007.png 이미지

The detection effect of C4.5 Decision Tree algorithm on different original data sets, and data sets, the feature of which are selected by CFS, GainR, InfoG, Sym feature selection method and FAFS method are shown in Fig. 2.

E1KOBZ_2020_v14n1_240_f0002.png 이미지

Fig. 2. Detection effect of C4.5 Decision Tree algorithm

The execution time required by C4.5 Decision Tree algorithm detecting malicious traffic in different data sets using different feature dimensionality reduction methods are shown in Table 8

Table 8. Execution time required by C4.5 Decision Tree algorithm

E1KOBZ_2020_v14n1_240_t0008.png 이미지

The detection effect of Naïve Bayes algorithm on different original data sets, and data sets,the feature of which are selected by CFS, GainR, InfoG, Sym feature selection method and FAFS method are shown in Fig. 3.

E1KOBZ_2020_v14n1_240_f0003.png 이미지

Fig. 3. Detection effect of Naïve Bayes algorithm

The execution time required by Naïve Bayes algorithm detecting malicious traffic in different data sets using different feature dimensionality reduction methods are shown in Table 9.

Table 9. Execution time required by Naïve Bayes algorithm

E1KOBZ_2020_v14n1_240_t0009.png 이미지

As shown in Table 2, the precision rate and recall rate of K-NN algorithm on KDD Cup99data set are 88.473% and 98.563% based on FAFS feature selection method. Based on FAFS feature selection method, the precision rate and recall rate of C4.5 Decision Tree algorithm on KDD Cup99 data set are 82.819% and 88.113%. Based on FAFS feature selection method the precision rate and recall rate of Naïve Bayes algorithm on KDD Cup99 data set are 81.183% and 88.931%. The results show that the classification algorithm can be significantly improved by FAFS method, compared with the original data set. Similarly, based on FAFS feature selection method, the precision rate and recall rate of different algorithms on NSL-KDD andModbus_traffic data set are improved, compared with the original data set in Table 4 and 6. It shows that the classification performance of learning algorithm can be improved by FAFSalgorithm.

As the classical detection algorithm, K-NN algorithm is based on the distance between different eigenvalues to classify. As shown in Fig. 1, based on FAFS feature selection method the precision rate and recall rate of K-NN algorithm on KDD Cup99 data set are better than that of K-NN algorithm on KDD Cup99 data set based on GainR, InfoG and Sym feature selection method. As depicted in Table 7, the execution time required by K-NN algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are 1891.07s, 366.11s and 0.26s based on FAFS feature selection method. The execution time required by K-NNalgorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are9129.24s, 504.447s and 0.2404s based on CFS feature selection method. Though the precision rate and recall rate of K-NN algorithm on KDD Cup99 data set based on FAFS feature selection method are slightly worse than that of K-NN algorithm on KDD Cup99 data set based on CFS selection method, FAFS method can significantly reduce the complexity of the detection model, the occupation of system resources, and the modeling time. Similarly, the classification results of K-NN algorithm on NSL-KDD and Modbus_traffic data set based on FAFS feature selection method are steable and the execution time required is much less than that of other feature selection methods.

According to Fig. 2, the precision rate and recall rate of C4.5 Decision Tree algorithm on KDD Cup99 data set are 82.819% and 88.113% based on FAFS feature selection method. Based on CFS feature selection method the precision rate and recall rate of C4.5 Decision Tree algorithm on KDD Cup99 data set are 9.138% and 10.011%. Based on GainR feature selection method the precision rate and recall rate of C4.5 Decision Tree algorithm on KDD Cup99 data set are 81.209% and 88.973%, etc. It indicates that the precision rate and recall rate of C4.5Decision Tree algorithm on KDD Cup99 data set based on FAFS feature selection method are better than that of C4.5 Decision Tree algorithm on KDD Cup99 data set based on classical feature selection methods.

Furthermore, C4.5 Decision Tree algorithm can also search decisive features in the data set. The characteristic of C4.5 Decision Tree is that one feature can play a better role in classification after certain other features are classified. Hence, the classification results of C4.5 can be affected by the data characteristics of input data. The data selected by different feature selection methods contains different data features. The classification results are affected differently by taking these features as the input of C4.5 Decision Tree algorithm. Therefore,the accuracy and recall rate of some data set fluctuate considerably, after selecting the features through feature selection methods such as CFS and Sym in Fig. 2. As shown in Fig. 2, the classification results of C4.5 Decision Tree algorithm on NSL-KDD and Modbus_traffic data set based on FAFS feature selection method are better than that on NSL-KDD andModbus_traffic data set based on classical feature selection methods.

Meanwhile, as shown in Table 8, the execution time required by C4.5 Decision Tree algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are9129.24s, 8.612s and 0.327s based on CFS feature selection method. The execution time required by C4.5 Decision Tree algorithm on KDD Cup99 data set, NSL-KDD data set andModbus_traffic data set are 2016.63s, 4.108s and 0.875s based on GainR feature selection method. The execution time required by C4.5 Decision Tree algorithm on KDD Cup99 dataset, NSL-KDD data set and Modbus_traffic data set are 2377.86s, 4.602s and 0.629s based on InfoG feature selection method. The execution time required by C4.5 Decision Tree algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are 2248.107s,3.947s and 0.629s based on Sym feature selection method. The execution time required byC4.5 Decision Tree algorithm on KDD Cup99 data set, NSL-KDD data set andModbus_traffic data set are 2091.07s, 4.44s and 0.43s based on FAFS feature selection method. It can be seen from Table 8 that FAFS method can significantly reduce the system modeling time. FAFS method has a good effect on reducing the complexity of the system, and it also has a strong universality.

The core idea of Naïve Bayes algorithm is to select the decision with the highest probability. As can be seen from Fig. 3, the accuracy of the FAFS method has a relatively stable result on different data sets. In addition, as shown in Table 9, the execution time required by NaïveBayes algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are0.733s, 0.508s and 0.240s based on CFS method. The execution time required by Naïve Bayes algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are 0.500s,0.515s and 0.269s based on GainR feature selection method. The execution time required byNaïve Bayes algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are 0.449s, 0.5107s and 0.288s based on InfoG method. The execution time required byNaïve Bayes algorithm on KDD Cup99 data set, NSL-KDD data set and Modbus_traffic data set are 0.455s, 0.447s and 0.305s based on Sym feature selection method. The execution time required by Naïve Bayes algorithm on KDD Cup99 data set, NSL-KDD data set andModbus_traffic data set are 0.32s, 0.41s and 0.21s based on FAFS feature selection method. It can be seen from Table 9, FAFS method can ensure the detection accuracy and reduce the modeling time.

On the whole, FAFS method outperforms the traditional feature selection method. Furthermore, the overall effect of machine learning detection has declined dealing with the data set, which is similar to the real network attack. The precision rate and recall rate have been significantly improved by using FAFS method.

5. Conclusion

Aiming at solving the problems of too many characteristic attributes of network traffic and low accuracy of malicious traffic detection, an FAFS method is proposed in this paper. Fuzzy inference system is used to filter the rules generated by association mining and the more important features are selected, implementing the feature dimensionality reduction of high dimensional data of network traffic. Furthermore, the detection effect of malicious traffic in the network is improved. According to the experimental results, FAFS method has achieved good results in the feature selection of malicious traffic in network and has been applied to different data sets. Meanwhile the detection effect of different detection algorithms has been improved.

The system modeling time, data redundancy and prediction error are reduced, and the detection ability of malicious traffic in the network has been significantly improved. At the sametime, compared with the traditional feature selection algorithm, the feature selection effect has also been significantly improved by FAFS method, which has a good application value in malicious traffic detection in the network.

Acknowledgment

This work was supported by China Postdoctoral Science Foundation (2016M590234),Postdoctoral fund of Shenyang Ligong University, Project of Applied Basic Research of Shenyang (18-013-0-32), Natural Science Foundation of Liaoning Province (20180551066),Program for Liaoning Distinguished Professor, Program for Liaoning Innovative Research Team in University, Liaoning BaiQianWan Talents Program (2016) and supported by Natural Science Foundation of Liaoning Province Project (No.20170540793). The author declares that there is no conflict of interest regarding the publication of this article.

References

  1. Jose Andre Morales, Areej Al-bataineh, Shouhuai Xu, Ravi Sandhu, "Analyzing and exploiting network behaviors ofMalware," in Proc. of 6th International Congerence on Security and Privacy in Communication Systems, vol. 50, pp. 20-34, September 7-9, 2010.
  2. Wei Wang, Yiqiang Sheng, and Jinlin Wang, Xuewen Zeng, Xiaozhou Ye, Yongzhong Huang, Ming Zhu, "HAST-IDS: Learning Hierarchical Spatial-Temporal Features Using Deep Neural Networks to Improve Intrusion Detection," IEEE Access, vol. 6, no. 99, pp. 1792-1806, December, 2017.
  3. Christian Rossow, Christian J. Dietrich, Herbert Bos, Lorenzo Cavallaro, Maarten van Steen, Felix C. Freiling, Norbert Pohlmann, "Sandnet: Network Traffic Analysis of Malicious Software," in Proc. of Workshop on Building Analysis Datasets & Gathering Experience Returns for Security, pp. 77-78, April 10, 2011.
  4. Xiyue Deng, Hao Shi, Jelena Mirkovic, "Understanding Malware's Network Behaviors using Fantasm," in Proc. of LASER 2017 Learning from Authoritative Security Experiment Results, pp. 1-11, October 18-19, 2017.
  5. Razieh Sheikhpour, Mehdi Agha Sarram, Sajjad Gharaghani, Mohammad Ali Zare Chahooki. Chahooki, "A survey on semi-supervised feature selection methods," Pattern Recognit, vol. 64, pp. 141-158, April, 2017. https://doi.org/10.1016/j.patcog.2016.11.003
  6. Zhihong Zhang, Lu Bai, Yuanheng Liang, Edwin Hancock, "Joint hypergraph learning and sparse regression for feature selection," Pattern Recognit, vol. 63, pp. 291-309, June, 2017. https://doi.org/10.1016/j.patcog.2016.06.009
  7. Sergio Ramirez-Gallego, Hector Mourino-Talin, David Martinez-Rego, Veronica Bolon-Canedo, Jose Manuel Benitez, Amparo Alonso-Betanzos, Francisco Herrera, "An information theory-based feature selection framework for big data under apache spark," IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, pp. 1441 - 1453, September, 2018. https://doi.org/10.1109/TSMC.2017.2670926
  8. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, Huan Liu, "Feature selection: a data perspective," ACM Computing Surveys, vol. 50, pp. 94:1-94:45, 2017.
  9. Wen Gao, Yaguan Qian, Chunming Wu, Ye Guo, Kai Zhu, Shuangxi Chen, "The Divide-Conquer and Voting Strategy for Traffic Feature Selection," Chinese Journal of Electronic Science, vol. 43, no. 4, pp. 795-799, April, 2015.
  10. Fei Tang, and Hemant Ishwaran, "Random Forest Missing Data Algorithms," Statistical Analysis & Data Mining the Asa Data Science Journal, vol. 10, no. 6, pp. 221-246, June, 2017.
  11. Xingbin Sun, Yanzan Sun, and Xiaoying Zheng, "A feature selection method for multi-class network traffic," Computer Application Research, vol.34, no. 2, pp. 568-571, February, 2017.
  12. Xingbin Sun, and Yun Rui, "A Statistical Frequency-Based Method for Network Traffic Feature Selection," Small Micrcomputer System, vol. 37, no. 11, pp. 2483-2487, November, 2016.
  13. Mohd Mahmood Ali, Mohd S Qaseem, Lakshmi Rajamani, A Govardhan, "Extracting useful rules through improved decision tree induction using information entropy," International Journal of Information Sciences & Techniques, vol. 3, no. 1, pp. 27-41, January 2013. https://doi.org/10.5121/ijist.2013.3103
  14. Frederico Coelho, Antonio Padua Braga, Michel Verleysen, "Multi-Objective Semi-Supervised Feature Selection and Model Selection Based on Pearson's Correlation Coefficient," International Journal of Information Sciences & Techniques, vol. 6419, no. 1, pp. 509-516, November, 2010.
  15. Qilei Yin, and Pingping Wu, "Detection of Attack Time Series Association Rules Based on Apriori Algorithms," Computer Security, no. 9, pp. 2-7, September, 2014.
  16. A. Salama, R. Saatchi and D. Burke, "Adaptive Sampling Technique for Computer Network Traffic Parameters Using a Combination of Fuzzy System and Regression Model," in Proc. of 4th International Conference on Mathematics and Computers in Sciences and in Industry (MCSI), pp. 206-211, August 24-27, 2017.
  17. T. V. Avdeenko and E.S. Makarova, "Integration of Case-based and Rule-based Reasoning Through Fuzzy Inference in Decision Support Systems," Procedia Computer Science, vol. 103, pp. 447-453, January, 2017. https://doi.org/10.1016/j.procs.2017.01.016
  18. R. Khosravanian, M. Sabah, D. A. Wood, and A. Shahryari, "Weight on drill bit prediction models: Sugeno-type and mamdani-type fuzzy inference systems compared," Journal of Natural Gas Science and Engineering, vol. 36, pp. 280 - 297, November, 2016. https://doi.org/10.1016/j.jngse.2016.10.046
  19. KDDCup1999Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup1999.html.
  20. DARPA Intrusion Detection Evaluation. http://www.11.mit.edu/IST/ideval/index.html.
  21. Modbus_traffic. http://download.csdn.net/download/a1187006940/9540421.html.