RESEARCH ARTICLE Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on Gene Expression Levels

. Key elements in reduction of mortality rate among lung cancer carriers are early detection, accurate determination of cancer histological type and adequate treatment. Errors in lung cancer type or, in general, malignant growth type determination lead to treatment efficiency degradation, because anticancer strategy depends on tumor morphology (morphogenesis). For example, early malignant pleural


Introduction
According to the International Agency for Research on Cancer there were about 13 % (1,825 thousand) of new lung cancer cases of the total number of new cancer cases and about 19.4 % (1,590 thousand) deaths of the total number of deaths owing to lung cancer in the world in 2012.In the structure of cancer pathology lung cancer takes first place for men and third place for women (Ferlay et al., 2015).
Key elements in reduction of mortality rate among lung cancer carriers are early detection, accurate determination of cancer histological type and adequate treatment.Errors in lung cancer type or, in general, malignant growth type determination lead to treatment efficiency degradation, because anticancer strategy depends on tumor morphology (morphogenesis).For example, early malignant pleural

Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on Gene Expression Levels
Maxim D Podolsky 1 *, Anton A Barchuk 2 , Vladimir I Kuznetcov 3 , Natalia F Gusarova 1 , Vadim S Gaidukov 1 , Segrey A Tarakanov 1 mesothelioma is optimally treated by extrapleural pneumonectomy followed by radiochemotherapy, whereas metastatic lung cancer is cured by chemotherapy (Pass, 2001).At that, lung cancer five-year survival rates remain low, for instance, in South Korea they reached 20.7 % in 2007-2011 (Jung et al., 2014).
At present it is optimal to use machine learning methods to ascertain a definite diagnosis.Their final aim is to obtain trained algorithms which compute type and developmental character of malignant growth by usage of one or several classification attributes.These algorithms can be used by clinicians as auxiliary tools to process huge amounts of patient data for establishing diagnosis (Sun et al., 2013;Yu et al., 2015).
In population screening machine learning methods are used to differentiate between benign and malignant lung nodules based on low-dose computed tomography (Wang et al., 2013), which is considered as a widespread standard in detecting and analysis of lung diseases.In case of expected tumors sampled it is possible to use technics on the basis of gene activity pattern in affected cells (Han et al., 2013;Liu et al., 2013).Thus gene expression levels can be used as classification attributes which characterize production rate of protein in lung tumor cells compared with healthy cells (Cheng et al., 2012).
To accomplish the task of cancer type classification the following algorithms are typically applied, such as Support Vector Machines, Random Forests, Decision Tree, Boosting, K-Nearest Neighbor, LASSO, neural networks (Lei Win et al., 2014;Cai et al., 2015).At that, effectiveness of various algorithms differs depending on analyzed data sets.To evaluate effectiveness of the algorithms and compare them it is accepted to use Receiver Operating Characteristic curve (ROC curve) and Matthews Correlation Coefficient (MCC) as a measure of the quality of binary (two-class) as well as non-binary classifications (Baldi et al., 2000).

Materials
To evaluate effectiveness of several machine learning algorithms we have processed four publicly available data sets related to gene expression: i) Dana-Farber Cancer Institute, Harvard Medical School (Bhattacharjee et al., 2001); Consists of 203 samples: 139 correspond with adenocarcinoma, 21 -squamous cell lung carcinoma, twenty -pulmonary carcinoids, six -small-cell lung carcinoma, seventeen -healthy lung samples.Each sample is described by 12600 gene expression levels.Research task for this data set was to classify cancer types.
ii) University of Michigan (Beer et al., 2002); Consists of 96 samples: 86 -primary adenocarcinoma (where 67 -stage I, nineteen -stage III), ten -non-neoplastic tissue.Each sample is presented by expression levels of 7,129 genes.The task was to detect adenocarcinoma.
Samples of primary tumor and adjacent non-neoplastic tissue were taken during surgical intervention from May 1994 to June 2000 in the University of Michigan Hospital.Peripheral portions of resected lung carcinomas were sectioned, evaluated by a study pathologist and compared with routine H&E sections of the same tumors, and utilized for mRNA isolation.Regions chosen for analysis contained a tumor cellularity greater than 70%, no mixed histology, potential metastatic origin, extensive lymphocytic infiltration or fibrosis.
iii) University of Toronto, Ontario, Canada (Wigle et al., 2002) Consists of 39 samples of non-small cell lung cancer.Twenty four samples correspond to patients with lung cancer recurrence (stage I -eight patients, stage II -thirteen patients, stage III -three patients).The remaining fifteen patients are disease-free (stage I -ten patients, stage II -two patients, stage III -free patients).The two groups were broadly similar in distribution of age and sex.Each sample is presented by expression levels of 2,880 genes.The task was to detect recurrences.
The samples were taken during lobectomy or pneumectomy of patients examined in University of Toronto, then snap-frozen and placed to liquid nitrogen to preserve them.Adenocarcinoma was confirmed in nineteen patients, squamous cell carcinoma -in fourteen, the rest six patients had adenosquamous carcinoma, large cell undifferentiated carcinoma or carcinoid tumor.
Patients were under observation for more than a year, on the average -around 26 months for patients with recurrence and 24 months for the rest.
iv) Brigham and Women's Hospital, Harvard Medical School (Gordon et al., 2002) Consists of 181 samples of malignant tissue, where 31 -malignant pleural mesothelioma and 150 -adenocarcinoma.Samples were divided in two sets: training (sixteen samples of each cancer type) and testing (the remaining 149 samples).Each sample is presented by expression levels of 12,533 genes.The task is to make a binary classification of malignant pleural mesothelioma and adenocarcinoma.All data sets can be downloaded using the reference: http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html

Methods
We have used seven machine learning algorithms or their versions to analyze the data sets: i) k-nearest neighbors algorithm (k-NN, k=1, k=5, k=10); ii) Naive Bayes classifier both with assumption of the normal distribution of attributes (NB_normal) and distribution through histograms (NB_histogram); iii) Support vector machine (SVM); iv) C4.5 decision tree.
To train the algorithms using data sets of Dana-Farber Cancer Institute, University of Michigan and University of Toronto 10-fold cross validation was used.For Brigham and Women's Hospital data set we have used training and testing samples that have been already prepared.After ROC curves construction Area under ROC (AUC) and MCC were calculated.

Results
Dana-Farber Cancer Institute, Harvard Medical School data set: The data set contains the samples of five classes.Due to increasing of degrees of freedom k*(k-1) we constructed the curve and calculated MCC for each class while combining other classes and labeling them as "not considered class" (Table 1).In Table 2

Discussion
It is expected to have false-positive or false-negative results of differentially expressed genes due to the noisiness and scatter of processed data.To acquire accurate qualitative and quantitative data it is necessary to analyze experimental results carefully.
Support vector machine algorithm showed best results for Dana-Farber Cancer Institute (MCC 0.93) and Brigham and Women's Hospital data sets (MCC 0.97).At that k-nearest neighbors with k = 5 showed MCC 0.96 for the second data set.High values prove that SVM based on assessment of gene expression levels can be used to classify lung cancer by histological types, as well as classify adenocarcinoma and mesothelioma.Obtained data confirm results of the study Li et al. (2014) where SVM showed high accuracy in adenocarcinoma and squamous cell lung carcinoma classification.However, SVM showed second result after Bayes tree algorithm in identification and validation of the methylation biomarkers of nonsmall cell lung cancer (Guo et al., 2015).In addition it is effectively used to predict lung cancer type between small-cell one and non-small cell one, for example, in study Hosseinzadeh et al. (2013) SVM showed the best accuracy in analysis of protein attributes.
All algorithms except C4.5 decision tree (one classification error) were capable to accurately distinguish between adenocarcinoma and healthy lung in University of Michigan data set.However, C4.5 decision tree showed best result (MCC 0.67) in University of Toronto data set.The reason for lower effectiveness of other algorithms can be small quantity of the samples.
In conclusion among compared machine learning algorithms SVM tends to be the most appropriate auxiliary tool in lung cancer screening, while others showed sufficient effectiveness to be used in the tasks of gene expression levels assessment.It gives the opportunity to predict tumor growth and its metastasis with improved performance decreasing burden on clinicians determining the diagnosis.Machine learning algorithms can be used to substantially (15-25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality (Cruz and Wishart, 2006).
Samples were taken and snap-frozen during surgical operations from 1993 to 2001 in Brigham and Women's Hospital, Boston, MA, USA.

Figure 1 .
Figure 1.ROC Curves for University of Toronto Data Set.Abbreviations: ROC, receiver operating characteristic

Table 2 . Averaged AUC and MCC of Dana-Farber Cancer Institute Data Set
dx.doi.org/10.7314/APJCP.2016.17.2.835Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on Gene Expression Levels AUC, area under receiver operating characteristic curve; MCC, Matthews correlation coefficient DOI:http:// related averaged results are shown.Data sets from University of Michigan, University of Toronto, Brigham and Women's Hospital have binary classification and are summarized in table 3. ROC curves of University of Toronto data set are depicted on Figure 1.