Colorectal Cancer Staging Using Three Clustering Methods Based on Preoperative Clinical Findings

Determination of the colorectal cancer stage is possible only after surgery based on pathology results. However, sometimes this may prove impossible. The aim of the present study was to determine colorectal cancer stage using three clustering methods based on preoperative clinical findings. All patients referred to the Colorectal Research Center of Shiraz University of Medical Sciences for colorectal cancer surgery during 2006 to 2014 were enrolled in the study. Accordingly, 117 cases participated. Three clustering algorithms were utilized including k-means, hierarchical and fuzzy c-means clustering methods. External validity measures such as sensitivity, specificity and accuracy were used for evaluation of the methods. The results revealed maximum accuracy and sensitivity values for the hierarchical and a maximum specificity value for the fuzzy c-means clustering methods. Furthermore, according to the internal validity measures for the present data set, the optimal number of clusters was two (silhouette coefficient) and the fuzzy c-means algorithm was more appropriate than the k-means clustering approach by increasing the number of clusters.


Introduction
Colorectal cancer is one of the most prevalent and leading causes of cancer death worldwide. This neoplasm is the third most common cancer in men and the second in women worldwide (Jemal et al., 2010). In Iran, colorectal cancer is the fifth common cancer in men and the third in women (Hoseini et al., 2014;Maeda et al., 2015). This malignancy tends to present at late stages and has poor outcome. Stage at presentation is the most important prognostic factor in patients with colorectal cancer (Jee et al., 2015;Roder et al., 2015). Currently, the American Joint Committee on Cancer (AJCC), the tumor node metastasis (TNM) staging system is commonly used for pathologic staging of colorectal cancer (Hari et al., 2013). After establishing the pathologic diagnosis, the locoregional and distant extent of the disease should be determined to provide a baseline for defining preferred therapy and prognosis. Preoperative clinical staging is usually based on the findings of physical examination and imaging studies particularly Computed Tomography (CT) scan of the abdomen and pelvis, and chest imaging (Kijima et al., 2014). Further imaging studies such as MRI, transrectal ultrasonography and PET scan may improve the accuracy of preoperative clinical staging; however, they impose an additional cost on patients (Petersen et al., 2014;Li et al., 2015).
Generally, data classification or clustering is one way to control and manage the information. Clustering is a type of classification methods in which data has been separated based on their similarities in some common characteristics. Usually, these similarities are calculated based on the distance formulas (Xu and Wunsch, 2005;Bataineh et al., 2011). No statistical assumptions are needed for data distribution in most of the clustering algorithms. Hence, they are very useful classification methods when there is no prior knowledge about the data.
The widespread use of different clustering algorithms in medical researches includes clustering of the disease risk factors (Lee, 2014), clustering the symptoms of a disease (Shahrbanian et al., 2015), gene expression data (Sarkar and Maulik, 2015), image processing (Nguyen et al.;Ryali et al., 2015), pattern recognition (Yang et al., 2015) and so on. As the fuzzy logic comes into the statistical analysis, the clustering approaches have more interesting applications in clinical studies (Belhassen and Zaidi, 2010;Bataineh et al., 2011;Bunyak et al., 2011;Clifford et al., 2011;Ekong et al., 2011;Fallahi et al., 2011;Hirsch et al., 2011;Keller et al., 2011;Pang et al., 2012;Xu et al., 2013). In fuzzy clustering algorithms, each case can belong to more than one cluster simultaneously with different possibilities. Therefore, more comprehensive operations may be done by considering these different possibilities.
This study aimed to investigate a clustering system for prediction of preoperative clinical staging of colorectal cancer. Fuzzy c-means clustering and two classical algorithms including k-means and hierarchical clustering methods were applied.

Data set
This retrospective study was carried out at Colorectal Research Center of Shiraz University of Medical Sciences. One hundred and seventeen patients with histologically proved colorectal adenocarcinoma were treated and followed-up between January 2006 and January 2014 at our department. Patients with other epithelial pathologies such as squamous cell carcinoma, or nonepithelial tumors, and recurrent disease were excluded. Moreover, we excluded the patients with missing or incomplete medical records or those who lacked complete pathological reports or had received neoadjuvant therapy. Tumors were restaged according to the 7th edition of the AJCC TNM staging system (2010). All patients with non-metastatic disease were initially treated with standard curative surgery. Also, patients with resectable metastatic disease were initially treated with surgery. However, those patients with disseminated disease were staged clinically and initially treated with systemic therapy. Preliminary evaluation included comprehensive history and physical examination, colonoscopy, chest, abdominal and pelvic computed tomography (CT) scans for all primary sites and pelvic MRI and/or transrectal ultrasonography for rectal tumors. PET/CT scan was performed in selected cases (Omidvari et al., 2015).
The pretreatment information was obtained from the patients' records. We collected 25 clinical and pathological variables including the patients' characteristics (age, sex, weight and BMI), and presentations (symptoms duration, anemia, abdominal pain, colicky abdominal pain, constipation, weight loss, rectal bleeding, jaundice, nausea and vomiting), Tumor characteristics (differentiation, location, growth pattern (base colonoscopic findings, and obstruction), and the results of liver function test (Alkaline phosphatase and bilirubin level). In this study, imaging findings of CT scan, MRI, transrectal ultrasonography and PET/CT scan were not included as variable data set.
Accurate staging was defined according to postoperative pathological findings (in locoregional tumors) and imaging findings (in metastatic disease). Furthermore, metastatic disease was confirmed by biopsy in those with suspected or limited metastatic foci.

Statistical analysis
Based on data separation method, clustering techniques consist of hierarchical and non-hierarchical approaches (Xu and Wunsch, 2005). K-means clustering algorithm is one famous non-hierarchical method in which the number of clusters is known (k) at the beginning. The center of k clusters is selected randomly among the data and updated during an iterative process. Indeed, an objective function is constructed based on the Euclidian distances of all data from k centers. Then, data assignment to the clusters is performed based on minimization of the objective function. Afterwards, by an iterative process, the clusters' centers are updated and the mean value of the data in each cluster is treated as the new center. These steps will continue until no change happens in the clusters' members (Xu and Wunsch, 2005). This method requires getting the number of clusters as the input parameter at the beginning. Therefore, hierarchical clustering is recommended when there is no prior information about the data, even the number of the clusters. This method includes two techniques called agglomerative and divisive methods (Xu and Wunsch, 2005). In the agglomerative method, each case is allocated to a separate cluster. Cluster integration process continues until a certain criterion is met. In contrast, all cases are allocated to one cluster in divisive method at first. Then, separation process starts till the stop criterion occurs. In the present study, the agglomerative technique was performed with three different clustering techniques (Nearest-neighbor or Single-Linkage method; based on the smallest distance between the members of two clusters, Farthest-neighbor or Complete-Linkage method; according to the largest distance between the members of two clusters and Average-Linkage method; in terms of the mean distance among all the members of two clusters).
The other clustering approach utilized in the present study was fuzzy c-means clustering method. It is the extension of k-means clustering technique in which an extra real-valued parameter in [0, 1] is added to the objective function representing the membership degrees of each datum to k clusters. These membership degrees should be summed to one for each datum over the k clusters (Bataineh et al., 2011). 10 fold cross-validation method was used for system validation. To evaluate the methods, external validity (accuracy rate, sensitivity and specificity values) and internal validity measures (Silhouette and Dunn's partition coefficients) were utilized (Belhassen and Zaidi, 2010). For external validity measures, the values were calculated for each stage and the mean values were reported.

Results
The preoperative information of 117 patients with colorectal cancer was used in the present study. The age range of the patients was 58.3±12.9 year. More than half of them were men (54.7%). According to the pathology results of the patients after the surgery, there were 17.1 % in stage 1, 33.3 % in stage 2, 44.4% in stage 3 and 5.2% in stage 4. Table 1 shows the patients' characteristics according to their hospital records.
To start the clustering algorithms, data set was divided into two testing and training subsets (80% and 20%, respectively) and four clusters were considered (k=4). By the training set, the clusters were made and by the testing set, the methods were compared. Table 2 displays a summary of the results. For the initial parameters in the analysis, the assumed values in MATLAB software were considered as follows: 2 for the fuzzification parameter, DOI:http://dx.doi.org/10.7314/APJCP.2016.17.2.823 Colorectal Cancer Staging Using Three Clustering Methods Based on Preoperative Clinical Findings 100 iterations for convergence and e-5 for stop criterion. For hierarchical clustering algorithm, the nearest-neighbor method was considered since it had more accuracy value than the others (not shown here). In addition, for fuzzy c-means clustering algorithm, the maximum membership degree of each patient was considered as the assignment criterion to the clusters. Moreover, among 5-fold, 7-fold and 10-fold cross-validation methods, 10-fold method had more accuracy than the others (not shown here). To select the training and testing subsets, the division of 80% vs. 20% led to more accurate results than 70% vs. 30%.

Discussion
Providing a system for cancer staging before the surgery is a valuable prediction in disease's therapy and inhibition (Ludwig and Weinstein, 2005). Indeed, the decision to perform urgent surgery, chemotherapy or radiation before the surgery and the type of the surgery depends on the cancer stage (Cirocco, 2000;Omidvari et al., 2013). This topic has been less attended in previous researches. Therefore, the present study is important in two aspects: prediction of cancer staging before the  surgery and comparison among three different clustering algorithms (k-means, hierarchal and fuzzy clustering algorithms). The use of fuzzy clustering in cancer staging may be interesting in the sense that the border between the stages of a cancer cannot be considered as a crisp border. In other words, there is not an exact definition for passing a patient from one to the next stage clinically. Fuzzy clustering technique calculates the possibility of belonging to all the cancer's stages for an individual.
Although the maximum membership degree is used for final decision, the next maximum one can be considered for further attempts in therapy (Karemore et al., 2010).
To the best of our knowledge, there are a few studies on cancer's staging by clustering algorithms (Nguyen et al.) and no study for fuzzy clustering application in this issue. However, fuzzy clustering approach is utilized and compared with other techniques for disease diagnosis (Bunyak et al., 2011;Ekong et al., 2011;Keller et al., 2011), image classification (Belhassen and Zaidi, 2010;Fallahi et al., 2011;Pang et al., 2012;Xu et al., 2013), pattern recognition (Hirsch et al., 2011) and genome information (Clifford et al., 2011). Of these, some studies evaluated fuzzy clustering technique better than the classical clustering approaches (Bunyak et al., 2011;Fallahi et al., 2011) and some others represented more accurate results for classical methods such as hierarchal approach, similar to our results (for instance (Clifford et al., 2011)).
These three algorithms were utilized on 117 patients with colorectal cancer who underwent surgery in the present study. The results revealed that hierarchal clustering method had more accuracy in prediction. In addition, fuzzy c-means with maximum specificity and hierarchal clustering method with maximum sensitivity were specific and sensitive methods for cancer staging respectively. Furthermore, the results of Dunn's partitioning coefficient showed that fuzzy c-mean was proper than k-means clustering algorithms for more clusters (0.78, 0.63, 0.62, 0.55 and 0.49 for one to six clusters, respectively).
As a result, hierarchal clustering algorithm was a proper technique in colorectal cancer staging according to this dataset. However, there were some problems concerning the data. Due to the small number of patients in the first and forth stages (17.1%, 33.3%, 44.4% and 5.2% in the first to fourth stage respectively), the internal validity measures such as Silhouette coefficient (0.43, 0.34 and 0.33 for two to four clusters, respectively) suggested two clusters for the optimal number of clusters. In addition, in this study, some important laboratory factors such as carcinoembryonic antigen (CEA) were not available in the majority of patients' hospital records. These clinical findings may improve the cancer staging process before the surgery. Therefore, complete dataset based on the study's objects should be applied for representing a prediction system or evaluation of the methods.