Improving the Accuracy of Early Diagnosis of Thyroid Nodule Type Based on the SCAD Method

Determining the type of nodule before the surgery has a great importance in diagnosis and treatment of the diseases. In some cases like thyroid disease the type of nodule (benign or malignant) determines the type of surgery (Pourahmad et al., 2015). Thyroid nodule is a typical problem in human society. Currently, fine needle aspiration (FNA) is the only effective minimally invasive method for the differential diagnosis of thyroid nodules. But it is subject to sampling and analysis uncertainties (Zhang and Berardi, 1998). In addition, it depends on the operator expertise. Therefore, the sensitivity and specificity values of this test may not be satisfactory. Thus, a powerful modeling method is urgently needed. However, some previous studies attempted to do this differently (Finley et al., 2004; Hong et al., 2009; Pourahmad et al., 2015). To model the relationships among the factors involved


Introduction
Determining the type of nodule before the surgery has a great importance in diagnosis and treatment of the diseases.In some cases like thyroid disease the type of nodule (benign or malignant) determines the type of surgery (Pourahmad et al., 2015).Thyroid nodule is a typical problem in human society.Currently, fine needle aspiration (FNA) is the only effective minimally invasive method for the differential diagnosis of thyroid nodules.But it is subject to sampling and analysis uncertainties (Zhang and Berardi, 1998).In addition, it depends on the operator expertise.Therefore, the sensitivity and specificity values of this test may not be satisfactory.Thus, a powerful modeling method is urgently needed.However, some previous studies attempted to do this differently (Finley et al., 2004;Hong et al., 2009;Pourahmad et al., 2015).
To in the diagnosis of diseases, there are various methods in feature selection and modeling technique (Ma and Huang, 2008).Conventional and penalized logistic regression, random forests, regression trees, artificial neural networks (ANN), decision trees, fuzzy models and discriminating methods by gene expression data are some recently applied methods in medical researches (Ghosh and Chinnaiyan, 2005;Mendonça et al., 2007;Yang et al., 2009;Lin et al., 2011;Yan et al., 2011;Talhaa and Al-Elaiwi, 2013;Mansiaux and Carrat, 2014).
Certainly, to achieve a model with minimum error and attenuate any possible biases, it is reasonable to use the whole potential factors that may be important in diagnosis.Hence, we encounter to high dimensional data.However, some of these factors are redundant.Utilizing them brings some complexity in the model without any significant improvement in its performance.Therefore, it is desirable to determine a sparse linear combination of factors with really effective detection.The smoothly clipped absolute deviation (SCAD) method is a well known penalized regression which can be applied in a huge number of variables (Fan and Li, 2001).
Accordingly, our attempt in the present study was to select a subset of factors so that an appropriate linear combination of them can distinguish benign from malignant thyroid nodules, using the Smoothly Clipped Absolute Deviation (SCAD) method introduced by Fan and Li (Fan and Li, 2001).

Materials and Methods
In a prospective study from 2008 to 2012, all patients who had been admitted to Rajai and Namazi hospitals in Shiraz, Iran, for surgery of the thyroid nodule were recruited.FNA test was performed for all the patients and all of them underwent surgery.Effective characteristics such as, gender, age, type and growth of the thyroid gland, FNA test result, duration of the disease, family history of the disease and cancer, size of the right and left thyroid gland, size of the nodules in the left and right thyroid glands and their volumes were used as the predictor variables in the analysis.More details about the study protocol and data collection can be found in (Pourahmad et al., 2015).
SCAD method was used to identify the factors which affect the thyroid nodule types.By adding a penalty function in the maximum likelihood of the model, SCAD forces some of the coefficients shrink to zero.Variable selection and coefficients estimation performs simultaneously in this method, which leads to increase precision.In addition, SCAD has oracle property, i.e. it estimates both zero and non-zero factors truly, with a probability tending to one (Fan and Li, 2001).
Suppose that we have n observation and β T =(β 1 , β 2 , …, β P ) is a vector of coefficients factors.For logistic regression, SCAD is defined as follows (Eq.1& 2): Eq. ( 1) which l n (β) is the traditional maximum likelihood estimator and λ is a positive constant called tuning parameter (13).The amount of shrinkage depends on λ which is estimated by cross-validation method (Fan and Li, 2001;Shahraki et al., 2014).
In this article, we randomly split the dataset into a training set and testing set at a ratio of 7:3.Training set (242 observations) was used for estimation and variable selection using SCAD and obtaining a linear combination of factors.In this step, 10-fold cross-validation method was implemented to obtain the optimal lambda with minimal error.Finally, the estimated model evaluated in the testing set (103 observations) via receiver operating characteristic (ROC) curve.All the analyses were performed, using SPSS version 15 and ncvreg package in R.3.1.2software.

Results
From 345 patients who were enrolled in the present study, 182 cases (52.8%) had two or more thyroid nodules.The mean age of the participated patients (66 male and 279 female) was 40.9 ± 13.4 years (range 15-90 years).54.8% of them (24.3% male and 75.7% female) had benign thyroid nodule and 45.2% (14% male and 86% female) had malignant thyroid nodule while FNA test showed that 50.1% of the patients had benign thyroid nodules.
The performance indices of the estimated model for nodule type diagnosis in testing dataset are shown in Table 1.This Table also represents the selected factors and their coefficients by the SCAD logistic regression model (Table 1 & Eq. ( 3)).The amounts of 10-fold cross-validation errors are shown in Figure 1.Accordingly, the minimum SCAD logistic regression introduced the history of cancer in family and result of FNA test as the most important factors in early diagnosis of the type of the thyroid nodule.Many less important features were removed from the model (zero coefficients in Table 1).Regarding the 8 eliminated variables, optimal combination of factors (SCAD model) was obtained as follows (Eq.( 3)): log( pi 1-pi ) = 1.94X 1 + 0.53X 2 -0.30X 3 + 0.28X 4 + 0.07X 5 -0.05X 6 -0.02X 7 -0.005X 8 Eq. ( 3) Using Eq. (3), we calculated p i (probability of malignant thyroid nodule) for the testing set data.Optimal cut off point of the ROC curve for this equation was obtained 0.44.The AUC of our estimated model was equal to 77% (95% CI: 68% -85%), which was much greater than single factors (Figure 2).Sensitivity, specificity, positive predictive value and negative predictive value for this model were 72%, 75%, 71%, and 76%, respectively.

Discussion
Usually, the decision for thyroidectomy in patients with thyroid nodule problem is based on the result of FNA test.Indeed, if the test detects a benign nodule, then the right, left or subtotal lobectomy is applied.Otherwise, total lobectomy is performed (Pourahmad et al., 2015).Although FNA test is the only effective method for differential diagnosis of thyroid nodules, clinical texts reported some mistakes in decision based on this test (Pourahmad et al., 2015).The accuracy rate for FNA test in the present data set was 63% (The result is not shown here).Therefore, due to the importance of the early diagnosis of thyroid nodule type, the present study attempted to improve the accuracy of early diagnosis of the type for thyroid nodule based on SCAD method.Some previous studies have followed this purpose by different methods (Finley et al., 2004;Hong et al., 2009;Pourahmad et al., 2015).But, the simulated data showed a high accuracy rate in classification of this method (Yan et al., 2011).Our estimated model reached a sparse linear combination of factors which improves the accuracy of diagnosis.The results of the ROC curve in the present study also revealed both high sensitivity and specificity values for SCAD method (Figure 2).Furthermore, the type of penalty function used in the present model is more powerful than some other penalized functions like LASSO (Lin et al., 2011).
In comparison with the previous study on this data using artificial neural networks (ANN) method, among thirteen batch learning algorithms, the maximum accuracy rate and AUC were found for GD (Gradient Descent) learning algorithm (Accuracy= 0.71 and AUC=0.837)(Pourahmad et al., 2015).Accordingly, SCAD method in the present study represented higher accuracy and lower AUC than ANN model.In addition, the order of ten first variables in determining the type of nodule assigned by ANN method was as follows: Multiple nodule, Sex, Rapid enlargement of the Thyroid gland, Age, Size of Thyroid nodule, Family history of Thyroid disease, Family history of cancer and Maximum size of Thyroid nodule.Comparing with results summarized on Table 1, the order of variables with non-zero coefficients by SCAD model was further confirmed by the physicians (Table 1).Indeed, the family history of cancer and FNA test result are two most important factors in malignancy of thyroid nodules in clinical texts.Furthermore, the important role of maximum size of nodules compared with their volumes and the sizes of lobes in malignancy prediction was another interesting result derived from our model.The statistical basis of SCAD method may be the cause of this superiority over ANN method (Fan and Li, 2001).Furthermore, the medium sample size (equal or more than 100) provides acceptable accuracy rate for SCAD method, while a large sample size is required for ANN method.However, there are some limitations concerning the SCAD method in practice.For example, adding or removing a variable leads to sensible changes in the variables' coefficients.In addition, like other penalized methods, SCAD suffers from missing data and outliers.
In conclusion, an increase of 10 percent and a greateraccuracy rate in early diagnosis of thyroid nodule type by statistical methods (SCAD and ANN methods) compared with the results of FNA testing revealed that the presetnly adopted statistical modeling methods are helpful in disease diagnosis.In addition, the factor ranking offered by these methods would be expected to be valuable in the clinical context.
model the relationships among the factors involved

Figure 1 .
Figure 1.Error Values in Predicting the Malignant Thyroid Nodule Based on the Number of Factors Included in the SCAD Model

Table 1 . Characteristics of the SCAD Logistic Regression Model for Thyroid Nodule Type
DOI:http://dx.doi.org/10.7314/APJCP.2016doi.org/10.7314/APJCP..17.4.1861Improving the Accuracy of Early Diagnosis of Thyroid Nodule Type Based on the SCAD Method amount of error is occurred when 10 factors were selected.