Bayesian Survival Analysis of High-Dimensional Microarray Data for Mantle Cell Lymphoma Patients

Among different types of cancers, the lymphoma are group of malignant diseases with the origin of lymphocytes (Fisher, 2005; Adamson et al., 2007). The lymphoma is divided into two types, the Hodgkin’s lymphoma and the non-Hodgkin’s lymphoma. Specific type of the non Hodgkin lymphoma is the Mantle Cell Lymphoma (MCL). The Mantle Cell Lymphoma contains 6% of all the nonHodgkin lymphoma and a larger fraction of deaths from lymphoma (Campo et al., 1999; Swerdlow and Williams, 2002). In this type of cancer, length of survival after diagnosis is varied and many attempts have been done for prediction of survival these patients (Rosenwald et al., 2003). Microarray technology and access to thousands of gene expression have been helped to predict survival of patients (Rust et al., 2005; Li et al., 2007). But the main challenge of this method is that the number of variables (p) is usually too much more than number of samples


Introduction
Among different types of cancers, the lymphoma are group of malignant diseases with the origin of lymphocytes (Fisher, 2005;Adamson et al., 2007). The lymphoma is divided into two types, the Hodgkin's lymphoma and the non-Hodgkin's lymphoma. Specific type of the non Hodgkin lymphoma is the Mantle Cell Lymphoma (MCL). The Mantle Cell Lymphoma contains 6% of all the non-Hodgkin lymphoma and a larger fraction of deaths from lymphoma (Campo et al., 1999;Swerdlow and Williams, 2002). In this type of cancer, length of survival after diagnosis is varied and many attempts have been done for prediction of survival these patients (Rosenwald et al., 2003). Microarray technology and access to thousands of gene expression have been helped to predict survival of patients (Rust et al., 2005;Li et al., 2007). But the main challenge of this method is that the number of variables (p) is usually too much more than number of samples Azam Moslemi 1 , Hossein Mahjub 2 *, Massoud Saidijam 3 , Jalal Poorolajal 2 , Ali Reza Soltanian 2 (n) (p>>n). In this case, identifying the strongest gene variables determinant of patients survival is very difficult (Li et al., 2007).
On the other hand, the model-building process for survival analysis based on the gene expression, involves the comparison many competing models. In this method, a single model is choosing each time and a statistical inference is accomplished based on that. But, uncertainty of single model selection causes overall uncertainty of the desired quantity estimation and considerable reduction of fitting model (Kass and Raftery, 1993). One of the solutions that is presented to this problem, is using the iterative Bayesian Model Averaging.
The iterative Bayesian Model Averaging method, estimates and reports the effect of multiple models, by computation the weighted average of their posterior distributions instead of selecting a single model and to perform the statistical inference based on that (Annest et al., 2009). In present study, iterative Bayesian Model Averaging method, using the survival of the Mantle Cell Lymphoma patients is estimated and based on that, the patients were divided into two groups: high-risk and low-risk.  (Rosenwald et al., 2003).

Materials and Methods
In order to, the fitting survival analysis model and its evaluation, overall dataset was considered as "train sample" and of their set 31 sampels randomly was selected as "test sample". The iterative Bayesian Model Averaging method performed on train sample. Obtained estimations were used for the assessment method in the test sample.
The strength of BMA method is to be able to estimate model uncertainty. Calculation model uncertainty is the part of Baysian model analysis that largely has been ignored by traditional stepwise selection methods. BMA solves this problem with selection a subset of all possible models and the statistical inference based on the weighted average of posterior distribution of models (Annest et al., 2009). In this method, Uniform distribution is considering as the prior distribution. The core of the BMA algorithm is described in Equation (1) (Raftery, 1995). Let Ψ and TD indicate desired quantity and train data, respectively. Let S={M 1 , M 2 , …, M n } represent the subset of selected models that enter in the analysis.
We have: Three objects must be considered before Equation (1) could be used: i). Obtaining the subset S of models that enter in the analysis, ii). Estimation the value of pr(ψ|TD, M i ), iii). Estimation the value of pr(M i |TD).
One of BMA's problems is number of models which potentially can be discovered by the algorithm, especially when working with microarray data. Two algorithms "leaps and bounds" and "Occam's window" are being used for solving this problem and discarding non-helpful models (Annest et al., 2009).
The leaps and bounds algorithm regards the best expected researcher number of the models (nbest) and based on that the top nbest models estimates with every number of variables (maximum 25 variables) (Furnival and Wilson, 1974). With using the Occam's window algorithm every model which its posterior distribution of the best model is less than the cut point determined by the researcher could be removed (Madigan and Raftery, 1994). It proposes to remove the models with a probability of less than 5% as likely as the strongest model (Annest et al., 2009). Thus the remaining models constitute the set S in Equation (1).
Estimation second object and accurate calculation of predictive distribution pr (ψ|TD, M i ) requires integration of the vector of regression parameters θ i , however integration is impossible for the most censored survival models, therefore maximum likelihood estimation (MLE) is used as an approximation: For estimation third object and calculation of the posterior probability of model M i given the training data, involves an Integral. But, for this reason that is impossible to assess exact value third object, the Bayesian information criterion (BIC) can be used as approximation estimation: (3), n indicates the number of samples recorded in the dataset, k i is number of the regression parameters in model M i and O (1) is the error term.
In addition to the computation posterior probability of the models that enter in BMA analysis, the posterior probability can be obtained for each of variables (genes) that enter in the analysis.
This information is useful in facilitating biological discussion and it helps to identify suitable predictor variables.
Let the (b i ≠0) shows that the regression parameter for gene X i exists in the vector of regression parametersθ i . In this case, posterior probability which gene X i is a suitable predictor variable, is written as bellow: Posterior probability of gene X i is sum Posterior probabilities of all models in subset S that contain gene X i (Annest et al., 2009).
Use of Iterative BMA method is unsuitable for microarray data. Because microarray dataset contains thousands of genes, while the implementation of leaps and bounds algorithm is slow when there are more than 45 variables (Annest et al., 2009). Performing a preprocessing step is proposed for dimension reduction and ranking genes (Annest et al., 2009). In present study, the univariate Cox proportional hazard model was used in preprocessing step. With parameters estimation of univariate Cox proportional hazard model, genes are sorted on the basis of their log likelihood in descending order. Then ranked genes from 1 to 1000 are selected that were more associated with survival. For selecting model, we fitted all possible models with at most m variables for the ranked genes from 1 to m (m≤25). We selected 50 (nbest=50) models with the best fit among them on the basis of leaps and bounds algorithm. For reducing set of selected models, we used the Occam's window algorithm and we removed the models with a posterior probability of less than the threshold (less than 5% posterior probability the strongest model). Then, we calculated the posterior DOI:http://dx.doi.org/10.7314/APJCP.2016.17.1.95 Bayesian Survival Analysis of High-dimensional Microarray Data for Mantle Cell Lymphoma Patients probability of m genes by using the selected models and we eliminated genes with the low posterior probability. We replaced deleted genes with next genes in the terms of rank. These steps continually repeated until through 1000 genes processed. Finally, the top models are obtained with maximum 25 genes in each model.
To evaluate the performance of iterative BMA method, risk scores of test group separated into two risk groups. The overall risk score for a patient is considered weighted average of the risk scores calculated for each model Mi in the set S. The equation is as bellow: Note that maximum likelihood estimation of the predictive coefficients and the posterior probability of the models is obtained from training dataset and the expression score of each gene into model for a patient (x j v ) achieved of test dataset. Finally, we determined cutpoint to identify high-risk group from low-risk group by using the risk score of patients in the train group. In present study, cutpoint acquired 60%. In other words, low-risk group contains 60% of the low boundary of the risk scores and high-risk group contains 40% of the high boundary of the risk scores. Also, predictive efficiency of two groups of patients has been assessed by Kaplan-Meier survival curve and log-rank test. For data analysis in confidence level 95%, the bioconductor package iterativeBM in R12.2.0 environment has been used. This package is available at http://www.bioconductor.org/.

Results
In this study, we performed iterative Bayesian model averaging method for dataset of the Mantle Cell Lymphoma patients. The 132 Competitive models were selected based on the posterior probabilities and the cutpoint 60%. Number of variables in each model was changing from 9 to 16 and the average number of variables was 12.68 genes per model. The posterior probabilities and univariate log likelihood ranking related to the 25 selected genes have been shown in Table 1. Among the 25 genes, HOXD9 gene with zero posterior probability has not been selected. Result of this table is showing that Among the 25 selected genes by BMA algorithm, 7 genes (28%) have rank above 502. Except 7 mention genes, the others selected genes have poor univariate ranks, so that, the highest ranking obtained from Cox proportional hazard model among them is 834 through 1000. In addition, the average ranking of the three genes with posterior probability 1 was 258.
The test group was consisted 31 patients. Iterative BMA method assigned 13 patients to high-risk category and 18 patients to low-risk category. Among 31 patients, 9 patients had been censored. One patient placed in high-risk group, from the 9 censored patients, While 12 patients of the uncensored group assigned to high-risk group. Table  2 shows the number of patients in each group.
To evaluate results, the Kaplan-Meier survival curves and the log-rank test were used. Figure 1 shows the Kaplan-Meier survival analysis curve. The based on this figure, there is consistently difference between the survival curves of two groups which survival length of people of high-risk group is very lower than people of low-risk group. Also, for the test sample, based on logrank test, value of chi-Square obtained 21.1 (P<0.001) that indicates the strength of iterative BMA method for   distinguishing the two risk groups from each other. For 25 genes simultaneously, the based on Cox proportional hazard model, 16 genes were significant (P<0.001). Table  3 shows estimation of the Cox proportional hazard model parameters and the level of significant for 16 genes.

Discussion
In present study, the iterative BMA method has been applied for using in survival analysis with high dimensional microarray data. Using this method numbers of 8810 genes have been reduced to 25 genes and 132 top survival models have been selected. Selected models were rather simple and including 9 to 16 genes. Also, Posterior probabilities of the selected genes were calculated by the iterative BMA. Values of the posterior probability of chosen genes was indicating overall contribution that gene into the patient risk score across all selected models. Finally, patients were separated into two risk groups: low-risk group and high-risk group, with very high significantly. The results of this study showed that iterative BMA method is able to separate risk groups with very high significantly. The Kaplan-Meier survival curves and the log-rank test implied the high power of iterative BMA method to predictive survival.
Algorithm BMA, in preprocessing step, Cox proportional hazard model fits for each genes and genes orders based on their log likelihood in descending order. Therefore, there are among selected variables comparison possibility of log likelihood values of top genes. The gene U19769 (not inclusion in selected genes) has first rank with log likelihood equal -214.57. The log likelihood for genes PSMA7 with ranking of 20 and HOXD9 with ranking of 1000, are respectively -218.41 and -288.29. So, genes with the poor ranking of log likelihood of univariate Cox often are selected by the iterative BMA algorithm and in the terms of goodness of fit are comparable with the top ranking genes. This is because that log likelihood scope of univariate Cox proportional hazard model is not extensive across the top 1000 genes. Thus, it is not surprising that selected genes by BMA that in the terms of log likelihood have poor ranking, achieve substantial predictive power when included in combinations. Previous studies have reported similar results. Hu et al, showed that gene PSMA7 is expressing in high level in the colorectal cancer sites and lymph node and liver metastatic sites while gene expression is not high in the normal colorectal tissue (Hu et al., 2008). Wan et al. (2004), Liu et al. (2007), Midorikawa et al. (2002), Li et al. (2004) respectively report that expression of genes M33374, ALDOB, KIAA0033 and V01555 are associated with Hepatocellular Carcinoma. Also, Muller et al, showed that GLIPR1 associates with prostate cancer (Muller et al., 2010). Chang et al. (2009) in their study, reported role of STAT4-deficient in the impaired development of human Th1 cells for posttransplantation patients (Chang et al., 2009). Park et al. (2008) detected existence the aberrant methylation for gene MAP2K3 in at least one lung adenocarcinoma cell lines (Park et al., 2008). Ma et al. (2009) report connection IGJ gene with breast cancer (Ma et al., 2009).
Although, the achieved results in this study were hopeful, the iterative BMA method has some limitations and can be extended further. Such as, determining the optimum number of risk groups and proposing statistical methods, for the assessment of different calculation methods. It is suggested that validation the chosen genes obtained of the iterativeBMA method collaborate with genetic and clinical studies.
Iterative BMA algorithm has the high accurate and strength for survival analysis. This method is able to identify a few numbers of predictive variables among many variables (genes) in a microarray dataset. So, it can be used as a low-cost diagnostic tool in clinical researches. This multivariate technique computes the model uncertainty through the averaging over the posterior probabilities of the strongest competitive models. Multivariate feature iterative BMA with the ability to calculate the model uncertainty makes this method as an interesting pattern for extraction predictive genes of high-dimensional biological data.