Analysis of Molecular Pathways in Pancreatic Ductal Adenocarcinomas with a Bioinformatics Approach

Pancreatic ductal adenocarcinoma (PDAC) is a leading cause of cancer death worldwide. Our study aimed to reveal molecular mechanisms. Microarray data of GSE15471 (including 39 matching pairs of pancreatic tumor tissues and patient-matched normal tissues) was downloaded from Gene Expression Omnibus (GEO) database. We identified differentially expressed genes (DEGs) in PDAC tissues compared with normal tissues by limma package in R language. Then GO and KEGG pathway enrichment analyses were conducted with online DAVID. In addition, principal component analysis was performed and a protein-protein interaction network was constructed to study relationships between the DEGs through database STRING. A total of 532 DEGs were identified in the 38 PDAC tissues compared with 33 normal tissues. The results of principal component analysis of the top 20 DEGs could differentiate the PDAC tissues from normal tissues directly. In the PPI network, 8 of the 20 DEGs were all key genes of the collagen family. Additionally, FN1 (fibronectin 1) was also a hub node in the network. The genes of the collagen family as well as FN1 were significantly enriched in complement and coagulation cascades, ECM-receptor interaction and focal adhesion pathways. Our results suggest that genes of collagen family and FN1 may play an important role in PDAC progression. Meanwhile, these DEGs and enriched pathways, such as complement and coagulation cascades, ECM-receptor interaction and focal adhesion may be important molecular mechanisms involved in the development and progression of PDAC.


Introduction
Pancreatic ductal adenocarcinoma (PDAC) is the most common pancreatic neoplasm that arising within the ductal of the pancreas, comprising over 85% of all pancreatic tumor cases (Li et al., 2004;Hezel et al., 2006). Given the few early indicators of illness, it is diagnosed late in the have been identified in PDAC, among which the most frequently affected are DPC4 (SMAD family member 4), K-ras (Kirsten rat sarcoma viral oncogene homolog) and p53 (Dempe et al., 2010;Morton et al., 2010;Navas et al., 2012), which all appear to play an important role for the development of PDAC. Additionally, it has been reported that germline mutations have been linked to familial PDAC, including the DNA mismatch repair gene MLH1 and the cationic trypsinogen gene PRSS1 (protease, serine, 1 (trypsin 1)), and as well those targeting the tumor suppressor genes INK4A (cyclin-dependent kinase 4), BRCA2 (breast cancer 2, early onset), and LKB1 (serine/ threonine kinase 11) (Whitcomb et al., 1996;Jaffee et al., 2002). Besides, accumulating evidence demonstrates that the aggressive nature of PDAC can be enhanced via interactions between the epithelial and stromal compartments, and extracellular matrix (ECM) proteins may play a key role in PDAC progression (Grippo et al., 2012). These evidences all suggest that the occurrence and development of PDAC are a complex process. Even though progresses have been made, the molecular mechanisms underlying its development are still unclear.
Microarray analysis is increasingly being used to identify genes or gene signatures that are strongly associated with pancreatic cancer (Campagna et al., 2008;Bournet et al., 2012). In previous study, Shi et al. screened differentially expressed genes (DEGs) and constructed a co-expression network to identify dysregulated pathways mainly involved in specific cellular functions, such as immune response, homeostasis and cell adhesion, in which microarray data GSE15471 was used (Shi et al., 2014). In addition, microarray data GSE15471 was also utilized to screen the Master Regulators (MRs) of transcription involved in PDAC disease and highlighted the potential value of Tubby-like protein 3 (TULP3) as a clinical prognostic biomarker (Sartor et al., 2014). In contrast to these findings, we downloaded microarray data GSE15471 in the present study to identify DEGs in PDAC tissues compared with patient-matched normal pancreas tissues. Then dimensionality of the top 20 DEGs was analyzed by principal component analysis. Besides, functional enrichment analysis was performed and protein-protein interaction (PPI) network was constructed. Our study aimed to improve the understanding to the underlying molecular mechanisms of PDAC and provide novel insights for the early diagnosis and medication control of PDAC.

Affymetrix microarray data
The transcriptional profile of GSE15471 deposited by Badea et al. (Badea et al., 2008) was downloaded from Gene Expression Omnibus database (GEO, http:// www.ncbi.nlm.nih.gov/geo/) in the National Center for Biotechnology Information (NCBI). Total 78 samples, including 39 PDAC tissues and patient-matched normal pancreas tissues from 39 PDAC patients, were available for microarray analysis based on the GPL6244 Platform (Affymetrix Human Gene 1.0 ST Array) (Affymetrix Inc., Santa Clara, California, USA).

Data preprocessing
The GSE14905 raw data in CEL files were converted into probe expression matrix by the ReadAffy of affy package in R language (Gautier et al., 2004). The expression values of all probes were normalized by the robust multiarray average algorithm (RMA) (Irizarry et al., 2003). Then the probe number was converted as gene symbol using the R/Bioconductor annotation package or based on the annotation information of Affymetrix Human Gene 1.0 ST Array. If multiple probes corresponded to the same gene, the mean expression value was calculated as the gene value.

Identification of DEGs and hierarchical clustering analysis
For the GSE15471 dataset, the paired t-test of samr package in R language (SAM method) (Tusher et al., 2001) was used to identify DEGs in PDAC tissues compared to patient-matched normal pancreas tissues. When the Del value was set at 4.0, only DEGs with the false discovery rate (FDR) < 0.01 and fold change (FC) value >2 or < 1/2 were selected. In addition, we performed hierarchical clustering analysis to better visualize the gene expression values with drawing dendrogram. In the hierarchical clustering analysis, the gene expression value of each sample was clustered by Pearson coefficient (Derrick et al., 1994) and Spearman coefficient of statistics respectively, and the unreasonable clustered samples were filtered. In order to reduce systematic errors of the sample grouping that could induce interference on the subsequent analysis, we filtered the gene expression values of unreasonable samples. Then DEGs in PDAC tissues compared to patient-matched normal pancreas tissues were screened again using the limma package (Smyth, 2004). The adjusted p value < 0.01 and |log2 FC| > 1.5 were used as the cut-off criteria.

Functional enrichment analysis of DEGs after sample filtration
The Database for Annotation, Visualization and Integrated Discovery (DAVID, http://david.abcc.ncifcrf. gov/) is a functional annotation tool for investigators to understand biological meaning behind large list of genes (Huang et al., 2008). Gene Ontology (GO) (Ashburner et al., 2000) is a widely used for functional studies of large-scale genomic data, which mainly contains biological process (BP), molecular function (MF), and cellular component (CC). Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisaand Goto, 2000) is a pathway-related database for providing the information how molecules or genes are networked. In this study, we performed GO-BP and KEGG pathway enrichment analysis for the screened up-and down-regulated DEGs by online DAVID tool. The overrepresent GO-BP terms were screened only with the adjusted p value less than 1e-5, and the significant enriched KEGG pathways were identified with FDR less than 0.05.

Principal component analysis of top 20 DEGs
To distinguish PDAC tissues from normal tissues, 20 DEGs with the minimum significant p value were identified to conduct principal component analysis. Principal component analysis (Raychaudhuri et al., 2000) is a statistical technique for determining the main variables in a multidimensional data set that stand for the differences in the observations, which can simplify the analysis and visualize the direction of multidimensional data sets. Therefore, we can just use several key variables rather than thousands of variables to classify the samples by means of principal component analysis.

PPI network construction of the top 20 DEGs
The protein-protein interactions (PPIs) was used to discover the interactions between proteins and to reveal the functions of proteins at the molecular level (Giot et al., 2003). Search Tool for the Retrieval of Interacting Genes (STRING) (Jensen et al., 2009) database collected the information of predicted and experimental PPIs in a given cell. Thus the top 20 DEGs were mapped STRING database to construct PPI network of them. The combined score of protein pairs >0.4 was considered as the cutoff value.

Data preprocessing
We obtained 19944 gene expression values of the paired PDAC tissues and patient-matched normal pancreas tissues from 39 patients after data preprocessing. As shown in Figure 1, all the median of gene expression values of the 78 samples were in a straight line after being normalized, indicating that the data had been normalized and could be used for further study.

Identification of DEGs and hierarchical clustering analysis of the paired PDAC tissues and normal tissues
A total of 727 DEGs, including 643 up-regulated genes and 84 down-regulated ones, were identified in the PDAC tissues compared to patient-matched normal tissues. Total 6 normal tissue samples (N40875, N40892, N40726, N40975, N51294, and N30308) and 1 PDAC tissue sample T30308 were found gathered in PDAC tissue sample clusters and normal tissue sample clusters respectively (Figure 2), which were unreasonable. Notably, part of the patient-matched normal pancreas tissues can be grouped into two clusters through hierarchical clustering, the distance from one of the clusters to the PDAC tissues closer than between the normal tissues. After sample filtration, total 532 DEGs in 38 PDAC tissue samples compared to 33 normal controls were identified again by using the limma package, including 450 up-regulated genes and 82 down-regulated ones. The number of up-regulated genes was significantly more than downregulated ones. The top20 DEGs was shown in Table 1.

Functional enrichment analysis of DEGs after sample filtration
A total of 19 overrepresent GO-BP terms (Table 2A) related to immune response and trauma dealing, and only 3 signaling pathways (Table 2B), including Complement and coagulation cascades, ECM-receptor interaction and Focal adhesion, were significantly enriched by the up-regulated genes using DAVID tool. However, no significant GO-BP terms and pathways were enriched by down-regulated genes.

Principal component analysis of the top 20 DEGs
The results of principal component analysis of the top 20 DEGs could differentiate the PDAC tissues from normal tissues directly (Figure 3). The first constituted principal component explained 94.72% of the variance of the 20 variables, the second principal component explained 1.84% of the variance, and the cumulative variance that explained is 96.56%. Moreover, variable pointing of the 20 DEGs had more great effect on PDAC tissues. Notably, the expression levels of these 20 genes in PDAC tissues were significantly higher than the normal tissues.

PPI network construction of the top 20 DEGs
Based on the information of STRING database, PPI network of the most significant 20 genes in PDAC tissues were constructed. As shown in the Figure 4, no interaction of SULF1 (sulfatase 1), PLXDC2 (plexin domain containing 2) and DACT1 (dishevelled-binding antagonist of beta-catenin 1) with other proteins was found presently, while the rest 17 constituted 72 PPI pairs. Strikingly, among the rest 17 DEGs, 8 DEGs, such as COL3A1 (collagen, type III, alpha 1), COL1A2 (collagen, type I, alpha 2), COL1A1 (collagen, type I, alpha 1), COL5A2 (collagen, type V, alpha 2), and COL5A1 (collagen, type   The horizontal axis represents the sample name (N indicates the patient-matched normal tissues; T indicates the pancreatic ductal adenocarcinoma tissues). The right ordinate axis represents the clustering condition of gene; the upper horizontal axis represents the clustering situation of the sample. Red indicates the down regulation of the gene, green indicates the up regulation. Pancreatic ductal adenocarcinoma tissues and normal tissues are mainly grouped into 3 clusters. Total 6 normal tissue samples (N40875, N40892, N40726, N40975, N51294, and N30308) and 1 PDAC tissue sample (T30308) were found gathered in PDAC tissue sample clusters and normal tissue sample clusters respectively, which are unreasonable V, alpha 1), were found to be members of collagen family. Additionally, FN1 (fibronectin 1) was also hub nodes in the network. Besides, these proteins of collagen family as well as FN1 are all key molecules involved in the above 3 enriched pathways.

Discussion
PDAC is the most lethal type of cancer and therapeutic interventions currently available for it are ineffective (Ishiguro et al., 2014). The identification of highly expressed genes in PDAC is critical to the development of novel strategies to detect and treat this highly lethal cancer (Iacobuzio-Donahue et al., 2003). In this study, we investigated the underlying molecular mechanism of PDAC development by using bioinformatics methods. The results of principal component analysis showed that the top 20 DEGs could differentiate the PDAC tissues from normal tissues directly. Strikingly, 8 of the top 20 highly expressed DEGs were key proteins of collagen family. Additionally, FN1 was also hub nodes in the network and could interact with these proteins of collagen family directly, suggesting these DEGs may be responsible for PDAC development.
The members of collagen family are the most abundant proteins in extracellular matrix and is a major component of epithelial basement membranes, which contribute to the structural integrity of epithelial organs via providing tensile strength to the interstitial matrix (Gelse et al., 2003). Previous study demonstrates that loss of basement membrane integrity is one key and prognostic event in spreading pancreatic cancer (Apte et al., 2004). In addition, collagens exert important functions in the storage and release of cellular mediators, including growth factors (GelsePöschl et al., 2003). Very recent studies confirm that the multifunctional growth factor midkine promotes cellular functions leading to increased proliferation, migration, and survival in PDAC (Rawnaq et al., 2014). The chronic stimulation of epidermal growth factor is found to cause Smad4-dependent signaling bypass in PDAC (Moz et al., 2014). Besides, previous study has revealed that the genes of collagen family are differentially expressed in PDAC tissue and significantly involved in pathways of focal adhesion (Lin et al., 2014b), which are in line with our results. Therefore, our results suggest that genes of collagen family may responsible for progression of PDAC.
In addition to collagen family proteins, FN1 was also hub nodes in the network and could interact with these proteins of collagen family proteins directly. Fibronetin (FN) is a major component of the extracellular matrix (ECM) and functions in cell adhesion, cell spreading and cell migration (Stenzel et al., 2011). Moreover, the inhibition of FN-dependent PDAC cell proliferation may be a crucial mechanism of significant antitumor effects of EMAP II against PDAC . Studies have discovered that FN stimulates invasion and adhesion and markedly inhibits cell death of pancreatic cancer cells (Vaquero et al., 2003;Edderkaoui et al., 2007). Besides, FN receptors, β1 and β3 integrins, have been shown to be up-regulated in PDAC cells (Linder, 2001). Therefore, our findings suggest that FN1 may be a key molecule involved in the development and progression of PDAC and could be used as a specific therapeutic molecular target in the treatment of this disease.
Strikingly, our results showed that these proteins of collagen family as well as FN1 were significantly enriched in complement and coagulation cascades, ECM-receptor interaction and focal adhesion. The work of Zhao et al. confirmed that the above 3 enriched pathways were significantly enriched and were regarded to be key mechanisms of pancreatic cancer (Zhao et al., 2014). ECM receptor interaction combined with complement and coagulation cascades are identified in multiple cancers suggesting an essential role of these pathways in cancer biology (Krupp et al., 2011). Focal adhesion is likely linked to pancreatic cancer, and the extracellular matrix regulates pancreatic cancer stem cell function via focal adhesion kinase signaling (Furuyama et al., 2006;Rasheed et al., 2012). In addition, genes of collagen family in PDAC tissue are involved in pathways of ECMreceptor interaction and focal adhesion (Lin et al., 2014a). Collagens, such as COL1A1, COL1A2 and COL3A1, and FN1 are found to be involved in ECM-receptor interaction (Cassar-Malek et al., 2007). Thus, we speculate that the above 3 enriched pathways are more likely to be the crucial mechanisms involved in PDAC. Further, the results suggest that these proteins of collagen family as well as FN1 may play important roles in PDAC progression via involving in these pathways.
In conclusion, we screened the DEGs and investigated the underlying molecular mechanism of PDAC by using bioinformatics methods. The genes of collagen family and FN1 may play an important role in PDAC progression. Meanwhile, complement and coagulation cascades, ECM-receptor interaction and focal adhesion are significantly enriched pathways involved in PDAC. These DEGs and enriched pathways may be important molecular mechanisms for promoting the development and progression of PDAC. Results from this study will provide the groundwork for the understanding of PDAC. However, further experiments are still needed to be explored and researched. lines represent the protein-protein interaction relationships that corresponding to the genes