OrCanome: a Comprehensive Resource for Oral Cancer

The incidence of oral and oropharangeal cancers is globally increasing at a very alarming rate. The 2012 report of Globocan shows 1,98,975 cases of incidence of cancer of lip and oral cavity across the world and has a very low survival rate (around 50%) (Ferlay et al., 2015). It is the eleventh most common cancer across the world and is found to be more prevalent in men than in women. Oral cancer, a subset of Head and Neck cancer, is the most prevalent cancers in males in India (Coelho, 2012). A higher prevalence of oral cancer in Indian subcontinent can be associated to the higher exposure of risk factors such as tobacco especially the smokeless form (Mehrotra and Yadav, 2006). Advanced screening and early detection approaches are being followed across the world including India to detect pre-cancerous lesions (Mehrotra and Gupta, 2011; Rajaraman et al., 2015). Since oral cancer poses a high disease burden globally, studies which unravel the underlying mechanistic insights have started emerging. Over the last few years, studies have been reported which have explored the genic associations to understand the molecular mechanisms. Of late, high-throughput genome-scale studies have also been reported uncovering the mutational landscape of the cancer (India Project Team of the International Cancer Genome, 2013). Apart from this, gene expression analysis involving microarrays have also been reported (Reis et al., 2011; Saeed et al., 2015). Such high throughput studies elevate the volume of the data, thus systematic collection


Introduction
The incidence of oral and oropharangeal cancers is globally increasing at a very alarming rate.The 2012 report of Globocan shows 1,98,975 cases of incidence of cancer of lip and oral cavity across the world and has a very low survival rate (around 50%) (Ferlay et al., 2015).It is the eleventh most common cancer across the world and is found to be more prevalent in men than in women.Oral cancer, a subset of Head and Neck cancer, is the most prevalent cancers in males in India (Coelho, 2012).A higher prevalence of oral cancer in Indian subcontinent can be associated to the higher exposure of risk factors such as tobacco especially the smokeless form (Mehrotra and Yadav, 2006).Advanced screening and early detection approaches are being followed across the world including India to detect pre-cancerous lesions (Mehrotra and Gupta, 2011;Rajaraman et al., 2015).
Since oral cancer poses a high disease burden globally, studies which unravel the underlying mechanistic insights have started emerging.Over the last few years, studies have been reported which have explored the genic associations to understand the molecular mechanisms.Of late, high-throughput genome-scale studies have also been reported uncovering the mutational landscape of the cancer (India Project Team of the International Cancer Genome, 2013).Apart from this, gene expression analysis involving microarrays have also been reported (Reis et al., 2011;Saeed et al., 2015).Such high throughput studies elevate the volume of the data, thus systematic collection
There have been insufficient efforts in accumulating the data associated with oral cancer.Gadewal and Zingde, 2011 reported the Oral Cancer Gene Database with a total of 374 genes (Gadewal and Zingde, 2011).Mitra et al., 2012 reported Head and Neck and Oral Cancer Database (HNOCDB) with a total of 415 genes (Mitra et al., 2012).These datasets have been compiled on the evidences based on the available literature.Though these databases cover a number of genes involved in oral cancer, a larger proportion of genes studied in oral cancer are completely missing.Moreover, these databases focus mostly on the genomic aspects of these genes and there is no dataset focusing on the various transcriptomic and the proteomic aspects.There is a scarcity of such resources comprising of integrated genomic, transcriptomic and proteomic information of the genes involved in oral cancer.
Targeted gene approaches have picked up a fewer number of genes as compared to the genome scale studies and the available databases have thus missed out a considerable proportion of the genes.With an increase in the abundance of the high-throughput studies, there has been an explosion in datasets (Reuter et al., 2015).Identification of drug targets and chemical inhibitors for these genes becomes highly laborious as these datasets are scattered across different repositories (Thariat et al., 2015).With a growing corpus of genome, proteome and chemical information on oral cancer, there is an urgent need to develop a consolidated database.It has become imperative to utilize an integrated genomic-transcriptomic-proteomic approach to assemble the datasets in a systematic and unified way in order to achieve a better annotation of these genes.A unified repository will enhance the annotation and provide possible biomarkers and therapeutic targets which will aid both the research and clinical community (Nakagawa et al., 2015).
OrCanome has been designed to benefit the researchers as well as clinicians to study the genes involved in oral cancer.At present, it houses 922 genes found to be dysregulated in oral cancer as a start-up resource for biomarker and therapeutic drug target discovery provides their appropriate annotation starting from genome, expression, proteome, pathway, immunological data, active compounds pertinent to oral cancer.

Genomic data
Retrieval of genes dysregulated in oral cancer: A systematic search was performed at Gene Expression Omnibus to fetch the expression profiling studies on oral cancer in humans (Barrett et al., 2013).The genes dysregulated in oral cancer were obtained from these studies.Since the gene names were present in different formats in all the studies, Ensembl Gene IDs were fetched for all genes using BioMart from Ensembl Genome Browser (Flicek et al., 2014).The unique gene IDs were used for downstream analysis.Along with these genes, literature mining was performed to include other genes.

Genic information
The genic annotations were fetched from Gencode release 21 (Harrow et al., 2012).This included information regarding the genomic location, strand and gene biotypes as recommended by Gencode.

Transcriptomic Data
Transcripts: The information of the different alternate transcripts of these genes has been obtained from Gencode along with their corresponding genomic location and transcript biotypes.
Expression datasets: The baseline and differential expression of these genes was obtained from Expression Atlas (EMBL-EBI) (Petryszak et al., 2014).Expression atlas provides expression data from functional studies of microarray and RNA sequencing.The baseline expression dataset obtained contains expression levels of each gene in various human tissues from six experiments.The differential expression dataset provides the comparison statistics of the expression levels of the genes in different conditions in various studies.

Pathways and Gene Ontology
The pathways and the gene ontologies corresponding to the genes were fetched using the functional annotation tool of The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 (Huang da et al., 2009a;Huang da et al., 2009b).

Proteomic Data
Protein Information: The protein IDs for the genes was fetched using Gencode.The corresponding protein description, sequences, PubMed entry were obtained from UniProt release 2015_08 (Consortium, 2015).The information regarding the 3-D structures and targets were extracted from Protein Data Bank (PDB) and ChEMBL (Berman et al., 2002;Bento et al., 2014).
Inhibitors: Chemical compounds showing inhibition against proteins are listed in OrCanome.The inhibitors have been incorporated only if they have been assayed for IC 50 values for target proteins.The one dimensional structure in the form of SMILES (http://www.daylight.com/smiles/) was generated using Open Babel software and the bioassay data were obtained from the chemical databases such as ChEMBL and BindingDB (Liu et al., 2007;O'Boyle et al., 2011).HTML toolkit from ChemDoodle was integrated for the visualization of chemical compounds (Burger, 2015).
Epitope: One of the most important features of this resource is that we have provided the linear B-cell epitopes predicted using EPMLR web server (Lian et al., 2014).The epitopes have been screened on the basis of their scores and only the ones with values greater than 0.9 have been retained.
Transmembrane Helices: The information about the experimentally validated and predicted transmembrane helices of the proteins was fetched from UniProt release 2015_08.The transmembrane helices were also predicted using TMHMM server 2.0 (Krogh et al., 2001).
Literature annotations: A comprehensive literature survey was performed for these genes and proteins and the relevant studies have been provided.

Database architecture
OrCanome is designed to meet the needs of the researchers and clinicians by providing a comprehensive, user-friendly and searchable resource.The database has been developed in MySQL and the interface has been designed using core PhP and HTML.Every gene found to be associated with oral cancer is provided with a separate page with its biologically relevant information on genomic, transcriptomic and proteomic aspects (Figure 1).OrCanome currently houses 922 genes which were found to be associated with oral cancer.These genes fall into different Gencode biotypes such as protein coding genes, antisense, microRNAs, long non-coding RNAs, small nucleolar RNAs and miscellaneous RNAs (misc RNAs).
The biological information at OrCanome is featured as different categories on every gene page.The statistics of the datasets in different categories have been provided in Table 1.The first category includes the basic genomic information which includes the genomic location, strand, Gencode gene ID, and the gene biotype.The second category comprises of the transcript information including the different transcript isoforms along with DOI:http://dx.doi.org/10.7314/APJCP.2016doi.org/10.7314/APJCP. .17.3.1333OrCanome: a Comprehensive Resource for Oral Cancer Research their genomic coordinates and biotypes.This is followed by the next two categories of the available datasets of the baseline and differential expression of these genes.The baseline expression data is present for 901 genes while the differential expression is present for 474 genes.
The fifth category includes the details about the gene ontologies and pathways for the corresponding genes and proteins.This includes the three different gene ontologies, the biological process, molecular function and cellular component.The respective InterPro, KEGG pathways and OMIM diseases have also been included (Kanehisa and Goto, 2000;Hamosh et al., 2005;Hunter et al., 2009).
The sixth category comprises of the annotation of the protein products which includes information of 890 proteins along with their name and sequence followed by external links to their corresponding UniProt entries, PDB structures and detailed information from ChEMBL.If the protein is a secretory protein, the information is also included for 183 proteins.External links to the respective databases have also been provided.The seventh category includes the details of the transmembrane helices.This includes the information about the 682 experimentally validated and 672 predicted transmembrane helices in these proteins.The different helices in the protein with their location on the protein sequence have been provided.
The details about the chemical compounds showing inhibitions to the respective proteins have been provided in the eighth category.The respective inhibitor compound ID and compound name have been extracted from ChEMBL.SMILES representing line notation which encodes molecular structures have also been generated and provided.It also includes the respective half maximal inhibitory concentration (IC 50 ) representing the quantitative measure of amount of the inhibitor required to block the protein or function by half.The corresponding literature report has also been provided.5534 compounds having IC 50 from 26 targets proteins have been incorporated in the database.The chemical structure of the compounds can also be visualized to identify Hydrogen bond donor/ acceptor to aid in drug designing.
The ninth category consists of the predicted linear B-cell epitopes of the corresponding proteins.OrCanome houses epitopes for 207 proteins and provides the rank for each predicted epitope with their score, epitope sequence and the location on the protein.The tenth category includes the literature annotation of the association with oral cancer.

Database Access
OrCanome provides user friendly, comprehensive search and browse options.The database is searchable through text/keyword such as gene name, ID, strand, type, transcript ID, inhibitors, epitopes and pathways.The datasets can also be browsed on the basis of the chromosome number, gene ID, gene name and gene type.Advanced search options have also been provided which enable the user to compare different datasets.This facilitates the identification of available inhibitors or epitopes for a gene/protein of interest or will provide expression statistics for the gene.This will aid in identifying potential biomarkers, vaccine candidates or therapeutic targets for oral cancer.

Discussion
The emergence of high throughput genome scale studies has provided us enormous data which needs systematic collection and annotation for identification of biomarkers and therapeutic targets in diseases including cancer.OrCanome aims to serve this purpose by providing genes dysregulated in oral cancer along with their annotations.It is one of the most comprehensive resources for oral cancer genes which provides a one stop solution for genomic, transcriptomic and proteomic information of these genes.OrCanome also provides links to external databases to enhance the interoperability of datasets.The data is organized in ten different categories and the advanced search option enables the user to explore the resource using specific queries.
OrCanome provides the starting point for in depth analysis for the genes associated with oral cancer.Since newer studies related to oral cancer will be emerging and provide better datasets, we aim to include those in our resource on a regular basis.We intend to serve the purposes of both research and clinical communities with

Figure 1 .
Figure 1.Conceptual framework depicting integrated genomic-transcriptomic-proteomic datasets in OrCanome wish to collaborate with them for data generation and sharing.