Full Text (PDF)
Oncogenomic screening in malignant neoplasias has led to the description of oncogenetic mechanisms and, recently, to the first successful targeted drug development approaches (1). Individual genomic abnormalities are used as diagnostic markers or for the individual prediction of clinical aggressiveness (2). However, most malignancies show nonrandom aberration patterns that may reflect the cooperation of multiple onco- and tumor suppressor genes, according to the multistep model of oncogenesis (3). The complexity of those changes warrants the application of advanced data mining methods for the development of oncogenomic models.
A number of cytogenetic and molecular genetic techniques describe chromosomal imbalances or changes in the regional DNA content of tumor cells. Historically, the microscopic inspection of stained metaphase spreads (4) had been most widely applied, and still is the reference method, in many clinical applications. Comparative genomic hybridization (CGH) (5) permits the detection of genomic imbalances from tumor samples with more than 50% tumor cell content as well as from archival material (6). Recently, array or matrix CGH (7,8) has started to overcome the limited spatial resolution (9) of metaphase CGH.
An intriguing concept for oncogenomic data mining is the combination of the accumulated cytogenetic data with the molecular cytogenetic data from metaphase and array-based CGH experiments. However, complex annotation formats are used for the description of experimental results. The standards for cytogenetic banding and reverse in situ hybridization (ISH) (e.g., CGH) have been defined in the International System for Cytogenetic Nomenclature (ISCN) (10). The results of genomic microarray experiments usually are stored according to the minimal information about a microarray experiment (MIAME) guidelines (11).
The largest publicly accessible resource for molecular cytogenetic screening data in oncology is the Mitelman Database of Chromosome Aberrations in Cancer (cgap.nci.nih.gov/Chromosomes/Mitelman), which describes more than 46,000 samples analyzed by metaphase banding. Utilization of this data has been limited by the lack of a format amenable to data mining procedures, though valuable studies have been published by the database maintainers (12). Another resource is the National Center for Biotechnology Information (NCBI) spectral karyotyping (SKY)/CGH database (www.ncbi.nlm.nih.gov/sky/skyweb.cgi) (13). It provides well-structured clinical and experimental information for the included cases, but due to the reliance of the NCBI site on voluntary data submission it is, with currently 1006 included experiments, quantitatively limited. Recently (13), the Mitelman database and the SKY/CGH project have been integrated into NCBI's Entrez Cancer Chromosomes site (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cancerchromosomes) and now offer band-specific search capabilities. By far, the largest collection of case-specific CGH data are presented through the Progenetix web site (www.progenetix.net) (14), on which this article is focused.
The Progenetix project was initiated in December 2000. The main inclusion criterion was the complete description of the genomic status of each tumor specimen in a peer-reviewed article. Data sampling methods included copying of ISCN annotations from publication files or online supplements and transcription of data from printed matter. For some array CGH data, pseudo-reverse ISH annotations were generated (e.g., based on the Bioconductor DNAcopy package; www.bioconductor.org). For 72 articles, experimental results were provided by the authors of the original publications.
For the conversion of cytogenetic annotations, software was implemented in the Perl scripting language (www.isc.org/sources/devel/lang/perl.txt). Cytogenetic data are converted to standard ISCN 1995 format ((Figure 1)A) and automatically checked for syntax errors. Each band of a cytogenetic reference table with 862 bands resolution [currently University of California Santa Cruz (UCSC) May 2004 edition; hgdownload.cse.ucsc.edu/goldenPath/hg17/database/cytoBand.txt.gz ] is evaluated for its inclusion in intervals derived from the text annotation, and the status (gain, loss, or high-level gain) is assigned accordingly ((Figure 1)B). The band status is annotated, and a two-dimensional band-specific status matrix file is generated ((Figure 1)C).
Figure 1.
The minimal consistent amount of case-specific information is sampled from the literature. Diagnoses and topographies are recoded to the International Classification of Diseases in Oncology (ICD-O-3) format (15). Each case is referenced to the PubMed ID of its originating publication. For the web site generation, all different case entities (disease, locus, publication, custom group) are identified, and for each of them, specific overview pages are generated. These consist of a list of case-specific information, an ideogrammatic representation of genomic gains and losses, and a page showing the unsupervised clustering of cases according to their aberration pattern using XCluster (Gavin Sherlock; genetics.stanford.edu/∼sherlock/cluster.html).
At the time of writing, 13,240 unique experiments published in 535 peer-reviewed articles have been included into the Progenetix database ((Figure 2)), representing 273 distinct neoplastic entities. The majority of those cases (12,179 or 92%) came from chromosomal CGH experiments.
Figure 2.
Progenetix presents a unique case-specific structured overview of chromosomal imbalances for most neoplasias. After free registration, academic researchers are able to download the main database content, including the band-specific annotation data in an XML format. As an additional unique feature, the web site offers a query option for the relative aberration status of single bands in disease entities ((Figure 3)).
Figure 3.
To allow users to convert, mine, and visualize their own molecular cytogenetic data sets, a version of the ISCN2matrix parser was implemented as a Perl CGI script. Users can upload a file containing data from multiple cases and generate chromosomal ideograms, cluster graphics, and XML files as described above.
Recently, the interval-specific aberration information from the Progenetix data set and the parsing software for CGH, as well as metaphase banding-based annotations, have shown their usefulness for the delineation of genomic aberration patterns with prognostic relevance (16) and for producing tumor type-specific combined genomic imbalance maps (17,18).
Large-scale data mining approaches based on tens of thousands of genomic profiles should lead to the identification of genomic signatures for a variety of neoplasias and the development of new diagnostic tools (e.g., disease-specific genomic arrays with low complexity). The integration of genomic aberration patterns will be of great benefit for the interpretation of expression array data, allowing for selection of genes with high probability of tumor-specific involvement. Additionally, the delineation of recurring genomic aberration patterns may become the basis for the development of smart target gene detection methods, using sequence similarity searches over commonly involved loci. Through the powerful combination of advanced data mining tools with unique data content, the Progenetix project should be useful for a new generation of oncogenomic data mining projects.
The author is indebted to all individuals who contributed their otherwise not accessible original data. A list of contributors can be found on the Progenetix web site. Alejandra Ellison-Barnes is thanked for helping with data transcription from printed matter.
The authors declare no competing interests.