Full Text (PDF)
The GeneChipĀ® Human Exon 1.0 ST array contains approximately 5.5 million probes, forming 1.4 million probe sets that are together used to separately interrogate 1 million known and predicted exons; the aim is to comprehensively cover the entire human genome at the exon scale. Exon arrays offer a more fine-grained view of gene expression than the current generation of chips and have the potential to support global inferences about gene expression at the level of individual isoforms and exons, rather than on the per-gene basis offered with existing approaches. Exon arrays are probably the most radically changed generation of GeneChip micro-arrays and promise to be a powerful technology, given that a significant proportion of human genes are predicted to be differentially spliced (1,2,3). In particular, 74% of multi-exon genes are estimated to be alternatively spliced (2). Such a high target density has been achieved through a variety of changes to the hardware platform, to the design of the array itself, and to the chemistry used to prepare the samples for hybridization. Thus, not only are there about six times as many features as the previous generation of chips, their probe set count has been further increased by no longer having a paired mismatch probe for each perfect match partner and by reducing the number of oligonucleotide probes per probe set from 11 to 4 (4). These changes in array design also make many of the existing data analysis methods obsolete. It is not possible, for example, to use either the MAS5 expression summary or detection calling (5,6) algorithms, since there are no paired mismatch probes; instead Affymetrix provide a new algorithm, probe logarithmic intensity error (plier), and a new method, detection above background (DABG), for assessing the reliability for each probe set (7).
With so many changes to the underlying technology, it is important to develop an understanding of exon arrays similar to that which has been developed for current chips. These have been comprehensively explored using controlled data sets (8,9), and the literature contains numerous studies in which candidate genes have been validated using alternate approaches, such as quantitative PCR or protein expression, and in which hypotheses generated from microarray data have been successfully pursued through to a biological conclusion. There is, therefore, a sizable body of data confirming the validity of Affymetrix microarray data (10,11,12,13). If a significant degree of mutual consistency is found between exon and standard expression arrays, this would provide significant evidence in favor of their reliability. Thus, the purpose of this report is to consider, using replicated data from two cell lines, MCF7 and MCF10A, the levels of reproducibility between HG-U133 Plus2 human microarrays and Exon 1.0 ST arrays.
Fundamental to the analysis is the need to analyze the available mappings between the probe sets on the different chips. Since one of the prime motivations behind the development of exon arrays is that different parts of a gene can be expressed in different ways in different samples, any comparison must consider the exact location of individual probe sets relative to the target genes’ structure. Thus the success or failure of any such analysis is likely to be at least in part governed by the annotation used to define the mapping (Figure 1). This is not straightforward, because the array structure is sufficiently complex for there to be an absence of a unique and universal one-one mapping between probe sets. This paper explores three possibilities, the consensus and target (or SIF) mappings, both supplied by Affymetrix, and an alternate approach (referred to here as the Chip Definition File, or CDF, mapping) based on the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) transcripts (14).
A short description of the different array design strategies (15,16) is necessary in order to identify their key differences. Briefly, probe sets on the Plus2 array were designed against a variety of public resources including UniGene (17), GenBankĀ® (18), the expressed sequence tags database (dbEST) (19), and RefSeq (20). Part of the annotation process involved generating clusters of sequences around each gene and computing alignments within these clusters. Choices have to be made where, for example, sequences within a cluster are of different lengths or quality, or there are discrepancies in the residue called at a particular point. Clusters are thus represented by consensus or exemplar sequences that reflect the end result of these decisions. These sequences are typically long (often full-length messenger RNA or mRNA) and require additional constraints, such as the proximity of a poly(A) site, to select appropriate regions against which to design each probe set. These shorter probe selection regions are known as SIF or target sequences.
