Although the majority (54.6%) of the reads that mapped to the genome were localized to exons, only 1% of the 3 million peaks were exonic. Following our previous observation of the correspondence between HeLa transcription peaks and potential novel transcription start sites, we considered the possibility that the large number of nonexonic genomic loci highlighted by HeLa peaks might represent novel transcribed elements. To assess this possibility, we again leveraged the Affymetrix tiling array generated from HeLa RNA spanning all the ENCODE regions. The ENCODE data includes 4377 transfrags, or contiguous regions of the genome believed to be transcriptionally active, based on these arrays (25), and only 2584 (59.0%) of these overlap known UCSC genes. A total of 14,906 HeLa peaks resided within the ENCODE regions, and comparison to the set of HeLa transfrags revealed that 3523 (23.8%) of these sites overlapped with 2311 (52.7%) of the transfrags. Of the 1736 highest confidence ENCODE-situated HeLa peaks that did not correspond to exons, 1285 (74.0%) overlapped with 1060 (24.2%) transfrags.
To further assess the relationship between transfrags and peaks, we compared the median signal strength of all probes of a transfrag that overlap with a peak to the median signal strength of all probes not overlapping the peak. This analysis revealed that, even for peaks comprising single reads, the median signal strength was increased on average from 0.26 to 2.22 (8.5×). Focusing on the highest confidence HeLa peaks (with height of at least 10) revealed a median increase of probe signal strength from 0.29 to 15.67 (54×). These results indicate that the sequences derive from transcribed regions of the genome and that even mapped singleton reads may be derived from bona fide transcripts.
Considering the number of reads mapping to introns, we sought evidence that some intronic peaks corresponded to novel cryptic exons. This was accomplished by searching for reads derived from the junction between an intronic peak and a known exon within the same gene. Using the 2846 genes with prominent intronic peaks (height 10 or greater), we created a database of sequences representing all possible junctions between known exons and intronic canonical splice sites (see Discerning Exon-Exon Splicing Events section). Searching all unmapped reads against this database identified 37 putative novel exons (in 32 genes) with redundant support. Many additional events were unannotated in Ensembl but were present in either RefSeq or UCSC gene annotations. An example of a particularly interesting event is shown in Figure 4, which demonstrates a novel exon in the MYC-interacting tumor suppressor gene Bin1 (26). This putative exon showed greater coverage than many of the other exons in this gene, suggesting it may be included in most of the transcripts deriving from this locus. The known cancer genes (27) with evidence for novel alternative isoforms are included in Table 4.Identification of Single-base Changes
A deep sampling of expressed sequences offers the attractive possibility of identifying single-base variation in coding elements. This variation could correspond to inherited polymorphisms as well as mutations. To determine the extent of this type of variation in HeLa, we used publicly available short-read alignment software (see Identification of Single-base Changes section in Materials and Methods) to identify and assign confidence scores to nucleotides discrepant from the reference genome. Using our empirically derived thresholds, a variant must be covered by at least six reads of sufficient quality and at least two of them must represent the nonreference allele. After this filtering, 2230 putative novel single-base mutations (1474 genes) and 3895 known variants remained (Table 3). The profile of MAQ consensus scores for these sites was significantly distinct from that of the known SNPs, suggesting that these candidate mutations likely comprise some degree of false positives (Kolmogorov-Smirnov test). Applying more stringent filtering criteria reduced the representation of novel mutations but did not remedy the disparity between confidence scores, which suggests that all false positives cannot be eliminated based solely on their confidence score. Considering the heterozygous mutations and known SNPs, the ratio reads representing each allele was generally different than 1:1, suggesting the potential to detect allelic expression differences. The mean ratio of less abundant to more abundant alleles was 0.68 for known SNPs and 0.63 for novel mutations (Supplementary Figure S3). Of the genes containing candidate mutations, approximately 3/4 (1123) contained a single variant, with a small number (34) containing more than 4 candidate mutations. As one might expect, many of the latter are annotated as pseudogenes, which are subject to neutral selection. One hundred genes had at least one nonsynonymous novel mutation, including nine genes with apparent truncating mutations. Using the GeneTrail software (genetrail. bioinf.uni-sb.de) (28), we found an enrichment of genes involved in the cell cycle (P = 6.47 × 10-5) as well as those involved in the E2F transcriptional module (P = 0.0286) in this list (29). Among these were 17 genes that reside in the COSMIC database (www.sanger.ac.uk/genetics/cgp/cosmic), hence are known to acquire somatic mutations in some types of cancer. These genes, their observed alterations, and genes containing novel exon skipping events are included in Table 4.Discussion
Here we describe a new approach for sequence analysis of transcriptomes, which involves generating multiple random-primed cDNA libraries, followed by sequencing using the Illumina 1G sequencer. Analysis of those <20 million randomly derived sequence reads aligned to the human genome revealed that the approach can be used to study gene expression and to identify novel gene structures as well as to genotype polymorphisms and discover novel single-base mutations in a transcriptome-wide fashion.
The results presented here derive mainly from analysis of data produced from a single flow cell. A strong advantage of this approach, and of digital gene expression profiling in general, is that the sensitivity of detection is limited in principle only by the methods used to prepare the sample for analysis and by the depth of sequencing used. If a single unambiguously mapped read is considered equivalent to a SAGE tag for measuring the transcriptional activity of a gene, we identified 28,613 genes in the largest of the four libraries, suggesting the potential for detecting changes in their expression across libraries of equivalent depth. Though the bulk of these (23, 460) were sequenced to less than 5× coverage, on average (Figure 1), additional “deeper” sequencing from the same library would allow a larger proportion of these to be fully sequenced for genotyping and improved detection of mutations and alternative splicing. Alternatively, sequencing libraries produced from normalized cDNA, in which the relative representation of abundant transcripts is reduced, may provide a more cost-effective option for expanded coverage of low-abundance genes. Production of paired end reads, deriving from both ends of fragments with defined lengths, is also currently in development and should assist in placing reads in low-mappability regions, providing improvements in mutation detection and quantitation.
In our description here of some of the first transcriptome shotgun libraries sequenced by this method, we identified a strong bias in representation of 5′ and 3′ ends relative to internal transcript regions. Though these biased regions account for approximately 55% of the reads aligned to a transcript, we have shown with subsequent libraries that this bias is a direct result of the library construction method (Supplementary Figure S2) and is remedied by a 5 min sonication step (see Construction of HeLa Whole Transciptome Libraries section above). Future development of the approach will focus on further reduction of this bias, in effect seeking to distribute reads randomly across transcripts. This should improve the proportional coverage of internal exons and improve the overall efficiency and reproducibility of this approach, improving quantitation of each exon within a given transcriptome. More powerful analyses will likely be possible once this level of uniformity has been achieved. For example, short informative regions of different transcript isoforms (either junctions or portions of exons unique to a given transcript) should enable quantitative discrimination of these isoforms. We have also shown that selection for polyadenylated transcripts significantly depletes noncoding RNA transcripts, which can be recovered from eluate (Supplementary Figure S4), and may be of interest for investigation into the noncoding RNA component of the transcriptome. We show that this fraction contains evidence for some novel noncoding RNA (ncRNA) genes as well as transcriptional activity induced by the integrated HPV18 genome (Supplementary Material).
We have demonstrated that the exons from a large number of genes can be sequenced to considerable depth (Figure 1, Table 1), allowing us to identify distinct exon junctions that can differentiate alternative transcript isoforms (Table 2, Figure 4). Not all splicing events identified here corresponded to annotated transcripts (Table 2), and may include exons and alternative events unique to HeLa S3 cells or certain cell types (Figure 4). That such events were detected here suggests the possibility of a digital approach for quantifying differences in alternative splicing between tissues and samples, previously only feasible using arrays (12,13). Differences in alternative splicing have been shown to be useful as prognostic markers in various cancers (30,31) and may relate to chemotherapy resistance (13).
Another potential application of this technique is the concurrent detection of mutation and aberrant gene expression events in cancers. Several studies (32,33) have described efforts in which directed exon sequencing has been used to identify cancer-associated mutations. Appropriate statistical techniques may allow differentiation of passenger mutations from driver mutations, or those responsible for transformation. Due to the use of the Illumina system, which involves sequencing molecules clonally derived from single DNA molecules, we suspect that transcriptome shotgun sequencing has the potential to facilitate detection of rare, but possibly critical, mutations in sub-populations of cells (34).
While primarily sequencing the most prevalent, or highly expressed, transcripts in this study, we have revealed 100 genes with apparent changes to their proteins caused by single-base changes as well as potentially many more, due to novel alternative splicing events (Table 4). In the context of cancer, it is noteworthy that we observed a significant enrichment of genes involved in the cell cycle and the E2F signal transduction pathway. The latter has been linked to many cancers (35), particularly in the formation of cervical cancer and the transformation of primary cells to immortal cell lines (36,37,38). From this analysis we suggest that further profiling of other cell lines and primary tumor samples will likely reveal single-base mutations as well as alternative splicing events involving both known and potentially undiscovered exons, which will aid us in better understanding the underlying mechanisms of cancer biology and could lead to improved methods to combat the disease.