All reads were aligned to the human genome using the MAQ multiple alignment software (http://maq.source-forge.net/maq-man.shtml), which considers the phred-like Illumina read quality scores (produced by Bustard) for building consensus sequences and robustly identifying high quality discrepancies. Candidate single-base changes identified by MAQ were filtered for those with consensus quality scores >20 and coverage depth of at least 6 reads. These parameters were chosen to minimize the number of false positive mutations discovered in an in silico simulation of WTSS sequences. To determine the effect of candidate mutations on gene products, we replaced the discrepant nucleotide in all corresponding Ensembl transcripts and compared the resulting amino acid sequence to that produced by the reference transcript.Results Transcript Identification and Quantitation
A total of 28.6 million 31 nt reads were generated using eight flow-cell lanes to sequence randomly primed cDNA libraries prepared from polyA+ and polyA- fractions of HeLa S3 RNA (see Construction of HeLa Whole Transciptome Libraries section above). Of these sequences, 18.9 million reads were unambiguously aligned to yield 3 million peaks representing more than 125 Mbp (4.1%) of the reference human genome sequence, and 15.6 million of these reads mapped to regions of the genome annotated as exons (20). Using only the reads mapping within these exons, we calculated the average coverage depth and total reads mapped to every exon. A total of 143,749 exons had an average of one read covering each nucleotide (1× coverage) while 36,445 of these showed an average coverage of at least 10× (Table 1). Many exons in the genome contained low-mappability bases (see Defining “Mappability” of Exonic Sequences section and Supplementary Material). These numbers were re-examined ignoring all exons with mappable fractions <0.5 to determine the effect of mappability on exon coverage—we identified a small number of exons (6527) showing <1× coverage that could be explained by such mapping issues, while the remainder were likely expressed at low levels. For those exons with imperfect mappability, we computed a corrected coverage depth to better gauge their abundance. A summary of exon coverage depth is shown, indicating coverage depth for all exons with their unadjusted and adjusted coverage values as well as the subset of exons with mappable fractions of at least 0.5 (Figure 1A).
To assess the extent to which entire genes were sequenced, we considered the coverage of all exons for a gene, choosing the gene model with the most mapped reads (Figure 1B). It is apparent that a large fraction of the reads produced in this study corresponded to a few extremely abundant genes (>2000× coverage), mainly those encoding ribosomal proteins. As indicated in Table 1, 5684 genes demonstrated coverage of at least 5× for every exon. This number increased to 5881when exons with low (<0.5) mappability were ignored. Our analysis revealed an overrepresentation of sequence reads aligning to transcript termini compared with internal regions. We also noted that the 5′ ends of transcripts were overrepresented 2:1 compared with 3′ ends (Figure 2). Comparison of our four HeLa-derived libraries reveals that the library preparation procedure affects the extent of this bias and that is likely a technical artifact that can be addressed with improvements to library preparation. In support of this notion, the two libraries produced using the SMART cDNA synthesis method (see Construction of HeLa Whole Transciptome Libraries section) have offered much more uniform coverage of transcripts (Supplementary Figure S1). We have since found that a 5 min sonication of random-primed cDNA fragments reduces the end-bias observed in the first two libraries; this procedure is now included in our standard protocol.
As the number of reads should correspond to the abundance of transcripts, these data should allow us to individually quantify the abundance of each exon in the transcriptome, providing a digital alternative to tiling arrays. As part of the ENCODE project, the HeLa transcriptome was profiled using custom tiling arrays produced by Affymetrix (Santa Clara, CA, USA), providing means for direct comparison between these methods (7). The 4633 Ensembl exons in the ENCODE regions showed a strong rank correlation between these platforms (R2 = 0.693) (Figure 3). To exclude the effect of the overrepresented reads deriving from transcript ends, we compared the abundance of the 3908 internal exons in these regions (Figure 3A). This resulted in an enhanced Spearman correlation (R2 = 0.713). This level of significance is similar to comparisons of different array platforms (21) and is striking when we consider that these two experiments were performed on HeLa RNA purified in separate laboratories. A plot of the array-based and sequence-based measures of exon levels is shown (Figure 3A) along with an example demonstrating the correspondence between all exons for a representative gene (Figure 3B). We demonstrated method reproducibility by comparing the two libraries produced with different sonication times (see Construction of HeLa Whole Transciptome Libraries section). This comparison revealed a strong correlation of exon coverage and quantitation between replicates (R2 = 0.976) (Supplementary Figure S2).