to BioTechniques free email alert service to receive content updates.
Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing
Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzywinski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, and Marco A. Marra
Full Text (PDF)
Supplementary Material
314 (.pdf)
312 (.pdf)
308 (.pdf)
307 (.pdf)

Discovery of Novel Expressed Elements

Although the majority (54.6%) of the reads that mapped to the genome were localized to exons, only 1% of the 3 million peaks were exonic. Following our previous observation of the correspondence between HeLa transcription peaks and potential novel transcription start sites, we considered the possibility that the large number of nonexonic genomic loci highlighted by HeLa peaks might represent novel transcribed elements. To assess this possibility, we again leveraged the Affymetrix tiling array generated from HeLa RNA spanning all the ENCODE regions. The ENCODE data includes 4377 transfrags, or contiguous regions of the genome believed to be transcriptionally active, based on these arrays (25), and only 2584 (59.0%) of these overlap known UCSC genes. A total of 14,906 HeLa peaks resided within the ENCODE regions, and comparison to the set of HeLa transfrags revealed that 3523 (23.8%) of these sites overlapped with 2311 (52.7%) of the transfrags. Of the 1736 highest confidence ENCODE-situated HeLa peaks that did not correspond to exons, 1285 (74.0%) overlapped with 1060 (24.2%) transfrags.

To further assess the relationship between transfrags and peaks, we compared the median signal strength of all probes of a transfrag that overlap with a peak to the median signal strength of all probes not overlapping the peak. This analysis revealed that, even for peaks comprising single reads, the median signal strength was increased on average from 0.26 to 2.22 (8.5×). Focusing on the highest confidence HeLa peaks (with height of at least 10) revealed a median increase of probe signal strength from 0.29 to 15.67 (54×). These results indicate that the sequences derive from transcribed regions of the genome and that even mapped singleton reads may be derived from bona fide transcripts.

Considering the number of reads mapping to introns, we sought evidence that some intronic peaks corresponded to novel cryptic exons. This was accomplished by searching for reads derived from the junction between an intronic peak and a known exon within the same gene. Using the 2846 genes with prominent intronic peaks (height 10 or greater), we created a database of sequences representing all possible junctions between known exons and intronic canonical splice sites (see Discerning Exon-Exon Splicing Events section). Searching all unmapped reads against this database identified 37 putative novel exons (in 32 genes) with redundant support. Many additional events were unannotated in Ensembl but were present in either RefSeq or UCSC gene annotations. An example of a particularly interesting event is shown in Figure 4, which demonstrates a novel exon in the MYC-interacting tumor suppressor gene Bin1 (26). This putative exon showed greater coverage than many of the other exons in this gene, suggesting it may be included in most of the transcripts deriving from this locus. The known cancer genes (27) with evidence for novel alternative isoforms are included in Table 4.

Identification of Single-base Changes

A deep sampling of expressed sequences offers the attractive possibility of identifying single-base variation in coding elements. This variation could correspond to inherited polymorphisms as well as mutations. To determine the extent of this type of variation in HeLa, we used publicly available short-read alignment software (see Identification of Single-base Changes section in Materials and Methods) to identify and assign confidence scores to nucleotides discrepant from the reference genome. Using our empirically derived thresholds, a variant must be covered by at least six reads of sufficient quality and at least two of them must represent the nonreference allele. After this filtering, 2230 putative novel single-base mutations (1474 genes) and 3895 known variants remained (Table 3). The profile of MAQ consensus scores for these sites was significantly distinct from that of the known SNPs, suggesting that these candidate mutations likely comprise some degree of false positives (Kolmogorov-Smirnov test). Applying more stringent filtering criteria reduced the representation of novel mutations but did not remedy the disparity between confidence scores, which suggests that all false positives cannot be eliminated based solely on their confidence score. Considering the heterozygous mutations and known SNPs, the ratio reads representing each allele was generally different than 1:1, suggesting the potential to detect allelic expression differences. The mean ratio of less abundant to more abundant alleles was 0.68 for known SNPs and 0.63 for novel mutations (Supplementary Figure S3). Of the genes containing candidate mutations, approximately 3/4 (1123) contained a single variant, with a small number (34) containing more than 4 candidate mutations. As one might expect, many of the latter are annotated as pseudogenes, which are subject to neutral selection. One hundred genes had at least one nonsynonymous novel mutation, including nine genes with apparent truncating mutations. Using the GeneTrail software (genetrail. (28), we found an enrichment of genes involved in the cell cycle (P = 6.47 × 10-5) as well as those involved in the E2F transcriptional module (P = 0.0286) in this list (29). Among these were 17 genes that reside in the COSMIC database (, hence are known to acquire somatic mutations in some types of cancer. These genes, their observed alterations, and genes containing novel exon skipping events are included in Table 4.


Here we describe a new approach for sequence analysis of transcriptomes, which involves generating multiple random-primed cDNA libraries, followed by sequencing using the Illumina 1G sequencer. Analysis of those <20 million randomly derived sequence reads aligned to the human genome revealed that the approach can be used to study gene expression and to identify novel gene structures as well as to genotype polymorphisms and discover novel single-base mutations in a transcriptome-wide fashion.

The results presented here derive mainly from analysis of data produced from a single flow cell. A strong advantage of this approach, and of digital gene expression profiling in general, is that the sensitivity of detection is limited in principle only by the methods used to prepare the sample for analysis and by the depth of sequencing used. If a single unambiguously mapped read is considered equivalent to a SAGE tag for measuring the transcriptional activity of a gene, we identified 28,613 genes in the largest of the four libraries, suggesting the potential for detecting changes in their expression across libraries of equivalent depth. Though the bulk of these (23, 460) were sequenced to less than 5× coverage, on average (Figure 1), additional “deeper” sequencing from the same library would allow a larger proportion of these to be fully sequenced for genotyping and improved detection of mutations and alternative splicing. Alternatively, sequencing libraries produced from normalized cDNA, in which the relative representation of abundant transcripts is reduced, may provide a more cost-effective option for expanded coverage of low-abundance genes. Production of paired end reads, deriving from both ends of fragments with defined lengths, is also currently in development and should assist in placing reads in low-mappability regions, providing improvements in mutation detection and quantitation.

In our description here of some of the first transcriptome shotgun libraries sequenced by this method, we identified a strong bias in representation of 5′ and 3′ ends relative to internal transcript regions. Though these biased regions account for approximately 55% of the reads aligned to a transcript, we have shown with subsequent libraries that this bias is a direct result of the library construction method (Supplementary Figure S2) and is remedied by a 5 min sonication step (see Construction of HeLa Whole Transciptome Libraries section above). Future development of the approach will focus on further reduction of this bias, in effect seeking to distribute reads randomly across transcripts. This should improve the proportional coverage of internal exons and improve the overall efficiency and reproducibility of this approach, improving quantitation of each exon within a given transcriptome. More powerful analyses will likely be possible once this level of uniformity has been achieved. For example, short informative regions of different transcript isoforms (either junctions or portions of exons unique to a given transcript) should enable quantitative discrimination of these isoforms. We have also shown that selection for polyadenylated transcripts significantly depletes noncoding RNA transcripts, which can be recovered from eluate (Supplementary Figure S4), and may be of interest for investigation into the noncoding RNA component of the transcriptome. We show that this fraction contains evidence for some novel noncoding RNA (ncRNA) genes as well as transcriptional activity induced by the integrated HPV18 genome (Supplementary Material).

We have demonstrated that the exons from a large number of genes can be sequenced to considerable depth (Figure 1, Table 1), allowing us to identify distinct exon junctions that can differentiate alternative transcript isoforms (Table 2, Figure 4). Not all splicing events identified here corresponded to annotated transcripts (Table 2), and may include exons and alternative events unique to HeLa S3 cells or certain cell types (Figure 4). That such events were detected here suggests the possibility of a digital approach for quantifying differences in alternative splicing between tissues and samples, previously only feasible using arrays (12,13). Differences in alternative splicing have been shown to be useful as prognostic markers in various cancers (30,31) and may relate to chemotherapy resistance (13).

Another potential application of this technique is the concurrent detection of mutation and aberrant gene expression events in cancers. Several studies (32,33) have described efforts in which directed exon sequencing has been used to identify cancer-associated mutations. Appropriate statistical techniques may allow differentiation of passenger mutations from driver mutations, or those responsible for transformation. Due to the use of the Illumina system, which involves sequencing molecules clonally derived from single DNA molecules, we suspect that transcriptome shotgun sequencing has the potential to facilitate detection of rare, but possibly critical, mutations in sub-populations of cells (34).

While primarily sequencing the most prevalent, or highly expressed, transcripts in this study, we have revealed 100 genes with apparent changes to their proteins caused by single-base changes as well as potentially many more, due to novel alternative splicing events (Table 4). In the context of cancer, it is noteworthy that we observed a significant enrichment of genes involved in the cell cycle and the E2F signal transduction pathway. The latter has been linked to many cancers (35), particularly in the formation of cervical cancer and the transformation of primary cells to immortal cell lines (36,37,38). From this analysis we suggest that further profiling of other cell lines and primary tumor samples will likely reveal single-base mutations as well as alternative splicing events involving both known and potentially undiscovered exons, which will aid us in better understanding the underlying mechanisms of cancer biology and could lead to improved methods to combat the disease.


R.M. and M.B. contributed equally to this work. The authors are grateful for the expert technical assistance provided by the library construction core of the BC Genome Sciences Centre at the BC Cancer Agency. Funding for this work was provided in part by the BC Cancer Foundation, Genome British Columbia and Western Diversification. M.M. is a senior scholar and S.J. is a scholar at the Michael Smith Foundation for Health Research.

Competing Interests Statement

The authors declare no competing interests.

1.) DeRisi, J., L. Penland, P.O. Brown, M.L. Bittner, P.S. Meltzer, M. Ray, Y. Chen, Y.A. Su. 1996. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet. 14:457-460.

2.) Shoemaker, D.D., E.E. Schadt, C.D. Armour, Y.D. He, P. Garrett-Engele, P.D. McDonagh, P.M. Loerch, A. Leonardson. 2001. Experimental annotation of the human genome using microarray technology. Nature 409:922-927.

3.) Saha, S., A.B. Sparks, C. Rago, V. Akmaev, C.J. Wang, B. Vogelstein, K.W. Kinzler, and V.E. Velculescu. 2002. Using the transcriptome to annotate the genome. Nat. Biotechnol. 20:508-512.

4.) Velculescu, V.E., S.L. Madden, L. Zhang, A.E. Lash, J. Yu, C. Rago, A. Lal, C.J. Wang. 1999. Analysis of human transcriptomes. Nat. Genet. 23:387-388.

5.) Larsson, M., S. Graslund, L. Yuan, E. Brundell, M. Uhlen, C. Hoog, and S. Stahl. 2000. High-throughput protein expression of cDNA products as a tool in functional genomics. J. Biotechnol. 80:143-157.

6.) Bertone, P., V. Stolc, T.E. Royce, J.S. Rozowsky, A.E. Urban, X. Zhu, J.L. Rinn, W. Tongprasit. 2004. Global identification of human transcribed sequences with genome tiling arrays. Science 306:2242-2246.

7.) Birney, E., J.A. Stamatoyannopoulos, A. Dutta, R. Guigo, T.R. Gingeras, E.H. Margulies, Z. Weng, M. Snyder. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799-816.

8.) Kapranov, P., J. Cheng, S. Dike, D.A. Nix, R. Duttagupta, A.T. Willingham, P.F. Stadler, J. Hertel. 2007. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316:1484-1488.

9.) Carninci, P. 2006. Tagging mammalian transcription complexity. Trends Genet. 22:501-510.

10.) Shiraki, T., S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji, R. Kodzius, A. Watahiki. 2003. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA 100:15776-15781.

11.) Wakaguri, H., R. Yamashita, Y. Suzuki, S. Sugano, and K. Nakai. 2008. DBTSS: Database of transcription start sites, progress report 2008. Nucleic Acids Res. 36:D97-D101.

12.) Johnson, J.M., J. Castle, P. Garrett-Engele, Z. Kan, P.M. Loerch, C.D. Armour, R. Santos, E.E. Schadt. 2003. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302:2141-2144.

13.) Griffith, M., M.J. Tang, O.L. Griffith, R.D. Morin, S.Y. Chan, J.K. Asano, T. Zeng, S. Flibotte. 2008. ALEXA: A microarray design platform for alternative expression analysis. Nat. Methods 5:118.

14.) Cargill, M., D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, N. Patil, N. Shaw, C.R. Lane. 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231-238.

15.) Ota, T., Y. Suzuki, T. Nishikawa, T. Otsuki, T. Sugiyama, R. Irie, A. Wakamatsu, K. Hayashi. 2004. Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet. 36:40-45.

16.) Barski, A., S. Cuddapah, K. Cui, T.Y. Roh, D.E. Schones, Z. Wang, G. Wei, I. Chepelev. 2007. High-resolution profiling of histone methylations in the human genome. Cell 129:823-837.

17.) Robertson, G., M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Euskirchen, B. Bernier. 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4:651-657.

18.) Chen, W, V. Kalscheu, A. Tzschach, C. Menzel, R. Ullmann, M. Schulz, F. Erdogan, N. Li. 2008. Mapping translocation breakpoints by next-generation sequencing. Genome Res. [Epub 21 May 2008].

19.) Morin, R.D., M.D. O'Connor, M. Griffith, F. Kuchenbauer, A. Delaney, A. Prabhu, Y. Zhao, H. McDonald. 2008. Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res. 18:610-621.

20.) Hubbard, T.J., B.L. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke, G. Coates. 2007. Ensembl 2007. Nucleic Acids Res. 35:D610-D617.

21.) Yauk, C.L., and M.L. Berndt. 2007. Review of the literature examining the correlation among DNA microarray technologies. Environ. Mol. Mutagen. 48:380-394.

22.) Akiva, P., A. Toporik, S. Edelheit, Y. Peretz, A. Diber, R. Shemesh, A. Novik, and R. Sorek. 2006. Transcription-mediated gene fusion in the human genome. Genome Res. 16:30-36.

23.) Sperisen, P., C. Iseli, M. Pagni, B.J. Stevenson, P. Bucher, and C.V. Jongeneel. 2004. Trome, trEST and trGEN: databases of predicted protein sequences. Nucleic Acids Res. 32:D509-D511.

24.) Thierry-Mieg, D., and J. Thierry-Mieg. 2006. AceView: A comprehensive cDNA-supported gene and transcripts annotation. Genome Bio. [Epub 7 August 2007] 7:1-14.

25.) Cheng, J., P. Kapranov, J. Drenkow, S. Dike, S. Brubaker, S. Patel, J. Long, D. Stern. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308:1149-1154.

26.) Pineda-Lucena, A., C.S. Ho, D.Y. Mao, Y. Sheng, R.C. Laister, R. Muhandiram, Y. Lu, B.T. Seet. 2005. A structure-based model of the c-Myc/Bin1 protein interaction shows alternative splicing of Bin1 and c-Myc phosphorylation are key binding determinants. J. Mol. Biol. 351:182-194.

27.) Futreal, P.A., L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman, and M.R. Stratton. 2004. A census of human cancer genes. Nat. Rev. Cancer 4:177-183.

28.) Backes, C., A. Keller, J. Kuentzer, B. Kneissl, N. Comtesse, Y.A. Elnakady, R. Muller, E. Meese. 2007. GeneTrail—advanced gene set enrichment analysis. Nucleic Acids Res. 35:W186-W192.

29.) Krull, M., S. Pistor, N. Voss, A. Kel, I. Reuter, D. Kronenberg, H. Michael, K. Schwarzer. 2006. TRANSPATH: An information resource for storing and visualizing signalling pathways and their pathological aberrations. Nucleic Acids Res. 34:D546-D551.

30.) Pampalakis, G., A. Scorilas, and G. Sotiropoulou. 2008. Novel splice variants of prostate-specific antigen and applications in diagnosis of prostate cancer. Clin. Biochem. [Epub 11 January 2008].

31.) Yousef, G.M., N.M. White, I.P. Michael, J.C. Cho, J.D. Robb, L. Kurlender, S. Khan, and E.P. Diamandis. 2005. Identification of new splice variants and differential expression of the human kallikrein 10 gene, a candidate cancer biomarker. Tumour Biol. 26:227-235.

32.) Greenman, C., P. Stephens, R. Smith, G.L. Dalgliesh, C. Hunter, G. Bignell, H. Davies, J. Teague. 2007. Patterns of somatic mutation in human cancer genomes. Nature 446:153-158.

33.) Sjoblom, T., S. Jones, L.D. Wood, D.W. Parsons, J. Lin, T.D. Barber, D. Mandelker, R.J. Leary. 2006. The consensus coding sequences of human breast and colorectal cancers. Science 314:268-274.

34.) Thomas, R.K., E. Nickerson, J.F. Simons, P.A. Janne, T. Tengs, Y. Yuza, L.A. Garraway, T. LaFramboise. 2006. Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nat. Med. 12:852-855.

35.) Gabellini, C., D. Del Bufalo, and G. Zupi. 2006. Involvement of RB gene family in tumor angiogenesis. Oncogene 25:5326-5332.

36.) Kim, H., J. Farris, S.A. Christman, B.W. Kong, L.K. Foster, S.M. O'Grady, and D.N. Foster. 2002. Events in the immortalizing process of primary human mammary epithelial cells by the catalytic subunit of human telomerase. Biochem. J. 365:765-772.

37.) Crowe, D.L., D.C. Nguyen, K.J. Tsang, and S. Kyo. 2001. E2F-1 represses transcription of the human telomerase reverse transcriptase gene. Nucleic Acids Res. 29:2789-2794.

38.) Giacinti, C., and A. Giordano. 2006. RB and cell cycle progression. Oncogene 25:5220-5227.

  1    2    3    4    5