Sequence-based methods for transcriptome characterization have typically relied on generation of either serial analysis of gene expression tags or expressed sequence tags. Although such approaches have the potential to enumerate transcripts by counting sequence tags derived from them, they typically do not robustly survey the majority of transcripts along their entire length. Here we show that massively parallel sequencing of randomly primed cDNAs, using a next-generation sequencing-by-synthesis technology, offers the potential to generate relative measures of mRNA and individual exon abundance while simultaneously profiling the prevalence of both annotated and novel exons and exon-splicing events. This technique identifies known single nucleotide polymorphisms (SNPs) as well as novel single-base variants. Analysis of these variants, and previously unannotated splicing events in the HeLa S3 cell line, reveals an overrepresentation of gene categories including those previously implicated in cancer.
Numerous methods exist to characterize transcriptomes and to measure gene expression at both the transcript and exon level (1,2,3,4,5), and recent work (6,7,8) indicates that much more of the genome is transcribed than previously thought. Techniques to map transcriptional start sites (9,10,11), quantitatively measure splicing events (12,13), and discover mutations in the transcriptome (14) have been described. Analysis of full-length cDNA data generated from capillary sequencing generally assists in gene discovery and refinement of gene annotations (15), but this approach is limited by the high costs and throughput challenges associated with capillary sequencers. Next-generation sequencing-by-synthesis approaches now provide an effective way to inexpensively and rapidly produce billions of bases of sequence data (16,17,18,19). As a result, these approaches have the potential to characterize the transcriptome in a genome-wide and therefore comprehensive manner.
Here we used short (31 nt) sequence reads from an Illumina Genome Analyzer 1 (Illumina, San Diego, CA, USA) to characterize the transcriptome of the human cervical carcinoma-derived cell line, HeLa S3 (National Cell Culture Centre, Minneapolis, MN, USA). Using these data, we assessed: (i) transcript abundance at the exon level and transcript level; (ii) transcriptional start and termination sites; (iii) exon-exon linkages in mature transcripts; (iv) the spectrum of base changes (both known and novel variants); and (v) the potential to identify novel exons belonging to known genes.Materials and Methods Construction of HeLa Whole Transcriptome Libraries
We constructed a total of four libraries for whole transcriptome shotgun sequencing (WTSS). For the first two libraries, a 75 µg aliquot of HeLa S3 total RNA (Ambion, Applied Biosystems, Foster City, CA, USA) was separated into polyA+ and polyA- fractions using the MACS mRNA Isolation kit (Miltenyi Biotec, Bergisch Gladbach, Germany). The nonribosomal polyA- RNA was enriched in the latter fraction using two iterations of the RiboMinus Human/Mouse Transcriptome Isolation kit (Invitrogen, Carlsbad, CA, USA). Double-stranded cDNAs were made from 200 ng of each of the polyA+ and ribosomal RNA-depleted polyA- fractions (4 µM; Stratagene cDNA Synthesis kit, Agilent, Santa Clara CA, USA) and random hexamer primers (8 µM; Invitrogen). For the second two libraries, 25 ng of HeLa S3 total RNA was used to synthesize full-length single-stranded cDNAs using the SMART PCR cDNA Synthesis kit (Clontech, Mountain View, CA, USA). Here, a modified oligo(dT) primer primes the first strand synthesis reaction, and a template-switching mechanism generates full-length single-stranded cDNAs containing the complete 5′ end of the mRNA and universal priming sequences for end-to-end PCR amplification. Four 50 µL PCR amplifications were performed each from 1/10 of a single first strand synthesis reaction using 18 cycles of 95°C for 5 s, 65°C for 5 s, and 68°C for 6 min. The products were pooled, purified on a QIAquick PCR Purification kit (Qiagen, Valencia, CA, USA), and eluted in 30 µL of Tris-EDTA buffer (10:0.1). The quality and quantity of the resulting double-stranded cDNAs were assessed using an Agilent DNA 7500 Series II assay (Agilent, Mississauga, Ontario, Canada) and NanoDrop ND-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA). Two 14 µL aliquots containing ∼200 ng amplified cDNA were each diluted with water to 100 µL and sonicated, one aliquot for 5 min and the other for 20 min. Sonication was performed in an ice-water bath using a Sonic Dismembrator 550 (Fisher, Ottawa, Ontario, Canada) with a power setting of “7” in pulses of 30 s interspersed with 30 s of cooling. Total sonication times refer to active sonication time only and do not include the rest periods in between each pulse.
The resultant cDNAs from all four libraries were size-fractionated on 8% polyacrylamide gels, and the 100 to 300 bp fractions excised. Gel-purified cDNA products were modified for Illumina sequencing using the Illumina Genomic DNA Sample Prep kit as follows: size-selected cDNAs were end-repaired by T4 DNA polymerase and phosphorylated by T4 DNA polymerase and T4 polynucleotide kinase. The cDNA products were incubated with Klenow DNA Polymerase (Illumina) to generate 3′ adenine overhangs followed by ligation to Illumina adapters, which contain 5′ thymine overhangs. The adapter-ligated products were purified on QIAquick spin columns (Qiagen), then PCR-amplified with Phusion DNA Polymerase (Illumina) in 10 cycles using Illumina's genomic DNA primer set. PCR products were purified on QIAquick MinElute columns (Qiagen) and the DNA quality-assessed and quantified using an Agilent DNA 1000 Series II assay and NanoDrop ND-1000 spectrophotometer and diluted to 10 nM. Cluster generation and sequencing was performed using the Standard Cluster Generation kit and the 36 Cycle Solexa Sequencing kit on the Illumina Cluster Station and Illumina Genome Analyzer I following manufacturer's instructions. Six lanes of the flow cell were applied to the first polyA+ library and two lanes were used to sequence the polyA- library. A single lane from a subsequent flow cell was used to sequence each of the latter two full-length libraries. Sequences were extracted from the resulting image files using the Firecrest and Bustard applications (Illumina) run with default parameters. A per-read error rate for each position in our reads was computed and showed a median error of less than 1% and a maximal error (at 31, the final position) of ∼1.6%.