to BioTechniques free email alert service to receive content updates.
Targeting the Transcriptome | PCR Feature

11/07/2012
Janelle Weaver, Ph.D.

The complexity of the human transcriptome surpasses the capabilities of current RNA sequencing technologies. Janelle Weaver reports on efforts to develop targeted approaches capable of revealing new insights into the transcriptome.


Nine years ago, the National Human Genome Research Institute launched a massive undertaking known as the Encyclopedia of DNA Elements (ENCODE). The goal was to identify all of the protein-coding genes, non-protein coding genes, and other functional elements contained in the human genome. This encyclopedia could help scientists mine human sequence data more effectively, gain new biological insights, and even develop new treatment strategies for diseases.

This schematic represents the criteria used to experimentally validate GENCODE gene models. Source: Genome Research

In 2007, the ENCODE project consortium reported the results of the pilot phase of the project. They identified and analyzed functional elements in 1% of the human genome (1). In the wake of the successful completion of the pilot phase, the project consortium began efforts to scale up to the entire genome. One subproject, known as GENCODE, aims to accurately annotate all evidence-based gene features in the entire human genome through painstaking manual curation, computational analyses, and targeted experimental validation.

One crucial component of the GENCODE project was to verify the quality of the reference annotation. Spearheading this effort is Alexandre Reymond, a human geneticist at the University of Lausanne and a member of the ENCODE and GENCODE research teams. On the same day the updated version of GENCODE was released a few months ago, Reymond and his team reported in Genome Research a new experimental validation pipeline called RT-PCR-seq (2).

Focusing on lower-confidence GENCODE gene models, the researchers used their method—which combines polymerase chain reaction (PCR) amplification and new sequencing technologies—to evaluate predicted junctions between exons. They confirmed 79% of the junctions and demonstrated the high quality of the GENCODE gene set. "We can be really confident that the GENCODE annotation is really good, because even when there is really little biological support, we can confirm that these transcripts really exist," says Reymond.

Ideally suited for corroborating rare transcripts expressed at low levels, the targeted RT-PCR-seq approach is more sensitive than unbiased transcriptome profiling through RNA sequencing. For instance, the researchers found that exon–exon junctions unique to GENCODE annotated transcripts were five times more likely to be corroborated with RT-PCR-seq than with extensive large human transcriptome profiling. RT-PCR-seq validated about 40% of rarely transcribed exons with splice junctions not represented within the Human Body Map RNA-seq data set.

Moreover, the targeted approach revealed that about 11% of introns have unannotated exons, and that at least 18% of known loci have yet-unannotated exons. "We really show how complex the human transcriptome is," says Reymond.

Building a Bedrock

Even though RT-PCR-seq seems promising, it is not without its limitations. The stringent criteria used to design primers can limit the number of testable junctions. Moreover, "you can only test one splice site at a time," says Reymond. "It also implies that you know the splice site. You need some data, even if it's low confidence, that says that the splice site exists, and sometimes you may not have this information."

In the end, Reymond envisions that RT-PCR-seq will complement rather than replace traditional, unbiased RNA sequencing approaches. "RNA-seq, which is an unbiased effort, would be good to make the first catalog, but then to recreate really rare transcripts, we would need a more targeted approach," says Reymond.

The main value of RT-PCR-seq may be to validate what's known, according to Mark Gerstein, an expert in protein bioinformatics at Yale University who is a member of the GENCODE team but was not involved in the study. "The future will have a tremendous amount of RNA sequencing data, and what would make a very big and useful and enduring impact on the field is a very accurate human gene set, because that's going to be the bedrock upon which we hang thousands if not tens of thousands of transcriptome experiments,” he says. “And so the impact of this is that it's part of that bedrock that's really giving us confidence in our human gene set."

Validation rates (dark blue) of GENCODE non-unique splice junctions (i.e., common to more than one GENCODE transcript isoform; “common” junctions), GENCODE unique splice junctions (specific to a single GENCODE transcript isoform; “specific” junctions), and lower confidence GENCODE unique splice junctions (specific to a single novel or putative GENCODE transcript isoform; “specific and low expressed” junctions) by Illumina Human Body Map RNA-seq (HBM), ENCODE RNA-seq (ENCODE), and RT-PCR-seq are shown in bar plot format. Source: Genome Research

Moreover, targeted approaches used for validation could be useful for addressing long-standing debates, such as how much pervasive transcription really occurs and the biological relevance of transcripts that are expressed at low levels, such as non-coding RNAs. "Pervasive transcription has been very controversial," says Gerstein. "Targeted approaches, particularly orthogonal approaches like RT-PCR-seq, are really valuable in the framework of this important question."

Separating Signal from Noise

Reymond is not the only one developing targeted and sensitive approaches to address the inability of current RNA sequencing technology to handle the amazing breadth and depth of the human transcriptome. John Rinn, an expert in non-coding RNAs at Harvard University, and his collaborators reported last year in Nature Biotechnology a novel approach for identifying and characterizing unannotated transcripts whose rare or transient expression is below the detection limits of conventional sequencing approaches (3).

The targeted RNA capture and sequencing strategy, called RNA CaptureSeq, is similar to previous in-solution capture methods and analogous to exome sequencing, in which protein-coding regions of the genome are selectively sequenced. It combines tiling arrays with deep-sequencing technologies to obtain saturating coverage.

"This was a new challenge, where the array technology could be merged with next-generation sequencing technology to provide whole new insights. So, the challenge there was to get these two worlds to think together," says Rinn. Another hurdle was figuring out how to analyze the data. "We've seen normal statistical models built for microarrays and for sequencing, but we were blowing those models out of the water because we were getting so much coverage in only a few parts of the genome. So, the whole thing was kind of a new beast," he says.

The new method revealed that intermittent sequence reads observed in conventional RNA sequencing data sets—previously dismissed as noise—are in fact rare transcripts. For example, the long non-coding RNAs the researchers discovered were present at an average of about 0.0006 transcripts per cell, indicating that expression occurs in only a small subpopulation of the cells sampled. Less than one-third of the captured intergenic transcripts were even detected in precapture RNA-seq libraries, and according to the study authors, these intergenic transcripts represent some of the rarest transcriptional events characterized to date.

Moreover, the researchers found complex patterns of non-coding transcription in intergenic regions and unannotated exons and splicing patterns in intensively studied protein-coding loci. "The most dramatic one to me was p53—a gene that has been studied for many, many years and cloned numerous times using classical approaches. We were able to find new variants of that gene," says Rinn. "The approach goes so far down that it can penetrate 50 years of study and still find new features."

Unraveling Complexity

Rinn and his collaborators reported that RNA CaptureSeq did not introduce PCR amplification bias and was capable of accurately reflecting the gene expression profiles of the original sample, enabling quantitative analysis of transcripts. But a potential limitation of RNA CaptureSeq is that it relies on tiling arrays, and the capture reagents don’t add to the fidelity of the experiment, says Gerstein. "They don’t perfectly pick out what you want to sequence. Often, lots of other bits of the genome come along for the ride," he says. "The problem is, for detailed quantification and assessment of how many of a particular RNA species you have, they're really very distorting."

Moreover, RNA CaptureSeq could be considered a stop-gap measure as sequencing technologies become less and less expensive. "If you can do RNA sequencing super cheap, and you can sequence a thousand times more than we can now at the same price, you wouldn't want to have that capture reagent because it just creates noise or mess, and you would just run your sequencer more and more," says Gerstein. "It's sort of the same argument that people make relative to exome sequencing versus whole genome sequencing. Exome sequencing is clearly a transient technology, as sequencing gets cheaper and cheaper."

The RT-PCR-seq validation rates of exon–exon junctions are shown as a function of the abundance of the targeted transcript isoforms. Source: Genome Research

Nonetheless, Rinn insists that RNA CaptureSeq will be valuable for a range of research and clinical applications. The technique can be used to target disease-related loci, and his team has recently used this approach to characterize transcription in the context of malaria infection. "Previously, you would have to sequence a ton to get down into the malaria transcripts, because they're such a small percentage of the cell. Any host-pathogen interaction can be monitored in parallel using this," says Rinn. "You can pull out the two transcriptomes and then compare what's being changed on both sides, whereas typically people have to isolate the human cells and then isolate the malaria in a separate reaction, and you're not getting the in-parallel mixture of the two."

In the end, targeted approaches are crucial for fully understanding the complexity of the human transcriptome, says Rinn. "Transcription regulation is much, much more complex than we ever thought, and each gene locus has actually got many different variants of the same gene. And this kind of approach can get at the depth in a targeted, hypothesis-driven way to understand those regions."

References

1. ENCODE Project Consortium. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799-816.

2. Howald, C., A. Tanzer, J. Chrast, F. Kokocinski, T. Derrien, N. Walters, J. M. Gonzalez, A. Frankish, B. L. Aken, T. Hourlier, J. H. Vogel, S. White, S. Searle, J. Harrow, T. J. Hubbard, R. Guigó, and A. Reymond. 2012. Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res 22(9):1698-710.

3. Mercer, T. R., D. J. Gerhardt, M. E. Dinger, J. Crawford, C. Trapnell, J. A. Jeddeloh, J. S. Mattick, and J. L. Rinn. 2011. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol 30(1):99-104.