Nine years ago, the National Human Genome Research Institute launched a massive undertaking known as the Encyclopedia of DNA Elements (ENCODE). The goal was to identify all of the protein-coding genes, non-protein coding genes, and other functional elements contained in the human genome. This encyclopedia could help scientists mine human sequence data more effectively, gain new biological insights, and even develop new treatment strategies for diseases.
One crucial component of the GENCODE project was to verify the quality of the reference annotation. Spearheading this effort is Alexandre Reymond, a human geneticist at the University of Lausanne and a member of the ENCODE and GENCODE research teams. On the same day the updated version of GENCODE was released a few months ago, Reymond and his team reported in Genome Research a new experimental validation pipeline called RT-PCR-seq (2).
Focusing on lower-confidence GENCODE gene models, the researchers used their method—which combines polymerase chain reaction (PCR) amplification and new sequencing technologies—to evaluate predicted junctions between exons. They confirmed 79% of the junctions and demonstrated the high quality of the GENCODE gene set. "We can be really confident that the GENCODE annotation is really good, because even when there is really little biological support, we can confirm that these transcripts really exist," says Reymond.
Ideally suited for corroborating rare transcripts expressed at low levels, the targeted RT-PCR-seq approach is more sensitive than unbiased transcriptome profiling through RNA sequencing. For instance, the researchers found that exon–exon junctions unique to GENCODE annotated transcripts were five times more likely to be corroborated with RT-PCR-seq than with extensive large human transcriptome profiling. RT-PCR-seq validated about 40% of rarely transcribed exons with splice junctions not represented within the Human Body Map RNA-seq data set.
Moreover, the targeted approach revealed that about 11% of introns have unannotated exons, and that at least 18% of known loci have yet-unannotated exons. "We really show how complex the human transcriptome is," says Reymond.
Building a Bedrock
Even though RT-PCR-seq seems promising, it is not without its limitations. The stringent criteria used to design primers can limit the number of testable junctions. Moreover, "you can only test one splice site at a time," says Reymond. "It also implies that you know the splice site. You need some data, even if it's low confidence, that says that the splice site exists, and sometimes you may not have this information."
In the end, Reymond envisions that RT-PCR-seq will complement rather than replace traditional, unbiased RNA sequencing approaches. "RNA-seq, which is an unbiased effort, would be good to make the first catalog, but then to recreate really rare transcripts, we would need a more targeted approach," says Reymond.
The main value of RT-PCR-seq may be to validate what's known, according to Mark Gerstein, an expert in protein bioinformatics at Yale University who is a member of the GENCODE team but was not involved in the study. "The future will have a tremendous amount of RNA sequencing data, and what would make a very big and useful and enduring impact on the field is a very accurate human gene set, because that's going to be the bedrock upon which we hang thousands if not tens of thousands of transcriptome experiments,” he says. “And so the impact of this is that it's part of that bedrock that's really giving us confidence in our human gene set."
Separating Signal from Noise
Reymond is not the only one developing targeted and sensitive approaches to address the inability of current RNA sequencing technology to handle the amazing breadth and depth of the human transcriptome. John Rinn, an expert in non-coding RNAs at Harvard University, and his collaborators reported last year in Nature Biotechnology a novel approach for identifying and characterizing unannotated transcripts whose rare or transient expression is below the detection limits of conventional sequencing approaches (3).
The targeted RNA capture and sequencing strategy, called RNA CaptureSeq, is similar to previous in-solution capture methods and analogous to exome sequencing, in which protein-coding regions of the genome are selectively sequenced. It combines tiling arrays with deep-sequencing technologies to obtain saturating coverage.
"This was a new challenge, where the array technology could be merged with next-generation sequencing technology to provide whole new insights. So, the challenge there was to get these two worlds to think together," says Rinn. Another hurdle was figuring out how to analyze the data. "We've seen normal statistical models built for microarrays and for sequencing, but we were blowing those models out of the water because we were getting so much coverage in only a few parts of the genome. So, the whole thing was kind of a new beast," he says.
The new method revealed that intermittent sequence reads observed in conventional RNA sequencing data sets—previously dismissed as noise—are in fact rare transcripts. For example, the long non-coding RNAs the researchers discovered were present at an average of about 0.0006 transcripts per cell, indicating that expression occurs in only a small subpopulation of the cells sampled. Less than one-third of the captured intergenic transcripts were even detected in precapture RNA-seq libraries, and according to the study authors, these intergenic transcripts represent some of the rarest transcriptional events characterized to date.
Moreover, the researchers found complex patterns of non-coding transcription in intergenic regions and unannotated exons and splicing patterns in intensively studied protein-coding loci. "The most dramatic one to me was p53—a gene that has been studied for many, many years and cloned numerous times using classical approaches. We were able to find new variants of that gene," says Rinn. "The approach goes so far down that it can penetrate 50 years of study and still find new features."
Rinn and his collaborators reported that RNA CaptureSeq did not introduce PCR amplification bias and was capable of accurately reflecting the gene expression profiles of the original sample, enabling quantitative analysis of transcripts. But a potential limitation of RNA CaptureSeq is that it relies on tiling arrays, and the capture reagents don’t add to the fidelity of the experiment, says Gerstein. "They don’t perfectly pick out what you want to sequence. Often, lots of other bits of the genome come along for the ride," he says. "The problem is, for detailed quantification and assessment of how many of a particular RNA species you have, they're really very distorting."
Moreover, RNA CaptureSeq could be considered a stop-gap measure as sequencing technologies become less and less expensive. "If you can do RNA sequencing super cheap, and you can sequence a thousand times more than we can now at the same price, you wouldn't want to have that capture reagent because it just creates noise or mess, and you would just run your sequencer more and more," says Gerstein. "It's sort of the same argument that people make relative to exome sequencing versus whole genome sequencing. Exome sequencing is clearly a transient technology, as sequencing gets cheaper and cheaper."
In the end, targeted approaches are crucial for fully understanding the complexity of the human transcriptome, says Rinn. "Transcription regulation is much, much more complex than we ever thought, and each gene locus has actually got many different variants of the same gene. And this kind of approach can get at the depth in a targeted, hypothesis-driven way to understand those regions."
1. ENCODE Project Consortium. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799-816.
2. Howald, C., A. Tanzer, J. Chrast, F. Kokocinski, T. Derrien, N. Walters, J. M. Gonzalez, A. Frankish, B. L. Aken, T. Hourlier, J. H. Vogel, S. White, S. Searle, J. Harrow, T. J. Hubbard, R. Guigó, and A. Reymond. 2012. Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res 22(9):1698-710.
3. Mercer, T. R., D. J. Gerhardt, M. E. Dinger, J. Crawford, C. Trapnell, J. A. Jeddeloh, J. S. Mattick, and J. L. Rinn. 2011. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol 30(1):99-104.