RNA sequencing (RNA-seq) is a powerful tool for measuring levels of gene or allele expression and detecting alternate splicing, among other uses. But until now, it hasn’t been used to find single nucleotide polymorphisms (SNPs). That task has been left to whole genome or whole exome sequencing, which are much more expensive.
In principle, RNA-seq could be used to find SNPs if the results were compared to a reference genome, but this yields an unacceptably high false-positive rate. Jin Billy Li, assistant professor of genetics at Stanford University, and his colleagues hope to change that.
Li is interested in RNA editing, wherein an organism alters RNA bases after transcription. In two Nature Medicine papers (1, 2), his team showed that they could use RNA-seq to identify individual RNA edits with a high level of accuracy.
“Based on that success, we thought if we can find editing variants, we should be able to identify some RNA variations derived from genomic SNPs,” said Li.
Identifying SNPs from RNA-seq data is challenging due to RNA complexity. Individual transcripts may vary in relative quantity by up to six orders of magnitude. RNA is also subject to massive amounts of gene splicing, and single genes often have multiple alternative splicing sites.
At 100-300 base pairs in length, RNA-seq reads frequently span at least one splice junction. As a result, the read has no counterpart in the reference genome, but a comparison may find a close match at some other part of the genome where the splice site will register as a SNP. To account for such mistakes, Li’s team maps their reads to both the transcriptome and reference genomes, and ignores SNPs found within four base pairs of splice sites.
RNA-seq libraries also contain a surprising number of errors attributable to the random hexamers used as primers when generating complementary DNA; about 90% of putative SNPs were actually due to hexamer mismatches, according to Li.
Researchers generally assume that the hexamer anneals perfectly, but in fact it sometimes doesn’t. The 3´ end is the most critical for annealing success, so mistakes rarely occur there, but the 5´ end can more readily tolerate mismatches, and these errors are copied into the resulting RNA-seq library. Li and his team corrected for this problem by computationally removing any SNPs found that matched the hexamer primers.
The researchers applied their technique, described in the American Journal of Human Genetics (3), to RNA-seq data from a human lymphoblastic cell line whose genome is also sequenced. Of the SNPs they found, 99.1% were also identified by whole genome sequencing.
The new technique can find SNPs only in expressed genes, and it might disappoint in tissue-specific studies since tissues are heterogeneous and material that is available for sampling may not contain disease-related variants.
But Li believes the approach should be useful for researchers who conduct RNA-seq analyses for other purposes. Identifying SNPs “is kind of a free add-on feature,” he said.
For now, the approach offers another route for confirming SNPs found using whole genome or exome sequencing because the sample preparation is independent and could therefore identify any false SNPs due to systematic errors.
1. Ramaswami G, Lin W, Piskol R, Tan MH, Davis C, Li JB. Accurate identification of human Alu and non-Alu RNA editing sites. Nat Methods. 2012 Jun;9(6):579-81.
2. Ramaswami G, Zhang R, Piskol R, Keegan LP, Deng P, O'Connell MA, Li JB. Identifying RNA editing sites using RNA sequencing data alone. Nat Methods. 2013 Feb;10(2):128-32.
3. Piskol R,Ramaswami G, Li JB. Reliable Identification of Genomic Variants from RNA-Seq Data. 2013 AJHG 93(4): 641–651.