Transcriptome studies based on quantitative sequencing can estimate levels of gene expression by measuring target RNA abundance in sequencing libraries. Sequencing costs are proportional to the total number of sequenced reads, and in order to cover rare RNAs, considerable quantities of abundant and identical reads are needed. This major limitation can be addressed by depleting a proportion of the most abundant sequences from the library. However, such depletion strategies involve either extra handling of the input RNA sample or use of a large number of reverse transcription primers, termed not-so-random (NSR) primers, which are costly to synthesize. Taking advantage of the high tolerance of reverse transcriptase to mis-prime, we found that it is possible to use as few as 40 pseudo-random (PS) reverse transcription primers to decrease the rate of undesirable abundant sequences within a library without affecting the overall transcriptome diversity. PS primers are simple to design and can be used to deplete several undesirable RNAs simultaneously, thus creating a flexible tool for enriching transcriptome libraries for rare transcript sequences.
For transcriptome studies using quantitative sequencing, highly abundant sequences within a library limit coverage and increase the difficulty in detecting transcripts of interest. For example, rRNA or hemoglobin sequences can represent the majority of a sequence library, meaning most of the money spent on sequencing in these cases would be for reads that are irrelevant to downstream analysis. For this reason, transcriptome analysis methods often include a step for removing these RNA sequences. Such depletion techniques include (i) capture with hybridization probes and magnetic beads (Ribo-Zero kit) (1) or using antibodies directed against DNA:RNA hybrids (GeneReadrRNA depletion kit) (2); (ii) capturing first-strand cDNAs synthesized from capped transcripts (CAP Trapper) (3); and (iii) selectively degrading the 5′-phosphate RNAs using Terminator 5′-phosphate-dependent exonuclease (Epicenter). For hemoglobin depletion, kits incorporating the same technologies are also used, such as the GLOBINclear Kit (Thermo Fisher) or the Globin-Zero Gold Kit (Illumina). These methods, while well-established, are not always advisable or possible, particularly when the amount of starting material is small, such as when using single cells, or when an extra step is hard to implement, such as when using microfluidic devices.
Precise selection and use of pseudo-random (PS) primers reduced detection of undesirable sequences within libraries, thus increasing the effective sequencing depth. Instead of the 4096 random primers currently used in targeted reduction, only 40 PS primers were needed to effectively deplete abundant transcripts.
Armour et al. described a new method using not-so-random (NSR) primers to deplete rRNA sequences without additional steps (4). cDNAs are primed with a mixture of the 749 out of 4096 random hexamers that do not have a direct match with human rRNAs, leading to a reduction of these sequences from 78% to 13%. The major drawback to this method is that primer pools need to be prepared by synthesizing each individual primer, making customization costly when adding a linker tail or changing the depletion target (5).
Here we present an extensive generalization of the NSR concept, which we term pseudo-random (PS) primers. Building on the observation of Mizuno et al. that reverse transcriptase tolerates up to two mismatches at the priming site (6), we reasoned that a large number of NSR primer sequences are functionally redundant; therefore, it should be possible to dramatically reduce their number, thus facilitating the development and testing of smaller custom primer sets. Materials and methods Selection of pseudo-random primers
We selected 40 PS primers that bind neither to human rRNA nor to the linker sequence of the template-switching oligonucleotide used in our experiments (Supplementary File S1).
The 40 primers were individually synthesized (Invitrogen, Tokyo, Japan), resuspended at a concentration of 100 M in ultra-pure water, and mixed equimolarly. Selection of PS_Hb primers
The 33 pseudo-random hemoglobin depletion (PS_Hb) primers were selected as described in Supplementary File S1 by discarding hexamer sequences targeting the human hemoglobin subunit alpha 1 (HBA1), hemoglobin subunit alpha 2 (HBA2), and hemoglobin subunit beta (HBB) RNAs. Library preparation
NanoCAGE libraries were prepared according to Salimullah et al. using 50 ng of total RNA extracted from HeLa and THP-1 cell lines (7). Technical triplicates of each nanoCAGE library were prepared from each RNA sample. Four libraries were made to compare (i) random hexamers (RanN6) versus PS primers; (ii) RanN6, PS, and 40 randomly picked RanN6 (40N6) primers; (iii) RanN6, PS, 3 subsets of 20 PS, and 1 subset of 10 PS primers; and (iv) RanN6 versus PS_Hb primers. Thus, differences between RanN6 and PS primers, rRNA depletion, and artifacts were replicated in three independent experiments. Library preparation and the composition of each nanoCAGE library are described in Supplementary Figure S1, A and B, and Supplementary Table S1. Data processing and analysis
The prepared libraries were individually paired-end sequenced on a MiSeq sequencer (Illumina) using the standard nanoCAGE sequencing primers (7). The sequencing data were analyzed using the workflow manager Moirai (8). Briefly, the reads were demultiplexed and trimmed to the first 15 bases with the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). The reads from rRNA or oligo-artifacts were removed with TagDust (version 1.13) (9), and the remaining reads were aligned to the human genome (hg19) with BWA (version 0.7) (10). Then, the non-proper paired reads and the PCR duplicates were filtered out with samtools (version 0.1.19) (11). Finally, the properly paired reads were clustered and analyzed as in Harbers et al., (12) (the scripts used for the analysis are provided in Supplementary Files S2-S8).
Larger data sets were deposited at Zenodo (FASTQ DOI: 10.5281/zenodo.48112 and BAM DOI: 10.5281/zenodo.48114), and intermediate result files can be downloaded from RIKEN (http://genome.gsc.riken.jp/plessy-20160322/plessy-20160322.tar.gz). Supplementary files were also deposited at GitHub (https://github.com/Population-Transcriptomics/pseudo-random-primers/tree/BioTechniques-2016). Results and discussion
We tested our PS primers concept using the nanoCAGE method for transcriptome profiling (13). Here, 5′ adapters are introduced by template-switching oligonucleotides during reverse transcription, and random primers are used to cover the non-polyadenlyated transcriptome. Undesirable sequences in nanoCAGE libraries come mainly from two sources: (i) rRNA and (ii) primer–primer artifacts. The frequency of these undesirable sequences becomes especially problematic when the quantity of starting material is <1 ng; therefore, we first designed PS primers to reduce rRNA and primer–primer artifacts at the same time. Using scripts written in the R language (see Supplementary File S1), we identified 40 hexamers that neither match perfectly with the human rRNA reference sequences nor match with 0, 1, or 2 mismatches with the nanoCAGE linker sequence. We prepared a mixture of 40 reverse transcription primers containing these hexamers (PS), to replace the standard reverse transcription random primers (RanN6).
We tested the PS primers on three sets of triplicate libraries prepared from HeLa and THP-1 cell line total RNA. Using nanoCAGE libraries prepared with RanN6 primers as a control (Supplementary Figure S2), we observed a significant decrease in sequence reads matching to rRNA (Figure 1A). Although 20.4% of the remaining reads still matched rRNA after depletion, this represents a reduction in the cost per mapped read of 37%. Primer artifacts were also reduced (Figure 1B) compared with controls, but the difference was only statistically significant for the THP-1 libraries; for one HeLa set of triplicates, there was no decrease, but the overall amount of artifacts was uniformly low, making it difficult to see any effect of the PS primers. To exclude the possibility that the observed effect of the PS primers comes only from reduction of the hexamer diversity, regardless of our selection, we included a control consisting of 40 randomly picked hexamers (40N6). These libraries did not have significantly depleted rRNA reads, but there was an impact on primer artifacts. We explain this effect by the fact that only a few hexamers matched to the linker sequences of the nanoCAGE primers; therefore, the 40N6 set was depleted by chance. Indeed, only 13 primers (32%) matched the linker with no or 1 mismatch (Supplementary Figure S3), whereas 1014 primers (25%) of the RanN6 group matched the linker with <1 mismatch. This confirms the efficiency of our precise selection of the PS primers to decrease the detection of undesired sequences within nanoCAGE libraries.