Since the introduction of next-generation sequencing, several techniques have been developed to selectively enrich and sequence specific parts of the genome at high coverage. These techniques include enzymatic methods employing molecular inversion probes, PCR based approaches, hybrid capture, and in-solution capture. In-solution capture employs RNA probes transcribed from a pool of DNA template oligos designed to match regions of interest to specifically bind and enrich genomic DNA fragments. This method is highly efficient, especially if genomic target regions are large in size or quantity. Diverse in-solution capture kits are available, but are costly when large sample numbers need to be analyzed. Here we present a cost-effective strategy for the design of custom DNA libraries, their transcription into RNA libraries, and application for in-solution capture. We show the efficacy by comparing the method to a commercial kit and further demonstrate that emulsion PCR can be used for bias free amplification and virtual immortalization of DNA template libraries.
The introduction of next generation sequencing (NGS) has revolutionized research in many areas (1,2), especially affecting our fundamental understanding of the genome and its subparts (3). A multitude of protocols exist for specialized nucleic acid preparations for NGS, DNA sequencing, RNA sequencing (4) and ChIP-seq (5).
Despite advances in technology, whole genome sequencing for large genomes is still associated with tremendous cost and workload. If the research is focused only against a subset of the whole genome, genome partitioning methods may be used to selectively enrich for the region of interest (6). Targeted enrichment is employed in many areas of genetic research like whole exome sequencing (7), sequencing of causal disease genes (8), and extensive resequencing for large cohorts (9).
There are various approaches for targeted enrichment available. Most commonly used techniques are based on hybrid capture, PCR, and molecular inversion probes (10). For large target regions, hybrid capture has turned out to be the most efficient. A main advantage of this approach is enrichment in-solution (11) rather than on microarrays (12); this provides easier handling and requires less DNA. In-solution capture often applies biotinylated RNA bait molecules transcribed from DNA template oligo libraries, which are the key component and main cost.
In this study, our goal was to reduce enrichment costs by omitting repeated synthesis of DNA template libraries for recurrent generation of RNA baits. This will be most beneficial for projects that require large target regions and large sample numbers to be enriched for sequencing. We set up a simple strategy for the design of DNA template libraries with two main characteristics: unique target sequences that are relatively short (40bp) and tiled along both strands of the target region in an alternating manner; and universal primers that flank each of the baits for library amplification.
To illustrate this method, we designed a DNA template library for 966 cancer associated genes. The library was ordered from MYcroarray, transcribed into RNA baits, and used for enrichment. The efficacy of our approach was demonstrated by comparison with the SureSelect Kit from Agilent. Both systems were used to enrich cancer associated genes of the cell line SW480 (13), an important model in colorectal cancer research.
Our next step was to amplify the DNA template library by water-in-oil emulsion PCR to prevent the introduction of amplification biases (14,15). Using this approach, we show evidence of bias free amplification and virtual immortalization of a DNA template library. For recurrent analyses in cases such as diagnostics, enrichment costs may be reduced significantly by using a short immortalized DNA template library (SIMLY), which is described here.
Materials and methods
Bait library design
For targeted enrichment, we compiled a list of cancer associated genes from different sources. The list included 408 genes from the Cancer Gene Census catalog (CGC) and 383 genes from the COSMIC database (v47). The Cosmic database lists 120,000 published mutations within 4200 genes of 100,000 different human cancer samples. We selected all genes that were reported as mutant more than three times and were not already covered by the CGC data set. Another 175 genes were selected because they were known to be associated with cancer (e.g., BBC3, PSEN1), to be involved in cancer relevant pathways (e.g., IRAK1), or have been identified in cancer genome sequencing projects running at the Max Planck Institute for Molecular Genetics (e.g., RAPGEF1). In total 966 genes were targeted.
In addition, 481 ultra-conserved regions were targeted, since highly conserved regions are likely to be functionally relevant in processes such as long-range transcriptional regulation (16). The corresponding probes made up about 3% of the total number of probes.
For gene identification, we used Entrez Gene IDs, while probe design was based on UCSC gene annotation. Entrez Gene IDs were mapped to UCSC gene annotations using the UCSC mapping table “knownToLocusLink” based on hg19. All UCSC exons and ultra conserved regions were intersected to obtain a non-redundant and non-overlapping set of target regions. Probes of 40 bp were tiled along the full length of each target region with gaps of 10 bp in between. Probes were designed to target the plus and minus strand in an alternating fashion, enabling binding of both strands of the prepared fragment library. For short target regions, we designed more probes to improve efficiency, with 3 probes for every ≤50 bp target region and 4 probes for 50–200 bp target regions. To minimize enrichment of repeating elements, repeat masking was performed and probes with repeating elements were excluded. Probe design for the target region resulted in 89,909 oligos, of which 87,119 oligos targeted coding exons of proven and presumed cancer genes while 2790 probes targeted ultra-conserved regions. The total size of the target region was 3.7 Mb with 18,713 targeted loci.
The individual oligos were comprised of a 40 bp unique target sequence plus a universal T7 promoter 5' sequence (5'-TAATACGACTCACTATAGGG-3') and a universal 3' sequence (5'-GCACTGCAAAAAGCAGGCTC-3'), with a total length of 80 bp. The universal sequences allow library amplification and transcription into biotin labeled RNA baits. The DNA template library was ordered as custom synthesis from MYcroarray (Ann Arbor, MI, USA) and resuspended to 50 ng/μL after delivery.
Amplification and tailing of bait library by emulsion PCR
Amplification and PCR tailing of the template library was performed by standard PCR and water-in-oil emulsion PCR with Phusion Taq (NEB/Finnzymes, Frankfurt, Germany) containing 10 ng template library in 50 μL PCR reactions (10 μL 5x HF buffer, 0.5 mM dNTPs, 1 mM each forward and reverse primer, 0.5 μL Phusion Taq; 95°C for 1 min, [98°C for 5 s, 55°C for 10 s, 72°C for 20 s]x15 cycles, 72°C for 2 min). Library amplification was performed with the universal primers (T7-for: 5'-TAATACGACTCACTATAGGG-3' and uni-reverse: 5'-GCACTGCAAAAAGCAGGCTC-3'; all oligos were synthesized by Metabion, Munich, Germany) in emulsion as described (15). In brief, 1x PCR master mix was emulsified with 6x oil mix. After 15 PCR cycles, products were cleaned by emulsion breaking and column purification. Tailing of the library with barcoded P1 primers (P1-tag1: 5'-CCA-CTACGCCTCCGCTTTCCTCTC-TATGGGCAGTCGGTGATCTCT-AATACGACTCACTATAGGG-3' and P1-tag2: 5'-CCACTACGCCTCCGCTT-TCCTCTCTATGGGCAGTCGGTGA-TGAGTAATACGACTCACTATAGGG-3') and P2 primer (5'-CTGCCCCGGGTTCCTCATTCTGCACTGCAAAAAG CAGGCTC -3') was performed for 5 cycles of PCR, enabling SOLiD sequencing.
Target library preparation and enrichment
DNA was prepared from SW480 cells and fragment libraries were prepared according to the SOLiD fragment library protocol with truncated P1 (P1-A: 5‘-TCTATGGGCAGTCGGTGAT-3' and P1-B 5'-ATCACCGACTGCCC-ATAGATT-3') and P2 (P2-A: 5'-CCGGGTTCCTCATTCTCT-3' and P2-B: 5'-AGAGAATGAGG-AACCCGGTT-3') adaptors. Prior to enrichment, library amplification was performed with the primer pair P1-A/P2-A in 100 μL (20 μL 5x HF buffer, 0.5 mM dNTPs, 1 mM each forward and reverse primer, 0.5 μL Phusion Taq; 95°C for 1 min, [98°C for 5 s, 52°C for 10 s, 72°C for 20 s] x 8 cycles, 72°C for 2 min). Size selection was performed after PCR in a size range of 150–200 bp by agarose gel purification. DNA template libraries were in vitro transcribed into biotin labeled RNA bait library probes with the Ambion MEGAscript T7 Kit (Invitrogen, Darmstadt, Germany) according to the manufacturer by replacing 20% of dUTP with biotin labeled dUTP (Biotin-16-dUTP, Roche, Mannheim, Germany). In a single reaction, 500 ng were transcribed for 90 min at 37°C with subsequent DNase (NEB, Frankfurt, Germany) digest and RNeasy (Qiagen, Hilden, Germany) column cleanup. Hybrid capture was performed with equal amounts of fragment library and RNA bait library (each 250 ng) and corresponding blocking oligos (P1-A and P2-A) in 26 μL at 65°C over night according to the protocol described (17). After capture of the enriched fragment library by streptavidin beads (Dynabeads M-280, Invitrogen, Darmstadt, Germany) and purification, the enriched fraction was amplified as described above with full length SOLiD P1 (5'-CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT-3') and P2 (5'-CTGCCCCGGGTT-CCTCATTCTCT-3') primers for 14 cycles, purified, and quantified by real time PCR for later SOLiD sequencing.
Sequencing of enriched sample and bait library
Sequencing of the enriched sample and barcoded template libraries was performed according to the SOLiD V4 protocol (Applied Biosystems, Darmstadt, Germany). Briefly, 10 million beads for each of the two barcoded template libraries and 100 million beads for the enriched fractions of genomic fragment library were prepared. The beads were combined and sequenced on a single quad of a flowcell with a 50 bp SOLiD 4 fragment run.
Read mapping and SNP calling
Data analysis was performed with the Applied Biosystems Bioscope v1.3.1 package (Applied Biosystems, Darmstadt, Germany) and a custom barcode deconvolution. To map the bait library, all 50mer reads were aligned to the probe sequences, including the T7 sequence and barcode (CTC/GAG), using the Bioscope Alignment module in classic mode and allowing for 5 mismatches. When the enrichment results were mapped, all 50mer reads mapped to hg19. The Bioscope Alignment module was used in seed and extend mode, using the first 25 bp of the reads as seeds for the first round and 25 bp starting at the 15th base in the second round, allowing 2 mismatches in both rounds and a mismatch penalty score of -2 for extension. The attached T7-tag included in the two bait libraries prevented probe reads from mapping to hg19. After mapping, the maToBam plugin was used to filter out all non-uniquely positioned reads in the genome.
Single nucleotide variants (SNV) were called with the Bioscope DiBayes SNP module. Stringency parameters were set to medium and het.skip.high.coverage set to 0, allowing the algorithm to call heterozygous SNVs for targeted resequencing approaches.
Results and discussion
While developing a method for targeted enrichment based on a short immortalized library, we selected 966 common cancer genes. We performed targeted enrichment and sequencing of genomic DNA derived from the human colon cancer cell line, SW480. Furthermore, we used water-in-oil emulsion PCR to amplify the bait library and demonstrated bias free amplification by sequencing the un-amplified and amplified bait libraries.
A schematic of the experimental design is shown in Figure 1. Herein, the sequencing of both bait libraries with (w/) and without (w/o) emulsion amplification is depicted, as well as the targeted enrichment by SIMLY and SureSelect bait libraries.