2The Sainsbury Laboratory, Norwich Research Park, Norwich, UK
3Plant Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA
4The James Hutton Institute, Dundee, UK
5School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, UK
Targeted capture provides an efficient and sensitive means for sequencing specific genomic regions in a high-throughput manner. To date, this method has mostly been used to capture exons from the genome (the exome) using short insert libraries and short-read sequencing technology, enabling the identification of genetic variants or new members of large gene families. Sequencing larger molecules results in the capture of whole genes, including intronic and intergenic sequences that are typically more polymorphic and allow the resolution of the gene structure of homologous genes, which are often clustered together on the chromosome. Here, we describe an improved method for the capture and single-molecule sequencing of DNA molecules as large as 7 kb by means of size selection and optimized PCR conditions. Our approach can be used to capture, sequence, and distinguish between similar members of the NB-LRR gene family—key genes in plant immune systems.
The plant innate immune system uses complex networks to recognize the presence of potential intruders and protect the plant against disease development. Pathogen-derived molecules, the so-called pathogen-associated molecular patterns (PAMPs), are recognized by plant pattern recognition receptors (PRRs) on a plant cell’s surface and initiate an immune response that is often sufficient to stop the pathogen from further growth. However, effector proteins released by adapted pathogens can weaken this resistance; thus, plants have a second layer of defense in the form of nucleotide binding-site leucine-rich repeat (NB-LRR) genes that directly or indirectly interfere with the effector proteins, causing local cell death to prevent further disease development (1). As NB-LRR genes are often the genetic basis of disease resistance in plants, they are also known as resistance genes or R-genes. RenSeq (R-gene enrichment sequencing) has been recently described using short-read sequencing technology (2,3,4). As R-genes often appear in clusters of nearly identical sequences, it is problematic to distinguish highly similar sequences (e.g., recently duplicated genes), including inter- and intragenic regions in such clusters. Here, we demonstrate that the use of RenSeq on DNA fragments up to 7-kb long in combination with PacBio sequencing results in full contig assemblies covering entire R-gene clusters with inter- and intragenic regions (Figure 1).
To improve PacBio sequencing of large DNA insert molecules, we developed a strategy to increase the overall size of the targeted DNA molecules by efficiently removing small DNA molecules that are preferentially sequenced on the PacBio platform. The key improvements offered by our method include the use of (i) the Diagenode Megaruptor DNA shearing device to generate a tight distribution of large DNA fragments; (ii) the Sage Scientific Electrophoretic Lateral Fractionator for DNA fragment size selection; and (iii) Kapa HiFi HotStart polymerase to significantly improve amplification yields. The latest PacBio P6-C4 chemistry with an average read length of 10–15 kb allowed us to generate consensus reads with high accuracy.
Material and methods Concentration assessment and quality control
All DNA samples, if not stated otherwise, were quantified using a Qubit 2.0 Fluorometer (Thermo Fisher, Cambridge, UK). DNA fragment sizes were measured using an Agilent Bioanalyzer DNA 12000 chip (Agilent, Stockport, UK). Genomic DNA extraction
Genomic DNA was extracted using cetyltrimethylammonium bromide (CTAB lysis), phenol-chloroform extraction, and Qiagen MagAttract HMW DNA Kit (QIAGEN, Manchester, UK) purification. Young frozen Solanum verrucosum (an inbred genotype from accession VER54) leaves (5 g) were ground to a fine powder using a liquid nitrogen–cooled mortar and pestle, distributed between two 50-mL tubes, mixed with 20 mL CTAB buffer [100 mM Tris-HCl pH 8.0, 2% (w/v) CTAB, 1.4 M NaCl, 20 mM EDTA] containing 20 μg/mL proteinase K, and incubated at 55°C for 20 min. Next, 0.5 volumes (8 mL) chloroform was added and carefully mixed by inversion 15×, followed by centrifugation at 2990 × g in an Eppendorf Centrifuge 5810 R (Eppendorf, Stevenage, UK) for 30 min. The aqueous phase was carefully transferred to a new tube to which 1 volume phenol/chloroform/isoamyl alcohol was added, followed by centrifugation for 30 min at 2990 × g. The aqueous phase was ethanol precipitated by addition of 3 M sodium acetate (0.1 × DNA volume) pH 5.2 and 2.5 volumes ethanol, mixed, and precipitated at 2990 × g at 4°C. DNA pellets were washed with ice-cold 70% ethanol, air-dried, and resuspended in 350 μL 1× TE buffer (10 mM Tris-HCl, 1 mM EDTA, pH 8.0) containing 10 U RNase One (Promega, Southampton, UK). The DNA was dissolved overnight at 4°C with occasional mixing by inversion. Finally, we purified the DNA with the QIAGEN MagAttract HMW DNA Kit with minor changes to the manufacturer’s instructions: we added 150 μL AL Buffer to 200 μL DNA (10 μg in TE) and 15 μL magnetic bead suspension. From this point, we followed the MagAttract HMW DNA Kit Quick-Start protocol (www.qiagen.com/us/resources/resourcedetail?id=50e33eab-4f51-490a- 9dc4-0c3a3469e1d2&lang=en), and we eluted the DNA in 200 μL 1× TE buffer. DNA fragmentation
We used the Diagenode Megaruptor (Diagenode, Seraing, Belgium) to fragment 10 μg of DNA to 8-kb fragments with the following settings: 200-μL input volume, 8-kb shearing size, short hydropore, pre-load hydropore, wash syringe before shearing. PreCR repair
The sheared DNA samples were repaired using the PreCR Repair Mix (New England Biolabs, Hitchin, UK) in a 100 μL reaction volume containing 1 μg DNA, 1× ThermoPol Buffer, 1× NAD+, 0.1 mM dNTPs, and 2 μL enzyme mix. The reaction was incubated for 30 min at 37°C followed by a 0.4× AMPure XP bead (Beckman Coulter, Brea, CA) purification and elution in 60 μL 1× TE buffer. Illumina TruSeq DNA library preparation
Large insert libraries were constructed using the Illumina TruSeq Low Sample protocol (TruSeq DNA Sample Prep Guide 15026486 C): sheared DNA was end repaired, A-tailed, and Illumina adapters were ligated to both ends of the DNA strands to serve as primer binding sites for subsequent amplification. For all purification steps, a 0.4× AMPure XP bead clean-up was used to remove smaller DNA fragments. After the ligation termination step, two 0.4× AMPure XP purifications were performed. TruSeq library amplification
PCR reactions were conducted in a G-Storm GS1 (G-Storm, Somerton, UK) thermal cycler. The reaction set-up was as follows: 100 ng Illumina TruSeq library, 0.3 μM Illumina Primer Mix, and 25 μL 2× Kapa HiFi HotStart ReadyMix (Kapa Biosystems, London, UK) in a total volume of 50 μL. We used the following PCR program: 94°C for 3 min, 10 × [94°C 30 s, 60°C 30 s, 68°C 4 min], 68°C 10 min. The ramp rate was set to 3.0°C/s. We obtained 6.1 μg of amplified library from the PreCR-treated and 9.0 μg from the non-PreCR-repaired sample. Size selection of the Illumina TruSeq library
Before size selection with the Sage Scientific Electrophoretic Lateral Fractionator (SageELF; Sage Science, Beverly, MA), we concentrated the amplified TruSeq libraries to 30 μL using precipitation with 0.4× AMPure XP beads. The amplified TruSeq library (5.0 μg) was loaded on a 0.75% SageELF agarose gel cassette using the following settings: timed separation mode, separation time: 2 h 15 min. For each library, we pooled and quantified the fractions with the correct size distribution. Target enrichment using the MYbaits target enrichment kit
We vacuum dried (Eppendorf SpeedVac Concentrator 5301; Eppendorf, Arlington, UK) 250 ng of each size-selected 4-kb, 5-kb, and 7-kb library at 48°C for 5 min and dissolved the DNA in 1.7 μL nuclease-free water. The hybridization of DNA with RNA baits was performed according to the MYbaits protocol (MYbaits User Manual version 1.3.8) and kit (MYcroarray, Ann Arbor, MI) but with minor changes and in half of the recommended reaction volume. The baits were designed as described by Jupe et al. (3). In brief, the entire bait library was composed of 20,000 unique 120-mer RNA probes designed for the coding regions of potato R-gene sequences between exon–exon boundaries starting from the first nucleotide following the predicted coding region of the R-gene. Sequences were at least 2 times tiled with a 60 nucleotide overlap (3,5). A full list of bait sequences can be found in the SMRT RenSeq protocol (www.nature.com/protocolexchange/ protocols/4663). The Library Master Mix was prepared as follows: 1.7 μL resuspended TruSeq library was mixed with 5 μL SeqCap plant capture enhancer (Roche, Burgess Hill, UK) and 0.6 Block #3. The Hybridization Master Mix was prepared by mixing 10 μL Hyb #1, 0.4 μL Hyb #2, 4 μL Hyb #3, and 4 μL Hyb #4. The Capture Baits Master Mix was prepared by mixing 2.5 μL capture probe with 0.5 μL RNase Block. The Library Master Mix was incubated at 95°C for 5 min with a subsequent incubation at 65°C. The Hybridization Master Mix was incubated at 65°C for 2 min. The Capture Baits Master Mix was incubated at 65°C for 5 min. On a 65°C thermal cycler, 3.5 μL Library Master Mix and 6.5 μL Hybridization Master Mix were transferred to the Capture Baits Master Mix. The hybridization was performed for 22 h at 65°C in a thermal cycler. DNA–RNA hybrids were selected with Dynabeads MyOne Streptavidin C1 (Thermo Fisher, Cambridge, UK) according to the MYbaits manufacturer’s instructions and were finally resuspended in 50 ml 1× TE buffer. Targeted capture amplification
The captured DNA fragments were amplified with the following reaction set-up: 1 μL, 5 μL, or 10 μL MyOne Streptavidin C1 beads; 0.3 μM Illumina Primer Mix; and 25 μL 2× Kapa HiFi HotStart Ready Mix in a total volume of 50 μL. We used the following PCR program: 94°C 3 min, 25 × [94°C 30 s, 60°C 30 s, 68°C 4 min], 68°C 10 min. The ramp rate was set to 3.0°C/s. We performed three individual amplification reactions for each library to minimize PCR biases and PCR duplications and to study the effect of varying template amounts on the PCR yield of the different insert-size libraries. The obtained amounts were (for 1 μL, 5 μL, or 10 μL MyOne Streptavidin C1 beads respectively): 4.8 μg, 3.1 μg, 2.3 μg for the 4-kb insert library; 3.8 μg, 4.5 μg, 3.6 μg for the 5-kb insert library, and 2.5 μg, 4.3 μg, 4.7 μg for the 7-kb insert library. Size selection of the enriched R-gene library
Sage ELF size selection was performed as previously described, but the separation time was adjusted to the fragment size of interest (4-kb amplicons: 1 h 6 min; 5-kb and 7-kb amplicons: 2 h 18 min), followed by a 0.4 × AMPure XP bead purification. PacBio library assembly
PacBio libraries were constructed according to the Pacific Biosciences 5-kb Template Preparation and Sequencing protocol, version 6 (www.pacificbiosciences.com/support/pubmap/documentation.html). Samples were prepared using the DNA Template Prep Kit 3.0 (Pacific Biosciences, Menlo Park, CA). To avoid loss of material, only the first bead purification step prior to the DNA damage repair step was conducted using a 0.45× ratio of AMPure PB magnetic beads (Pacific Biosciences). All other remaining clean-up steps were performed using a 0.6× ratio of AMPure PB beads. Finally, the libraries were purified with 2 consecutive 0.6× AMPure PB bead clean-up reactions and eluted in 10 μL PacBio EB buffer. PacBio loading and sequencing
The SMRTcell loading protocol was generated with the PacBio Binding Calculator v.126.96.36.199. To maximize SMRTcell yield, the concentration on the plate was set to 0.1 nM instead of using the default value of 0.01 nM. The sequencing primer was annealed by heating the sample to 80°C followed by a slow cool-down of 0.1°C/s to 25°C. For each library, 2 SMRTcells were sequenced with the PacBio P6-C4 chemistry and 240-min movies. Evaluation of reads and subreads
Reads of insert (RoI) were generated using PacBio’s SMRTpipe ReadsOfInsert algorithm. Filtered subreads (SR) were retrieved from PacBio’s SMRT-portal web interface. Illumina adapter sequences 64 bp in length were removed from the 5′ and 3′ ends of the RoI and SR using cutadapt-1.8.1 (6) (Supplementary Material). SR and RoI statistics were calculated with abyss-1.9.0 (7) (Supplementary Material), and modal values were calculated with a custom Python script (see github repository: https://github.com/mgiolai/Ren-seq). The RoI were aligned to the S. verrucosum VER54 draft genome with the “bwa-mem –x pacbio” setting (https://arxiv.org/abs/1303.3997). To assess the sequencing coverage of the assembled contigs, we aligned the RoI to each corresponding assembly with the “bwa-mem –x pacbio” setting. To evaluate how many RoI and SR contain at least one bait sequence, the SR and RoI sequences of each library were searched separately against the bait sequences using BLAST-2.2.28+ with the following arguments: “-num_threads 8 -outfmt 6” (8) (Supplementary Material). The BLASTN results were analyzed with a custom Python script that only scores 80% sequence identity between read and bait sequences over at least 96 bp as a hit. R-gene detection was performed with the NLR-parser-v1.0 (9) using default values. NB-LRR gene assembly
The raw PacBio sequence data sets are stored in .bax.h5 files. A custom pipeline was built in SMRTanalysis 2.3.0 to assemble the data. The main changes to the standard SMRTanalysis assembly pipeline were filtering the raw reads from Illumina adapters and whitelisting (filtering out reads where adapter sequences were not removed).
The Illumina adapters, which are ligated to the DNA during library preparation, were removed before assembling the sequences. We used a custom script (available at https://github.com/paajanen/RenSeq) to trim the adapters off with the h5-raw files as input. This custom script trims each read by 64 bp (the length of the adapter).
We used BLASR (10) to align the adapter sequences to the edited files with a whitelisting script to filter out reads where the adapter was not removed. Those reads were 1%–2% of the raw reads, and they represent chimeric reads (either from physical ligation of two amplicons or virtual chimeras with PacBio software not recognizing the SMRTbell adapters).
The edited bax.h5 files were fed into the SMRTanalysis HGAP3 protocol (11) along with a parameter file that contained the whitelisting. All of the scripts are available at github (https://github.com/paajanen/RenSeq). Results and discussion
High molecular weight S. verrucosum VER54 genomic DNA was extracted (Supplementary Figure S1) and sheared to 8 kb using the Megaruptor (Supplementary Figure S2). Due to the tight size distribution, no additional size-selection step (with its associated loss of material) was necessary before long-insert Illumina TruSeq library construction.
To assess the possible benefits of PreCR DNA repair on library construction, we compared the yield and size distribution of two 8-kb libraries with and without DNA repair. Equal amounts of both TruSeq libraries were amplified using the Kapa HiFi polymerase. In both cases, the PCR yield was sufficient for size selection on the SageELF (5 μg maximum input per gel cassette). We observed a substantial shift in fragment size distribution in the repaired and non-repaired sample after PCR. The PreCR-repaired library showed a decrease from 7.4 kb to 5.3 kb, whereas the non-DNA repaired sample decreased from 7.0 kb to 5.0 kb (Supplementary Figure S2). From this, we concluded that the amplification profile of the Illumina library was little affected by PreCR repair. Lower quality DNA or DNA sheared by other methods (e.g., sonication) might benefit from this step. However, as PreCR repair could be beneficial to the quality of longer DNA molecules, we decided to proceed with the PreCR-treated library.
To minimize the preferential amplification of smaller DNA molecules by PCR, we tested different PCR conditions and commercially available long-range PCR mixes such as the KAPA Long Range PCR Kit and the QIAGEN Long Range PCR Kit (Supplementary Material). Both are advertised to amplify DNA fragments longer than 10 kb. From our results, we concluded that amplification of the TruSeq library with the long-range PCR mixes or various PCR conditions did not lead to better results (e.g., longer PCR products). We preferred using Kapa HiFi for subsequent reactions as this enzyme has minimal GC bias and high proof-reading activity (12).
The SageELF is designed to fractionate DNA into 12 separate elution wells with defined fragment sizes. SageELF size selection of a 5-μg TruSeq library yielded ~4.3 μg of DNA (87% recovery). This shows that a single size-selection run using 5 μg of input DNA on the Sage ELF produces enough DNA for a capture approach. This amount can be achieved with one or two Kapa HiFi PCR reactions.
We selected 3 SageELF size fractions (4-kb, 5-kb, and 7-kb) for bait hybridization (Table 1, Figure 2). The DNA/ RNA bait hybridization to capture resistance gene sequences was performed individually on the 4-kb, 5-kb, and 7-kb fractions according to the MYbaits target enrichment protocol but with minor modifications (see “Materials and methods” section).