Long mate pair (LMP) or “jump” libraries are invaluable for producing contiguous genome assemblies and assessing structural variation. However the consistent production of high quality (low duplication rate, accurately sized) LMP libraries has proven problematic in many genome projects. Input DNA length and quantity are key issues that can affect success. Here we demonstrate how 12 libraries covering a wide range of jump sizes can be constructed from <10 µg of DNA, thus ensuring production of the best LMP libraries from a given DNA sample. Finally, we demonstrate the accuracy of the insert sizes by mapping reads from each library back to an existing assembly.
Standard paired-end next-generation sequencing projects can produce long continuous sections of sequence (contigs), but these alone lack the long-range information required to produce single contig assemblies of even bacterial chromosomes (1). Assemblies based on paired-end data alone are unable to resolve repeated sequences that are bigger than the insert size of the library (typically ~500 bp). The genomes of some higher eukaryotes can consist of >80% repeated sequences (2), and this can result in highly fragmented genome assemblies containing many thousands or even millions of small contigs.
In order to increase assembly contiguity, many projects use long mate pair (LMP) libraries to jump over repeated sequences to connect contigs, a process known as scaffolding (3). Depending on the quantity and quality of the available input DNA it is possible to generate LMP libraries with insert sizes ranging from 1.5 kb to 40 kb. High quality assemblies typically use multiple LMP libraries of different insert sizes, which is costly in terms of input DNA quantity, time, and money. LMP libraries are also notoriously difficult to make, especially for the larger insert sizes.
We present a method to simultaneously size select and construct up to 12 long mate pair (LMP) libraries at a time and then map the generated reads back to the available assembled sequences to accurately calculate insert sizes. These calculations can then be used to determine which libraries to sequence to greater depth and to use the accurate insert size information in de novo genome assemblies to improve outputs.
Using the Illumina Nextera Mate Pair Sample Preparation Kit (Illumina, San Diego, CA), libraries can be constructed from as little as 1 µg of genomic DNA (gDNA) using the Nextera transposase to fragment DNA and tag the molecules with known sequences (a process known as tagmentation). However, these libraries tend to have a broad insert size which can range from 1 kb to 12 kb (Supplementary Figure S2). As a result, many labs employ gel-based size selection to generate specific insert sizes that can be supplied to the scaffolding algorithm, thereby simplifying the scaffolding step. Semi-automated gel approaches such as BluePippin (Sage Science, Beverly, MA) improve this process but limit throughput to four libraries at a time and use more input DNA. Constructing 4 LMP libraries, could require >18 µg of DNA, and if insert sizes >10 kb are targeted, each size selection run would last longer than 6 h, meaning that library construction could take up to 3 days to complete (Figure 1). Furthermore, in our experience it is hard to predict how a specific DNA sample will perform in a tagmentation reaction, so more than one reaction is often needed to obtain a specific size. Finally, there can be 10%–20% variance between the targeted and recovered DNA size on a BluePippin.
We optimized the Nextera based LMP Library Construction kit to maximize fragmentation across the largest possible size range using the minimum amount of input material. Using gDNA isolated from the bread wheat (Triticum aestivum) variety Chinese Spring 42, we performed just 2 Gel Plus tagmentation reactions and subsequent strand displacements to construct 12 LMP libraries. This allows us to construct 60 LMP libraries from 5 samples using a 10-reaction kit. As fragment size in a Nextera reaction is controlled by the ratio of DNA and Nextera enzyme, one reaction was performed with 3 µg of input DNA, and another with 6 µg. The two Nextera reactions were then pooled post strand displacement, and the range of fragment sizes confirmed by analyzing the profiles on an Agilent BioAnalyser 12000 chip (Agilent, Stockport, UK) (Supplementary Figure S1). By using 2 independent tagmentation reactions, we ensured the material entering size selection ranged from 1.5 kb to >17 kb with a good distribution, allowing us to construct LMP libraries from a wide range of insert sizes.
Size selection was performed on a Sage Science Electrophoretic Lateral Fractionator (SageELF), which is unique in its ability to simultaneously isolate 12 different discrete size fractions from a single sample loading. The pooled, strand-displaced reactions were loaded onto a 0.75% cassette, which was configured to separate the sample for 3 h 30 min and then elute 12 fractions over 35 min. Post size selection, the size of each of the 12 isolated fractions was measured on an Agilent BioAnalyser Chip 12000 (Figure 2A and Table 1), and the yield was determined using a High Sensitivity Qubit Assay (Thermo Fisher, Cambridge, UK) (Supplementary Table S1).