Advances in sequencing technologies have dramatically reduced costs in producing high-quality draft genomes. However, there are still many contigs and possible misassembled regions in those draft genomes. Improving the quality of these genomes requires an efficient and economical means to close gaps and resequence some regions. Sequencing pooled gap region PCR products with Pacific Biosciences (PacBio) provides a significantly less expensive means for this need. We have developed a genome improvement pipeline with this strategy after decreasing a loading bias against larger PCR products in the PacBio process. Compared with Sanger technology, this approach is not only cost-effective but also can close gaps greater than 2.5 kb in a single round of reactions, and sequence through high GC regions as well as difficult secondary structures such as small hairpin loops.
Second-generation sequencing technologies produce more and more draft genomes at an ever faster speed and lower cost. However, finished, high-quality genomes are still preferred by researchers (1). Closing gaps in a draft genome is necessary to improve the quality of the final genome sequence. Picking primers at gap regions for PCR and assembling the resulting PCR sequences into the genome can reduce numbers of both contigs and scaffolds. Since the advancement of less expensive sequencing technologies, Sanger sequencing (2, 3) of individual PCR products spanning targeted regions becomes a more expensive method compared with the cost of the draft itself. Pooling dozens of PCR products of various sizes and sequencing them as one library with single molecular sequencing technology from PacBio is a more economical option (4, 5). However, there is a loading bias against large DNA fragments in the PacBio sequencing process. The PacBio technique uses single molecule sequencing done in wells on a chip, which is called a single molecule, real time (SMRT) cell. Smaller PCR products will load into the PacBio wells with a greater efficiency than larger PCR products. When PCR products ranging from 500 bp to 5 kb are pooled and sequenced together using PacBio, the smaller products have a substantially higher coverage than the larger products resulting in poor quality or incomplete sequences for the larger PCR products. To address this problem, we tried adjusting the molar ratio of the PCR products when pooling them together based on the PCR size and concentration. This resulted in a much closer distribution of coverage for the different sizes of PCR products. We used a finished genome to normalize our process, and we chose 18 PCR primer pairs with amplification sizes ranging from 500 bp to 5 kb. Amplifications were performed using two commercial kits: FailSafe PCR System (Epicentre, Madison, WI, USA) for genomes with mid-range GC content (40%–60%) and GC-Rich PCR System (Roche, Indianapolis, IN, USA) for genomes with GC content higher than 60%. PCR products were cleaned individually (ZR 96 DNA Clean and Concentrator, Zymo Research, Irvine, CA, USA) and pooled PCRs were purified again (Agencourt AMPure XP, Beckman Coulter, Brea, CA, USA). The results were pooled into three groups for three different approaches: group one was our control, where we loaded equal DNA mass for each PCR product; for group two, we pooled each PCR product at the equal molar mass; for group three, we adjusted the molar mass ratio based on the size of the PCR product, increasing the molar mass with size. The results can be seen in Figure 1. The control group resulted in much higher coverage for the smaller PCR products while the longer PCR products were barely covered (shown in the blue bars in Figure 1). The second group had an improvement in coverage for the larger products, but still less than the coverage for the smaller products (shown in the red bars in Figure 1). The third group shows dramatic improvement in the coverage for the larger products (shown in the green bars in Figure 1). For group 3, the formula below was used to make the molar mass adjustment and to calculate the volume needed for each PCR and for robotic pooling:
We have combined over 200 PCR products into one pool and the process produced good coverage for the products. Since one SMRT cell can produce 0.5 gigabases of data after filtering to remove adapters, the process described in this paper provides us with an efficient method for potentially pooling 500–1000 PCR products into one sequencing library, depending on the sizes of the PCR products. By decreasing the loading bias against larger PCR products in the PacBio technology, we have developed a much more efficient and economical method to close gaps in draft genomes since larger gaps can be closed with the PacBio technology.
We applied this gap closure method to 16 diverse bacterial genome projects in our genome improvement pipeline. We selected primers for 362 regions in these 16 projects and sequenced the resulting products with both Sanger and PacBio technologies. The gap sizes ranged from 500 bp to 5 kb. While the majority of gaps less than 2.5 kb were closed with both Sanger (64%) and PacBio (73%) technologies, none of the gaps larger than 2.5 kb were closed with a single round of Sanger technology (Table 1). PacBio sequencing of the PCR products closed almost 90% of these larger gaps. This method also allows the closure of gaps due to small hairpin structures (typically with higher GC content) called hard stops, since PacBio can successfully sequence through these regions whereas other sequencing technologies usually fail. Because one of our goals is to reduce costs, we pooled more than 200 PCR products in a single PacBio SMRT cell for sequencing. To successfully assemble the PacBio sub-reads (a sub-portion of a read resulted from screening and removing of sequencing adapters that were in the middle of a read) into an accurate consensus for a single PCR product we pulled out sub-reads from the pool of sequenced sub-reads that belonged only to that PCR product. We developed computational scripts to interact with our local database to identify the primer sequences and an additional 150-nucleotide unique sequence next to the primers from the draft assembly to fish out the sub-reads—using BLAST (6)—that belong to a particular PCR product and, therefore, a particular gap. This is especially necessary for repeat gaps so that if there are slight differences in the repeats, they can be resolved correctly. Of course, since the error rate of PacBio sequencing is typically reported to be about 15%, we needed to increase the accuracy of the consensus we obtained. By choosing 200 sub-reads with the highest sequence matches to the primer-plus-150nt-unique sequences we were able to dramatically improve the quality of the PCR product consensus sequences after assembling the selected sub-reads for an individual PCR product using Allora, the long read assembler for de novo assembly from PacBio. For the smaller gaps where the missing sequences were resolved by both Sanger and PacBio technologies, 91% of the PacBio consensus sequences matched the Sanger sequences with a 98% identity or better. To try to maintain this accuracy rate, we found that for the larger PCR products we needed to increase the number of sub-reads to 300. We did not see a significant difference in the results based on the GC content of the genomes. For genomes with mid-range GC content (40%–60%), 78% of 51 PCRs closed the gap. For genomes with high GC content (>60%), 86% of 311 PCRs closed gaps. We continue to investigate ways to get the highest accuracy possible from the PacBio PCR consensus sequences for the larger gaps.
We want to thank Beverly Parson Quintana and Yuliya A. Kunde for supporting this work in PCR and PacBio sequencing process. The work is funded by DOE Joint Genome Institute through contract W-7405-ENG-36.
The authors have a patent pending for the technology described in this paper.
Address correspondence to either Xiaojing Zhang or Cliff S. Han. The Genome Science Group, Bioscience Division, Mail stop 888, Los Alamos National Laboratory, Los Alamos, NM 87545, USA. E-mail: