2Department of Biology, Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Germany
Microsatellite sequences are important markers for population genetics studies. In the past, the development of adequate microsatellite primers has been cumbersome. However with the advent of next-generation sequencing technologies, marker identification in genomes of non-model species has been greatly simplified. Here we describe microsatellite discovery on a Pacific Biosciences single molecule real-time sequencer. For the Greater White-fronted Goose (Anser albifrons), we identified 316 microsatellite loci in a single genome shotgun sequencing experiment. We found that the capability of handling large insert sizes and high quality circular consensus sequences provides an advantage over short read technologies for primer design. Combined with a straightforward amplification-free library preparation, PacBio sequencing is an economically viable alternative for microsatellite discovery and subsequent PCR primer design.
Microsatellites are important and highly informative markers for population genetics, evolutionary biology, and ecology. But the development of sufficient microsatellite markers for population genetics studies of non-model organisms has been laborious. The first high-throughput attempts to identify microsatellite markers were performed using 454 pyrosequencing of genomic shotgun libraries or microsatellite enriched libraries (1-3). Approaches using ultra-short read sequencing for microsatellite identification have also been explored (4, 5). Typically, these cloning-free approaches are performed at low genome coverage, employing relatively short sequence reads and often recovering small fractions of the target genome. Shotgun genome sequencing for microsatellite discovery is generally less biased, but potential markers are typically sequenced only once, inherently linking the reliability of the sequence to the single pass accuracy of the sequencing chemistry. In contrast, microsatellite-enriched libraries are biased by the choice of capture probe, restriction enzyme, or PCR amplification steps. Pacific Biosciences (Menlo Park, CA) has developed a single molecule, real-time (SMRT) DNA sequencing system, the PacBio RS (6). Although SMRT sequencing has a single pass accuracy of ~85%, the sequencing library format (SMRTbell) allows multi-pass sequencing of the same circular template, thereby generating highly accurate circular consensus sequencing (CCS) data reaching >99.999% accuracy (7).
This study reports the use of single molecule consensus sequencing using the Pacific Biosciences RS for microsatellite discovery. The advantage over other next-generation sequencing systems is the random error model and the capability for sequencing the same library molecule several times, thereby generating high quality consensus sequences. The relatively long library molecules in combination with high consensus accuracy are excellent templates for primer design as the reads cover larger flanking regions of the identified microsatellites compared to Illumina, 454 or Ion Torrent sequencing reads.
Figure 1 shows the genomic shotgun approach for microsatellite discovery using the PacBio PacBio real-time sequencing system, which is based on the CCS approach. CCS therefore represents a combination of unbiased random shotgun sequencing with high consensus coverage of the template molecule. We have therefore explored the usability of CCS reads for microsatellite discovery and subsequent primer design.
Material and methods
Genomic DNA from a single Anser albifrons individual was sheared to approximately 3 kb fragments for SMRTbell template library generation. Sequencing was performed on a PacBio RS using C2/C2 chemistry (movie time 90 min) by GATC Biotech (Konstanz, Germany).
The CCS reads were subjected to microsatellite analysis and primer design using msatcommander (v1.08) (8) with a threshold of at least five repetitions for di-, tri-, tetra-, penta-, and hexa-nucleotide repeats, excluding mononucleotide repeats. Primer design parameters were: product size 90–500; primer: max size 25 / min size 18, min Tm 47°C, max Tm 63°C, min GC 40%, max GC 60%, and msatcommander option “combine loci”. PCR was performed in a 25 µL reaction mix containing: 1 × complete buffer (2.0 mM MgCl2; Bioron Diagnostics, Ludwigshafen, Germany), 10 pM of each primer, 0.5 mM of each dNTP, except dATP (0.25 mM), and 0.25 mM of radiolabeled (33P-α)-dATP,1 unit of Taq polymerase, and 50–100 ng of template DNA. The following PCR protocol was employed: (1) initial denaturation, 5 min at 94°C, (2) 35 cycles of 45 s of denaturation at 94°C, 60 s of annealing at 52°C, and 2 min of extension at 72°C; (3) final 10 min extension at 72°C. Samples were genotyped by autoradiography in 6% acrylamide: N,N’-methylenebisacrylamide 24:1 denaturing gels using X-ray Hyperfilm (Kodak, Taufkirchen, Germany). The autoradiograms were analyzed by eye and scored. Filtered subreads and CCS reads were generated using the PacBio SMRT analysis software (v1.3.1). The filtered subreads were mapped to the complete mitochondrial genome of A. albifrons (GenBank: NC_004539.1) using the BWA Smith-Waterman Aligner (BWA-SW; v0.6.2) with mapping parameters “-b5 -q2 -r1 -z20” (9). Results and discussion
After quality filtering of zero-mode waveguides (ZMW), the run yielded 16,180 reads with 43 Mb of sequence data. The reads were further split into 31,200 subreads, with subreads ranging from 1 to 41 per insert (mean: 10, median: 7.5). For 1,457 of the inserts with at least 3 full subreads, a CCS read could be generated (length mean: 1,867, median: 1,861), translating into 2.72 Mb of sequence data. A vast majority of the reads had an average predicted error rate of <1%, i.e., Phred 20 score (Figure 2) (10).
In 281 CCS reads, 316 microsatellites were identified, and 251 flanking PCR primer pairs could be designed. The distribution of putative target loci consisted of 213 di-, 50 tri-, 28 tetra-, and 25 penta-nucleotide motifs. Of the microsatellite containing reads, 255 contained a single microsatellite and 26 reads contained 2 to 5 motifs. This is equivalent to a(combined) locus to primer conversion of ~90%. This yield is higher than the values obtained from 454 Titanium or Illumina shotgun sequencing, which usually show conversion of 40%–60% of the loci due to read length constraints and depending on stringency of analysis (2, 4, 5, 11).
For CCS reads containing more than one motif and generating several primer pairs, we chose the most promising target for PCR amplification, based on its microsatellite motif characteristics. Although birds are known to have a lower microsatellite density than other vertebrates (12, 13), a reasonable number of loci could be extracted from our sequence data. These higher primer yields from relatively few reads are likely the result of the long CCS reads, which have a higher chance of identifying microsatellites and also leave enough flanking region for subsequent primer design. The random error model of PacBio sequencing (14) grants the CCS reads a high accuracy at three or more circular sequencing passes and are favorable in this context compared to reads from 454 and Illumina sequencing with their sequence-specific error models (15, 16).
We tested the performance of the primer pairs generated by msatcommander based on CCS reads of 10 individuals from 3 goose species: A. albifrons, A. anser, and A. erythropus. Our first primer evaluation showed successful PCR products for A. anser from 48 of 50 pairs, and 46 out of 50 pairs amplified the desired locus for both A. albifrons and A. erythropus. The autoradiograms were analyzed by eye and scored. To evaluate if the frequency of the observed genotypes is higher than expected under genetic equilibrium, genotypic linkage disequilibrium per pair of loci was tested using the software Genepop (v4.1) (17); the results indicated no linked loci (P < 0.05). A thorough analysis of the microsatellite markers will be presented in a separate publication (Frias Soler et al., manuscript in preparation).
Apart from microsatellite discovery, we also analyzed how many reads matched mitochondrial sequences and found that 59 subreads could be mapped to the mitochondrial genome of A. albifrons. The reads covered 12,957 bp out of 16,737 bp from the published reference sequence (77.42%). In principle, one could co-retrieve the complete mitochondrial genome from a single SMRT-cell, at least in a draft stage, to be improved later.
In conclusion, we show that, even with a small fraction of an avian target genome, one can generate enough primer pairs for microsatellite loci to perform population genetics studies. The unique combination of randomly sub-sampling a small fraction of the genome and long high-quality CCS reads is advantageous for primer design over short-read technologies that allowonly single-pass reads at low genome coverages. For technical reasons, our library had an average insert size of 1.8 kb and was slightly above the optimal range for the C2 chemistry read length, yielding only few CCS reads from the longest fraction of the reads. As CCS accuracy is correlated with the number of sequencing passes of the template molecule, shorter library molecules would result in a higher number and better quality of the CCS reads (7).
In light of current sequencing chemistry and system upgrades (RS II; P5 polymerase / C3 chemistry) with average read lengths of ~8,500 bases, our library would have been covered by at least 3–4 sequencing passes. A single sequencing run costing ~$600 (including library preparation) should therefore generate a minimum of 30,000 CCS reads, a 20-fold increase compared with our study, yielding ~6500 potential loci even in microsatellite poor bird genomes. Several price calculations for different PacBio sequencing chemistry combinations can be found in the recent study by Koren et al. (18). This puts the cost of microsatellite discovery using PacBio sequencing between that of 454 and Illumina sequencing (4). Our approach provides a useful alternative when cost reduction using multiplexing is not practicable, and microsatellite array length is to be determined directly from the data for prioritized locus testing.Author contributions
Markus A. Grohme designed the study, performed data analysis, and wrote the manuscript. Roberto Frias Soler performed PCR experiments. Michael Wink provided the sample and was involved in writing the manuscript. Marcus Frohme provided funding, aided in writing the manuscript, and supervised the work.
Funding was by the Ministry of Science, Research and Culture (MWFK) of the federal state of Brandenburg (Germany) in the program “Knowledge and Technology Transfer for Innovation” (FKZ 80143246 / GenoSeq) based on the European Fund of Regional Development (EFRE).
The authors declare no competing interests.
Address correspondence to Marcus Frohme, Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany. E-mail: [email protected]
1.) Abdelkrim, J., B. Robertson, J.-A. Stanton, and N. Gemmell. 2009. Fast, cost-effective development of species-specific microsatellite markers by genomic sequencing. Biotechniques 46:185-192. 2.) Malausa, T., A. Gilles, E. Meglécz, H. Blanquart, S. Duthoy, C. Costedoat, V. Dubut, N. Pech. 2011. High-throughput microsatellite isolation through 454 GS-FLX Titanium pyrosequencing of enriched DNA libraries. Mol. Ecol. Resour. 11:638-644. 3.) Santana, Q., M. Coetzee, E. Steenkamp, O. Mlonyeni, G. Hammond, M. Wingfield, and B. Wingfield. 2009. Microsatellite discovery by deep sequencing of enriched genomic libraries. Biotechniques 46:217-223. 4.) Jennings, T.N., B.J. Knaus, T.D. Mullins, S.M. Haig, and R.C. Cronn. 2011. Multiplexed microsatellite recovery using massively parallel sequencing. Mol. Ecol. Resour. 11:1060-1067. 5.) Castoe, T.A., A.W. Poole, A.P.J. de Koning, K.L. Jones, D.F. Tomback, S.J. Oyler-McCance, J.A. Fike, S.L. Lance. 2012. Rapid microsatellite identification from Illumina paired-end genomic sequencing in two birds and a snake. PLoS ONE 7:e30953. 6.) Eid, J., A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank. 2009. Real-time DNA sequencing from single polymerase molecules. Science 323:133-138. 7.) Travers, K.J., C.-S. Chin, D.R. Rank, J.S. Eid, and S.W. Turner. 2010. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38:e159. 8.) Faircloth, B.C. 2008. msatcommander: detection of microsatellite repeat arrays and automated, locus-specific primer design. Mol. Ecol. Resour. 8:92-94. 9.) Li, H., and R. Durbin. 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589-595. 10.) Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186-194. 11.) Lepais, O., and C.F.E. Bacles. 2011. Comparison of random and SSR-enriched shotgun pyrosequencing for microsatellite discovery and single multiplex PCR optimization in Acacia harpophylla F. Muell. Ex Benth. Mol. Ecol. Resour. 11:711-724. 12.) Primmer, C.R., T. Raudsepp, B.P. Chowdhary, A.P. Møller, and H. Ellegren. 1997. Low frequency of microsatellites in the avian genome. Genome Res. 7:471-482. 13.) Meglécz, E., G. Nève, E. Biffin, and M.G. Gardner. 2012. Breakdown of phylogenetic signal: a survey of microsatellite densities in 454 shotgun sequences from 154 non model eukaryote species. PLoS ONE 7:e40861. 14.) Carneiro, M.O., C. Russ, M.G. Ross, S.B. Gabriel, C. Nusbaum, and M.A. DePristo. 2012. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13:375. 15.) Gilles, A., E. Meglécz, N. Pech, S. Ferreira, T. Malausa, and J.-F. Martin. 2011. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 12:245. 16.) Nakamura, K., T. Oshima, T. Morimoto, S. Ikeda, H. Yoshikawa, Y. Shiwa, S. Ishikawa, M.C. Linak. 2011. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 39:e90. 17.) Rousset, F. 2008. genepop'007: a complete re-implementation of the genepop software for Windows and Linux. Mol. Ecol. Resour. 8:103-106. 18.) Koren, S., G.P. Harhay, T.P. Smith, J.L. Bono, D.M. Harhay, S.D. McVey, D. Radune, N.H. Bergman, and A.M. Phillippy. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14:R101.