Sign Up to BioTechniques free email alert service to receive content updates.
Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencing-by-synthesis technology
 
Olivier Harismendy and Kelly A. Frazer
Scripps Genomic Medicine, Scripps Translational Science Institute, Scripps Research Institute, La Jolla, CA, USA
BioTechniques, Vol. 46, No. 3, March 2009, pp. 229–231
Full Text (PDF)
Abstract

One approach for high-throughput population-based sequencing of targeted intervals in the human genome is to amplify the regions using long-range PCR (LR-PCR) followed by sequencing with next-generation sequencing (NGS) technologies. Utilizing this method, we have observed that the 50 bp located at the amplicon ends account for more than 50% of the sequenced bases and that the sequence coverage depth of base pairs within an amplicon is highly variable. Here we propose an explanation for the overrepresentation of the amplicon ends and show that the use of 5′-blocked primers for the LR-PCR reaction reduces their overrepresentation. Furthermore, we demonstrate that using a 600-bp library insert size rather than the standard 200-bp insert size results in more uniform sequence coverage depth. The capability to increase sequence coverage uniformity greatly improves the effective throughput of NGS platforms.

The use of next-generation sequencing (NGS) platforms for population-based sequencing of targeted genomic intervals will enable the examination of genetic variants across the allele frequency spectrum for association with diseases (1,2). NGS technologies perform best for base-calling accuracy and variant-finding sensitivity with high and uniform sequence coverage. Using long-range PCR (LR-PCR) to amplify targeted genomic intervals followed by sequencing on the Illumina Genome Analyzer (GA) (San Diego, CA, USA), we and others (3,4) have noted an overrepresentation of the amplicon ends, and in particular, that the 50 bp located at the amplicon ends can account for more than 50% of the sequenced bases. We also noted that the sequence coverage depth of base pairs within the amplicon—and thus present in equimolar amount in the starting sample material—is highly variable. This per-base sequencing coverage variability is known to be an important issue in next-generation sequencing, and observed regardless of the organism or the type of input material (5,6,7). These artifacts are not only wasteful for the sequencing yield but decrease the expected average coverage depth across the targeted interval and thereby impact data quality.

Prior to sequencing on the Illumina GA, the LR-PCR amplicons were fragmented to ∼200 bp, ligated to linkers, and amplified through linker-mediated PCR. We reasoned that the overrepresentation of the ends was a result of the sample preparation method, as nucleotides located at amplicon ends are present at the extremities of the 200-bp fragments more frequently than a random internal nucleotide. To avoid overrepresentation of the amplicon ends, we tested the use of 5′-blocked primers in the LR-PCR to prevent their ligation to the linkers.

Six genomic intervals (size range 3129–10,989 bp) were amplified from DNA sample NA17460 obtained from the Coriell Institute for Medical Research (Camden, NJ, USA). We performed the LR-PCR reactions using 30 ng of genomic DNA, 0.5 µM forward LR-PCR primers, 0.5 µM reverse LR-PCR primers in a total reaction volume of 12 µL, as described (8). The primers were ordered from IDT Technologies (Coralville, IA, USA) without modification (unblocked) or with the 5′ modification of either Amino Modifier C6 (NH2-blocked) or C3 spacer (C3-blocked). Following LR-PCR, the 6 amplicons generated with one type of primer were quantified and combined in equimolar amounts prior to fragmentation and purification. The Illumina GA libraries were prepared according to the manufacturer's instructions except for the following steps: the fragmentation was performed enzymatically using 1 µg of the equilmolar pooled amplicons incubated for 25 min at 37°C with 0.05 U of DNase I, resulting in digestion to the 170- to 250-bp fragment size range. Modified adaptors were used in order to add a 4-nucleotide barcode at the 5′ end of the library fragments as described by Craig et al. (9) with the following modifications: a 4-bp barcode with two constant bases and two variable bases (CNNT) was used; and both oligonucleotides were mixed at 100 µM in TE pH 8.0, heated for 5 min at 95°C, and annealed by slowly cooling to 4°C over 12 h. Each of the three libraries [corresponding to the different LR-PCR primer types (unblocked, NH2-blocked or C3-blocked) used to generate the amplicons] received different indexes. Following adaptor ligation, we selected a fragment size of ∼200 bp by gel extraction and enriched for adapter-ligated fragments using manufacturer primer sequences and the following PCR conditions: 18 cycles of 30 s at 98°C, 20 s at 65°C, 15 s at 72°C, 15 s at 72°C; and 5 min at 72°C. We pooled the three indexed libraries generated from the different primer types and sequenced the library pools on two different lanes of the flow cell following manufacturer's instructions for cluster generation and sequencing-by-synthesis of single ends for 40 cycles. We used the Illumina Genome Analyzer Pipeline Version 0.2 software with default signal quality filters (chastity of base signals >0.6 within the first 12 bases of the read) in order to qualify reads that passed filters (PF; read files are available upon request). The PF reads were split according to their corresponding indexes and the 4-bp index was then removed. The remainder of the reads were aligned to the reference sequence (6 LR-PCR amplicons sequences from NCBI36) using the MAQ (mapping and assembling with qualities) algorithm (10). Poor-quality bases (

The coverage of the amplicon ends using blocked primers was reduced by 6.8 times on average (ranging 2.8–9.5) when compared with unblocked primers (Figure 1, A and B), with both types of blocking groups working equally. We note that the coverage of the amplicon ends is still twice above expected and that sequence coverage depth of base pairs within the amplicon show great variability. We hypothesize that this bias is introduced during the PCR amplification of the library and that using a larger library size would reduce coverage variability and residual overrepresentation of the ends. To test this, we fragmented the amplicon pools down to 600 bp, generated libraries, and sequenced on the Illumina GA as described for the 200-bp fragments. The coverage of the ends of the 600-bp library generated with unblocked primers was reduced by 3.6 times when compared with a 200-bp library (Figure 1A and 1B), thus supporting the hypothesis that the overrepresentation of the ends is due to the fragmentation rather than to the sample preparation by LR-PCR. The combined effect of an increased library size and the 5′-blocked primers lowers the coverage of the amplicon ends, nearing the expected level. In addition to the reduction in sequence coverage of the amplicon ends, the 600-bp library had a 28% reduction in coverage variability across the amplicon compared with the 200-bp library (t-test, P < 0.01, Figure 2, A and B), therefore improving overall coverage uniformity.





Our results demonstrate that the efficiency of DNA re-sequencing using LR-PCR amplicons to amplify targeted intervals can be greatly improved by utilizing 5′-blocked primers and a 600-bp library size. Implementation of these simple steps will maximize coverage yield and limit coverage variability when designing targeted re-sequencing experiments using next-generation short read technologies.

Acknowledgments

We would like to thank Karrie Trevarthen for technical assistance, and Kari Ohlsen and Ryan Lister for helpful discussions. We are grateful to John Havens (IDT Technologies) for providing C3-blocked primers. This work is supported by a National Institutes of Health Clinical and Translational Science Award (NIH CTSA; grant no. NIH 1U54RR025204-01). This paper is subject to the NIH Public Access Policy.

The authors declare no competing interests.

References
1.) Bentley, D.R. 2006. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16:545-552.

2.) Mardis, E.R. 2008. The impact of next-generation sequencing technology on genetics. Trends Genet. 24:133-141.

3.) Parla, J.S., and W.R. McCombie. 2008..

4.) Yeager, M., N. Xiao, R.B. Hayes, P. Bouffard, B. Desany, L. Burdett, N. Orr, C. Matthews. 2008. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum. Genet. 124:161-170.

5.) Ossowski, S., K. Schneeberger, R.M. Clark, C. Lanz, N. Warthmann, and D. Weigel. 2008. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 18:2024-2033.

6.) Cronn, R., A. Liston, M. Parks, D.S. Gernandt, R. Shen, and T. Mockler. 2008. Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res. 36:e122.

7.) Hillier, L.W., G.T. Marth, A.R. Quinlan, D. Dooling, G. Fewell, D. Barnett, P. Fox, J.I. Glasscock. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5:183-188.

8.) Frazer, K.A., E. Eskin, H.M. Kang, M.A. Bogue, D.A. Hinds, E.J. Beilharz, R.V. Gupta, J. Montgomery. 2007. A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature 448:1050-1053.

9.) Craig, D.W., J.V. Pearson, S. Szelinger, A. Sekar, M. Redman, J.J. Corneveaux, T.L. Pawlowski, T. Laub. 2008. Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5:887-893.

10.) Li, H., J. Ruan, and R. Durbin. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851-1858.




Back to top

Search BioTechniques.com: Advanced Search