Modern Human Library Amplification
Ten different commercially available polymerases and buffers were used in separate reactions of 15, 25 and 40 cycles per polymerase, for a total of 30 distinct amplifications. All reactions were done using 1 µL of library as template. As the modern human library already contained an indexed P7 adaptor, all amplifications were performed with a uniquely indexed P5 primer, making use of a recently described double indexing scheme (21). The amplification scheme is depicted in Supplementary Figure 1A. Cycling was performed on a DNA Engine Dyad (Bio-Rad, Munich, Germany). Reactions were spin-column purified (MinElute PCR purification kit, Qiagen). The following polymerases and manufacturer's conditions were used:
Herculase II Fusion (Agilent, Waldbronn, Germany) – 1x Herculase Buffer, 0.25 mM of each dNTP (Fermentas, St. Leon-Rot, Germany), 0.25 µM P5 index primer,0.25 µM P7 primer, 0.5 µL Herculase II Fusion polymerase, water to 50 µL. The thermocycling profile was 2 min at 95°C, n cycles of 20s at 95°C, 20s at 60°C and 40s at 72°C followed by a final extension step of 3 min at 72°C.
Phusion Hot Start I and II with HF and GC buffers (Finnzymes, Vantaa, Finland) – 1x HF or GC buffer, 0.2 mM of each dNTP (Fermentas), 0.5 µM P5 index primer, 0.5 µM P7 primer, 2.5 units Phusion or Phusion II polymerase, water to 50 µL. The thermocycling profile was 30s at 98°C, n cycles of 10s at 98°C, 20s at 60°C and 40s at 72°C with a final extension step of 5min at 72°C.
Phusion High Fidelity Master Mix (Finnzymes) – 1x Phusion Master Mix with HF buffer, 0.5 µM P5 index primer, 0.5 µM P7 primer, water to 50 µL. The thermocycling profile was 30s at 98°C, n cycles of 10s at 98°C, 20s at 60°C and 40s at 72°C with a final extension step of 5min at 72°C.
AmpliTaq Gold (Applied Biosystems, Darmstadt, Germany) – 1x Gold Buffer, 2.5 mM MgCl2, 0.2 mM of each dNTP (Fermentas), 0.2 µM P5 index primer, 0.2 µM P7 index primer, 2.5 units AmpliTaq Gold Polymerase, water to 50 µL. The thermocycling profile was 12min at 95°C, n cycles of 20s at 95°C, 30s at 60°C and 40s at 72°C, and a final extension step of 5min at 72°C.
Platinum Taq High Fidelity (Invitrogen, Darmstadt, Germany) – 1x High Fidelity Buffer, 0.2 mM each dNTP (Fermentas), 2 mM MgSO4, 0.2 µM P5 index primer, 0.2 µM P7 primer, 2.5 units Platinum Taq High Fidelity Polymerase, water to 50 µL. The thermocycling profile was 1 min at 94°C, n cycles of 20s at 94°C, 30s at 60°C and 1 min, 10s at 68°C.
Pfu Turbo Cx Hotstart (Agilent) – 1x Pfu Turbo Cx buffer, 0.2 mM each dNTP (Fermentas), 0.2 µM P5 index primer, 0.2 µM P7 primer and 2.5 units Pfu Turbo Cx polymerase. The thermocycling profile was 2 min at 95°C, n cycles of 30s at 95°C, 30s at 60°C and 1 min, 10s at 72°C followed by a final extension step of 10 min at 72°C.
AccuPrime Pfx Polymerase (Invitrogen) – 1x AccuPrime reaction mix, 0.3 µM P5 index primer, 0.3 µM P7 primer and 1.25 units AccuPrime Pfx polymerase. The thermocycling profile was 2 min at 95°C, n cycles of 15s at 95°C, 30s at 60°C and 1 min, 10s at 68°C.
Neandertal Library amplification
Polymerase and PCR conditions remained the same as in the modern human library amplifications, however Phusion II, Phusion Master Mix and Phusion I with GC buffer were excluded from this assay. Reactions were done in replicates of 6 for 26 cycles per reaction. Each reaction contained uniquely indexed P5 and P7 primers (Supplementary Figure 1B). All reactions were purified with the MinElute PCR purification kit (Qiagen).
Quantification and pooling of PCR products
Products amplified from the modern human library for 25 cycles were quantified with a DNA 1000 chip on a BioAnalyzer (Agilent). Amplicons from 15 and 45 cycle reactions had to be quantified differently because of insufficient concentrations or the presence of heteroduplexes. Therefore, one 25-cycle amplicon was taken as representative and a dilution series was created to use as a qPCR standard for quantification of all other amplified libraries from modern human samples. qPCR counts were used to determine dilution factors for each PCR product so that they could be pooled in equimolar ratios for sequencing.
All products amplified from the Neandertal library were quantified on a DNA 1000 chip (Agilent) and pooled in equimolar ratios for sequencing.
Illumina sequencing and sequence analysis
The resulting libraries of pooled PCR product as well as the original, unamplified library were sequenced on separate lanes of 75 cycle double-index paired-end runs on the Illumina GAII platform as described in Kircher et. al. (21) and bases were called with the machine learning algorithm Ibis (22).
For the modern human data, raw sequences were sorted by index and split into forward and reverse sequences and trimmed to 40 bp. These sequences were mapped to the human reference genome (hg19_1000 g) using BWA (23). Mate pairs were filtered for mapping quality ≥ 30, and the start and end coordinates were used to extract the putative insert sequence from the reference genome. The resulting sequences were then subjected to a length filter excluding sequences less than 40 bp and more than 500 bp, and these sequences were used in the subsequent analysis.
For the Neandertal data, sequences were sorted by index and paired-end reads were merged into single sequences to eliminate adapter sequences from short insert reads (24). The sequences were mapped to the human reference genome (hg19–1000 g) using BWA (23). Quality and size filters were then applied to exclude sequences with a mapping quality of < 30, and lengths shorter than 35 bp. An upper length cutoff of 139 bp was applied by using only merged reads for mapping. Of the remaining sequences, a random subsample of 370,000 sequences was taken from each replicate and used in subsequent analysis.
Results and discussion
Length and GC biases in Modern Human Sequencing Libraries
In order to analyze the extent of size and base composition biases introduced into sequencing libraries by different polymerases, we set up a simple amplification-sequencing assay. We first prepared a sequencing library from sheared human genomic DNA. By definition this library is unbiased by amplification. We then amplified ∼2.0x106 molecules of this library using ten commercially available DNA polymerase/buffer systems for 15, 25 and 40 cycles. The polymerase/buffer systems include AccuPrime Pfx (Invitrogen), Herculase II Fusion (Agilent), Pfu Turbo Cx Hotstart (Agilent), Platinum Taq High Fidelity (Invitrogen), AmpliTaq Gold (Applied Biosystems) and Phusion DNA polymerase (Finnzymes). Since Phusion DNA polymerase is recommended for library amplification in many Illumina library preparation protocols, we tested different versions of this enzyme with alternative buffers supplied by the manufacturer. Post-sequencing analysis involved qualityfiltering and lower and upper size cutoffs of 40 bp and 500 bp respectively.
The resulting length and GC distributions are shown in Figure 1. After sequencing, the original input library had a median length of 185 bp and a median GC-content of 39%. After 15 cycles of PCR, the median length of all libraries had decreased, the most extreme being both Phusion and Phusion II in HF buffer, with 41 bp and 38 bp decreases, respectively. The smallest shifts were observed in the Herculase II Fusion library (10 bp decrease) and the Pfu Turbo Cx library (14 bp decrease). Changes in the median GC-content for all libraries were within one percentage point of the original unamplified library.
Trends were consistent after 25 cycles. Herculase II Fusion and Pfu Turbo Cx libraries again had the smallest shifts in median lengths (14 bp and 16 bp decrease, respectively) and Phusion and Phusion II HF libraries had the highest (70 bp decrease for both). GC-content for all libraries was again within 1–2 percentage points of the original.
The inter-quartile range (IQR), which represents the middle 50% of the data, decreased concomitantly with the median in all libraries. Accordingly, Herculase II Fusion libraries posted the smallest reductions after 15 and 25 cycles (14 bp and 21 bp), followed by Pfu Turbo Cx (20 bp and 24 bp). Phusion and Phusion II HF libraries showed the largest IQR reductions across both cycle numbers (80 bp and 74 bp respectively after 25 cycles), indicating not only a reduction, but also a severe narrowing of the length distributions.
Shifts in length and GC-content medians, as well as IQRs after 40 cycles were slightly incongruous with previous trends. Decreases in median lengths and IQRs were exaggerated in some cases. For example, after remaining stable through 25 cycles, the median length of the Pfu Turbo Cx library decreased by 47 bp. In addition, the median length of the Phusion II HF library dropped 116 bp and the IQR decreased enough to fall completely outside that of the original library. 40 cycles was assumed to be well into the plateau phase of amplification, so in order to determine if these inconsistencies were the result of PCR plateau effects or experimental noise, we performed a higher resolution assay using only Herculase II Fusion and Phusion HF, the best and worst performing polymerases at 25 cycles. This time amplifications were performed at 2 cycle intervals from 20 to 40 cycles (Figure 2). It was determined by qPCR that the libraries entered plateau, i.e., no further increase in copy number, after 32 and 34 cycles (Herculase II Fusion and Phusion HF, respectively) (data not shown). Our data indicates that biases did not increase upon entering the plateau phase of PCR. Thus the extremes seen after 40 cycles (Figure 1) are mostly due to stochastic amplification in individual reactions. This finding is particularly relevant for the amplification of libraries prior to target enrichment. Applications like hybridization capture on micro-arrays (8) can require up to 20 µg of template DNA. In such cases it is difficult to avoid amplifying into plateau. Knowing that cycling into plateau does not magnify these biases will ease future amplifications.