Despite some experimental noise, the data from these amplifications show clear trends. Herculase II Fusion and Phusion HF represent the best and worst performers of the 10 polymerase-buffer systems tested. In libraries amplified with Phusion HF there is a strong negative correlation between length and cycle number (r = -0.96, P = 1.62x10-6, Pearson) and a clear positive correlation between GC percentage (r = 0.97, P = 9.15x10-7, Pearson). Libraries amplified with Herculase II Fusion were comparatively robust to increasing cycle number. There was no correlation between GC percentage and cycle number and only a weak correlation between length and cycle number (r = -0.66, P = 0.02, Pearson). It is clear from this that a large degree of length and GC-bias in sequencing libraries can be effectively avoided by simply switching the polymerase-buffer system.
Length and GC biases in AncientDNA Libraries
Part of the motivation of this study was to eliminate strong length and GC biases generated when amplifying ancient DNA libraries. These biases inflate the depth and thus the cost required to thoroughly sequence such libraries. In fact, AmpliTaq Gold and Phusion polymerases, which were found to be among the worst performers in the previous experiments, are commonly used for amplification of aDNA libraries (25)(26). The detrimental effects of this become apparent when analyzing data from one such study (25; Supp. Figure 2)
We applied the amplification-sequencing assay to a Neandertal library to further investigate these biases, and to find a suitable polymerase specifically for the amplification of ancient DNA libraries. We pursued a different strategy than done for the modern human library based on heavy oversequencing of a very dilute library and direct counting of the number of duplicate sequences seen from molecules in each length and GC bin. All polymerases from the previous panel were used except for the different Phusion systems. In this case only Phusion HF was used. The input library was treated with uracil-DNA glycosylase and endonuclease VIII prior to amplification to remove deoxyuracils and abasic sites from the template (18). This has the double benefit of increasing downstream sequence accuracy, as well as removing DNA modifications that are known block most polymerases tested here (only AmpliTaq Gold and Pfu Turbo Cx can copy across uracils) (27). Amplifications were performed in replicates of 6, using a unique indexing primer pair for each reaction. Amplification products were pooled and sequenced on one Illumina GAII lane.
A subset of 370,000 sequences was taken for analysis, corresponding to the minimal number of sequences obtained from each replicate. We observed a striking difference among polymerases in the percentage of sequences that could be mapped to the human genome. Since unmapped sequences putatively come from microbial contamination, this number is usually thought to reflect the fraction of endogenous DNA present in the sample. On average, between 9600 and 10,300 sequences (2.6%–3.0%) could be mapped to the human reference genome for libraries amplified with AccuPrime Pfx, Herculase II Fusion, Pfu Turbo Cx and Platinum Taq HiFi (Suppl. Table 1). In libraries amplified with AmpliTaq Gold and Phusion HF, only an average of 8373 (2.3%) and 6259 (1.7%) sequences could be mapped, respectively. Thus, amplification with AmpliTaq Gold and Phusion HF results in a drastically lower fraction of endogenous sequences. Interestingly, the difference in percentage of endogenous sequences determined by polymerase choice is even larger than the one recently found between Illumina and Helicos sequencing technologies (13). In that study, Phusion was used for Illumina amplification whereas Helicos sequencing was performed without prior amplification of sample DNA. Our data suggest that PCR bias may fully explain the observed difference.
To investigate the nature of polymerase length bias in ancient DNA samples, we took advantage of the fact that the libraries consist of unique molecules and copies of those molecules produced during amplification. We collapsed the mapped sequences into unique molecules and calculated the number of duplicates per unique molecule in the library (Figure 3). Since unique molecules can only be determined from mapped sequences, this analysis only considers endogenous molecules. Of the 6 polymerases, length and average duplicate number were strongly negatively correlated in AmpliTaq Gold libraries (r = 0.94, P <</i> 2x10-6, Pearson) and Phusion HF libraries (r = 0.95, P <</i> 2x10-6, Pearson). In both libraries there was an approximate 3-fold reduction in average duplicate number across the range of fragment lengths. Length and duplicate number were only marginally correlated in Pfu Turbo Cx libraries (r = 0.32, P = 0.006, Pearson) and there was no significant correlation in Platinum Taq HiFi, Herculase II Fusion or AccuPrime Pfx libraries.