As a point of reference, 0.01 ng is the amount of DNA present in one and one-third human diploid cells. Thus, a few contaminating human cells could provide the majority of DNA present in a 0.01 ng bacterial gDNA sample. When preparing and processing subnanogram quantities of DNA, one can go to extraordinary lengths to avoid exogenous DNA contamination. No such measures were taken during our sample preparation and our experience with DNA contamination underscores the importance of employing these procedures when minute quantities of DNA are processed.
We next compared Illumina sequencing libraries created by HTML-PCR to those created by Nextera using a strain that had been previously sequenced, namely the TIGR4 isolate of Streptococcus pneumoniae (8). In each case, 50 ng of starting gDNA template was used. The Nextera library was prepared from the kit as per the manufacturer's instructions while the HTML-PCR library was prepared as above. Following library preparation, the samples were sequenced and then subjected to bioinformatic analysis. HTML-PCR uses mechanical shearing to create the ends from which molecules are sequenced while Nextera uses transposition. Although mechanical shearing is known to be an unbiased process, transposases can select target DNA in a biased and nonrandom manner. We therefore anticipated that Nextera might show a greater bias toward and/or against particular gDNA target sequences. In an effort to compensate for this and obtain as many Nextera-generated ends as possible, 5-fold more of that sample relative to the HTML-PCR sample wasloaded within a single lane of the Illumina flow-cell.
After trimming the reads for quality, in each case they were mapped to the reference genome (8). The HTML-PCR library yielded 17,995,348 filtered reads of which 99.4% mapped to the reference genome while the Nextera sample yielded 91,242,087 filtered reads of which 96.5% were mapped. We next looked for whether particular regions of the genome represented hotspots for transposition or shearing by each method. Similarly, if there were any sequences that were strongly favored in the earliest rounds of PCR, these jackpot events would also appear as hotspots. In each case, the position in the genome with the highest sequence coverage (strongest hotspot) was identified and its coverage was divided by the average coverage of the entire genome. By this measure the higher the value obtained, the greater the hot spot preference. To our surprise, Nextera did not show a significant preference for any sequence, with the highest value being 3.652, which was essentially the same as the highest value for HTML-PCR, which was 3.646. We conclude that neither method is prone to hotspot insertion or jackpot PCR biases.
As a measure of the extent of genome coverage we analyzed the number of unique 5′ ends generated by each method. Since each strand of the genome is independently sequenced, the maximum theoretical number of unique 5′ ends is twice the genome length. With Nextera, of 91,242,087 filtered reads, there were 3,192,276 unique 5′ ends represented (73.9% of the maximum) while with the 17,995,348 filtered HTML-PCR reads, 2,185,530 unique ends were represented (50.6% of the maximum). For both libraries, there were no unsequenced positions in the reference genome and both yielded the same consensus genome sequence.
We also compared the coverage distribution of sites throughout the genome in the libraries generated by the two methods. As shown in Supplementary Figure S3, the resulting plots were quite similar. We also examined coverage as a function of the GC content of specific regions throughout the genome. We observed coverage bias with both methods although it was more severe with the Nextera-generated library. For instance, regions of the genome with a 20% GC content were covered 5.52-fold lower than those with a 50% content with the Nextera library while there was only a 2.36-fold bias with the HTML-PCR library (Supplementary Figure S4). We conclude that for the two library construction methods, the quality of sequencing data obtained was similar although HTML-PCR is better suited for sequencing regions or genomes with low GC content.
With HTML-PCR, due to the 16 dG nucleotides present at the 3′ end of one of the PCR primers used, the genomic DNA can only be amplified if it contains a stretch of complementary dC nucleotides of a similar or greater length. In most molecules, the exogenously added oligo(dC) tail provides that requirement, however, if long oligo(dC) stretches exist naturally in the genome, these sites could be amplified in a tail-independent manner. Furthermore, since amplification of endogenous sites does not depend upon the efficiency of tailing, this amplification might be very efficient resulting in the over-representation of endogenous homopolymers in the final library. For the experiments above, this theoretical objection is not applicable as nowhere within the V. cholerae or S. pneumoniae genomes are there oligo(dC) stretches that exceed 11 nucleotides in length. However, in larger, more complex genomes such as the human genome, numerous endogenous dC stretches of at least 16 nucleotides do exist (9, 10).