The sequencing metrics for a 2-fold dilution series of M13mp18 input genomic DNA ranging from 100 ng to 0.8 ng is shown in Table 1. At the lowest input amount (0.8 ng) the run generated only 74 reads, but those reads gave enough data to call 91.4% of the genome with 93.4% consensus accuracy. With only 3.1 ng of starting material, mapped reads covered 100% of the bases at 100% consensus accuracy. Using the manufacturer's standard methods, one would be required to start with far more DNA, regardless of the target genome size; e.g., the standard protocol recommends a minimum of 500 ng for a 1-kb sequence. For a 7.2-kb sequence, the recommendation would be to use several µg.
To enhance this technique, we used random hexamer primers rather than primers specifically designed for the sample. In this case, no prior knowledge of the DNA sequence is necessary and the method, in principle, can be applied directly to a wide range of unknown samples e.g., in a clinical setting. Figure 1 (right panel) shows the coverage of ds M13mp18 sequenced with Roche's random hexamers. There is a more uniform coverage in comparison to the results obtained using the specific primers. The coverage distribution is still not ideal but having started with 50 ng of DNA, we generated similar data to that shown in Table 1 for 50 ng of input DNA (2392 mapped reads, 100% bases called, 100% consensus accuracy, 403 × mean coverage).
To test the application of our direct sequencing method for the PacBio detection of modified bases, 6-kb vectors were sequenced with known positions of methylation. ApET28a vector encoding EcoDam methyltransferase, which generates N6-methyladenine (m6A) methylation in GATC motifs, was directly sequenced with subsequent kinetic analysis using PacBio's software to identify base modification. Random hexamer primers were used and the experimental procedure was as described previously for M13mp18, starting with 25 ng of DNA. Figure 3 shows the SMRT View genome browser depiction of a portion of the sequence data for one of the vectors sequenced using this direct sequencing technique; four instances of GATC methylated sites are evident as peaks in the purple trace (+ strand) and orange trace (- strand). As the GATC sequence is a palindrome, there is an m6A base on both strands, and by observing the inter-pulse distance (IPD) ratio reported by the PacBio software, it is exceptionally easy to see these base modifications. Other studies have used the PacBio RS for base modification analysis on similar plasmid/methyltransferase models (10, 11) and entire bacterial genomes (12) but to our knowledge these studies follow standard library preparation protocols and required far greater amounts of DNA than in the technique described here. The data we generated were analyzed with Motif Finder, an application provided by PacBio, for mining polymerization kinetics for motifs associated with base modifications. In this vector, 50 instances of GATC methylated at the A position were identified; there are 25 GATC sites in the sequence and wild type EcoDam was expected to methylate each one of them.
Direct sequencing was then tested using a DNA extract of Staphylococcus aureus TW20, a MRSA strain and well-known nosocomial infection (13). The plasmids of this bacterial sample were of interest as an example of the application of the PacBio RS to infectious disease identification through sequencing. Antibiotic resistance genes are often carried on plasmids (14, 15) and can spread very quickly in heterogeneous bacterial communities (16-19). DNA was extracted from a solution culture of TW20 and digested with Plasmid Safe DNase to reduce the amount of linear fragments and effectively increase the concentration of plasmids in the sample. An electropherogram of the final sample showed that the DNA preparation also contained a smear of linear double stranded fragments ranging from 100 bp to >25 kbp, with a peak at approximately 20 kb (Supplementary Figure 1). The two plasmids in TW20 are double-stranded and circular, with lengths of 3 kb and 30 kb. We generated sequence data using random hexamer primers in the annealing reaction. Four reactions containing 50 ng of the S. aureus DNA preparation with various amounts of hexamer primers, from 10-fold to 600-fold, i.e., 500 ng, 1 µg, 10 µg, and 30 µg per annealing reaction, were performed in 9 µL reaction volumes. A single SMRT cell was sequenced for each reaction and the trend observed across these four reactions showed fewer mapped reads as the amount of random hexamer primers increased. This is perhaps because of the proximity of annealed primers on the DNA strand at higher concentrations, leading to polymerases colliding with one another, or simply the reduction of signal to noise as two fluorescent signals could be observed concurrently. The annealing reaction with 10-fold primers generated 3240 mapped reads, 20-fold generated 3085 mapped reads, 200-fold generated 2911 mapped reads, and 600-fold generated 2011 mapped reads, all with a mean mapped read length of approximately 500 bp. There was also a difference in coverage depth between the two plasmids; the mean coverage for the 3-kb plasmid was 35×, but only 5× coverage was obtained for the 30-kb plasmid, which is due mostly to the difference in plasmid length. There is a loading inefficiency of larger molecules because of their lower diffusion coefficient, as well as the disparity between the molecule's hydrodynamic radius and the very small zero-mode waveguide (ZMW). Future upgrades to the loading mechanism on the PacBio instrument (MagBead loading) which should eliminate this problem are very close to release. The combined sequence data from these four SMRT cells produced 13,724 reads; 479 reads mapped to the plasmids and 11,247 to the genome (5.3 Mb mapped providing a mean 1.6× coverage), an overall mapping rateof 85% which is not dissimilar to standard mapping rates of SMRT bell libraries we have made (from a recent single SMRT cell of S. aureus TW20 1 kb SMRT bell library 39,478 reads were mapped from 47,465 filtered reads, a mapping rate of 83%).