Materials and methods
Sample collection and isolation
Human DNA (hDNA) was obtained from anonymous adult donors using a protocol approved by the Battelle Memorial Institute Internal Review Board. hDNA was collected and purified from saliva samples using the Oragene-DNA isolation kit (DNA Genotek, Kanata, ON, Canada), according to the manufacturer's recommended protocol. A total of five individual human saliva samples were evaluated. For mixture experiments, two of the five samples were mixed post-PCR amplification at a ratio of 1:1 prior to sequencing.
Illumina GAIIx sequencing
The CODIS core 13 loci were individually amplified with Phusion High-Fidelity DNA Polymerase (New England Biolabs, Ipswich, MA, USA) using custom designed primers covering 1000–2500 bp amplicons per locus (Supplementary Table S1). The length of these amplicons was designed to better simulate the use of genomic DNA that would be fragmented prior to sequencing, rather than short amplicons formatted for a specific HTS technology. Individual PCR amplifications were pooled and purified with MinElute PCR Purification kit (Qiagen, Valencia, CA, USA). Sequencing libraries were constructed with the TruSeq DNA Library kit (Illumina) using index tags, according to the manufacturer's recommended protocol. Six multiplexed samples (five individual and one mixture) were pooled and sequenced in one lane of an Illumina GAIIx at the Nucleic Acid Shared Resource Laboratory, The Ohio State University Medical Center (Columbus, OH, USA), for 150-bp single-end reads. Over 5 million pass-filtered reads were generated for each sample.
CE STR profiling
STR loci were amplified from human DNA samples using the PowerPlex 16 System (Promega, Madison, WI, USA), and the amplified products were detected using an Applied Biosystems 3130 Genetic Analyzer with Data Collection Software Version 3.0. Output data was analyzed using GeneMapper Software Version 4.0 (Applied Biosystems). All process quality values (PQV) were evaluated based on default values to determine the accuracy of genotyping.
STR profiling by NGS
A modified reference alignment method was developed to genotype STR loci. First, each CODIS STR locus sequence was designated by its repeat pattern and a segment of upstream and downstream nonrepetitive sequence. An in silico reference for each STR locus was constructed in FASTA format by generating a linear sequence of concatenated STR alleles as compiled by the National Institute of Standards and Technology (NIST). The Bowtie short readaligner was utilized for reference alignment based on several evaluated criteria: open-source accessibility, flexibility in specifying search and output parameters, and an ungapped alignment approach that was most applicable to STR alignment requirements (19). Optimized parameters of the algorithm were defined as seed length equal to 100 nucleotides, three allowed mismatches, and suppress all alignments for a particular read if more than one reportable alignment exists. Output from the aligner was generated in the SAM format and converted to BAM format using Samtools (20). Picard (http://picard.sourceforge.net/) was used to remove potential PCR duplicates, and BedTools (21) was used to convert BAM files to a BED file format. A custom script was used in the R statistical programming environment (22) to perform read filtering, retaining only reads that spanned the entire repeat region of any STR allele in the in silico genome and 5 bp of the 5′- and 3′-flanking regions. Finally, to make allele calls, a heuristic decision model based on Fisher's Exact Test was applied to evaluate the magnitude of reads mapping to each allele. A probability score was generated for each allele in the in silico genome based on: (i) the number of reads mapping to the allele, (ii) the total number of reads mapping to locus, (iii) the number of reads aligning to the in silico genome, and (iv) the total number of reads generated for a particular sample. This metric provided objective criteria on which to base STR genotyping calls from the NGS data.
A simulated analysis was performed to estimate the confidence of STR genotyping calls as related to number of reads. A Monte Carlo probabilistic model was applied to read counts that mapped to the in silico reference alignment. For each model, 10,000 draws were randomly obtained from a grid of: (i) total reads, (ii) reads aligning to the in silico genome, and (iii) the number of reads mapping to the entire locus. Here, each draw resulted in a simulated table of alignment counts with each draw designated as either pass or fail. A passing draw had true allele counts with locus proportions significantly larger than 0.10, and all other alleles with proportions not significantly larger than 0.10. To test the Monte Carlo estimations, a separate analysis was also performed on subsets (10% and 1%) of the raw sequence data using the reference alignment method described previously.
Results and discussion
Data generated on the Illumina GAIIx system and analyzed with an optimized reference alignment method consistently identified 13 CODIS STR core loci and sex (AMEL) locus from single individuals and a mixed sample as compared with results that were obtained by CE using the PowerPlex 16 assay (Table 1). Since the Illumina GAIIx platform is capable of producing an extremely large number of reads per run, >39 million high-quality sequence reads in one lane (one-eighth of a flow cell) were produced that corresponded to the six multiplexed samples in this study (Supplementary Table S2). This high number of reads allowed for a high level of stringency in accurately calling STR alleles within a Fisher Exact Test, thereby discriminating against stutter or sequencing errors (Supplementary Table S3).