Overall, the NGS method was highly accurate in determining the STR allele types from simple, compound, and complex repeats as benchmarked against the CE-based method (Table 1). The STR allelic genotypes for TPOX, D3S1358, FGA, CSF1PO, D5S818, D7S820, D8S1179, HUMTH01, VWA, D13S317, and D16S539 were identical between CE- and NGS-based methods. However, some discrete differences were observed for D18S51 and D21S11. First, some inconsistencies were observed with the NGS compared with the CE-based method for the D18S51 STR locus. For index samples 5 and 7, allele 14 could not be identified by reference alignment. This observation was unexpected given that this locus has a simple type of repeat pattern ([AAGA]n), and the STR allele length at this locus is completely spanned by 150-bp sequence reads. Although the exact cause was not determined for the allele drop-out by NGS, it is possible that the custom PCR primers used to make the libraries did not sufficiently amplify this allele. Hence, the allele was not present nor observed in the reference alignment analysis.
One of the observed limitations of the current NGS method is in the length of sequence reads for detecting longer STR alleles. The data used in this study was 150-bp single-end reads, which is currently the maximum high-quality read length for this DNA sequencing platform. Although an analysis of read quality showed that quality diminished toward the 3′-end of the reads, a significantly large number of reads maintained an average score of Q20 and Q30 throughout the entire 150-bp read length, as expected (Supplementary Table S2). Due to the length criteria, the short-read NGS method was unsuccessful in detecting longer alleles (34.2 in index 2, and 32.2 in index 7) from the D21S11 CODIS STR. Additional corroborating evidence for this observation was also seen in the mixture sample, index 12, in which alleles 32.2 and 34.2 were not detected as well. The longest allele typed in this study was 31.2 at the D21S11 locus. Approximately ∼17% D21 alleles and <1% FGA alleles are longer than this allele (23). Importantly, at about 6% in most populations, the D21S11 32.2 allele is the most common allele among the CODIS loci that are longer than 32 repeats. Thus, a marginal increase in read lengths on the Illumina GAIIx platform will capture this allele. Significant improvement in read length will be required to capture all of the rare but extremely long alleles reported at the FGA locus (24).
Since forensic samples often contain the genetic material of more than one individual, it becomes even more critical to accurately genotype STR alleles from complex mixture samples (25-27). Therefore, the NGS method was tested using a sample (index 12) based on a mixture of indexes 2 and 6. Equal amounts of DNA from these individuals were mixed and sequenced on the Illumina GAIIx. Table 2 shows that the quantification of mapped reads from NGS analysis yielded expected ratios for the distribution of alleles from the two individuals when mixed at a 1:1 ratio. For example, index 2 was homozygous for TPOX 8, while index 6 was heterozygous for alleles 8 and 10. The resulting TPOX 10:8 ratio was 0.24:0.76 as expected (Figure 1 and Table 2).
Another benefit from the NGS-based STR typing method was sequence determination of STR alleles and detection of variant alleles that were not initially observed by the CE-based method. For example, using NGS variants of allele 16 at the D3S1358 locus from index 2 and 6 were detected. Alignment to the in silico STR genome, containing a representation of each D3S1358 allele variant, revealed that index 2 presented alleles16a/16b and index 6 presented alleles 16b/16c (Table 2). Analysis of index 12 successfully identified all three allele variants in expected relative abundances based on the number of mapped reads for each allele within the mixture (0.25:0.48:0.28 or 1:2:1). These results demonstrate the advantage of using NGS compared with CE, as samples remained indistinguishable using CE (size-based), but can be discriminated utilizing the NGS method (SNP- or sequence-based).
In order to estimate the number of short reads required to accurately assess each STR allele present in each sample, a Monte Carlo probabilistic model was fit to mapped read counts to an in silico reference alignment. Figure 2 shows the proportion of passing subsample draws at varying values of read assignments for each of the models. The STR locus detected at greater than 95% probability with the fewest number of reads was D13S317. D13S317 also provided the highest number of mapped (aligned) reads relative to the other loci (Table 2 and Supplementary Tables S3 and S4) and is also one of the shortest loci in the panel (28–60 bp within the repeat). However, it was difficult to ascertain from the data (Figure 2) if this method of STR identification was more accurate for shorter alleles (e.g., D13S317), or if PCR amplification bias and/or sequencing bias had a greater influence in the model. Additional samples would be required to fully evaluate this observation.