1Battelle Memorial Institute, Columbus, OH, USA
2Battelle Memorial Institute, Charlottesville, VA, USA
3Human Cancer Genetics Program, The Ohio State University Comprehensive Cancer Center, Columbus, OH, USA
4Department of Physics and Biochemistr, Center for RNA Biology, The Ohio State University, Columbus, OH, USA
BioTechniques, Vol. , No. , April 2012, pp. 1–6
Here, we present a systematic method for high-throughput genotyping of the Combined DNA Index System (CODIS) short tandem repeat (STR) loci for human forensics using Illumina GAIIx short-read technology. Our novel contribution to the field is that we show that short-read–based next-generation sequencing technology can accurately genotype the CODIS STR loci from multiple samples and more importantly from mixed samples using quantitative measurements (reads). We also demonstrate method sensitivity, showing that as few as 18,500 reads, aligned to our in silico referenced genome, were required to genotype an individual (>99% confidence) for the entire CODIS panel of loci.
DNA-based methods for human identification principally rely upon genotyping of short tandem repeat (STR) loci. Electrophoretic-based techniques for variable-length classification of STRs are universally utilized, but are limited in that they have relatively low throughput and do not yield nucleotide sequence information. High-throughput sequencing technology may provide a more powerful instrument for human identification, but is not currently validated for forensic casework. Here, we present a systematic method to perform high-throughput genotyping analysis of the Combined DNA Index System (CODIS) STR loci using short-read (150 bp) massively parallel sequencing technology. Open source reference alignment tools were optimized to evaluate PCR-amplified STR loci using a custom designed STR genome reference. Evaluation of this approach demonstrated that the 13 CODIS STR loci and amelogenin (AMEL) locus could be accurately called from individual and mixture samples. Sensitivity analysis showed that as few as 18,500 reads, aligned to an in silico referenced genome, were required to genotype an individual (>99% confidence) for the CODIS loci. The power of this technology was further demonstrated by identification of variant alleles containing single nucleotide polymorphisms (SNPs) and the development of quantitative measurements (reads) for resolving mixed samples.
Analysis of short tandem repeats (STRs) has become a well-established technology for human forensic casework (1-3). STRs, also known as microsatellites or simple sequence repeats (SSRs), are repetitive regions of DNA that contain unique core repeat units of 2–6 nucleotides in length (4). Several STR database systems have been established, including the Combined DNA Index System (CODIS) utilized in the United States (5). This system currently uses a standard set of 13 STR loci, which are highly polymorphic, genetically unlinked and reside in noncoding regions (2). These STR alleles are routinely analyzed by multiplexed PCR followed by capillary electrophoresis (CE)-based separation (3,6-9). Although the CE-based technique for STR typing is both time and cost-effective, it does not allow for full sequence determination of STR loci and is only semiquantitative. Information on nucleotide variation within STR alleles would be informative for discriminating alleles in partial profile situations, resolving mixed samples and in kinship analysis (11).
A current limitation of STR typing with CE is its limited bandwidth, restricting the use of additional investigative genetic markers that are potentially informative for ancestry, phenotype, and other attributes. Indeed, additional genetic markers have been recently added to the forensicDNA analysis repertoire. For example, STR genotypes on the Y chromosome are often used in mixture analysis particularly in sexual assault cases (10), and various single nucleotide polymorphism (SNP) markers are now being used as predictors of externally visible characteristics (EVCs). The recent proliferation of genetic markers usable in forensic DNA analysis has resulted in a profusion of analytical platforms required to perform these assays (12). High-throughput sequencing (HTS) offers a single analytical platform for multiple forensic DNA analysis.
Previous reports have evaluated the potential of HTS for analyzing STR loci (13-15). In one of them, a pyrosequencing approach was used to examine 10 common markers for forensic analysis in Swedish individuals (13). Although this approach offers several advantages over CE-based STR analysis, including indentifying sequence variants within or proximal to STR loci and sequencing shorter DNA fragments, it was not successful in distinguishing between STRs with compound repeat motifs. Furthermore, this technology is not compatible with providing a high level of sequence coverage offered by HTS platforms. In a subsequent report, the Genome Sequencer FLX System (GS-FLX) (Roche Diagnostics, Branford, CT, USA) was evaluated for its ability to examine five STR loci from 10 human samples (14). This study demonstrated the benefits of STR genotyping using HTS technology (i.e., deeper resolution of STRs) and of using bioinformatic tools for sorting and evaluating sequence read lengths and frequencies critical for showing reliable and consistent results. Although this report showed the advantages of utilizing the Roche GS-FLX sequencer in STR-typing analyses, it is unknown whether a short-read next generation sequencing (NGS) platform (e.g., Illumina, San Diego, CA, USA or SOLiD, Applied Biosystems, Foster City, CA, USA) could provide the level of sensitivity required to detect an allelic mixture within a sample or accurately and reliably genotype all 13 CODIS STR loci (16).
In this report, we investigated whether a NGS platform based on short-read length technology could be applied to STR-typing analysis. Although a clear advantage of this technology is its ability to generate a large sequence data set exceeding 320 GB per run (17), it is unclear whether this type of platform, due to its short read output, could successfully sequence complete STR loci. To perform genotyping analysis, the Illumina GAIIx was used to provide deep-sequencing of all 13 CODIS STR loci in addition to the amelogenin (AMEL) locus from five individual human samples and one mixture sample. This large data set was examined with a bioinformatics streamlined approach by aligning reads to an in silico reference comprised of all loci analyzed followed by read-specific filtering and allele association statistics. The Illumina GAIIx data, analyzed by reference alignment, provided analogous allele calls for the 13 CODIS loci compared with the standard CE assay protocol based upon the PowerPlex 16 system (18), with the added benefit of detecting variant alleles of the same repeat length. Furthermore, this sequencing platform was able to discriminate STR alleles in a mixed sample preparation providing additional value for its use in forensic analyses.