2Volcani Center, Bet-Dagan, Israel
Full Text (PDF)
Sequencing of a diploid PCR product that is heterozygous for an indel mutation and a downstream single nucleotide polymorphism (SNP) allows determination of haplotype phase. Manual determination of phase from trace files of this type is tedious; therefore, we introduce the Perl script TraceHaplotyper, which works from the expected sequence to analyze a sequencing trace file for the sites that are informative for the genotyping and outputs the two underlying haplotypes.
Numerous reports have shown that haplotypes are better than single markers for genetic association studies (1). Alleles of single markers may not always demonstrate an association clearly, while haplotypes can significantly improve the power and robustness of association tests because of stronger linkage disequilibrium (1). However, conventional haplotyping requires the use of several generations to reconstruct haplotypes within a pedigree, while computational strategies that rely upon statistical inference are likely to require large sample sizes for acceptable accuracy. Various methodologies for direct molecular haplotyping have also been described (2), making it a valuable strategy for gene mapping studies (3).
Recently, Flot et al. (4) described phase determination from direct sequencing of length-variable DNA regions. Their method was based on the observation that different bases are superimposed in the forward and reverse chromatograms obtained by sequencing a mixture of PCR-amplified products from such regions. They noted that their method was tedious when performed manually and suggested that a program that could perform this analysis automatically would be of significant benefit. This prompted us to develop TraceHaplotyper, a Perl script that scans a trace file for single nucleotide polymorphisms (SNPs) in the proximity of a polymorphic insertion-deletion (indel) site and outputs the reconstructed phases.
Although conventional tools for sequence analysis can be used to deduce SNPs (a chromatogram with interpretable heterozygous SNP is shown in Figure 1A), they cannot interpret trace files of PCR products that are heterozygous with respect to indels. Following the site of the indel, base calling is hampered by ambiguity that arises from the superimposed sequence of two different alleles that are shifted in their positions (e.g., Figure 1, C and D). However, as demonstrated in Figure 1, the phase of SNP alleles that follow the indel can be determined manually by analysis of their corresponding sites on the chromatogram (4). The underlying principle is exemplified in Figure 1E—if a SNP and an indel mutation are linked (i.e., on the same DNA strand), the SNP will be shifted by the presence of the indel. If they are not linked (i.e., on different strands or chromosomes), the position of the SNP will not be moved by the presence of the indel in the sequenced DNA mixture.
To perform haplotype phasing, TraceHaplotyper uses two input files: (i) the trace file in .ab1, .abi, or .scf format and (ii) a file detailing the expected locations and sequences of the indel and of the SNP(s). Thus, this information about the indel and the sites of the SNPs must be known beforehand or deduced from the forward and reverse sequencing traces by manual inspection.
The algorithm uses the Phred -d option for base calling (5). Each call returns the identity of the two major peaks and their quality. If only one peak is detected, then it is assumed that this peak results from two identical major peaks. TraceHaplotyper makes use of the sequence of the 20 bases immediately upstream of the indel site to identify the corresponding site in the trace file. Then, the program attempts to determine the genotype of the indel and the SNP by the following procedure. First, it moves to the position where the SNP would be if the deletion were present. To be sure that the right site is being examined, the program searches for the expected base combination immediately upstream of the SNP site. By checking sequence context, the program can correct for a one-nucleotide deviation from the expected location, which might occur due to inaccuracies introduced by the automated sequencing software trying to compensate for sequence-specific fluctuations in electrophoretic mobility. Any failure to observe the expected base combinations in these sites would produce an output, indicating that TraceHaplotyper could not align the sequences of the expected and the trace files. Assuming that the correct SNP position is located and confirmed, the base call is analyzed. In the example shown in Figure 1, C and D, this call is A or G (R). The expected base of the chromosome with the insertion at this position is G (see Figure 1A; G is the third nucleotide of the insertion). Hence, the other base A is the SNP call for the chromosome lacking the insertion; in other words, the haplotype is Del + SNP A. The entire process is then repeated for the location of the SNP site when the insertion is present. In the example shown in Figure 1C, the base call is G or T (K); in the case shown in Figure 1D, the base call is A or T (W). As shown in Figure 1B, the expected base of the chromosome lacking the insertion at this position is T. Hence, the program would determine bases G and A as the SNP calls in Figure 1, C and D, respectively, corresponding to the haplotypes Ins + SNP G and Ins + SNP A.
