Sign Up to BioTechniques free email alert service to receive content updates.
Microarray genotyping resource to determine population stratification in genetic association studies of complex disease
 
Scott J. Tebbutt, Jian-Qing He, Kelly M. Burkett, Jian Ruan, Igor V. Opushnyev, Ben W. Tripp, Jeffrey A. Zeznik, Chiaka O. Abara, Colleen C. Nelson, Keith R. Walley
University of British Columbia, Vancouver, BC, Canada
BioTechniques, Vol. 37, No. 6, December 2004, pp. 977–985
Full Text (PDF)
Supplementary Material
TableS5 (.xls)
TableS4 (.xls)
TableS3 (.xls)
TableS2 (.xls)
TableS1 (.xls)
Abstract

We have developed a robust microarray genotyping chip that will help advance studies in genetic epidemiology. In population-based genetic association studies of complex disease, there could be hidden genetic substructure in the study populations, resulting in false-positive associations. Such population stratification may confound efforts to identify true associations between genotype/haplotype and phenotype. Methods relying on genotyping additional null single nucleotide polymorphism (SNP) markers have been proposed, such as genomic control (GC) and structured association (SA), to correct association tests for population stratification. If there is an association of a disease with null SNPs, this suggests that there is a population subset with different genetic background plus different disease susceptibility. Genotyping over 100 null SNPs in the large numbers of patient and control DNA samples that are required in genetic association studies can be prohibitively expensive. We have therefore developed and tested a resequencing chip based on arrayed primer extension (APEX) from over 2000 DNA probe features that facilitate multiple interrogations of each SNP, providing a powerful, accurate, and economical means to simultaneously determine the genotypes at 110 null SNP loci in any individual. Based on 1141 known genotypes from other research groups, our GC SNP chip has an accuracy of 98.5%, including non-calls.

Introduction

Gene Association Studies

In candidate gene association studies, one first identifies candidate genes that are hypothesized or known to be important in the pathogenesis of a condition. The next step is to identify polymorphisms within or close to the gene that could affect its regulation or function. Finally, one examines whether the polymorphisms occur more frequently in individuals who have a disease than in an appropriate control population. One of the major advantages of candidate gene association studies is that one uses knowledge of biologically plausible pathogenetic mechanisms to focus the search for genes on relatively few candidates. Another advantage is that the study subjects can be unrelated individuals so that genotypic and phenotypic data from multiple generations are not required. This is especially important in complex diseases such as chronic obstructive pulmonary disease (COPD) and atherosclerosis, in which the late age of onset makes it very difficult to ascertain DNA and phenotypic data from parents of affected individuals. However, there are several issues that negatively influence population-based genetic association studies that need to be addressed (1), including increased false positives due to hidden population stratification or admixture (2,3).

False-positive associations (type I errors) can occur if the frequencies of genetic markers and of the disease of interest vary across different population groups. Freedman et al. (4) and Marchini et al. (5) have recently reviewed and generated new data on the impact of population stratification on genetic association studies.

To avoid problems due to population stratification in association studies, both the cases and controls should be selected from the same population/ethnic group and geographic area. Because population genetic background is difficult to measure, it is impossible to guarantee genetic homogeneity. Therefore, alternative strategies based on the use of families have been employed. Such an alternative method of analysis is the transmission/disequilibrium test (TDT; Reference (6), which simultaneously tests linkage and association. The TDT evaluates the frequency of transmission of specific alleles at a single locus from heterozygous parents to their affected children. In the absence of association, each allele is expected to be transmitted with the Mendelian frequency of 50%. If a marker allele is transmitted significantly more often than 50% of the time, this implies that the allele must be linked to the disease-causing allele. The main advantage of the TDT is that it does not compare groups of cases and controls and therefore is not generally susceptible to population stratification. However, a major drawback of the TDT method is that it requires parental DNA, which is often unobtainable in studies of late-onset diseases such as COPD. An example of this approach is the finding of an association between the IL4-590 polymorphism with asthma (7). The T allele of this polymorphism was transmitted from a heterozygous parent to an affected child in 64% of the informative meioses.

Confounding due to population stratification in population-based association studies can cause either biased and/or overdispersed test statistics, leading to false positives. While family-based association studies are not subject to the same problem, they do have their own limitations such as the recruitment of families. Therefore, methods estimating and correcting the test statistics for the effects of population substructure have been developed. Genomic control (GC) and structured association (SA) are two such methods that have been developed.

SA is a statistical method to test for association in the presence of hidden population substructure (8,9). It is a “latent-class” method, which assumes that the sample is composed of individuals from K latent subpopulations, each having a characteristic set of allele frequencies at marker loci. Unlinked genetic marker loci are used to estimate subpopulation parameters. In the first step, a Markov Chain Monte Carlo method is used to estimate K, the allele frequencies in each subpopulation, and a set of vectors qi = (q1, …, qK) representing the proportion of each individual's genome from each of the K subpopulations. In the second step, a test statistic that conditions on inferred subpopulation is calculated to account for the population substructure. Because SA actually attempts to infer ancestry, it can require up to several hundred unlinked “null” marker loci (4,10).

  1    2    3    4    5  



Back to top