2, University of Hong Kong, Hong Kong
3, University of Southampton, Southampton, UK
Full Text (PDF)
High-throughput genotyping technologies such as DNA pooling and DNA microarrays mean that whole-genome screens are now practical for complex disease gene discovery using association studies. Because it is currently impractical to use all available markers, a subset is typically selected on the basis of required saturation density. Restricting markers to those within annotated genomic features of interest (e.g., genes or exons) or within feature-rich regions, reduces workload and cost while retaining much information. We have designed a program (MaGIC) that exploits genome assembly data to create lists of markers correlated with other genomic features. Marker lists are generated at a user-defined spacing and can target features with a user-defined density. Maps are in base pairs or linkage disequilibrium units (LDUs) as derived from the International HapMap data, which is useful for association studies and fine-mapping. Markers may be selected on the basis of heterozygosity and source database, and single nucleotide polymorphism (SNP) markers may additionally be selected on the basis of validation status. The import function means the method can be used for any genomic features such as housekeeping genes, long interspersed elements (LINES), or Alu repeats in humans, and is also functional for other species with equivalent data. The program and source code is freely available at
High-throughput genotyping technologies such as DNA pooling (1,2) and DNA microarrays (3) mean that whole genome screens are now practical for complex disease gene discovery using association studies. Databases such as the Marshfield genetic map (4), the Advanced Biomedical Computing Center (ABCC) database (5), or the single nucleotide polymorphism (SNP) database dbSNP (6) contain large numbers of polymorphic markers. It is currently impractical to use all available markers, and a subset is typically selected on the basis of required saturation density. Evenly spacing markers is not efficient, as genes are not evenly spaced along chromosomes (7), and linkage disequilibrium (LD) is variable between and within chromosomes (8). Thus, for disease locus discovery, a restricted subset of markers could have the same information as the full set. Restricting markers to those within annotated genomic features of interest, or within feature-rich regions, reduces workload and cost while retaining much information. For instance, markers could be restricted to those within exons, which occupy only a small proportion of the genome. Moreover, the identity of functional elements is continually being determined, and this information is being added to the genome annotation (e.g., as part of the ENCODE project; http://www.genome.gov/10005107). Another endeavour, the International HapMap project (http://www.hapmap.org/), is designed to delineate the LD structure of the genome and generate a highly informative but restricted set of markers that tag most common genetic variation. Combining this LD information with the targeting of genomic features may ultimately yield the most efficient choice of markers, but the HapMap data are available in preliminary form, and the conversion of the genotype data to a genome-wide representation of LD is only now underway. We have started to use these data by developing maps with distances in LD units (LDUs) for a subset of the HapMap data used in the ENCODE project. Our approach uses genome assembly data to identify areas rich in particular annotated features of interest, such as genes or exons (9) and matches these with markers at a user-defined marker spacing. Except for the ENCODE region we have already converted, marker spacing is defined currently in base pairs, but the remainder will be converted to LDUs when whole genome LD information becomes available from the HapMap Project.
Materials and Methods Computer ProgramA database was constructed from the July 2003 construction of the human genome [National Center for Biotechnology Information (NCBI) build 34; http://genome.ucsc.edu/ ] containing both known and predicted genes, their exons, National Institutes of Health (NIH) reference SNPs from dbSNP, and polymorphic microsatellites with unambiguous genetic and physical map data from the Marshfield (4) and ABCC databases (5). A program was designed (Marker Gene Interspacing and Correlation; MaGIC) to use the database to create feature-targeted marker lists. MaGIC is written in Visual FoxPro and compiled for Microsoft® Windows® 95, 98, ME, NT, 2000, and XP platforms.
Algorithm for Targeting GenesEach gene with start position s base pairs is placed into one of many sequential bins, each w base pairs wide, the bin number nb calculated by:
The rank order of bins by number of genes per bin is used to define gene-rich regions by counting the number of genes in each bin and then ranking by gene count in descending order. This allows gene-dense regions to be specifically targeted at the expense of gene-poor regions. A user-defined cutoff of genes per bin is used to define valid bins and one marker selected per bin. In this algorithm, marker spacing corresponds to bin width, so marker density is a function of bin size. A nested algorithm allows multiple markers to be selected for each bin if required.
Determination of LDUsBoth gene position and bin width can be measured in LDUs instead of base pair units. LD maps (10) are constructed from the Malecot equation,
modeling the decline in association ρ as a function of distance d. The parameters of the model include M, which is the maximum association at zero distance, reflecting association at the last major bottleneck. L is the residual association at large distance, and ε is the exponential decline of ρ, with distance in kilobases. The Malecot parameters ε and M are estimated by fitting multiple pairwise association probabilities, ρ, and corresponding information, Kρ, using composite likelihood. Construction of LD maps requires the estimation of the ε parameter within each map interval, and a distance in LDUs is defined as εd, where d is the interval width in kilobases.