2Department of Anthropology, University of Florida, Gainesville, FL, USA
3Department of Biochemistry, Faculty of Medicine, Sana'a University, Sana'a, Yemen
Two-dimensional and three-dimensional PCA plots of the core and test datasets.
Whole-genome studies of genetic variation are now performed routinely and have accelerated the identification of disease-associated allelic variants, positive selection, recombination, and structural variation. However, these studies are sensitive to the presence of outlier data from individuals of different ancestry than the rest of the sample. Currently, the most common method of excluding outlier individuals is to collect a population sample and exclude outliers after genome-wide data have been collected. Here we show that a small collection of 20–27 polymorphic Alu insertions, selected using a principal component–based method with genetic ancestry estimates, may be used to easily assign Africans, East Asians, and Europeans to their population of origin. In addition, we show that samples from a geographically and genetically intermediate population (in our study, samples from India) can be identified within the original sample of Africans, East Asians, and Europeans. Finally, we show that outlier individuals from neighboring geographic regions (in our study, Yemen and sub-Saharan Africa) can be identified. These results will be of value in preselection of samples for more in-depth analysis as well as customized identification of maximally informative polymorphic markers for regional studies.
Data sets comprising thousands of individuals genotyped at hundreds of thousands of markers are now collected on a near-routine basis (1,2,3). These data have accelerated the identification of disease-associated allelic variants (1,2,3) and have also contributed significantly to characterizations of the pattern of natural selection (4,5,6,7,8), recombination (9,10), and structural variation (11,12) in human populations. However, data from admixed or outlier individuals, as well as undetected population structure, in these new world-wide sample sets may bias some results by obscuring information and increasing the occurrence of false positives (13,14). As such, it is increasingly important to be able to efficiently identify outlier individuals and population structure in samples chosen for analysis. Once identified, outlier samples may be removed or the data set may be partitioned into appropriate subsets. Post-hoc population assignment and control for population structure can be conducted from the full marker set (15,16), but it may be more efficient and economical to conduct a population structure study prior to large-scale genotyping. Furthermore, analysis of a subset of markers chosen to be maximally informative based on the specific study populations can form the basis of a population history study.
Multi-state markers such as microsatellites have many desirable properties, but biallelic markers [e.g., single nucleotide polymorphisms (SNPs), transposable element polymorphisms, and restriction fragment length polymorphisms (RFLPs)] are generally preferred for the simplicity of data collection and variant detection. The method we describe herein is applied to a set of polymorphic Alu insertions, but the method should be generally applicable to many biallelic markers. Polymorphic Alu insertions have several advantages as population structure investigation markers. They are inexpensive and easy to type because they are unambiguous to call and may be easily discriminated on a simple agarose gel. Most importantly, no homoplasy of these markers has yet been shown—that is, there is no known mechanism of back-mutation—meaning they are identical by descent markers (17,18). This property suggests that polymorphic Alu markers may be preferable for population assignment relative to SNPs and simple tandem repeats (STRs), which may be identical by state but not identical by descent (19).
A previous analysis of 100 polymorphic Alu insertions in samples of East Asian, European, and African origin demonstrated that these markers could be used to assign individuals to groups that correspond to their continent of origin (19,20). In addition, the analysis showed that group assignments were accurate with 60 randomly selected Alu markers (mean probability of correct assignment = 0.933 ± 0.13) but less accurate with 20 randomly selected Alu markers (mean probability of correct assignment = 0.866 ± 0.20). A follow-up study also found that at least 50 randomly selected markers were required for effective structure identification (21). However, random selection is not an optimal strategy for marker selection; indeed, there are now several published methods that improve on random selection (22,23). Recently, several studies have used a principal component–based approach to identify informative SNPs for determination of human population structure based on rankings of the principal components (24,25,26).
Our goal in this study was to show that principal component analysis (PCA) rankings can be used to identify the most informative binary markers for accurate population assignment. As an exemple of binary markers, we used the previously published Alu data set (19,20) [plus additional unpublished data; L. Jorde (Department of Human Genetics, University of Utah, Salt Lake City, UT, USA), personal communication] (Table 1). We tested whether sets of <100 Alu markers could be used to correctly assign sub-Saharan Africans, East Asians, and Europeans to their respective groups. We also compared our method to a previously published PCA-based method (26). Although the two methods share a similar theoretical basis and yield equivalent results, our method will always provide a ranked set of markers whereas the method of Paschou et al. (26) will only rank markers when statistically significant principal components have been identified. We further demonstrated the utility of our method to identify outlier individuals (i.e., individuals of different ancestry than the rest of the study population) in two test situations. First, we determined the number of markers and critical value (i.e., minimum ancestry) needed to ensure that individuals from genetically and geographically intermediate populations (in our study, those from India) were not mistakenly assigned to any of the African, East Asian, or European groups. Second, we showed that individuals exhibiting admixture from an adjacent geographic region can be identified: for example, we identified an individual with primarily African ancestry in a population sample of ethnic Yemeni.