Genotype data of thousands of individuals are deposited and accessed in online databases involved with the HapMap Project and 1000 Genomes Project. The data are anonymous, but just how private are they?
“This is the first study showing that you can do this end-to-end,” said principal investigator Yaniv Erlich of the Whitehead Institute. “You can sit next to your computer, and if you have the right knowledge, you can go all the way from sequence files in 1000 Genomes, just using public tools, and in some cases go back to people.”
Prior to his career in science, Erlich worked for a security company testing bank security systems, so it was unsurprising for him to wonder about the security of public genetic data. In 2009, a study from Jane Gitschier at the University of California, San Francisco, caught his eye. In that study, Gitschier demonstrated that it was possible to identify potential surnames for individuals who contributed genomic data to the HapMap project (2). Yet Gitschier did not positively identify single individuals, just surnames that matched numerous individual sequences.
To see if they could identify the full names of genomic research participants, Erlich and his team used an informatics tool they developed called lobSTR, which is freely available online, to analyze the genetic information made public by the 1000 Genomes Project. They focused on the Y chromosome because the Y chromosome and family surnames are both transmitted from father to son and become tightly linked, providing ripe targets for identification.
Then Erlich’s team put the lobSTR results into two recreational databases, Ysearch and SMGF, both free-of-charge websites that allow individuals to search for genealogy matches based on Y chromosome genetic data. On average, the database returned a last name associated with the genetic information 12% of the time. But thousands of individuals in the United States have the same last names.
So for the final step, Erlich’s team used age and state of residency—demographic information that is not protected by the US Health Insurance Portability and Accountability Act (HIPPA) and is typically associated with genetic data online—to narrow their results to specific individuals. In one case, “we got from 300 million people in the US very quickly to two males, just based on public searches,” said Erlich. “At that point, we could just call each of them and ask if they participated in a genetic study.”
After identifying just 5 genomes from the 1000 Genomes projects, the team was then able to deduce the identities of 50 people within 3 families.
Erlich will not reveal the names of those identified, he emphasized, and does not wish to curtail the public sharing of genetic information. “Just in my lab alone, we have used public databases in two studies to identify the genes involved in devastating pediatric disorders, giving hope to families,” says Erlich. “We’re not saying, ‘Oh my god! This is a privacy issue and let’s shut down all the databases.’ Completely the opposite. The focus of this study is to illuminate the current status of genetic privacy, to engage public discussion about it, and to maybe get some better legislation and policies to protect data misuse.”
Prior to publication, Erlich shared his findings with officials at the NIH, who subsequently removed age information from certain online databases. In the pages of Science, the directors of the National Human Genome Research Institute and the National Institute of General Medical Sciences have also called for the research community to begin a “rigorous and open discussion” about how to balance the benefits of data sharing with the privacy of research participants.
1. Gymrek, M., A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2012. Identifying personal genomes by surname inference. Science, 339:321-24.
2. Gitschier, J. 2009. Inferential genotyping of Y chromosomes in Latter-day Saints founders and comparison to Utah samples in the HapMap Project. Am J Hum Genet., 84:251-8.