to BioTechniques free email alert service to receive content updates.
Genome databases contaminated

Julie Manoharan

Large portions of international genome databases are contaminated with human DNA.

Bookmark and Share

University of Connecticut (UConn) researchers have discovered human DNA contamination in international genome databases, raising concerns among the scientific community about current standards for validation of sequencing data.

The team, led by UConn associate professor of molecular and cell biology Rachel O’Neill, searched for AluY—a primate-specific sequence—in 2749 nonprimate sequencing data and genome assemblies from the National Center Biotechnology Information, Ensembl, the Joint Genome Institute (JGI), and the University of California, Santa Cruz (UCSC) databases. They found the primate-specific sequence in 492 of the analyzed non-primate data sets.

Rachel O’Neill and graduate student Mark Longo found a primate-specific sequence in 492 non-primate data sets. Source: University of Connecticut

“We found that 25% of NCBI’s trace archives were contaminated,” said O’Neill. “Human DNA was found in, for example, a zebrafish, a platypus, and a cow. These aren’t species that human genomes interact with.”

Since these findings reveal only the contamination of AluY, O’Neill believes that more contamination could be discovered if these databases were screened for other primate-specific sequences.

In response to O’Neill’s paper, the NCBI issued a statement stating that the paper overestimates the severity of the contamination problem. The response explains that the O’Neill’s team used data from unscreened, preliminary databases and that the amount of contamination in most screened databases is so small that those sequences would most likely not contribute to a resulting gene model. But the NCBI still cautions researchers to consider sequence contamination when interpreting results to avoid confusion in evolutionary and comparatives studies.

After publishing her findings, O’Neill has been contacted by dozens of researchers who have spent time and money tracking sequences that were later found to be contaminated. “People must be aware that this is happening,” said O’Neill. Because most researchers validate their findings, O’Neill believes that existing research based on these contaminated databases may be largely unaffected.

The contamination, however small, could have come from DNA sequencing libraries handlers or tissue contamination. “[Contamination] could happen in many different ways,” said O’Neill. “So many different labs and so much effort is put into these genome sequencing initiatives that it’s virtually impossible to figure out how it actually happened.”

While computational filters designed to identify contaminants are often effective, it is almost impossible to identify or quantify the amount of human DNA contamination in human sequencing data sets. As clinical sequencing becomes more common, scientists cannot rely on these limited filters to screen contaminants.

Higher standards and protocols for handling genome samples are necessary to eliminate contamination at its source, says O’Neill. And such protocols are possible. For example, O’Neill’s group found 172 influenza genomes that were wholly uncontaminated. She believes that because the researchers were working with an infectious agent, they took extreme care in sample handling, reducing the risk of contamination.

“The care that goes into handling those samples is so extreme and that’s the kind of care that needs to be applied [to all samples],” says O’Neill.

The paper, “Abundant human DNA contamination identified in non-primate genome databases,” was published 16 Feb. 2011 in Plos One.