to BioTechniques free email alert service to receive content updates.
When reference genomes go bad | Top Sequencing Feature of 2010

Andrew S. Wiecek

Recently, the quality of several reference genomes has been called into question. Andrew Wiecek investigates the issue of reference genome quality and talks with several organizations developing tools to help the researchers improve genome annotation. 

Bookmark and Share

While studying the evolution of sheep and goats, French researcher Alexandre Hassanin and his colleagues grew increasingly concerned that the mitochondrial reference sequence for the domestic goat, Capra hircus, was contaminated. The reference genome, published in 2003 (1), is archived in RefSeq, a curated sequence database maintained by the National Center for Biotechnology Information (NCBI).

The Vietnamese goat sequenced by Hassanin et al. in their recent paper. The animal was collected in 2007 in the Pae village (Thua Thien Hue province, Vietnam). Source: Mitochondrial DNA

To test their theory, Hassanin and his team sequenced the mitochondrial genome of a goat and compared their sequence to sequences, both complete genome and fragments, available in the databases. What they discovered was the researchers who submitted the complete goat reference had only actually sequenced three fragments from the genome; the remaining six fragments used in the assembly were obtained from public databases, sequenced by six different labs. Hassanin and his colleagues found an unusually high amount of sequencing errors in those three fragments, leading them to conclude that this reference genome should not be used for future studies (2).

Reference genomes, such as the goat mitochondrial genome, are important since they provide researchers with a reference point and a representation of a particular species’ genetic code and structure. As such, these genomes are often assembled from multiple examples of a species to provide a reference for variation.

“The reference genome for any kind of sequence genome is an artificial decision that one makes,” says Pascale Gaudet, research assistant professor at Northwestern University (Evanston, IL), and the president of the International Society for Biocuration. For example, the human reference genome is not the sequence of one individual, or an average of the polymorphisms of the human genetic population. Instead, it is a combination of the first group of people who were sequenced. “It’s helpful to think about a reference genome so we all have the same coordinates to refer to when we compare different strains or different species,” says Gaudet.

Reference genomes are designed to provide an approximation of the DNA of any single individual, so the best possible representation of any genome will help future researchers.

“The fact is that the quality [of reference genomes] varies enormously across the scope, according to the genome and to the community,” says Paul Kersey, team leader for the EBI’s Ensembl Genomes annotation system for genome analysis and visualization. “At one end you have the human genome which has been very highly sequenced extensively finished, and there’s a massive annotation program going on. But for some of the lesser known genomes, things are still improving.”

Although scientists have gotten very good at calling individual bases, issues remain with the genome assembly and annotation. The community interested in the particular organism is usually charged with the continued maintenance of the reference genome, keeping it relevant for research purposes. And the issue becomes the incredible amount of time, effort, and funding needed to maintain these reference sequences.

A graphical representation of the latest human assembly GRCh37. The genome is colored with respect to the genomic component used to build the genome assembly at that location. The red triangles mark regions where alternate loci have been provided. Source: The Reference Genome Consortium

First appearances can be deceiving
Initial assemblies often have problems such as gaps, duplications, insertions, and deletions. Complex, duplicated regions in mammalian genomes are a challenge to sequence using the shorter-read, high-throughput next-generation sequencing methods. So researchers often turn to the tried-and-true, but significantly more expensive Sanger methodology to obtain longer reads that enable scaffold assembly of complex genomes.

“Even with that technology, there were still regions of the genome that were inaccessible due to their complexity,” says Deanna Church. Church is a staff researcher at the National Center for Biotechnology Information (NCBI) and manages the Genome Reference Consortium (GRC), a group of about 20 scientists from a number of genome research institutes, including the European Bioinformatics Institute (EBI), the NCBI, The Sanger Institute, and Washington University in St. Louis, MO. The GRC was established to improve the quality of the human reference assembly.

Ten years ago, when an international coalition of researchers announced the completion of the Human Genome Project, the first assembly contained over 150,000 gaps; regions that scientists thought were just too difficult to resolve at the time.

But the human reference assembly is still being updated by the GRC group, which fixes misrepresented regions in the reference, closes as many remaining gaps as possible, and produces alternative assemblies of structurally variant loci. And the scientific community can report regions that they feel are in need of further review at the GRC web site.

In March 2009, the GRC released the 19th version of the human genome, referred to as GRCh37. In this release, the GRC closed 25 gaps, resolved over 150 reported issues, and added alternate loci for three regions. One of the closed gaps, a region on chromosome 4, was the result of a duplication; two alleles were mapped in that region instead of one. The GRC closed the gap and made an alternate representation for that region to represent the diversity. “We have been adding sequences in places and take away certain sequences that shouldn’t have been in the assembly in the first place,” says Church.

The GRC does not take changes to the reference lightly; there are researchers in the community who rely on the assembly, so a significant change could alter their experimental data. “I will say I have experienced the wrath of people who were not happy that we updated the assembly,” says Church. “Our assertion is that when we update the assembly, we improve it. We are continuously coming up with ways that we can continue to do that work without causing too much angst and trouble for people who are trying to do whole-genome analysis.”

Even with constant annotation updates, issues remain. A recent study led by Evan Eichler from the University of Washington found 2363 sequences from nine donors couldn’t be mapped to the human reference genome (3). “When the human genome was put together, at any given position, it’s essentially one haplotype that’s represented there,” Eichler told BioTechniques at the time. “So [what] follows from that [is] that there must be pieces of DNA that we know nothing about that exist in maybe the majority of humans.”

The problem of how to represent variation is something very much at the top of the GRC’s to-do list. “You really want a genome that can represent all of those alleles so you can really represent the diversity of the population. Those are some of the challenges that the GRC has been trying to address in collaboration with groups like David Schwartz’s lab in Wisconsin and Evan Eichler’s at the University of Washington,” says Church.

In addition to the human reference, the GRC also maintains the mouse reference genome and will begin maintenance of the zebrafish genome in the future. Although the GRC has software tools that could be scaled to maintain other genomes, they have no intentions of maintaining any more genomes, since they don’t have the resources for the additional curation or experimentation.

”If we took another genome, we would really need the commitment from that community to provide the experimental resources,” says Church. “They will have the experimental resources to investigate the problems that we find. Some of these regions are sufficiently complicated that Washington University or the Sanger Institute has to do some experimental work to sort out the structure.”

A question of manpower
Appropriate resources are a key factor in the quality and relevance of any reference genome. “And here of course is a problem,” says Kersey, “because that’s never going to happen for 10,000 genomes. If you take a look at the bioinformatics landscape at the moment, there’s a lot of attention on human.” The other staple model organisms of biomedical research, such as Drosophila and Caenorhabditis elegans, also have well-funded communities that maintain and annotate their particular reference assembly because they have the experimental infrastructure established.

To support these communities in curation and annotation, the International Society for Biocuration, which is led by Gaudet, is helping to define the profession of Biocuration by providing a forum for biocurators, developers, researchers, and students to exchange experience and ideas. The group also lobbies for increased funding for resources and organize workshops to train new biocurators or interested students in the use of common tools such as the Gene Ontology, Genotype, and Phenotype curation.

Gaudet is also involved with the Gene Ontology Consortium’s Reference Genome Group, which supports the independent communities that maintain reference genomes to increase the depth of annotations for genes in twelve major model organisms used by researchers. The group promotes annotation standards across these reference genomes, so researchers can better understand the evolutionary relationship between genes in these different organisms.

Kersey believes that the funding agencies will have to begin providing platforms to organize this data for the long term. “While we have traditional archives such as the NCBI and the EBI, these are a very good way of storing the record and showing what’s published in the biological literature, but what they don’t do have is genome scale organization. They are not designed primarily as a platform for experimental data and interpretations.”

Although the NCBI’s RefSeq is a curated collection, the data is pulled from the agency’s GenBank database, which accepts sequence data with minimum quality control. RefSeq curators select the best GenBank entries and often copy them to RefSeq. The quality of the reference genomes in RefSeq again comes back to the community’s interest in the representation of their organism.

Unlike RefSeq, the EBI’s Ensembl databases include only genomes that have a strong, funded community who is actively involved in improving the reference. “What we try to do is not record genes, but develop a high-quality tool suite that allows people to see integrated data from a wide range of experiments,” says Kersey. “To keep this up to date is important because [it] ensures that when a genome is sequenced, the annotation continues to get updated.”

It takes a community
In the goat genome paper, Hassanin and his colleagues propose a new link to each accession number in the nucleotide databases, an additional annotation field named “external expertise” that is updated to validate good-quality data and indicate problems with the sequence. The idea appear to be similar to commenting features on a blog.

But Gaudet is not so sure that these comments on the quality of data will be prominent enough. “It’s interesting, but you have to kind of happenchance on it,” she says. “It’s not indexed anywhere, so it’s kind of a dead-end way to handle this. It’d be a little post-it note on the web.” The challenge, she says, is to make the evaluation more powerful than just a comment.

There’s always going to be imperfect, unannotated data, so Gaudet believes that biocurators will need some way to represent the quality of the sequence, assembly, and annotation in the future. She suggests that, after enough data is compiled, it may be possible to create a confidence index to assess the quality of a new submission. “I haven’t seen anyone do this yet,” says Gaudet. “There’s a lot of poorly sequenced, poorly annotated data out there, so we’re going to need to have a way to prove this a lot better than we do right now.”

In the case of the goat genome, the community may not be large enough to support the genomic sequences published in RefSeq. But there is some hope on the horizon. In 2006, the goat and sheep DNA database was established in an effort to bring together researchers looking into the sequences of these related organisms. And as sequencing costs continue to slide, coverage of these now-outlier organisms will expand, hopefully providing even better quality references in the future.

1. Parma, P., M. Feligini, G. Greeppi, G. Enne. 2003. The complete nucleotide sequence of goat (Capra hircus) mitochondrial genome. Goat mitochondrial genome. DNA Squ 14:199–203.

2. Hassanin, A., C. Bonillo, B.X. Nguyen, and C. Cruaud. 2010. Comparisons between mitochondrial genomes of domestic goat (Capra hircus) reveal the presence of numts and multiple sequencing errors. Mitochondrial DNA. Early Online: 1–9.

3. Kidd, J.F., N. Sampas, F. Antonacci, T. Graves, R. Fulton, H.S. Hayden, C. Alkan, M. Malig, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. 2010. Nature Methods 7:365–371.


2010 marked the beginning of our Methods-specific Newsletter series. Covering cell culture, microscopy, PCR, and antibody technology, BioTechniques brought you the latest methodological and technical advances in these exciting fields through weekly feature articles and news stories. If you enjoyed the Top Sequencing Feature of 2010, check out the rest of the editors’ picks of our favorite methods-specific news features from 2010 here.