As critical as maps and GPS have become in reaching our everyday destinations, reference genomes have become similarly essential to geneticists. These genetic roadmaps provide researchers common reference points for a particular species’ genetic code. Since variations in that code exist even between organisms of the same species (there can be millions of single nucleotide polymorphisms between two people, for example), these genomes are often assembled from multiple examples of a species in an attempt to provide a reference for this variation.
“The reference genome for any kind of sequence genome is an artificial decision that one makes,” notes Pascale Gaudet, a research assistant professor at Northwestern University in Chicago, Illinois and the current president of the International Society for Biocuration. Indeed, the human reference genome, for example, is not the sequence of one individual or even an average of the polymorphisms found in the human genetic population. Instead, it is a combination of the first four people who were sequenced.
Although reference genomes are designed to provide an approximation of the DNA of any single individual organism, capturing all that information in a single sequence is a difficult task.
“The fact is that the quality [of reference genomes] varies enormously, according to the genome and to the community,” says Paul Kersey, a team leader for the Ensembl Genomes genome analysis and visualization annotation system at the European Bioinformatics Institute (EBI). “At one end you have the human genome, which has been very highly sequenced and extensively finished with a massive annotation program going on. Some of the lesser known genomes, well, things are still improving.”At First Glance
Initial genome assemblies often have problems with gaps, duplications, insertions, and deletions that can require months, or even years, to sort through. The speed of sequencing has grown exponentially in recent years thanks to the emergence of next-generation methods, but complex, duplicated regions of mammalian genomes are a challenge to map using these short-read length, high-throughput approaches. So researchers often return to the tried-and-true, but significantly more expensive and time-consuming Sanger methodology to obtain the longer reads necessary for scaffold assembly and mapping of these complex regions.
“Even with that technology, there were still regions of the genome that were inaccessible due to their complexity,” says DeAnna Church. Church is a staff researcher at the National Center for Biotechnology Information (NCBI). She also manages NCBI's Genome Reference Consortium (GRC), a group of about 20 scientists from a number of genomic research institutes around the world working to improve the quality of the human reference assembly.
When researchers announced the completion of the Human Genome Project, the first assembly they generated contained over 150,000 gaps; these were regions scientists thought were just too difficult to resolve at the time using the available technology. Today, the human reference assembly continues to be updated. The GRC released the 19th version of the human genome—referred to as GRCh37—in March 2009. In this release, 25 more gaps were closed, over 150 reported issues were resolved, and alternate loci for three regions were added.
Changes to the reference are not made lightly; researchers rely on the assembly, so any significant change could alter their experimental data. “Well, I will say I have experienced the wrath of people who were not happy that we updated the assembly,” says Church. “Our assertion is that when we update the assembly, we improve it. We are continuously coming up with ways that we can continue to do that work without causing too much angst and trouble for people who are trying to do whole-genome analysis.”
Still, issues remain. A recent study led by Evan Eichler from the University of Washington found 2363 sequences from nine donors couldn't be mapped to the human reference genome. “When the human genome was put together, at any given position, it's essentially one haplotype that's represented there,” explains Eichler. “So [what] follows from that [is] that there must be pieces of DNA that we know nothing about that exist in maybe the majority of humans.”