The problem of how to represent such population variation is very much at the top of the GRC's to-do list. “You really want a genome that can represent all of those alleles, so you can represent the diversity of the population,” says Church.
At the moment, the GRC maintains the human and mouse reference genomes and will begin maintenance of the zebrafish genome in the future. Although they have the software tools to scale up and maintain other genomes, they have no intention of adding others at the moment.
“If we took another genome, we would really need the commitment from that community to provide the experimental resources,” says Church. “They will have [to have] the experimental resources to investigate the problems that we find.”A Question of Manpower
Resources are the limiting factors when it comes to obtaining a greater number of quality reference genomes. “And here of course is a problem,” says EBI's Kersey, “because that's never going to happen for 10,000 genomes. If you take a look at the bioinformatics landscape at the moment, there's a lot of attention on human.” Other staple model organisms in biomedical research, such as Drosophila and Caenorhabditis elegans, also have well funded communities capable of maintaining and annotating their particular reference assembly.
To support smaller communities in their curation and annotation efforts, the International Society for Biocuration, which is led by Gaudet, provides a forum for biocurators, developers, researchers, and students to exchange experience and ideas. In addition, the group lobbies for increased funding and organizes workshops to educate on the use of common curation tools.
Gaudet is also involved with the Gene Ontology Consortium's Reference Genome Group, whose goal is to increase the depth of annotations for genes in 12 major model organisms used by researchers. The group promotes annotation standards across these reference genomes so researchers can better understand the evolutionary relationship between various genes in different organisms.
Kersey believes funding agencies will have to begin providing platforms to organize this data for the long term. “NCBI and the EBI are a very good way of storing the record and showing what's published in the biological literature, but what they don't have is genome scale organization. They are not designed primarily as a platform for experimental data and interpretations.”
NCBI's RefSeq is a curated collection with only one reference genome per organism. But the data is pulled from the agency's GenBank database, which openly accepts submissions of sequence data with a minimum of quality control. Curators at RefSeq will select the best GenBank entries and often copy them to RefSeq, but again, the quality of the reference genomes in RefSeq comes back to the community's interest in the representation of their organism.
Unlike RefSeq, the EBI's Ensembl and Ensembl Genomes databases select only genomes that have a strong, well-funded community actively involved in supporting and improving the reference. “What we try to do is not record genes, but develop a high quality tool suite that allows people to see integrated data from a wide range of experi ments,” says Kersey. “To keep this up to date is important because [it] ensures that when a genome is sequenced, the annotation continues to get updated.”Building blocks
When a community does not have the resources and funding to maintain and improve a reference genome, future researchers are bound to get lost or find a dead end. French researcher Alexandre Hassanin can attest. He and his colleagues noticed a problem during their studies on mitochondrial evolution in goats and sheep: the goat mitochondrial reference genome appeared contaminated. Hassanin and his team suspected that those original sequences, published in 2003, might contain errors. So they sequenced the complete mitochondrial genome from a Vietnamese domestic goat and then compared their data with the sequences archived in RefSeq.
What they found was that the original 2003 assembly was generated by seven laboratories augmented by sequences from nucleotide databases. Two fragments contained an unusually high number of errors, while another segment of the genome was a nuclear sequence of mitochondrial origin. Judging from its chimeric nature and poor quality, the team concluded that this reference genome could not be used for future studies.