There’s nothing like a little competition to get computational biologists crunching numbers. A team of scientists at the University of California, Santa Cruz, has begun accepting submissions for their Alignathon contest, which pits genome alignment methods and labs against each other. The goal: a better understanding of the strengths and weaknesses of each alignment program, what makes a program successful, and which—if any—program comes out ahead. The results will help researchers design the best way to deal with the flood of data expected to come from the Genome 10K Project, the international effort that’s under way to sequence the genomes of 10,000 vertebrates.
But there’s no single computational method that’s accepted as the best way to find similarities between genomes of different species. That’s where the Alignathon comes in. Each lab that enters the competition will use their preferred algorithm to align three sets of genomes provided by the competition organizers. Two sets will be simulated data created for the competition, four primate-like genomes in one set and five mammal-like genomes in another. A third set will consist of real data from 12 fly genomes.
In December 2010, Haussler’s lab launched the Assemblathon competition, designed to compare methods of assembling full genomes from the short segments of genetic information produced by genetic sequencing technologies. The results of the initial competition were published in the journal Genome Research in September (1). Seventeen teams from seven countries participated, submitting a total of 62 different assemblies.
“What we found in the Assemblathon is that there was a huge amount of variety between different assembly programs and different groups,” said Paten. “Two groups could essentially run the same program and get different results.”
No clear winner emerged from the Assemblathon. Some programs ranked better at assembling genomes at a high order but made mistakes in single base pairs; others had few base-pair errors but more errors in the larger organization of a genome. But a winner is not the goal of either the Assemblathon or the Alignathon, said Paten. The goal is to have a benchmark against which current and future methods can be compared. Paten said he expects the same variety to come from the Alignathon.
“Just as the assembly problems are, in a mathematical sense, hard, the alignment problems are also very hard,” he said. “Genomes are subject to changes at all kinds of levels, from single nucleotide changes to small insertions and deletions to copy number changes or large rearrangements. An alignment program has to be able to take into account all of these different possible mechanisms for change.”
Because no single lab possesses the resources to compare each alignment method on their own, Haussler’s lab hopes the competition will speed the comparison process at the same time as encouraging a spirit of collaboration. They’re expecting around 10 labs to participate in the initial Alignathon, and future competitions could focus on other questions and include more labs. In addition, a second Assemblathon is in the works. While the first Assemblathon relied on simulated data, the second-generation competition will test methods on real data.
“And there are other future competitions you can imagine,” said Paten. “Competitions to assess not just alignment techniques but ways of reconstructing an evolutionary history from those alignments.” For those participating in the Genome 10K Project, the competitions not only provide avenues with which to analyze data, but help keep methods fresh, spirits high, and collaboration alive.
References
- Earl, D., K. Bradnam, J. St John, A. Darling, D. Lin, J. Fass, H.O. Yu, V. Buffalo, D.R. Zerbino, et al. 2011. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res. 21:2224-41.
