to BioTechniques free email alert service to receive content updates.
Genomics ‘jigsaw puzzle’ competitions

Priya Sabu

While next-generation sequencing has made obtaining sequence data easier than ever before, the technology has also created a challenge for whole-genome assembly. Priya Sabu takes a look at several contests squarely aimed at improving how we piece together short sequences.

Bookmark and Share

DNA sequencing technologies are making it possible to access DNA sequences from a larger number of species at greater depth. And although today’s next-generation sequencing platforms are able to generate billions of short sequence reads (~50–100 bp) at a very low cost, these reads are actually making it harder for assemblers to piece together whole genomes accurately.

Indeed, writing in Nature Methods earlier this year, Evan Eichler—a professor of genomics at the University of Washington—criticized the recent de novo assemblies produced by the Beijing Genome Institute (BGI), arguably one of the top three genomic institutes in the world.

Evan Eichler, professor of Genome Sciences at UWashinghton, criticized BGI's de novo assemblies in Nature Methods earlier this year. Source:

"Start up your computer, free up some space on your hard drive and get ready to write some serious code. Let's get ready to assemble!" Source:

In the article, Eichler and his colleagues, who study copy number variation in the genome, compared two recent de novo human genome assemblies produced by the BGI and found that both were 16.2% shorter than the reference genome. In addition, the authors reported that 420.2 megabases of common repeats and 99.1% of validated duplicated sequences were missing from the assembled genome. “Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention paid to what is lost by sole application of short sequence reads,” concluded Eichler.

“Evan Eichler is correct in saying that the short read technology has not proven to be up to the task to disambiguate regions of large and small segmental duplication,” says David Haussler, investigator at the Howard Hughes Medical Institute and professor of Biomolecular Engineering at University of California (UC) Santa Cruz. “He’s rightfully frustrated that he has not been able to use the assemblies that have been produced to study his regions of interest.”

For his work, Eichler describes this as the “watershed moment in genomics:” although the ability to produce data has significantly improved, creating accurate and correctly annotated genomes still remains challenging.

Contests to the rescue?

Last November, seventeen teams of researchers from seven different countries prepared to test their genomics puzzle-solving skills in one of the most competitive assembly “bake-offs,” known as the Assemblathon. Several competitions along the lines of the Assemblathon have sprung up in recent years in an effort to test the skills and assembly algorithms of the world’s best sequencers. The event was the brainchild of David Haussler (UC Santa Cruz), Joe DiRisi (UC San Francisco), Oliver Ryder (UC San Diego), and Stephen J. O’Brien (National Institutes of Health), who saw the need for better annotation and assembly algorithms in light of many planned large-scale sequencing and assembly efforts such as the Genomes 10K project.

Although many algorithms exist for sequence assembly from next-generation sequencing data, a better understanding of the value and limitations of these programs is needed. In one promising study, Eric Lander, professor of biology at Massachusetts Institute of Technology (MIT), a member of the Whitehead Institute, and director of the Broad Institute of MIT and Harvard, generated high quality de novo mammalian genome assemblies with colleagues, using an algorithm called ALLPATHS-LG. The article published in PNAS in January of this year reported that the ALLPATHS-LG algorithm was able to generate assemblies from next-generation sequencing data that were similar to the ones generated using traditional, higher-read length, capillary-based sequencing in terms of completeness, contiguity, connectivity and accuracy.

The teams that participated in this year’s Assemblathon competition employed 22 different algorithms for sequence assembly; in fact, a team from Broad Institute led by David Jaffe used its software ALLSPATH-LG in the competition to assembly the synthetic genome of ‘species A.’

Haussler believes that there is still no “cut-and-dry assessment” for what makes one assembly or algorithm better than another. “The field is too young. We have a nuanced assessment of assemblies that can be separated under different dimensions,” he says. These assembly competitions serve as a platform for bringing together and evaluating the various assembly algorithms to ask questions such as: How large of a scaffold with gaps can the team assembly correctly? What is the error rate in difficult regions of the genome? How well does the algorithm perform in repetitive regions versus non-repetitive regions?

With the success of the first Assemblathon, organizers are in the midst of putting together a second, and possibly a third.

Steven Salzberg, director of the Center for Bioinformatics and Computational Biology and professor of the Department of Computer Science at the University of Maryland, decided to take a different spin with his assembly competition. He and several colleagues have organized the Genome Assembly Gold Standard Evaluations (GAGE) competition, where entrants will use only real data for their assembly efforts.

Steven Salzberg, professor for Bioinformatics and Computational Biology at University of Maryland, organized the GAGE assembly competition, which will use real data in the hopes to provide realistic guidelines for the scientific community. Source:

Those in other competitions, including the Assemblathon and a European competition called de Novo Genome Assessment Project (dnGASP), are running their algorithms on stimulated or synthetic data which, from a biological and computational perspective, looks like a vertebrate genome. This year’s Assemblathon generated a synthetic genome for an unknown ‘species A’.

“It doesn’t always reflect the actual performance of an assembler on real data. At least that’s my view,” explains Salzberg. “So I thought that could be, in a way, a setback for the community to take away a message from a ‘bake-off’ that is based on only simulated data.”

There are now several programs available that have been suggested to perform reasonably well in genome assembly. Read length improvements are also aiding assemblers; those produced by the 454 Roche sequencer have reached nearly 500 base pairs per read, which is almost as good as the early Sanger technology, and both the Illumina and SOLID systems now are capable of 100-bp reads.

When evaluating these assemblies, Steven likes to consider the usefulness of the assembly for downstream analyses, in addition to its correctness. “The ideal assembly will have single a contig for every chromosome in the original organism. That almost never happens, even with long reads,” he says. Thus, he looks to answer the question: how long do these contigs have to be for the assembly to be useful?

“This is a difficult question to answer because it depends on the project,” Steven explains. “My collaborators and I have a general rule of thumb that we would like to see a majority of the organism’s gene on a single contig.”

The upside for researchers who rely on whole-genome assemblies in their work is that as new assembly competitions spring up, the technology is being tracked more carefully as it improves. “It’s a dawn of a new age in the field of life sciences,” says Haussler.


Try your hand at assembling a genome with this puzzle.

provided by

And now, try it with short reads.

provided by