to BioTechniques free email alert service to receive content updates.
Competition Leads to Better Genetic Data Compression

05/04/2012
Sarah C.P. Williams

$15,000 has been awarded to scientist for best compression of next-generation sequencing data.

Bookmark and Share

The data that comes from next-generation gene sequencing machines is enough to overwhelm most computers. But a new algorithm to compress such data has made the genetic data storage problem a little less challenging, and won it’s developer—James Bonfield of the Wellcome Trust Sanger Institute—a $15,000 prize from The Pistola Alliance.

The Sequence Squeeze competition website kept a live leaderboard that allowed participants to see where they ranked against other scientists, helping encourage the spirit of competitiveness and motivating many to improve their entries. Source: sequencesqueeze.org





Last April, The Pistola Alliance—a group of life science companies, technology vendors, publishers, and academic groups—announced Sequence Squeeze, a competition to develop the best new algorithm for compressing the data from next-generation sequencing studies. For many researchers, the cost of storing data generated by genetic studies is now more of a roadblock than the cost of the initial sequencing.

“In general, people had been compressing data using programs like WinZip, fairly standard compression programs that can compress word processing documents or anything,” said Bonfield. “But if you can understand what data means, you can compress more effectively.”

So, the Pistola Alliance challenged bioinformaticists—those who know genetic data well—to develop better ways to compress their data. The submitted methods were judged on how much they compressed genetic data, how fast the compression was, and how much computer memory it required as well as the decompression speed.

Throughout the competition, the Sequence Squeeze website contained an up-to-date scoreboard showing the rankings of various algorithms that had been submitted, a move that inspired many participants to continue improving and resubmitting their programs.

“The open leaderboard was really instrumental to, I expect, most people’s entries,” said Bonfield. “There were a number of times that people were continuously leapfrogging each other.”

By the close of the competition, March 15, more than 100 entries had been submitted. Bonfield’s ranked highest, compressing data to be less than a third smaller in size than older methods had been able to. The crux of his—and many others—methods lies in dividing up data into groups of similar data. For each bit of gene sequence, there is typically a DNA sequence, a name, and a string of information related to quality. To compress the data more effectively, Bonfield split up the three pieces of information and grouped them with like data.

The compression of sequencing data is useful for two purposes, said Bonfield. “There’s archival use for long-term storage, or there are files that need to be compressed to be transmitted through the network.” While the new methods are immediately useful for those compressing files for the second purpose, for long-term storage, a finalized version of the algorithm will be necessary to ensure it is effective and will still be used decades from now.

To finalize his algorithm, Bonfield is collaborating with others working on similar problems. “What we don’t need is many different file types,” he said. “We need to come together and have one strong program.”

For now, while the cost of data storage has yet to come down, at least the file size of a genome may be on the decline. Bonfield received his award from the Pistola Alliance on April 24. He donated a portion of the prize to the Wellcome Trust Sanger Institute and the rest to the British Heart Foundation.

Keywords:  sequencing genomics