The data that comes from next-generation gene sequencing machines is enough to overwhelm most computers. But a new algorithm to compress such data has made the genetic data storage problem a little less challenging, and won it’s developer—James Bonfield of the Wellcome Trust Sanger Institute—a $15,000 prize from The Pistola Alliance.
“In general, people had been compressing data using programs like WinZip, fairly standard compression programs that can compress word processing documents or anything,” said Bonfield. “But if you can understand what data means, you can compress more effectively.”
So, the Pistola Alliance challenged bioinformaticists—those who know genetic data well—to develop better ways to compress their data. The submitted methods were judged on how much they compressed genetic data, how fast the compression was, and how much computer memory it required as well as the decompression speed.
Throughout the competition, the Sequence Squeeze website contained an up-to-date scoreboard showing the rankings of various algorithms that had been submitted, a move that inspired many participants to continue improving and resubmitting their programs.
“The open leaderboard was really instrumental to, I expect, most people’s entries,” said Bonfield. “There were a number of times that people were continuously leapfrogging each other.”
By the close of the competition, March 15, more than 100 entries had been
submitted. Bonfield’s ranked highest, compressing data to be less than a
third smaller in size than older methods had been able to. The crux of
his—and many others—methods lies in dividing up data into groups of similar
data. For each bit of gene sequence, there is typically a DNA sequence, a
name, and a string of information related to quality. To compress the data
more effectively, Bonfield split up the three pieces of information and
grouped them with like data.
The compression of sequencing data is useful for two purposes, said Bonfield. “There’s archival use for long-term storage, or there are files that need to be compressed to be transmitted through the network.” While the new methods are immediately useful for those compressing files for the second purpose, for long-term storage, a finalized version of the algorithm will be necessary to ensure it is effective and will still be used decades from now.
To finalize his algorithm, Bonfield is collaborating with others working on similar problems. “What we don’t need is many different file types,” he said. “We need to come together and have one strong program.”
For now, while the cost of data storage has yet to come down, at least the file size of a genome may be on the decline. Bonfield received his award from the Pistola Alliance on April 24. He donated a portion of the prize to the Wellcome Trust Sanger Institute and the rest to the British Heart Foundation.