As data sets grow in size, new computational tools are needed. Jeffrey Perkel looks at the ins and outs of data analysis.
As a postdoctoral fellow, Anne Carpenter found herself facing a daunting task: screening Drosophila cell cultures for genes affecting cell growth using a library of over 22,000 inhibitory RNAs. With the RNAs pre-printed on microscope slides, she plated cultured fly cells atop the library to induce RNA interference. Three days later, she stained the cells to visualize their size and cell cycle status.
It was then she encountered a problem: With replicates, the experiment included nearly 90,000 conditions. “I needed to process tens, probably hundreds of thousands of images, but couldn't find software that was up to the task,” recalls Carpenter, who is now Director of the Imaging Platform at the Broad Institute of Harvard and MIT. That's not to say there were no tools available—some could handle the volume of images but not the subtlety of the work; others could handle the science but not the number of pictures.
Carpenter's story in not unique. Today's laboratory hardware—whether microscopes, mass spectrometers, or DNA sequencers—produce data by the gigabyte (GB). Yet, they typically lack the software necessary to handle, process, and extract meaning from those data, at least for the most interesting bleeding-edge applications. But researchers have stepped in to bridge that gap, producing freely available computational tools for experts and neophytes alike, empowering them to squeeze biological insights from raw data.How big is “Big Data”?
Adina Howe was a postdoctoral fellow at Michigan State University with C. Titus Brown (now Associate Professor of Genetics at the University of California, Davis) when she found herself trying to assemble bacterial genomes from metagenomics data sets so large—nearly 400 billion bases worth—that the assembler software couldn't keep up. “They would require hundreds of gigs of memory that we didn't have,” she explains.
Typically, a raw next-generation sequencing data set is ~30 GB per sample—an order of magnitude smaller than Howe's. Yet even at that size, a computing neophyte would encounter significant practical difficulties—moving the data from place-to-place or even figuring out how to open the files takes savvy. “They can't open it any Microsoft product,” Howe notes, as the software would likely crash. And of course, processing the data—filtering out low quality sequences, for instance—would be even more difficult.
Fortunately, Brown had been developing tools to handle such problems for years. His lab's flagship software is “khmer,” a tool that reduces sequences into a series of arbitrary-length “words” (that is, k-mers), simplifying tasks such as genome assembly.
To tackle Howe's metagenomics problem, Brown's team implemented a probabilistic data structure called a “Bloom filter,” which reduces the amount of memory required for sequences some 40- fold. As Howe explains it, this structure is like a faculty mailroom in which each box is shared by several instructors. By creating multiple “rooms” in which those faculty pairings are shuffled, it becomes statistically possible to determine how likely it is that any given individual has mail—in this case, an overlapping sequence to fit into a growing assembly—by checking to see if those different boxes are full.
“I go to my mailbox and I ask, ‘Hey, are any potential connecting sequences stored in my data structure?’ If not, I know that they don't exist, and I don't have to pursue that path any more; otherwise, I'm going to check my other ‘mailbox rooms’.”
In the end, Howe used this approach to produce an assembly of nearly 5.5 million protein-coding genes from her metagenomics data set.
“It took us a good frustrating year,” Howe, who now is an Assistant Professor of Agricultural and Biosystems Engineering at Iowa State University, recalls.Imaging's big challenge
Pavel Tomancak, a group leader at the Max Planck Institute of Molecular Cell Biology and Genetics, faces a similarly daunting problem in his research. He studies gene expression during Drosophila embryogenesis, generating multi-terabyte recordings of fruit fly embryos via light-sheet microscopy that he mines to trace cellular motion during development. With 24 hours worth of images recorded from multiple angles and taken several times a second—existing software simply cannot keep up. “They are not adequate; they are not precise enough; they are not able to follow hundreds of thousands of cells without making any mistakes.”