A single run on the Applied Biosystems SOLiD 3 sequencer can generate more than 20 billion base pairs of sequence. Mass spectrometry systems now have the capability to generate thousands of spectra from a single sample. High-throughput chemical genomics platforms can screen cell lines against thousands (or in some cases, millions) of different compounds in a few days. This level of analysis—which can be measured in terabytes of data—is no longer the realm of large, multi-lab initiatives. Indeed, even smaller lab groups are now generating reams of information on a daily basis. Like it or not, biology has moved into the fast-paced, high-throughput, modern world. And though this move has created unique possibilities for researchers in all fields, there are issues that accompany such large-scale data generation that will need to be solved, and quickly.
The ability to generate these expansive datasets quickly and cheaply has really given biology a new outlook on life. The recent expansion of DNA sequencing technology and capability is just one example. Having moved in a very short time from the celebrated completion of the 13-year Human Genome Project in 2003 at a price of $1 billion, today an individual genome can be sequenced for less than $50,000 in a time frame of several weeks—some companies even anticipate that the cost could be as low as $5,000 as early as next year, and that the sequencing time frame will eventually reach a single day. The effect of this technology boom has transformed the world of personal genomics and medicine, and enabled the eventual integration of a patient's genomic information with their clinical profile and medical history to gain new insights into the genetic basis of human disease. But personal genomics could soon begin to generate data at a pace and volume which, five years ago, any biologist would not have dreamed possible. While this is good news for many scientists when it comes to the potential to make new discoveries, the difficulty emerges in both protecting and sharing this information among scientists and clinicians.
Several databases have been established for scientists to share the results of their sequencing efforts, along with many other databases catering to microarray datasets (such as Gene Expression Omnibus), mass spectrometry studies (such as Open Proteomics Database at the University of Texas), and clinical imaging efforts [including the National Cancer Institute's integrated Cancer Bioinformatics Grid (caBIG)]. These databases, and those that are sure to emerge in the coming years, fulfill a critical need since the datasets they contain can provide different information to researchers depending on how these datasets are analyzed. But coordinating data sharing and access among researchers can be a challenge and will take time and effort to perfect.
Recently, a group of researchers led by a biostatistician at Yale School of Public Health used data from the National Institutes of Health database of genotypes and phenotypes (dbGaP) for a study that was accepted into the Proceedings of the National Academy of Sciences. Although researchers are often asked to sign an agreement (called a Data Use Certificate) which states that for one year, dbGaP data will not be used by anyone other than the depositing scientist, their article was accepted and appeared prior to the end of that one-year embargo period. To their credit, all involved took quick action to retract the paper and an investigation is moving ahead now to determine how the breach occurred. Sharing data via large databases is an important mechanism to move science forward, but so too is protecting the rights of those who did the work to produce these datasets in the first place. Going forward, both journals and funding agencies will have to be more scrupulous when reviewing manuscripts or grants that are contingent on dataset analysis; they must consider how these datasets were obtained for analysis in the first place, as well as how they will be made available to other researchers to validate and expand upon the results.
There is little doubt that biologists will continue to push the proverbial envelope, and in doing so obtain larger and larger dataset sizes. Storing this data and sharing it among peers is going to be a key discussion point for years to come, but the value of all researchers having access to such treasure troves of information is one that far outweighs the occasional setbacks that will take place as we move forward.