In a 2010 Genome Biology article (1), Lincoln Stein, Professor of Molecular Genetics at the University of Toronto, made a case for migrating genome informatics to the “cloud.” Stein leads the Data Coordinating Centre for the International Cancer Genome Consortium (ICGC), whose mission is to ensure that all researchers can access the ICGC data sets. These data sets, according to Stein, are now 1 petabyte in size and could approach 10 petabytes by 2018. “Even in 2010, it was clear to me that the traditional way to distribute genomic data, in which data was stored in internet archives for downloads, was not going to scale,” recalled Stein.
Stein’s solution was straightforward: take advantage of the growing number of cloud-based computational services available to consumers—services where the customers (in this case biologists) can rent hardware and storage space for as long as needed. But these are not physical systems; rather, these cloud providers offer the ability to create “virtual machines” that take advantage of extremely large clusters of servers and nodes with petabytes of storage space to perform genomic analysis tasks.
Stein wrote his article at a time when the average sequencing system could generate a few billion base pairs per run. Since then, sequence output per system has grown tremendously, with systems generating hundreds of billions of bases per run. Not to mention that costs have come down (as predicted), so access to sequencing technology has also increased. But has the approach to sequence analysis and storage changed along with the increase in data? And just how much data storage and analysis has actually found its way into the cloud?
Machine in the Cloud
In some ways, the initial entry into cloud computing is similar to selecting a desktop computer. You are able to configure the specifications of your “virtual computer” or “virtual machine” as it is often called, to whatever your specific needs might be. How much memory do you want? What speed? How much storage capacity? This is what makes cloud computing particularly attractive to labs where customizing computational infrastructures might prove too expensive and time consuming. Still, work in the cloud can take some adjustment too.
“Most computational biologists and bioinformaticians are new to the cloud, which has operating characteristics different from local computer clusters, and are taking some time to wrap their heads around the concept,” noted Stein, but this situation could change quickly as more researchers test cloud computing.
Amazon Web Services, one of the early leaders in cloud computing solutions, offers dedicated genomics tools and large companion data sets for interested researchers, all based around their so-called Elastic Compute Cloud (EC2). In fact, one of the most valuable elements when computing on the cloud could be the ability to access and use large volumes of data from what have traditionally been considered extremely large databases (think GenBank or the1000 Genomes database) quickly and on demand.
When the first 1000 Genomes dataset was released, it was over 200 terabytes in size, far too large for most local computer environments. But the research teams involved in the project decided to host the data on the Amazon EC2, allowing researchers in remote locations to simultaneously analyze data from over 1000 human genomes using sophisticated tools, accessing the computing power required by simply setting up their “virtual machines” on EC2. This is a major advantage to life in the cloud; once a data set is uploaded to EC2, it is there for all to use.
Sequencing instrument manufacturers have also gotten into the cloud, offering cloud-based solutions for everything from sample set-up and tracking to sequence analysis. Illumina, the maker of the MiSeq, HiSeq, and NextSeq next-generation sequencing systems, has set up a cloud platform called BaseSpace, which is targeted to researchers using Illumina systems who want to simplify their local computational demands while extending their sequence analysis options. BaseSpace offers users data storage along with a host of analysis tools ranging from sequence alignment and assembly to pathway analysis, all integrated within the standard workflow of an Illumina sequencing system.
Other companies have started to offer cloud-based sequencing solutions as well. DNANexus is a company based in Mountain View, California that provides data management, storage, and analysis solutions for a variety of throughput levels.
When it comes to open-source tools to enable cloud-based analysis, several academic groups are also pitching in to ease the transition for biologists. In 2011, Florian Fricke and colleagues described a virtual machine called CloVR for analysis of next-generation sequence data sets (2). CloVR is capable of analyzing local sequences on a home computer or by virtual computing across multiple cloud platforms, if desired. The advantage of such dual software is the ability to tailor sequence analysis to a user’s particular needs; larger data sets can be shipped to the cloud, while smaller analyses can be performed locally. Other virtual environments for cloud computing include Galaxy and Cloud BioLinux, which was developed by the J. Craig Venter Institute and is also stored on EC2.
Analyzing Data in the Cloud
In addition to storage options, cloud computing offers users the potential for more power in their genome analyses through parallelization. In the world of cloud computing, parallel analysis problems fall into a number of different categories depending on the degree of parallelization possible.
For example, sequence BLAST analysis falls into a category called “embarrassingly parallel.” Here, a large set of sequences needs to be analyzed in order to find closely related sequences that potentially reveal the identity of an unknown sequence. If the sequences are all independent, then they can be analyzed in parallel on different machines, with no effort needed to separate the sequences beforehand. When it comes to aligning all those sequences, the extra power of parallel analysis using a virtual machine helps, but sometimes there are other issues that come into play.
In 2013, Fabian Sievers and colleagues reported in the journal Bioinformatics that current alignment programs used for protein sequences do not scale well (3). The accuracy of an alignment tends to decrease as the number of sequences increases. Such results present an issue for developers.
Unlike BLAST, sequence alignment and assembly is not an “embarrassingly parallel” problem, but instead a “tightly coupled” problem and one that generally, depending on the number of sequences and the size of the assembled genome, can require quite a bit of computer memory. But here, cloud computing can also make a difference.
Bioinformatics has tackled the problem of designing new sequence assembly tools for cloud-based applications in recent years. In 2012, Jan-Ming Ho and colleagues described a tool called CloudBrush for genome assembly based on string graphs in BMC Genomics (4). As with other sequence assembly programs, CloudBrush proved scalable, depending on the number of clusters used in the virtual machine.
Getting Data to the Cloud
It turns out that with internet bandwith limitations, getting your data into the cloud might be the biggest obstacle of all, especially when the data sets are particularly large. In Stein’s 2010 article, he noted that with a typical connection transferring at a rate of 5 to 10 megabytes/second, it would require a week to transfer a 100 gigabyte next-generation sequence data set. While speeds have improved since 2010, they have not caught up to the increase in sequencer output, and data transfer remains a concern for cloud users.
“Data transfer rates are certainly still the bottleneck,” said Stein. “You want to minimize transfers by putting the big data sets up in the cloud and leaving them there.”
Illumina’s BaseSpace provides another clever solution to the bandwith challenge. When using the NextSeq system, users can elect to push data to BaseSpace during the instrument run for data storage and analysis, rather than transferring all the data after the run is completed. In this way, analysis of the data set can be performed faster and at locations other than where the sequencing system is physically located.
Interest in using cloud-based solutions in genomics is on the rise as more tools are developed for interested researchers. But questions of benefit remain, especially when it comes to overall cost. Still, as sequencing capacity and speed increase well above Moore’s Law for storage and computation, Stein suggests now is the moment for researchers to take a long, hard look at cloud computing.
“Cloud computing allows us to use resources more effectively and fend off the day of reckoning, but that day is coming.”
1. Stein, L. D. (2010). "The case for cloud computing in genome informatics". Genome Biology 11 (5): 207
2. Angiuoli, S.V. et al. CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 2011, 12:356
3. F. Sievers, D. Dineen, A. Wilm, and D. G. Higgins, "Making automated multiple alignments of very large numbers of protein sequences," Bioinformatics, vol. 29, no. 8, pp. 989–995, 2013
4. Chang Y.-J., et al. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 2012, 13(Suppl 7):S28