All things considered, the race to the $1000 genome may very well end in a tie. At this year’s Consumer Electronics Show (CES) in Las Vegas, NV, just 10 days into the new year, two companies announced instruments that can sequence a human genome in less than a day for less than $1000. Life Technologies’ Ion Proton sequencer and Illumina’s HiSeq 2500 system are both scheduled to be released later this year, bringing an end to one of the most competitive races in the history of biotechnology since the Human Genome Project.
“The big transition that the industry is going through right now is getting out of the research lab and into the clinic,” says Clifford Reid of Complete Genomics. “There are about a million research genomes to be sequenced every year and there are maybe a billion clinical genomes to be sequenced every year.”
But this transition requires more than just a cheap price tag and quick turnaround. So now, the industry is preparing for the transition from research labs to clinical settings by improving the genomic sequence quality, sorting out data processing issues, and developing new software for clinical interpretation.
Quality Vs. Cost
Reid isn’t getting caught up in this $1000 genome business. He’s quick to point out that the price tag only includes the consumables, leaving out the costs of the instrument, labor, and data storage and analysis.
“Over the past few years, the major cost reductions have been the cost of the consumables, and those cost reductions are almost done,” says Reid. “Now, it’s the cost of all the other things like the computing and the labor involved that are the rate-limiting steps.”
As chairman, president, and CEO of Complete Genomics, the human genome sequencing and analysis service company in Mountain View, CA, his company offers complete genome sequencing as a service for $4000 per genome for its large research customers. That’s about the same price as current clinical molecular diagnostic tests, such as genetic risk assessment and BRAC mutation testing for breast and ovarian cancer.
Because it’s on a level with current molecular testing, Reid believes that higher quality will be valued in clinical sequencing over dirt-cheap pricing. When a researcher studies hundreds or thousands of genomes in a genome-wide association study for a particular disease, the sequencing errors fade into the background. But when a physician is working with a single patient’s genome, any sequencing error could become a life-or-death decision.
“Your doctor says ‘Hey, I can give you the cut-rate research genome that has a bunch of errors in it, but boy is it cheap. Or I can give you a clinical-grade genome where we’re sure we get the right answer.’ There’s no question which way to go,” says Reid.
Because of their service business model, Complete Genomics has already worked out the data storage and analysis aspects of genome sequencing for customers who never need to handle the raw sequencing data. The company produces a finished genome by processing all the raw data at its four-petabyte, 7500 core-processor data center and then delivers it to clients through Amazon Web Service’s cloud-computing platform.
Then, researchers can analyze the data using a variety of genome analysis tools, which have become increasing more versatile thanks to implementation of a new file standard called Variant Call Format (VCF). Developed by the 1000 Genomes Project, VCF is a text-file format that represents the biologically interesting parts of the genome. Regardless of the technology used to sequence a genome, the finished product can be represented in this file format, which then can then be analyzed using any academic or commercial software that supports this emerging standard.
Genome in the Cloud
That’s the idea behind DNAnexus, Inc., in Mountain View, CA: to reduce the cost of genomics data analysis by using cloud-computing and Internet technology. The company’s platform is designed to be run through a web browser, like other applications such as e-mail, calendars, and social media websites.
“All things accounted for, we’re already at the point where I think that the cost of data management and analysis is as much as or more than the sequencing itself,” says Andreas Sundquist, CEO and co-founder of DNAnexus.
In addition to reducing the cost of genomic analysis, the company hopes to make it simpler. Biologists and clinicians might be interested in using genome sequencing, but they may not have any experience with the bioinformatics required to process and analyze that data. So, the company is designing a user-friendly, click-and-point interface for data storage, organization, visualization, and analyses, which includes expression analysis and variant calling.
“They are not going to set-up an in-house parallel queuing architecture and figure out how to compile and partition the data. They just want to start using it and get to the results quickly,” says Sundquist.
Another company that is using cloud-computing for genomics is Santa Clara, CA-based NextBio, Inc. Through its platform, the company takes raw data from a client’s sequencers and processes it into a VCF or other useful format within hours, a feat that still takes several weeks at university data centers. The goal is to reduce the data bottleneck and speed up analysis.
“The Illumina and Life Technologies instruments in core sequencing facilities are focused on generating a lot of data, and they’ve done a great job of bringing efficiency to that process. We’re trying to bring that same level of efficiency in terms of making all this big data accessible and making it useful,” says Saeid Akhtari, co-founder, president, and CEO of NextBio.
Furthermore, NextBio is also interpreting the massive amount of data that is available in public and private databases into actionable information for physicians. The system integrates different types of molecular profiling from patients and links it together with other integrated data, from human and animal studies.
“Every time a new patient comes into the system, it gets smarter. It connects the dots,” says Akhtari. “Our system is constantly learning and finding new connections between all these billions and trillions of data points.”
Eventually, with data from enough patients uploaded into the system, physicians will be able to use the system to find what treatments worked for patients that have a disease with a particular mutation. Making correlations between trillions of data points, a one- or two-page report that suggests treatment options and disease risks becomes possible. By the end of this year, there may be well over 100,000 molecular profiles of patients in NextBio’s database.
In addition to the Ion Proton introduction, the company announced a project to advance computational tools to help doctors diagnose and treat their patients through genomics. The company has tapped Robert F. Murphy, director of the Lane Center for Computational Biology at Carnegie Mellon University to lead the “Doctor-in-a-Box” project.
In contrast to companies that are developing software packages for translational research, the Doctor-in-a-Box will be an open source system without any licensing fees. It will be freely available for research, clinical, or other purposes. Furthermore, the project seeks to draw conclusions about complex diseases that have multiple genetic contributors and multiple possible manifestations.
“The main focus is using advanced techniques that are being developed to tackle diseases where traditional genome-wide association studies (GWAS) have failed to pick up these weak signals,” says Murphy. “When you have multiple possible contributors, traditional GWAS, whether they analyze single base-pair changes or SNPs, don’t find anything.”
To analyze these complicated genetic diseases, Murphy’s team will develop advanced machine-learning methods. In essence, the system will become a never-ending learner, building models of what to look for in newly sequenced genomes with regards to disease and treatment effectiveness and continuously refining that model as it processes more genomic data. In the end, the software may help with the deluge of sequencing data by highlighting the data that is useful with regards to current genomic knowledge.
“Like in all biological and biomedical research, we’re data-limited,” says Murphy. “One of the great things about the Proton is that it makes it possible to get a whole genome sequence on much, much larger number of individuals in order to find these statistical linkages. And that’s probably the limiting factor that we can look at right now that the technology has the potential to reduce.”
In the end, the software will only be as good as the amount and quality of the data that is pumped into it. So, such programs would not have been very useful prior to the volumes of genomes that the $1000 price tag will produce. While standards such as the VCF file format are helping to integrate this massive amount of data, there is still along way to go to connecting all the dots.
“We have so much data, terabytes of data from studies, and so many tools and databases. But if we can’t get these pieces working together in a unified system, there’s no way that this is going to work,” says Sundquist.