to BioTechniques free email alert service to receive content updates.
Finding the true $1000 genome
Jeffrey Perkel, Ph.D.
Full Text (PDF)

Still, he says, even that cost excludes the upfront R&D expense required to develop the sequencers and sequencing technologies—costs that are typically included in the HGP accounting. But according to Nusbaum, the genome he can provide today is a pale shadow of that first HGP genome sequence, and lacks most long range structural information. The HGP data were assembled de novo (rather than merely aligned to a reference) and then each base was manually "finished", subjected to extensive quality control, manual cross-checking and error correction, none of which occurs routinely today. “It's not the same genome,” he says. In contrast, today's genomes are generated by the far less complicated and expensive process of "resequencing" and aligment to a reference.

Still, Nusbaum figures he and others in the field could probably squeeze costs by a “couple factors of two” from current processes—improving automation of DNA extraction and library preparation steps, for instance. This, plus anticipated instrument technology improvements in read lengths and read densities, could get his per-genome cost down to around $2000 over the next two years or so.

$1000 genome; $100,000 ‘interpretome’

When it comes to genomics, sequencing is the easy part; all that data must then be analyzed and interpreted. Quality metrics on raw sequence needs to be evaluated, data assembled and compared to a reference, and genetic variants identified. Many mature computational pipelines are now available, including the GATK package developed by the Broad Institute and the SOAP package from BGI. Both are largely automated and cost less than $500 per genome, Xu says (though a full accounting of cost would also have to include the processing time required by computers to run these jobs).

If the goal is simply to collect a bunch of genomes to assess human genetic variability, then the process is done at this stage. But if the effort is discovery-driven in order to identify variants underlying phenotype or human disease—say, for a clinical application—those automated steps are not sufficient.

“It cannot be done by automation pipelines on a computer; you have to have a lot of range of knowledge and certain kinds of data-interpretation technical skills,” says Xu. “Those kind of people could be very expensive, and it's not very efficient to do that, because you have to manually interpret each data point you care about.” $1000 genomes may be in sight, but what are they worth alongside so-called million-dollar “interpretomes?” Indeed, in 2010, Washington University geneticist and Genome Institute codirector Elaine Mardis penned an article in Genome Medicine entitled “The $1000 genome, the $100,000 analysis,” a reference to a comment she once made to then-NHGRI head Francis Collins, when NHGRI was actively funding technology development towards the $1000 genome.

“My admonishment to Francis at the time was that he really needed to start thinking about funding bioinformatics,” recalls Mardis, “or he was going to have his $1000 genome but he would still have a million-dollar analysis.”

That isn't hyperbole. When the institute sequenced its first tumor-normal genome pair in 2008 using next-generation technology, it cost about a million dollars to collect all those As, Cs, Gs, and Ts, and another $600,000 to analyze the data once the reads were done. “The analyses had never been done before,” says Mardis. “We had to figure out how to do it as we went.”

Today, the Genome Institute at Washington University has more than 30 Illumina HiSeq 2000s, each costing about $750,000. Each sample, Mardis reckons, costs about $8000 “fully-loaded” to generate the sequencing data. But the downstream analysis costs still add up, especially for clinical applications.

On average, each tumor genome contains about 3.5 million single-base variants from a matched normal. For each one, the variant must be logged and analyzed. Does this variant change an amino acid? Is that residue conserved? Is the variant seen in the matched normal sample, and what is its frequency in the human population? Much of that work can be automated in informatics pipelines, but the analytical interpretation of the final list is not yet easily automated.

  1    2    3    4