Sign Up to BioTechniques free email alert service to receive content updates.
Sequencing
 
Lynne Lederman, Ph.D.
BioTechniques, Vol. 46, No. 3, March 2009, pp. 159–161
Full Text (PDF)
Erratum
This article has been changed from its original form.
Click here to read the details.

The Next Generation

Every year sees new advances in all areas of biotechnology, and sequencing is no exception. As technologies improve and allow sequences to be obtained faster and less expensively, data will accumulate, and new challenges will present themselves. These include making it easier and even more cost-effective to sequence whole genomes, and improving the methods to analyze sequence data. High-throughput, next-generation sequencing will allow the sequencing of metagenomes (i.e., genomes from microbiologic communities without prior cultivation of the organisms that they comprise). These communities include environmental samples from different ecologic habitats and of intestinal flora in animals. Although still in its infancy, metagenomics may allow comparison of different types and numbers of organisms among communities—as well as what those organisms may be doing—by determining the presence of various functional genes.

Filling the Gaps

Matthew Huentelman, associate investigator of the Neurobehavioral Research Unit at the Translational Genomics Research Institute (TGen) in Phoenix, Arizona, is interested in the genetics of human neurologic diseases, including neurodegenerative and neurobehavioral disorders. He says that emerging sequencing technologies are allowing researchers to tackle problems in human genetics that couldn't be addressed before. Although sequencing has become dramatically more cost-effective in recent years, Huentelman feels that cost, throughput, and bioinformatics must develop further before researchers can truly unravel human diseases by comparing whole genomes of sequence data. His focus is on methods that he can use to leverage sequence technologies for the easiest approaches. “Some might call this picking the ‘low-hanging fruit,’” he acknowledges. Still, his work goes beyond investigating one gene at a time and relating it to disease. “Now it's possible to envision sequencing every pathway member related to a disease state, so the obvious first approach can still be largely hypothesis-driven.”



Next-generation sequencing (NGS) platforms allowed only a small number of samples, limited by the number of lanes, to be run at one time. The addition of short nucleotide tags (barcoding) has allowed a mixture of individual, tagged, species to be run together in one lane. Instead of 8 individual samples each running in their own lane, a mixture of 12 samples per lane (the contents of a 96-well plate representing a focused set of genes) could be run at one time. “I think that's a powerful ability for the sequencing field because we can think about pathways for a given disease. Every coding region in the genome is also a logical target…[we could] assign function to mutation in coding regions for a disease,” Huentelman observes.

To selectively sequence regions of the genome, one must first enrich for (i.e., specifically capture) these regions. His group is investigating febit's HybSelect technology (Heidelberg, Germany) for this purpose, which allows sequence capture to be performed on the company's fully customizable arrays. Individually tagged DNA segments are applied to an array, then enriched, eluted, collected, and sequenced. “The array hybridization approach has the ability to eliminate multiple polymerase chain reaction (PCR) steps,” explains Huentelman. “The field is trying to decide whether sequence enrichment on arrays or on other solid supports or multiplex PCR is the best.” That may be decided—at least in part—by which technology reads “on target,” (that is, maps to regions of the genome where sequencing was desired) versus “off target,” (maps to undesired regions).

Still, Huentelman believes that a combined barcoding and sequence-capture approach will open up the ability to sequence not only specific regions, but the entire known collection of coding regions throughout the genome. He notes that common variants in many diseases are becoming more frequently identified by analyzing single polynucleotide polymorphisms (SNPs). Sequencing is filling in the gaps. His laboratory is optimizing approaches to ask specific questions about particular diseases of the brain, primarily autism and Alzheimer's disease.

All NGS platforms do a good job, he says. The difficulty is on the back end, namely in how bioinformatics will handle all the data. “NGS companies have provided scientists with a great tool. Without a doubt, the data coming off the machines is so large in scale we are struggling to find space to store it while waiting to analyze it.” Questions that must be addressed include what to archive and whether raw data in the form of images should be stored. “Too much data is a great problem to have,” Huentelman says, “but if we are going to tackle next-generation sequencing, a lot of resources will be required for bioinformatics. You need an in-house support staff. Thankfully, the SNP field and SNP chips helped initiate migration of computer scientists to the genetics field,” he notes, adding that there is a lot of room for the field to grow in the future. “My lab and others will only generate more data.” This data-explosion problem is one that will provide an opportunity for those with backgrounds in computing and informatics to make an impact in biology.

At the Core

Robert Lyons is the Director of the DNA Sequencing Core at the University of Michigan in Ann Arbor, Michigan. The Core provides university investigators with access to automated DNA sequencing technology for DNA samples provided as pure plasmids, mini-prep plasmids, M13 clones, cosmids, lambda clones, PCR products, gel-isolated fragments, or bacterial genomes. At least 900 nucleotides of sequence data can be obtained from one sequencing run (in the form of a raw sequence data file and a chromatogram data file), as long as the template is of good quality and the primer is well-designed. “Our clients know what they're doing,” observes Lyons. “If we get to the $1000 genome, there are people who know what they'll do with it.” For example, in cancer therapy, a given chemotherapeutic agent depends on a number of genes that could determine if a tumor will respond to it—or even, be hyper-responsive. The $1000 genome could be a way to intelligently approach how to choose which agent would be appropriate for a given patient, Lyons suggests. The 900-nucleotide sequence is relatively cheap at $3–4 per run. For a large, scale-up genome center running hundreds of thousands to a million sequences in parallel, he explains, the cost for that same sequence could be as low as $0.30–1.

“These 900-nucleotide sequence runs are incredibly valuable to our clients. We also provide NGS services, which generate vast swaths of data from vast swaths of DNA. Our clients usually know what they're after, but they may not realize how much data they will get.” Lyons agrees with Huentelman that data generation on this scale can pose problems. “New sequencers put out mind-boggling amounts of data; you can't look at them on a desk top computer,” he explains. “You need substantial computer power and expertise to interpret NGS data, as well as massive amounts of storage to save those results.” Although the Core's clients may abstractly understand the volume of data they will be getting, it doesn't hit home until they receive it, and many are unprepared to handle the data. For its own data handling, the Core relies on the University of Michigan Center for Computational Medicine and Biology (CCMB), which was created to encourage collaborative, interdisciplinary research in bioinformatics. The CCMB's Collaborative Computing and Data Unit supports multiple high-performance computing clusters, provides access to clusters, and hosts and manages more than 50 terabytes (TB) of storage and databases.

“Bioinformatics is an underpopulated field,” observes Lyons. The University of Michigan has a graduate program offering both masters and doctorate degrees in which, Lyons says, students will be introduced to NGS data sets and be ready to enter the field. He notes, however, that archiving information is still a problem. In his laboratory, there are 20 TB of storage, which he characterizes as not nearly enough. There are 70 TB available elsewhere on campus that will be expanded to 120 TB. He anticipates there will be a need to expand to petabyte (PB, or 1024 TB) capacity within a year.

“One saving factor is when running a big NGS experiment, the 3–4 TB of data are in the form of camera images,” Lyons says. “When the images of interest to the client are processed, the information is reduced to gigabytes of data.” The cost of saving the images may exceed the cost of running the experiment: as long as the sample is available, it makes more sense to discard the images and rerun the sequencing experiment if necessary. But this doesn't always work: some researchers at the university are not determining sequences, but rather, investigating various computational mechanisms for extracting data from images. These individuals need to save images, and the space to do so. “Another ongoing issue is when data are stored, whatever media they are stored on may not be available 10 years later,” Lyons observes. “We are not facing that issue yet. The time scale is probably five to ten years for longevity of media and the technology used to read and write it.” Those who still have piles of floppy disks sitting around can attest to how brief this time scale really is.




Back to top