to BioTechniques free email alert service to receive content updates.
Genome sequence data: management, storage, and visualization
 
Jacqueline Batley1 and David Edwards2
1Australian Centre for Plant Functional Genomics, School of Land, Crop and Food Sciences and ARC Centre of Excellence for Integrative Legume Research, University of Queensland, Brisbane, Australia
2Australian Centre for Plant Functional Genomics and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, Australia
BioTechniques Special Issue, Vol. 46, No. 5, April 2009, pp. 333–336
Full Text (PDF)
Abstract

Over the last few years there has been a revolution in DNA sequencing technology that has brought down the cost of DNA sequencing and made the sequencing of an increasing number of genomes both feasible and cost effective. There has also been a dramatic shift in the type of sequence data being generated, with vast numbers of short reads or pairs of short reads replacing the traditional relatively long reads produced by Sanger sequencing. These changes in data quantity and format have led to a rethinking of sequence data management, storage, and visualization, and provide a challenge for bioinformatics. The vast amount of sequence data that will will be generated over the next few years will require a change in what data are stored and how users query the information.

The new data

Genome sequencing was revolutionized by the introduction of commercial pyrose-quencing and the release of the Roche 454 GS20 in 2006, which could produce 20 Mbp per run. This was replaced by the GS-FLX in 2007, with a 5-fold data output improvement to 100 Mbp, which increased in 2008 to 400 Mbp (1). Since the introduction of second generation sequencing by Roche 454, both Illumina and Applied Biosystems (AB) have joined the market with the GAII and the SOLiD systems, respectively. Roche has continued extending average read length (currently ~ 400 bp with the release of the Titanium methods in 2008), while Illumina and AB have focused on producing vast numbers of reads, with read lengths currently at 75 bp (www.illumina.com) and 35 bp (2), respectively, for these systems. Manufacturers are constantly increasing output in terms of number of reads, increasing read length, as well as working to improve read quality, such that at the time of publication, it is possible to generate 20 Gbp of sequence in a single run of the Illumina GAII (www.genengnews.com/ news/bnitem.aspx?name=48514221). The rapid advances in sequence data generation suggest that we are still in the early stages of technology development in this field and that data production will continue to increase dramatically over the next few years.

It is clear that we are only at the start of next- or second-generation sequencing. Currently, commercial companies are increasing their sequencing capability and are being joined by additional commercial sequencing technologies that will drive the market and sequencing capability much further over the coming decade. Helicos (www.helicosbio.com) has recently come onto the market with the first single molecule sequencing system, offering sequencing without the DNA amplification required for all previous systems (3). Both Pacific Biosciences (www.pacificbiosciences.com) and Nanopore technologies (www.nanoporetech.com) have commercial products in the pipeline, and DNA sequencing is expected to continue to grow at an exponential rate for several years to come. Pacific Biosciences uses single molecule real-time sequencing (SMRT) in which a DNA polymerase molecule is attached to a chip and visualized as DNA is synthesized from a template of a single-stranded DNA molecule (4). The Nanopore sequencing technology uses a thin membrane of nanopores, where the target DNA is placed on the membrane and a current is applied across the nanopore. The duration of the translocation of a polynucleotide through a nanopore channel is dependent on the DNA sequence and not the length, allowing the sequence to be determined (4).

The thousand-dollar genome is within sight (5,6) and soon this cost will not only apply to resequencing, but also to de novo genome sequencing. New technologies can be applied to sequence the large and complex genomes of agronomically important plant species. It was previously considered unfeasible to sequence crops such as wheat, whose genome is six times larger than the human genome and consists predominantly of repetitive elements. The size and hexaploid nature of genomes such as wheat create significant bioinformatics challenges, and although sequence data generation is now relatively inexpensive, it may be several years before bioinformatics methods are capable of assembling large and complex genomes. Genomics technologies have moved from gene to genome sequencing and are now capable of sequencing whole environments of microorganisms. The production of this metagenomic data, however, generates another challenge for data management, because sequences cannot always be associated with specific species as has been traditional for gene and genome sequence data management.

Data management and storage

The International Nucleotide Sequence Databases consisting of GenBank (7), the DNA Databank of Japan (DDBJ) (8), and European Molecular Biological Laboratory (EMBL) (9) provide the principle repositories for DNA sequence data. In addition to hosting the text sequence data, they host basic annotation and, in many cases, the raw data from which the text sequences were derived. Although submitting the raw trace file data for Sanger sequences is often a requirement for publication, the storage of raw data for the new technologies is problematic due to the vast size of the images. The cost of storing the gigabytes of raw data produced by each run of the Illumina GAII or AB SOLiD has been estimated to be greater than the cost of generating the data in the first place. It is now common practice to delete the raw image files once they have been processed to produce the relatively small text sequence and quality data files. While the long-term storage of the text sequence files is feasible using current tape and disc technology, maintaining the data in a readily usable form where it may readily be interrogated by users is more of a challenge. The GenBank sequence repository continues to increase in size exponentially, and searching this data using standard sequence comparison algorithms takes an extensive amount of time. Additionally, a large amount of computing power is needed to run standard tools such as BLAST. New sequence comparison tools such as Zoom (10) have been developed specifically for second-generation short read sequences; however, it may be some time before a standard comparison tool equivalent to BLAST becomes prevalent for short reads.

  1    2