to BioTechniques free email alert service to receive content updates.
Going to Great Read Lengths

10/27/2014
Janelle Weaver, Ph.D.

In a revival of the era of finished genomes, scientists are using the long reads offered by third-generation sequencing technologies to close gaps in genome assemblies. Can next-generation sequencing catch up? Janelle Weaver reports.


Next-generation sequencing has made it possible for scientists to sequence genomes faster and at a much lower cost than with Sanger sequencing, paving the way for the $1000 genome. But this approach sacrifices read length for speed, reducing average reads to about 100 base pairs instead of 800–900 base pairs using Sanger sequencing (1). Short read lengths make genome assembly more difficult because additional coverage (i.e., more overlapping sequence reads) is required to produce a comparable assembly (2).

The current draft genome of the rhesus macaque—an important biomedical animal model—contains sequence gaps in up to 20% of its gene models. Source: Einar Fredriksen, Wikimedia Commons




The PacBio RS system produces average read lengths that span several thousand bases and maximum read lengths of up to 30,000 bases in some cases. Source: PacBio

Worley and her colleagues developed an automated software tool called PBJelly, which aligns long PacBio reads to draft assemblies to close or improve gaps while preserving annotations. Source: PLoS One

But deeper coverage does not compensate for certain problems. For de novo assembly, repetitive sequences longer than the read length produce gaps, resulting in more fragmented assemblies in recent years than in the past. As a result, it’s more difficult to detect variation in repetitive regions, which may be important for understanding certain diseases.

“The frustrating thing about short-read data is that there’s not a lot of information content in a 100 base pair read,” says Kim Worley, a geneticist at the Human Genome Sequencing Center at Baylor College of Medicine. She pointed out that the current draft genome of the rhesus macaque—an important biomedical animal model—contains sequence gaps in up to 20% of its gene models.

“We have finished the human genome and the mouse genome,” she says. “But even those finished genomes have regions that are not completely contiguous and correct, and users of those data are always dissatisfied with those regions.”

To address this issue, Worley and her colleagues turned to the Pacific Biosciences (PacBio) RS platform, a third-generation sequencing technology that can perform single-molecule sequencing reactions in real time. The system produces average read lengths that span several thousand bases and maximum read lengths of up to 30,000 bases in some cases.

Those long sequence reads simplify genome assembly because they can span repeat regions, and, because no amplification of source DNA is required, there's also a reduction in certain sequencing artifacts and genome coverage biases. Because the PacBio RS platform produces long reads without GC-bias or systematic errors, it is uniquely suited for upgrading genome assemblies.

As reported previously in PLoS ONE (3), Worley and her colleagues developed an automated software tool called PBJelly, which aligns long PacBio reads to draft assemblies to close or improve gaps while preserving annotations. Applying this approach to four genomes—a simulated Drosophila melanogaster genome, the version 2 draft for Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the sooty mangabey genome— the researchers addressed 63%–99% of gaps and were able to close 32%–69% and improve 12%–63%.

“We’re experiencing a renaissance and a revival of the era of finished genomes,” says Jonas Korlach, chief scientific officer at PacBio. “That was really the norm back in the days of Sanger sequencing, but when next-generation technologies came around, it was really almost abandoned because it was not possible or it was so cumbersome to close those genomes with Sanger sequencing.”

Playing Catch Up

In principle, PBJelly can be applied to long sequence reads produced by any platform. This feature may be important in the future when next-generation sequencing companies catch up with PacBio’s read lengths.

One move in this direction is the acquisition of the San Francisco-based startup Moleculo by Illumina. Technology developed by Moleculo allows large DNA fragments to be sequenced on standard next-generation sequencing Illumina systems for subsequent assembly into synthetic long reads. The short sequence reads originating from each molecule are assembled separately, and the end result is a full sequence of all the fragments. Essentially, short read data is reconstructed into long reads.

At the International Plant and Animal Genome Conference, a team of scientists reported that Moleculo technology could produce long, accurate DNA sequencing reads spanning 1.5–15 kilobases using the Illumina HiSeq2000 platform.

Another example of long-read technology is the 454 GS FLX+ system, which can deliver reads of up to 1000 base pairs. Right now, a research consortium is using this sequencing technology to analyze and assemble the RP11 human reference genome as part of an effort to close gaps and uncover novel genes in the genome sequence.

“One of the things that 454 has been known for is the highest-quality, longest-read sequencing on the market today,” says Todd Arnold, vice president for research and development at 454 Life Sciences, a Roche company. And the read length and throughput are only going to get better, he says. “What we strive for is to preserve our quality score as we increase the read length, because it’s very important to our customers.”

But according to Korlach, other existing technologies will never be able to catch up with PacBio. “There are fundamental technological differences and limitations that prevent other commercially available technologies from providing contiguous single reads of the lengths that we can provide,” he says.

Even so, one downside of the PacBio long-read technology is its high error rate. Although highly accurate sequencing results can be achieved through building consensus sequences, the PacBio RS instrument generates single-pass reads that average only 87–89% nucleotide accuracy.

“We’re working on improving that, but the accuracy will probably be lower than other existing technologies for a significant amount of time because our technology is fundamentally based on single-molecule, real-time detection,” says Edwin Hauw, the company's senior director of product management.

Putting Long Reads to the Test

At the University of Tokyo, computational biologist Michiaki Hamada isn’t too concerned about those error rates. “In my opinion, these high error rates do not raise serious issues, because most of the errors can be corrected by using short reads with low error rates, such as those produced by Illumina sequencers,” he says.

In a study, Hamada and his team developed a read simulator, called PBSIM, which captures the key features of PacBio reads. “Our long-term research goal is to develop a de novo assembler for long reads produced by, for example, PacBio sequencers,” says Hamada. “But there was no available simulator that targeted the specific generation of PacBio libraries.”

As reported last year in Bioinformatics (4), Hamada and his team used PBSIM to analyze 13 PacBio datasets. After conducting hybrid error correction and assembly tests for PacBio reads, they found that extensive assembly results can be obtained with a continuous long-read coverage depth of at least 15, in combination with a circular consensus sequencing coverage depth of at least 30. “PBSIM can be used not only in evaluating assemblers for PacBio sequencers, but also in experimental design for sequencing,” says Hamada.

In the end, because these gaps in reference genomes could contain genes involved with disease, capitalizing on long-read technology can make a big impact in the clinical realm. For example, in their study, Arnold and colleagues identified a region that might be involved in cancer development. “There was evidence for that gene that came out of earlier RNA sequence data, but this didn’t appear in the reference genome, so anyone who was doing resequencing studies wouldn’t see it,” says Arnold. “The more complete the reference library is, the better your ability to use this data in a positive fashion.”

References

  1. Treangen, T.J., and S.L. Salzberg. 2011. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13(1):36-46. doi: 10.1038/nrg3117.
  2. Schatz, M.C., A.L. Delcher, and S.L. Salzberg. 2010. Assembly of large genomes using second-generation sequencing. Genome Res 20(9):1165-73. doi: 10.1101/gr.101360.109.
  3. English, A.C., S. Richards, Y. Han, M. Wang, V. Vee, J. Qu, X. Qin, D.M. Muzny, J.G. Reid, K.C. Worley, and R.A. Gibbs. 2012. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7(11):e47768. doi: 10.1371/journal.pone.0047768.
  4. Ono, Y., K. Asai, and M. Hamada. 2013. PBSIM: PacBio reads simulator--toward accurate genome assembly. Bioinformatics 29(1):119-21. doi: 10.1093/bioinformatics/bts649.