to BioTechniques free email alert service to receive content updates.
Sequencing Risk Reduction

Jesse Jenkins

Want to know how much sequence data is required to answer your questions? Now there’s an algorithm for that. Learn more...

Although DNA sequencing is getting cheaper as each day passes, large-scale deep sequencing experiments still require a significant investment of time and money without any guarantee of useful data. Wouldn’t it be nice if you could predict the usefulness of additional sequencing by analyzing some preliminary data first? Well, a new algorithm promises to do just that.

Difficulties in predicting library complexity from initial shallow sequencing. Source: Nature Methods

In a paper published this week in Nature Methods (1), researchers from the University of Southern California (USC) describe the Preseq algorithm that predicts the molecular complexity of a DNA sample or library.

“Based on what we’ve seen by looking through databases like the Sequence Read Archive at the National Center for Biotechnology Information, a lot of people have this problem,” said study author Andrew Smith, a USC associate professor of biological sciences. “There are a lot of data sets in there and when you go through them, it’s pretty clear that people are sequencing the same molecules over and over again.”

Smith believes the method can help researchers determine the appropriate depth of sequencing in order to study mutations and other rare molecules, making such sequencing projects more efficient overall.

“If you are sequencing a cancer genome and looking for a particular type of mutation, you can use a method like this to tell that you’ve come pretty close to saturation and you’re not going to see a whole lot more,” explained Smith. “Or the method can tell you that there’s a huge amount of distinct molecules left, so keep looking.”

Smith and USC graduate student Timothy Daley developed the algorithm while working on bisulfite sequencing which tends to damage the DNA, leaving libraries of low complexity. To solve the issue, the two researchers sought a way to predict the properties of a larger sequencing experiment based on a smaller initial experiment.

After testing a number of methods, Smith and Daley developed the algorithm based on a statistical framework called ‘capture-recapture,’ which has been used to measure species abundance in ecological studies. The sampling model essentially captures, tags, and recaptures individuals to measure how diverse the members of a population are and how many distinct characters might be left.

“We can currently predict the yield of a sequencing experiment 100 times larger than the initial experiment used to estimate the yield to within 10%,” explained Daley. “The accuracy of the method increases with more data, but there are no sequencing experiments large enough to test how far out we can predict using larger amounts of data to fit.”

Now, Smith’s group is further improving the algorithm’s precision to understand what happens when one sequences infinitely. “A lot of things become unstable as you take everything to infinity,” said Smith. Infinity is just a qualitatively different thing, but we’re working toward it because we think we’ve got some good ideas.”


1. Daley T, Smith AD Predicting the molecular complexity of sequencing libraries. Nat Methods. 2013 Feb 24.

Keywords:  DNA sequencing