At its core, genome sequencing is a data processing challenge. Two labs can take the same raw sequence data and piece it together in dramatically different ways. Chunks of genes can be in the wrong order, a pair of reads from a mate-pair can be placed in wrong locations or orientations, gaps in the data can exist. Understanding the errors that each method — there are now more than 20 — is prone to, as well as which technique is the most accurate, is crucial to genetic research.
When evaluating sequence methods, researchers analyze 12-15 different features, which include the number of mismatched-bases at a position in an assembly-layout and the number of repeat-regions with high data compression. These features are usually treated as equal, under the assumption that they are independent.
But Bud Mishra and his colleagues at the Courant Institute at New York University wanted to know which feature best indicated assembly accuracy. So, they compared 21 sequences generated using five different assembly methods, publishing their results in PLoS ONE (1).
“The first thing we did was to see if all the features are non-redundant. Many errors are correlated with each other or dependent on each other,” said Mishra. “We found that there are about 3-4 informative, independent features.”
By and large the metric, called N50, which indicates the average length of pieced together sequences, has been a standard way of comparing assembly methods. But Mishra’s team found that it didn’t rank as the best.
Instead, the team suggests an algorithm that weighs features based on their predictive value, rather than treating them equally. In their analysis, the most predictive was a metric that represents how much of an assembly is put together in an inconsistent way, with matching ends of mate-pairs, represented at low coverage — thus not necessarily confirmed by just the overlapping sections. Furthermore, the findings can be used to enhance the current method of rating assembly methods, called the Feature Response Curve, which determines the relationship between the number of error-indicating features and the overall genome quality.
And their findings suggest that there’s room for improvement in sequence assembly methods. “I think more work in this field needs to go into algorithm design,” he says. “The amount of research in data analysis should be matching up with the research in bio-chemistry, bio-physics, and bio-technology.”
In the end, assembly methods are continuously changing, so the predictive value of each feature needs to be re-evaluated. “We plan to continue to keep this alive as new assemblers come out. We hope to see a Moore’s Law of genome assembly: the Feature Response Curve becoming twice as steep every 18 months or so,” said Mishra.
1. Vezzi, F., G. Narzisi, and B. Mishra. 2012. Feature-by-Feature – evaluating de novo sequence assembly. PLoS ONE 7(2):e31002+.