While genome sequencing has become more accessible for smaller labs, the process of genome annotation is still lengthy and labor intensive. It often takes many experts manually analyzing different regions of raw DNA sequences to identify genes for adding to genome databases.
“We have combined two of the most powerful algorithms that exist in the field of gene finding at this time, and this created a very versatile and accurate tool for groups around the world that work with eukaryotic genomes,” said study author Mark Borodovsky from Georgia Tech and the Moscow Institute of Physics and Technology.
“Genomic scientist can now find genes for hundreds or thousands of genomes at a time without having to spend manual time on each genome,” added co-author Mario Stanke from the University of Greifswald in Germany. “We just feed the raw sequence and alignment data into the program and it does everything automatically, and better than currently used competing pipelines.”
The team tested BRAKER1’s performance by comparing its prediction accuracy against the most commonly-used gene prediction pipeline currently in use, MAKER2. The team collected nuclear genomes, reference annotations, and RNA-Seq libraries from databases for four already highly-accurately annotated model organisms for comparison: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, and Schizosaccharomyces pombe. On average, the team found BRAKER1’s prediction accuracy to be more than 10% higher than MAKER2 in terms of gene prediction sensitivity and specificity.
According to Stanke, the BRAKER1 software has been downloaded 1,200 times since its initial launch month in January 2015, averaging 100 downloads per month from labs around the world. He and his collaborators are now working on modifications that will further improve the software’s accuracy.
“If we look at the fruit fly tests we did, for 65% of the genes we find 1 correct version of the gene, leaving 35% of the genes where we make a mistake…so there are definite improvements to be made,” said Stanke. “With modifications and other ideas we have in mind, I think we could get this to something like 80-90% accuracy.”
Katharina J. Hoff, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS; Bioinformatics first published online November 11, 2015 doi:10.1093/bioinformatics/btv661.