2Department of Medicine, University of California, San Diego, CA, USA
*D.C.N.'s current address is Rosetta Inpharmatics LLC, Seattle, WA 98109, USA.
**G.H.L.'s current address is University of Alabama at Birmingham, Birmingham, AL 35294, USA.
Full Text (PDF)
Fast and accurate estimation of phylogenies and determination of genetic and phylogenetic divergence and diversity of molecular sequences are essential components of biological research. For a set of sequences, a typical phylogenetic analysis involves several steps, including multiple sequence alignment, phylogenetic reconstruction, visualization of the inferred tree, and calculation of evolutionary measures. A large number of phylogenetic analysis resources have been developed, as cataloged by Joseph Felsenstein (http://evolution.genetics.washington.edu/phylip/software.html), including web servers that provide an easy route to address specific evolutionary questions. For example, PhyML Online (1) performs maximum likelihood (ML) phylogenetic estimation under a wide range of evolutionary models. Phylemon (2) provides experts with a suite of online programs and a Java interface to build a phylogeny pipeline. Dereeper et al. recently made available Phylogeny.fr (3), which boasts an easy-to-use interface designed for the non-specialist combined with up-to-date programs that are frequently reserved for experts.
These tools provide excellent interfaces to phylogenetic reconstruction; however, there is an increasing demand by researchers for a tool that performs not only typical phylogenetic reconstructions, which most existing web servers do capably, but also enables downstream processing and interpretation. For example, calculating divergence and diversity measurements and genetic distance distributions from the phylogenetic output are usually very time-consuming processes that require caution if conducted manually to ensure that calculations are carried out correctly and that data has not been altered in the transfer among the several necessary software packages. Furthermore, reducing an alignment to only its phylogenetically informative sites—a position at which there are at least two different character states and each of those states occurs in at least two of the sequences—has proven to be a useful approach in recombination analysis (4,5,6) and visualizing extended alignments. Calculation of central sequences and comparison of a set of sequences to a consensus (CON), most recent common ancestor (MRCA), or center of tree (COT; an ancestral state that minimizes the phylogenetic distance from the specified sequences) (7,8,9) have been used in a variety of studies of sequence evolution, structure, function, and rational vaccine design.
The need for a unified web interface to integrate useful tools and perform automated phylogenetic and other genetic analyses (including summaries and visualization of the resulting data) led us to develop DIVEIN, which has four major components: (i) a pipeline to automatically guide a set of aligned sequences through phylogenetic tree estimation under a variety of evolutionary models, and visualization of the inferred tree; (ii) an interface to reconstruct MRCA/COT/CON sequences and reconstruct and visualize trees re-rooted by MRCA and COT sequences; (iii) calculation of genetic distance distributions, pairwise diversity and divergence from the MRCA/COT/CON; and (iv) an interface to detect, visualize, and numerically summarize phylogenetically informative sites as well as private mutations (found only in a single sequence) in an alignment.
DIVEIN runs on an Apache web server. The web interfaces are implemented via Perl CGI and JavaScript. Data manipulation and presentation employ standard Perl and BioPerl (10) modules. Maximum likelihood phylogenetic reconstructions use PhyML v3.0 (11), which applies a hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. The inferred tree can be viewed and edited through the included Archaeopteryx v0.955 β Java applet (www.phylosoft.org/archaeopteryx). The MRCA and COT sequences are reconstructed using a joint maximum likelihood procedure (12) via HyPhy v2.0 (13), a scriptable software package for performing a wealth of evolutionary sequence analyses. Distance distribution histograms and divergence and diversity plots are generated using the open source Gnuplot graphing package (www.gnuplot.info). DIVEIN is hosted on a Linux computer with two quad-core Intel Xeon 2.5 GHz processors (8 cores) and 8 GB RAM. It is configured to run up to eight user-submitted projects simultaneously, with additional projects queued for later execution. Bootstrap replicates are limited to 100 because of computational resource limitations.
Given a collection of sequences, the divergence is derived by calculating the mean distance of all sequences from a reference or founder sequence and the diversity is given as the mean distance between all sequences (14). Using d(i,j) to denote either the path length between nodes i and j in the reconstructed phylogenetic tree or a genetic distance between sequences i and j, we measure divergence and diversity for a collection of N sequences as follows:


DIVEIN accepts aligned nucleotide or amino acid sequences in NEXUS, PHYLIP, or FASTA format. For phylogenetic analyses, users can perform ML estimation alone or include divergence/diversity analyses. They can calculate divergence from MRCA, COT, and/or CON, or any sequence in the alignment [MRCA calculations require a file listing sequence name(s) that belong to the outgroup]. Users can optionally provide a file that assigns input sequences to multiple groups and calculate divergence and diversity for each of those groups. If a group file is not provided, DIVEIN will assign all sequences to a single group, excluding the defined outgroup sequences. For COT analysis, users may upload a tree to reconstruct its COT. If the tree is not provided, DIVEIN will estimate one using either the general time reversible (GTR) (15) substitution model (for nucleotides) or LG, an improved general amino acid replacement matrix (16).
We have also included an informative sites module in DIVEIN that is useful for condensing sequence data to allow users to quickly identify sites that are changing within an alignment and more easily obtain an overview of complex and large data sets. To detect phylogenetically informative sites (those found in more than one sequence, and thus contributing to branch ordering), users can include a reference sequence at the top of the alignment, or DIVEIN will calculate the consensus of the alignment as the reference. Example data sets are provided to familiarize users with the correct input formats and expected output results. DIVEIN also provides the functionality to retrieve finished results via a previously assigned project ID.
When an analysis is finished, a randomly generated URL known only to the user initiating the analysis is sent to the user by email in order to view and download results, which are accessible on the server for 2 days. Users can locally view and edit phylogenetic trees and dynamically generate and download graphs of distance distribution histograms and divergence and diversity (if applicable). Sample screen shots of DIVEIN output (phylogeny/divergence/diversity) are shown in Figure 1. Using an example alignment of 28 DNA sequences with 624 sites (available on the DIVEIN web site), it takes <30 s to finish the entire analysis process. For the analysis of phylogenetically informative sites, the states at each informative site are displayed as an alignment and in a table.
In conclusion, DIVEIN performs fast, accurate, and automated phylogenetic analyses, including (i) informative sites detection, (ii) ML tree estimation under a variety of evolutionary models, (iii) MRCA, COT, and CON reconstruction, (iv) distance distribution calculation, and (v) distance- and phylogenetic-based divergence and diversity measurements, along with resulting data summarization and visualization. Future versions will add the option to select the best-fit evolutionary model via ModelTest (17) and ProtTest (18) to reconstruct the phylogeny. Furthermore, we will incorporate other widely used phylogenetic analysis programs [e.g., MrBayes (19)] into DIVEIN to allow users easy access to other state-of-the-art molecular evolution analysis programs.
We thank John E. Mittler for discussions. This work was supported by grants from the US Public Health Services (grant nos. AI047734 and AI057005), including support to the Computational Biology Core of the University of Washington Center for AIDS Research (grant no. AI27757).
The authors declare no competing interests.
Address correspondence to James I. Mullins, Department of Microbiology, University of Washington School of Medicine, Seattle, WA 98195, USA. e-mail: jmullins@u.washington.edu
