2Institute for Information Transmission Problems, Moscow, Russian Federation
3Vavilov Institute of General Genetics, Moscow, Russian Federation
Pyrosequencing of 16S ribosomal RNA (rRNA) genes has become the gold standard in human microbiome studies. The routine task of taxonomic classification using 16S rRNA reads is commonly performed by the Ribosomal Database Project (RDP) II Classifier, a robust tool that relies on a set of well-characterized reference sequences. However, the RDP II Classifier may be unable to classify a significant part of the data set due to the absence of proper reference sequences. The taxonomic classification for some unclassified sequences might still be performed using BLAST searches against large and frequently updated nucleotide databases. Here we introduce TUIT (Taxonomic Unit Identification Tool)—an efficient open source and platform-independent application that can perform taxonomic classification on its own or can be used in combination with the RDP II Classifier to maximize the taxonomic identification rate. Using a set of simulated DNA sequences, we demonstrate that the algorithm performs taxonomic classification with high specificity for sequences as short as 125 base pairs. TUIT is applicable for 16S rRNA gene sequence classification; however, it is not restricted to 16S rRNA sequences. In addition, TUIT may be used as a complementary tool for effective taxonomic classification of nucleotide sequences generated by many current platforms, such as Roche 454 and Illumina. Stand-alone TUIT is available online at http://sourceforge.net/projects/tuit/.
Recent advances in sequencing technologies have dramatically changed our understanding of microbial diversity, including the diversity of the human microbiome (1). The arrival of multiple next-generation sequencing platforms (2) followed by a rapid decrease in sequencing cost per nucleotide has encouraged sizable increases in the amount and scale of metagenomic research projects. Next-generation sequencing, a powerful tool for basic and clinical research, has been facilitated by the development of multiplexing techniques such as barcoding (3), which allows for parallel sequencing of dozens of unique samples in a single run.
We developed TUIT—a simple open source and platform-independent command line tool for DNA-based microbial taxonomy analysis. TUIT utilizes standard BLAST reports and a robust taxonomic database search engine for efficient taxonomic classification of nucleotide sequences. We have tested TUIT using sets of real and simulated sequences and successfully applied it to improve the taxonomic classification success rate. TUIT is not limited to any specific type of sequences and maintains high specificity levels for sequences as short as 125 base pairs; it also has the ability to classify sequences down to the species level.
The Human Microbiome Project (4) has led to a massive increase in the volume of available microbial DNA sequence data. This growth has required improvements in existing software as well as the development of new and novel computational tools to analyze and extract the maximum amount of information from microbial sequences. For an extended period of time, the analysis of microbiome composition was widely performed using the 16S ribosomal RNA (rRNA) gene-based taxonomic classification of DNA sequence reads generated by the Roche 454 platform, rightfully placing this technology in central focus for both bacterial microbiome diversity and composition studies (5). Modern studies, however, are steadily shifting toward the more cost- and yield-efficient Illumina technology (6), which currently can produce up to 300 bp paired-end reads (7). Rapid and correct taxonomic classification of 16S rRNA gene reads is currently accomplished by a number of specialized bioinformatics software products. The Ribosomal Database Project (RDP) II Classifier has been used in the vast majority of 16S rRNA gene-based studies (8) and has become the gold standard for bacterial microbiome analysis due to its accuracy, speed, and straightforward ease of use.
The RDP II Classifier is a naïve Bayesian classifier that achieves its level of efficiency by using a library of known bacterial 16S rRNA gene sequences and a short subword-based algorithm that does not require sequence alignment calculation (9). It provides the taxonomic assignment for a given sequence with a bootstrap confidence score for each taxonomic rank in the taxonomic hierarchy. Although the RDP II Classifier is generally considered superior, other computational approaches, such as BLAST-based algorithms, have shown a similar level of effective taxonomic recovery (10). Moreover, the RDP training set comprises only a finite number of well-characterized sequences, where each sequence has been taxonomically assigned down to genus level. BLAST databases from the NCBI (National Center for Biotechnology Information), on the other hand, are continuously updated and contain the majority of annotated genomic sequences, including those of well-established taxonomy. In some cases, a search against BLAST databases can provided additional information that is sufficient to classify sequences that RDP failed to classify, thereby making these two approaches complementary.
NCBI BLAST is by far the most popular tool for sequence homology search (11) and is immediately applicable for searching against a reference database; however, BLAST does not automatically provide a definitive taxonomic classification for the query sequence. Instead, BLAST searches produce an extensive list of hits and a robust algorithm is then required to analyze the report and extract the appropriate information that defines the taxonomy of the query at the deepest taxonomic rank possible. Challenges for creating such an algorithm include the following:
- The algorithm should deal effectively with useless hits that give no taxonomic information (e.g., environmental samples, unclassified, enrichment culture, etc.). These sequences are abundant in sequence databases such as the non-redundant nucleotide database (NR) and may overcrowd BLAST reports, slow down the taxonomic identification process, and decrease the overall performance and accuracy.
- The algorithm should attempt to identify the deepest taxonomic rank possible and, in this respect, challenge a commonly used, though probably excessively conservative lowest common ancestors (LCA) algorithm, successfully implemented by some currently available applications, such as the MEtaGenome ANalyzer (MEGAN) (12)
- It should allow f lexible settings for similarity criteria for certain taxonomic ranks (species, genus, etc.), which can be established by the end user depending on the type of analysis.
- The software implementation of the method should address the need for a simple easy-to-use solution for non-expert users (as metagenome projects become more and more mainstream). This includes but is not limited to free use, distribution and extension (open source), and an ability to process universally any DNA sequences, regardless of the read length (within reasonable bounds). Also, the implementation should be platform independent.
The proposed algorithm, further referred to as the TUI (taxonomic unit identification) algorithm, was implemented by our group in a lightweight command line tool, TUIT (TUI tool). TUIT was written in Java and can run on Windows/*nix platforms. The TUIT precompiled application and source code are available for download at http://sourceforge.net/projects/tuit/. TUIT is a free project with a documented open source API that can be obtained and extended alongside the actual pre-compiled application. Being a front end wrapper to BLAST, TUIT calls a standard NCBI BLAST executable (assuming that the BLAST+ package has already been installed). It was designed to direct BLAST to perform in two modalities: (i) as a local search, for those systems that maintain in-house BLAST databases for searches performed locally, or (ii) as a remote search via the computational facilities of the NCBI BLAST server for compact systems where deploying local BLAST databases is unreasonable.
The goal of the TUIT module is to assign taxonomy to a query at the deepest possible taxonomic rank. This task, although simple at first glance, introduces several challenges. In some rather common cases, the classification process is challenged by abundant well-aligned unclassified sequences or those sequences that belong to the nodes currently lacking proper taxonomic placement. To effectively handle such cases and increase the overall performance, TUIT allows the user to restrict the output to only those hits that can actually provide essential taxonomic information. This strategy facilitates the removal of unclassified, environmental, and other less characterized sequences from the search scope. A cleaner BLAST report is then processed with the help of an enhanced performance NCBI taxonomic database. Another challenge also effectively handled by TUIT includes the issue of contradicting hits (cases when several hits from different taxonomic groups have very similar scores).
TUIT creates and maintains a specific taxonomic schema (see Supplementary Taxonomic database) within a database based on the popular and free MySQL engine, which can be stored on the local system or a remote server. During the installation process, TUIT deploys the taxonomic database using a set of corresponding files, which are downloaded from the NCBI FTP server and processed automatically. Updates for the taxonomic database can be performed frequently in an automated mode.
The TUI algorithm (see Figure 1) begins by making an initial taxonomic assignment for every hit from the BLAST report list in accordance with its GI number (see Supplementary Taxonomic database). With the use of the assigned taxonomic information, the algorithm further attempts to define the lowest (deepest) taxonomic rank among the hits as current for the round's classification attempt (Step 1). It then collects a sublist of hits with the current taxonomic rank (Step 2) and filters the sublist against a rank-specific set of cutoffs. Two alignment-specific cutoffs are taken into consideration at this point: percent of identity (PI) and query coverage (QC). Among those hits that satisfy the cutoff set, TUI selects a single pivotal hit, which has the lowest (best) E-value (Step 3). Among the hits of higher taxonomic ranks and lower E-values (if any), the algorithm attempts to find at least one that would point to a taxonomic node that is not parental to the pivotal hit-assigned taxa. In the case when at least one such hit is found, (i) the algorithm rejects the pivotal hit, (ii) the current rank of classification lifts one step higher on the taxonomic tree, and (iii) the algorithm starts from (1) with the lifted rank. Conversely, TUI browses those hits with higher (worse) E-values in an attempt to find at least one hit of the same taxonomic rank that belongs to a different taxonomic subgroup (Step 4). If such a hit can be found, a statistical evaluation is performed. The algorithm tests if we can reject the null hypothesis that the pivotal hit and the hit in question have the same degree of similarity to the query sequence at the given significance level (P ≤ 0.05 by default). First, TUIT calculates the number of nucleotide matches and mismatches of the query against the pivotal hit and the hit in question. Then, assuming the independence of substitutions a chi-square test is used to determine the P value. Gaps are treated as single substitutions regardless of their length. If the null hypothesis cannot be rejected, the algorithm starts from Step 1 with the lifted rank. This approach provides a better sensitivity and specificity compared with another approach that we have tested: using an E-value ratio cutoff to determine if the pivotal hit has better similarity to the query than to a competing hit from another taxonomic group. This approach was based on the notion that E-values of hits from the same database can be approximated as P values for the purpose of a statistical comparison (13).
TUIT prov ide s ta xonomic assignment to a query sequence at the lowest taxonomic rank possible, while complying with a strict level-specific set of cutoffs. Each cutoff set has certain default values that can be overridden within the properties file to enhance the algorithm with extra flexibility and lessen its dependence on a particular of data type and/or sequencing technique. Some studies may involve a trial-and-error approach to applying variable rank-specific cutoffs. TUIT allows the user to modify a case-specific XML-formatted property file that is applied along with a specific input.
In exceptional conditions, such as when the BLAST report fails to provide enough information for the algorithm to unambiguously determine a proper taxonomic assignment to the query sequence at any distinct level, the level of superkingdom is used. Similarly, if a BLAST search returns no hits, TUIT reports that the sequence was not identified in the output.
Because it is implemented as a command line tool, TUIT is amenable to building custom pipelines and scripts but nevertheless remains clear and simple with a minimum of just two f lagged parameters per call. Thus, it is more useful for researchers with limited experience in computational biology and for performing a small number of short tasks without needing to go through verbose manuals.
The overall computation time for a single TUIT run is limited by the BLAST search speed. Remote searches are likely to be slower than searches performed locally. The tool itself uses an updatable taxonomic database, derived from the dump files of the NCBI taxonomic database, downloaded from the NCBI FTP server, processed, and normalized (we refer to the process of a relational database structure refinement) in an automatic mode to include only scientific names for taxonomy and efficiently link and index tables. Taxonomic classification of a 16S rRNA gene data set derived from human cornea epithelium samples
For the purpose of testing TUIT in actual research settings, we used data obtained from an ongoing human cornea metagenome project. Healthy cornea samples were collected by a cornea epithelium sheaths removal procedure, performed with a delaminating instrument (Gebauer Medizintechnik GmbH, Neuhausen, Germany) during PRK and epi-LASIK surgical procedures. Total genomic DNA was extracted using a QIAGEN DNeasy (QIAGEN Inc. Valencia, CA) tissue extraction kit and the Gram-positive bacteria protocol, according to the manufacturer's manual. Mock specimens (molecular grade water) were processed in parallel with patient samples to monitor reagent purity. Following the assessment of genomic DNA quality and concentration, two replicates of multiple displacement amplification (MDA) were performed with each biological sample using the Illustra GenomiPhi V2 DNA Amplification Kit (GE Healthcare, Pittsburgh, PA). Following MDA, PCR-generated amplicon libraries were constructed using sequencing primers specific to the V3-V4 region of the 16S rRNA gene (E. coli positions 338–802) (14). Primers contained 454-specific adapter sequences as well as barcode key sequences for multiplexing, as described earlier (15). Each PCR reaction contained 0.25 ml (30 mM) of primer mix, 3 ml of template DNA, and 22.5 ml of Platinum PCR SuperMix (Invitrogen Life Technologies, Grand Island, NY). Forward and reverse primers were used in the primer mix in equal proportions. Samples were denatured at 94°C for 3 min, amplified for 35 cycles of 94°C for 45 s, 50°C for 30 s, and 72°C for 90 s. A final extension at 72°C for 10 min was performed. Negative controls, including no-template and template from unused swabs, were included at all steps to control for potential primer or sample DNA contamination. All tagged samples were pooled and sequenced in a single 454 run of the GS FLX 454 Roche Life Sciences (Branford, CT) instrument run to avoid variation between experiments.
Sequences were assigned to the corresponding sample based on the 8 bp sample identifier tag, trimmed of primers, and classified using bioinformatic tools [MOTHUR (16), custom scripts] via the RDP-II Classifier. Only sequences that were longer than 200 bp, had no ambiguous characters, and had average quality scores of more than 25 (according to 454 Roche quality control) were included in further analyses. Taxonomic classification recovery for a training 16S rRNA gene sample set
In order to evaluate the overall accuracy of the newly developed tool, we performed a repetitive random test on a RDP II Classifier training set downloaded from the RDP II Classifier repository (http://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetAEM.tgz/download). The set contained 4622 bacterial 16S RNA gene sequences, ranging from 1200 to 1833 bp in length, with a mean length of 1460 bp, which had a reference classification down to the genus level. Only sequences that had taxonomic naming consistent with the NCBI Taxonomy were included (17). As the minimal sequence length of 1200 bp within the set is several times longer than an average next-generation sequencer read [~450 bp for Roche 454 Titanium (18) and ~300 bp (300 × 2 bp) for paired reads (7)] for the Illumina platform, every sequence was further processed to produce 5 non-redundant subsequences by random fragment excision to mimic a read obtained with a sequencing platform. This procedure was done for subsequences with lengths of 125 bp, 250 bp, 400 bp, and 600 bp, yielding 4 subsets containing 23,110 sequences each. Additionally, we analyzed the full length 16S rRNA gene sequences. TUIT performed a homology search for sequences from each subset via NCBI BLAST against bacterial sequences from the nucleotide (NT) database (as of 08/01/2013) with unclassified and environmental sequences restricted. The BLAST reports were used by TUIT for a classification analysis with the default set of rank-specific cutoffs: genus (identity: 95%, query coverage: 90%, a: 0.05) (19) and family (identity: 80%, query coverage: 90%, a: 0.05) (19). Calculation and comparison of class-normalized sensitivity and specificity
We calculated the class-normalized sensitivity and specificity (see Results and discussion) to analyze the algorithm efficiency (as proposed in Reference (20) with a slightly modified formula for sensitivity (see Supplementary Material). A confidence cutoff of 0.8 was applied both to RDP and MEGAN reports; other parameters were at default settings. We used the stand-alone RDP II Classifier version 2.6 (9) and MEGAN version 4 (12) to compare their class-normalized sensitivity and specificity with those of TUIT. Results and discussion
We developed TUIT as a simple solution that combines the ability to run BLAST (with parameters that maximize the taxonomic information load of the output), parse a BLAST [BLAST+ version 2.2.28 with the blastn algorithm (11)] report, and combine the processed information with taxonomic data from NCBI. TUIT class-normalized sensitivity and specificity
We compared the class-normalized sensitivity and specificity (see Materials and methods) of TUIT, MEGAN, and the RDP II Classifier (Figure 2). A combination of these two parameters allows one to estimate the depth of taxonomic assignment (sensitivity) and how precise the choice of a taxonomic node at a given rank (specificity) is. Both parameters are weighted by classes (groups of sequences at a given taxonomic rank) so that the parameters ref lect the overall ability of an algorithm to classify sequences coming from different classes. TUIT has similar class-normalized sensitivity and specificity values compared to RDP II for full-length rRNA gene sequences but has slightly lower sensitivity values for short sequences. Note that we used the RDP II training set of rRNA gene sequences, so RDP II is expected to have an advantage in this test. TUIT outperforms MEGAN in sensitivity and specificity at the genus and family levels of taxonomic classification for all tested fragment lengths. Notably, according to our results, both sensitivity and specificity of TUIT increase with query sequence length, while this was not the case for MEGAN. This might be due to the difficulty of choosing an appropriate classification when many different high-scoring hits are found and the lowest common ancestor approach is used.
The subsequences we used for the evaluation of classification efficiency were excised with a frame of a fixed size and a random position within reference rRNA gene sequences. This allowed us to estimate the relative impact of the frame position on TUIT's ability to correctly classify the subsequence. The results of this comparison are shown in Figure 3. We observed that the classification success rate (fraction of sequences correctly classified at the genus level) for 125 bp simulated reads was higher for reads containing V1 16S rRNA gene hypervariable regions and smaller for reads containing V2 16S rRNA gene hypervariable regions. However, starting with a 250 bp read length, this difference became less pronounced.
TUIT displayed high specificity even when short 125 bp fragments were used. This implies that TUIT is reliable for sequence data generated by next-generation platforms such as Illumina, where reads can be relatively short. However, TUIT does perform better with longer reads. Taxonomic classification recovery for a human 16S rRNA gene data set
To perform the analysis, we took a set of 200,476 high quality reads obtained during 16S rRNA gene sequencing of human corneal surfaces and reduced it to a set of 4700 non-redundant sequences. We were able to classify 3454 (73.5% total) non-redundant sequences at least at the family level using the RDP II Classifier (stand-alone version 2.6, allowed bootstrap value cutoff = 0.8), leaving 26.5% either unclassified or classified at a higher taxonomic level. Of those sequences that were classified at the family level, 2920 (62.1%) were also classified at the genus level. A subset of 1780 sequences, which RDP II Classifier failed to classify at the genus level, was further analyzed with TUIT (using the default parameter set). This allowed us to classify 371 additional sequences at the family level and 212 at the genus level. As a result, the fraction of classified sequences has increased to 81.4% and 66.6% at the family and genus levels, respectively.
Next, we explored whether TUIT may be applicable for the species level of classification, provided that sequence alignments satisfy the species set of cutoffs (identity: 97.5%, query coverage: 95%). TUIT allowed us to classify 341 reads from the complete non-redundant set down to the species level.
We have demonstrated that the application of TUIT may be beneficial to microbiome studies if applied in combination with other standard taxonomic classification tools such as the RDP II Classifier in several ways. First, TUIT allowed us to classify rRNA gene sequences at the species level, which may be especially relevant in clinical research. Second, TUIT was capable of classifying a substantial number of sequences that the RDP II Classifier failed to classify at genus and family levels. Since TUIT's specificity was shown to be similar to that of the RDP II Classifier in computational tests, this improvement is probably not due to higher levels of misclassifications.
Using simulated and real microbial sequence data from a human conjunctiva study (21), we showed that TUIT was successful and reliable in its ability to assign appropriate taxonomic classification to sequence reads as short as 125 bp. TUIT was also able to classify sequences reads of ~380 bp obtained from our study of the human ocular surface microbiome (unpublished data). Additionally, TUIT is not limited to 16S rRNA gene sequence analysis for bacteria classification, but may also be employed to classify sequences of different origin from many taxonomic groups. However, a certain limitation for the method is introduced by incompleteness of nucleotide sequence databases. With further updates of the databases, the method will become even more useful. Since TUIT has lower sensitivity values than the RDP II Classifier for short fragments of 16S rRNA genes, TUIT should be used in combination with RDP II Classifier when dealing with this kind of data.
In conclusion, TUIT is a reliable tool that allows researchers to increase the efficiency of taxonomic identification of microbial DNA sequences and requires little user expertise or computational power. Due to its versatility, this algorithm can provide a unified approach to enhance metagenomic studies where analysis of sequence data generated by multiple sequencing techniques is required. Author contributions
All authors contributed extensively to the work presented in this paper. A.T. assembled input data, performed application and database implementation, contributed to algorithm design, and edited the manuscript. A.P. designed, tested, and optimized the algorithm and edited the manuscript. V.S. designed the study, supervised the analysis, and participated in the manuscript preparation.
We thank Sergei Spirin for a helpful discussion of our approach and Abigail Hackam for helping us improve the manuscript. This work was supported by NIH grant EY02238, Russian Federal Special Program Grant 2012-1.5-12-000-1002-018 to the Russian Ministry of Science and Education, state contract 8494 of the Federal Special Program “Scientific and educational human resources of innovative Russia” for 2009–2013, and the Russian Foundation for Basic Research grant 12-04-31071. This paper is subject to the NIH Public Access Policy.
The authors declare no competing interests.
Address correspondence to Alexander Tuzhikov, Department of Ophthalmology, Bascom Palmer Eye Institute, University of Miami, School of Medicine, Miami, FL, E-mail: firstname.lastname@example.org; or Valery I. Shestopalov, Department of Ophthalmology, Bascom Palmer Eye Institute, University of Miami, School of Medicine, email@example.com.
1.) Huse, S.M., Y. Ye, Y. Zhou, and A.A. Fodor. 2012. A core human microbiome as viewed through 16S rRNA sequence clusters. PLoS ONE 7:e34242. 2.) Shendure, J., and H. Ji. 2008. Next-generation DNA sequencing. Nat. Biotechnol. 26:1135-1145. 3.) Binladen, J., M.T. Gilbert, J.P. Bollback, F. Panitz, C. Bendixen, R. Nielsen, and E. Willerslev. 2007. The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE 2:e197. 4.) Turnbaugh, P.J., R.E. Ley, M. Hamady, C.M. Fraser-Liggett, R. Knight, and J.I. Gordon. 2007. The human microbiome project. Nature 449:804-810. 5.) Tamaki, H., C.L. Wright, X. Li, Q. Lin, C. Hwang, S. Wang, J. Thimmapuram, Y. Kamagata, and W.T. Liu. 2011. Analysis of 16S rRNA amplicon sequencing options on the Roche/454 next-generation titanium sequencing platform. PLoS ONE 6:e25263. 6.) Degnan, P.H., and H. Ochman. 2012. Illumina-based analysis of microbial community diversity. ISME J. 6:183-194. 7.) Zhang, J., K. Kobert, T. Flouri, and A. Stamatakis. 2013. PEAR: a fast and accurate Illumina Pai red-End reAd mergeR. Bioinformatics. (In press.). 8.) Wen, L., R.E. Ley, P.Y. Volchkov, P.B. Stranges, L. Avanesyan, A.C. Stonebraker, C. Hu, F.S. Wong. 2008. Innate immunity and intestinal microbiota in the development of Type 1 diabetes. Nature 455:1109-1113. 9.) Wang, Q., G.M. Garrity, J.M. Tiedje, and J.R. Cole. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73:5261-5267. 10.) Liu, Z., T.Z. DeSantis, G.L. Andersen, and R. Knight. 2008. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res. 36:e120. 11.) Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T.L. Madden. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. 12.) Huson, D.H., S. Mitra, H.J. Ruscheweyh, N. Weber, and S.C. Schuster. 2011. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21:1552-1560. 13.) Karl in, S., and S.F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-2268. 14.) Chakravorty, S., D. Helb, M. Burday, N. Connell, and D. Alland. 2007. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol. Methods 69:330-339. 15.) Hamady, M., J.J. Walker, J.K. Harris, N.J. Gold, and R. Knight. 2008. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat. Methods 5:235-237. 16.) Schloss, P.D., S.L. Westcott, T. Ryabin, J.R. Hall, M. Hartmann, E.B. Hollister, R.A. Lesniewski, B.B. Oakley. 2009. Introducing mothur: open-source, platform-independent, communit y-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75:7537-7541. 17.) Sayers, E.W., T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. DiCuccio. 2009. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 37:D5-D15. 18.) Luo, C., D. Tsementzi, N. Kyrpides, T. Read, and K.T. Konstantinidis. 2012. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS ONE 7:e30087. 19.) Everett, K.D., R.M. Bush, and A.A. Andersen. 1999. Emended description of the order Chlamydiales, proposal of Parachlamydiaceae fam. nov. and Simkaniaceae fam. nov., each containing one monotypic genus, revised taxonomy of the family Chlamydiaceae, including a new genus and five new species, and standards for the identification of organisms. Int. J. Syst. Bacteriol. 49:415-440. 20.) McHardy, A.C., H.G. Martin, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos. 2007. Accurate phylogenetic classif ication of variable-length DNA fragments. Nat. Methods 4:63-72. 21.) Dong, Q., J.M. Brulc, A. Iovieno, B. Bates, A. Garoutte, D. Miller, K.V. Revanna, X. Gao. 2011. Diversity of bacteria at healthy human conjunctiva. Invest. Ophthalmol. Vis. Sci. 52:5408-5413.