LASAGNA-Search and MAPPER2 have large libraries of TF models, while users need to collect PWMs before using matrix-scan. Users may input a PWM or unaligned TFBSs to LASAGNA-Search for model building. Although matrix-scan also accepts PWMs, both matrix-scan and MAPPER2 do not accept unaligned TFBSs. All three tools accept promoter sequences from FASTA, while matrix-scan handles sequences in five additional formats. Automatic sequence retrieval for matrix-scan is accomplished by interfacing with two tools, “retrieve sequence” and “retrieve EnsEMBL sequence,” on the same web site. These two tools are capable of retrieving sequences in a wide range of species and can be used with any TFBS search tools. LASAGNA-Search and MAPPER2 offer integrated promoter retrieval tools supporting seven and three organisms, respectively.
Visualization of predicted binding sites is usually tightly connected with promoter sequence retrieval. This is because creating a custom track in the UCSC Genome Browser requires knowledge of the genome build (release version) and the genome coordinates for the promoter sequence must also be known. For LASAGNA-Search and MAPPER2, hits found on any promoter sequences retrieved by the provided tool can be visualized with ease in the UCSC Genome Browser. Visualizing hits found by matrix-scan in the UCSC Genome Browser is possible only when the genome build and coordinates are specified in the FASTA header of the promoter sequence. Headers of sequences retrieved by the aforementioned two tools, however, do not contain the required information for enabling visualization of hits in the UCSC Genome Browser.
Of the three integrative web tools, GRN inference from search results is only available using LASAGNA-Search. The PAINT tool (28) offers similar functionality by integrating Match (16) in the TRANSFAC Public or Professional databases and promoter sequence retrieval for human, mouse and rat. Compared to PAINT, LASAGNA-Search contains 1726 TF models from four source databases and retrieves promoters for seven species. The major difference between LASAGNA-Search and PAINT, however, is that LASAGNA-Search keeps track of the coding genes of TF models. This is an important feature because it allows visualization of self-regulation by self-loops and merging nodes for TF models coded by the same genes.
Finally, it is useful to compare LASAGNA-Search to other relevant web tools. The MEME Suite (25) offers web interfaces to four TFBS search tools with access to whole-genome promoter sequences. However, these tools have no access to the PWM database in the suite, nor do they scan promoters of specific genes or offer visualization of hits. Two tools motivated by evolutionary conservation are COTRASIF (26) and ReXSpecies2 (27). COTRASIF collects 138 JASPAR and 398 TRANSFAC PWMs and offers whole-genome Ensembl promoter sequences. However, it does not allow selection of gene-specific promoter sequences nor does it offer visualization. ReXSpecies2, on the other hand, sources PWMs from JASPAR, scans promoters of specific genes, and allows visualization in the UCSC Genome Browser, but it focuses only on human and mouse sequences, and selecting individual PWMs requires use of regular expression-like syntax. Evaluation of precomputed TF models
Since MAPPER2 is the web tool most similar to LASAGNA-Search, we compare the TF model collections offered by these two tools on a whole-genome basis. The MAPPER2 database stores hits from the 10Kbp upstream region of each transcript for each TF model, so we scanned the same sequences using TF models offered by LASAGNA-Search. We have no access to the profile hidden Markov models (41) used by MAPPER2 and the dynamic scanning interface offered by MAPPER2 was not functioning at the time of writing. Fortunately, MAPPER2 allows users to download the top 1000 hits for each model. We therefore limited our comparison to the top 1000 hits produced by each TF model.
To evaluate model performance, human and mouse ChIP-seq data from the ENCODE project (42) were used as the gold-standard. We compared the results for all validated TFs on a per-TF basis. Supplementary Tables S1 and S2 list the ChIP-seq tracks (experiments) by TF for human and mouse, respectively. We associated each TF with models that can be used to predict its binding sites. Each of the 1000 hits produced by a model was checked against the ChIP-seq peaks for the TF. A hit was marked as a true positive if it was completely covered by a peak in at least one experiment as ChIP-seq peaks are much longer than TFBSs. Otherwise, the hit was marked a false positive.
Evaluating a model based on the top 1000 hits is analogous to evaluating a search engine based on the top 1000 documents. Therefore, we used average precision (43) to score models. This performance measure is widely used inthe information retrieval community and is defined as:
where P(k) gives the precision based on the fraction of the top k hits that are true positives. Indicator tp(k) is 1 if hit k is a true positive. Otherwise, tp(k) is 0. The denominator c is the portion of bases in upstream regions that are covered by peaks and was computed based on all ChIP-seq experiments used to validate the model. We also scored each model by accuracy, which is equivalent to P(1000).
The performance of LASAGNA-Search and MAPPER2 for a TF was measured by the average score of the associated models. Results for LASAGNA-Search are listed in Supplementary Tables S3 and S4, while results for MAPPER2 are listed in Supplementary Tables S5 and S6. Average precision and accuracy are given in individual columns. Each row presents the performance of model predicitons of the binding sites of a TF. Figure 6 shows the comparison between LASAGNA-Search and MAPPER2 in terms of average precision. A similar comparison in terms of accuracy is shown in Supplementary Figure S4.
An outlier corresponding to Mafk is seen in Figures 6 and S4. Four models in LASAGNA-Search and one MAPPER2 model were used to predict Mafk binding sites (see Tables S4 and S6). Interestingly, the best model of each tool is based on the same TRANSFAC matrix M00037. The LASAGNA-Search model is a PWM model that has no position dependence information. The MAPPER2 model, however, uses a hidden Markov model that considers position dependence. The use of position dependence gave the MAPPER2 model an edge over the LASAGNA-Sarch model. The other three LASAGNA-Search models performed much worse than the onebased on the M00037 matrix, resulting in poor average performance on Mafk. While it is difficult to draw conclusions based on only 13 mouse TFs, the results from human TFs indicate that LASAGNA-Search models are significantly better. Overall, we observe that LASAGNA-Search significantly outperforms MAPPER2, indicating that the models used in LASAGNA-Search more accurately predict TFBSs.
We plan to improve LASAGNA-Search by expanding the content and incorporating useful features. Additional organisms will be supported in automatic promoter retrieval and visualization in the UCSC Genome browser. To expand our TF model collections, more sources of TFBSs and PWMs such as the PAZAR database (23) and ChIP-seq data will be considered. The the general binding preference (GBP) score (39) is based on multiple evidence sources including evolutionary conservation and has been shown to improve prediction of binding sites. Integrating the GBP scores with the search module will be investigated.
In a recent report, using a cluster of TF models to scan a sequence for binding sites has outperformed the best model in the cluster (32). This strategy will benefit from our large collections of TF models and improve the TFBS search performance of LASAGNA-Search. Finally, we will enable the search for two-block motifs (44, 45), which are binding sites composed of two half sites separated by variable-length gaps. While plenty of work has been devoted to de novo two-block motif discovery (44, 46, 49),searching for two-block motif instances is more straight-forward. Using two TF models with or without a gap penalty (44) will be investigated.
We are indebted to Prof. Daniel Schwartz and two anonymous reviewers for their comments, which greatly improved LASAGNA-Search and this paper. This work was supported in part by the National Science Foundation grant numbers CCF-0755373 and OCI-1156837.
The authors declare no competing interests.
Address correspondence to Chun-Hsi Huang, Department of Computer Science and Engineering, 371 Fairfield Way, Unit 4155, University of Connecticut. Storrs, CT, USA. Email: [email protected]
1.) Lee, J.M., E.V. Ivanova, I.S. Seong, T. Cashorali, I. Kohane, J.F. Gusella, and M.E. MacDonald. 2007. Unbiased Gene Expression Analysis Implicates the huntingtin Polyglutamine Tract in Extra-mitochondrial Energy Metabolism. PLoS Genet. 3:e135. 2.) Bourdeau, V., J. Deschênes, D. Laperrière, M. Aid, J.H. White, and S. Mader. 2008. Mechanisms of primary and secondary estrogen target gene regulation in breast cancer cells. Nucleic Acids Res. 36:76-93. 3.) Fiore, R., S. Khudayberdiev, M. Christensen, G. Siegel, S.W. Flavell, T.K. Kim, M.E. Greenberg, and G. Schratt. 2009. Mef2-mediated transcription of the miR379-410 cluster regulates activity-dependent dendritogenesis by fine-tuning Pumilio2 protein levels. EMBO J. 28:697-710. 4.) Johnson, K.J., A.K. Robbins, Y. Wang, S.M. McCahan, J.K. Chacko, and J.S. Barthold. 2010. Insulin-Like 3 Exposure of the Fetal Rat Gubernaculum Modulates Expression of Genes Involved in Neural Pathways. Biol. Reprod. 83:774-782. 5.) Yamauchi, K. 1991. The sequence flanking translation initiation site in protozoa. Nucleic Acids Res. 19:2715-2720. 6.) Staden, R. 1984. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12:505-519. 7.) Osada, R., E. Zaslavsky, and M. Singh. 2004. Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics 20:3516-3525. 8.) Tomovic, A., and E.J. Oakeley. 2007. Position dependencies in transcription factor binding sites. Bioinformatics 23:933-941. 9.) Salama, R.A., and D.J. Stekel. 2010. Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction. Nucleic Acids Res. 38:e135. 10.) Riva, A. 2012. The MAPPER2 Database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Res. 40:D155-D161. 11.) Pairó, E., J. Maynou, S. Marco, and A. Perera. 2012. A subspace method for the detection of transcription factor binding sites. Bioinformatics 28:1328-1335. 12.) Fazius, E., V. Shelest, and E. Shelest. 2011. SiTaR: a novel tool for transcription factor binding site prediction. Bioinformatics 27:2806-2811. 13.) Bryne, J.C., E. Valen, M.H.E. Tang, T. Marstrand, O. Winther, I. da Piedade, A. Krogh, B. Lenhard, and A. Sandelin. 2008. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36:D102-D106. 14.) Matys, V., O.V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev. 2006. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34:D108-D110. 15.) Newburger, D.E., and M.L. Bulyk. 2009. UniPROBE: an online database of protein binding microarray data on protein–DNA interactions. Nucleic Acids Res. 37:D77-D82. 16.) Kel, A.E., E. Gößling, I. Reuter, E. Cheremushkin, O. Kel-Margoulis, and E. Wingender. 2003. MATCH™: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31:3576-3579. 17.) Chekmenev, D.S., C. Haid, and A.E. Kel. 2005. P-Match: transcription factor binding site search by combining patterns and weight matrices. Nucleic Acids Res. 33:W432-W437. 18.) Turatsinze, J.V.V., M. Thomas-Chollier, M. Defrance, and J. van Helden. 2008. Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat. Protoc. 3:1578-1588. 19.) Zambelli, F., G. Pesole, and G. Pavesi. 2009. Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Res. 37:W247-W252. 20.) Grant, C.E., T.L. Bailey, and W.S. Noble. 2011. FIMO: Scanning for occurrences of a given motif. Bioinformatics 27:1017-1018. 21.) Fordyce, P.M., D. Pincus, P. Kimmig, C.S. Nelson, H. El-Samad, P. Walter, and J.L. DeRisi. 2012. Basic leucine zipper transcription factor Hac1 binds DNA in two distinct modes as revealed by microfluidic analyses. Proc. Natl. Acad. Sci. USA 109:E3084-E3093. 22.) Griffith, O.L., S.B. Montgomery, B. Bernier, B. Chu, K. Kasaian, S. Aerts, S. Mahony, M.C. Sleumer. 2008. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36:D107-D113. 23.) Portales-Casamar, E., D. Arenillas, J. Lim, M.I. Swanson, S. Jiang, A. McCallum, S. Kirov, and W.W. Wasserman. 2009. The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences. Nucleic Acids Res. 37:D54-D60. 24.) Bailey, T.L., and C. Elkan. 1994.. Fitting a mixture model by expectation maximization to discover motifs in biopolymers:28-36. 25.) Bailey, T.L., M. Bodén, F.A. Buske, M. Frith, C.E. Grant, L. Clementi, J. Ren, W.W. Li, and W.S. Noble. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37:W202-W208. 26.) Tokovenko, B., R. Golda, O. Protas, M. Obolenskaya, and A. El'skaya. 2009. COTRASIF: conservation-aided transcription-factor-binding site finder. Nucleic Acids Res. 37:e49. 27.) Struckmann, S., D. Esch, H. Schöler, and G. Fuellen. 2011. Visualization and exploration of conserved regulatory modules using ReXSpecies 2. BMC Evol. Biol. 11:267. 28.) Gonye, G.E., P. Chakravarthula, J.S. Schwaber, and R. Vadigepalli. 2007. From Promoter Analysis to Transcriptional Regulatory Network Prediction Using PAINT. Methods Mol. Biol. 408:49-68. 29.) Dreszer, T.R., D. Karolchik, A.S. Zweig, A.S. Hinrichs, B.J. Raney, R.M. Kuhn, L.R. Meyer, M. Wong. 2012. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 40:D918-D923. 30.) Larkin, M.A., G. Blackshields, N. Brown, R. Chenna, P. McGettigan, H. McWilliam, F. Valentin, I. Wallace. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23:2947-2948. 31.) Marinescu, V.D., I.S. Kohane, and A. Riva. 2005. The MAPPER database: a multi-genome catalog of putative transcription factor binding sites. Nucleic Acids Res. 33:D91-D97. 32.) Oh, Y.M., J.K. Kim, S. Choi, and J.Y. Yoo. 2012. Identification of co-occurring transcription factor binding sites from DNA sequence using clustered position weight matrices. Nucleic Acids Res 40:e38. 33.) Lopes, C.T., M. Franz, F. Kazi, S.L. Donaldson, Q. Morris, and G.D. Bader. 2010. Cytoscape Web: an interactive web-based network browser. Bioinformatics 26:2347-2348. 34.) Zhao, Y., and G.D. Stormo. 2011. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29:480-483. 35.) Berger, M.F., and M.L. Bulyk. 2009. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat. Protoc. 4:393-411. 36.) Crooks, G.E., G. Hon, J.M. Chandonia, and S.E. Brenner. 2004. WebLogo: A Sequence Logo Generator. Genome Res. 14:1188-1190. 37.) Ram, O., A. Goren, I. Amit, N. Shoresh, N. Yosef, J. Ernst, M. Kellis, M. Gymrek. 2011. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells. Cell 147:1628-1639. 38.) Davydov, E.V., D.L. Goode, M. Sirota, G.M. Cooper, A. Sidow, and S. Batzoglou. 2010. Identifying a high fraction of the human genome to be under selective constraint using GERP + +. PLOS Comput. Biol 6:e1001025. 39.) Ernst, J., H.L. Plasterer, I. Simon, and Z. Bar-Joseph. 2010. Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res. 20:526-536. 40.) The ENCODE Project Consortium 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489:57-74. 41.) Eddy, S.R. 2008. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLOS Comput. Biol 4:e1000069. 42.) Rosenbloom, K.R., T.R. Dreszer, M. Pheasant, G.P. Barber, L.R. Meyer, A. Pohl, B.J. Raney, T. Wang. 2010. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res. 38:D620-D625. 43.) Turpin, A., and F. Scholer. 2006.. User performance versus precision measures for simple search tasks:11-18. 44.) Bi, C., J.S. Leeder, and C.A. Vyhlidal. 2008. A Comparative Study on Computational Two-Block Motif Detection: Algorithms and Applications. Mol. Pharm. 5:3-16. 45.) Johnson, D.S., A. Mortazavi, R.M. Myers, and B. Wold. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316:1497-1502. 46.) Liu, X., D.L. Brutlag, and J.S. Liu. 2001.. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes:127-138. 47.) van Helden, J., F.R. Alma, and J. Collado-Vides. 2000. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28:1808-1818. 48.) Jensen, S.T., and J.S. Liu. 2004. BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics 20:1557-1564. 49.) Li, L. 2009. GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery. J. Comput. Biol. 16:317-329.