The release of ChIP-seq data from the ENCyclopedia Of DNA Elements (ENCODE) and Model Organism ENCyclopedia Of DNA Elements (modENCODE) projects has significantly increased the amount of transcription factor (TF) binding affinity information available to researchers. However, scientists still routinely use TF binding site (TFBS) search tools to scan unannotated sequences for TFBSs, particularly when searching for lesser-known TFs or TFs in organisms for which ChIP-seq data are unavailable. The sequence analysis often involves multiple steps such as TF model collection, promoter sequence retrieval, and visualization; thus, several different tools are required. We have developed a novel integrated web tool named LASAGNA-Search that allows users to perform TFBS searches without leaving the web site. LASAGNA-Search uses the LASAGNA (Length-Aware Site Alignment Guided by Nucleotide Association) algorithm for TFBS alignment. Important features of LASAGNA-Search include (i) acceptance of unaligned variable-length TFBSs, (ii) a collection of 1726 TF models, (iii) automatic promoter sequence retrieval, (iv) visualization in the UCSC Genome Browser, and (v) gene regulatory network inference and visualization based on binding specificities. LASAGNA-Search is freely available at http://biogrid.engr.uconn.edu/lasagna_search/.
Transcription factors (TF) regulate their target genes by physically binding to the gene regulatory regions. TF binding to DNA is sequence specific; thus binding sites for a specific TF share common sequence patterns called motifs. Given a set of known binding sites for a particular TF, these motifs can be used to search unannotated promoters to identify putative transcription factor binding sites (TFBSs), providing an inexpensive alternative to experimental determination of TFBSs in a wet lab (1-4).
Various methods have been proposed for motif modeling. A consensus model summarizes binding sites by the consensus sequence (5), while a position-specific weight matrix (PWM) model summarizes TFBSs by a scoring matrix (6). Some extensions of these simple methods do not rely on the assumption of position independence and instead score nucleotide pairs (7, 8). Other methods model position dependence by first-order Markov chains (9), profile hidden Markov models (10), or principal components (11). SiTaR(12), on the other hand, does not summarize TFBSs but instead uses input motifs are to identify TFBSs in the query data set.
While transcription factor (TF) binding affinity data generated by high-throughput techniques such as ChIP-seq has become increasingly available, a need remains for computational tools, especially for less-studied species and transcription factors. To facilitate and accelerate analyses, we developed LASAGNA-Search, an integrated user-friendly web tool for TF binding site (TFBS) search and visualization that combines existing and novel features. Important features include accepting unaligned variable-length binding sites, a collection of 1726 models, automatic promoter sequence retrieval, visualization in the UCSC Genome Browser, gene regulatory network inference, and visualization based on binding specificities.
Despite the more sophisticated models proposed over the past decades, the PWM method remains a simple and widely-used TFBS search approach. It represents a motif by a 4 × l matrix, where l is the length of binding sites and each row corresponds to one of the four nucleotide bases. Column i of the matrix keeps the scores of matching the ith letter of a length-l sequence (an l-mer) to nucleotides A, C, G, and T, respectively. To score an l-mer, the PWM method sums up the scores of the l individual letters. Databases such as JASPAR (13), TRANSFAC (14), and UniPROBE (15) store matrices of TFs that can be easily used by web tools implementing the PWM method or its variants (16-20). Astored matrix usually consists of counts or probabilities and can be easily converted to be compatible with different scoring schemes.
A PWM is usually built from experimentally validated binding sites – DNA segments that can be physically bound by a TF but may or may not be functional. These binding sites often vary in length and are not aligned. In cases where a transcription factor does not have available PWMs, researchers must resort to studying its binding sites. Nearly 38.1% of the TFs we found in the TRANSFAC Public database do not have PWMs available, so their binding sites have to be aligned. Since TRANSFAC may build more than one PWM for a TF, a lack of matrices for TFs that recognize more than one motif is unlikely to occur (21). Open-annotation databases such as ORegAnno (22) and PAZAR (23) contain valuable user-curated TFBSs. To utilize binding sites in the ORegAnno database, one has to align them before building PWMs. The PAZAR database, on the other hand, is another important resource because it dynamically creates PWMs for users using MEME(24). Although PWMs represent motifs in a compact form, information about position dependence is lost when converting TFBS alignments to PWMs. It has been shown in many studies (7-9)(11), that position dependence significantly improves the search performance of a method. For this reason, TFBS alignments may be preferred compared to PWMs.
A typical TFBS search web tool takes a PWM and promoter sequence as inputs and returns putative binding sites. Many web tools include useful features in addition to the basic search function. Some accept variable-length binding sites (9, 12, 25), offer precomputed models built from PWMs or TFBSs (10, 13, 17, 26-28), adopt a TFBS search method that exploits position dependence (9, 10), offer promoter sequence retrieval, or integrate a sequence retrieval tool (10, 18, 25-27). The MAPPER2 database (10) supports visualization of hits in the UCSC Genome Browser (29) for three organisms. Another useful fuction is visual representation of predicted binding specificities as a gene regulatory network (GRN) (28). Until now, there was no single web tool that incorporated all the aforementioned features.
We created a web tool for TFBS search and visualization that we call LASAGNA-Search. LASAGNA-Search accepts variable-length TFBSs in addition to PWMs. It offers 1726 precomputed models based on TFBSs and PWMs collected from the TRANSFAC Public, JASPAR, ORegAnno and UniPROBE databases. Its search module exploits position dependence for a TFBS-based model whenever performance gain is indicated by cross-validation. Automatic promoter sequence retrieval is supported for seven organisms at LASAGNA-Search, which enables visualization of search results in the UCSC Genome Browser. Search results can also be visualized along promoter sequences locally at LASAGNA-Search for any organism. Finally, a GRN can be constructed from search results and visualized locally with various options. Materials and methods
Figure 1 shows the architecture of LASAGNA-Search. We introduce the major components in the following sections.