Modules Alignment module
The alignment module aligns variable-length TFBSs to build a TF model. This module has been extensively compared to ClustalW2 (30) and MEME (24) with favorable outcomes (see Supplementary Figure S2). These two methods were chosen because they are widely-used representatives of two different types of TFBS identification methods. ClustalW2, which is based on pairwise sequence similarity, was used for TFBS alignment in several studies (9, 11, 31). MEME is a commonly used de novo motif discovery tool (24). We describethe key ideas behind the algorithm and include the technical details in the Supplementary Methods.
A binding site includes a core region – a short stretch of DNA to which the TF actually binds – flanked by a few bases on each side. The core region may not be determined accurately if the resolution of the binding site calling technique is not sufficient. When progressively aligning binding sites, the order in which the sites are aligned is important. We observed that aligning binding sites from the shortest to the longest generaly yields better alignments.Shorter binding sites tend to contain fewer non-informative bases flanking the core region. Therefore, we use TFBS length to guide the alignment process. Extensive experiments have validated the efficacy of this algorithm. Search module
The search module uses a TF model and promoter sequence as inputs. The TF model is a PWM variant, which scores sequences of length l. Depending on the TF, scores of nucleotide pairs may contribute to the score of a sequence, controlled by the parameter of K ≥ 0, the maximal distance between a nucleotide pair. The value of K is TF-dependent andis determined by cross-validation. Hence, K is greater than zero only if nucleotide pairs improve the search performance for a TF. We refer readers to the Supplementary Methods for additional technical details.
It is commonly assumed that the first letter of an l-mer sequence is aligned with the first position of a TF model binding site and the l-mer is scored accordingly. Unlike more conventional approaches, we align an l-mer with a TF model by sliding an l-mer and its reverse-complement through the model such that the overlap between the two is at least one nucleotide using the framework described in the section. Evaluation of Precomputed TF Models
We found that this is significantly better than the conventional approach for locating TFBSs (see Supplementary Figure S3). Moreover, this approach allows easy scoring of an l-mer by a cluster of TF models of different widths. Scoring with a cluster of TF models has been shown to outperform using only the best model in the cluster (32) and hence is a feature to be added to LASAGNA-Search in the near future.
For each putative binding site hit, the search module computes the score and the p-value indicating the probability of observing a score equal to or higher than the score by chance. We describe the p-value computation in the Supplementary Methods. While p-values are not corrected for multiple testing, they are useful for ordering hits found by different TF models. To take into account the length of the promoter sequence in which a hit is found, an E-value is computed for the hit. The E-value gives the expected number of times a hit of the same or higher score is found in the promoter sequence by chance. If L is the length of the promoter sequence and l is the length of the putative binding site, then E-value = p-value × (L - l + 1), which is approximately p-value × L when L >> l. Promoter retrieval module
Currently, LASAGNA-Search supports retrieving promoter sequences for seven species: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Saccharomyces cerevisiae, Caenorhabditis elegans, and Caenorhabditis briggsae. Users may enter the NCBI Gene identifier (ID), the official gene symbol or an mRNA accession number of a gene to retrieve its upstream promoter region. The upstream region of a gene is specified by positions relative to the transcription start site (TSS) obtained from the UCSC Genome Browser (29). Information in the NCBI Gene database is used for conversion between Gene IDs and symbols. GRN inference
LASAGNA-Search automatical ly constructs a network based on search results. A directed edge from a TF model to a gene is established if at least one significant hit is found in the promoter region of the gene by the TF model. The lowest p-value of these hits is used to compute the weight on this edge. That is, the thickness of the edge is proportional to -log p-value. In cases where the coding genes of a TF model are known, these genes may be added to the network with dotted arrows from the genes to the TF model. To simplify the network, the node for a TF model may be removed, leaving only its coding genes in the network. Figure 2 shows an example network of human genes TP53 and MYB. Visualization of GRNss at LASAGNA-Search is enabled by Cytoscape Web (33). We describe how the networks in Figure 2 were generated in the section titled User Interface.