TF model collections
LASAGNA-Search currently offers five precomputed TF model collections. The collections are categorized by the type of data used to build a model. Table 1 lists the type and number of models for each collection. To facilitate GRN visualization, we mapped TF models to genes coding for the TFs. The number of models that can be mapped for each collection is also listed in Table 1. Models in the TFBS-based collections were built from unaligned TFBSs, while models in the PWM-based collections were built from PWMs. We describe these two categories in the following sections.
le> TFBS-based collections
We collected experimentally validated transcription factor binding sites from the TRANSFAC Public database and the ORegAnno database. In these two collections, binding sites of a TF were not collected across organisms. TF models are non-redundant in the sense that a TF of a species has only one model based on all the available binding sites in a database. The binding sites of a TF were aligned to build a model. We built one model for each TF because, for most TFs, the binding affinity can be explained by only one model (34). In case a TF recognizes more than one motif (21), we rely on database curators to distinguish binding sites belonging to distinct motifs. Moreover, the TFBS-based collections are complemented by our PWM-based collections, which offer more than one model for some TFs.
Binding sites for five organisms were collected from the TRANSFAC Public database (release 7.0) (14), including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster and Saccharomyces cerevisiae. For each organism, a TF was included in our collection if it contained at least 10 binding sites. Binding sites for 189 TFs across the five species were collected. Although TRANSFAC builds PWMs, 72 (38.1%) of the TFs did not have PWMs in TRANSFAC.
In addition to the five organisms present in the TRANSFAC collection, binding sites of Caenorhabditis elegans and Caenorhabditis brig gsae were collected from the ORegAnno database (08Nov10) (22). As an open-annotation database, ORegAnno allows users to adopt the role of curators and contribute binding sites and other types of annotations to the database, including the NCBI or Ensembl ID for each gene or transcription factor. This feature allows easy mapping of distinct mentions of the same TF to a unique database ID so that binding sites the same TF contributed by different users can be merged. Nevertheless, many TF mentions in ORegAnno do not have a database ID. In this case, we automatically assign the NCBI Gene ID to a TF mention by consulting the NCBI Gene database. We note that this is not always possible since a TF mention may be the symbol of one gene and a synonym of another, preventing unique mapping. In these cases, ambiguity was manually resolved. Still, some TF mentions are protein complexes that cannot be identified by a single gene ID. These mentions were semi-automatically collapsed. Finally, binding sites of 133 TFs across 7 organisms were collected, with each TF having at least 10 TFBSs.
As seen in Table 1, nearly all of the TF models in the two collections were mapped to TF coding genes. Only one model in each collection remained unmapped due to lack of information in the source databases:ETF (T00270) in TRANSFAC and MYF in ORegAnno. PWM-based collections
In addition to binding sites, we also collected position-specific weight matrices (PWMs) from the TRANSFAC Public database, the JASPAR CORE database (13) and the UniPROBE database (15). A PWM is a 4 × l matrix, where l is the length of the binding sites. Each element in column i of a PWM usually contains the count or probability of a nucleotide at position i. PWMs are valuable resources for a number of reasons. Most PWMs in TRANSFAC and JASPAR were built by domain experts. For instance, some PWMs in TRANSFAC and JASPAR CORE were based on binding sites in multiple organisms due to cross-species conservation (e.g., TRANSFAC matrix M00152). Moreover, a PWM in TRANSFAC may be based on binding sites of two or more TFs having similar binding specificities (e.g., TRANSFAC matrix M00158). Another reason that PWMs are particularly valuable is that some techniques produce only matrix data. The UniPROBE database, for example, stores data from protein binding microarray (PBM) experiments (35). The PBM technique assigns a binding specificity score to each 10-mer sequence variant. Berger and Bulyk (35), however, do not suggest setting a specificity cut-off threshold to report binding sites. Instead, PWMs are produced by the Seed-and-Wobble algorithm.
From the UniPROBE database, we collected 530 PWMs from six species: Homo sapiens, Mus musculus, Saccharomyces cerevisiae, Caenorhabditis elegans, Plasmodium falciparum and Cryptosporidium parvum. These 530 PWMs correspond to 414 non-redundant TFs (proteins or protein complexes). We collected 476 PWMs from the JASPAR CORE database, where the PWMs were categorized into six species groups: vertebrates, insects, plants, fungi, nematodes and urochordates. Finally, 398 PWMs were collected from the TRANSFAC Public database and grouped into the following categories: vertebrates, insects, plants, fungi, nematodes, and Bacteria.
According to Table 1, the PWM-based collections contain more unmapped TF models than the TFBS-based collections because some source databases lack information. Matrices such as MA0102.1 and MA0061.1 in the JASPAR CORE database were built from TFBSs of more than one organism, but accession numbers for the homologous proteins are not available. Some matrices in the TRANSFAC and JASPAR CORE databases have protein accession numbers, but records of the corresponding coding genes cannot be found in the NCBI Gene database. These proteins often belong to species such as Pisum sativum and Triticum aestivum, which are not as well-studied as model organisms. Results and discussion Input page
The LASAGNA-Search input page is divided into three parts: TF model input, promoter sequence input, and result filtering. Figure 3A shows a screenshot of the input page. Two options are available for result filtering. One is to set a p-value threshold so that only hits with equal or lower p-values will be reported. The other is to set k so that only the k hits with the highest scores will be reported.