While genes can have multiple promoters in principle, their usage will likely vary depending on cellular conditions. Time-course CAGE experiments can be used to study the dynamic usage of promoters by comparing the distribution of expression levels over transcription start sites in the upstream region of a gene to discover which promoters are switched on and off between time points during the experiment. Similarly, we may find tissue-specific promoter usage, or promoters that are only used in specific cell lines.
A major goal of the analysis of deep CAGE sequence data is to infer the regulatory network that orchestrates transcription in a cell (26). With the target of each regulatory interaction being a promoter instead of a gene, such a network is qualitatively different from current gene-based networks inferred from microarray or SAGE expression profiling. Each of the promoters may be associated with one or more coding or noncoding transcripts.
A promoter-based network can be constructed by first obtaining matrix models to describe the binding affinities of transcription factors, and then using these matrices to search for potential transcription factor binding sites in promoter regions. Both of these steps are facilitated by the availability of the promoter locations and expression levels measured by deep CAGE.
Matrix models of transcription factors can be derived either from the literature or by aligning the upstream regions of coregulated genes. Such coregulated genes can be found by clustering the genome-wide expression profiles measured in microarray experiments. However, the expression profile measured by microarrays is an aggregate of the expression profiles of the individual promoters from which transcripts originate, hampering a clean clustering result. In addition, these promoters are typically regulated by a different set of transcription factors. Deep CAGE expression profiling enables us to separate the expression of a gene into the contribution of the individual promoters, which can then be clustered based on their expression profiles to generate tighter clusters of co-expression. Each of these clusters of promoters is likely regulated by a smaller set of transcription factors than conventional clusters of genes. In addition, the CAGE-defined promoter identifies the genome regions to be aligned to find overrepresented sequence motifs.
Similarly, we can restrict the genome region to be searched for potential transcription factor binding sites to the promoter region identified by CAGE, significantly reducing the possibility of false positives. While comparative genomics may also be used to identify the promoter region, it does not pinpoint the exact transcription start site. In addition, many biologically functional sequences do not seem to be evolutionarily constrained across all mammals (16) and therefore cannot be identified by comparative approaches. Finally, deep CAGE profiling identifies which promoters are active in a particular biological context and therefore suggests which transcription factor binding sites may be biologically relevant.Software Needs for High-throughput Transcriptome Sequencing
The sheer size of the datasets generated by high-throughput transcriptome sequencing places new requirements on the software tools used to analyze these data. During extraction of CAGE tags from the raw reads, care must be taken to correctly distinguish the tag from linker sequences. Mapping CAGE tags to the genome is complicated by the possibility of sequence mismatches as well as tags mapping to multiple locations on the genome. Whereas BLAST (27) can be used to place CAGE tags on the genome, this tool is likely too slow given the size of the high-throughput datasets produced by next-generation sequencers. In addition, BLAST is based on the assumption that the sequences to be compared are evolutionarily related, which is clearly not appropriate for CAGE tag mapping. For this purpose, software specializing in transcriptome tag sequencing such as SSAHA (Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK) (28) and Nexalign (RIKEN, Yokohama, Japan) (29) may be more appropriate. The latter is exceedingly fast for exact matches in particular, and guarantees to find any tag present in the genome.
Special care must be taken for ambiguous tags that map to multiple genome locations with equal scores. In such cases, it may be possible to decide between the mapping locations by considering the number of singly-mapped CAGE tags in each genome neighborhood as a prior probability (30). Many of the ambiguous CAGE tags map to a large number of genome locations, suggesting that they originate from repeat regions of the genome. This is consistent with a considerable fraction of the human transcriptome originating from repetitive sequences, which may play a role in the transcriptional regulation of gene expression (31).
The size of the datasets produced by next-generation sequencers also poses a challenge to data management systems. As terabytes of image data may be generated in a single run, saving all raw data produced in one experiment may no longer be an option. With billions of bases being generated in a single run, even saving only the sequence data will be a considerable enterprise.
In addition to new software to analyze deep CAGE data, the paradigm shift of gene-based networks to promoter-based networks of transcriptional regulation requires new ways to visualize such networks. Visualization software packages such as Cytoscape (32) have previously been developed to represent biomolecular interaction networks and can be used to draw gene regulatory networks. In a promoter-based network, the targets of regulatory interactions are the individual promoters of a gene, which may be too numerous for graphical representation except in the most detailed drawings. The situation is further compounded by the multitude of transcripts that have been identified for each gene. A multiscale visualization approach in which users can choose the level of detail at which each gene is represented may be suitable for visualizing promoter-based regulatory networks.
This work was supported by a research grant from the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H., a grant from the Genome Network Project, also from the Ministry of Education, Culture, Sports, Science and Technology, and a grant from the RIKEN Frontier Research System, Functional RNA Research Program.
The authors declare no competing interests.