to BioTechniques free email alert service to receive content updates.
Deep cap analysis gene expression (CAGE): genome-wide identification of promoters, quantification of their expression, and network inference
 
Michiel de Hoon Yoshihide Hayashizaki
Full Text (PDF)

While genes can have multiple promoters in principle, their usage will likely vary depending on cellular conditions. Time-course CAGE experiments can be used to study the dynamic usage of promoters by comparing the distribution of expression levels over transcription start sites in the upstream region of a gene to discover which promoters are switched on and off between time points during the experiment. Similarly, we may find tissue-specific promoter usage, or promoters that are only used in specific cell lines.

A major goal of the analysis of deep CAGE sequence data is to infer the regulatory network that orchestrates transcription in a cell (26). With the target of each regulatory interaction being a promoter instead of a gene, such a network is qualitatively different from current gene-based networks inferred from microarray or SAGE expression profiling. Each of the promoters may be associated with one or more coding or noncoding transcripts.

A promoter-based network can be constructed by first obtaining matrix models to describe the binding affinities of transcription factors, and then using these matrices to search for potential transcription factor binding sites in promoter regions. Both of these steps are facilitated by the availability of the promoter locations and expression levels measured by deep CAGE.

Matrix models of transcription factors can be derived either from the literature or by aligning the upstream regions of coregulated genes. Such coregulated genes can be found by clustering the genome-wide expression profiles measured in microarray experiments. However, the expression profile measured by microarrays is an aggregate of the expression profiles of the individual promoters from which transcripts originate, hampering a clean clustering result. In addition, these promoters are typically regulated by a different set of transcription factors. Deep CAGE expression profiling enables us to separate the expression of a gene into the contribution of the individual promoters, which can then be clustered based on their expression profiles to generate tighter clusters of co-expression. Each of these clusters of promoters is likely regulated by a smaller set of transcription factors than conventional clusters of genes. In addition, the CAGE-defined promoter identifies the genome regions to be aligned to find overrepresented sequence motifs.

Similarly, we can restrict the genome region to be searched for potential transcription factor binding sites to the promoter region identified by CAGE, significantly reducing the possibility of false positives. While comparative genomics may also be used to identify the promoter region, it does not pinpoint the exact transcription start site. In addition, many biologically functional sequences do not seem to be evolutionarily constrained across all mammals (16) and therefore cannot be identified by comparative approaches. Finally, deep CAGE profiling identifies which promoters are active in a particular biological context and therefore suggests which transcription factor binding sites may be biologically relevant.

Software Needs for High-throughput Transcriptome Sequencing

The sheer size of the datasets generated by high-throughput transcriptome sequencing places new requirements on the software tools used to analyze these data. During extraction of CAGE tags from the raw reads, care must be taken to correctly distinguish the tag from linker sequences. Mapping CAGE tags to the genome is complicated by the possibility of sequence mismatches as well as tags mapping to multiple locations on the genome. Whereas BLAST (27) can be used to place CAGE tags on the genome, this tool is likely too slow given the size of the high-throughput datasets produced by next-generation sequencers. In addition, BLAST is based on the assumption that the sequences to be compared are evolutionarily related, which is clearly not appropriate for CAGE tag mapping. For this purpose, software specializing in transcriptome tag sequencing such as SSAHA (Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK) (28) and Nexalign (RIKEN, Yokohama, Japan) (29) may be more appropriate. The latter is exceedingly fast for exact matches in particular, and guarantees to find any tag present in the genome.

Special care must be taken for ambiguous tags that map to multiple genome locations with equal scores. In such cases, it may be possible to decide between the mapping locations by considering the number of singly-mapped CAGE tags in each genome neighborhood as a prior probability (30). Many of the ambiguous CAGE tags map to a large number of genome locations, suggesting that they originate from repeat regions of the genome. This is consistent with a considerable fraction of the human transcriptome originating from repetitive sequences, which may play a role in the transcriptional regulation of gene expression (31).

The size of the datasets produced by next-generation sequencers also poses a challenge to data management systems. As terabytes of image data may be generated in a single run, saving all raw data produced in one experiment may no longer be an option. With billions of bases being generated in a single run, even saving only the sequence data will be a considerable enterprise.

In addition to new software to analyze deep CAGE data, the paradigm shift of gene-based networks to promoter-based networks of transcriptional regulation requires new ways to visualize such networks. Visualization software packages such as Cytoscape (32) have previously been developed to represent biomolecular interaction networks and can be used to draw gene regulatory networks. In a promoter-based network, the targets of regulatory interactions are the individual promoters of a gene, which may be too numerous for graphical representation except in the most detailed drawings. The situation is further compounded by the multitude of transcripts that have been identified for each gene. A multiscale visualization approach in which users can choose the level of detail at which each gene is represented may be suitable for visualizing promoter-based regulatory networks.

Acknowledgements

This work was supported by a research grant from the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H., a grant from the Genome Network Project, also from the Ministry of Education, Culture, Sports, Science and Technology, and a grant from the RIKEN Frontier Research System, Functional RNA Research Program.

Competing Interests Statement

The authors declare no competing interests.

References
1.) Shiraki T. Kondo S. Katayama S. Waki K. Kasukawa T. Kawaji H. Kodzius R. Watahiki A., Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage, Proc. Natl. Acad. Sci. USA, P15776 - P15781

2.) Cheng J. Kapranov P. Drenkow J. Dike S. Brubaker S. Patel S. Long J. Stern D., Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution, Science, P1149 - P1154

3.) Kodzius R. Kojima M. Nishiyori H. Nakamura M. Fukuda S. Tagami M. Sasaki D. Imamura K., CAGE: cap analysis of gene expression, Nat. Methods, P211 - P222

4.) Schena M. Shalon D. Davis W. R. Brown O. P., Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, P467 - P470

5.) Brenner S. Johnson M. Bridgham J. Golda G. Lloyd H. D. Johnson D. Luo S. McCurdy S., Gene expression by massively parallel signature sequencing (MPSS) on microbead arrays, Nat. Biotechnol., P630 - P634

6.) Velculescu E. V. Zhang L. Vogelstein B. Kinzler W. K., Serial analysis of gene expression, Science, P484 - P487

7.) Higuchi R. Dollinger G. Walsh S. P. Griffith R., Simultaneous amplification and detection of specific DNA sequences, Biotechnology (N. Y.), P413 - P417

8.) Higuchi R. Fockler C. Dollinger G. Watson R., Kinetic PCR: Real time monitoring of DNA amplification reactions, Biotechnology (N. Y.), P1026 - P1030

9.) Heid A. C. Stevens J. Livak J. K. Williams M. P., Real time quantitative PCR, Genome Res., P986 - P994

10.) Wittwer T. C. Herrmann G. M. Moss A. A. Rasmussen P. R., Continuous fluorescence monitoring of rapid cycle DNA amplification, BioTechniques, P130 - P138

11.) Hashimoto S. Suzuki Y. Kasai Y. Morohoshi K. Yamada T. Sese J. Morishita S. Sugano S. Matsushima K., 5′-end SAGE for the analysis of transcriptional start sites, Nat. Biotechnol., P1146 - P1149

12.) Carninci P. Kasukawa T. Katayama S. Gough J. Frith C. M. Maeda N. Oyama R. Ravasi T., The transcriptional landscape of the mammalian genome, Science, P1559 - P1563

13.) Zavolan M. Kondo S. Schönbach C. Adachi J. Hume A. D. Hayashizaki Y. Gaasterland T. RIKEN GER Group, Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome, Genome Res., P1290 - P1300

14.) Carninci P. Sandelin A. Lenhard B. Katayama S. Shimokawa K. Ponjavic J. Semple M. C.A. Taylor S. M., Genome-wide analysis of mammalian promoter architecture and evolution, Nat. Genet., P626 - P635

15.) ENCODE Project Consortium, The ENCODE (ENCyclopedia Of DNA Elements) Project, Science, P636 - P640

16.) ENCODE Project Consortium, Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature, P799 - P816

17.) Ravasi T. Suzuki H. Pang C. K. Katayama S. Furuno M. Okunishi R. Fukuda S. Ru K., Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome, Genome Res., P11 - P19

18.) Trinklein D. N. Aldred F. S. Hartman J. S. Schroeder I. D. Otillar P. R. Myers M. R., An abundance of bidirectional promoters in the human genome, Genome Res., P62 - P66

19.) Engström G. P. Suzuki H. Ninomiya N. Akalin A. Sessa L. Lavorgna G. Brozzi A. Luzi L., Complex loci in human and mouse genomes, PLoS Genet., Pe47

20.) Ambros V., The functions of animal microRNAs, Nature, P350 - P355

21.) Fukagawa T. Nogami M. Yoshikawa M. Ikeno M. Okazaki T. Takami Y. Nakayama T. Oshimura M., Dicer is essential for formation of the heterochromatin structure in vertebrate cells, Nat. Cell Biol., P784 - P791

22.) Imamura T. Yamamoto S. Ohgane J. Hattori N. Tanaka S. Shiota K., Non-coding RNA directed DNA demethylation of Sphk1 CpG island, Biochem. Biophys. Res. Commun., P593 - P600

23.) Murrell A. Heeson S. Reik W., Interaction between differentially methylated regions partitions the imprinted genes Igf2 and H19 into parent-specific chromatin loops, Nat. Genet., P889 - P893

24.) Andersen A. A. Panning B., Epigenetic gene regulation by noncoding RNAs, Curr. Opin. Cell Biol., P281 - P289

25.) Katayama S. Tomaru Y. Kasukawa T. Waki K. Nakanishi M. Nakamura M. Nishida H. Yap C. C., Antisense transcription in the mammalian transcriptome, Science, P1564 - P1566

26.) Nilsson R. Bajic B. V. Suzuki H. di Bernardo D. Björkegren J. Katayama S. Reid F. J. Sweet J. M., Transcriptional network dynamics in macrophage activation, Genomics, P133 - P142

27.) Altschul F. S. Gish W. Miller W. Myers W. E. Lipman J. D., Basic local alignment search tool, J. Mol. Biol., P403 - P410

28.) Ning Z. Cox J. A. Mullikin C. J., SSAHA: a fast search method for large DNA databases, Genome Res., P1725 - P1729

29.) Lassmann T. Arner E. Daub O. C.

30.) Faulkner J. G. Forrest R. A.R. Chalk M. A. Schroder K. Hayashizaki Y. Carninci P. Hume A. D. Grimmond M. S., A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE, Genomics, P281 - P288

31.) Imanishi T. Itoh T. Suzuki Y. O'Donovan C. Fukuchi S. Koyanagi O. K. Barrero A. R. Tamura T., Integrative annotation of 21,037 human genes validated by full-length cDNA clones, PLoS Biol., Pe162

32.) Shannon P. Markiel A. Ozier O. Baliga S. N. Wang T. J. Ramage D. Amin N. Schwikowski B. Ideker T., Cytoscape: a software environment for integrated models of biomolecular interaction networks, Genome Res., P2498 - P2504

  1    2    3