In cap analysis gene expression (CAGE), short (∼20 nucleotide) sequence tags originating from the 5′ end of full-length mRNAs are sequenced to identify transcription events on a genome-wide scale. The rapid increase in the throughput of present-day sequencers provides much deeper CAGE tag sequencing, where CAGE tags can be found multiple times for each mRNA in a given experiment. CAGE tag counts can then be used to reliably estimate the cellular concentration of the corresponding mRNA. In contrast to microarray and SAGE expression profiling, CAGE identifies the location of each transcription start site in addition to its expression level. This makes it possible for us to infer a genome-wide network of transcriptional regulation by searching the promoter region surrounding each CAGE-defined transcription start site for potential transcription factor binding sites. Hence, deep CAGE is a unique tool for the construction of a promoter-based network of transcriptional regulation. CAGE-based expression profiling also allows us to identify dynamic promoter usage in time-course experiments and the specific promoter regulated by a given transcription factor in disruption experiments. The sheer size of the short-tag datasets produced by modern sequencers spurs a need for new software development to handle the amount of data generated by next-generation sequencers. In addition, new visualization methods will be needed to represent a promoter-based transcriptional network.
Cap analysis gene expression (CAGE) was introduced in 2003 as a method to determine transcription start sites on a genome-wide scale by isolating and sequencing short sequence tags originating from the 5′ end of RNA transcripts (1). Mapping these tags back to the reference genome identifies the transcription start sites from which the transcripts originated.
CAGE relies on a cap-trapper system to capture full-length RNAs while avoiding rRNA and tRNA transcripts. First, an oligo-dT primer is used to reverse-transcribe poly-A terminated RNAs. Alternatively, a random primer can be used for RNAs without a poly-A tail, which may constitute almost half of the transcriptome (2). RNA/DNA double-stranded hybrids that contain a mature mRNA are selected by biotinylating their 5′ cap structure, allowing capture by streptavidin-coated magnetic beads. Ligation of a linker sequence containing an MmeI recognition site to the 5′ end of the full-length cDNA creates a restriction site about 20 nucleotides downstream, producing a short CAGE tag starting at the 5′ end of eukaryotic mRNAs (3). CAGE tags that map just upstream of known genes may be derived from the corresponding full-length mRNAs, whereas others may reflect the existence of currently unknown transcription start sites or genes. Due to their short size, sequencing CAGE tags is more efficient at detecting transcription start sites than sequencing full-length cDNAs.
In early CAGE experiments, the throughput of sequencers limited the achievable sequencing depth such that many CAGE tags were found only once in a given experiment. More recently, a new generation of sequencers excelling at high-throughput sequencing of short tags has enabled deep CAGE tag sequencing, generating upward of a million tags from a single experimental condition. The tag counts found in such experiments are typically much larger than one, allowing an accurate estimate of the cellular concentration of the RNA molecule corresponding to each CAGE tag. Deep CAGE thus detects both the transcription start site as well as its expression level, making it a unique tool in the analysis of transcriptional regulatory networks.Characteristic Features of CAGE
High-throughput gene expression experiments based on microarrays (4), massively parallel signature sequencing (MPSS) (5), serial analysis of gene expression (SAGE) (6), and CAGE give us a snapshot of the RNA concentrations in the cell at a particular time in a specific experimental condition. Quantitative realtime PCR (qRT-PCR) (7,8,9,10), while not a high-throughput method, can provide a valuable standard for validation because of its accuracy and wide dynamic range. Whereas these methods are complementary to each other, the characteristic features of CAGE expression profiling make it particularly suitable for investigating the transcriptional regulatory network that drives the expression of genes and noncoding RNAs.
First, CAGE tag counts allow us to calculate the cellular amount of the corresponding RNA molecule in a digital form. As one mRNA is not preferentially detected over another, expression profiling based on tag sequencing is unbiased, allowing a direct comparison of the expression values of different genes measured in a single deep CAGE experiment. In contrast, microarray fluorescence levels are affected both by the mRNA concentration and by the probe-dependent mRNA affinity, precluding a direct comparison between genes. In addition, tag counts as a measure of mRNA concentrations have a dynamic range that is orders of magnitude larger than microarray expression levels. The accuracy and the dynamic range of CAGE- and SAGE-derived expression levels as well as the sensitivity of detecting lowly expressed transcripts can be improved further by deeper sequencing. Importantly, microarray and qRT-PCR expression profiling are restricted to those transcripts for which a probe or primer pair is available, whereas methodologies based on tag sequencing can also measure the expression of currently unknown transcripts.
Deep CAGE expression profiling is unique among high-throughput expression profiling methods because the 5′ end of the CAGE tag identifies the corresponding transcription start site. Hence, deep CAGE allows us to determine the promoter driving the transcription of each transcript in addition to its expression level. In contrast, tags generated by SAGE or MPSS are located at the 3′ end of the transcript and do not identify the promoter, which may lie tens of kilobases or more upstream in the genome sequence. A variant of SAGE has been developed that uses oligo-capping to create tags at the 5′ end of the transcript (5′ SAGE) (11), but it is currently not in common use. Whereas expressed sequence tags and full-length mRNA sequencing do identify the promoter, the throughput of these techniques is insufficient to allow genome-wide expression profiling and may not be able to detect lowly expressed transcripts.