Discerning Exon-Exon Splicing Events
We assessed the ability of this method to identify exon-exon junctions to infer the presence of alternative transcript isoforms. As discussed, we first searched all reads that could not be mapped directly to the genome against the theoretical junctions of all possible combinations of exons within and between neighboring genes (see Discerning Exon-Exon Splicing Events section in Materials and Methods). Using this approach, 31,618 of the 284,873 known exon-exon junctions and 379 unannotated exon-skipping events were supported by multiple unambiguously mapped reads. Another 26,232 and 1220 annotated and novel within-gene junctions (respectively) were supported either by a single read or by ambiguously mapped reads. The novel exon-skipping events (e.g., Figure 4A) occurred within 1355 different genes (Supplementary Table S3) with a slight enrichment among highly abundant genes (mean read depth = 29). Finally, we found a single example of distinct reads spanning a pair of exons from neighboring genes, which suggests some transcripts result from co-transcription (22) of FAM24b along with its downstream neighbor CUZD1.
Following this fairly conservative approach for identifying reads spanning splice junctions, many reads (∼9.7 million) still remained unmapped. We expected that some of the unmapped reads likely corresponded to the exonic or junction sequence belonging to exons not currently annotated. We extended our attempt to use these sequences by including candidate splicing events represented in the trome transcript database, a much larger resource where automated annotations are driven by cDNA sequence (23). Another 156,563 reads could be mapped to 34,895 junctions described in trome, providing additional empirical evidence that these should be added to gene models (see Figure 4A for an example). A further 32,829 unique reads could be mapped to 7195 transcripts within the trome transcript database, indicating the presence of novel exons. These included both known structures, which are present in other databases such as RefSeq but missing from Ensembl, as well as novel events, some of which are supported by EST evidence only. This highlights the still underappreciated diversity of the transcriptome as reflected in current annotations and promises a means for rapid and robust advancements of annotations in the near future.Identification of Transcriptional Start and Termination Sites
In addition to our current limited understanding of alternative splicing, correct definition of the boundaries of a gene or transcription unit is still inherently difficult. We assessed the utility of our approach for refining the transcriptional start (TSS) and termination sites (TTS) of annotated genes. A TSS or TTS was considered confirmed if one of our WTSS peaks started (or ended) within 30 bp of the annotated TSS/TTS. Using these criteria, we confirmed the TSS of 12,177 and TTS of 11,906 transcripts by this approach. We further investigated our ability to refine or identify a TSS by comparing our results to CAGE data. As CAGE tags are derived from the first 20 nt from the 5′ end of an mRNA transcript (9,10), we would only expect the HeLa peaks corresponding to initial exons to overlap CAGE peaks. However we would expect most CAGE peaks to overlap HeLa peaks as long as they derive from a gene expressed in HeLa. CAGE tags were realigned to the latest human genome build (see Alignment, Peak Discovery, and Exon and Transcript Abundance Calculation section) yielding 102,823 distinct CAGE alignment peaks. Due to the reported presence of alternate promoters for many genes (11), we checked for the overlap of a CAGE peak at any position within HeLa exons. A total of 37,054 (36.1%) of these overlapped with HeLa peaks. Of the weakest CAGE alignment sites (supported by only two CAGE tags), 13,081 (27.8%) overlapped with one of our HeLa peaks (regardless of peak strength) while 9140 (59.6%) of alignment sites with ≥10 CAGE tags corresponded to a HeLa peak (Figure 2B and Supplemental Table 4).
A strong relationship between HeLa peak height and CAGE overlap was also observed (Figure 2C). Nearly all (27, 495) of the sites matched by CAGE tags and HeLa peaks corresponded to known exons, supporting the notion that these regions are likely to represent real transcriptional start sites. The remaining CAGE sites that lacked supporting reads in this library may have been due to either inactive or low expression of genes in HeLa. Conversely, the 9559 sites co-occupied by both HeLa peaks and CAGE peaks but not overlapping annotated exons likely represent unannotated exons. As expected, some of these sites (1927, 20%) corresponded to exons in the UCSC “known genes” while a larger proportion (6722, 70%) also overlapped with an exon from an AceView EST-based gene model (24).