Due to the presence of repetitive sequences in the human genome, in many cases, the short reads produced by Illumina sequencing cannot be unambiguously mapped to a single locus. We determined the expected “mappability” of all annotated exons to aid us in interpreting the significance of regions with lower (or no) representation in our libraries. This computation, described in detail in the Supplementary Material (available online at www.BioTechniques.com), first assigns a mappability score to each nucleotide in the genome. This value represents the fraction of all possible Illumina reads derived from that position that could theoretically be mapped unambiguously back to that position. For latter calculations, exons with average mappability (over all their nucleotides) of <0.5 were considered low-mappability exons and were ignored in some calculations (see next section).Alignment, Peak Discovery, and Exon and Transcript Abundance Calculation
We aligned the Illumina WTSS sequences to the human reference genome (NCBI Build 36.1) using the Eland aligner (Illumina), which produced a summary and alignment position for each unambiguously mapped sequence (allowing up to two mismatches). This process flagged both the sequences where the best alignments mapped to multiple locations and sequences that failed to align to the reference genome. Only sequences with unambiguous alignments to a single genomic site were retained for this analysis. Reads that aligned fully within the boundaries of an annotated exon (Ensembl, version 48; www.ensembl.org) were considered for coverage calculations. The average coverage of an exon was computed as the sum of the read depth aligned across each individual base divided by its length (Supplementary Table S1). Corrected exon coverage was calculated by dividing average coverage by the average mappability value for that exon. A summary value for each gene was calculated using the mean coverage for all exons comprising a representative gene model (the one with the most unambiguously mapped reads). Notably, short reads do not provide enough information for us to discern individual transcripts; hence these numbers represent a summary of the net transcriptional activity of each gene locus (Supplementary Table S2). All genomic regions in which more than one aligned read overlapped was considered a peak, with peak height defined by the maximum number of overlapping reads at any position in that peak. For known exons, height (or depth) was considered a metric for exon abundance, while for nonexonic peaks, height was considered stronger evidence for its potential to represent a novel exon. Cap analysis of gene expression (CAGE) data was treated in the same way, producing CAGE peaks. The CAGE dataset used in this work consisted of 24 libraries from 14 tissue types (http://fantom31p.gsc.riken.jp/cage/hg17/). A UCSC wig-formatted file for visualization of the HeLa peaks is available for download (ftp://ftp03.bcgsc.ca/public/HeLa/).Discerning Exon-Exon Splicing Events
A database of exon-exon junction sequences was constructed by combining, for each gene, the last 30 nt of an exon with the first 30 nt of each of its downstream exons. Those exon-exon junctions present in at least one annotated Ensembl transcript were considered “known” events whereas the remainder were considered “putative” events. These ∼3.5 M junction sequences were then queried with all reads that failed to align to the human genome, enforcing perfect alignments with no mismatches. We retained the reads aligning with at least 6 nt on either side of the junction and excluded those mapping to multiple junctions. A junction was considered present if at least two distinct reads mapped uniquely across it.
Novel cryptic exons were identified by first creating a database representing the most likely possible exon-intron splicing events. These were identified by searching the downstream introns for each exon (to a maximum distance of 50 kb) for all canonical splice acceptor sequences (AG), and creating a putative exon junction with the latter 30 bases of this exon and the first 30 bases following the acceptor sequence. The same procedure was done using the 5′ end of each exon and its upstream introns, but involved searching for the canonical splice acceptor sequence (GT). These junctions were created using the preceding 30 nucleotides from that intron and the first 30 nucleotides of that exon. Reads were mapped to a database of these junctions (as described in the previous paragraph) to identify evidence for splicing of cryptic exons to known exons.