For more than a decade, the complete human and mouse genomes have been sequenced and catalogued in databases for anyone to search through. But the list of proteins encoded by those genomes still isn’t complete: researchers have just discovered more than 2000 new mammalian proteins created by splicing known genes in new ways (1). The strings of nucleotides that encode these novel proteins were already listed in databases, but it was assumed that they weren’t translated into proteins until a team of Australian scientists took a closer look.
“It all started when we found this bizarre splice form of a protein we were studying,” said Aude Fahrer of Australia National University, senior author of the study. “We wondered how many other similar proteins there were.”
The protein was Ncaph2, which is involved in chromosome assembly. The alternate version of the gene was spliced to lack 17 base pairs of one exon. Because it doesn’t delete a complete codon, this deletion should shift the reading frame of the protein and render the entire remainder of the protein unreadable. Indeed, in gene databases, the alternate splice was listed, but annotated as “nonsense mediated decay,” indicating that it wouldn’t be turned into a protein.
Fahrer’s team, however, found that an alternate start codon—in line with the alternate splice reading frame—rescued the protein, allowing the cell to create an alternate form after all. They searched the literature for similar cases of protein isoforms with alternate start codons that could rescue a frameshift and found just three other published examples.
Thus began the search through the NCBI and ENSEMBL databases to find more, similar cases in the mouse and human genomes. “We looked for places where transcripts were misaligned by something not divisible by three,” Fahrer said. “Then we asked how many of those have a rescue start and stop codon.”
Once additional criteria were applied to the search to ensure that the genes were from well-sequenced areas of the genome, for example, Fahrer and her colleagues generated a list of 1849 human and 733 mouse transcripts that could encode alternate protein isoforms. 80 percent of the transcripts were incorrectly annotated as non-protein coding in the existing databases.
“To find two thousand new proteins is pretty cool,” Fahrer said. “And chances are that some of these will be quite important biologically.” In one of the known cases, for example, the alternate isoform has the opposite effect on a pathway that the primary protein does.
To obtain proof that these proteins are translated, since bioinformatics generated only predictions, Fahrer’s group added the predicted protein information to a mass spectrometry database and reanalyzed some published mass spectroscopy experiments. Such an experiment is unlikely to turn up all possible proteins, but the team detected the presence of 26 novel isoforms. An additional 38 proteins were validated by comparing them to a recently published list of experimentally verified translation initiation sites.
Fahrer’s team has contacted the ENSEMBL database administrators to ask that the transcript annotations be updated for the newly discovered proteins. “What we hope now is that other researchers take a look to see if we have found a new isoform of their favorite protein; these are now available for anyone to work on,” said Fahrer.
Wilson, L.O.W., Spriggs, A., Taylor, J.M., Fahrer, A.M. (2014). A novel splicing outcome reveals more than 2000 new mammalian protein isoforms. Bioinformatics 30 (2): 151-156