Combining proteomics and machine learningBioinformatics & computational biology
Irina Armean from the European Bioinformatics Institute discusses the gene ontology (GO) and how machine learning can play a role in evaluating protein protein interactions (PPIs).
Irina Armean obtained her MSc degree in Bioinformatics from the University of Applied Sciences (Hagenberg, Austria) and her PhD in Proteomics and Bioinformatics from the University of Cambridge (UK). During her PhD, Irina studied protein complexes in Drosophila melanogaster by combining affinity purification, mass-spectrometry and machine learning to improve protein interaction evaluation.
Irina’s passion for understanding genomes and proteomes has led her to work on the genome annotation of newly sequenced species part of Ensembl Metazoa before analyzing the functional and phenotypic impact of protein truncating variants in the human genome during her post-doc at the Broad Institute (MA, USA). Irina is currently interested in improving DNA variants phenotype annotation and interpretation in the Ensembl Variation team at the European Bioinformatics Institute (EMBL-EBI, Cambridge, UK).
Please can you provide us with an overview of your research?
My research sits at the intersection of computer science and biology, namely bioinformatics.
The recently published work, a collaboration with Professor Kathryn Lilley, Director of the Cambridge Centre for Proteomics and Dr Sean Holden, Senior Lecturer in Computer Science at the University of Cambridge, presents the application of a machine learning model, GIS-MaxEnt, to a protein complex dataset obtained by affinity purification coupled with mass spectrometry (AP-MS) with the aim to quantify the support of existing annotation towards high-confidence interactions. We used the GO and InterPro annotation of individual proteins to build protein interaction attributes that were used to quantify via the newly defined system, the MaxEnt-PPI tool.
Previously published methods that used GO to evaluate PPIs have tested multiple methods of sub-setting the GO annotation space in order to use either protein similarity or PPI similarity to infer confidence of other protein interactions being true.
In our research, we developed the MaxEnt-PPI tool that works with the entire GO annotation set and builds the exhaustive set of PPI attributes based on the individual protein annotations. This approach has the advantage of maintaining all the information encoded in the term relationships. Additionally, the implementation of MaxEnt-PPI allows the export of the individual weights assigned to each GO term pair; making it stand out from the other machine-learning tools considered as being a black box, obscuring the underlying numerical analysis.
What is the GO and how was it utilized in your research?
The GO is a controlled vocabulary used to describe gene products/ proteins, in terms of cellular component, molecular function and biological process.
This ontology is one of the longest running and very successful collaborative efforts to standardize the way knowledge gets documented and annotated. Its power lies in the collaborative work of several research communities. For example, the Drospohila (via FlyBase), Saccharomyces (via SGD - Saccharomyces Genome Database) and the Mouse Genome Database (via MGD) communities have been involved since the beginning in building, using and disseminating this set of controlled terms with agreed definitions and relationships to the other controlled terms.
Different datasets once annotated with GO become easily comparable at the computational level. In our work we use the GO protein annotation and the relationship of the different annotation terms to build PPI attributes/descriptors, which we use to train the MaxEnt machine learning model included in our MaxEnt-PPI tool.
To train our model, we used a previously published high-confidence yeast protein complex dataset as a positive set and generated a corresponding negative set using proteins from the positive set by randomly selecting protein pairs that were not observed to interact.
By combining the novel training set design and our new GO-based metric, we obtained a machine-learning model that is able to associate certain GO term pairs with protein pairs that are likely or unlikely to interact. We used these trained associations to evaluate other non-curated protein complex data sets.
How are PPIs studied and how is your research assisting that?
One of the more popular methods for high-throughput protein interaction study is by AP–MS.
AP–MS implies the purification of the protein of interest together with its interacting partners. The resulting purification can then be analyzed via mass spectrometry and each protein identified. This method is not without false-positive results. For example, due to the experimental methods, which usually involve breaking apart cellular compartments, proteins that would never come in contact in a living cell now have the chance to come together and be jointly purified and identified, leading to a false-positive result.
Our research uses existing knowledge and annotation stored in ontologies about the proteins, their location, function, domains to evaluate the likelihood of jointly identified proteins interacting in a functional biological meaningful context. Bringing this information into the evaluation allows the user to filter and rank the proteins in an AP–MS result based on the likelihood of a true interaction, not only based on the experimental data at hand but also in the context of the existing annotation.
What conclusions could be drawn from your research?
Biological data sets are increasing in complexity and size quarter by quarter, making it indispensable to have an easy-to-use computational method of analyzing them in seconds or minutes. The use of ontologies and controlled vocabularies offers a good solution to this problem. Our MaxEnt-PPI metric maintains with fidelity the relationships and terms used in the ontology, making it robust to changes in the underlying ontology; the more ontologies used, the better the performance.
Once the knowledge of a domain is encoded into an ontology, it can easily be transferred to an automated evaluation system allowing more time for a deeper look into the results.
Our MaxEnt-PPI moves away from being a black box and exports the individual weights for each GO term pair that have led to the PPI score.
For the best results, we recommend the use of a species-specific training set, which can be also obtained using homology. Using a set specific to the problem at hand ensures the learned annotation space is the one that is most descriptive of the species of interest.
As our method is based on annotation, we also recommend using the same proteins from the positive set in the negative set, as this will ensure the same annotation coverage.
The MaxEnt-PPI score improves upon previous research by improvements in training set design, improved GO-based attribute construction and transparency of the MaxEnt model. MaxEnt-PPI was compared to SVM, MKL and go2ppi on multiple training set combinations showcasing an accuracy and precision >0.90.
What is the importance of your research in real-world applications?
Our approach has already been applied to FlAnnotator, a large resource of Drosophila gene expression and protein interaction data obtained by iPAC (interactomes using parallel affinity capture) allowing users to rank and evaluate the observed protein interaction lists in the context of other published annotation support.
In addition, due to its transparency in terms of internal attribute weights, MaxEnt-PPI allows for a deeper understanding of the controlled vocabulary structures and their impact on the final scores. I believe this is of crucial importance especially when specific evaluations are of interest.
By making the code, training sets and an executable java library with use documentation available, we hope to support as many researchers that would benefit from our tool or the use of the training data sets.
Where is this research headed in the future?
The study of gene and protein networks contains a big part of the biological complexity, but at the same time, it still presents challenges of the correct identification of relevant protein interactions within a complex system.
Our MaxEnt-PPI plays the part in assigning confidence to individual PPI relationships; however, it does not analyze the impact of a specific PPI on a protein interaction network or pathway.
The future holds a continuing refinement of experimental methods to capture and identify true interaction partners at different time points and cellular locations. New experimental data will continue to drive the refinement of bioinformatics tools, tools such as MaxEnt-PPI to assess individual PPIs and tools that work at the protein interaction network or pathways level combining multiple individual scores into one.
One such example that comes to mind is the STRING database, which was just recently highlighted as most recommended in a systematic study of molecular networks for discovery of disease genes, a study published earlier this month.
Sources: Armean IM, Lilley KS, Trotter MWB, Pilkington NCV, Holden SB. Co-complex protein membership evaluation using Maximum Entropy on GO ontology and InterPro annotation. Bioinformatics doi:10.1093/bioinformatics/btx803 (2018); Huang JK, Carlin DE, Ku Yu M et al. Systematic evaluation of molecular networks for discovery of disease genes. Cell Syst. doi:10.1016/j.cels.2018.03.001 (2018).