False-positive protein identifications lurk in proteomics. These misidentifications can spell trouble for the unwary scientist who wastes hours or days tracking down false leads. In the end, inaccurate identifications delay the translation of proteomics into clinical diagnostics and therapeutics.
“It reduces the credibility of the field as a whole,” says Morgan Giddings, an investigator and software engineer at Boise State University who develops genomic and proteomic analysis programs. “People have decided proteomics is not a very worthwhile thing to be doing, and that’s unfortunate.”
Now, to improve the reputation of the proteomics data, scientific journals and proteomics organizations are raising the bar for data submission standards, forcing researchers to improve the algorithms used to identify proteins.
Raising data standards
In 2002, the newly founded Human Proteome Organization (HUPO) included proteomics standards in its initial set of initiatives. The HUPO Proteomics Standards Initiative (PSI) was founded to define community standards for proteomics data in response to data inconsistencies in published high-throughput proteomics datasets. The idea was to make it easier to compare, exchange, and validate data.
But three years later, the HUPO PSI working group still did not have a final draft of standards. During that time, high-throughput proteomics data were being published in increasingly higher volumes. But without quality standards, researchers could not access the false-positive identification rates of these studies.
“If you did just pick randomly out of a list of 500 gene products and didn’t validate before going onto some big study, then it could be costly,” says Gerard Cagney, a high-throughput protein researcher at University College Dublin who performs proteomic screens.
In 2005, the journal Molecular & Cellular Proteomics (MCP) and the American Society for Biochemistry and Molecular Biology gathered together thirty leading academic researchers, instrument manufacturers, bioinformaticians, and journal representatives at the Maison de la Chimie in Paris, France for a two-day conference to discuss the quality of proteomics identification data. Within nine months, MCP implemented the Paris Report, the first journal to develop submission guidelines for peptide and protein identifications. The guidelines were designed to raise the data quality and to streamline the presentation of proteomic experiments being performed at different institutions.
Within two years, other proteomic journals followed suit. Proteomics published guidelines for submitting proteomic data in 2006, the Journal of Proteomics Research implemented their guidelines in 2007, and the Journal of Biological Chemistry also added submission requirements for proteomics data, based on MCP’s standards, in their Instructions to Authors. In 2006, the HUPO PSI also released their guidelines.
“I remember reading the guidelines and thinking, ‘this is quite difficult to implement all the time,’” says Cagney. “They’ve set the bar quite high now.”
The submission guidelines require researchers to provide a detailed methods description, the names and version of the programs used for database searching, the name and version of the sequence databases used, mass spectrometry (MS) data interpretation methods, and a statistical analysis to validate the results. This statistical analysis provides an upfront false-positive rate. According to Cagney, journals like to see a false-positive rate of 1% or less in submitted data.
“If I publish a list of a thousand proteins, and I make a statement saying it’s 1% false-positive, then it’s quite possible that there would be 5 or 10 mistakes in it,” says Cagney. Although most proteomics journals require such validation, reported false-positive rates often range from below 1% up to 3%.
The journals also usually insist that researchers use two or three unique peptides to identify each protein. “If you don’t have that, they want to see the actual mass spectrum,” says Cagney. Single peptide-based identifications require additional materials, including the mass spectrum, peptide sequence, modifications, and additional statistical analysis.
To meet these quality standards and reduce uncertainty in large proteomics data sets, researchers are looking at the two possible ways to improve MS-based protein identifications: instruments and informatics.
Reaching resolutions limits
That’s putting it lightly. While the human genome contains about 20,300 protein-coding genes, the human proteome could easily contain over one million protein variants. Proteins exhibit spliced variations and can undergo post-translational modifications, and sequences can also share similarities and duplicate within and across species, which complicates protein identification.
Mass spectrometry is the primary method used to identify proteins in biological samples. A tandem mass spectrometer blows up a protein into ionized peptide fragments. Each fragment has a unique mass spectrum. Researchers then use this data to identify the protein, using a search engine to match the observed spectrum to a theoretical spectra developed from sequence information in genomic databases.
During the 1980s, researchers could identify proteins in relatively simple mixtures, samples containing only a few unknown proteins. The low resolution of early MS instruments and techniques provided myriad possible protein candidates, easily leading to false-positive identifications.
But over the last two decades, instrument manufacturers have rapidly advanced MS technology. New mass spectrometers provide increased resolution and accuracy, allowing them to differentiate between peptides with minute differences in mass. Now, researchers can capture information about proteins in complex mixtures, using mass spectrometers that can resolve the millions of molecules that make up a cell’s proteome.
Higher resolution and increased accuracy can provide better protein identifications. Although other MS techniques exist that provide extremely high resolution—such as Fourier transform MS with a mass accuracy down to 1 ppm—these instruments are more expensive and technically challenging.
And the outlook for increased resolution and accuracy in traditional tandem MS instruments does not look so bright, according to Cagney. “In the future, the improvements will not be as fast,” said Cagney. Mass spectrometers may be just about as advanced as possible for the foreseeable future—indeed, a credit to the engineers who have spent the past two decades advancing the techniques, but also a sign that proteomics needs a new tool.
Improving spectrum match-making
In 1994, John Yates and his associates published a paper in the Society of Mass Spectrometry describing SEQUEST, a MS data analysis program that automated protein identification. With this program, researchers could plug in the raw data from their MS experiment and get a list of proteins in return. The program analyzes the observed mass spectrum, determines the amino acid sequence, and compares it with sequence information in genomics databases.
But this automated data analysis program has brought new challenges to the proteomics field. Proteomic researchers—interested in the biological questions—do not necessarily write the code; instead, they rely upon computer scientists, computational biologists, and software engineers. Because of this, most researchers are not aware of the software’s potential—and conversely, its limitations.
One limitation in the software is that it relies upon data in genomic databases. If the unknown protein in a sample has no corresponding sequence information in genomic databases, there will be no result returned. While global sequencing efforts are flooding databases with genome sequences, researchers are no where near close to identifying, yet alone sequencing, the million-plus potential proteins in the human proteome. The quality of the sequence information in these databases could also increase false-positive rates for the unwary researcher.
To keep up with the increasing complexity of sample mixtures and improve the automated protein identification programs, researchers like Giddings and Cagney are refining the artificial intelligence of these search engines.
At Giddings’ lab, researchers have designed improved search engine and validation techniques including genome-based peptide fingerprint scanning (GFS) and a hidden Markov model (HMM). “[HMM] improved upon what many other search engines were doing by better recognizing how peptides fragment in a mass spectrometer and using the recognition of those patterns to better match to a particular peptide sequence,” says Giddings.
Not taking matches at face-value
By relying on the automated features of the search engines, researchers can prematurely conclude that the first match is a positive one, says Giddings. In reality, hits can arise for many reasons. Validation programs of spectra interpretations are designed to help researchers identify these potential false-positives in the results.
Developed by Alexey Nesvizhskii and Ruedi Aebersold at the Institute for Systems Biology, ProteinProphet is one system that provides automatic validation of protein identities made by database search programs. ProteinProphet scrutinizes two aspects of the search program to calculate a potential false-positive rate for the data set. First, it looks at proteins identified only by one peptide fragment—rather than multiple peptide fragments—in the sample. Second, it looks at peptides that are not unique to a specific database protein entry. Using this analysis, it calculates a probability of false-positives. This probability can help guide others using the data set for their own research.
At Cagney’s lab, his team is working to improve validation programs for high-throughput proteomics data set. As the genomic databases have grown, so has the computing power needed for these high-throughput search programs. Limiting these searches has decreased the quality of the resulting data, so Cagney and his team developed an application called msmsEval to statistically model the spectra interpretations.
Despite their efforts, Cagney believes that the end-users of search programs should become more familiar with the underlying algorithms used by these programs. No algorithm is perfect, so these researchers will have an advantage with their data interpretation if they understand how their results were derived.
Proteomic researchers should be able to answer key questions, according to Giddings. “What are you actually getting out of a search engine?” she says. “When can you believe it? When do you need to validate?”
Amidst all the advances in instrumentation resolution, accuracy, and search and validation programs, there may only be one true way to restore confidence in high-throughput proteomics. “What you really need are the raw files,” says Cagney. With these raw spectrums, other researchers can make their own data assessment of one group’s work. Researchers could find new discoveries using a mass spectrum that was published over a decade ago. But the size of this raw data presents a new challenge.
“The raw data are very information dense,” said Giddings. “Somebody’s going to have to pay for this storage.”