to BioTechniques free email alert service to receive content updates.
Thank you for sharing

05/03/2011
Julie Manoharan

As the proteomics community warms up to sharing large and complex data sets, databases are struggling to keep up with the vast amounts of information. Julie Manoharan investigates one of thes systems on the brink of failure with a new plan for survival.

Bookmark and Share

Ten years ago proteomics researchers waited weeks for datasets from collaborators. When these data finally arrived, it was not by email or server sharing. It was by snail mail in the form of hard drives filled with gigabytes of proteomics information, which couldn’t be squeezed into an email attachment or an FTP server. So proteomic researchers played the waiting game, waiting on the postman to deliver the data. The sooner, the better.

Phil Andrews heads the Proteome Commons team. Credit: Martin Vloet, University of Michigan Photo Services

Phil Andrews, professor of biological chemistry at University of Michigan, remembers how frustrating those days were. “We were recognizing that these datasets could be valuable beyond our own individual experiments and could be reused for different purposes.” But these datasets were hard to come by: researchers in the field did not want to share data, stating that the complexity and large size of proteomic data sets were incompatible with online databases.

So, along with Jayson Falkner, his doctoral student specializing in bioinformatics, Andrews decided to build an open and accessible database for proteomics researchers. This eventually became Proteome Commons. But the database’s success has caused it to become its own worst enemy, as it is now inundated with information and sitting on the verge of failure.

Met with resistance

Ten years ago, although researchers were publishing new and exciting proteomics findings, other labs were struggling trying to replicate those early experiments. Scientists were neither publishing their methods nor supporting data for published studies. There simply wasn’t anywhere to publish this information.

“Most of the journals bought into this, and the more the journals bought into this, the more early researchers in this could get away with whatever they wanted,” says Falkner. “They could publish a huge dataset or only the protocol, so no one could reproduce it. There was no way to say if it was accurate.”

This irritated Falkner and Andrews. To allow members of the research community to hide their methods and underlying data was an affront to the scientific method. But if a system that allowed researchers to easily input their datasets existed, scientists would no longer have an easy out.

But proteomics data are incredibly complicated. Proteomes—the entire protein catalog expressed by a cell—are constantly changing structures in time and space. The methods used to map vary widely from one researcher to the next, even if they are searching for similar functional elements. The creation of a functional aggregator would be difficult.

For example, proteomics methods based on mass spectrometry (MS)—which measures the mass of ionized protein fragments for protein identification and quantification—produce raw files that contain tandem mass spectra. These represent a “fingerprint of a fragmented peptide,” says Dave Tabb, an MS bioinformaticist at the Vanderbilt-Ingram Cancer Center in Nashville, TN. These raw data files must then be converted into multiple formats dictated by instrument vendors, database search engines, or the journal that will publish the study. Each one of these steps, from ion conversion to format conversion, can vary according to the specific needs of the study. Replication by outside labs is virtually impossible without access to specific methods data for each step.

Despite what many researchers were calling insurmountable challenges, Andrews believed proteomics data sharing was essential to the maturation of the field. “The big step we took was to solve this technical problem,” says Falkner, who envisioned a peer-to-peer system enabling researchers to input data in any format and emphasizing the importance of data sharing. “It’s not actually a hard problem. Then we move the discussion to say that people don’t have the excuse of not being able to share data.”

The proteome network

Proteome Commons' success has caused it to become its own worst enemy, as it is now inundated with information and sitting on the verge of failure. Source: Proteome Commons

In 2005, as part of this doctoral work, Falkner created Tranche, a secure and open-source file storage and dissemination system that could store large files, and which required only Java 1.5 or higher. By design, the system would allow proteomics researchers to share large datasets with as few constraints on the data as possible.

The key to the system was that it accepted data in all formats, alleviating the burden of another data conversion for researchers. Furthermore, only minimal information was required to input data: their identification, their location, their research focus, and the time. Beyond that, scientists were encouraged to share as much as possible, but could share whatever and with whomever they wanted. Any Tranche member could add a server as well as add or remove their data. Once data is added to a server, members could chose with whom they wanted to share their data.

In 2006, the first functional Tranche file storage and dissemination system was launched, called Proteome Commons. The site is a project management tool that functions like some social networking websites. Users register, sign into the system, create a new project, and then invite colleagues and collaborators to the project. Invitees are given permissions and responsibilities by the project’s creator.

When researchers saw that Proteome Commons was working, they set up servers that enabled Proteome Commons to gain access as a user. Those servers could be linked to Tranche, allowing massive data sharing with relative ease.

Long-time users sign in and have a list of projects they are associated with. They can click on each and, depending on their permissions, download partial or entire datasets from Tranche, where data is permanently stored. Collaborators’ names are associated with their responsibilities so that all members of a project can track each others’ progress.

“We take advantage of the shame factor,” says Andrews, who currently leads the project. “Everybody knows if you’re keeping up with the work.”

When a project is finished and ready for publication, the principal investigator can choose to make the dataset publically available simply by checking a box. A new Proteome Commons website is generated for the completed project that includes associated annotations, the manuscript, the underlying dataset, the contact information for the original principal investigator, and any other information the group wants to include.

“It very usable,” says Tabb. “Tranche and Proteome Commons are willing to store the raw data and that’s the currency of the field.”

While a minority of proteomics researchers continued to withhold data, most were thrilled with the promise of Proteome Commons. “The complexity of the datasets makes data sharing challenging, but it’s also part of the argument for why we need to share,” says Andrews. “None of us makes use of all the information that’s in one of these datasets, but that dataset could be extremely useful to other researchers who could re-use it,” says Andrews.

But even with Proteome Commons and similar proteomics databases, Falkner believes that about half of the researchers in the field still withhold data. In February 2003, the National Institutes of Health (NIH) published its Final Statement on Sharing Research Data, requiring all projects that receive more than $500,000 from the agency to share the resulting data. But the NIH requirements can be manipulated, says Falkner, allowing researchers to share only the bare minimum with the community.

Growing pains

Without enough funding for its own hardware and facilities, Proteome Commons relies on educational institutions to provide the server space required for the data sets. It was envisioned as a temporary solution until a proper storage system was developed to store the data. But the funding never materialized, and that new system hasn’t been built.

Now, the system is experiencing what Andrews calls “growing pains.” In the last six months, Proteome Commons has seen dramatic increases in users and data input. The network has over 1000 members, more than 10,500 datasets, and 41 affiliated user groups. The volume of data is proving difficult for the network of institutional servers to manage.

“There’s some real concern right now for Tranche and Proteome Commons’ long-term stability,” says Andrews. Support for the system has already broken down. There are not enough technicians or server space to manage the 15+ terabytes of existing data.

“This is a bit of a crisis for people who want public datasets,” says Tabb. “We thought we had a consensus that everybody should make their data files publicly available and now the major mechanism for that appears to be going the way of the dodo.”

As more researchers become open to sharing data and more proteomics projects are initiated, more technicians and server space will be needed to house incoming data soon. If Tranche is not that system, the onus has yet to fall on a single organization to create a new system that can accept proteomics data in such a wide variety of formats. “A year out from now, we have to have a long-term community solution for Tranche and Proteome Commons,” says Andrews.

But where that solution will come from remains unclear. Last month, the National Center for Biotechnology Information (NBCI) shut down Peptidome, the only proteomics repository funded by the US government. Although the European Bioinformatics Institute (EBI) has stepped up its support of Proteome Commons and similar struggling efforts for proteomics data sharing, that support might not be enough to save the proteomics databases.

These systems, rather than competing with each other, function as a complementary network. While Tranche stores all types of data formats, the EBI and the Human Proteome Organization (HUPO) focus on a standardized proteome data format. The Proteomics Identification Database (PRIDE) emphasizes annotation within a dataset, addressing the semantics issues that Tranche intentionally overlooked. The PeptideAtlas stores analyzed data for a smaller group of researchers. So PRIDE could store summaries and point to the raw data in Tranche, Peptide Atlas would lend data to the other repositories, and HUPO can continue to work on a functional way to standardize proteomics data.

“If there’s good news in the proteome-archiving world, it’s that these archives are working together because they each have something to offer that the others don’t,” says Tabb.

Joining forces

ProteomeXchange will work to integrate existing proteome repositories and analyzers into one system. Source: ProteomeXchange

To further these complementary approaches, representatives from the PeptideAtlas, PRIDE, the Global Proteome Machine Database (GPMD), and Tranche began discussing a way to exchange information between the systems three years ago at a proteomics conference in Barbados. Together they started the ProteomeXchange, which is largely funded by a European Union grant. The project’s goal is to create a formal and open method for data submission that will combine the efforts of past data repositories. The group has created an initial website that includes the online project infrastructure.

Two weeks ago, the ProteomeXchange community, of which Andrews is a part, met in Heidelberg, Germany to attend the project’s kickoff meeting. There, the group outlined a clear set of deliverables for the next three years. Service is intended to commence in the next 6 to 12 months. Andrews hopes the collaboration will ease some of the current burdens on Tranche.

”The proteomics community will see benefits of this effort quite soon in the form of improved or extended data services from several of the participants that are directed toward ultimate integration of services,” he says.

But there are still some roadblocks. The ProteomeXchange organizers must create a system that is extremely easy to use, otherwise researchers will be hesitant to take the time to share. Furthermore, the system must facilitate the use of data.

“The good news is that the major concerns of a few years ago—that researchers would not want to share data—has been demonstrated to be a nonissue,” says Andrews.

---

Lisa Grauer wrote about why the NCBI shut down the Peptide proteomics database in April 2011.

Interested in proteomics? Sign up for our Proteomics newsletter.