Big data in big trouble?

Written by Tristan Free

A recent study highlights the potential for big data in healthcare to be exploited by deanonymizing individuals in legal documents, indicating the need for big data security to be improved.

As studies into new therapeutics, diagnostics and prognostics begin to embrace the potential of big data and artificial intelligence (AI) to provide researchers with identifiable trends and predictions concerning individual patients; the emphasis on data collection, storage and analysis is increasing.

Projects such as All of Us, the 100,000 Genomes Project and FinnGen are all collecting huge amounts of data for the explicit purpose of being stored and analyzed.  The potential benefits of these efforts for a vast range of conditions are countless and critical, but these new data stores bring with them a new set of complications.

A recent study from the University of Zurich (Switzerland) has shown that a combination of big data and AI can be utilized to identify individuals who are, or have been, involved in confidential legal cases. This highlights the requirement for big data security to be improved.

The study involved using AI to mine over 1200,000 public legal records, using an algorithm to identify connections between them. This process, described as ‘linkage’, was capable of matching identifying information from publicly available data sets with individuals mentioned in anonymized court documents.

This process successfully identified 84% of the individuals mentioned in the documents that they mined — all in less than one hour. Commenting on the frightening efficiency of the process, Kerstin Noëlle Vokinger, co-author on the study, stated that, “with today’s technological possibilities, anonymization is no longer guaranteed in certain areas.”



Urs Jakob Mühlematter, study co-author, noted that, “this procedure can in principle be applied to any publicly available database.”

The implications of this study are huge. For example, what now constitutes personal, identifying data? Taken in combination with the large amount of intimately personal data that is now being collected for medical purposes — including genome sequences, lifestyle and health statuses — the concerning potential for the abuse of this information has been exposed.

In the specific case of this study, it clearly highlights the need to assess how legal documents and data are processed, stored and protected. Strikingly though, the research exposes the lack of protection available for large data banks and indicates the requirement for governments and institutions to be proactive about creating systems and methods to improve big data security and ensure that publicly collected and available data can only be used for the purposes it was provided to further.