Data science
2 mins
Read

Universal Data Access: The First Step Towards Remote Data Science

Pratyush Patodia
Pratyush Patodia
In 2018, a joint research project between the University of Central Florida and the City of Orlando used real-time traffic data to find strategies to help reduce car crashes and eventually improve road safety in the city.

Despite the small scale of the pilot project, it tremendously helped researchers get a deeper understanding of multiple traffic variables in real-time and uncover strategies to reduce crashes or at least minimize fatalities.

What if that dataset could be expanded?

With 280 million vehicles and 227.5 million drivers, the United States is one of the busiest nations in terms of traffic. Unsurprisingly, a worrying number of accidents are registered every year across the country’s roads. In 2018 alone, the number of vehicles involved in crashes stood at a disturbing 12 million.

What if nationwide traffic data could be collected and used for analysis to discover strategies that would improve road safety in the United States?

You could study the information about every traffic accident and uncover reasons – from road quality to treacherous terrains to driver skills – that are most likely to lead to an accident. Then, imagine the number of lives saved by taking small preventive measures!

However, the availability and access to such large data sets are enormous challenges. Even if such data is available, it is present in cohorts with government agencies, NGOs, private corporations, and other institutions. Then there is the matter of data privacy, a valid and sensitive issue.

Now, what if the universe of such datasets itself could be expanded?

To all these seemingly insurmountable challenges, blockchain technology offers a solution. It can even address the twin issues of privacy and interoperability.

Blockchain: Decentralizing Big Data

Centralized machine learning models introduce a nightmarish privacy problem. When such systems fail or are under attack, the privacy of data they are handling comes under threat. Thankfully, blockchain technology offers a perfect countermeasure. In the case of blockchain, there isn’t a centralized system that can be hacked or physically damaged.

In fact, for the first time in history, individuals have access to a transparent and decentralized system that helps us decide what information is shared and what’s not.

Blockchains are nothing but databases – immutable and shared. Modern technology developments also present an exciting new chapter in blockchain technology.

Blockchains are now scalable, which makes them useful in big data environments. This opens the door for shared data control, easier-to-track trails, and, most importantly, universal data exchange.

Successful universal data exchange needs the data to be trustworthy, auditable, secure, and usable. By its very nature, blockchain technology meets these demands. It introduces interoperability, visibility, privacy, and protection to the data exchange process. Blockchain applications are already used for data exchange in healthcare, science, and other disciplines.

The issues of privacy and security of data also exist within the blockchain. But they can be addressed effectively with differential privacy.

The Elegance of Differential Privacy

Researchers from Carnegie Mellon University published a paper titled “Simple Demographics Often Identify People Uniquely," They showed that 87% of Americans could be uniquely identified merely based on their 5-digit zip code, gender, and date of birth. Essentially, it means that if a dataset contains these three data points, it cannot be considered anonymous.

Data and computer scientists have repeatedly proved it too. For example, the 2006 Netflix Prize competition asked competitors to create a predictive algorithm to determine how someone would rate a movie on a star-based system. Netflix released a dataset that had supposedly anonymized data of 480,000 users with 100 million ratings for 17000 movies. Unfortunately, they removed the username and faked some ratings to support anonymity. However, in 2008, scientists from the University of Texas published a paper titled: “Robust Deanonymization of Large Sparse Datasets.” These researchers identified the people in the Netflix Prize dataset by crossing it with publicly available data on IMDb. There are many examples of such deanonymization of seemingly anonymous data.

Differential privacy can counter such deanonymization, enabling scientists to leap towards safer access to universal datasets.

Understanding Differential Privacy

At its heart, differential privacy adds noise to a dataset where it does not compromise the data’s usability but makes it harder to find the original data point. This situation is analogous to radio signals. If you do not tune your radio into a station's precise frequency, the reception will be wrong, but you can still receive the message (music, news, or something else), albeit in a slightly distorted form.

Differential privacy involves adding carefully measured noise and altering unique identifiers within entire data sets. Since differential privacy algorithms know how the differential privacy noise is added to the datasets, you can compensate for it in your analyses to uncover the genuine insights captured within the original dataset. The differential privacy noise in datasets is typically introduced using Laplace distribution to distribute data over a broader range and improve anonymity.

Companies like Google and Apple have already started offering differential privacy with their new operating systems.

A Step Towards Future

This is the era of knowledge, and blockchain technology offers a unique and unprecedented opportunity to share that knowledge with all stakeholders for realizing amplified benefits. However, in the pursuit of knowledge, it is crucial to consider privacy issues that can have a long-term impact on individuals, countries, and the future of blockchain technology itself—applying principles of blockchain and differential privacy in machine learning supply a potent solution and show a way forward. Combined with techniques such as federated machine learning (we wrote about that here), it would allow the benefits to cascade while addressing emergent concerns.

Suppose you are more interested in the details and techniques of remote data science via differential privacy-infused datasets. In that case, you can also check out OpenMined, a leading organization in championing and developing this effort.

Join the conversation

Grow Your Business

Are you ready to build the business of your dreams? Let our technical expertise and execution show you the best way forward.

Start now