Confidentiality and integrity are the two most crucial data privacy goals in the current era. Though there have been continuous enhancements in the cryptographic mechanism, there occurs a proportionate growth in the security attacks under both ‘active’ and ‘passive’. When it comes to a statistical database, say big data, it is indeed a great matter of concern to protect the sensitive information that may or may not be individually specific while still unrevealing the actual PII data. This is where differential privacy comes to the rescue.
The Need for Differential Privacy
Netflix, in 2007, released a data set on movie ratings by users to conduct a competition after anonymizing PII information. But still, analysts had cleverly linked the Netflix anonymized training database with the auxiliary data from the IMDB database to partly de-anonymize this training database.
The present world is going through a tough time battling the pandemic COVID-19. Governments across the world are trying to figure out the source of the virus by locating the affected individuals. Governments are also releasing statistical data about COVID patients to the public. At the same time, they must make sure that their PII is protected. One of the traditional ways to do so is by anonymizing the PII information. But as we have seen in the example above, anonymizing the PII is not enough. Since auxiliary data and other sources of information are available in the public domain, they can be easily combined with the statistical data for reverse engineering to rediscover the actual PII data. It may lead to a privacy breach.
What is Differential Privacy (DP)?
Differential Privacy is a mathematical framework that offers privacy for statistical databases. A statistical database, in this sense, is any database that provides large-scale information about a population without revealing individual-specific information. The sensitive data in the statistical database is secure such that it is devoid of any third-party potential privacy attacks. In other words, it is difficult to reverse engineer differential private data. It is already being used by several organizations like Apple, Uber, the US Census Bureau, and Microsoft.
Goals of Differential Privacy
1. To make sure that the data is not compromised, at the same time, maximize the data accuracy.
2. To eliminate potential methods that may distinguish an individual from a large dataset.
3. To ensure the protection of an individual’s PII under any circumstance.
The Mechanism Behind Differential Privacy
The conventional way of preserving data privacy is by anonymizing the data sets. But the principal mechanism behind differential privacy is to shield the dataset by introducing carefully tuned random noise to the data set while using it to perform any analysis. The amount of noise to be added to a data set is controlled by a privacy guideline called the privacy loss parameter, represented by the Greek letter ‘ɛ’. ‘ɛ’ computes the effect of each discrete information on the respective analysis, and the output. This parameter ‘ɛ’ determines the overall privacy rendered by a differentially private study. The smaller value of ‘ɛ’ indicates better protection causing low privacy risks. Conversely, a higher value would indicate worse protection inducing high privacy risks. A value of ɛ equaling 0 gives complete data privacy, but its usability will be nil. Privacy loss is independent of the database, and the larger the database greater the accuracy amount for a differentially private algorithm.
The privacy loss parameter is proportional to the “privacy budget”. Depending on the different analyses performed on the data, one can decide how much privacy budget is to be utilized. It signifies one can precisely define how much privacy budget is needed till the data is not anonymous anymore.
The above picture shows how differential privacy works.
If a data expert was depending on databases that have a single data entry difference, then the chance of a change in the result will not get affected by the variation of that particular entry. The only probability of that change would be a multiplicative factor, i.e., the expert cannot mark off one database from others depending on the output when the differential privacy is utilized.
Conclusion
Differential privacy is one of the most anticipated research topics now in the field of data privacy. The adoption of differential privacy is still in its early stages. There is no doubt that it guarantees privacy and security to one’s data. Yet the limitation is that if we have high dimensional data and need to provide more privacy, then we might end up adding noise to it and may make the data unworthy. But still, it is a far better approach to protect personal data against the privacy breach for high dimensional data while comparing it with the traditional data privacy techniques. We, at iEDPS, have been analyzing the market where our customers require data sets that have to be reconstruction resistant. It is one area where we feel the iEDPS Differential Privacy Module will add a lot of value to solving these business use cases.
Author: Jessmol Paul