Differential Privacy is a mathematical framework that allows analysis of aggregated data ensuring privacy of individual data. This method guarantees that single individual’s data is not revealed in anyway.
Giving this very basic definition of what differential privacy is, lets understand some interesting information on why Differential privacy.
Machine learning algorithms needs lots of data to train and develop a model that would be used in practical application. Most of the time the data has sensitive data sets, that contain private information of individuals, if not taken care could leak the sensitive information. There was a Netflix incident, about leakage of private information. Netflix provided a training dataset of user data in 2006, that consisted of 100 million individual movie ratings and data of the individual ratings for around 500k users. Some information here. The data were heavily ananoymized however researchers found a way to isolate individual users cross referencing the data and public data that could be found online.
So we are curating sensitive database with more senstive information like healthcare data, and we could release some statistics of the data to the public. Lets assume that we have a active dataset of people with their personal health data, and provide a easy API for public to call. Now if researchers could retrieve aggregated data like age and number of users from the dataset for their research. They publish the data they get in their article like reporting Number of users – Nt = 100 and avergae age of group – At = 30. While another researcher publishes another report with same dataset of number of user – Nt+1 = 101 and average age of group – At+1 = 30.03. Based on these two reports, we could derive the details of the individual user.
Time t ; Number of users, Nt = 100, Average age group, At = 30
Time t + 1; Number of users, Nt+1 = 101, Average age group, At+1 = 30.03
Now we could reverse engineer the age of the individual like
For t, = 100 x 30 = 3000
For t+1, = 101 x 30.03 = 3033.03
So age of the only person = 3033.03 – 3000 = ~33 year.
We can get private data of individual user from the data, in the similar way. Like for salary of people, bank balance of users, or even sensitive information from enterprise data.
These kind of attack on the anonimized data could be protected with Differential privacy algorithms, that rely on adding random noise to the data either in local or global datapoints. Doing this will definitely impact the overall result and skew the result to some non-sensical ones. DP is a study of making the balance between the privacy and utility of the result from the dataset.
Adding random values to the dataset however could result in breach of privacy eventually when large number of queries are done, however doesn’t fully secure the privacy.
There are few terms in DP that should be understood as well like Function Sensitivity, Privacy Loss, Epsilon-DP, Privacy Budget, etc.
We will go through the concept and try to mathematically define them with examples and implement them as well for our applications. We will also go through DP in ML.
Leave a Reply