All posts by sharonsto

Big Data – Midterm Review

Midterm review
Last week, we presented our goals and questions for the semester in the DPSI Midterm review event. It was great to hear about the focus of other groups – some of them struggling with very different questions than ours and others working on surprisingly similar things. We exchanged thoughts and ideas with other members of the community.
A puzzle to solve
To help envision the problems we are dealing with, we shared a few examples of de-identification problems and their implications:
(1) EdX and Completion rate
When researchers began analyzing the completion rate of EdX courses they began noticing that the annonmyzed dataset presented with very different statistics when compared with the original dataset. The completion rate showed a significant drop when the data was anonymized. When digging into this, it became evident that observations of  many of those who actually completed courses was dropped from the anonmyzized dataset. This is because the characteristics of a person who signed up for a course once and never went back to the page again were drastically different than those of a person who signed up, watched every lecture, and did every problem set. With so much identifying information, such observations were frequently dropped, even though these individuals were much more likely to finish a course. Analysis on the annonymized dataset was therefore useless.
(1) Google Ads and users behavior 
While interning at Google this summer, Olivia, a member of our Big Data group, ran into a peculiar problem. In her role as a data scientist, she was trying to understand whether people who saw Google Ads were more likely to conduct search on ad-related queries. Since Olivia was an intern and was not allowed to see user’s individual search information, she received a dataset in aggregated form, which summed up interactions by user. When she ran the analysis she saw some strange results – it seemed like people who saw ads were somehow less likely to perform ad-related queries. Since she believed such results were suspicious she raised those concerns, and her supervisor ran the analysis on the original dataset. The results were radically different, and as expected showed that users who saw ads were much more likely to run ad-related queries. Why did this happen? It seemed like users who watched ads for a few seconds were very different from users who watched ads for a few minutes, but that richness of the data disappeared in aggregated form. You could no longer distinguish between a user who saw many ads for a second and a user who saw an ad for a minute. This drastically changed the results and rendered the anonomyzed dataset useless for such purpose.
 
What’s up next
We now officially have a de-identified dataset to work with, along with some of the documentation around how it was de-identified. The coders in the group will begin examining it and playing with the code.
Our policy team continues to work on de-identification laws outside of the education space (FERPA). We are taking a look at HIPPA, which specifies de-identification requirements for medical information, and international laws (especially in privacy protecting Europe).

Developing big data analysis tools 2.0

What we worked on

The Big Data team spent the past few weeks introducing the group’s work so far and setting concrete goals for this semester. We are extremely excited to welcome new highly qualified and interested team members!

During our first meeting, the team joined the Privacy Tools group of the Center for Research in Computation and Society to talk about the results from the group’s work last year and the CACM paper that followed.

Next, we met to learn about every member’s interests and goals discussed how might we work together to define and explore the key questions that are interesting to the group.

Our Work Plan for the Semester

Throughout the semester we will be asking 3 key questions:

(1) Do current anonymization techniques used in large datasets able to maintain the data and its properties reliable and complete?

Can we use anonymized datasets in research?

Can robust insights be generated from such anonymized datasets?

To answer such questions, we will analyze samples of such datasets and try to understand whether analysis of the original datasets and the anonymized ones generates the same results.

(2) If current methods do not maintain the data’s key properties, is there an anonymization method that can do so?

We will experiment with different ways of anonymizing data and try to understand which one, if any,  generates robust and satisfactory results in a way that maintains the qualities of the original data and does not compromise users’ privacy.

(3) Finally, if anonymizing data in a way that maintains its original properties is not possible, we will research and brainstorm new concepts of privacy

Can privacy exist without anonymity?

This is a huge undertaking, and one that many have thought of in the past. We will spend the semester researching different notions of privacy and try to understand what lies at the core of it, and whether we can generate a kind of privacy without anonymity. While we may not succeed, we think that spending time on this issue is important.

What went well so far

Everyone seems genuinely fascinated by the problem and excited to get our hands dirty trying to do some meaningful work on the subject. We are all thrilled to be working on a subject that not many others have explored before us.

What was challenging

Since we are stepping into a path not many have walked before us, we will have to figure things out as we go. This may be challenging at time, but we will work to create a supportive community that will facilitate a productive process.

What’s up next

We have been building a reading list to get everyone up to speed on concepts in data anonymization, de-identification methods, legal requirements, and related notions of privacy. In the next week or so we will discuss the readings and the larger themes around them. We are also working on securing access to some large data sets so that we can start conducting preliminary analysis and visualization.