Anonymity, Privacy, and Big Data


It’s been a while since my last post. Some interesting things have been going on, especially in the area of privacy, big data, and the work I’ve been doing with the researchers looking at the information coming out of the Massive Open On-line Classes (MOOCs) being offered by HarvardX on the edX platform. The latest result, just published on-line, shows that if you equate privacy with anonymity (which is commonly done in laws), you can either preserve privacy or do good science, but not both. Which is a pretty interesting result, I think. Let me lead you through the thinking, with apologies for the long and rambling nature of the post.

When I first started offering the course Privacy and Technology, I did so thinking that it would be a topic that would allow discussion of the interaction of technology and policy without being overly trendy or taken over by events. I remember my co-instructor at the time, Mike Smith, saying in the shopping lecture that he thought privacy was going to be the issue of the next decade; it was a shopping lecture, so I figured being a bit over-the-top was fine. Turns out he was a much better predictor of the future than I; privacy has become a huge issue that has now become one of my major research areas.

One reason for the worry is the advances in the ability to use Big Data to analyze all kinds of things. People worry about all the information the NSA appears to be gathering about everyone, creeped-out by how much Google or other web properties know about them, and generally worried that nothing can be kept secret (and it will just get worse as more sensors are connected to the internet). There are very few general regulations or laws in the U.S. about what data can be gathered and what that data can be used for. Google and Facebook can pretty much do what they want with the data they gather as long as they don’t lie to you in their privacy statements (although how you know what those statements say is an exercise left to the reader).

There are some areas that are regulated. Everyone seems to know about the privacy requirements imposed by the Health Insurance Portability and Accountability Act (better known as HIPPA), although as my colleague Latanya Sweeney often points out, this regulation is more about what data can be shared than it is about what is private. The other industry that is highly regulated with respect to privacy is education, which is covered by the Family Educational Rights and Privacy Act (FERPA). Most of these regulations center on the notion of de-identified data– that is, data that has been changed so that you can’t identify individuals from the data. HIPAA specifically defines what it means to de-identify the data– you have to take out names, addresses, social security numbers, and the like. The advantage of the definition is that you can tell when your data has been HIPAA de-identified; the disadvantage is that it isn’t all that hard to re-identify individuals given the de-identified data and other, pretty easily acquired information.

The Family Education Rights and Privacy Act (FERPA) is much less well defined; it talks about data that could not be re-identified by a reasonable person (is Latanya reasonable? Probably not, but how about all of the students she has trained?). Where it does get something like specific it seems to rest on a notion k-anonymity (invented by none other than Professor Sweeney), that in it’s simplest formulation says that for any set of properties on any individual in your data set, there must be at least k-1 other individuals within that set with the very same set of properties. So if your data set is, say, k-anonymous for k=5, it means that for any information about some particular individual in the data set, there will be at least 4 other individuals in the set that can’t be distinguished from the original.

Now, the MOOCs that Harvard has been running have been generating a lot of data about how students taking the MOOCs interact with the course material. What videos get viewed, which get sped up or slowed down, what questions are answered correctly and after how many tries– all of this information is available on anywhere from thousands to hundreds of thousands of students. Over the past year we (a research group from Harvard and MIT) have been able to do some pretty interesting data analysis on this, and are starting to learn some things about how students learn on-line.

But if you are really committed to science, you want to allow others to test your work by running their own analysis on the data. We also thinks others will ask questions we have failed to ask, which is a good thing. So we would like to share the data. But this being data about students, we (or, more accurately, the lawyers) worry about FERPA. So before we can share it, we needed to make it k-anonymous (where, in this case, k=5). Using a variety of techniques, we were able to produce a basic data set that was 5-anonymous and gave data on over 800,000 students who had taken MOOCs. Which we then released to other researchers.

But then we thought, just to be sure, that we should run our original analyses over the de-identified data set. When we did, the results we got were very different. Significantly so. For details, see the paper, but the difference was enough that it wasn’t clear that the de-identified set would lead to good science. Which is a huge problem.

We hope that the paper will cause some discussion, both among researchers and policy makers. There are others that have worried that there is a conflict between big data science and the preservation of privacy through anonymity (most notably, the President’s Council of Advisors on Science and Technology in a recent report), but as far as I know our paper is the first to show that the problem is real.

So now I’m thinking maybe Mike Smith was wrong. This is a problem that is going to be around for a lot longer than a decade.

WeCode and Visceral Education
Thoughts on time between…

Leave a Comment