Archive for the 'privacy' Category

Good service or privacy invasion…


You wouldn’t know it now, but there was a time when I was a pretty good dresser. Fashionable, even.

I will admit that this was through no fault (or effort) of my own. But when I was a teenager, I bought my clothing at a single store, where there was a clerk who essentially picked out all of my clothing. I would go in a couple of times a year, and he would tell me what to buy, and what went with what, and how I could combine what I was buying with what I already owned. He didn’t quite stitch the garanimals on the clothing for me, but it was close. He knew everything that I owned, and all I had to do was follow his instructions.

When I went off to college, I was on my own and things degraded quickly. But for a while there I had great service, and was willing to pay a price and know that there was someone who knew more about what was in my closet than I did. He also cared more than I did. But I liked the result.

I can now foresee a time when this sort of service could be offered again, pretty much to everyone. It could be done via the web, using an AI that was able to filter my preferences (taken from what I had bought in the past) and add some sense of fashion taste, and offer me the right clothing to buy. It would know more about what was in my closet than I did, and could customize itself to my taste and preferences.

But we tend to worry about all of the data that on-line corporations like Amazon and Google know about us, given all of the data that they gather. We worry about the micro-targeting of ads (which can also be seen as only showing us ads about things in which we are interested) and the honing of our news feeds that put us in a bubble. Because of this, there is talk of regulating the data that these companies gather about us, limiting their ability to know what we have done and predict what we will do.

While I share a lot of these concerns, I also wonder if we are letting the unfamiliarity and “creepiness” of the situations dictate an emotional response that may not be justified. When I hear people talking about how much, say, Google knows about me, I wonder who it is that actually knows. Corporations don’t know things, they are abstract entities. The computers at Google don’t know things, either; they process data and perform algorithms, but they no more know anything than submarines swim. Is there any person at Google who knows all of this information about me (I have friends at Google who know a lot about me, but that isn’t from tracking the data I enter on-line). There might be someone at the NSA that knows about me (although I doubt I’m that interesting), but I don’t think there is anyone at Google.

One big difference between the technologies that know about me and the clothing store clerk of my youth is that I trusted the clerk. He was another human, and I talked with him and interacted in a way that led me to respect his opinions (and do what I was told). There is no such trust relationship with the technologies with which I interact. But that could change, if the sophistication of the algorithms improves. Rather than showing me things I have just purchased to see if I want to purchase them again, maybe the algorithms will start showing me things related to those purchases. The more they could be like the trusted clerk, the less I would be creeped out.

I don’t think they will ever get to the point that I will be a fashionable dresser again. But it might mitigate some of my worries about privacy. Maybe…

Thoughts on time between…


It has been a full semester since the last post– seems like just a moment. Fall is a favorite time of mine for lots of reasons, but Fall Semester is fun because I get to teach my privacy course. It is small, it is intense, and it is a lot of work, but it is more fun than I can imagine having in any other context. We had our usual count of the surveillance cameras on and around campus (a lot), a lab on unprotected web cameras that ended up with us all watching a wedding in a Las Vegas chapel (the bride went as Marilyn Monroe, and groom as early Elvis and the official doing the ceremony as the later Elvis), and a slew of projects that were amazing, creepy, and often both.

This year was especially interesting, given all of the privacy-related things that were going on in the rest of the world. We had more Snowden revelations, new and interesting hacks, Facebook and Google doing things that were questionable, and goings on in our own community. All of which had privacy implications.

One event led to some interesting discussions, both in and out of class. The first was the revelation that some educational research had taken place on the effect of attendance on the outcomes in a course. The research entailed placing cameras in some courses to determine how many were attending on any given day. This was done without the knowledge of either the students in the class or the professors teaching those classes, and once revealed lead to considerable discussion in the general Harvard community. I was a part of some of these in my role as CTO. But the most interesting conversations were the ones I had with my class.

The first thing that the students wanted to know was more details about the study itself– did the study get reviewed by the Institutional Review Board, and if so what did the application look like and what was the determination? How were the cameras used to determine attendance, and were they counting raw numbers or could the cameras tell if a particular student was attending or not? How were the courses that were part of the study selected? All of these were great questions, and at the time of maximum controversy, none were answered publicly. This didn’t stop a lot of the discussion, but it did cause the students to stop and wonder. We never got answers to all of the questions (in particular, we never saw the IRB application), but the answers that came out about some (the experiment did get submitted to the IRB; the cameras could only tell wether or not someone was in a seat, not track individuals; classes were picked by being in the right Science Center rooms), things seemed a lot less sinister than they had at first. The students tended to think that the privacy of the students was not compromised (after all, what is the reasonable expectation of privacy in a lecture hall) but perhaps the instructor’s privacy was (although there was continuing discussion about that).

What did become clear is that the issue really revolved around trust, and the more information that was available, the easier it was to trust that the researchers were not compromising privacy. When there were few facts, it was easy to speculate and ascribe questionable motives to those involved, which raised all of the privacy hackles. But as we all learned more, it was easier to see what was being done in a clearer way, and to become less concerned about the motives of the actors. It still seemed sort of creepy, but it was hard to see the actions as immoral (and we never got to the discussion of illegal).

Anonymity, Privacy, and Big Data


It’s been a while since my last post. Some interesting things have been going on, especially in the area of privacy, big data, and the work I’ve been doing with the researchers looking at the information coming out of the Massive Open On-line Classes (MOOCs) being offered by HarvardX on the edX platform. The latest result, just published on-line, shows that if you equate privacy with anonymity (which is commonly done in laws), you can either preserve privacy or do good science, but not both. Which is a pretty interesting result, I think. Let me lead you through the thinking, with apologies for the long and rambling nature of the post.

When I first started offering the course Privacy and Technology, I did so thinking that it would be a topic that would allow discussion of the interaction of technology and policy without being overly trendy or taken over by events. I remember my co-instructor at the time, Mike Smith, saying in the shopping lecture that he thought privacy was going to be the issue of the next decade; it was a shopping lecture, so I figured being a bit over-the-top was fine. Turns out he was a much better predictor of the future than I; privacy has become a huge issue that has now become one of my major research areas.

One reason for the worry is the advances in the ability to use Big Data to analyze all kinds of things. People worry about all the information the NSA appears to be gathering about everyone, creeped-out by how much Google or other web properties know about them, and generally worried that nothing can be kept secret (and it will just get worse as more sensors are connected to the internet). There are very few general regulations or laws in the U.S. about what data can be gathered and what that data can be used for. Google and Facebook can pretty much do what they want with the data they gather as long as they don’t lie to you in their privacy statements (although how you know what those statements say is an exercise left to the reader).

There are some areas that are regulated. Everyone seems to know about the privacy requirements imposed by the Health Insurance Portability and Accountability Act (better known as HIPPA), although as my colleague Latanya Sweeney often points out, this regulation is more about what data can be shared than it is about what is private. The other industry that is highly regulated with respect to privacy is education, which is covered by the Family Educational Rights and Privacy Act (FERPA). Most of these regulations center on the notion of de-identified data– that is, data that has been changed so that you can’t identify individuals from the data. HIPAA specifically defines what it means to de-identify the data– you have to take out names, addresses, social security numbers, and the like. The advantage of the definition is that you can tell when your data has been HIPAA de-identified; the disadvantage is that it isn’t all that hard to re-identify individuals given the de-identified data and other, pretty easily acquired information.

The Family Education Rights and Privacy Act (FERPA) is much less well defined; it talks about data that could not be re-identified by a reasonable person (is Latanya reasonable? Probably not, but how about all of the students she has trained?). Where it does get something like specific it seems to rest on a notion k-anonymity (invented by none other than Professor Sweeney), that in it’s simplest formulation says that for any set of properties on any individual in your data set, there must be at least k-1 other individuals within that set with the very same set of properties. So if your data set is, say, k-anonymous for k=5, it means that for any information about some particular individual in the data set, there will be at least 4 other individuals in the set that can’t be distinguished from the original.

Now, the MOOCs that Harvard has been running have been generating a lot of data about how students taking the MOOCs interact with the course material. What videos get viewed, which get sped up or slowed down, what questions are answered correctly and after how many tries– all of this information is available on anywhere from thousands to hundreds of thousands of students. Over the past year we (a research group from Harvard and MIT) have been able to do some pretty interesting data analysis on this, and are starting to learn some things about how students learn on-line.

But if you are really committed to science, you want to allow others to test your work by running their own analysis on the data. We also thinks others will ask questions we have failed to ask, which is a good thing. So we would like to share the data. But this being data about students, we (or, more accurately, the lawyers) worry about FERPA. So before we can share it, we needed to make it k-anonymous (where, in this case, k=5). Using a variety of techniques, we were able to produce a basic data set that was 5-anonymous and gave data on over 800,000 students who had taken MOOCs. Which we then released to other researchers.

But then we thought, just to be sure, that we should run our original analyses over the de-identified data set. When we did, the results we got were very different. Significantly so. For details, see the paper, but the difference was enough that it wasn’t clear that the de-identified set would lead to good science. Which is a huge problem.

We hope that the paper will cause some discussion, both among researchers and policy makers. There are others that have worried that there is a conflict between big data science and the preservation of privacy through anonymity (most notably, the President’s Council of Advisors on Science and Technology in a recent report), but as far as I know our paper is the first to show that the problem is real.

So now I’m thinking maybe Mike Smith was wrong. This is a problem that is going to be around for a lot longer than a decade.

Immoral, Illegal, or Creepy…


About 5 years ago, I designed and started teaching CS 105, a computer science course entitled Privacy and Technology. The course grew out of a National Academies study that I had co-chaired, where it became apparent that people who knew about policy often didn’t know about technology, and those that knew about technology had no idea about how policy was made. I thought it would be good to bring the two groups together, and the subject of privacy seemed to be a reasonable place to start. After all, technology (at the time, such things as wireless sensors, RFID tags, and surveillance cameras) was often seen as impinging on privacy, and so the combination of privacy and technology seemed a good place to get people thinking about some of these issues.

I was pretty sure that privacy was going to be an important subject when we started the class (it was always a group effort, started by Michael Smith and me and joined by Latanya Sweeney and a crew of undergraduate and graduate TAs that have all had a hand in shaping the course). We have always let the privacy controversies that were current at the time shape the class (it is one of the only classes I know at Harvard that starts every day with a section on “current events”). But I had no idea that it was going to become as central to the public debate, either at Harvard or in the country, as it has become in the past few months. It seems sort of quaint that we were worried about RFID tags when the class first started, given the worries that we have now.

I’m not sure what I think about the current controversies, either local or national; I’d like to find out a lot more about the facts than I know right now before I come to a conclusion. But I will make a couple of observations, even this early in the game.

The first is that the reason for establishing the course is just as valid now as it was then. Watching the discussions and reading the debates, both local and national, about electronic privacy shows me once again the distance between those who make the policies and laws and those who understand the technology. The policy makers discussions of meta-data, the internet, and database technology is pretty uninformed. But often equally uninformed are the technologists discussions about the law and what guarantees we have or ought to have. The original reason for the course, which is to bring these worlds together, has not disappeared.

My second observation is that all who are in this discussion (and I’m glad to see how many are taking part) might want to think about one of the first distinctions we introduce in the course. This is a distinction between what is immoral, what is illegal, and what is creepy. Something is immoral if it violates an ethical code; this is the purview of philosophers (a group I was part of long ago) and can shed lots of light (or at least lots of distinctions) on the subject. And while we try to base our laws on ethical theories, laws are not able to fully reflect those theories. So the fact that something is immoral doesn’t mean that it is illegal; that is a matter left up to the lawyers and, more particularly, to the judges who interpret the laws. And both of these are different from something being creepy, which is simply a description of an emotional reaction we have to a situation. The fact that this reaction is emotion doesn’t make it any less real, and may be an indicative of the situation breaking some moral code. But it isn’t the same.

When looking at a supposed privacy violation, it is often reasonable to ask which of these three categories the violation falls into. The government building a database of all of the phone meta-data on all of its citizens is certainly creepy. I don’t know if it is illegal; that would need to be decided by the courts (and I hope it will be at some time in the very near future). And whether or not it is immoral is a much deeper question, having to do with the kinds of tradeoffs that are needed and the underpinnings of your ethics. It is the right question to ask, but it is a difficult one to answer.

The discussions of privacy aren’t about to go away or get any simpler. They are very important, and I’m glad that we are having them. And on a selfish note, they are going to make the next edition of CS 105 (coming up this fall) really interesting.