Furthering Human Knowledge


A recent graduate with whom I did considerable work recently wrote and asked me the following question:

What are your thoughts about how academia versus industry contribute to the expansion of the world’s knowledge? My thoughts from speaking to people from both sides are that it seems that in academia, knowledge is much more open to sharing, but progress is much slower (because of multiple things: the need to apply for grant money, the multiple roles that professors must juggle, the difficulty of the problems being tackled, the lack of readily accessible consumer data, etc.), and that in industry, there are probably tens or hundreds of companies trying to do the same thing (like spam detection, for example) but who do not want to share their knowledge, but that the competition and that the money incentive makes development faster. Do you think that working in one versus the other is more effective or efficient if the furthering of knowledge is the ultimate goal? This is one of the big questions I’m thinking about as I’m trying to figure out where I best see myself in 10 years.

I’ve been thinking about this problem, and some variants, for a considerable period of time, so the response was fairly length. On the off chance that others might be interested, I post it here:

And this is why I love working with students— they always ask the simple questions :-).

There may have been a time when the answer to this was actually simple. The idea was that academics worked on long-term, highly speculative research topics, generally funded by the government, and published papers for all the world to see. Then industry would pick up some of these ideas, figure out the commercial application, apply the pure research, and take the product that resulted from applying that research to market. Academics worked on the long-term (5-10 year) problems, and industry worked on the short term (.5-2 years). A simple and rational pipeline, with each group knowing what it was doing and how it interacted with the other. If you wanted to work on the high-risk, high-impact (but likely to fail) stuff, you became an academic. If you wanted to have immediate impact on the real world, with some better guarantee of success, you worked in industry.

This is a nice picture, but if it has ever actually been true in the computer/software sector, it was before my time. As long as I’ve been watching and taking part, real research was done all over a combination of academia, industrial research labs (Bell Labs, BBN, Xerox PARC, Digital’s SRC, Sun Labs, MSR) and people in product groups at various companies, both large and small. I remember a time when it seemed that everyone in the industry worked in Research and Development; the difference between being an academic and working for a company was whether you wrote the paper before or after releasing the product.

But there are some changes that have occurred over the past 10 or more years that have made the picture even more muddled than it was, and thus make the question harder to answer.

One of these changes is that the amount of research funding going to academics in the CS field has been going down, and the risk profile of the funding agencies has been going down with it. There was a time when NSF and DARPA would give out lots of money, and be willing to take chances on things (like the internet) that didn’t have any obvious payback and might not have worked at all (there were plenty of research projects that didn’t). As the amount of money has decreased, the willingness for risk has decreased, as well— while there are some projects that seem pretty risky, for the most part the perception is that for a project to get funded, you need to show that it will succeed. This leads to people getting startup funding to do a year or so of research, applying to funding agencies for that work (or a small follow on to the work), and then using the funding to do something new that they can then ask for the next round of funding to support. Again, it isn’t always like this, but it is like this often enough that it can be problematic. By the way, it may not be that the actual amount of money has gone down (I believe that it has), but the amount hasn’t gone up in proportion to the number of academics who are trying to get it. So that makes things strange, as well.

At the same time, the number of industrial research labs seems to be decreasing, along with the funding available for such labs. Big places are willing to do some really adventurous stuff (look at Google X, or at least the rumors from Google X), but the work is not done in the open, may not be shared, and when it is often is covered by patents. Which is natural; these companies want a return on their investment. But it does limit the scope of the spread of the innovation or knowledge. In a lot of cases, companies are now willing to take more of a chance on research than academics, because they realize that the payoff to being the first to figure something out is so huge. So some of the really speculative, long-range work is being done in industry, but you hardly hear about it (think of the self-driving car).

And then there is a third trend. What seems to be happening more and more is that innovation and real research is being outsourced to startup companies. If you have an innovative idea, you start a company and work on the idea. Then if the idea works out, you either take over the world or (more likely) get bought by a larger company. This is a really attractive model for those who fund innovation; they have essentially out-sourced the problem to the VC community. The government doesn’t have to spend any money at all. The existing companies only have to buy when they know that the idea works. And the VCs are willing to put up the initial money, because the payback for the companies that get bought or get to go public is large enough to make the initial investment profitable. This certainly seems to be the way Oracle does research these days (they don’t even do much hiring; most of the people they add to the company come in through company acquisition). Ryan Adams recently had his little company bought by Yahoo, so sometimes the line between academic, startup, and established company can be even more blurred.

Of course, such outsourcing also means that the time depth of start-up company research is dictated by the patience of the VC community, which seems to be limited to a couple of years (at best). And the research better have a clear commercial application. 

All of this has to do only with how the initial research gets funded. The real question centers on how you could most effectively add to human knowledge. Which is a lot harder than just getting funding, because once you get funding, you then need some way to get people to recognize what you have done.

Academics do this by writing papers and giving talks, which sometimes works. But there are a lot of papers that get written and never really get read, a lot of talks that are heard and immediately forgotten. By the same token, there are lots of products that are really great pieces of technology that, for one reason or another, never get adopted. Before inventing Java, James Gosling invented NeWS, an elegant and fully functional window system. But NeWS went nowhere; instead the industry of the time adopted X-windows, which a lot of us thought was not technically as nice. Dick Gabriel and I have been arguing over why LISP lost out to C or Multics lost out to Unix for a long time, but whatever the reason was it was not purely technical. I remember being told by Ivan Sutherland, who has done more great technology than just about anyone I know, that all you can do as a technologist is make sure that the work is good; adoption is a question outside of your control. A hard lesson, but true.

After all of this evasion, let me try to answer the real question, which is what should you do if you want to push forward the boundaries of knowledge? And the answer will depend on how you want to live your life, not on which is more likely to push those boundaries successfully. As an academic, you have much more freedom, deciding on your own research agenda and working with students who you will get to advise and direct. In industry, you probably won’t have that sort of freedom for 5 to 10 years (and that’s if you enter with a Ph.D.); you will be part of someone else’s group for that time, working on their problems. But while in industry you will not have the worries over funding (my funding cycle when at Sun Labs was a couple of weeks), and the people you will be working with will have a higher level of skill and discipline than is generally found in students. But the problems you will work on will be constrained, to some extent, by the market and you may not be able to share all you know. The environment of a startup gives you some of the freedoms of an academic, but also brings the pressures of the market. 

And, of course, a final consideration is just what is meant by “furthering human knowledge.” One way of doing this is to come up with something that no one has ever thought of before, and getting everyone to understand it. This might be a new theorem, a new system, or a better way of doing things. Java, for all its flaws, certainly contributed to human knowledge in some way; the IP protocols did the same. But these sorts of contributions are few and far between. When they happen, it is a combination of insight, perspiration, and luck; no one knows how it really happens, but when it does it is pretty amazing.

But the other way to further human knowledge is to train the next generation in how to further knowledge. This can also be done in all of the contexts spoken about above— I mentored a lot of people when I was in industry, start-ups teach in their own kind of way, and being a professor is (supposedly) explicitly about that. As my career has moved along, I’ve grown to appreciate this way of furthering knowledge more and more; it may not be as romantic or immediately satisfying, but you end up playing the long game, knowing that your contributions will outlast you and not be subject to quite so many whims. Which is the other reason that I love working with students— it is part of my attempted contribution to the expansion of human knowledge.

It was 20 years ago today…


A lot of Beatle’s songs have been running through my head this last little while, but Sgt. Pepper is getting most of the mental traffic. In part this is because of a recent personal anniversary (will you still need me?) but mostly because it was 20 years ago that Java(tm) was first made public. My, how time flies.

There was actually a bit of a debate on the JavaSoft (that was the part of Sun that was responsible for Java) mailing lists; the first public release of the alpha technology preview of Java went out March 27, 1995, while the first announcement of the language and environment was during the Sun Developers Conference held May 20-something in that year. To give a bit of context, that was the same Sun Developer’s Conference when the Memo of Understanding between Sun and Netscape that placed Java in the browser was announced. For those of you who don’t know what Netscape was, go look it up. At the time, Netscape was a much bigger deal than Java, but time has a way of changing those things.

Java had actually been wandering around Sun internally for a couple of years before that; during the 10 year anniversary I found that I had the electronic copy of the oldest known (then) spec for the language, which you can see here. First note that the language was not then called Java. It was originally named Oak, after the tree outside of James Gosling’s office. But the trademark folks said that name was already taken, so the commercial name was changed to Java. The rest, as they say, is history.

The Oak spec is recognizably Java, but is also different. The exception mechanism at the time did not require that exceptions be declared or caught; this was changed before the first official release. The language was a lot smaller (full description in under 30 pages), and the mass of libraries, frameworks, and imported features from other languages weren’t there. This is a snapshot of Java in its purer form, before it became important enough to be fought over by different companies, before the community process, and before all those who wanted it to be something else had their way with it.

I still like to program in Java (the Standard Edition; don’t get me started on the Enterprise Edition), since at its core it is still the simple, object-oriented, strongly typed language that it started out being. I do use generics, even though I hate the way they were implemented and argued against them (and still think they aren’t needed). I’m glad to see lambdas introduced, although I could have lived without them. And I do sometimes wish that we had had a few more years to develop the language before it became the global phenomenon it became. But it ruled much of my life in technology, and it was fun.

So happy birthday, Java. And thanks for all the good times. The bad ones weren’t your fault…

Architecture at HUIT


It’s been quite a while since my last post; lots has happened and it is past time to start talking about some of it.

Information technology at Harvard is changing a lot, as all of the planning that started when Anne Margulies came in as CIO is beginning to be implemented. We are well into the implementation of a new Student Information System, and the move from a classroom support technology from iSites to Canvas is happening (sometimes faster than we thought). All of these are good, and visible. But this is hardly all of it, or even the most (technically) interesting.

In some important ways, the biggest change is the move to the cloud. If done right, this will be pretty much transparent to the users of our services, but it is a huge change for the organization itself. It makes sense in lots of ways, but it takes new sets of skills and new ways of looking at the problems we are trying to solve. But we have committed to moving 75% of our services to the cloud in the next three years, which is a lot more than just testing the waters.

As we do this, it is also an opportunity to start putting some architectural principles in place. HUIT has traditionally treated most applications as “one-of-a-kind” entities, with machinery and underlying software stacks selected, optimized, and maintained for each system. When we were trying to squeeze all we could from each application, this made sense. But as computing power grows, the complexity of such an approach is overwhelming any advantage in performance we might be able to gain.

To help bring some regularity to this, I’ve convened a new group, the Architecture Decision Group. As the name implies, this is a group that comes together to make decisions about what the architecture is and will be going forward, at least for HUIT. If the wider Enterprise Architecture work was complete, this group would spend most of its time making sure that architecture was being followed. But since that architecture is in-process, this group is trying to decide on issues that need answers now. Since that wider effort is just starting, we needed something to make decisions now, so we can avoid the lack of regularity that is our current state.

The group is intentionally designed to be small and technical. Permanent members are the CTO (natch), the deputy CIO, the Managing Directors of Engineering and Architecture and Strategy and Planning, the Chief Information Security Officer, and the Director of Networking. Depending on the subject being discussed, we will ask other (technical) people to attend.

An important part of the work that we are doing is writing it down. We have a backlog list, and then a set of decisions and rationales for those decisions. All of this is kept on a publicly viewable wiki.

While the deliberations of the group are invitation-only, we are looking for ways that the more general engineering community can contribute. For any of the topics in the backlog, we invite opinions to be written up (on the wiki) and submitted. The group will read these, and those that seem particularly relevant may cause us to invite the writer to join for a session or two. We also invite comments on our decisions. The assumption is that nothing we decide is set in stone, but unless there is good reason to follow some other design everything that HUIT does should follow the decisions made by the group.

We have already made a number of decisions around the cloud and the network architecture that impact our move to the cloud; take a look and file a comment if you think we have not understood something important. We will next be looking at some of the patterns for deployment in the cloud; opinions on those topics are being sought. So take a look and get involved…this is the technical future of HUIT that is being worked out here, so we would love to hear from you.

Thoughts on time between…


It has been a full semester since the last post– seems like just a moment. Fall is a favorite time of mine for lots of reasons, but Fall Semester is fun because I get to teach my privacy course. It is small, it is intense, and it is a lot of work, but it is more fun than I can imagine having in any other context. We had our usual count of the surveillance cameras on and around campus (a lot), a lab on unprotected web cameras that ended up with us all watching a wedding in a Las Vegas chapel (the bride went as Marilyn Monroe, and groom as early Elvis and the official doing the ceremony as the later Elvis), and a slew of projects that were amazing, creepy, and often both.

This year was especially interesting, given all of the privacy-related things that were going on in the rest of the world. We had more Snowden revelations, new and interesting hacks, Facebook and Google doing things that were questionable, and goings on in our own community. All of which had privacy implications.

One event led to some interesting discussions, both in and out of class. The first was the revelation that some educational research had taken place on the effect of attendance on the outcomes in a course. The research entailed placing cameras in some courses to determine how many were attending on any given day. This was done without the knowledge of either the students in the class or the professors teaching those classes, and once revealed lead to considerable discussion in the general Harvard community. I was a part of some of these in my role as CTO. But the most interesting conversations were the ones I had with my class.

The first thing that the students wanted to know was more details about the study itself– did the study get reviewed by the Institutional Review Board, and if so what did the application look like and what was the determination? How were the cameras used to determine attendance, and were they counting raw numbers or could the cameras tell if a particular student was attending or not? How were the courses that were part of the study selected? All of these were great questions, and at the time of maximum controversy, none were answered publicly. This didn’t stop a lot of the discussion, but it did cause the students to stop and wonder. We never got answers to all of the questions (in particular, we never saw the IRB application), but the answers that came out about some (the experiment did get submitted to the IRB; the cameras could only tell wether or not someone was in a seat, not track individuals; classes were picked by being in the right Science Center rooms), things seemed a lot less sinister than they had at first. The students tended to think that the privacy of the students was not compromised (after all, what is the reasonable expectation of privacy in a lecture hall) but perhaps the instructor’s privacy was (although there was continuing discussion about that).

What did become clear is that the issue really revolved around trust, and the more information that was available, the easier it was to trust that the researchers were not compromising privacy. When there were few facts, it was easy to speculate and ascribe questionable motives to those involved, which raised all of the privacy hackles. But as we all learned more, it was easier to see what was being done in a clearer way, and to become less concerned about the motives of the actors. It still seemed sort of creepy, but it was hard to see the actions as immoral (and we never got to the discussion of illegal).

Anonymity, Privacy, and Big Data


It’s been a while since my last post. Some interesting things have been going on, especially in the area of privacy, big data, and the work I’ve been doing with the researchers looking at the information coming out of the Massive Open On-line Classes (MOOCs) being offered by HarvardX on the edX platform. The latest result, just published on-line, shows that if you equate privacy with anonymity (which is commonly done in laws), you can either preserve privacy or do good science, but not both. Which is a pretty interesting result, I think. Let me lead you through the thinking, with apologies for the long and rambling nature of the post.

When I first started offering the course Privacy and Technology, I did so thinking that it would be a topic that would allow discussion of the interaction of technology and policy without being overly trendy or taken over by events. I remember my co-instructor at the time, Mike Smith, saying in the shopping lecture that he thought privacy was going to be the issue of the next decade; it was a shopping lecture, so I figured being a bit over-the-top was fine. Turns out he was a much better predictor of the future than I; privacy has become a huge issue that has now become one of my major research areas.

One reason for the worry is the advances in the ability to use Big Data to analyze all kinds of things. People worry about all the information the NSA appears to be gathering about everyone, creeped-out by how much Google or other web properties know about them, and generally worried that nothing can be kept secret (and it will just get worse as more sensors are connected to the internet). There are very few general regulations or laws in the U.S. about what data can be gathered and what that data can be used for. Google and Facebook can pretty much do what they want with the data they gather as long as they don’t lie to you in their privacy statements (although how you know what those statements say is an exercise left to the reader).

There are some areas that are regulated. Everyone seems to know about the privacy requirements imposed by the Health Insurance Portability and Accountability Act (better known as HIPPA), although as my colleague Latanya Sweeney often points out, this regulation is more about what data can be shared than it is about what is private. The other industry that is highly regulated with respect to privacy is education, which is covered by the Family Educational Rights and Privacy Act (FERPA). Most of these regulations center on the notion of de-identified data– that is, data that has been changed so that you can’t identify individuals from the data. HIPAA specifically defines what it means to de-identify the data– you have to take out names, addresses, social security numbers, and the like. The advantage of the definition is that you can tell when your data has been HIPAA de-identified; the disadvantage is that it isn’t all that hard to re-identify individuals given the de-identified data and other, pretty easily acquired information.

The Family Education Rights and Privacy Act (FERPA) is much less well defined; it talks about data that could not be re-identified by a reasonable person (is Latanya reasonable? Probably not, but how about all of the students she has trained?). Where it does get something like specific it seems to rest on a notion k-anonymity (invented by none other than Professor Sweeney), that in it’s simplest formulation says that for any set of properties on any individual in your data set, there must be at least k-1 other individuals within that set with the very same set of properties. So if your data set is, say, k-anonymous for k=5, it means that for any information about some particular individual in the data set, there will be at least 4 other individuals in the set that can’t be distinguished from the original.

Now, the MOOCs that Harvard has been running have been generating a lot of data about how students taking the MOOCs interact with the course material. What videos get viewed, which get sped up or slowed down, what questions are answered correctly and after how many tries– all of this information is available on anywhere from thousands to hundreds of thousands of students. Over the past year we (a research group from Harvard and MIT) have been able to do some pretty interesting data analysis on this, and are starting to learn some things about how students learn on-line.

But if you are really committed to science, you want to allow others to test your work by running their own analysis on the data. We also thinks others will ask questions we have failed to ask, which is a good thing. So we would like to share the data. But this being data about students, we (or, more accurately, the lawyers) worry about FERPA. So before we can share it, we needed to make it k-anonymous (where, in this case, k=5). Using a variety of techniques, we were able to produce a basic data set that was 5-anonymous and gave data on over 800,000 students who had taken MOOCs. Which we then released to other researchers.

But then we thought, just to be sure, that we should run our original analyses over the de-identified data set. When we did, the results we got were very different. Significantly so. For details, see the paper, but the difference was enough that it wasn’t clear that the de-identified set would lead to good science. Which is a huge problem.

We hope that the paper will cause some discussion, both among researchers and policy makers. There are others that have worried that there is a conflict between big data science and the preservation of privacy through anonymity (most notably, the President’s Council of Advisors on Science and Technology in a recent report), but as far as I know our paper is the first to show that the problem is real.

So now I’m thinking maybe Mike Smith was wrong. This is a problem that is going to be around for a lot longer than a decade.

WeCode and Visceral Education


Last weekend I had the great pleasure of attending the WeCode conference run by the Harvard Undergraduate Women in Computing group. It was a great event; well-organized, well-attended, and far more interesting than the “OMG, Goldman Sachs was giving out nail files with their logo, how insensitive” meme that seems to have gone rampant on news sites that should know better. I was there to moderate a set of panels, but decided to attend most of the Saturday event to see what it was like.

The first keynote was in one of the large lecture theaters in the Science Center. When I walked in, there were probably 200 conference goers in their seats, and more were streaming in. I took three or four steps into the hall, and it suddenly hit me. I was one of may two or three men in the hall. I’ve never been accused of being shy, but I felt completely out of place. Completely other. All of the voices in my head were saying “get out of here… go to your office and get some work done…”. All the flight responses were active.

And at the same time, I was realizing that this is the feeling everyone else in the room must have at every other tech conference in the world, or in most computer science classes, or tech gatherings in general. It was a Zen experience. I suddenly felt that I had a better understanding of what women in computer science (and the STEM fields more generally) are up against.

I’ve tried to be a supporter of women in software positions all my life. My groups at Sun always had women software engineers, and my closest collaborator over most of my career was a woman. I’ve tried to encourage women in my classes. The last edition of my privacy course was 2/3rds female (a fact that one of the male students complained about; his complaint was an opportunity for a discussion of these issues which I hope had some impact). But I’ve never felt the problem the way I did last Saturday.

I’ll admit I’m not sure what to do about this. But it is a problem, not just of fairness and justice, but for the field. We need good people in software engineering, computer science, and related fields. The supply of any kind of people can’t keep up with the demand, and the supply of good people isn’t even close. Artificially limiting the supply of talent to half the population is insane, destructive, and wrong. Changing this will be hard, because not everyone understands. I thought I understood, but I didn’t really. I don’t fully understand now, but I’ve had a good lesson. It’s amazing how much more effective a lesson is when it arrives through the emotions instead of the brain.

I’m still thinking about the experience. But I know I won’t think about women’s issues in the STEM field in the same way. For that reason alone, the WeCode conference may have been the most educational I’ve ever attended.

The four best books for software engineers


This semester I’ve had the good fortune to be involved in the pilot of the Digital Problem Solving Initiative, a program being started out of the law school and the Berkman Center to allow Harvard students (from all schools) to get some mentored experience in dealing with various sorts of digital solutions. I’ve been leading a group of law students, FAS graduate students, and college students looking at some of the issues around the research data coming in from the HarvardX courses that are running on the edX platform. It’s a great group, and we have a nice combination of law questions (just what does FERPA require, anyway?), policy matters (how do we keep privacy in these sorts of data sets) and programming problems (they really think we can use *this* data?).

The group is a lot of fun, and we’ve had a stream of visitors as well to liven things up. A couple of weeks ago, Doc Searles and his wife joined us. Doc is one of the authors of the Cluetrain Manifesto, and I mentioned that it was one of the three or four books that all programmers should read. After the session, a couple of the participants asked what the others were.

Which got me to thinking. Most of the people who talk about the three or four books that everyone should read only tell you one of the books at any one time, and if you take the union of all those times the total is in the 20s or 30s. So trying to come up with all of them at once is difficult, and takes some thinking. It’s much easier to come up with one (the “desert island book”), or 20. But four?

But here we go…

First on the list would be Tracey Kidder’s The Soul of a New Machine. This is a non-fiction report of the development of a 32-bit minicomputer by Data General in the 1980s. But the reason to read a book about a long-forgotten machine with what is now a laughable computing capability by a company that no longer exists is to see how little the technology industry changes. The personalities, problems, and solutions described in this book are as relevant and real now as they were then. There are the same personalities (the architect, the team lead, marketing, management), the same situations (dispair, joy, post-shipping depression), and the same unvarying truths (the way to get the impossible done is to give the task to an intern, but make sure you don’t tell the intern it is impossible). I re-read this book every couple of years, and it still rings true.

Second on the list is Fred Brooke’s The Mythical Man Month. Talking about a machine that predates the one described in Kidder’s book, this book contains most of what is true about software engineering. This is the book that first enunciated the principle that adding people to a late project makes it later, but there is much more here. This is where I learned how to estimate how long a project will take, how to size a team, and so much else. The problem with this book is it is such a classic the everyone reads it early in their career (or education) and then forgets it. It should be re-read early and often. These truths don’t change, and the combination of this book and Kidder’s will remind you that high technology is still mostly about people (or, as I’m sometimes known to say, the technical problems are easy compared to the problems of primate behavior).

Third on the list is the aforementioned Cluetrain Manifesto. It packs a lot of truth about the new network world in a small number of patges, and is necessary reading for those who deal with the new internet world. A lot of what is said in this work is now common knowledge, so I sometimes worry that those reading it now won’t understand how radical it was when it was first published (and on the Web, no less). But other parts of the book are still not clearly understood, and are good to read.

My fourth book would be Kernighan and Ritchie, The C Programming Language, more generally known by its initials, K and R. This is the only overtly technical book on the list, and given how few people live their lives in the C language anymore, may seem an odd choice. Programmers shouldn’t read this book to learn the C language (although you can’t read it and not learn the C language). Programmers should read this book to understand how computers work, and to see an example of great writing about a technical subject.

Any computer language is a model for computation. COBOL models computation as a filing cabinet, allowing you to take something out of the cabinet, do some simple things with it, and put it back. Java models computation as a set of interacting objects (unless you use J2EE, in which the model is, well, more like a filing cabinet). LISP models computation as a set of composable functions.

C models computation as a computer; more precisely C models computation as a PDP-11. But this isn’t a bad model to learn, as most of our current computers are, at heart, PDP-11s. Learning C lets you understand what the computer is doing; it is the most useful assembly language you can learn.

But the main reason for reading K and R is to be exposed to the most elegant exposition of a programming language (and programming) I know. The elegance of the writing is so pervasive that you don’t even notice it (which is true elegance), but everything just makes sense. As a model for how to explain the complex, there is no better example.

And those are my four. Unlike most books in the software section of your bookstore (or Amazon), I’m reasonably confident that they will be the four I would pick in ten years. They have all not only aged well, but become better with age. Something we can all aspire to do, both in our work and our selves.


Privacy and Anonymity


It has been an interesting summer on the privacy front. Following the Spring revelations at Harvard about email searches, we have watched Edward Snowden subject the intelligence agencies of the U.S. to a version of the classic Chinese water torture (except he has replaced the drops of water with bowling balls) by releasing information about all of the information being gathered by them. I’ve been a bit distressed by how little discussion there has been about all of this in public, although an interesting alliance of the far left and the far right in the House of Representatives (the self-proclaimed “Wing Nuts”) seems to be paying some attention.

There are also a host of interesting questions that aren’t being addressed, but which the different sides seem to assume have already been answered (often in different ways). One of these questions is whether gathering data is a privacy issue, or if the issue only arises if and when the data is accessed. Those defending the gathering of all of the data seem to think that it is access that needs to be monitored and watched, telling us that we shouldn’t be worried because while they have all that data, actual access to the data is far more controlled. Those who are worried about the gathering seem to believe that the act of gathering the data is the problem, often pointing out that once the data is collected, someone will do something with it. Another question has to do with whether or not privacy is violated when data is viewed algorithmically, rather than when a human being looks at it. Again, those defending the various data gathering programs seem to hold that computers looking at the data has no privacy implications, while those objecting to the programs think that even algorithms can violate privacy.

I think these are both interesting questions, and I’m not sure I know the right answer to either of them. I have been able to construct some cases that make me lean one way, while others make me lean the other.

Another issue I don’t see being raised has to do with the difference between privacy and anonymity, and how the two relate. In fact, what I see in a lot of the technical discussions around the questions of data aggregation, is an implicit equation of privacy and anonymity. This is an equivalence that I think does both sides a disservice, but especially the side wanting to argue for privacy.

Anonymity, roughly put, is the inability to identify the actor of a particular action or the individual with whom some set of properties is associated. The inability to identify may be because you can’t see the individual (as is done for symphony auditions, where the players are placed behind a screen, a practice that has increased the number of female members of major orchestras), or because there is no identifier associated with some document, or when a database has been scrubbed so that only some data is associated with each record (although this can be more difficult than most think).

Privacy is more difficult to characterize (take my course in the fall if you want lots more discussion of that), but is more involved in not knowing something about someone. My medical records are private not because you don’t know who I am, but because you don’t have access (or have the good taste not to access) those facts about me. What happens in Vegas stays in Vegas not because everyone there is anonymous (that would make hotel registration interesting), but because those who are there don’t tell.

I often think that voting is the best example that can illustrate this distinction. You don’t want voting to be anonymous; it is a good thing to need to identify yourself at the polls and make sure that you are on the voter lists (how you do this, and how much trouble it should be, is a very different issue). But voting is a very private thing; you want to make sure that the vote I cast is private both to protect me from any blowback (I grew up blue in a very red state) but also to protect the integrity of the voting process itself (as long as voting is private, it is hard for someone trying to buy votes to determine if the money spent led to the right result in any individual case).

One problem with this slushy notion of how to define privacy is that it is hard to build a technology that will insure it if you don’t know what it is. So a lot of work in the technology space that appears to preserve privacy actually centers around preserving anonymity. Tor is one of my favorite examples; it is often seen as privacy preserving, but in fact is designed to insure anonymity.

The argument over the collection of meta-data rather than data is all about this distinction. If (and it is a big if) the metadata on phone calls and internet communications only reveals the identity of those communicating, it violates the anonymity of those who are communicating. The analogy here is if you follow someone and note all of the people the person followed talks to, without actually hearing what the conversations are about. Such a thing would be creepy, but it isn’t clear (especially if you are following the person in public areas) that it violates anyone’s privacy.

Confusing privacy and anonymity also allows those who may be violating privacy to point out that insuring anonymity helps bad people to cover their bad actions (the standard “terrorists and child pornographers” argument, which reduces to some variation of “if we insure anonymity, we help the terrorists and child pornographers”). No one wants to enable the bad actors to act in those ways, so it appears that we have to give something up (although, if you really believe in privacy as a right, perhaps you are willing to give some of this up– just as free speech has to include those who say things that you really don’t like).

I’d really like to see some deeper thinking here, although I expect that it won’t happen, at least in public. These are important issues, and they should be thought about calmly and not in the heat of some shocking revelation (like the current discussion) or in reaction to some horrific event (like the 9/11 terrorist attacks, that gave us the current legal frameworks). One of the problems with privacy law in the U.S. is that it tends to be reactive rather than contemplative.

Maybe we can do better at Harvard. I hope so.


Immoral, Illegal, or Creepy…


About 5 years ago, I designed and started teaching CS 105, a computer science course entitled Privacy and Technology. The course grew out of a National Academies study that I had co-chaired, where it became apparent that people who knew about policy often didn’t know about technology, and those that knew about technology had no idea about how policy was made. I thought it would be good to bring the two groups together, and the subject of privacy seemed to be a reasonable place to start. After all, technology (at the time, such things as wireless sensors, RFID tags, and surveillance cameras) was often seen as impinging on privacy, and so the combination of privacy and technology seemed a good place to get people thinking about some of these issues.

I was pretty sure that privacy was going to be an important subject when we started the class (it was always a group effort, started by Michael Smith and me and joined by Latanya Sweeney and a crew of undergraduate and graduate TAs that have all had a hand in shaping the course). We have always let the privacy controversies that were current at the time shape the class (it is one of the only classes I know at Harvard that starts every day with a section on “current events”). But I had no idea that it was going to become as central to the public debate, either at Harvard or in the country, as it has become in the past few months. It seems sort of quaint that we were worried about RFID tags when the class first started, given the worries that we have now.

I’m not sure what I think about the current controversies, either local or national; I’d like to find out a lot more about the facts than I know right now before I come to a conclusion. But I will make a couple of observations, even this early in the game.

The first is that the reason for establishing the course is just as valid now as it was then. Watching the discussions and reading the debates, both local and national, about electronic privacy shows me once again the distance between those who make the policies and laws and those who understand the technology. The policy makers discussions of meta-data, the internet, and database technology is pretty uninformed. But often equally uninformed are the technologists discussions about the law and what guarantees we have or ought to have. The original reason for the course, which is to bring these worlds together, has not disappeared.

My second observation is that all who are in this discussion (and I’m glad to see how many are taking part) might want to think about one of the first distinctions we introduce in the course. This is a distinction between what is immoral, what is illegal, and what is creepy. Something is immoral if it violates an ethical code; this is the purview of philosophers (a group I was part of long ago) and can shed lots of light (or at least lots of distinctions) on the subject. And while we try to base our laws on ethical theories, laws are not able to fully reflect those theories. So the fact that something is immoral doesn’t mean that it is illegal; that is a matter left up to the lawyers and, more particularly, to the judges who interpret the laws. And both of these are different from something being creepy, which is simply a description of an emotional reaction we have to a situation. The fact that this reaction is emotion doesn’t make it any less real, and may be an indicative of the situation breaking some moral code. But it isn’t the same.

When looking at a supposed privacy violation, it is often reasonable to ask which of these three categories the violation falls into. The government building a database of all of the phone meta-data on all of its citizens is certainly creepy. I don’t know if it is illegal; that would need to be decided by the courts (and I hope it will be at some time in the very near future). And whether or not it is immoral is a much deeper question, having to do with the kinds of tradeoffs that are needed and the underpinnings of your ethics. It is the right question to ask, but it is a difficult one to answer.

The discussions of privacy aren’t about to go away or get any simpler. They are very important, and I’m glad that we are having them. And on a selfish note, they are going to make the next edition of CS 105 (coming up this fall) really interesting.

End of the year ramblings…


It is always hard to believe that yet another year has passed. It has been a rich one, with the emergence of HarvardX and edX, major changes in the Harvard Library, and a lot of work being done at Harvard IT.I taught another edition of the course that first brought me to Harvard (Distributed Computing) and the course that I helped introduce into Harvard (Privacy and Technology). A pretty full year.

What I find most interesting in looking back isn’t any of these, though. What I find most interesting is how IT is changing at Harvard, and in higher education more generally. The advent of on-line education is part of this change, but only a part. The full picture is far more complex, and far more radical, than just the advent of MOOCs (which, given all the attention to MOOCs, would seem difficult).

The baseline of IT in higher education really isn’t much different than IT in any other large organization. The goal has been to help run the business (and higher education does have major components that are like a business) and provide basic networking and computing infrastructure for the rest of the business. So Harvard IT runs a large set of business applications that are like any other business’ applications, having to do with payroll, and human resources, and budgeting. We also provide networking, email, and calendar functions for pretty much everyone associated with central administration and the Faculty of Arts and Sciences, much like every other IT organization supplies such services to the rest of the business.

Of course, higher education IT has always needed to provide some extras to help with the mission of the university. The most obvious example of this is the iSites application, which is used to run a large number of Harvard courses. Such academic technology is an important part of what higher education IT does, but has historically been a minor piece of the work, especially when measured in terms of the amount of money invested. Historically, between 2/3 and 3/4 of the IT budget has been spent on support of administrative computing, with the rest going to infrastructure and academic support.

This is going to have to change, because the use of computing (and storage) within higher education (in general) and Harvard (in particular) is changing. Computing is becoming increasingly central to both the teaching and research mission of Harvard. As digital mechanisms become more central to the core missions of the University, the role of IT is going to have to change. Rather than being part of the administrative background, IT is going to be part of everything that is going on.

Courses have, for some time, used computing to allow posting of readings and assignments, but we are moving to a time when a course will include streamed lectures, on-line discussions, and the construction of digital artifacts by the students for evaluation of their learning. Research in the sciences has long required access to large amounts of computing and storage, but that need is now moving to the social sciences and the humanities. Just take a look at what Jeffrey Schnapp is doing at the metaLab, or the way that Peter Der Manuelian approaches egyptology.The need for large amounts of computing and storage is rapidly increasing, everywhere in the University.

This may become the newly expanded job of IT, or it may just require coordination with IT. In either case, the job of HUIT is going to be very different in the next couple of years. Our investment portfolio will, in all likelihood, invert. What we do now will, I predict, take up between 1/4 and 1/3 of our budget, and the rest will be taken up in support of research and teaching. Partly this will be done by savings that can be extracted in the administrative work that we are doing, as the cost of machinery goes down. Partly this will be done by adding to the investment in IT, but this will be hard in the current budget climate.

Most of this will occur as we decide to do less of one thing and more of another. There are tasks that we have been doing locally that may be outsourced or otherwise moved elsewhere. I suspect that there is a lot that can be gained from the commercial cloud providers, and other software-as-a-service providers. Some of this will be done by making more of what we do self-service; this can both decrease the cost of the IT group and empower the users, but has to be done carefully to insure that service is not degraded. The way we work now is going to have to change.

I find such a prospect invigorating. Doing the same thing has never appealed to me, so the prospect of major change in the way things are done makes me anticipate the new year. It will be interesting, it will be challenging, but it won’t be the same and it won’t be boring. And who can ask for more than that?