Thoughts on time between…


It has been a full semester since the last post– seems like just a moment. Fall is a favorite time of mine for lots of reasons, but Fall Semester is fun because I get to teach my privacy course. It is small, it is intense, and it is a lot of work, but it is more fun than I can imagine having in any other context. We had our usual count of the surveillance cameras on and around campus (a lot), a lab on unprotected web cameras that ended up with us all watching a wedding in a Las Vegas chapel (the bride went as Marilyn Monroe, and groom as early Elvis and the official doing the ceremony as the later Elvis), and a slew of projects that were amazing, creepy, and often both.

This year was especially interesting, given all of the privacy-related things that were going on in the rest of the world. We had more Snowden revelations, new and interesting hacks, Facebook and Google doing things that were questionable, and goings on in our own community. All of which had privacy implications.

One event led to some interesting discussions, both in and out of class. The first was the revelation that some educational research had taken place on the effect of attendance on the outcomes in a course. The research entailed placing cameras in some courses to determine how many were attending on any given day. This was done without the knowledge of either the students in the class or the professors teaching those classes, and once revealed lead to considerable discussion in the general Harvard community. I was a part of some of these in my role as CTO. But the most interesting conversations were the ones I had with my class.

The first thing that the students wanted to know was more details about the study itself– did the study get reviewed by the Institutional Review Board, and if so what did the application look like and what was the determination? How were the cameras used to determine attendance, and were they counting raw numbers or could the cameras tell if a particular student was attending or not? How were the courses that were part of the study selected? All of these were great questions, and at the time of maximum controversy, none were answered publicly. This didn’t stop a lot of the discussion, but it did cause the students to stop and wonder. We never got answers to all of the questions (in particular, we never saw the IRB application), but the answers that came out about some (the experiment did get submitted to the IRB; the cameras could only tell wether or not someone was in a seat, not track individuals; classes were picked by being in the right Science Center rooms), things seemed a lot less sinister than they had at first. The students tended to think that the privacy of the students was not compromised (after all, what is the reasonable expectation of privacy in a lecture hall) but perhaps the instructor’s privacy was (although there was continuing discussion about that).

What did become clear is that the issue really revolved around trust, and the more information that was available, the easier it was to trust that the researchers were not compromising privacy. When there were few facts, it was easy to speculate and ascribe questionable motives to those involved, which raised all of the privacy hackles. But as we all learned more, it was easier to see what was being done in a clearer way, and to become less concerned about the motives of the actors. It still seemed sort of creepy, but it was hard to see the actions as immoral (and we never got to the discussion of illegal).

Anonymity, Privacy, and Big Data


It’s been a while since my last post. Some interesting things have been going on, especially in the area of privacy, big data, and the work I’ve been doing with the researchers looking at the information coming out of the Massive Open On-line Classes (MOOCs) being offered by HarvardX on the edX platform. The latest result, just published on-line, shows that if you equate privacy with anonymity (which is commonly done in laws), you can either preserve privacy or do good science, but not both. Which is a pretty interesting result, I think. Let me lead you through the thinking, with apologies for the long and rambling nature of the post.

When I first started offering the course Privacy and Technology, I did so thinking that it would be a topic that would allow discussion of the interaction of technology and policy without being overly trendy or taken over by events. I remember my co-instructor at the time, Mike Smith, saying in the shopping lecture that he thought privacy was going to be the issue of the next decade; it was a shopping lecture, so I figured being a bit over-the-top was fine. Turns out he was a much better predictor of the future than I; privacy has become a huge issue that has now become one of my major research areas.

One reason for the worry is the advances in the ability to use Big Data to analyze all kinds of things. People worry about all the information the NSA appears to be gathering about everyone, creeped-out by how much Google or other web properties know about them, and generally worried that nothing can be kept secret (and it will just get worse as more sensors are connected to the internet). There are very few general regulations or laws in the U.S. about what data can be gathered and what that data can be used for. Google and Facebook can pretty much do what they want with the data they gather as long as they don’t lie to you in their privacy statements (although how you know what those statements say is an exercise left to the reader).

There are some areas that are regulated. Everyone seems to know about the privacy requirements imposed by the Health Insurance Portability and Accountability Act (better known as HIPPA), although as my colleague Latanya Sweeney often points out, this regulation is more about what data can be shared than it is about what is private. The other industry that is highly regulated with respect to privacy is education, which is covered by the Family Educational Rights and Privacy Act (FERPA). Most of these regulations center on the notion of de-identified data– that is, data that has been changed so that you can’t identify individuals from the data. HIPAA specifically defines what it means to de-identify the data– you have to take out names, addresses, social security numbers, and the like. The advantage of the definition is that you can tell when your data has been HIPAA de-identified; the disadvantage is that it isn’t all that hard to re-identify individuals given the de-identified data and other, pretty easily acquired information.

The Family Education Rights and Privacy Act (FERPA) is much less well defined; it talks about data that could not be re-identified by a reasonable person (is Latanya reasonable? Probably not, but how about all of the students she has trained?). Where it does get something like specific it seems to rest on a notion k-anonymity (invented by none other than Professor Sweeney), that in it’s simplest formulation says that for any set of properties on any individual in your data set, there must be at least k-1 other individuals within that set with the very same set of properties. So if your data set is, say, k-anonymous for k=5, it means that for any information about some particular individual in the data set, there will be at least 4 other individuals in the set that can’t be distinguished from the original.

Now, the MOOCs that Harvard has been running have been generating a lot of data about how students taking the MOOCs interact with the course material. What videos get viewed, which get sped up or slowed down, what questions are answered correctly and after how many tries– all of this information is available on anywhere from thousands to hundreds of thousands of students. Over the past year we (a research group from Harvard and MIT) have been able to do some pretty interesting data analysis on this, and are starting to learn some things about how students learn on-line.

But if you are really committed to science, you want to allow others to test your work by running their own analysis on the data. We also thinks others will ask questions we have failed to ask, which is a good thing. So we would like to share the data. But this being data about students, we (or, more accurately, the lawyers) worry about FERPA. So before we can share it, we needed to make it k-anonymous (where, in this case, k=5). Using a variety of techniques, we were able to produce a basic data set that was 5-anonymous and gave data on over 800,000 students who had taken MOOCs. Which we then released to other researchers.

But then we thought, just to be sure, that we should run our original analyses over the de-identified data set. When we did, the results we got were very different. Significantly so. For details, see the paper, but the difference was enough that it wasn’t clear that the de-identified set would lead to good science. Which is a huge problem.

We hope that the paper will cause some discussion, both among researchers and policy makers. There are others that have worried that there is a conflict between big data science and the preservation of privacy through anonymity (most notably, the President’s Council of Advisors on Science and Technology in a recent report), but as far as I know our paper is the first to show that the problem is real.

So now I’m thinking maybe Mike Smith was wrong. This is a problem that is going to be around for a lot longer than a decade.

WeCode and Visceral Education


Last weekend I had the great pleasure of attending the WeCode conference run by the Harvard Undergraduate Women in Computing group. It was a great event; well-organized, well-attended, and far more interesting than the “OMG, Goldman Sachs was giving out nail files with their logo, how insensitive” meme that seems to have gone rampant on news sites that should know better. I was there to moderate a set of panels, but decided to attend most of the Saturday event to see what it was like.

The first keynote was in one of the large lecture theaters in the Science Center. When I walked in, there were probably 200 conference goers in their seats, and more were streaming in. I took three or four steps into the hall, and it suddenly hit me. I was one of may two or three men in the hall. I’ve never been accused of being shy, but I felt completely out of place. Completely other. All of the voices in my head were saying “get out of here… go to your office and get some work done…”. All the flight responses were active.

And at the same time, I was realizing that this is the feeling everyone else in the room must have at every other tech conference in the world, or in most computer science classes, or tech gatherings in general. It was a Zen experience. I suddenly felt that I had a better understanding of what women in computer science (and the STEM fields more generally) are up against.

I’ve tried to be a supporter of women in software positions all my life. My groups at Sun always had women software engineers, and my closest collaborator over most of my career was a woman. I’ve tried to encourage women in my classes. The last edition of my privacy course was 2/3rds female (a fact that one of the male students complained about; his complaint was an opportunity for a discussion of these issues which I hope had some impact). But I’ve never felt the problem the way I did last Saturday.

I’ll admit I’m not sure what to do about this. But it is a problem, not just of fairness and justice, but for the field. We need good people in software engineering, computer science, and related fields. The supply of any kind of people can’t keep up with the demand, and the supply of good people isn’t even close. Artificially limiting the supply of talent to half the population is insane, destructive, and wrong. Changing this will be hard, because not everyone understands. I thought I understood, but I didn’t really. I don’t fully understand now, but I’ve had a good lesson. It’s amazing how much more effective a lesson is when it arrives through the emotions instead of the brain.

I’m still thinking about the experience. But I know I won’t think about women’s issues in the STEM field in the same way. For that reason alone, the WeCode conference may have been the most educational I’ve ever attended.

The four best books for software engineers


This semester I’ve had the good fortune to be involved in the pilot of the Digital Problem Solving Initiative, a program being started out of the law school and the Berkman Center to allow Harvard students (from all schools) to get some mentored experience in dealing with various sorts of digital solutions. I’ve been leading a group of law students, FAS graduate students, and college students looking at some of the issues around the research data coming in from the HarvardX courses that are running on the edX platform. It’s a great group, and we have a nice combination of law questions (just what does FERPA require, anyway?), policy matters (how do we keep privacy in these sorts of data sets) and programming problems (they really think we can use *this* data?).

The group is a lot of fun, and we’ve had a stream of visitors as well to liven things up. A couple of weeks ago, Doc Searles and his wife joined us. Doc is one of the authors of the Cluetrain Manifesto, and I mentioned that it was one of the three or four books that all programmers should read. After the session, a couple of the participants asked what the others were.

Which got me to thinking. Most of the people who talk about the three or four books that everyone should read only tell you one of the books at any one time, and if you take the union of all those times the total is in the 20s or 30s. So trying to come up with all of them at once is difficult, and takes some thinking. It’s much easier to come up with one (the “desert island book”), or 20. But four?

But here we go…

First on the list would be Tracey Kidder’s The Soul of a New Machine. This is a non-fiction report of the development of a 32-bit minicomputer by Data General in the 1980s. But the reason to read a book about a long-forgotten machine with what is now a laughable computing capability by a company that no longer exists is to see how little the technology industry changes. The personalities, problems, and solutions described in this book are as relevant and real now as they were then. There are the same personalities (the architect, the team lead, marketing, management), the same situations (dispair, joy, post-shipping depression), and the same unvarying truths (the way to get the impossible done is to give the task to an intern, but make sure you don’t tell the intern it is impossible). I re-read this book every couple of years, and it still rings true.

Second on the list is Fred Brooke’s The Mythical Man Month. Talking about a machine that predates the one described in Kidder’s book, this book contains most of what is true about software engineering. This is the book that first enunciated the principle that adding people to a late project makes it later, but there is much more here. This is where I learned how to estimate how long a project will take, how to size a team, and so much else. The problem with this book is it is such a classic the everyone reads it early in their career (or education) and then forgets it. It should be re-read early and often. These truths don’t change, and the combination of this book and Kidder’s will remind you that high technology is still mostly about people (or, as I’m sometimes known to say, the technical problems are easy compared to the problems of primate behavior).

Third on the list is the aforementioned Cluetrain Manifesto. It packs a lot of truth about the new network world in a small number of patges, and is necessary reading for those who deal with the new internet world. A lot of what is said in this work is now common knowledge, so I sometimes worry that those reading it now won’t understand how radical it was when it was first published (and on the Web, no less). But other parts of the book are still not clearly understood, and are good to read.

My fourth book would be Kernighan and Ritchie, The C Programming Language, more generally known by its initials, K and R. This is the only overtly technical book on the list, and given how few people live their lives in the C language anymore, may seem an odd choice. Programmers shouldn’t read this book to learn the C language (although you can’t read it and not learn the C language). Programmers should read this book to understand how computers work, and to see an example of great writing about a technical subject.

Any computer language is a model for computation. COBOL models computation as a filing cabinet, allowing you to take something out of the cabinet, do some simple things with it, and put it back. Java models computation as a set of interacting objects (unless you use J2EE, in which the model is, well, more like a filing cabinet). LISP models computation as a set of composable functions.

C models computation as a computer; more precisely C models computation as a PDP-11. But this isn’t a bad model to learn, as most of our current computers are, at heart, PDP-11s. Learning C lets you understand what the computer is doing; it is the most useful assembly language you can learn.

But the main reason for reading K and R is to be exposed to the most elegant exposition of a programming language (and programming) I know. The elegance of the writing is so pervasive that you don’t even notice it (which is true elegance), but everything just makes sense. As a model for how to explain the complex, there is no better example.

And those are my four. Unlike most books in the software section of your bookstore (or Amazon), I’m reasonably confident that they will be the four I would pick in ten years. They have all not only aged well, but become better with age. Something we can all aspire to do, both in our work and our selves.


Privacy and Anonymity


It has been an interesting summer on the privacy front. Following the Spring revelations at Harvard about email searches, we have watched Edward Snowden subject the intelligence agencies of the U.S. to a version of the classic Chinese water torture (except he has replaced the drops of water with bowling balls) by releasing information about all of the information being gathered by them. I’ve been a bit distressed by how little discussion there has been about all of this in public, although an interesting alliance of the far left and the far right in the House of Representatives (the self-proclaimed “Wing Nuts”) seems to be paying some attention.

There are also a host of interesting questions that aren’t being addressed, but which the different sides seem to assume have already been answered (often in different ways). One of these questions is whether gathering data is a privacy issue, or if the issue only arises if and when the data is accessed. Those defending the gathering of all of the data seem to think that it is access that needs to be monitored and watched, telling us that we shouldn’t be worried because while they have all that data, actual access to the data is far more controlled. Those who are worried about the gathering seem to believe that the act of gathering the data is the problem, often pointing out that once the data is collected, someone will do something with it. Another question has to do with whether or not privacy is violated when data is viewed algorithmically, rather than when a human being looks at it. Again, those defending the various data gathering programs seem to hold that computers looking at the data has no privacy implications, while those objecting to the programs think that even algorithms can violate privacy.

I think these are both interesting questions, and I’m not sure I know the right answer to either of them. I have been able to construct some cases that make me lean one way, while others make me lean the other.

Another issue I don’t see being raised has to do with the difference between privacy and anonymity, and how the two relate. In fact, what I see in a lot of the technical discussions around the questions of data aggregation, is an implicit equation of privacy and anonymity. This is an equivalence that I think does both sides a disservice, but especially the side wanting to argue for privacy.

Anonymity, roughly put, is the inability to identify the actor of a particular action or the individual with whom some set of properties is associated. The inability to identify may be because you can’t see the individual (as is done for symphony auditions, where the players are placed behind a screen, a practice that has increased the number of female members of major orchestras), or because there is no identifier associated with some document, or when a database has been scrubbed so that only some data is associated with each record (although this can be more difficult than most think).

Privacy is more difficult to characterize (take my course in the fall if you want lots more discussion of that), but is more involved in not knowing something about someone. My medical records are private not because you don’t know who I am, but because you don’t have access (or have the good taste not to access) those facts about me. What happens in Vegas stays in Vegas not because everyone there is anonymous (that would make hotel registration interesting), but because those who are there don’t tell.

I often think that voting is the best example that can illustrate this distinction. You don’t want voting to be anonymous; it is a good thing to need to identify yourself at the polls and make sure that you are on the voter lists (how you do this, and how much trouble it should be, is a very different issue). But voting is a very private thing; you want to make sure that the vote I cast is private both to protect me from any blowback (I grew up blue in a very red state) but also to protect the integrity of the voting process itself (as long as voting is private, it is hard for someone trying to buy votes to determine if the money spent led to the right result in any individual case).

One problem with this slushy notion of how to define privacy is that it is hard to build a technology that will insure it if you don’t know what it is. So a lot of work in the technology space that appears to preserve privacy actually centers around preserving anonymity. Tor is one of my favorite examples; it is often seen as privacy preserving, but in fact is designed to insure anonymity.

The argument over the collection of meta-data rather than data is all about this distinction. If (and it is a big if) the metadata on phone calls and internet communications only reveals the identity of those communicating, it violates the anonymity of those who are communicating. The analogy here is if you follow someone and note all of the people the person followed talks to, without actually hearing what the conversations are about. Such a thing would be creepy, but it isn’t clear (especially if you are following the person in public areas) that it violates anyone’s privacy.

Confusing privacy and anonymity also allows those who may be violating privacy to point out that insuring anonymity helps bad people to cover their bad actions (the standard “terrorists and child pornographers” argument, which reduces to some variation of “if we insure anonymity, we help the terrorists and child pornographers”). No one wants to enable the bad actors to act in those ways, so it appears that we have to give something up (although, if you really believe in privacy as a right, perhaps you are willing to give some of this up– just as free speech has to include those who say things that you really don’t like).

I’d really like to see some deeper thinking here, although I expect that it won’t happen, at least in public. These are important issues, and they should be thought about calmly and not in the heat of some shocking revelation (like the current discussion) or in reaction to some horrific event (like the 9/11 terrorist attacks, that gave us the current legal frameworks). One of the problems with privacy law in the U.S. is that it tends to be reactive rather than contemplative.

Maybe we can do better at Harvard. I hope so.


Immoral, Illegal, or Creepy…


About 5 years ago, I designed and started teaching CS 105, a computer science course entitled Privacy and Technology. The course grew out of a National Academies study that I had co-chaired, where it became apparent that people who knew about policy often didn’t know about technology, and those that knew about technology had no idea about how policy was made. I thought it would be good to bring the two groups together, and the subject of privacy seemed to be a reasonable place to start. After all, technology (at the time, such things as wireless sensors, RFID tags, and surveillance cameras) was often seen as impinging on privacy, and so the combination of privacy and technology seemed a good place to get people thinking about some of these issues.

I was pretty sure that privacy was going to be an important subject when we started the class (it was always a group effort, started by Michael Smith and me and joined by Latanya Sweeney and a crew of undergraduate and graduate TAs that have all had a hand in shaping the course). We have always let the privacy controversies that were current at the time shape the class (it is one of the only classes I know at Harvard that starts every day with a section on “current events”). But I had no idea that it was going to become as central to the public debate, either at Harvard or in the country, as it has become in the past few months. It seems sort of quaint that we were worried about RFID tags when the class first started, given the worries that we have now.

I’m not sure what I think about the current controversies, either local or national; I’d like to find out a lot more about the facts than I know right now before I come to a conclusion. But I will make a couple of observations, even this early in the game.

The first is that the reason for establishing the course is just as valid now as it was then. Watching the discussions and reading the debates, both local and national, about electronic privacy shows me once again the distance between those who make the policies and laws and those who understand the technology. The policy makers discussions of meta-data, the internet, and database technology is pretty uninformed. But often equally uninformed are the technologists discussions about the law and what guarantees we have or ought to have. The original reason for the course, which is to bring these worlds together, has not disappeared.

My second observation is that all who are in this discussion (and I’m glad to see how many are taking part) might want to think about one of the first distinctions we introduce in the course. This is a distinction between what is immoral, what is illegal, and what is creepy. Something is immoral if it violates an ethical code; this is the purview of philosophers (a group I was part of long ago) and can shed lots of light (or at least lots of distinctions) on the subject. And while we try to base our laws on ethical theories, laws are not able to fully reflect those theories. So the fact that something is immoral doesn’t mean that it is illegal; that is a matter left up to the lawyers and, more particularly, to the judges who interpret the laws. And both of these are different from something being creepy, which is simply a description of an emotional reaction we have to a situation. The fact that this reaction is emotion doesn’t make it any less real, and may be an indicative of the situation breaking some moral code. But it isn’t the same.

When looking at a supposed privacy violation, it is often reasonable to ask which of these three categories the violation falls into. The government building a database of all of the phone meta-data on all of its citizens is certainly creepy. I don’t know if it is illegal; that would need to be decided by the courts (and I hope it will be at some time in the very near future). And whether or not it is immoral is a much deeper question, having to do with the kinds of tradeoffs that are needed and the underpinnings of your ethics. It is the right question to ask, but it is a difficult one to answer.

The discussions of privacy aren’t about to go away or get any simpler. They are very important, and I’m glad that we are having them. And on a selfish note, they are going to make the next edition of CS 105 (coming up this fall) really interesting.

End of the year ramblings…


It is always hard to believe that yet another year has passed. It has been a rich one, with the emergence of HarvardX and edX, major changes in the Harvard Library, and a lot of work being done at Harvard IT.I taught another edition of the course that first brought me to Harvard (Distributed Computing) and the course that I helped introduce into Harvard (Privacy and Technology). A pretty full year.

What I find most interesting in looking back isn’t any of these, though. What I find most interesting is how IT is changing at Harvard, and in higher education more generally. The advent of on-line education is part of this change, but only a part. The full picture is far more complex, and far more radical, than just the advent of MOOCs (which, given all the attention to MOOCs, would seem difficult).

The baseline of IT in higher education really isn’t much different than IT in any other large organization. The goal has been to help run the business (and higher education does have major components that are like a business) and provide basic networking and computing infrastructure for the rest of the business. So Harvard IT runs a large set of business applications that are like any other business’ applications, having to do with payroll, and human resources, and budgeting. We also provide networking, email, and calendar functions for pretty much everyone associated with central administration and the Faculty of Arts and Sciences, much like every other IT organization supplies such services to the rest of the business.

Of course, higher education IT has always needed to provide some extras to help with the mission of the university. The most obvious example of this is the iSites application, which is used to run a large number of Harvard courses. Such academic technology is an important part of what higher education IT does, but has historically been a minor piece of the work, especially when measured in terms of the amount of money invested. Historically, between 2/3 and 3/4 of the IT budget has been spent on support of administrative computing, with the rest going to infrastructure and academic support.

This is going to have to change, because the use of computing (and storage) within higher education (in general) and Harvard (in particular) is changing. Computing is becoming increasingly central to both the teaching and research mission of Harvard. As digital mechanisms become more central to the core missions of the University, the role of IT is going to have to change. Rather than being part of the administrative background, IT is going to be part of everything that is going on.

Courses have, for some time, used computing to allow posting of readings and assignments, but we are moving to a time when a course will include streamed lectures, on-line discussions, and the construction of digital artifacts by the students for evaluation of their learning. Research in the sciences has long required access to large amounts of computing and storage, but that need is now moving to the social sciences and the humanities. Just take a look at what Jeffrey Schnapp is doing at the metaLab, or the way that Peter Der Manuelian approaches egyptology.The need for large amounts of computing and storage is rapidly increasing, everywhere in the University.

This may become the newly expanded job of IT, or it may just require coordination with IT. In either case, the job of HUIT is going to be very different in the next couple of years. Our investment portfolio will, in all likelihood, invert. What we do now will, I predict, take up between 1/4 and 1/3 of our budget, and the rest will be taken up in support of research and teaching. Partly this will be done by savings that can be extracted in the administrative work that we are doing, as the cost of machinery goes down. Partly this will be done by adding to the investment in IT, but this will be hard in the current budget climate.

Most of this will occur as we decide to do less of one thing and more of another. There are tasks that we have been doing locally that may be outsourced or otherwise moved elsewhere. I suspect that there is a lot that can be gained from the commercial cloud providers, and other software-as-a-service providers. Some of this will be done by making more of what we do self-service; this can both decrease the cost of the IT group and empower the users, but has to be done carefully to insure that service is not degraded. The way we work now is going to have to change.

I find such a prospect invigorating. Doing the same thing has never appealed to me, so the prospect of major change in the way things are done makes me anticipate the new year. It will be interesting, it will be challenging, but it won’t be the same and it won’t be boring. And who can ask for more than that?

The Bozo Event Horizon


I’m on a Harvard mailing list for some folks interested in startups and innovation. A recent thread of discussion was around hiring, and in a posting to the group I talked about making sure that you did your hiring so that you avoided the bozo effect. I was asked by a number of people what I meant by that, which led to a long post that generated some interest. So I thought it might be of interest to a wider audience, as well. So I’m posting it here…

On hiring, bozos, and some (admittedly biased) history

Some time ago on this list I sent out a message concerning hiring, and mentioned that you need to avoid bozos if you want your company to survive. I said in that post

It is a truism that good people want to work with other good people; a corollary to this is that bozos attract other bozos. Once the bozo count reaches a certain percentage, the company is doomed (I saw this happen from the outside to Digital Equipment Co. and from the inside to Sun; I’mworried that Google may have hit the bozo event horizon).

A number of you asked, either privately or publicly, if I would expand on this, and perhaps talk about what happened at Sun and DEC, and what I’m seeing happening at Google (and what I mean by a bozo). These are difficult topics, some intellectually so and others emotionally so. But I’ve been thinking about this for a bit, and I’ll give it a try.

Let’s start with the notion of a bozo. All of the great companies I have worked for (Apollo and Sun in various incarnations) or heard about (DEC, PARC, Bell Labs and the like) started around a core of incredible people. These were people who are or were legends in the field. They were the ones who where 10 or 100 times as productive as the average engineer. Some, like Bill Joy, are idea gerbils who can spout out hundreds of original ideas a week. Only some of them are actually workable, but if there is someone around to catch the good ones and edit the losers, these people change the world. Others, like James Gosling, quietly change the world by building something (the core Java language and libraries) that make so much sense and are so elegant that you just smile when you use them.

Good tech companies find a way to reward these people without making them go into management or otherwise change what they are doing. DEC had the title of consulting engineer and senior consulting engineer; at Sun there were the distinguished engineers and fellows. These were levels above the rank and file engineers; no one could expect to be promoted to that level, but you always hoped to become one of the elect. I remember being told that the requirement for becoming a Sun Fellow was that you had invented one or more major branches of computer science; the original fellows (Bob Sproull, Ivan Sutherland, and Peter Deutsch) all qualified on that metric.

One aspect of these positions is that they generally required peer review. You couldn’t become a Sun DE or a DEC consulting engineer just because the managers said you should. You became one because the other DEs or CEs had looked at your technical chops, and said that you were one of the elect. It was often compared to getting tenure, except that it was often more difficult; professors with tenure who shifted to these companies often weren’t passed into this level. And these people were the keepers of the corporate technical flame, making sure that the company stayed on the right (technical) footing.

The core of this decision procedure was the ability of the top-level technical talent being able to make technical judgements about other technical contributors. But at some point in the history of the companies, there arose worries that the selection criteria wasn’t, in some sense, fair. People who, from the manager’s point of view, did great work weren’t being selected by the technical leaders to join the top group. People who did other kinds of important work were seen as being de-valued because they weren’t being allowed into the upper ranks. And at some point, in the name of “fairness” or “diversity of skills” or the like, contributors who would not have otherwise been let in are added to the group.

And these are the bozos. Not necessarily bad people, or even unintelligent, but those who have been promoted to a level where they are given technical weight that they don’t deserve. The “A” team now has some “B” members, but those outside of the team (and maybe some inside of the team) can’t tell the difference. The upper levels of the technical parts of the company now have some people who are driven more by politics, or quick business wins, or self-promotion (all of which may have been the skills that got them support from the non-technical to be promoted to the technical elite). Without a clear technical voice, management does the best it can. But the ship is somewhat rudderless.

Worse still, the bozos will push to promote others like themselves. Which dilutes the technical thinking even more. At some point, what used to be technical discussions devolve into discussions about politics, or business models, or market share. All of which may be important, but they aren’t the technical discussions that had made the company a leader. This is when you have reached the bozo event horizon. I’ve never seen a company recover.

All of this is about the technical bozos, because that is what I’ve experienced. But it wouldn’t surprise me to find that the same sort of phenomenon goes on in marketing, or management, or any other field. The indicator is when process and fairness becomes more important than judgement, and when it isn’t ok to say that some people have reached their limit. Or maybe this is something that happens more in the technical parts of an organization than in the others. I wouldn’t know.

I don’t know that Google has hit the bozo event horizon, but I’m worried that they might have. Part of the worry is just because of their size; it is really hard to grow the way Google has without letting some lightweights rise to the top. The other is their hiring process (full disclosure; I’ve looked at Google a couple of times and it never worked) which has gotten pretty process-bound and odd. The last time I went through it, the site manager admitted that I was plenty smart, but they didn’t know what they would do with me. Given what they were obviously looking for, I wasn’t sure what I would do with them, either. But the whole process seems to indicate that they are looking for people to fit a pre-defined mold, which the top performers generally don’t do all that well. In fact, the Google process reminded me of the time, more than 20 years ago, when I interviewed at Microsoft. And we saw how well that worked…


Why now?


One question I’ve been asked repeatedly about Harvardx and edX is, essentially, “why now?” On-line and distance eduction have been around for a long time, and while both have had some exciting times they haven’t changed the face of education as we know it. Now everyone is thinking about how to educate billions of people over the internet. What has changed to make it more interesting this time around? Why now?

I think we may well be witnessing the beginning of a paradigm shift in higher education, which makes the question all the more interesting. And when I talk about this being a paradigm shift, I mean it in the sense originally outlined by Thomas Kuhn in The Structure of Scientific Revolutions, not in the sense that it is generally meant today. Kuhn’s notion was that such shifts force a new way of looking at the world, a change that is so radical that the practitioners of the old science or paradigm can’t even understand the practitioners of the new science or paradigm. Such shifts, argues Kuhn, happen not because of a single experiment, result, or difficult-to-explain phenomenon, but rather when the set of counter-examples, unexplained phenomenon, or changes to the current paradigm in science become so cumbersome that there is intellectual room for a new approach that deals with the problems in the received theories in a radically different (and more elegant) way.

I think what we are seeing in the on-line education arena is the beginnings of a paradigm shift/revolution in education (rather than the more restricted “on-line teaching”). And like a scientific revolution, it isn’t being caused by one particular thing, but rather by an accretion of a number of factors that are encouraging the more adventurous in the field to think in different ways.

Some of these changes are technological. Access to computers, and more importantly to networked computers, is now pervasive over much of the human population. Many of these computers are called cell phones, but the fact is that they have the power of a high-end engineering workstation of not that long ago, and are connected on a network with reasonable speed. The pervasiveness of these devices can’t be overstated– think of all of the interesting micr0-scale power generation ideas that you have heard about that are designed to allow people in the more remote parts of Africa or Asia to charge their cell phones.

Beyond the technology, there are the problems with the way education is done today. The obvious problem has to do with the cost of education, which gets a lot of press and is a genuine worry. But just as worrying is the uneven distribution of education over the geographic area of the globe. Not everyone who wishes to learn is able to get to a place where they can learn (or learn well); partly this is cost but mostly this is physics. High-quality education is currently available in a breathtakingly small number of locations, often in areas that are difficult and/or expensive to get to.

And then there is the problem that current education doesn’t scale much beyond the point where we are now. The last major scaling revolution in higher education was the introduction of the lecture. This allowed a single professor to educate a couple hundred students at a time, rather than the seminar-style of teaching that limited a single professor to educating one to ten students at a time. The quality of education through a lecture was not the same as the quality of the seminar, but it was good enough that the scale made up for it. In the same way, approaches to on-line education allow educating hundreds of thousands rather than hundreds at a time. The education may not be quite as good, but the current belief seems to be that the scale makes up for this.

None of these by themselves are enough to explain why we are seeing the surge in interest in on-line and technology-enabled education. But taken together they put pressure on the status quo. Add a visionary like Sal Khan who shows an alternative without asking if it is a good idea, and we get the revolutionary ball rolling.

As I’ve said before, I don’t see this revolution as being about on-line education. I see it as being much broader than that, looking at how we can use the technologies currently being pioneered in on-line education to make all education, whether on-line, commuter, or residential, better. I’d love to use on-line mechanisms to allow me to never have to give another large lecture. I’d much rather use class time for discussions, or team problem solving, or in some other way that would engage the student and teach the techniques of thinking rather than the content of a field.

I don’t know if the approaches that are being tried in edX or Harvardx will be a better way of teaching. But trying them lets us ask a different set of questions. And that, in itself, is one of the main characteristics of a paradigm shift: when practitioners of the new paradigm are asking different questions than the practitioners of the old paradigm.

Here we go…


Residential Education


A fairly consistent reaction to the advancement of on-line educational materials (like edX or its west-coast counterparts) is that this is the beginning of the end for residential higher education. If you can take a course over the internet, the reasoning goes, why spend the time and the money to actually go to some place for college? It is far more efficient to do your travel virtually. If the end result is the same (or even close to it) there is no need for the overhead of the residential education.

Back when I was in the commercial world of hi-tech, I used to refer to thinking like this as being an example of the Highlander Fallacy. This is the fallacy based on the assumption that there can be only one; one programming language, one database, one web-server, one network transport. The new will always win over the old, and we will unify around a single standard that everyone will use. The real world doesn’t work that way; while there may be a best programming language, database, or transport for any (particular) problem, there isn’t a best of any of these for all problems.

Saying that on-line education will replace residential education is another example of the Highlander Fallacy. But it also misses the point of residential education in so many ways that it is hard to know just where to begin. A residential education is a way to spend four years in a community that is all about learning, allowing students to experiment in ways that they won’t be able to when they are out of school. At a place like Harvard, the interaction with other students is probably more educational than any courses that you could take. And heading off to college is the first chance many get to re-invent themselves; going to a new community frees us of the history that has been built up around us in our old community.

But the real reason residential education (or at least co-located education) will never go away has to do with the different kinds of things that we learn when mastering a subject. There are multiple things that need to be learned to attain mastery in a particular subject. One set of things is the content of that subject matter. But the other, more subtle and I think more important, is a set of techniques around problem solving that are used in that subject. Back in my days as a philosopher, there was an important distinction between knowing that and knowing how. Knowing that has to do with the content of a field. Knowing how is a set of skills that allow you to think like a practitioner in that field.

Consider the example of computer science. The content of computer science includes, among other things, learning about a lot of algorithms, different programming languages, the principles of operating systems and databases, and the math needed to understand cryptography. But the techniques of computer science are none of those– they have to do with learning to decompose a problem into a set of (hopefully simpler) problems, of knowing how to build a set of interfaces between different components, or how to design a user interface so that it is intuitive and easy to learn. The notion of computational thinking is all the rage at the moment, but the real core of that kind of thinking is learning how to approach problems the way a computer scientist would.

Other fields have other ways of approaching problems, which are the techniques of that field. You need to learn the content of the field to become a practitioner, but it is far more important to learn the ways of thinking. When I studied philosophy, it seemed that most of the content of the field was uninteresting (which may be why I’m no longer a philosopher), but the techniques of analytic philosophy were very interesting (and have served me well as a computer engineer and scientist).

Bringing this back to on-line education– I think that the real promise of on-line education is the ability to teach the content of a field. But it is going to be much harder to teach the techniques of thinking in an on-line fashion. The best ways to teach technique tend to look like the apprenticeship model. I’ve talked about this for system design elsewhere, but I believe it is true for lots of other fields as well. That is where the residential (or at least face-to-face) form of teaching will still be needed.

In fact, I think the proper use of on-line learning materials will enhance the residential experience. If we can get most of the content taught on-line, we will have more time to mentor students in the techniques of a field. I’d love not to have to do lectures again, and just work on problems and code and designs with students. That sort of work needs the content to be known, but is much more rewarding for both the student and the teacher.

So don’t think of edX as replacing the residential experience. The real goal is to enhance that experience.