Technology and government

1

Our discussion in our Freshman Seminar this last week concerned how technology in general and the Internet in particular could be used to help governments (at all levels) be more responsive and deliver services better. We were fortunate to have David Eaves join us; he has been an advocate for open data and using technology in government for some time, so getting advice from someone who has been there is really helpful.

What I keep thinking about, after the discussion, is the number of large technical projects within various parts of government that fail. There are the really visible failures like healthcare.gov (on its initial rollout) or the attempts to implement a case management system for the FBI or attempts to replace the air traffic control system (which has failed multiple times). Depending on the statistics you want to cite, 72% of government IT projects were in trouble in 2009, or that 70% of government IT projects fail, or that only 6.4% are successful. David claimed that things are not that different outside of the government, and you can certainly find studies that agree with this. In fact, reading these studies, it is surprising that anything ever works at all.

My problem with all these studies is that they fly in the face of my own experience. I spent about 30 years developing software in industry. I was on lots of projects. There were some that weren’t successful in the market for one reason or another. There were a couple that were stopped when the company I was working for got bought and the new owners decided that those projects weren’t the sorts of things they were interested in. But I was never on one that failed because we just couldn’t get the system to work.

David did distinguish between projects that were done in “technology” companies verses those done by everyone else, and I certainly worked in technology companies. But over the past 6 years I’ve been working in one part or another of Harvard Information Technology. Harvard is hardly a technology company (don’t get me started…), but in that time we have successfully rolled out a new course management system, a new student information system, re-vamped the Identity and Access Management system, moved most of email from local servers to the cloud, and done a ton of other projects as well. Not all of them have done exactly what everyone hoped they would do, but that have all pretty much worked. None had to be abandoned, or re-written from scratch, or got deployed and then turned into a disaster.

So what is the difference between the facts and figures that we see about project failure and my own experience? Maybe I have some sort of magic about me, so that projects I join or observe are somehow saved from the fate of all of these others. That would be really useful, but I don’t think it is the right explanation. I think I’m good, but I don’t think I’m magic.

I’m more inclined to think that the difference has to do with what the managers of the projects care about. In most of the government projects I’ve heard about, and in lots of the non-governmental projects that have failed, managers have been more concerned about how things get done than anything else. That is, the worry is what kind of process gets followed. Is the project being run using the waterfall model (which was first discussed in a paper saying that it was the wrong way to manage a software project) or various forms of agile development (which is a new cult), or some other method? These are approaches that managers really hope will make the development of software predictable, manageable, and most importantly, independent of the people who are on the project. All of these models try to make the developers interchangeable parts who just do their job in the same way. Doesn’t matter who is on the project, as long as the right process is followed.

This is in contrast to what I saw through my career, and what I see in companies that might be thought of as “tech” companies now. In these projects, the worry was all about who was on the project. There was a time I gave talks about what I called the Magnificent Seven approach to software projects. The process was straightforward: hire a small group of experienced professionals, let them deal with the problem as they saw fit, and if you found a kid who could catch fish barehanded ask him or her along. This was hardly an idea that I came up with by myself; you can see it in The Mythical Man Month and other things written by Fred Brooks.

A process-based approach seems a lot more egalitarian, and in some ways a lot more fair. It means that you never have to tell someone that they aren’t good enough to do the job. It is good for the company, because you don’t have to pay outrageous salaries to engineers who are probably a pain in the tail to manage because they think (often rightly) that the company need them more than they need the job (since, if they really are that good, they can find another job easily). So I certainly understand why managers, and large bureaucracies like various levels of government, want to focus on process rather than individual talent.

But then you have to explain the difference in success rates. If focusing on process gives a success rate somewhere between 40% and 5%, and focusing on talent does a lot better (I don’t have numbers, but my anecdotal experience would put the success rate of really high performance teams at the 85%+ range), then maybe making quality distinctions isn’t a bad idea. I’m not sure how you get the various levels of government to accept this, but I think if we are going to have governments that are dependent on good technology, we need to figure out a way.

The Singularity, and Confessional Language

0

In our seminar this last week, we talked about the Singularity, that point at which machines become smarter than people, and start designing machines that are even smarter so that the gap between humans and computers running AI programs just gets larger and larger. Depending on who you listen to, this could happen in 2045 (when the computing power of a CPU will, if current trends continue, be greater than that of the human brain), or sooner, or later. There are people who worry about this a lot, and in the past couple of weeks there have even been a couple of Presidential studies that address the issue.

I always find these discussions fascinating, as much for what is presupposed in the various positions as for the content of the discussion. The claim that machines will be as “smart” as humans when the complexity of the chips equals the complexity of the human brain assumes a completely reductionist view of human intelligence, where it is just a function of the number of connections. This may be true, but whether it is or not is a philosophical question that has been under discussion at least since Descartes. Consciousness is not something that we understand well, and while it might be a simple function of the number of connections, it might be something else again. In which case, the creation of a computer that has the same level of complexity as the human brain would not be the same as creating a conscious computer, although it might be a step in that direction.

Then there is the assumption that when we have a conscious computer, we will be able to recognize it. I’m not at all sure what a conscious computer would think about, or even how it would think. It doesn’t have the kinds of inputs that we have, nor the millions of years of evolution built into the hardware. We have trouble really understanding other humans who don’t live like we do (that is the study of anthropology), and this even goes back to Wittgenstein’s dictum that “to understand a language is to understand a way of life.” How could we understand the way of life of a computer, and how would it understand ours? For all we know, computers are in some way conscious now, but in a way so different than we can’t recognize it as consciousness. Perhaps the whole question is irrelevant; Dijkstra’s aphorism that “The question of whether machines can think is about as relevant as the question of whether submarines can swim” seems relevant here.

Beyond the question of whether machines will become more intelligent than humans, I find the assumptions of what the result of such a development would be to tell us something about the person doing the speculation. There are some (like Bill Joy) who think that the machines won’t need us, and so will become a threat to our existence. Others, like Ray Kurzweil, believe we will merge with the machines and become incredibly more intelligent (and immortal). Some think the intelligent machines will become benevolent masters, others that we are implementing Skynet.

I do wonder if all of these speculations aren’t more in the line of what John L. Austin talked about as confessional language. While it appears that they are talking about the Singularity, in fact each of these authors is telling us about himself– how he would react to being a “superior” being, or his fears or hopes of how such a being would be. These things are difficult to gauge, but the discussion was terrific…

Money, bits, and the network

1

We had an interesting discussion in the freshman seminar on the impact of the internet on the economy. We all see some of the disruptions– Uber and Lyft are causing major changes to the taxi companies, Amazon has done away with most brick-and-mortar bookstores (and many other kinds, as well), and the notion that you need to actually to somewhere in the physical world to buy something (that you will see, touch, and perhaps try out before you buy) is getting restricted more and more to items that seem oddly special for just the reason that you need to see before you buy. Even the paradigm of this, the car, is having buying habits changed by companies like Tesla.

We only touched on a more fundamental change in the economy that the computer and networking world is bringing about– the very notion of money. It wasn’t that long ago that what stood behind our currencies was a hot political topic– there were those who worried about currency that was only backed by silver, rather than by gold. The idea was that the value of money needed to be directly traced to some precious metal that the money represented. A $100 bill got its worth by being capable (at least in theory) of being exchanged for $100 in gold (or at least gold coins).

This notion was abandoned by most countries in the early part of the 20th century. What made $100 worth something is that someone else was willing to part with some set of goods and services in exchange for the bill. The final arbiter was that you could pay your taxes with such bills; the government always needs to be paid, so that kind of worth is a form of guarantee.

Now, the real money is just bits in a computer. Banks exchange the bits with each other, and we all hold our money in those banks, where they are represented as bits, and where we can transfer the bits via computers, or credit cards, or by direct electronic means. On occasion we exchange some bits for bits of paper (bills) that we can carry around with us and use to pay, but that is become less and less needed as we all become used to the notion of money as bits.

This can be very convenient. I was recently in England for most of a week. My ATM card could be used to get English currency because bits are easy to transfer internationally. But I didn’t need the currency all that often; mostly I paid with credit cards that were tokens allowing me to directly move bits from one account to another. I remember many years ago when I first went to England, and needed to worry about getting traveller’s checks beforehand so I would have money while I was there. Just not needed any more.

All of which leads to the question of what money is now. It isn’t a representation of a precious metal. It’s more a consensual hallucination that we all believe will continue to be exchanged for goods, services, or (as a last resort) taxes. But it isn’t much of a thing so much as a representation in the Internet.

All of which brings me to one of my favorite characters in this space, J.S.G. Boggs. Boggs became (in)famous for drawing complex pictures of obviously fake U.S. bills (he often put his own face on the bill), and then passing the art at the value shown on the bill. So if the bill showed $100, he would ask for $100 worth of goods (or change). Merchants were happy to do this, since the bills were worth much more than their face value (as art). But it drove the U.S. Secret Service (which has as part of its job the enforcement of anti-counterfeiting laws) somewhat bananas. The courts finally sided with Boggs, who claimed that the work was performance art. But it was also something that had value of a sort that, at least in the initial transaction, was not like other forms of art.

Things are getting more complex today. Bitcoin is hardly performance art, but whether or not it is currency is under debate. It appears to have value (although it fluctuates considerably), it can be used for payments, and is certainly causing some authorities consternation. How much it changes the economy is yet to be seen, but it will certainly have an impact no matter what happens.

Design principles

0

As part of the seminar on the Internet that I help to teach, we read End-to-end arguments in system design, one of my favorite papers in computer science. What makes this such a great paper is that it takes the notion of design seriously. The reason to read the paper isn’t to learn about a nice implementation, or the proof of some theorem, or even how to make an algorithm that runs faster than others when doing some task. Instead, this paper talks about a general principle that informs any work that might be done that involves a network.

Simply put, the end-to-end argument tells us to keep the network simple, and do the work at the endpoints, since those endpoints will need to do most of the work themselves anyway. Worried about who you are talking to? Well, you need to authenticate at the endpoints, so there is no need to do that in the network. Need to check that the message hasn’t been corrupted in transit? That has to be done at the endpoint, so there is no reason to do so in the network, as well. It is an outgrowth of the idea that you don’t want to do work twice. So find the right place to do the work, and ignore it everyplace else. The result of following this principle is that the network we now use is simple, has scaled remarkably, and can be used for all sorts of things it was never intended to be used for. To introduce a new bit of functionality, all that needs to be changed are the end points that need that functionality, not the network itself.

It seems obvious now, but this principle was pretty radical at the time it was proposed. A lot of people thought it would never really work, never really scale, never really perform. It was fine for experimentation and toy applications, but not for real work (that needed token rings, or something far more reliable and guaranteed).

Articles that enunciate general design principles are few and far between, and should be treasured when they are found. My other favorite is Butler Lampson’s Hints on Computer System Design, a paper written in 1983 but still relevant today. The examples may be somewhat outdated, but the hints are still important to understand. The details of the work may change, but the underlying design principles are much closer to being timeless.

In the seminar, we also talked about the design notion that you can solve a problem by introducing a level of indirection. When the first ARPAnet was built, the networking was taken care of by simple, small computers called interface message processors (IMPs), that were responsible for talking to each other, with each IMP connected to a host system. This isolated the differences in the host systems to a local phenomenon; if I wanted to connect I had to deal with local characteristics locally, but not with the special characteristics of the remote hosts I wanted to talk with or connect to. The IMPs offered a level of indirection. When it was realized that there were local networks that wanted to be connected, another level of indirection was introduced. This level looked to an IMP like a host, and to the local network like part of that network. Thus was a gateway born, allowing the local idiosyncrasies of the local networks to be dealt with locally, not globally.

Each of these levels of indirection can also be seen as adding a layer of software into the system that will translate from the local environment to the more global one. Each local environment may have a different gateway, but that is masked from the global environment. The power of software is that it allows a common external interface to be implemented in a different way that is hidden to those who don’t need to know about it.

Discussions of software design have recently centered around various sorts of patterns. While these may be interesting, I do wish we as a community would talk more about the general principles that can inform a wide variety of designs. They are hard to find, often look trivial once they are known, but are important ways to think about how we build our systems.

Back again

1

It has been way too long…amazing how life intrudes on the best of intentions. But time to get back.

To help force myself, I’m teaching a freshman seminar (along with Mike Smith) in which we are requiring that the students keep a blog of their thoughts about the content of the class. And since it seems unfair to ask others to do what you are unwilling to do yourself, we committed to do the same. It’s one way to get back to writing.

The seminar’s topic is What is the Internet, and What Will It Become? One of the pleasures of teaching a freshman seminar is that the topic can be wide open, pretty much unconstrained, and far more interesting than tractable. This topic fits the bill pretty well. It reminds me of my past as a philosopher– the more I think about the topic, the less sense I can make of it. Is the Internet just TCP/IP? Is it a suite of protocols, or a consensual hallucination?

Beyond the topic, we get to discuss this (and in the process get to know) a small group of what appear to be spectacular students. I always learn more from them than they learn from me, and I’m looking forward to being taught by them.

We are starting by looking at the history of the development of the Internet. We have been reading Hafner and Lyon’s Where Wizards Stay Up Late, as accurate a single-volume history as we could find. History is a funny thing, especially when there are still those around who were involved in the events. It is hard to get everyone to agree who did what when, and even more difficult to get everyone to agree on the impact and import of much of what went on. It’s so much easier when no one is around who can say “well, I was there, and it didn’t really happen that way.”

There are lots of interesting lessons to learn from the way the early Internet was constructed. There seemed to be some ideas that permeated the air but were completely counter to the existing orthodoxy, such as packet switching. It was clear that there was no real agreement on what the end state of the experiment that was ARPAnet was going to be. And reading the history it becomes apparent that then, as now, much of the real work was done by graduate students, who seemed to have a better idea of what it was all about than the people who were supposedly running the project.

What I find most interesting, though, is the contrast in notions of how to build a reliable network. The packet network advocates started with the assumption that the network could never be made reliable, and that was just the way the world was. So they spent a lot of time on figuring out how to build reliable transmission on top of an unreliable network, thinking through things like re-tries, congestion control, and dynamic routing. Errors, on this design philosophy, are a given, and so the users of the network need to acknowledge that and build reliability on top of the unreliable pieces.

This is a huge contrast to the network engineers of the time at, say, the Bell system. The phone company (and there was only one in the U.S. back then) was all about building a reliable network. They did a pretty good job of this; I remember when not getting a dial tone on your (AT&T owned) phone was a sign of the Zombie Apocalypse (or, given the times, the beginnings of nuclear war). But making the system reliable was difficult, and expensive, and limited what could be done on the network (since lots of assumptions about use got built in). It is hard to remember, now that the Internet is the backbone of most everything, that which of these approaches was going to be best wasn’t clear for about 20 years. Big companies backed “reliable” networks well into the 90s. But in the end, simplicity at the network level won out, giving us the networks we have today.

I suppose my interest in this evolution is not surprising, given that I have spent most of my life working in distributed systems, where the same argument went on for a long time (and may still be going on). Building a reliable computing platform can be done by trying to insure that the individual components never fail. When you build like this, you worry about how many 9s your system has, which is a reflection of what percentage of the time your system is guaranteed to be up. Four 9s is good (the system is guaranteed to be up 99.99% of the time), five 9s is better (now you have a guarantee of 99.999% uptime). But moving from four 9s to five 9s is expensive.

The alternative, best exemplified by cloud computing, or the approach taken by Amazon, Google, or Facebook, is to build a reliable system out of lots of unreliable components. You assume that any of the servers in your server farm is going to fail at any time, but build your system around redundancy so that the failure of one doesn’t mean that the system is no longer available. It is a more challenging design, since you have to worry about failure all the time. But it is one that works well.

Just like the Internet.

Furthering Human Knowledge

1

A recent graduate with whom I did considerable work recently wrote and asked me the following question:

What are your thoughts about how academia versus industry contribute to the expansion of the world’s knowledge? My thoughts from speaking to people from both sides are that it seems that in academia, knowledge is much more open to sharing, but progress is much slower (because of multiple things: the need to apply for grant money, the multiple roles that professors must juggle, the difficulty of the problems being tackled, the lack of readily accessible consumer data, etc.), and that in industry, there are probably tens or hundreds of companies trying to do the same thing (like spam detection, for example) but who do not want to share their knowledge, but that the competition and that the money incentive makes development faster. Do you think that working in one versus the other is more effective or efficient if the furthering of knowledge is the ultimate goal? This is one of the big questions I’m thinking about as I’m trying to figure out where I best see myself in 10 years.

I’ve been thinking about this problem, and some variants, for a considerable period of time, so the response was fairly length. On the off chance that others might be interested, I post it here:

And this is why I love working with students— they always ask the simple questions :-).

There may have been a time when the answer to this was actually simple. The idea was that academics worked on long-term, highly speculative research topics, generally funded by the government, and published papers for all the world to see. Then industry would pick up some of these ideas, figure out the commercial application, apply the pure research, and take the product that resulted from applying that research to market. Academics worked on the long-term (5-10 year) problems, and industry worked on the short term (.5-2 years). A simple and rational pipeline, with each group knowing what it was doing and how it interacted with the other. If you wanted to work on the high-risk, high-impact (but likely to fail) stuff, you became an academic. If you wanted to have immediate impact on the real world, with some better guarantee of success, you worked in industry.

This is a nice picture, but if it has ever actually been true in the computer/software sector, it was before my time. As long as I’ve been watching and taking part, real research was done all over a combination of academia, industrial research labs (Bell Labs, BBN, Xerox PARC, Digital’s SRC, Sun Labs, MSR) and people in product groups at various companies, both large and small. I remember a time when it seemed that everyone in the industry worked in Research and Development; the difference between being an academic and working for a company was whether you wrote the paper before or after releasing the product.

But there are some changes that have occurred over the past 10 or more years that have made the picture even more muddled than it was, and thus make the question harder to answer.

One of these changes is that the amount of research funding going to academics in the CS field has been going down, and the risk profile of the funding agencies has been going down with it. There was a time when NSF and DARPA would give out lots of money, and be willing to take chances on things (like the internet) that didn’t have any obvious payback and might not have worked at all (there were plenty of research projects that didn’t). As the amount of money has decreased, the willingness for risk has decreased, as well— while there are some projects that seem pretty risky, for the most part the perception is that for a project to get funded, you need to show that it will succeed. This leads to people getting startup funding to do a year or so of research, applying to funding agencies for that work (or a small follow on to the work), and then using the funding to do something new that they can then ask for the next round of funding to support. Again, it isn’t always like this, but it is like this often enough that it can be problematic. By the way, it may not be that the actual amount of money has gone down (I believe that it has), but the amount hasn’t gone up in proportion to the number of academics who are trying to get it. So that makes things strange, as well.

At the same time, the number of industrial research labs seems to be decreasing, along with the funding available for such labs. Big places are willing to do some really adventurous stuff (look at Google X, or at least the rumors from Google X), but the work is not done in the open, may not be shared, and when it is often is covered by patents. Which is natural; these companies want a return on their investment. But it does limit the scope of the spread of the innovation or knowledge. In a lot of cases, companies are now willing to take more of a chance on research than academics, because they realize that the payoff to being the first to figure something out is so huge. So some of the really speculative, long-range work is being done in industry, but you hardly hear about it (think of the self-driving car).

And then there is a third trend. What seems to be happening more and more is that innovation and real research is being outsourced to startup companies. If you have an innovative idea, you start a company and work on the idea. Then if the idea works out, you either take over the world or (more likely) get bought by a larger company. This is a really attractive model for those who fund innovation; they have essentially out-sourced the problem to the VC community. The government doesn’t have to spend any money at all. The existing companies only have to buy when they know that the idea works. And the VCs are willing to put up the initial money, because the payback for the companies that get bought or get to go public is large enough to make the initial investment profitable. This certainly seems to be the way Oracle does research these days (they don’t even do much hiring; most of the people they add to the company come in through company acquisition). Ryan Adams recently had his little company bought by Yahoo, so sometimes the line between academic, startup, and established company can be even more blurred.

Of course, such outsourcing also means that the time depth of start-up company research is dictated by the patience of the VC community, which seems to be limited to a couple of years (at best). And the research better have a clear commercial application. 

All of this has to do only with how the initial research gets funded. The real question centers on how you could most effectively add to human knowledge. Which is a lot harder than just getting funding, because once you get funding, you then need some way to get people to recognize what you have done.

Academics do this by writing papers and giving talks, which sometimes works. But there are a lot of papers that get written and never really get read, a lot of talks that are heard and immediately forgotten. By the same token, there are lots of products that are really great pieces of technology that, for one reason or another, never get adopted. Before inventing Java, James Gosling invented NeWS, an elegant and fully functional window system. But NeWS went nowhere; instead the industry of the time adopted X-windows, which a lot of us thought was not technically as nice. Dick Gabriel and I have been arguing over why LISP lost out to C or Multics lost out to Unix for a long time, but whatever the reason was it was not purely technical. I remember being told by Ivan Sutherland, who has done more great technology than just about anyone I know, that all you can do as a technologist is make sure that the work is good; adoption is a question outside of your control. A hard lesson, but true.

After all of this evasion, let me try to answer the real question, which is what should you do if you want to push forward the boundaries of knowledge? And the answer will depend on how you want to live your life, not on which is more likely to push those boundaries successfully. As an academic, you have much more freedom, deciding on your own research agenda and working with students who you will get to advise and direct. In industry, you probably won’t have that sort of freedom for 5 to 10 years (and that’s if you enter with a Ph.D.); you will be part of someone else’s group for that time, working on their problems. But while in industry you will not have the worries over funding (my funding cycle when at Sun Labs was a couple of weeks), and the people you will be working with will have a higher level of skill and discipline than is generally found in students. But the problems you will work on will be constrained, to some extent, by the market and you may not be able to share all you know. The environment of a startup gives you some of the freedoms of an academic, but also brings the pressures of the market. 

And, of course, a final consideration is just what is meant by “furthering human knowledge.” One way of doing this is to come up with something that no one has ever thought of before, and getting everyone to understand it. This might be a new theorem, a new system, or a better way of doing things. Java, for all its flaws, certainly contributed to human knowledge in some way; the IP protocols did the same. But these sorts of contributions are few and far between. When they happen, it is a combination of insight, perspiration, and luck; no one knows how it really happens, but when it does it is pretty amazing.

But the other way to further human knowledge is to train the next generation in how to further knowledge. This can also be done in all of the contexts spoken about above— I mentored a lot of people when I was in industry, start-ups teach in their own kind of way, and being a professor is (supposedly) explicitly about that. As my career has moved along, I’ve grown to appreciate this way of furthering knowledge more and more; it may not be as romantic or immediately satisfying, but you end up playing the long game, knowing that your contributions will outlast you and not be subject to quite so many whims. Which is the other reason that I love working with students— it is part of my attempted contribution to the expansion of human knowledge.

It was 20 years ago today…

1

A lot of Beatle’s songs have been running through my head this last little while, but Sgt. Pepper is getting most of the mental traffic. In part this is because of a recent personal anniversary (will you still need me?) but mostly because it was 20 years ago that Java(tm) was first made public. My, how time flies.

There was actually a bit of a debate on the JavaSoft (that was the part of Sun that was responsible for Java) mailing lists; the first public release of the alpha technology preview of Java went out March 27, 1995, while the first announcement of the language and environment was during the Sun Developers Conference held May 20-something in that year. To give a bit of context, that was the same Sun Developer’s Conference when the Memo of Understanding between Sun and Netscape that placed Java in the browser was announced. For those of you who don’t know what Netscape was, go look it up. At the time, Netscape was a much bigger deal than Java, but time has a way of changing those things.

Java had actually been wandering around Sun internally for a couple of years before that; during the 10 year anniversary I found that I had the electronic copy of the oldest known (then) spec for the language, which you can see here. First note that the language was not then called Java. It was originally named Oak, after the tree outside of James Gosling’s office. But the trademark folks said that name was already taken, so the commercial name was changed to Java. The rest, as they say, is history.

The Oak spec is recognizably Java, but is also different. The exception mechanism at the time did not require that exceptions be declared or caught; this was changed before the first official release. The language was a lot smaller (full description in under 30 pages), and the mass of libraries, frameworks, and imported features from other languages weren’t there. This is a snapshot of Java in its purer form, before it became important enough to be fought over by different companies, before the community process, and before all those who wanted it to be something else had their way with it.

I still like to program in Java (the Standard Edition; don’t get me started on the Enterprise Edition), since at its core it is still the simple, object-oriented, strongly typed language that it started out being. I do use generics, even though I hate the way they were implemented and argued against them (and still think they aren’t needed). I’m glad to see lambdas introduced, although I could have lived without them. And I do sometimes wish that we had had a few more years to develop the language before it became the global phenomenon it became. But it ruled much of my life in technology, and it was fun.

So happy birthday, Java. And thanks for all the good times. The bad ones weren’t your fault…

Architecture at HUIT

0

It’s been quite a while since my last post; lots has happened and it is past time to start talking about some of it.

Information technology at Harvard is changing a lot, as all of the planning that started when Anne Margulies came in as CIO is beginning to be implemented. We are well into the implementation of a new Student Information System, and the move from a classroom support technology from iSites to Canvas is happening (sometimes faster than we thought). All of these are good, and visible. But this is hardly all of it, or even the most (technically) interesting.

In some important ways, the biggest change is the move to the cloud. If done right, this will be pretty much transparent to the users of our services, but it is a huge change for the organization itself. It makes sense in lots of ways, but it takes new sets of skills and new ways of looking at the problems we are trying to solve. But we have committed to moving 75% of our services to the cloud in the next three years, which is a lot more than just testing the waters.

As we do this, it is also an opportunity to start putting some architectural principles in place. HUIT has traditionally treated most applications as “one-of-a-kind” entities, with machinery and underlying software stacks selected, optimized, and maintained for each system. When we were trying to squeeze all we could from each application, this made sense. But as computing power grows, the complexity of such an approach is overwhelming any advantage in performance we might be able to gain.

To help bring some regularity to this, I’ve convened a new group, the Architecture Decision Group. As the name implies, this is a group that comes together to make decisions about what the architecture is and will be going forward, at least for HUIT. If the wider Enterprise Architecture work was complete, this group would spend most of its time making sure that architecture was being followed. But since that architecture is in-process, this group is trying to decide on issues that need answers now. Since that wider effort is just starting, we needed something to make decisions now, so we can avoid the lack of regularity that is our current state.

The group is intentionally designed to be small and technical. Permanent members are the CTO (natch), the deputy CIO, the Managing Directors of Engineering and Architecture and Strategy and Planning, the Chief Information Security Officer, and the Director of Networking. Depending on the subject being discussed, we will ask other (technical) people to attend.

An important part of the work that we are doing is writing it down. We have a backlog list, and then a set of decisions and rationales for those decisions. All of this is kept on a publicly viewable wiki.

While the deliberations of the group are invitation-only, we are looking for ways that the more general engineering community can contribute. For any of the topics in the backlog, we invite opinions to be written up (on the wiki) and submitted. The group will read these, and those that seem particularly relevant may cause us to invite the writer to join for a session or two. We also invite comments on our decisions. The assumption is that nothing we decide is set in stone, but unless there is good reason to follow some other design everything that HUIT does should follow the decisions made by the group.

We have already made a number of decisions around the cloud and the network architecture that impact our move to the cloud; take a look and file a comment if you think we have not understood something important. We will next be looking at some of the patterns for deployment in the cloud; opinions on those topics are being sought. So take a look and get involved…this is the technical future of HUIT that is being worked out here, so we would love to hear from you.

Thoughts on time between…

0

It has been a full semester since the last post– seems like just a moment. Fall is a favorite time of mine for lots of reasons, but Fall Semester is fun because I get to teach my privacy course. It is small, it is intense, and it is a lot of work, but it is more fun than I can imagine having in any other context. We had our usual count of the surveillance cameras on and around campus (a lot), a lab on unprotected web cameras that ended up with us all watching a wedding in a Las Vegas chapel (the bride went as Marilyn Monroe, and groom as early Elvis and the official doing the ceremony as the later Elvis), and a slew of projects that were amazing, creepy, and often both.

This year was especially interesting, given all of the privacy-related things that were going on in the rest of the world. We had more Snowden revelations, new and interesting hacks, Facebook and Google doing things that were questionable, and goings on in our own community. All of which had privacy implications.

One event led to some interesting discussions, both in and out of class. The first was the revelation that some educational research had taken place on the effect of attendance on the outcomes in a course. The research entailed placing cameras in some courses to determine how many were attending on any given day. This was done without the knowledge of either the students in the class or the professors teaching those classes, and once revealed lead to considerable discussion in the general Harvard community. I was a part of some of these in my role as CTO. But the most interesting conversations were the ones I had with my class.

The first thing that the students wanted to know was more details about the study itself– did the study get reviewed by the Institutional Review Board, and if so what did the application look like and what was the determination? How were the cameras used to determine attendance, and were they counting raw numbers or could the cameras tell if a particular student was attending or not? How were the courses that were part of the study selected? All of these were great questions, and at the time of maximum controversy, none were answered publicly. This didn’t stop a lot of the discussion, but it did cause the students to stop and wonder. We never got answers to all of the questions (in particular, we never saw the IRB application), but the answers that came out about some (the experiment did get submitted to the IRB; the cameras could only tell wether or not someone was in a seat, not track individuals; classes were picked by being in the right Science Center rooms), things seemed a lot less sinister than they had at first. The students tended to think that the privacy of the students was not compromised (after all, what is the reasonable expectation of privacy in a lecture hall) but perhaps the instructor’s privacy was (although there was continuing discussion about that).

What did become clear is that the issue really revolved around trust, and the more information that was available, the easier it was to trust that the researchers were not compromising privacy. When there were few facts, it was easy to speculate and ascribe questionable motives to those involved, which raised all of the privacy hackles. But as we all learned more, it was easier to see what was being done in a clearer way, and to become less concerned about the motives of the actors. It still seemed sort of creepy, but it was hard to see the actions as immoral (and we never got to the discussion of illegal).

Anonymity, Privacy, and Big Data

0

It’s been a while since my last post. Some interesting things have been going on, especially in the area of privacy, big data, and the work I’ve been doing with the researchers looking at the information coming out of the Massive Open On-line Classes (MOOCs) being offered by HarvardX on the edX platform. The latest result, just published on-line, shows that if you equate privacy with anonymity (which is commonly done in laws), you can either preserve privacy or do good science, but not both. Which is a pretty interesting result, I think. Let me lead you through the thinking, with apologies for the long and rambling nature of the post.

When I first started offering the course Privacy and Technology, I did so thinking that it would be a topic that would allow discussion of the interaction of technology and policy without being overly trendy or taken over by events. I remember my co-instructor at the time, Mike Smith, saying in the shopping lecture that he thought privacy was going to be the issue of the next decade; it was a shopping lecture, so I figured being a bit over-the-top was fine. Turns out he was a much better predictor of the future than I; privacy has become a huge issue that has now become one of my major research areas.

One reason for the worry is the advances in the ability to use Big Data to analyze all kinds of things. People worry about all the information the NSA appears to be gathering about everyone, creeped-out by how much Google or other web properties know about them, and generally worried that nothing can be kept secret (and it will just get worse as more sensors are connected to the internet). There are very few general regulations or laws in the U.S. about what data can be gathered and what that data can be used for. Google and Facebook can pretty much do what they want with the data they gather as long as they don’t lie to you in their privacy statements (although how you know what those statements say is an exercise left to the reader).

There are some areas that are regulated. Everyone seems to know about the privacy requirements imposed by the Health Insurance Portability and Accountability Act (better known as HIPPA), although as my colleague Latanya Sweeney often points out, this regulation is more about what data can be shared than it is about what is private. The other industry that is highly regulated with respect to privacy is education, which is covered by the Family Educational Rights and Privacy Act (FERPA). Most of these regulations center on the notion of de-identified data– that is, data that has been changed so that you can’t identify individuals from the data. HIPAA specifically defines what it means to de-identify the data– you have to take out names, addresses, social security numbers, and the like. The advantage of the definition is that you can tell when your data has been HIPAA de-identified; the disadvantage is that it isn’t all that hard to re-identify individuals given the de-identified data and other, pretty easily acquired information.

The Family Education Rights and Privacy Act (FERPA) is much less well defined; it talks about data that could not be re-identified by a reasonable person (is Latanya reasonable? Probably not, but how about all of the students she has trained?). Where it does get something like specific it seems to rest on a notion k-anonymity (invented by none other than Professor Sweeney), that in it’s simplest formulation says that for any set of properties on any individual in your data set, there must be at least k-1 other individuals within that set with the very same set of properties. So if your data set is, say, k-anonymous for k=5, it means that for any information about some particular individual in the data set, there will be at least 4 other individuals in the set that can’t be distinguished from the original.

Now, the MOOCs that Harvard has been running have been generating a lot of data about how students taking the MOOCs interact with the course material. What videos get viewed, which get sped up or slowed down, what questions are answered correctly and after how many tries– all of this information is available on anywhere from thousands to hundreds of thousands of students. Over the past year we (a research group from Harvard and MIT) have been able to do some pretty interesting data analysis on this, and are starting to learn some things about how students learn on-line.

But if you are really committed to science, you want to allow others to test your work by running their own analysis on the data. We also thinks others will ask questions we have failed to ask, which is a good thing. So we would like to share the data. But this being data about students, we (or, more accurately, the lawyers) worry about FERPA. So before we can share it, we needed to make it k-anonymous (where, in this case, k=5). Using a variety of techniques, we were able to produce a basic data set that was 5-anonymous and gave data on over 800,000 students who had taken MOOCs. Which we then released to other researchers.

But then we thought, just to be sure, that we should run our original analyses over the de-identified data set. When we did, the results we got were very different. Significantly so. For details, see the paper, but the difference was enough that it wasn’t clear that the de-identified set would lead to good science. Which is a huge problem.

We hope that the paper will cause some discussion, both among researchers and policy makers. There are others that have worried that there is a conflict between big data science and the preservation of privacy through anonymity (most notably, the President’s Council of Advisors on Science and Technology in a recent report), but as far as I know our paper is the first to show that the problem is real.

So now I’m thinking maybe Mike Smith was wrong. This is a problem that is going to be around for a lot longer than a decade.