Everyday Mission-Critical Computing
Boston’s Silver Line is possibly the best bus in town. From the South End to downtown in 15 minutes, with double-length buses every 2-3 minutes during rush hour. Except that, starting a few weeks ago, they installed these new, fancy bus pass readers. Instead of a half-second swipe, it’s now a 2-second let-the-machine-suck-your-card-in-read-it-and-spit-it-back-to-you maneuver. Not to mention that they don’t take single-dollar bills anymore for those without passes: they force travelers to buy $20 bus debit cards. On the bus. At the same machine. So now, loading time on a silver line bus is at least multiplied by 4, not to mention what happens when someone is trying to feed a $20 bill into the machine to get their new bus debit card.
But it gets worse. Boston bus riders know that, every now and then, the bus announcement panel displays “Out of Service” instead of the route number. In those cases, bus drivers have taken to sticking a piece of paper up on the windshield to indicate the route number. But now, with this new fancy Silver Line bus card system, when the on-board computer crashes and displays “Out of Service,” the card system also goes out of service. And the bus driver is forced to wave people in, collecting no cash. I suppose that’s better than having the driver wait until the system reboots.
Have people noticed how scenarios like these are more and more frequent?
2 weeks ago, I was helping out with the A/V during the Theory of Cryptography Conference held at MIT’s new Stata Center. The ampitheater has a state-of-the-art integrated A/V system with touch-screen central control. Plug in a laptop, touch the right buttons, and the lights dim, the screen comes down from the ceiling, and the projector turns on. Except that, 2 hours into the conference, the whole system spontaneously shut down. The screen retracted. The projector shut down, and the lights turned on. We restarted the system, hoping this was a one-time problem. It was not. 30 minutes later, the system spontaneously shut down again. Low and behold, there is no manual override. Our solution involved bringing in a backup projector, keeping the lights mostly on, and keeping the screen in place by literally pulling the fuse to prevent it from retracting.
Meanwhile, a friend recently told me a story of how car mechanics are having to deal with more and more complex and bizarre accidents. A modern car has easily 10 or more CPUs for various functions. When one of them becomes faulty, the result is unpredictable, and sometimes dramatic. Brakes may be activated because traction control is getting the wrong signal from the ABS subsystem. A driver might be cruising down the highway and suddenly find himself fishtailing for no apparent reason.
What’s happening here is that we’ve forgotten that most of these systems are mission-critical. Getting onto that bus needs to happen quickly and efficiently. A conference projector needs to work immediately, all the time. And a car, well, we generally know that if something goes wrong with a car, we’re in trouble.
But there’s something more here. There’s something unintuitive. The thing is, when a mechanical system breaks down, it’s generally something that makes sense and results in somewhat graceful degradation. A car’s brakes might become weaker as the pads wear down, and that makes sense. A projector might become dimmer as the lightbulb fades. A screen might get stuck halfway unrolled. But when a computer system breaks down, even a small fault has dramatic effects.
It’s for that reason that computerized systems in mission critical applications need physical failovers wherever possible.
No Comments
No comments yet.
RSS feed for comments on this post.
Sorry, the comment form is closed at this time.