Send in the consultants

I’ve recently moved into a new job at Novell, working on our strategy for worldwide services and planning for our next fiscal year is keeping me busy. But I still, fortunately, deal with real clients and real problems too. This one is classic: the client has several hundred old Unix and RHEL servers that they want to move to SLES. Great! We want to help. So they negotiate the server deal and then want to know the cost to migrate. How much is it going to cost, in total, to go from what they have today to what they want tomorrow? They ask for estimates on a per-server basis; how many hours would it take to migrate a Solaris server to SLES? Ten hours? A thousand hours? So they bring in the consultants, the dreaded consultants. They’ve tried to avoid slowing down the deal but there’s no avoiding it now.

Well, you’ve done this before, they say, you’re grizzled veterans of the data center; is it two or ten hours for a server? And the consultant — and I’ve been in this situation, believe me, it sucks — has to say, “Well, it depends. It could be a thousand hours.” Which is what everyone is expecting him to say because you can’t get a straight answer out of a consultant. They’re always going to tell you “it depends.” Right.

And even if there is all the time in the world, this particular answer needs to be in writing on the buyer’s desk by EOD today or the sales guy isn’t going to make his number for the quarter which means that he’s not going to make ‘club’ (his incentive travel event), which his wife is really looking forward to, so this damned consultant is not only not answering a simple, reasonable question from the beloved customer but they are also very directly making his wife mad at him, with the attendant consequences.

Perhaps you think I joke? Or exaggerate?

Making matters worse, some nerd named Chad has downloaded OpenSUSE onto a machine in their testing lab and moved a couple of apps without incident (some directory changes, a few lines of code) and based on that experience has estimated that moving the three hundred servers will take approximately an hour each. Seriously: we have clients who want us to tell them that moving unknown production workloads from one operating system to another will take less than two hours per server.

So the consultant sighs and starts to ask questions: What do the workloads on these servers actually do? Online banking is different from warehouse management. What platforms are they running? (What version of J2EE? What version of RHEL? What version of Manugistics?) Are they going to change anything else besides the operating system when they do this move? Is the software custom or off-the-shelf? What’s it written in? If they say something like current Java apps running on a 2.6 kernel going to the same JVM on another distribution, that would be one thing. If you are looking at non-ANSI C custom code on RHEL 3 on a complex multi-tiered app, that’s something else. (Moving from the 2.4 kernel to the 2.6 kernel on any distribution is much harder than moving from one current distribution to another.) What about storage, and backup, and disaster recovery? Systems management? There are a thousand more architectural details that you need to understand (one data center or many? resource utilization?) but everyone is getting impatient with you and your endless questions.

Then you start getting into the enterprise-y aspects, which is where the real time and cost come in. There’s a difference between Chad moving an app from one platform to another as a technical exercise and the actual time that it takes production applications to go from one to another. What’s the testing regime? I would expect that production code moving from one distribution to another would require real testing (stress/performance, UAT, etc.). Would you include that in the estimate? What about security? Does the new OS have to go through a security audit at the company? (Answer: yes, and it’s going to take a long time for the online banking app, believe me.) Documentation?

This is all super-boring and bureaucratic and definitely not technical so the nerds aren’t interested and think it’s worthless and the sales guy is hearing his wife screaming at him and the buyer is saying, “Why is this so complicated?”

So, should we skip the backup part?

Really, the way to do this kind of thing is to do a quick assessment and figure out some kind of prioritization and rough sequencing, but that would require the client to spend time and money helping you to figure out how much to charge them and they are naturally leery of such a thing. You desperately want to avoid getting locked into a fixed figure because you still have no real idea how complex the problem your being asked to solve is, but that is what the client and the others are asking for.

So you end up with a fudge; you commit to moving some edge servers and a cluster of supposedly simple apps and you sign up to do a security-approved core build and an assessment for the rest so that the project can get started and the customer can show progress to their boss and the sales guy can make his number.

Now you’re faced with months in the lab at the client site with Chad explaining to you how completely screwed up their environment is and how there’s no way that he’s going to give up his Solaris servers and anyway they’ve tried to do this themselves a bunch of times already and it never works because it’s not really a current release of Manugistics and they did some customization that they probably shouldn’t have…