I dunno why the New York Times appeared on my doorstep this morning, along with our usual Boston Globe (Sox lost, plus other news) — while our Wall Street Journal did not. (Was it a promo? There was no response envelope or anything. And none of the neighbors gets a paper at all, so it wasn’t a stray, I’m pretty sure.) Anyway, while I was paging through the Times over breakfast, I was thinking, “It’s good, but I’m not missing much here–” when I hit Hot Story to Has-Been: Tracking News via Cyberspace, by Patricia Cohen, on the front page of the Arts section. It’s about MediaCloud, a Berkman Center project, and features quotage from Ethan Zuckerman and Yochai Benkler


(pictured above at last year’s Berkman@10).

The home page of MediaCloud explains,

The Internet is fundamentally altering the way that news is produced and distributed, but there are few comprehensive approaches to understanding the nature of these changes. Media Cloud automatically builds an archive of news stories and blog posts from the web, applies language processing, and gives you ways to analyze and visualize the data.

This is a cool thing. It also raises the same question that is asked far too often in other contexts: Why doesn’t Google do that? Here’s the short answer: Because the money’s not there. For Google, the money is in advertising.

Plain enough, but let’s go deeper.

It’s an interesting fact that Google’s index covers the present, but not the past. When somebody updates their home page, Google doesn’t remember the old one, except in cache, which gets wiped out after a period of time. It doesn’t remember the one before that, or the one before that. If it did it might look, at least conceptually, like Apple’s Time Machine:


If Google were a time machine, you could not only see what happened in the past, but do research against it. You could search for what’s changed. Not on Google’s terms, as you can, say, with Google Trends, but on your own, with an infinite variety of queries.

I don’t know if Google archives everything. I suspect not. I think they archive search and traffic histories (or they wouldn’t be able to do stuff like this), and other metadata. (Mabye a Googler can fill us in here.)

I do know that Technorati keeps (or used to keep) an archive of all blogs (or everything with an RSS feed). This was made possible by the nature of blogging, which is part of the Live Web. It comes time-stamped, and with the assumption that past posts will accumulate in a self-archiving way. Every blog has a virtual directory path that goes domainname/year/month/day/post. Stuff on the Static Web of sites (a real estate term) were self-replacing and didn’t keep archives on the Web. Not by design, anyway.

I used to be on the Technorati advisory board and talked with the company quite a bit about what to do with those archives. I thought there should be money to be found through making them searchable in some way, but I never got anywhere with that.

If there isn’t an advertising play, or a traffic-attraction play (same thing in most cases), what’s the point? So goes the common thinking about site monetization. And Google is in the middle of that.

So this got me to thinking about research vs. advertising.

If research wants to look back through time (and usually it does), it needs data from the past. That means the past has to be kept as a source. This is what MediaCloud does. For research on news topics, it does one of the may things I had hoped Technorati would do.

Advertising cares only about the future. It wants you to buy something, or to know about something so you can act on it at some future time.

So, while research’s time scope tends to start in present and look back, advertising’s time scope tends to start in the present and look forward.

To be fair, I commend Google for all the stuff it does that is not advertising-related or -supported, and it’s plenty. And I commend Technorati for keeping archives, just in case some business model does finally show up.

But in the meantime I’m also wondering if advertising doesn’t have some influence on our sense of how much the past matters. And my preliminary response is, Yes, it does. It’s an accessory to forgetfulness. (Except, of course, to the degree it drives us to remember — through “branding” and other techniques — the name of a company or product.)

Just something to think about. And maybe research as well. If you can find the data.

With the 10th anniversary edition of  Cluetrain coming out, I thought I’d try to keep up with postings that mention “Cluetrain” — through four five Live Web* search engines: BlogPulse, Google BlogSearch, Technorati, FreindFeed Search and Twitter Search. I’ve got all four feeding into an aggregator.

As of 3:33pm EDST, BlogPulse finds 20 posts so far in the month of June. Google Blogsearch finds 22. Technorati is currently down.  Twitter Search finds 28 in the last day (I didn’t go back any farther there.) Not sure I want to make this a more formal research effort. I just thought it was worth vetting a bit about how I’m following stuff.

[Later…] Thanks to Chris Heath for suggesting I add FriendFeed Search. There I just gave up counting at 50 postings.

* I much prefer “live” to “real time”, mostly because my son Allen came up with the “Live Web” line way back in 2003, and correctly observed that the Web of sites was essentially a static one, and that the World Live Web would branch off of it. The language alone is a give-away. The Static Web is full of real estate language: sites, domains and locations that you architect, design and build. While the Live Web is one with feeds where you write, post, update, syndicate and now also tweet and re-tweet. To me the differences between static and live are much clearer than those between ______ (find a word) and real time.

Dave asks, When Google has to cut its own revenue stream by enhancing search, will they do it?

Good question. Here is another: Has Google’s success at advertising slowed its innovations around search? And, How far will Google go with search engine improvements if there’s clearly no advertising money in it?

I’m not suggesting answers here. I’m just asking.

There are many things I would love to search for that Google doesn’t cover. But then, nobody does. For example, a date-range search just of blogs. Google Blogsearch does feature date-based search, with the most recent on top. But what if I want to search just in November and December of 2004? Near as I can tell, it can’t be done. (Correct me if I’m wrong. I’m glad to be.) [Later…] I am corrected by the first two comments.

I once had high hopes that Technorati would support that kind of search, but both Technorati and Google Blogsearch are playing the What’s Popular game. (For what it’s worth, I used to be on Technorati’s advisory board, but now David Sifry is gone and I’m not sure the company even has one any more.)

Anyway, it’s hard for me not to appreciate the many different ways Google lets me search for stuff. Their geographic services, for example, are amazing. So is stuff like this. But I can’t help but notice that the basic search offering has changed relatively little over the years. Is it because of the advertising? You tell me. I really don’t know.

Here’s a job for the Citizen Media’s long tail: find the fist time anybody used the terms “Craigslist killer”, “Craigslist case” or “Craigslist murder”. What the effort will highlight are two issues for journalism. One is the absence of an engine that allows easy first-date or date-range search. (Unless I’m mistaken about that, which I’d be glad to be. [Later… I am.]) The other is the unfairness of turning the name of a business into an adjective that modifies the noun for a crime — essentially re-branding that business as a criminal accessory.

Why “Craigslist killer”? Well, the easy answer is that the killer apparently targeted victims he found on Craigslist, and that’s interesting. Meaning it’s kind of new and different. Murder goes digital. Hey, you don’t hear about “the phone book killer” or “the newspaper killer,” do you? (Well, actually, Craigslist has been called that too.)

My point here isn’t about how natural and easy it is to name a case “Craigslist murder”, but about what that does to Craigslist. I think it’s unfair, as well as a bummer for Craig Newmark and the rest of the Craigslist folks, even if the label is hard to avoid using.

Meanwhile, I’d love to see better chronological search on Google Blogsearch and Technorati, both of which offer it, at least for syndicated sources.

Dr. Weinberger covers this, and adjacent topics.

