Media Cloud, relaunched

Today, the Berkman Center is relaunching Media Cloud, a platform designed to let scholars, journalists and anyone interested in the world of media ask and answer quantitative questions about media attention. For more than a year, we’ve been collecting roughly 50,000 English-language stories a day from 17,000 media sources, including major mainstream media outlets, left and right-leaning American political blogs, as well as from 1000 popular general interest blogs. (For much more about what Media Cloud does and how it does it, please see this post on the system from our lead architect, Hal Roberts.)

We’ve used what we’ve discovered from this data to analyze the differences in coverage of international crises in professional and citizen media and to study the rapid shifts in media attention that have accompanied the flood of breaking news that’s characterized early 2011. In the next weeks, we’ll be publishing some new research that uses Media Cloud to help us understand the structure of professional and citizen media in Russia and in Egypt.

With our relaunch of the site, many of our most powerful tools are now available for your use. We’re hoping Media Cloud proves useful to anyone interested in asking questions about what bloggers and journalists are paying attention to, ignoring, celebrating or condemning.

We hope the tools we’re providing are a complement to amazing efforts like Project for Excellence in Journalism’s News Coverage and New Media indices – we consider their tools the gold standard for understanding what topics are discussed in American media. PEJ works their magic using talented teams of coders, who sample different corners of the media ecosystem to find out what’s being discussed. We use huge data sets, algorithms and automation to give a different picture, one focused on language instead of topic.

At its most basic, Media Cloud gives a picture of what journalists and bloggers and writing about by counting the words used in recent stories. Above is a cloud of language used in our set of political blogs during the week starting on Monday, May 2nd. We can see language about the US raid on Osama bin Laden’s compound, including obvious words like Abbotabad, Bin Laden and raid, as well as words that suggest particular interests within those stories: helicopter, SEALs, intelligence, interrogation, Pakistan. Even with a major story dominating discussion, we see glimpses of other issues, like the US Congress Caregiver’s Act and speculation that Indiana governor Mitch Daniels will enter the Presidential race. You can click each word in the cloud and see what sentences in different blogs contained the term in question, how often it was used, and how that source compared to others.

Comparison is where our tool is most powerful. The cloud above shows the differences between words used in left and right wing blogs during the same time period. We start to see differences in what aspects of the Bin Laden story bloggers focused on. Bloggers on the left used the words “torture” and “waterboarding” while bloggers on the right use “interrogation” and “terrorist”. Other comparisons are less obvious – we see more discussion of debate about releasing raid photos on the right than on the left, and a discussion about expanding the Hyde Amendment (which affects congressional funding for abortion) on the left.

We’re also able to make general statements about the similarity or difference in word usage in these comparisons. While the left and right may both be focused on the raid in Pakistan, the similarity score (near the bottom of the word cloud, towards the right) suggests a larger disparity in agendas than we saw looking at these two sets of media a year ago, when both sides were talking primarily about Arizona’s tightened immigration laws. I’ve been taking an in-depth look at similarity scores to understand how media attention can shift at moments of international crisis, and how the recent, internationally-focused media cycle may differ from the news we often get in the US.

What our tools let you do with Media Cloud are really just the tip of the iceberg. The code behind our system is published under an open source license, so other researchers can build systems to monitor media in other countries and other languages. (We’ve got a system monitoring Russian media and blogs that you’ll hear more about soon.) We are publishing huge sets of data that include information on word frequencies in different stories for researchers who want to analyze American media without collecting their own data. And we’re hoping to collaborate with researchers around the world who’d like to use our tools and data to ask and answer pressing questions about what’s covered and how.

This new release is thanks for the hard work of Hal Roberts, architect of the project, David Larochelle, developer extraordinaire, Zoe Fraade-Blanar, whose skill at interface design has made our work vastly more useable as well as more attractive. Thanks to them and everyone else involved with the Media Cloud project. Hope you’ll check our work out and let us know what you discover.

This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to Media Cloud, relaunched

  1. Jay L says:

    Very cool service. I had a question about an odd behavior I saw. The cloud generated by the default “Political Blog” search includes the term “socrates”. I thought that was interesting so I clicked to see what the context(s) would be. But it looks as though there was only one source, a longish blog post at, that happens to mention “socrates” a lot, like maybe 100+ times. This made me wonder if you are (or should be) doing some normalization of term counts or document length in whatever algorithm generates the word clouds? A single blog post doesn’t seem to justify the size at which “socrates” is represented in the cloud.

  2. Jay – these happen. They’re often parsing errors – we do lots and lots of work to sort story text from navigational cruft when we evaluate the text on a page. The vast majority of the time, we get it right – in cases like this, we don’t. Often, when we get a bug report like this, we’re able to go in a tweak the parsers and get cleaner output going forward. Normalization may be one way to go – I think we might do better simply flagging cases where a word utterly dominates a source and queueing them for review, rather than putting them into word clouds…

  3. really interesting. how can we know which publications/blogs and geographic scope are included as sources of the raw data? are these all american-based?

  4. Ramin says:

    I love this idea. I like that you are already broadening it out to international news too. It will be interesting to compare and contrast US media to International media on specific news events.

  5. I concur with what Ramin says , however the challenges the developers will have to face is the geographic appeal of this tool as it is the site’s data is predominantly useful ‘only’ and i stress for the American user.Popular blogs on bloglines do not directly or even the list bit translate to popular blogs internationally.However i am certain that the idea will morph to capture the geographic appeal in time!

Leave a Reply

Your email address will not be published. Required fields are marked *