The following is an overview of the methods that Media Cloud uses to collect, download, and analyze the media ecosystem. Media Cloud collects stories from 30,000 feeds belonging to 17,000 media sources from a combination of mainstream and new media sources. It stores both the full html and the extracted story text from about 50,000 stories per day from those media sources. It converts that story text into per story word counts that it makes publicly available as daily data dumps. We use those word counts to perform a variety of modes of analysis of the media ecosystem, including word clouds, clustering, mapping, and a variety of regular and custom reports written by the media cloud team.
For full details, see the Media Cloud source code, which is published under a free software license at http://mediacloud.svn.sourceforge.net/. Where helpful in the following overview, we include the modules implementing the given functionality so that the reader can reference the relevant code.
The goal of Media Cloud is collect and analyze the content from a moderate number of different media sources without relying on private feeds (which are not accessible to others and therefore make our research hard for others to reproduce). A core problem for this work is that there is no single place to find associations between media sources (like the New York Times or Power Line) and the feeds that publish their headlines. For big mainstream media sources like the New York Times, it is usually (but not always) possible to find a list of feeds for the site, but there are usually a large number of feeds (we currently collect 131 feeds for the New York Times, for example), and there is almost never a single feed for a large mainstream media source that includes all stories for that source. For individual blogs, there is usually one feed that includes all stories published by the blog, but the larger number of blogs makes it necessary to automate the discovery of the single feed.
For both mainstream media and blogs, we use an automated but moderated feed scraper to discover the feeds for each source we track. For blogs hosted on some large blog hosting systems like Live Journal, we are able to determine the url of the feed directly from the blog url (for example by adding ‘/rss’ to the end of the url). For all other blogs, the feed scraper looks for any html tag with an attribute value consisting of a url including any of the text “feed|rss|syndication|sitemap|xml|rdf|atom” and then downloads each of those urls and tries to parse the resulting content as an RSS, RDF, or ATOM feed. If the content downloaded at the url can be parsed as a feed, it is added to the media sources as a potential feed. Many sources do not include a list of feeds on the front page, however, so we use a recursive version of this scraper to find all feeds within a site — we find all urls matching the above feed pattern on the home page of the media source and then also look for feed urls in any html pages downloaded from the home page and try to parse those as feeds. So the feed scraper finds a link with the url “http://www.nytimes.com/rss” on the home page, downloads the html of that page. That page includes a list of the New York Time’s feed urls, each of which is downloaded and parsed to verify that it is a feed. After discovering the feeds for a set of newly added media sources, a human manually reviews each new media source to verify that the system discovered feeds for the source, that the feeds are not duplicates, and that the feeds at a glance belong to the given media source (are not links to the feeds of another media source).
For full details of the Media Cloud feed discovery methods, see the Feed::Scrape module.
Media Set Definition
In addition to associating feeds with media sources, a key function of Media Cloud is to group media sources into sets to allow for larger scale analysis. Many of the core research questions we want to answer through Media Cloud are based not on individual media sources (“How does coverage of the health care debate differ in the New York Times vs. Power Line”) but on larger sets of media sources (“How does coverage of the health care debate differ in mainstream media vs. blogs?”). To be able to answer those larger scale questions, we have to define what “mainstream media” and “blogs” mean by defining which individual media sources belong to each label. We have defined many media sets, but for our main dashboard tool, we have defined the following media sets: U.S. Top 25 Mainstream Media, Political Blogs, Political Blogs – Center, Political Blogs – Right, Political Blogs – Left, Popular Blogs, and White House.
The U.S. Top 25 Mainstream Media set are the 25 news related sites with the most monthly unique users according to Google AdPlanner on 2010-02-18. Specifically, we performed a search using Google AdPlanner for the sites with the largest audience reach for the English speakers in the U.S. belonging to the ‘news’ category. We manually edited the resulting list to remove sites that clearly were not related to general news (for example, linkedin.com) or were purely portal / aggregation sites. The resulting list of media sources is available at http://www.mediacloud.org/dashboard/medi…. Media Cloud has the capability to handle hundreds or thousands of sources, instead of merely 25, but we found that beyond 25, the mainstream media sources quickly become regional and local rather than national, so we limited this set to only 25 sources.
The Political Blogs sets were generated by first using Technorati’s list of the 1000 most influential political blogs using its proprietary ranking mechanism. We manually coded the resulting list to remove sources that either are not at least largely politically oriented as well as sources that are clearly mainstream media (such as cnn.com). We erred on the side of including blogs that might fit either of this criteria, so we included blogs that were not primarily about political topics as long as they had some political content and we included blogs owned or hosted by mainstream media organizations, such as politico.com. We also coded each source as belonging to either the Center, Left, or Right set based on our own judgment of the content of the blog.
The Popular Blogs set consists of all feeds listed in the Blogline’s list of its 1000 most popular feeds on 2009-10-12. The White House set consists of all feeds published by http://www.whitehouse.gov.
These set definitions are intended to be portable across different languages and countries. Media Cloud has back end support for languages other than English and currently supports Russian as well as English. For our Russian sources, we use similar definitions to those above. We have a Russian Top 25 Mainstream Media sources that consists of an edited list of the most popular news sites for Russian speakers in Russia as well as a Popular Blogs set that consists of the 1000 most popular blogs according to Yandex, a rough Russian equivalent for Technorati.
Media Cloud runs a crawler that downloads each of its feeds about once every four hours. After downloading a feed, it parses the feed to find any new stories referenced in the feed. A new story is a story whose url does not belong to an existing story within the system and whose title is not the same as that of an existing story belonging to the same media source with a publication date within 12 hours of the new story. The title test is necessary because many sources publish the same story many times in the same day with different urls (usually either publishing the same story with different urls for each associated feed, or publishing edited versions of the same story). Each new story is added to the Media Cloud database along with an instruction to the crawler to download the html for that story. The html for the story is usually downloaded within an hour of discovering the new story in a feed. The html for each story is only downloaded once, so edits to a story after the initial download are not captured.
For full details of the Media Cloud crawling methods, see the MediaWords::Crawler::Engine, MediaWords::Crawler::Provider, and MediaWords::Crawler::Handler modules.
Many mainstream media sources use several pages for many individual stories. Each time the crawler downloads a story, it looks for a ‘next page’ link and adds an instruction to the crawler to download that url and add the content to the story (each single story can include several downloads of html content). The paging module of the crawler finds next page links by looking for links within the story that include the word ‘next’ in the link text with at most one word before and two words after the word ‘next.’ The paging module applies a handful of other heuristics to try to wean out ‘next’ links that are not for pages (the most common are next-story and next-page-of-comments links); for example, the paging module ignores all ‘next’ links that link to a url that includes the text ‘blog,’ ‘forum,’ or ‘discussion’.
For full details of the Media Cloud paging methods, see the MediaWords::Crawler::Pager module.
The crawler downloads the full html for the url (or urls) associated with each story, but that html includes not just the substantive content of the news story but also all of the surrounding content that constitutes the web page, including ads, navigation, and other contextual content. For sources that include the full text of the stories in their feeds (according to a detection algorithm we run), we use that text instead of the downloaded HTML. For all other sources, Media Cloud extracts just the substantive content of the news story from the downloaded HTML, discarding the rest of the page. The text extraction engine uses the HTML density of each line as the basic metric to determine whether a given line is part of the substantive content of the story. The HTML density is the ratio of characters within HTML tags to all characters in a given line. As a first approach, lines with an HTML density of less than 0.10 are treated as part of the substantive text of the story (extracted text), and lines with an HTML density of greater than 0.10 are omitted from extracted text. This basic heuristic works because ads, navigation, and other contextual tend to be very heavily marked up, with links, formatting, and other html combined with relatively little text.
But the basic heuristic is not accurate enough on its own, so the extractor uses a collection of about a dozen other signals to modify the basic HTML density score. For example, for lines that are longer than 256 characters the HTML density is halved because those lines are more likely to be part of the substantive text regardless of HTML density (and for lines longer than 512 characters, the HTML density is halved again). Discounts are similarly given for lines that share many words with the RSS title or description of the story; that are close to lines already marked for extraction; that are within comments that indicate the text belongs on the printable version of the page; and that have <p>or <a> tags. Various other signals are used to increase the HTML density score of each line, including whether the line is close to an html including the text ‘comment’ and whether the line has one of a number of words like ‘copyright’ that tend to be in meta statements like copyright or discussion policy lines. We also remove a number of lines from consideration altogether, including lines that are outside of the <body> tags or outside of printable page comments if present.
For full details of the Media Cloud text extraction methods, see the MediaWords::Crawler::Extraction and HTML::CruftText modules.
After the text from each story is extracted, Media Cloud removes the html from the extracted text and stores the plain, extracted text of each story in the database. This extracted text is parsed into individual sentences. The sentence parser looks for the end of each sentences as a word followed by a period followed by a space, with exceptions for a number of common English (or Russian) abbreviations like ‘Mr.’ and ‘Mrs.’ To avoid over-counting words from sentences that are wrongly extracted, such as navigation taglines or copyright disclosures, the vectoring engine ignores any sentence that has already appeared within the given media source in the given week (this method possibly undercounts words from quotes, if the quotes include exactly identical sentences). The vectoring engine then generates a stem for each word using a Porter stemming implementation, reducing, for example, ‘economies’ and ‘economy’ to ‘econom’. The system then counts how many times each stem appears in each sentence, eliminating stems that are not at least 3 characters long, that are not among a list of 176 very common stopwords (like ‘he’, ‘she’, ‘a’, and ‘the’), that do not include any non-printable characters, and that include at least one letter. The system stores this per sentence per story count of each stem.
The per sentence per story counts are stored in a large table that is expensive to query, so that table is aggregated into a daily count of words for each media source and media set. Also added to the daily counts table are daily counts for topics, which are defined as the words that appear in the sentence as some other word. So for the ‘health’ topic, Media Cloud stores the counts for all the words that appear in the same sentence as the word ‘health’ within each media source or each media set. Aggregating the per sentence per story counts into daily counts allows us to query that daily count data from the database quickly. The daily counts are further aggregated into weekly counts, which are further aggregated into weekly counts for the top 500 words for each media source or media set. The top 500 weekly word counts eliminate words within a much larger stopword list that includes over 4000 words, generated through a combination of analyzing the most common words within our system and manually including words that do not provide useful data for analysis.
To generate the daily and weekly counts for media sets, the vectoring system adds the word counts for each of the media sources within a media set. So the daily and weekly word counts for the Political Blogs set consists of the sum of the word counts from each of the individual media sources that make up the Political Blogs set.
For full details of the Media Cloud vectoring and aggregation methods, see the MediaWords::StoryVectors, MediaWords::Util::StopWords, and MediaWords::Util::Stemmer modules.
Analysis – Queries
For all analysis functions, we use queries to represent the content associated with:
- a set of any combination of media sets and individual media sources
- a date range rounded to the nearest week
- a set of topics
So a query might consist of all of the content within the New York Times during the week starting 2011-04-04, or it might consist of all of the content within the U.S. Top 25 Mainstream Media and Political Blogs media sets, from the week starting 2011-04-04 through the week starting 2011-04-18 covering the topics of ‘Egypt’, ‘Tunisia’, or ‘Protests’. In either case, the system will return a word vector — a list of the word counts — for the top 500 weekly words (as described in the Vectoring section above) within the content described by the query. For queries that include a set of topics, the query returns the set of words that appear in the same sentence as any of the query keywords (in the above example, ‘tunisia’, ‘egypt’, or ‘protest’).
Analysis – Similarity
To determine the similarity of two media sets, we use the cosine similarity of the word vector of the 500 most used words within each query. The cosine similarity function returns a number between 0 (sharing no words at all) and 1 (sharing exactly the same set of words and counts). To normalize the word counts between sources with differing total numbers of words, we divide the count for each word by the total of all word counts within the top 500 words for the query (otherwise, word counts for ‘obama’ would be much higher in the New York Times than in any blog simply because the New York Times publishes much more content).
For full details of the Media Cloud similarity methods, see the MediaWords::Cluster and MediaWords::Cluster::Map modules.
Analysis – Word Clouds
The word clouds displayed on the mediacloud.org dashboard for individual media sets consist of the top 100 words from the top 500 weekly word counts for the query. The size of each word is scaled according to the log of the number of uses of the word during the given week. For comparative word clouds (word clouds that compare two different queries), the system starts with the top 75 words of each query but colors words according to their relative frequency within each query. Words in red or blue are words within the top 75 words for one query that are ranked at least 25 spots lower than in the other query. Words in purple are ranked within 25 spots in both of the queries. So the word ‘elephant’ would be blue if it were the 15th most popular word in the first query but only the 50th most popular word in the second query. If ‘elephant’ were the 15th most popular word in the first query and the 35th most popular word in the second query, it would be colored purple. Words are sized according to the number of mentions divided by the number of matching queries. So a word that is found in both queries is sized according to the mentions in both queries divided by two, but a word that is found in only one query is sized according to the number of mentions in the one query (with no division).
For full details of the Media Cloud word cloud methods, see the MediaWords::Util::WordCloud module.
Analysis – Clustering
We use Kmeans clustering to assign individual media sources within a given query to a set of clusters based on the similarity of the word vector of each individual media source. For this clustering, we use only the top 100 words for each media source as the word vector. Within each word vector, we divide the number of mentions of each word by the total number of mentions of all of the top 500 words for the given media source for the given query. To begin the clustering, we choose a number of clusters, starting at ( ( N ^ 2 ) / 2 ). We adjust if manual inspection indicates too many general purpose clusters (indicating not enough clusters) or too many split clusters (meaning too many clusters).
To begin the Kmeans clustering run, we randomly assign N media sources to be the starting cluster centroids, where N is the number of clusters. We then assign each media source to the centroid with the highest similarity to itself, recalculate the centroid for each cluster as the mean vector for the cluster member word vectors, and then repeat: reassign cluster membership and recalculate each cluster centroid. We repeat this process the lesser of 20 times or until there are no more reassignments. Because of the randomly seeded start, the results of this process are not determinate. So to find an optimal solution, we repeat the whole clustering process 20 times (or more for clustering runs of less than 200 sources). We assign each clustering run a quality score of the sum for each cluster of the square root of the sum of the similarity scores of each cluster member to each other cluster member. We keep the clustering run with the highest quality score. To automatically assign names to clusters, we take the most highly ranked word within each cluster that is more highly ranked than in any other cluster.
For full details of the Media Cloud clustering methods, see the MediaWords::Cluster, MediaWords::Dashboard::Clusters, and MediaWords::Cluster::Kmeans modules.
Analysis – Cluster Mapping
We use a force based layout algorithm to visualize how the individual media sources within a clustering run relate to one another. Each media sources is represented by a dot on the map, and the color of each dot represents the cluster (from the Kmeans clustering as described above) to which the given media source belongs. The position of each node is determined by a force based layout algorithm that places related sources close to each other. To determine whether two sources are related, we generate a matrix of the cosine similarities of the top 100 word vector of each individual media source. From that matrix of similarity scores, we pick a number (N) of the highest similarity scores. The sources whose similarities are present in the top N similarity scores are considered related to one another. The force based layout tries to push all nodes apart with a force of 1 / d, where d is the distance between the nodes. Sources that are related to one another attract one another with a force of d ^ 2. The number N of top similarity scores is manually chosen by researchers to be the highest number of scores that minimizes clumping of all nodes in the middle of the map. In some clustering maps, the shape of the node indicates the media set to which the given media source belongs. The number N is generally about (1.5 * the number of sources).
For full details of the Media Cloud cluster mapping methods, see the MediaWords::Cluster::Map module.
Analysis – Polar Mapping
We use polar maps to visualize how the individual media sources within a clustering run relate to a separate query. As in a cluster map, each node represents an individual media source, the color of each node represents the cluster membership of the media sources, and in some maps the shape of the node indicates the media set of the media source. For polar maps, a black dot in the middle of the map represents the polar query for the map (for instance, the polar query might be the Russian Government media set from 2011-04-04 to 2011-04-18). The position of every other node on the map is determined by the similarity (represented as the cosine similarity of the top 100 word vectors) between the individual media source and the polar query. Media sources that are more similar to the polar query are closer to the central black dot and media sources that are less similar are farther away. The radial position of a node in the polar map has no meaning.
For full details of the Media Cloud polar mapping methods, see the MediaWords::Cluster::Map module.
Analysis – Country Coverage
To determine which countries are covered by which stories, we count the number of times keywords identifying the country are mentioned in the given query. For countries with a single keyword (eg. Egypt or Canada) all mentions of the single keyword are counted. For countries with multiple keywords (eg. United States or Saudi Arabia), the number of sentences including both terms are counted. This is a rough method that is liable to both over counting (‘All states were united against the effort’) and under counting (‘Protests in Cairo continued today’).
For full details of the Media Cloud country coverage methods, see the MediaWords::StoryVectors and MediaWords::Util::Countries modules.
Pingback: …My heart’s in Accra » Media Cloud, relauched
Pingback: Berkman Center relanza Media Cloud, una plataforma de análisis de la cobertura mediática – Periodismo Ciudadano
Pingback: News Nebula « EJO – European Journalism Observatory
Pingback: Ziņu mākonis « EJO – European Journalism Observatory
Pingback: Media cloud, tutte le notizie in una nube « EJO – European Journalism Observatory
Why does the Media Cloud crawler only crawl every 4 hours? Do you think that is enough to get a legitimate sample of news feed blog updates? My concern is that stories can be on and off of a large media news feed in less than an hour if it is not popular. Just a thought when I read your article. Anyway, please let me know what you think, I’m definitely interested in your response.
In practice, the only sources that scroll stories that fast are msm sources, and for every big msm source we capture lots of feeds, most of which scroll pretty slowly, allowing us to catch all of the stories. So we might not catch every story in the front page feed of the nytimes, but we have > 200 of their feeds, and every story that appears in that front page feed will appear in a few other feeds as well (so a front page story about the Libyan conflict will be at least in the world feed as well as the front page feed). We are almost certain to catch each story in at least one of the feeds.
I’d love to add some intelligence to our crawler to pay attention when a given feed is scrolling more quickly than every four hours and to adjust the crawl interval accordingly on a per crawl basis. Just hasn’t made it to the top of the todo list yet.