One thing that we can use quantitative text analysis for is to get a sense for the overall landscape of a set of blogs. The following map of popular blogs in the U.S. (the top 1000 blogs according to Bloglines) gives a good sense of what topics people write about on popular blogs and how those topics relate to one another:
Click the image above for an interactive version of this map along with a list of clusters and which blogs belong to which clusters.
This map is generated by comparing the similarity of the top 100 words of each of the blogs in the popular blogs set during the first five months of 2011. Each dot represents a blog. We generate the cosine similarity of each blog to each other blog in the set and then designate the 3000 blog pairs with the highest similarity scores as being similar to one another. The map is then laid out using a force based mechanism, trying to push apart all of the nodes, with the 3000 similarity pairs acting as forces keeping those pairs together. So blogs that are similar to one another end up closer to one another in the map. The color of each node is the result of using the same cosine similarity scores to group the blogs into 25 separate clusters according to the similarity of the blogs’ content.
This map tells us that this U.S. popular blogs (to the degree they are represented by the particular Bloglines sample) can be grouped into three big meta clusters: Crafts, News, and Technology. Of these, the Craft meta-cluster is the biggest, and the biggest single cluster with 137 blogs is the big ‘love’ cluster (cluster labels are automatically generated as the most common word within the cluster’s member blogs). This is the most general cluster in the map, as indicated by the fact that it sits closest to the center of the map of any cluster. It is a mix of general crafting blogs and general personal life blogs. A few of the blogs include: [ Wee Wonderfuls ] [ Loobylu ] [ A Dress A Day ] [ Sea of Shoes ] [ PostSecret ] [ still me ] [ Loldogs, Dogs ‘n’ Puppy Dog Pictures – I Has A Hotdog! ] [ six and a half stitches ] [ CrazyAuntPurl ]. The word cloud for the ‘love’ cluster is strongly dominated by the title world, with mostly crafting words among the next most common:
The next largest cluster, with 44 blogs, within the crafting meta-cluster is the ‘knit’ cluster, which is exactly what it sounds like. According the clustering run, half as many blogs within the Bloglines top 1000 feeds focus on knitting as do on politics (82, which we’ll discuss below). A few of the blogs in this cluster are: [ Lolly Knitting Around ] [ SO MUCH YARN, SO LITTLE TIME! ] [ Grumperina goes to local yarn shops and Home Depot ] [ Twist Blog ] [ turkey feathers ] [ Hello Yarn ] [ Sheri at The Loopy Ewe ] [ Yarn-A-Go-Go ]. The word cloud for the cluster is exactly what one would expect a knitting word cloud to look like, with very strong focus on knitting and differentiation from other clusters:
Another prominent cluster within the crafting meta-cluster is the ‘recipe’ cluster, with 29 blogs. The blogs in this cluster include: [ (Inside A Black Apple) ] [ This American Life ] [ Baking Banter ] [ Cooking For Engineers ] [ Cheap Healthy Good – Frugal Recipes and Food Tips ] [ CRAFT ] [ smitten kitchen ] [ Accidental Hedonist ] [ not martha ]. The word cloud for this cluster consists almost entirely of directly food related words:
Almost all of the blogs and the top words in this cluster are on the topic of food. But the inclusion of This American Life, a radio show not focused at all on food, is a typical artifact of a content-based clustering process. In this particular case, This American Life posts a short summary of each show each week at its RSS feed. Two of the twelve weekly stories posted to its feeds for the time period of this clustering run are about the original recipe for coke and about the Onion, respectively. Because the rest of its stories are so diverse, those two stories, especially the recipe story, are enough to throw it into this recipe cluster.
The news meta-cluster is dominated by the ‘obama’ cluster of news sources. The ‘obama’ cluster has 82 feeds that are a a mix of general news, soft news, and political news in blogs and in mainstream blogs, including: [ Calculated Risk ] [ All Salon ] [ nybooks.com: Latest articles ] [ youtube :: most viewed videos – today ] [ Right Wing Nut House ] [ Overheard in New York ] [ NYT > Arts ] [ NYT > Opinion ] [ Yahoo! News: Top Stories ] [ Waiter Rant ]. This cluster has such a diversity of sorts of blogs because even this diversity of types of news gets swamped by the difference between any kind of news and the sorts of content in the crafting or technology meta-clusters. The word cloud for the ‘obama’ cluster shows that the content in the cluster is mostly dominated by politics, regardless of the diversity of the sources. In fact, the cloud acts as a pretty good description of political news coverage for the six months covered by the clustering run, with top words including not only the obvious ongoing ‘obama,’ ‘republican,’ and ‘democrat’ topics but also more specific topics dominating particular news cycles, like ‘egypt’, ‘libya,’ and ‘japan.’
The final meta-cluster is on technology, dominated by the ‘google,’ ‘app,’ and ‘iphone’ clusters. In this case, the k-means clustering that picks out cluster members (and assigns colors in the map) is able to subtly differentiate between these computer and Internet topics even though they are very closely related, even to close human inspection, but the mapping process (which assigns the position of the blogs on the map) is not able to differentiate among these blogs as well, leading to the dense clump of blogs on the right of the map.
The google cluster mostly consists of blogs that focus on topics directly related to google, including search engines (‘keyword’), search engine optimization (‘seo’), and online advertising (‘adsense’). The blogs in this cluster include: [ TEDTalks (video) ] [ Google Earth Blog ] [ Techdirt ] [ Coding Horror ] [ Google Sightseeing ] [ Daily Blog Tips ] [ Small Business Marketing Blog from Duct Tape Marketing ] [ Adverblog ] [ Gmail Blog ]. There is potential for this cluster to contain some artifact members, since ‘google’ is such a common word on the Internet and is often used for meta-comment (eg. ‘use google to search this site…’). But in fact, every member of this cluster is actually substantively focused on Google or on the issues closely related to Google described in the cluster word cloud:
The app cluster is less strongly focused, including a variety of more generalist computer / technology related blogs: [ Hack a Day ] [ Random Acts Of Reality ] [ USATODAY.com Tech – Top Stories ] [ Download Squad ] [ Springwise ] [ Linux Journal – The Original Magazine of the Linux Community ] [ BBC News | Technology | UK Edition ] [ Scobleizer ] [ PortableApps.com – Portable software for USB drives ] [ Betanews ]. The cluster notably includes a wide variety of current platforms, including ‘android,’ ‘iphone’, ‘linux’, ‘mac’, and ‘microsoft’ (as well as ‘facebook,’ ‘google’, and ‘twitter’ arguably the dominant online platforms):
Finally, the iphone cluster includes blogs that focus on personal technology in general and iphones and ipads in particular: [ Pogue’s Posts ] [ Going My Way ] [ ThinkGeek :: What’s New ] [ AnandTech Article Channel ] [ MobileCrunch ] [ TechCrunch Japan ] [ Web Designer Wall – Design Trends and Tutorials ] [ Apple Hot News ] [ Cool Hunting ]. Even though the app cluster above includes desktop computing platforms (‘microsoft’ and ‘mac’), blogs about those platforms are only strongly represented in that general cluster. Blogs in the iphone cluster, in contrast, focus strongly on iphones and gadgets, creating a strongly separated cluster:
I’ll dig into more interesting details about this U.S. Popular Blog map in future posts. Please feel free to poke at the full results of the clustering run and add your observations below.