Ample room for improvement in spam filtering practices

A recent article from Slashdot contributor Bennett Haselton highlights the risks inherent in any automated filtering system– even one that is well intentioned. Haselton runs a mailing list through which he informs his users of web proxies, which can be used to circumvent filtering. He regularly distributes information about new proxies as existing ones become blocked. In September, Haselton emailed the list with information about ten new proxies he had created. Two weeks later, two of the proxies were placed on the domain blocklist of Spamhaus, a spam-tracking organization. This itself was not new, as spam filters have previously identified Haselton’s proxies as spam.  What was new, however, was that following Spamhaus’ action, all 10 of Haselton’s new proxies (not just the two erroneously identified as spam) were disabled.  The proxies were taken down because Haselton’s domain registrar preemptively disabled all ten domains.

Why were all 10 of Haselton’s proxies disabled? It was actually the fault of two organizations working in concert: Spamhaus and Haselton’s registrar, Afilias.  Spamhaus is one of several organizations maintaining databases to help flag spam. Spamhaus maintains blacklists– which flag, for example, IP addresses associated with spam operation.  One of those blacklists is of domains typically found in spam messages, which Spamhaus calls the DBL. Spamhaus recommends that ISPs and other entities use the DNS blacklists to reject mail from bad IP addresses before it is processed by the mail server, then use the DBL to scan the content of remaining messages for blacklisted domains. Spamhaus’s lists are publicly accessible and receive, according to the website, billions of queries every day.

The problem with Haselton’s proxies began when Spamhaus placed two of the proxy domains on their DBL. An effective domain blocklist must be careful not to identify domains that are actually legitimate.  In fact, Spamhaus crows that its DBL has a “zero-false-positive reputation.” But as Haselton discovered, there is no way to guarantee a zero-false-positive rate. In fact, it might even be possible for a malicious party to force a domain onto the DBL by repeatedly inserting it into spam messages.

Afilias, Haselton’s registrar, removed all ten of the proxies once Spamhaus placed two of them on its DBL.  As a domain registrar, Afilias uses the DBL to try to shut down spammers.  When it saw two of the domains they had registered appear on the DBL, they noticed that Haselton had registered eight others at the same time and preemptively suspended all ten. Haselton was not notified; instead, he had to wade through a circuitous series of calls to three different companies before he was told that his proxies had been placed on a blacklist. Haselton then found that he was able to instantly and automatically remove his sites from the DBL by submitting a form on Spamhaus’s site. This itself gave Haselton pause because it defeats the purpose of a blacklist if all the sites on the list can be removed so easily.

Haselton’s experience demonstrates the draconian spam-prevention policies of some domain registrars. First, Afilias should have notified Haselton with the reason that his sites were to be taken down; instead, Haselton found out only when the members of his mailing-list emailed him. Furthermore, Afilias should not have automatically suspended all of Haselton’s domains. At the very least, it should have examined the content of each of the sites to see whether they were actually connected to spam operations. Finally, Afilias should have given Haselton better resources for dealing with the suspension of his domains. Haselton received no help from Afilias and had to investigate by himself how to get his sites removed from the blacklist. Afilias’s current policy towards spam-filtering casts too wide a net and seems to offer no due process to site owners.

Haselton’s experience also underscores the need for transparency in spam-filtering practices. According to Spamhaus, the suggested implementation for its blacklists “will identify and reject approximately 85% of an average mail relay’s incoming mail traffic.” That is, 85% of messages sent to a mail server will be rejected outright– the potential recipients have no way of ever accessing or seeing those messages. This is not necessarily a problem, but the potential for abusive filtering needs to be kept in check. Both email providers and blacklist maintainers should be as transparent and public in their practices as possible and should give reasonable recourse to parties who have been wrongly marked as spammers.

China blocks NY Times after Wen Jiabao article

Herdict reports confirm that as of Sunday, October 28, China is continuing to block access to both the English- and Chinese-language New York Times websites. On Sina Weibo, China’s version of Twitter, search terms related to the family or wealth of Wen Jiabao are also blocked, according to the China Digital Times, a website for Chinese news that is US-based and is itself blocked in China. Although China has previously blocked individual stories, this is the first time that China has fully blocked access to the New York Times.

On the afternoon of Thursday, October 25, the New York Times broke a story on the multibillion-dollar wealth of the family of Wen Jiabao, the prime minister of China. According to the Times, much of this wealth accumulated after Wen’s political rise to power. The story was originally posted on the English website at 4:34 pm, Eastern time. By 7 pm Eastern time, access to both the English- and Chinese-language Times sites had been blocked in China, before the Chinese-translated version of the story had even been posted to the Chinese-language site. On social media sites, references to the story were quickly deleted.

The extent to which China’s swift response has actually been effective is unclear. Chinese users with VPN access can still reach the New York Times websites, and posts from Weibo users indicate that quite a few users read and responded to the content of the Times story before it was blocked.

Let Herdict know whether nytimes.com or cn.nytimes.com is blocked in your area.

Little “i” Internet: The “Innocence of Muslims” Video’s Impact on Free Speech

Over the past couple of years, Iran has not-so-secretly been building the foundations for an internal Iran-only version of the Internet, separated from the rest of the world.  Two weeks ago they took it out for a spin, shutting out Google properties such as Gmail and YouTube.  When many Iranians voiced their displeasure at losing access to their e-mail, the government quickly reversed course, restoring access to Gmail.  This was hailed by some very smart people, such as EFF’s Jillian York, as evidence that cutting off access to the global Internet is doomed to failure.  But I think we may be celebrating prematurely.

If you believe that the Internet has become “too big to block” (or contains too many cute kittens) without risking open revolt, then it naturally follows that companies faced with the threat of blockage should call the bully’s bluff; the aims of free speech could never be furthered by removing a video, tweet, or blog.  Thus, it should come as no surprise that Jillian and others at EFF have taken Google to task for its decision to voluntarily remove the controversial “Innocence of Muslims” video from Libya and Egypt.  The video did not violate Google’s terms of service and Google did not receive any formal government requests; taking down the video in the absence of those prerequisites was a departure from past practices.  It was that departure that Eva Galperin of EFF called Google “turn[ing] its back” on free speech.  And Jillian expressed the concern that Google has placed “itself in the role of arbiter” and is unilaterally deciding what people in different countries should or should not see.  This criticism is justified if a countrywide blockade is indeed doomed to fail.

But what if it is not a bluff and a country could successfully block Google or totally separate itself from the global Internet?  In that case, Google’s decision seems more justifiable.  Forestalling a total block preserves access to 99.99% of YouTube and all of the speech therein.  Should Google have waited for a formal takedown request while Libyan citizens are still waiting for a government? Egypt and Libya are two countries with new regimes and old histories of filtering.  By removing the video, Google ensured that Libyan and Egyptian citizens could continue to benefit from all the non-censored speech that remains on YouTube.

Iran was ham-handed in what they did with Google, and they paid a political price for it.  But we must be careful not to conflate a failure of execution with a failure of policy.  Given a defter political touch, it is less clear that highly separated national networks—little “i” internets—are doomed to failure.

What Google did in pulling down the video was not as radical as it might seem.  Is Google acting as the arbiter of content?  Yes, but it also acts as the arbiter of content when it decides whether videos contain gratuitous violence (and therefore violate the terms of service).  And it exercises its discretion with every single search you make.  Every time you search for something on Google, the results you get back represent Google’s editorial discretion as to what constitutes relevant results.  In fact, Google has claimed these decisions are so subjective that they represent the company’s constitutionally protected speech.  Recently Facebook removed the New Yorker’s entire Facebook page after it posted a cartoon featuring an image of a topless Eve.  Thus, it is nothing new for these corporations, who are in many ways the very heart of the Internet, to exercise their discretion in determining the content that they present to their users.

Similarly, the idea of an Internet that provides a fundamentally different experience depending upon the geographic location of a visitor is not as radical as it might seem.  In 2000, the French High Court in LICRA v. Yahoo! ruled that Yahoo’s auction site must not display Nazi memorabilia to visitors from France, forcing the company to show different sites to different people.  Professor Larry Lessig was writing about the Balkanization of the Internet in 2004, noting that the global Internet was quickly becoming “a collection of nation-state networks” that are for all intents and purposes separate.  Similarly, in 2008, Doc Searls described the “Splinternet,” where entities such as Apple and NPR use geographic boundaries to limit the scope of their services.  As Professors Jack Goldsmith and Timothy Wu said in 2006: “Geography turns out to be one of the most important ways to organize information on this medium that was supposed to destroy geography.”  Thus, where you live has long determined what you can see.

These trends continue today.  Google’s transparency report shows how from July to December 2011, Google received over 1000 requests from governments requesting that the company remove content from their respective jurisdictions.  And that does not include the material that Google chooses to remove on its own volition, usually for violations of the company’s terms of service.  In contrast, Twitter has thus far resisted the march toward Balkanization.  Although Twitter created a system that would remove Tweets from a given locale (and only that locale) following a government request, as of June 30, 2012, they had not actually used this tool.  That said, the fact that Twitter has the tool suggests that they too will eventually move towards Balkanization.  Today, more often than not political boundaries and their concomitant legal restrictions determine how the Internet is experienced in a given locale.

That said, even when these companies act as arbiters of content, their fundamental offerings remain relatively constant.  For example, Google has made the editorial decision to remove The Pirate Bay from its autocomplete listings, but it still appears in search results.  And even if Google entirely removed The Pirate Bay from its search results, Google would still function very much in the same manner as it does today, running billions of searches per day.  True, the results would be a little less useful and representative of the total web, but on the whole the experience would be largely the same.

If by removing the “Innocence of Muslims” video, Google prevented an outright block of YouTube or other Google services, it may have preserved more speech than it surrendered.  Censorship that goes video-by-video, link-by-link, or Tweet-by-Tweet is ultimately a losing game for the censors.  72 hours of video is uploaded to YouTube every minute.  Twitter handles more than 400 million Tweets per day.  For every “Innocence of Muslims” movie that is removed, how many offensive videos remain?  Removing every offensive Tweet or video is simply impossible.  Free speech is not in short supply, and with each passing day the job of the censors becomes exponentially more challenging.

It is easy to believe the Internet is a single cohesive experience.  Its very name suggests this: proper noun “Internet.”  With a capital “I” the name implies that there is only one Internet.  IBM, in their tutorial about Internet protocols, states that “The words internetwork and internet are simply a contraction of the phrase interconnected network.  However, when written with a capital “I”, the Internet refers to the worldwide set of interconnected networks.  Therefore, the Internet is an internet, but the reverse does not apply.”  In other words, Internet is the one network that connects everyone.

But the Big I Internet is not a given.  Due to Google’s actions, Libya and Egypt lost access to a single offensive video.  In contrast, Afghanistan and Pakistan chose to block YouTube entirely, depriving their citizens of not only that video but every other video on the platform.  Such a shift moves us ever further from an Internet, toward a series of similar but independent little i internets.

As I noted above, perhaps the clearest example of this is Iran’s attempt to construct a self-sufficient internal internet.  Recent studies have uncovered a shadow network operating in parallel to the existing Internet but accessible only within Iran.  The recent Google debacle in Iran demonstrated that simply turning off access to a popular site can be politically perilous; that, however, was a relatively clumsy approach.  Instead, Iran could slowly move toward supporting two separate and very unequal networks: one that is fast, cheap, and dependable but wholly internal to Iran, and another that is slow, expensive, and unreliable but connected to the outside world.  Under such a scenario, many Iranians may voluntarily place themselves into a little i internet.

This is not unlike what China already does. China uses its so-called Great Firewall of China to block Twitter, YouTube, Facebook.  In their stead, China has encouraged the development of tightly regulated national equivalents, which have become extraordinarily popular.  By some counts China’s Twitter-like service, Sina Weibo, has 300 million users, or twice the number of people using Twitter.

Such little i networks may only become more common as the enabling technology becomes ever cheaper.  Should that happen, then we are no longer talking about a handful of videos or Tweets being removed among the deluge of other content.  Instead, access to entire swaths of the Internet would become difficult, if not impossible, from many parts of the world.  If Google forestalled that future by removing access to a single video, then it is not perhaps the unequivocal handicap to free speech that some suggest.

Perhaps I am like Neville Chamberlain, ceding the Czechoslovakia of free speech in the name of appeasement.  But I believe that as long as content removals are limited and transparently documented through sites such as Google’s and Twitter’s transparency reports, Chilling Effects, and Herdict, free speech is supported by keeping some of the Internet’s most important sites generally accessible.

« Older posts       Newer Posts »