The Problems With Stemmming: A Practical Example

Posted on March 30, 2014 by dlarochelle

This post provides an overview of stemming and presents a real world case in which it led to undesirable behavior.

Stemming is a common technique in natural language processing and information retrieval. The idea is that different forms of a word refer to the same concept. So, when a user searches for a word, the system should return documents containing all forms of the word. For example, if a user searches for ‘running’, they probably want to see documents containing the words ‘run’, ‘runs’, and ‘runner’ in addition to ‘running’. Stemming enables this by converting each form of a word to a common base or ‘stem’. For example, ‘run’, ‘runs’, ‘runner’, and ‘running’ would all be converted to ‘run’. Usually this is done using algorithmic techniques to remove word suffixes rather than with dictionary look ups. Algorithms perform almost as well as dictionary lookups and are simpler to implement. Also they can handle new words e.g. ‘iPhone’.

Stemming is intuitively appealing. But is it actually helpful in practice? The textbook example of stemming causing problems is that ‘business’ and ‘busy’ map to the same stem but represent different concepts. However, this example feels artificial.

Here’s a real example. I recently searched for “Withings” on Slickdeals.net — the one of top sites for finding deals and coupons. As you can see in the screenshots below, Slickdeals returned results containing the word ‘with’:

2nd screen shot from Slickdeals.net

What’s happening here? It’s likely that Slickdeals has a stemmer that converts ‘withings’ to ‘with’ using the programmatic rule of removing the ‘ings’ suffix. (Usually this would be the correct behavior. Consider “clip”, “clipping”, and “clippings”). Thus instead of returning results that contain ‘withings’, it returns results that contain ‘with’.

Often, sites can mitigate these type of problems by changing the order in which results are presented so that documents matching the exact search term appear before those only matching the stem. For example, most users only look at the first few pages of Google results. It doesn’t matter if there are false positives if they rank too low in the search results for users to actually see. (Ranking search results is a complex science. Google became successful largely because it was better at determining which matches were most relevant rather than because it delivered more total matches.) However, Slickdeals needs to sort results by time to meet the needs of its users because deals expire quickly. Knowing that there was a brief sale on an item 3 years ago isn’t particularly useful if you want to buy one now.

Stemming can be a useful tool but it’s important to understand its drawbacks. While there are certainly use cases in which the benefits outweigh the drawbacks, stemming should not be blindly adopted.

Inline::Java: Using Java with Perl

Posted on July 14, 2013 by dlarochelle

Last week, William Cox and I gave a presentation at Boston Perl Mongers on combining Java and Perl using Inline::Java. For those who were unable to attend, here are links to the slides in PDF and Power Point formats.

Slides for the Perlbrew and Carton talk

Posted on October 14, 2012 by dlarochelle

As promised here are the slides from my talk last week on Perlbrew and Carton.

carton_and_perlbrew

Speaking about Perlbrew and Carton Tuesday 9 October 2012

Posted on October 7, 2012 by dlarochelle

I’m excited to be speaking at the Boston Perl Mongers meeting this Tuesday 9 October 2012. I”ll post slides and a summary next week but I thought that I’d post an abstract here as a teaser. If you’re in Boston and are interested in attending the talk information on the Boston Perl Mongers meetings is available here.

Abstract:

Managing Complexity With Perlbrew and Carton

Deploying Perl programs to multiple systems can be challenging. Even when they run the same operating system version, different systems often contain different versions of the same CPAN modules. Different module versions are often incompatible in subtle ways that may not be detected immediately. In the worst case, software works fine on the development machines but malfunctions in production.

Things are even more complicated when deploying to different operating systems or different versions of the same operating system. Different distribution versions may contain different version of the system Perl and on many systems, the distributed Perl version lags the latest release. Writing for the oldest Perl version on all deployment systems means missing out of newer features, which reduces programmer productivity, and makes code less readable and reliable. Perl has excellent backwards compatibility but some differences still exist. Ubuntu LTS versions are often still in use after the Perl they distribute is no longer supported by the Perl community. Because Perl is deeply integrated into Debian and Ubuntu systems, vendor patches are usually limited to security issues, and manually upgrading the system Perl is difficult and risky. Thus users of older distributions may encounter program cashes due to internal bugs in a system Perl that they cannot patch nor upgrade.

This talk will show how Perlbrew and Carton can be used to address these problems. I will discuss how the Media Cloud project used Perlbrew and Carton to decouple the system Perl from the application’s Perl and achieve a consistent environment across different machines running different operating systems.

Scenes of HOPE

Posted on July 29, 2012 by dlarochelle

Pictures from HOPE 9

Earlier this weekend I was able to attend my third Hackers On Planet Earth (HOPE). As always, HOPE was an incredible experience. Here are some pictorial highlights.

Hope Number 9

View all

Get your own

Uploading to the Cloud by Email

Posted on May 20, 2012 by dlarochelle

This post is about a problem and a solution.

I’m trying to go digital and replace my paper documents with digital ones. I have access to a fast scanner with an auto-document feeder but unfortunately essentially the only way that I’m able to get document off the scanner is by using scan to email. For a while, I had documents spent to my gmail. But that was suboptimal for a number of reasons. Firstly, it cluttered my inbox. Disk space was less of a concern here than clutter in general. Secondly, finding files was difficult since the scanner generated emails with identical subjects, no body text, and PDF file attachments with auto-generated names. Finally getting files out of gmail can be a hassle especially if they are spread out over many different messages.

Thankfully I found a solution that enables me to have the files stored directly in a cloud storage drive.

I signed up for a Dropbox account and then used the third party service sendtodropbox.com to facilitate email uploads. sendtodrop.com is a great little service. It gives you a special email address and then uploads any attachments sent to it to your Dropbox. It also has a number of special features such as allowing the folder location to be automatically determined based on sender or date.

Now, I simply have the scanner scan documents and upload them to this address and voila they’re conveniently stored in Dropbox.

Alternatives:

SugarSync and Box.net have email uploading built in so there is no need to use a third party service. However, I decided to go with Dropbox despite this. SugarSync does not have a Linux client and I knew this would end of driving me crazy since Ubuntu is my primary OS. The free versions of Box.net seemed to be overly limited and the paid versions were expensive. Finally, I just wanted to try Dropbox since I’ve known many people who use it and I wanted to see what all the fuss was about.

Credits:
Finding this solution took a lot of Googling but the following articles were hopeful.

http://www.makeuseof.com/tag/5-ways-send-files-dropbox-dropbox/
http://www.codingbeaver.com/myblog/index.php/2011/12/09/sugarsync-vs-dropbox-vs-spideroak-which-should-i-choose/
http://gizmodo.com/5828035/the-best-way-to-store-stuff-in-the-cloud

The State of Perl

Posted on April 1, 2012 by dlarochelle

The last decade of Perl could be summarized as follows. Around 2000 Perl was essentially the goto scripting language for system administration and dynamic web sites. The design process for Perl 6 was announced in June of that year by Perl creator Larry Wall who hoped to remove “historical warts” from the language. It was expected that Perl 6 design and development would proceed quickly and that within a few years Perl 6 would displace Perl 5 the way Perl 5 displaced Perl 4. As a result development of Perl 5 stagnated, while the Perl community expected a Perl 6 which never really materialized. Perl 5.8 was released in 2002 and the 5.8.x branch remained the current version for 5 and a half years until Perl 5.10 was finally released in December of 2007. The Perl community eventually began to view Perl 6 as a parallel project rather than a replacement for Perl 5 but the damage had already been done. Perl had lost much of its mindshare and was increasingly being displaced by languages such as Php, Python, and Ruby.The Perl community has returned to improving Perl 5 with renewed vigor. The community aims to release a new point release of Perl 5 every 6 months and numerous modules have been created to improve the Perl 5 experience. It’s hard to determine what the mind share of Perl will be a few years from now.

Perl now faces both technical and marketing challenges. In many ways, it is a victim of its own success. Because Perl has been around so long, numerous outdated tutorials are still floating around the web. When Googling it can be difficult to figure out which pages tell the best way to do something in Perl now and which tell you what was the best way 10 years ago. Debian GNU/Linux and derivative distributions such as Ubuntu are essential written in Perl. While this increases Perl’s mind share, it also means that the system Perl can not be easily upgraded. Unlike updating the Java version, changing the Perl version has widespread implications that require extensive testing. As a result, Ubuntu release tend to include the version that was current 6 months before the actual release rather than the latest version.

More generally, Perl suffers from outdated perceptions. Yes it was once possible and common for people to write unreadable Perl programs that resembled line noise. (The old joke was that “Perl is like a toothbrush — you should use it but you don’t want to use someone elses”.) But this is no longer the case. Simply adding ‘use strict’ prevents a lot of bad code. Using good style practices prevents many of the rest of the issues. It’s certainly true that Perl allows people to do things that might not be a good idea such as using arbitrary regular expression delimiters but in real programs, this is the exception rather than the rule.

This has been a short discussion of the state of Perl and the issues facing it. I write this as someone who is knowledgeable about the language but isn’t a Perl partisan. I’ve known Perl for over a decade and used it as a primary development language for the last three years. However, I’ve also used other languages and know that every language has its strengths and weaknesses. In subsequent blog posts, I plan to further explore the state of Perl and the efforts of the community to improve it.

iPhones and the Limits of Curated Security

Posted on February 20, 2012 by dlarochelle

Abstract: The Apple App store uses a curated model in which all applications must be approved by Apple before they are published. In has been argued that this provides better security than Android where applications are published immediately and only removed if they are found to be problematic. We study the cases of Smurf’s Village and similar applications and argue that the protection offered by this form of security is limited. We conclude that the image of Apple as a benevolent dictator acting as a neutral arbitrator to promote the best user experience is dangerously naive. Apple’s economic interests will often be different than their users’ security interests.

In the Apple App store for iOS devices such as the iPhone and iPad, all applications must be approved by Apple before they are published. That is in contrast to Android in which submitted applications are immediately published and only removed if they are determined to be problematic. The Apple approval process has been criticized as being arbitrary and confusing. Many accuse Apple of exercising excessive control. The App Store Review Guidelines are 8 pages. The introduction contains statements like the following:

We view Apps different than books or songs, which we do not curate. If you want to criticize a religion, write a book. If you want to describe sex, write a book or a song, or create a medical app.
We have over 350,000 apps in the App Store. We don’t need any more Fart apps.
[If] you’re trying to get your first practice App into the store to impress your friends, please brace yourself for rejection. We have lots of serious developers who don’t want their quality Apps to be surrounded by amateur hour.
If your app is rejected, we have a Review Board that you can appeal to. If you run to the press and trash us, it never helps.
If it sounds like we’re control freaks, well, maybe it’s because we’re so committed to our users and making sure they have a quality experience with our products.

The rest of the Guidelines contains specific rules about what is and is not allowed. It is generally acknowledged that all submissions are not reviewed equally. For example, Apple includes a ban on offensive or mean-spirited commentary but specifically exempts professional political satirists and humorists. Additionally, well known companies and organizations are thought to have an easier time getting their apps approved than individual developers.

Though many have criticized those restrictions as excessive, it has been argument that they will provide protection against malware. Indeed, Jonathan Zittrain, among others, has worried that the iPhone will be adopted because it provides increased security even though this security comes the cost of the freedom and flexibility that has led to significant innovation in the computer industry. But just how effectively will the Apple review process protect users? Software auditing is time consuming and expensive. There are a variety of technical measures that would allow malicious developers to hide unapproved functionality from Apple’s reviewers. I may discuss these techniques in another, geekier, post.

In this post, I use the existence of Smurf’s Village and similar apps to argue that a curated app store is not a security panacea. These type of apps, many of which are targeted at children, are free to download but have very expensive in app purchases for items. For example, Smurfs’ Village, which is is based on the Smurfs tv show, sold a wheel barrow of Smurf Berries for $99. Many children amassed huge bills playing these applications on their parents phones. One 8 year old girl amassed a $1400 phone bill playing Smurf’s Village. (See http://www.digitaltrends.com/mobile/publisher-greed-little-girl-amasses-1400-iphone-bill-playing-smurfs-village/ ) There have been allegations of similar spending on other games such as a 5 year old girl spending $99.99 in Fishies, a 9 year old boy spending $670 on virtual weapons for X-Mas Resort and F.A.S.T, and a 9 year-old girl spending $200 on Zombie Toxins and other items in Zombie Cafe. (http://newsandinsight.thomsonreuters.com/New_York/News/2011/05_-_May/Class_action_firms_go_after__free__Apple_apps/ ) Whether these type of applications should count as malware or badware, is a question of semantics that we will not attempt to address. However, it is worth noting that the financial harm suffered by the parents of these children is likely to be greater than those whose PCs or Android phones are infected with traditional malware.

Why does Apple allow these applications in their app store? An app like Smurfs’ Village might be justifiable based laissez-faire arguments in an open market but that argument makes no sense in the context of Apple’s curated market. It is impossible to know Apple’s exact motivation but economic incentives may provide the answer. Apple requires that all game items be purchased using their In APP Purchase API payment system and takes a 30% cut of any transaction. Thus Apple has an incentive for users to spend as much money as possible. Indeed Apple’s guidelines contained no restriction on excessive pricing of in game items even though they contained restrictions on just about everything else. (Apple’s 30% commission is much greater than the amount taken by credit card companies and consequently they have a much greater incentive to allow expensive purchases.)

Apple has since taken some steps to mitigate the problem. They now require a password to be entered before every purchase (previously the password would be cached for 15 minutes) and allow in-app purchases to be disabled in device settings. Whether or not these mechanisms are sufficient to protect parents from these specific type of applications, in app purchases are a rich target for exploitation. In the future, there will likely be new and different attempts to exploit it that are both creative and disturbing. In the media, there is the popular stereo type that undesirable software originates from super hackers who are either anti-social western males or Russian gangs. The danger of this myth is that it causes us to be overly trusting of corporate software. The extent to which Apple uses developer reputation in the approval process may make them more likely to approve unsavory software if it has the veneer of corporate legitimacy.

It is clear that Apple’s review process does not prevent users from being harmed by unsavory applications. Indeed the image of Apple as a benevolent dictator acting as a neutral arbitrator to promote the best user experience is dangerously naive. Apple’s economic interests will often be different than their users’ security interests.

Hash Collision Complexity Attacks Perl vs. Other Languages

Posted on January 15, 2012 by dlarochelle

This is a presentation that I gave at Boston Perl Mongers on Tuesday 10 January 2012. I’m sharing the slides here as a service to the member of Boston Perl Mongers and other in the community.

perl_monger_hash_collisions

E-books in Translation: The Possibility of Dynamic Rendering

Posted on January 2, 2012 by dlarochelle

E-books in Translation: The Possibility of Dynamic Rendering

Abstract: E-books could fundamentally transform translated works. We argue for a paradigm shift in the role of the translator. Instead of taking a foreign language work and producing a single rendering of it in the reader’s language, the translator would annotate the original work in order to provide data that could be used by different readers to electronically render different versions of the work according to their individual criteria. We focus mainly on the bible and discuss cases in which readers may disagree with the translator’s decision and wish for alternate renderings. We also discuss other works in which reader controlled rendering might be desirable such as anime.

Currently translators take a text in one language and create a rendering of it in a language that’s accessible to their readers. There will never be a perfect mapping from one language to another. As such, a text can be rendered in multiple ways. For example, there may be multiple ways a term can be translated. Each may have different connotations and there may be arguments in favor of each of these terms. In the print world, a book can only contain a single rendering, and thus a single choice must be made. The reader is left with that choice. Compromises will also need to be reached between literary flow and literal faithfulness. The translator makes this choice, and again the reader is stuck with it. Translators may try to mitigate this limitation by offering rationale for their decisions in introductions, including footnotes with alternate translations, or providing the original language term in parentheses, but a printed translation will always be very much a single rendering.

In the world of eBooks, things could be different. Instead of taking a foreign language work and producing a single rendering of it in the reader’s language, the translator could annotate the original work in order to provide data that could be used by different readers to render different versions of the work according to their individual criteria. Readers would be able to have their own bespoke translations.In this post, I use the bible as the primary example of a translated work. There are both technical and theatrical reasons to use the bible as our example. Firstly, if its various translations/editions are treated as a single book it is the best selling book of all time. Secondly, the bible is read and studied at a deeper level than most other books, as such translation decisions that would seem trivial in other cases are actively debated. The bible’s interpretation has been filled with political and religious controversy. Whether they intend to or not, its translators are arguably making a political and religious statement. Indeed, the first English translations of the bible altered the existing power structure in such a way that their mere existence was controversial and they were initially violently oppressed. Finally there are also some interesting technical aspects to bible translations: that are multiple source documents whose languages are no longer spoken, there is no living author to be consulted, and the source material is out of copyright.

An example of a biblical translation decision is the translation of the Hebrew יהוה. The Oxford Annotated Bible translates it as “The Lord”. In college, my religion professor disagreed with this translation and instead used the term “Yahweh” going so far as to say “Yahweh” when reading aloud even though the translation said “The Lord” on the printed page. Others have used “Jehovah”. Since a printed page can only contain a single term, he and his students were stuck with this rendering. Reading an alternate term was his only and somewhat Quixotic form of protest. Imagine instead that readers were free to choose whatever translation they desire — either one of the three terms above or something else of their own devising. If a book is electronic, there’s no reason an alternate term couldn’t be selected. ( See http://en.wikipedia.org/wiki/Yahweh for more discussion of the translation of יהוה.)

Other examples might include aesthetic and stylistic decisions such as modern vs. archaic pronouns i.e. ‘you’ vs ‘thee’ and ‘thou’ and ‘your’ vs. ‘thine’ . Or the use of the phrase “become the father of” vs “begot”. The former uses modern conventional language while the latter is more concise and easier to read if used many times in the same paragraph. The decision to use archaic language (or not) is effectively a political and religious statement. The use of archaic language is a stylistic attempt to give the bible a status as an ancient and venerated text.

An interesting challenge of bible translations is that there’s no one authoritative original text. Different fragments come from different surviving documents. Translators must decide what source to use as the basis for the translation of a given passage if multiple sources contain it. However, if the translation is rendered electronically, readers could choose how different documents are weighted. For example, they might decide that the dead sea scrolls should be used where-ever possible or they might think the dead sea scrolls should be avoided.

Dynamic Translation for Other Works

Differing translations for a term is a common problem in many religious and philosophical texts. Sanskrit Śūnyatā though now usually translated as ‘emptiness’ was once translated as ‘void’ — a subtle but important difference that is thought to have led to confusion.

One decision that translators make is whether to use the original language terms instead of less exact translations. Scholars and serious connoisseurs are likely to prefer original language terms while the casual reader is likely to prefer translations. In the anime community Otaku (https://en.wikipedia.org/wiki/Otaku) often prefer fansubs to studio translations because they retain more Japanisms. (A pet peeve of mine is movies that insist on translating the local currency into US dollars.)

Why not just have more print translations?

There are a few cases in which multiple print translations of a work exist. Though readers cannot create a bespoke translation, they can at least chose among the different static translations available. However, translating a work requires significant time and expertise. The number of different translations will always be limited. Additionally, with a copyrighted work, there may also be rights issues that limit the available translations. It is unlikely that a reader will find precisely the translation she’s looking for.

An objection is that readers do not want to make translation decisions. Indeed, many readers will happily accept a default rendering just as many users accept the default software settings. However, a set of translation decisions could be packaged by experts and shared with others. To make an analogy to web browsers, few users write their own add-ons but many use add-ons written by others.

Conclusion

E-books have the potential to change the nature of translations. The key is to effect a paradigm shift and shift power from the translator to the reader. Instead of being someone who produces a single rendering the translator would annotate the original word in such a way as to provide data. This data could then be used by the reader to create the precise rendering she desired.

David Larochelle's Blog

Category Archives: tech