» computational linguistics The Occasional Pamphlet

Moderating principles

July 25th, 2022

Some time around April 1994, I founded the Computation and Language E-Print Archive, the first preprint repository for a subfield of computer science. It was hosted on Paul Ginsparg’s arXiv platform, which at the time had been hosting only physics papers, built out from the original arXiv repository for high-energy physics theory, hep-th. The repository, cmp-lg (as it was then called), was superseded in 1999 by an open-access preprint repository for all of computer science, the Computing Research Repository (CoRR), which covered a broad range of subject areas, including computation and language. The CoRR organizing committee also decided to host CoRR on arXiv. I switched over to moderating for the CoRR repository from cmp-lg, and have continued to do so for the last – oh my god – 22 years.[1]

Articles in the arXiv are classified with a single primary subject class, and may have other subject classes as secondary. The switchover folded cmp-lg into the arXiv as articles tagged with the cs.CL (computation and language) subject class. I thus became the moderator for cs.CL.

A preprint repository like the arXiv is not a journal. There is no peer review applied to articles. There is essentially no quality control. That is not the role of a preprint repository. The role of a preprint repository is open distribution, not vetting. Nonetheless, some kind of control is needed in making sure that, at the very least, the documents being submitted are in fact scholarly articles and are appropriately tagged as to subfield, and that need has expanded with the dramatic increase in submissions to CoRR over the years. The primary duty of a moderator is to perform this vetting and triage: verifying that a submission possesses the minimum standards for being characterized as a scholarly article, and that it falls within the purview of, say, cs.CL, as a primary or secondary subject class.

I am (along with the other arXiv moderators) thus regularly in the position of having to make decisions as to whether a document is a scholarly article or not. To a large extent, Justice Potter Stewart’s approach works reasonably well for scholarly articles: you know them when you see them. But over time, as more marginal cases come up, I’ve felt that tracking my thinking on the matter would be useful for maintaining consistency in my own practice. And now that I’ve done that for a while, I thought it might be useful to share my approach more broadly. That is the goal of this post.

The following thus constitutes (some of) the de facto policies that I use in making decisions as the moderator for the cs.CL collection in the CoRR part of the arXiv repository. I emphasize that these are my policies, not those of CoRR or the moderators of other CoRR subjects. (The arXiv folks themselves provide a more general guide for arXiv moderators.) Read the rest of this entry »

Posted in computational linguistics, computer science, language, open access, scholarly communication, writing | Comments Off

For Ada Lovelace Day 2012: Karen Spärck Jones

October 16th, 2012

Karen Spärck Jones, 1935-2007

In honor of Ada Lovelace Day 2012, I write about the only female winner of the Lovelace Medal awarded by the British Computer Society for “individuals who have made an outstanding contribution to the understanding or advancement of Computing”. Karen Spärck Jones was the 2007 winner of the medal, awarded shortly before her death. She also happened to be a leader in my own field of computational linguistics, a past president of the Association for Computational Linguistics. Because we shared a research field, I had the honor of knowing Karen and the pleasure of meeting her on many occasions at ACL meetings.

One of her most notable contributions to the field of information retrieval was the idea of inverse document frequency. Well before search engines were a “thing”, Karen was among the leaders in figuring out how such systems should work. Already in the 1960’s there had arisen the idea of keyword searching within sets of documents, and the notion that the more “hits” a document receives, the higher ranked it should be. Karen noted in her seminal 1972 paper “A statistical interpretation of term specificity and its application in retrieval” that not all hits should be weighted equally. For terms that are broadly distributed throughout the corpus, their occurrence in a particular document is less telling than occurrence of terms that occur in few documents. She proposed weighting each term by its “inverse document frequency” (IDF), which she defined as log(N/(n + 1)) where N is the number of documents and n the number of documents containing the keyword under consideration. When the keyword occurs in all documents, IDF approaches 1 for large N, but as the keyword occurs in fewer and fewer documents (making it a more specific and presumably more important keyword), IDF rises. The two notions of weighting (frequency of occurrence of the keyword together with its specificity as measured by inverse document frequency) are combined multiplicatively in the by now standard tf*idf metric; tf*idf or its successors underlie essentially all information retrieval systems in use today.

In Karen’s interview for the Lovelace Medal, she opined that “Computing is too important to be left to men.” Ada Lovelace would have agreed.

Posted in computational linguistics, computer science, language | 1 Comment »

Switching to open access for the new year

January 4th, 2012

“…time to switch…”
A very old light switch (2008) by RayBanBro66 via flickr. Used by permission (CC by-nc-nd)

The journal Research in Learning Technology has switched its approach from closed to open access as of New Year’s 2012. Congratulations to the Association for Learning Technology (ALT) and its Central Executive Committee for this farsighted move.

This isn’t the first journal to make the switch. The Open Access Directory lists about 130 of them. In my own research field, the Association for Computational Linguistics (ACL) converted its flagship journal Computational Linguistics to OA as of 2009, and has just announced a new open-access journal Transactions of the Association for Computational Linguistics. Each such transition is a reminder of the trajectory that journal publishing ought to head.

The ALT has done lots of things right in this change. They’ve chosen the ideal licensing regime for papers, the Creative Commons Attribution (CC-BY) license. They’ve jettisoned one of the largest commercial subscription journal publishers, and gone with a small but dedicated professional open-access publisher, Co-Action Publishing. They’ve opened access to the journal retrospectively, so that the entire archive, back to 1993, is available from the publisher’s web site.

Here’s hoping that other scholarly societies are inspired by the examples of the ALT and ACL, and join the many hundreds of scholarly societies that publish their journals open access. It’s time to switch.

Posted in computational linguistics, open access, scholarly communication | 1 Comment »

C’est la bouquet, or why translation is hard

July 17th, 2011

Grégoire Bouillier

I used to use as my standard example of why translation is hard — and why fully automatic high-quality translation (FAHQT) is unlikely in our lifetimes however old we are — the translation of the first word of the first sentence of the first book of Proust’s Remembrance of Things Past. The example isn’t mine. Brown et al. cite a 1988 New York Times article about the then-new translation by Richard Howard. Howard chose to translate the first word of the work, longtemps, as time and again (rather than, for example, the phrase for a long time as in the standard Moncrieff translation) so that the first word time would resonate with the temporal aspect of the last word of the last volume, temps, some 3000 pages later. How’s that for context?

I now have a new example, from the Lorin Stein translation of Grégoire Bouillier‘s The Mystery Guest. Stein adds a translator’s note to the front matter “For reasons the reader will understand, I have refrained from translating the expression ‘C’est le bouquet.’ It means, more or less, ‘That takes the cake.’” That phrase occurs on page 14 in the edition I’m reading.

The fascinating thing is that the reader does understand, fully and completely, why the translator chose this route. But the reason is, more or less, because of a sentence that occurs on page 83, a sentence that shares no words with the idiom in question. True the protagonist perseverates on this latter sentence for the rest of the novella, but still, I challenge anyone to give an explanation in less than totally abstract terms, as far from the words actually used as you can imagine, to explain the reasoning, perfectly clear to any reader, of why the translator made this crucial decision.

Language is ineffable.

Posted in computational linguistics, language | Comments Off

The Occasional Pamphlet

Moderating principles

July 25th, 2022

For Ada Lovelace Day 2012: Karen Spärck Jones

October 16th, 2012

Switching to open access for the new year

January 4th, 2012

C’est la bouquet, or why translation is hard

July 17th, 2011

Pages

Subscribe by email

@pmphlt says…

Elsewhere

Video/Audio

Archives

Search

Categories

Meta

The Occasional Pamphlet

Moderating principles

July 25th, 2022

Share this:

For Ada Lovelace Day 2012: Karen Spärck Jones

October 16th, 2012

Share this:

Switching to open access for the new year

January 4th, 2012

Share this:

C’est la bouquet, or why translation is hard

July 17th, 2011

Share this:

Pages

Subscribe by email

@pmphlt says…

Elsewhere

Video/Audio

Archives

Search

Categories

Meta