You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

I’m generally a big fan of peer review. I think it plays an important role in the improvement and “chromatography” of the scholarly literature. But sometimes. Sometimes.

The Boyer-Moore MJRTY algorithm allows efficient determination of which shape (triangle, circle, square) is in the majority without counting each shape.
The Boyer-Moore MJRTY algorithm allows efficient determination of which shape (triangle, circle, square) is in the majority without counting each shape.

This past week I was reading Robert Boyer and J Strother Moore‘s paper on computing the majority element of a multiset, which presents a very clever simple algorithm for this fundamental problem and a description of a mechanical proof of its correctness. The authors aptly consider the work a “minor landmark in the development of formal verification and automated reasoning”.

Below is the postscript to that paper, in its entirety, which describes the history of the paper including how and why it was “repeatedly rejected for publication”. (It was eventually published as a chapter in a 1991 festschrift for Woody Bledsoeten years after it was written, and is now also available from Moore’s website.)

In this paper we have described a linear time majority vote algorithm and discussed the mechanically checked correctness proof of a Fortran implementation of it. This work has a rather convoluted history which we would here like to clarify.

The algorithm described here was invented in 1980 while we worked at SRI International. A colleague at SRI, working on fault tolerance, was trying to specify some algorithms using the logic supported by “Boyer-Moore Theorem Prover.” He asked us for an elegant definition within that logic of the notion of the majority element of a list. Our answer to this challenge was the recursive expression of the algorithm described here.

In late 1980, we wrote a Fortran version of the algorithm and proved it correct mechanically. In February, 1981, we wrote this paper, describing that work. In our minds the paper was noteworthy because it simultaneously announced an interesting new algorithm and offered a mechanically checked correctness proof. We submitted the paper for publication.

In 1981 we moved to the University of Texas. Jay Misra, a colleague at UT, heard our presentation of the algorithm to an NSF site-visit team. According to Misra (private communication, 1990): “I wondered how to generalize [the algorithm] to detect elements that occur more than n/k times, for all k, k ≥ 2. I developed algorithm 2 [given in Section 3 of [9]] which is directly inspired by your algorithm. Also, I showed that this algorithm is optimal [Section 5, op. cit.]. On a visit to Cornell, I showed all this to David Gries; he was inspired enough to contribute algorithm 1 [Section 2, op. cit.].” In 1982, Misra and Gries published their work [9], citing our technical report appropriately as “submitted for publication.”

However, our paper was repeatedly rejected for publication, largely because of its emphasis on Fortran and mechanical verification. A rewritten version emphasizing the algorithm itself was rejected on the grounds that the work was superceded by the paper of Misra and Gries!

When we were invited to contribute to the Bledsoe festschrift we decided to use the opportunity to put our original paper into the literature. We still think of this as a minor landmark in the development of formal verification and automated reasoning: here for the first time a new algorithm is presented along with its mechanically checked correctness proof—eleven years after the work.

I have to think the world would have been better off if Boyer and Moore had just posted the paper to the web in 1981 and been done with it. Unfortunately, the web hadn’t been developed yet.

Stamps to mark "restricted data" (modified from "atomic stamps 1" by flickr user donovanbeeson, used by permission under CC by-nc-sa)
Stamps to mark “restricted data” (modified from “atomic stamps 1” by flickr user donovanbeeson, used by permission under CC by-nc-sa)

Ten years ago today was the largest terrorist action in United States history, an event that highlighted the importance of intelligence, and its reliance on information classification and control, for the defense of the country. This anniversary precipitated Peter Suber’s important message, which starts from the fact that access to knowledge is not always a good. He addresses the question of whether open access to the scholarly literature might make information too freely available to actors who do not have the best interests of the United States (or your country here) at heart. Do we really want everyone on earth to have information about public-key cryptosystems or exothermic chemical reactions? Should our foreign business competitors freely reap the fruits of research that American taxpayers funded? He says,

You might think that no one would seriously argue that using prices to restrict access to knowledge would contribute to a country’s national and economic security. But a vice president of the Association of American Publishers made that argument in 2006. He “rejected the idea that the government should mandate that taxpayer financed research should be open to the public, saying he could not see how it was in the national interest. ‘Remember — you’re talking about free online access to the world,’ he said. ‘You are talking about making our competitive research available to foreign governments and corporations.’ “

Suber’s response is that “If we’re willing to restrict knowledge for good people in order to restrict knowledge for bad people, at least when the risks of harm are sufficiently high, then we already have a classification system to do this.” (He provides a more detailed response in an earlier newsletter.) He is exactly right. Placing a $30 paywall in front of everyone to read an article in order to keep terrorists from having access to it is both ineffective (relying on al Qaeda’s coffers to drop below the $30 point is not a counterterrorism strategy) and overreaching (since a side effect is to disenfranchise the overwhelming majority of human beings who are not enemies of the state). Instead, research that the country deems too dangerous to distribute should be, and is, classified, and therefore kept from both open access and toll access journals.

This argument against open access, that it might inadvertently abet competitors of the state, is an instance of a more general worry about open distribution being too broad. Another instance is the “corporate free-riding” argument. It is argued that moving to an open-access framework for journals would be a windfall to corporations (the canonical example is big pharma) who would no longer have to subscribe to journals to gain the benefit of their knowledge and would thus be free-riding. To which the natural response would be “and what exactly is wrong with that?” Scientists do research to benefit society, and corporate use of the fruits of the research is one of those benefits. Indeed, making research results freely available is a much fairer system, since it allows businesses both large and small to avail themselves of the results. Why should only businesses with deep pockets be able to take advantage of research, much of which is funded by the government.

But shouldn’t companies pay their fair share for these results? Who could argue with that? To assume that the subscription fees that companies pay constitute their fair share for research requires several implicit assumptions that bear examination.

Assumption 1: Corporate subscriptions are a nontrivial sum. Do corporate subscriptions constitute a significant fraction of journal revenues? Unfortunately, there are to my knowledge no reliable data on the degree to which corporate subscriptions contribute to revenue. Estimates range from 0% (certainly the case in most fields of research outside the life sciences and technology) to 15-17%  to 25% (a figure that has appeared informally and been challenged in favor of a 5-10% figure). (Thanks to Peter Suber for help in finding these references.) None of these estimates were backed up in any way. Without any well-founded figures, it doesn’t seem reasonable to be worrying about the issue. The onus is on those proposing corporate free-riding as a major problem to provide some kind of transparently supportable figures.

Assumption 2: Corporations would pay less under open access. The argument assumes that in an open-access world, journal revenues from corporations would drop, because they would save money on subscriptions but would not be supporting publication of articles through publication fees. That is, corporate researchers “read more than they write.” Of course, corporate researchers publish in the scholarly literature as well (as I did for the first part of my career when I was a researcher at SRI International), and thus would be contributing to the financial support of the publishing ecology. Here again, I know of no data on the percentage of articles with corporate authors and how that compares to the percentage of revenue from corporate subscriptions.

Assumption 3: Corporations shouldn’t be paying less than they now are, perhaps for reasons of justice, or perhaps on the more mercenary basis of financial reality. It is presumed that if corporations are not paying subscription fees (and, again by assumption, publication fees) then academia will have to pick up the slack through commensurately higher publication fees, so the total expenditure by academia will be higher. This is taken to be a bad thing, but the reason for that is not clear. Why is it assumed that the “right” apportionment of fees between academia and business is whatever we happen to have at the moment, resulting as it does from historical happenstance based on differential subscription rates and corporate and university budget decisions? Free riding in the objectionable sense is to get something without paying when one ought to pay.  But the latter condition doesn’t apply to the open-access scholarly literature any more than it applies to broadcast television.

Assumption 4: Corporations only support research through subscription fees. However, corporations also provide support for funded research through the corporate taxes that they pay to the government, which funds the research. And this mode of payment has the advantage that it covers all parts of the research process, not just the small percentage that constitutes the publishing of the final results. Corporate taxes constitute some 10% of total US tax revenue according to the IRS, so we can impute corporate underwriting of US-government funded research at that same 10% level. (In fact, since many non-corporate taxes, like FICA taxes, are earmarked for particular programs that don’t fund research, the imputed percentage should perhaps be even higher.) The subscription fees companies pay is above and beyond that. Is the corporate 10% not already a fair share? Might it even be too much?

If we collectively thought that the amount corporations are paying is insufficient, then the right response would be to increase the corporate taxes accordingly, so that all corporations contribute to the underwriting of scientific research that they all would be benefitting from. Let’s take a look at some numbers. The revenue from the 2.5 million US corporations paying corporate tax for 2009 (the last year for which data are available) was about $225 billion. The NSF budget for 2009 was $5.4 billion. So, for instance, a 50% increase in the NSF budget would require increasing corporate tax revenues by a little over 1%, that is, from a 35% corporate tax rate (say) to something like 35.4%. I’m not advocating an increase in corporate taxes for this purpose. First, I’m in no way convinced that corporations aren’t already supporting research sufficiently. Second, there are many other effects of corporate taxes that may militate against raising them. Instead, the point is that it is naive to pick out a single revenue source, subscription fees, as the sum total of corporate support of research.

Assumption 5: Subscription fees actually pay for research, or some pertinent aspect of research. But those fees do not devolve to the researchers or cover any aspect of the research process except for the publication aspect, and publishing constitutes only a small part of the costs of doing research. To avoid disingenuousness, shouldn’t anyone worrying about whether corporations are doing their fair share in underwriting that aspect be worrying about whether they are doing their fair share in underwriting the other aspects as well? Of course, corporations arguably are underwriting other aspects — through internal research groups, grants to universities and research labs, and their corporate taxes (the 10% discussed above). And in an open-access world, they would be covering the publication aspect as well, namely publication fees, through those same streams.

In summary, maintaining the subscription revenue model for reasons of distribution control — whether for purposes of state defense or corporate free-riding — is a misconstruction.

Cover of the first issue of the Philosophical Transactions of the Royal Society, dated March 6, 1665. Available from JSTOR's Early Journal Content collection.
Cover of the first issue of the Philosophical Transactions of the Royal Society, dated March 6, 1665. Available from JSTOR’s Early Journal Content collection.

JSTOR, the non-profit online journal distributor, announced yesterday that they would be making pre-1923 US articles and pre-1870 non-US articles available for free in a program they call “Early Journal Content”. The chosen dates are not random of course; they guarantee that the articles have fallen out of copyright, so such distribution does not run into rights issues. Nonetheless, that doesn’t mean that JSTOR could take this action unilaterally. JSTOR is further bound by agreements with the publishers who provided the journals for scanning, which may have precluded them contractually from distributing even public domain materials that were derived from the provided originals. Thus such a program presumably requires cooperation of the journal publishers. In addition, JSTOR requires goodwill from publishers for all of its activities, so unilateral action could have been problematic for its long-run viability. (Such considerations may even in part underly JSTOR’s not including all public domain material in the opened collection.)

Arranging for the necessary permissions — whether legal or pro forma — takes time, and JSTOR claims that work towards the opening of these materials started “about a year ago”, that is, prior to the recent notorious illicit download program that I have posted about previously. Predictably, the Twittersphere is full of speculation about whether the actions by Aaron Swartz affected the Early Journal Content program:

@grimmelm: JSTOR makes pre-1923 journals freely available http://about.jstor.org/participate-jstor/individuals/early-journal-content Would this have happened earlier or later w/o @aaronsw?

@mecredis: JSTOR makes all their public domain content available for free: http://about.jstor.org/news-events/news/jstor%E2%80%93free-access-early-journal-content I think this means @aaronsw wins.

@maxkaiser: Breaking: @JSTOR to provide free #openaccess to pre-1923 content in US & pre-1870 elsewhere – @aaronsw case had impact: http://about.jstor.org/news-events/news/jstor%E2%80%93free-access-early-journal-content

@JoshRosenau: JSTOR “working on releasing pre-1923 content before [@aaronsw released lotsa their PDFs], inaccurate to say these events had no impact.”

@mariabustillos: Stuff that in yr. pipe and smoke it, JSTOR haters!! http://bit.ly/qtrxdV Also: how now, @aaronsw?

So, did Aaron Swartz’s efforts affect the existence of JSTOR’s new program or its timing? As to the former, it seems clear that with or without his actions, JSTOR was already on track to provide open access to out-of-copyright materials. As to the latter, JSTOR says that

[I]t would be inaccurate to say that these events have had no impact on our planning. We considered whether to delay or accelerate this action, largely out of concern that people might draw incorrect conclusions about our motivations. In the end, we decided to press ahead with our plans to make the Early Journal Content available, which we believe is in the best interest of our library and publisher partners, and students, scholars, and researchers everywhere.

On its face, the statement implies that JSTOR acted essentially without change, but we’ll never know if Swartz’s efforts sped up or slowed down the release.

What the Early Journal Content program does show is JSTOR’s interest in providing broader access to the scholarly literature, a goal they share with open-access advocates, and even with Aaron Swartz. I hope and expect that JSTOR will continue to push, and even more aggressively, towards broader access to its collection. The scholarly community will be watching.

On guerrilla open access

July 28th, 2011

William G. Bowen, founder of JSTOR
William G. Bowen, founder of JSTOR

[Update January 13, 2013: See my post following Aaron Swartz’s tragic suicide.]

Aaron Swartz has been indicted for wire fraud, computer fraud, unlawfully obtaining information from a protected computer, and recklessly damaging a protected computer. The alleged activities that led to this indictment were his downloading massive numbers of articles from JSTOR by circumventing IP and MAC address limitations and breaking and entering into restricted areas of the MIT campus to obtain direct access to the network, for the presumed intended purpose of distributing the articles through open file-sharing networks. The allegation is in keeping with his previous calls for achieving open access to the scholarly literature by what he called “guerrilla open access” in a 2008 manifesto: “We need to download scientific journals and upload them to file sharing networks.” Because many theorize that Swartz was intending to further the goals of open access by these activities, some people have asked my opinion of his alleged activities.

Before going further, I must present the necessary disclaimers: I am not a lawyer. He is presumed innocent until proven guilty. We don’t know if the allegations in the indictment are true, though I haven’t seen much in the way of denials (as opposed to apologetics). We don’t know what his intentions were or what he planned to do with the millions of articles he downloaded, though none of the potential explanations I’ve heard make much sense even in their own terms other than the guerrilla OA theory implicit in the indictment. So there is a lot we don’t know, which is typical for a pretrial case. But for the purpose of discussion, let’s assume that the allegations in the indictment are true and his intention was to provide guerrilla OA to the articles. (Of course, if the allegations are false, as some seem to believe, then my claims below are vacuous. If the claims in the indictment turn out to be false, or colored by other mitigating facts, I for one would be pleased. But I can only go by what I have read in the papers and the indictment.)

There’s a lot of silliness that has been expressed on both sides of this case. The pro-Swartz faction is quoted as saying “Aaron’s prosecution undermines academic inquiry and democratic principles.” Hunh? Or this one: “It’s incredible that the government would try to lock someone up for allegedly looking up articles at a library.” Swartz could, of course, have looked up any JSTOR article he wanted using his Harvard library privileges, and could even have text-mined the entire collection through JSTOR’s Data for Research program, but that’s not what he did. Or this howler: “It’s like trying to put someone in jail for allegedly checking too many books out of the library.” No, it isn’t, and even a cursory reading of the indictment reveals why. On the anti-Swartz side, the district attorney says things like “Stealing is stealing whether you use a computer command or a crowbar, and whether you take documents, data or dollars.” If you can’t see a difference between, say, posting one of your articles on your website and lifting a neighbor’s stereo, well then I don’t know what. There’s lots of hyperbole going on on both sides.

Here’s my view: Insofar as his intentions were to further the goals of proponents of open access (and no one is more of a proponent than I), the techniques he chose to employ were, to quote Dennis Blair, “not moral, legal, or effective.”

If the claims in the indictment are true, his actions certainly were not legal. The simple act of downloading the articles en masse was undoubtedly a gross violation of the JSTOR terms and conditions of use, which would have been incorporated into the agreement Swartz had entered into as a guest user of the MIT network. Then there is the breaking and entering, the denial of service attack on JSTOR shutting down its servers, the closing of MIT access to JSTOR.  The indictment is itself a compendium of the illegalities that Swartz is alleged to have committed.

One could try to make an argument that, though illegal, the acts were justified on moral grounds as an act of civil disobedience, as Swartz says in his manifesto. “There is no justice in following unjust laws. It’s time to come into the light and, in the grand tradition of civil disobedience, declare our opposition to this private theft of public culture.” If this was his intention, he certainly made an odd choice of target. JSTOR is not itself a publisher “blinded by greed”, or a publisher of any sort. It merely aggregates material published by others. As a nonprofit organization founded by academics and supported by foundations, its mission has been to “vastly improve access to scholarly papers”, by providing online access to articles previously unavailable, and at subscription rates that are extraordinarily economical. It has in fact made good on that mission, for which I and many other OA proponents strongly support it. This is the exemplar of Swartz’s villains, his “[l]arge corporations … blinded by greed”? God knows there’s plenty of greed to go around in large corporations, including large commercial publishing houses running 30% profit margins, but you won’t find it at JSTOR. As a side effect of Swartz’s activities, large portions of the MIT community were denied access to JSTOR for several days as JSTOR blocked the MIT IP address block in an attempt to shut Swartz’s downloads down, and JSTOR users worldwide may have been affected by Swartz’s bringing down several JSTOR servers. In all, his activities reduced access to the very articles he hoped to open, vitiating his moral imperative. And if it is “time to come into the light”, why the concerted active measures to cover his tracks (using the MIT network instead of the access he had through his Harvard library privileges, obscuring his face when entering the networking closet, and the like)?

Finally, and most importantly, this kind of action is ineffective. As Peter Suber predicted in a trenchant post that we can now see as prescient, it merely has the effect of tying the legitimate, sensible, economically rational, and academically preferable approach of open access to memes of copyright violation, illegality, and naiveté. There are already sufficient attempts to inappropriately perform this kind of tying; we needn’t provide further ammunition. Unfortunate but completely predictable statements like “It is disappointing to see advocates of OA treat this person as some kind of hero.” tar those who pursue open access with the immorality and illegality that self-proclaimed guerrillas exhibit. In so doing, guerrilla OA is not only ineffective, but counterproductive.

I believe, as I expect Aaron Swartz does, that we need an editorially sound, economically sustainable, and openly accessible scholarly communication system. We certainly do not have that now. But moving to such a system requires thoughtful efforts, not guerilla stunts.

Gregoire Bouillier
Grégoire Bouillier

I used to use as my standard example of why translation is hard — and why fully automatic high-quality translation (FAHQT) is unlikely in our lifetimes however old we are — the translation of the first word of the first sentence of the first book of Proust’s Remembrance of Things Past. The example isn’t mine. Brown et al. cite a 1988 New York Times article about the then-new translation by Richard Howard. Howard chose to translate the first word of the work, longtemps, as time and again (rather than, for example, the phrase for a long time as in the standard Moncrieff translation) so that the first word time would resonate with the temporal aspect of the last word of the last volume, temps, some 3000 pages later. How’s that for context?

I now have a new example, from the Lorin Stein translation of Grégoire Bouillier‘s The Mystery Guest. Stein adds a translator’s note to the front matter “For reasons the reader will understand, I have refrained from translating the expression ‘C’est le bouquet.’ It means, more or less, ‘That takes the cake.’” That phrase occurs on page 14 in the edition I’m reading.

The fascinating thing is that the reader does understand, fully and completely, why the translator chose this route. But the reason is, more or less, because of a sentence that occurs on page 83, a sentence that shares no words with the idiom in question. True the protagonist perseverates on this latter sentence for the rest of the novella, but still, I challenge anyone to give an explanation in less than totally abstract terms, as far from the words actually used as you can imagine, to explain the reasoning, perfectly clear to any reader, of why the translator made this crucial decision.

Language is ineffable.

Front steps of National Library of Medicine, 2008, photo courtesy of NIH Image Bank
Front steps of National Library of Medicine, 2008, photo courtesy of NIH Image Bank

Imagine my surprise when I actually received a response to my letters in recognition of the NIH public access policy, a form letter undoubtedly, but nonetheless gratefully received. And as a side effect, it allows us to gauge the understanding of the issues in the pertinent offices.

The letter, which I’ve duplicated below in its entirety, addresses two of the issues that I raised in my letter, the expansion of the policy to other agencies and the desirability for a reduction in the embargo period.

With regard to expanding the NIH policy to other funding agencies, the response merely notes the America COMPETES Act‘s charge to establish a working group to study the matter — fine as far as it goes, but not an indication of support for expansion itself.

With regard to the embargo issue, the response seems a bit confused as to how things work in the real world. Let’s look at some sentences from the pertinent paragraph:

  • “As you may know, the 12-month delay period specified by law (Division G, Title II, Section 218 of P.L. 110-161) is an upper limit. Rights holders (sometimes the author, and sometimes they transfer some or all of these rights to publishers) are free to select a shorter delay period, and many do.” This is of course true. My hope, and that of many others, is to decrease this maximum.
  • “The length of the delay period is determined through negotiation between authors and publishers as part of the copyright transfer process.” Well, not so much. Authors don’t so much negotiate with publishers as just sign whatever publishers put in their path. When one actually attempts to engage in negotiation, sadly rare among academic authors, things often go smoothly, but sometimes take a turn for the odd, and authors in the thrall of publish or perish are short on negotiating leverage.
  • “These negotiations can be challenging for authors, and our guidance (http://publicaccess.nih.gov/FAQ.htm#778) encourages authors to consult with their institutions when they have questions about copyright transfer agreements.” I have a feeling that the word challenging is a euphemism for something else, but I’m not sure what. The cited FAQ doesn’t in fact provide guidance on negotiation, but just language to incorporate into a publisher agreement to make it consistent with the 12-month embargo. No advice on what to do if the publisher refuses, much less how to negotiate shorter embargoes. As for the excellent advice to “consult with their institutions”, in the case of Harvard, that kind of means to talk with my office, doesn’t it? Which, I suppose, is a vote of confidence.

So there is some room for improvement in understanding the dynamic at play in author-publisher relations, but overall, I’m gratified that NIH folks are on top of this issue and making a good faith effort to bring the fruits of research to the scholarly community and the public at large, and reiterate my strong support of NIH’s policy.

Here’s the full text of the letter:

DEPARTMENT OF HEALTH & HUMAN SERVICES
Public Health Service
National Institutes of Health
Bethesda, Maryland 20892

May 27 2011

Stuart M. Shieber, Ph.D.
Welch Professor of Computer Science, and
Director, Office for Scholarly Communication
1341 Massachusetts Avenue
Cambridge, Massachusetts 02138

Dear Dr. Shieber:

Thank you for your letters to Secretary Sebelius and Dr. Collins regarding the NIH Public Access Policy. I am the program manager for the Policy, and have been asked to respond to you directly.

We view the policy as an important tool for ensuring that as many Americans as possible benefit from the public’s investment in research through NIH.

I appreciate your suggestions about reducing the delay period between publication and availability of a paper on PubMed Central. As you may know, the 12-month delay period specified by law (Division G, Title II, Section 218 of P.L. 110-161) is an upper limit. Rights holders (sometimes the author, and sometimes they transfer some or all of these rights to publishers) are free to select a shorter delay period, and many do. The length of the delay period is determined through negotiation between authors and publishers as part of the copyright transfer process. These negotiations can be challenging for authors, and our guidance (http://publicaccess.nih.gov/FAQ.htm#778) encourages authors to consult with their institutions when they have questions about copyright transfer agreements.

I also appreciate your suggestion to expand this Policy to other Federal science funders, and the confidence it implies in our approach. The National Science and Technology Council (NSTC) has been charged by the America COMPETES Reauthorization Act of 2010 (P.L. 111-358) to establish a working group to explore the dissemination and stewardship of peer reviewed papers arising from Federal research funding. I am copying Dr. Celeste Rohlfing at the Office of Science and Technology Policy on this correspondence, as she is coordinating the NSTC efforts on Public Access.

Sincerely,

Neil M. Thakur, Ph.D.
Special Assistant to the NIH Deputy Director for Extramural Research

cc: Ms. Celeste M. Rohlfing
Assistant Director for Physical Sciences
Office of Science and Technology Policy
Executive Office of the President
725 17th Street, Room 5228
Washington, DC 20502

dictionary and red pencil
Dictionary and red pencil, photo by novii, on Flickr

Sanford Thatcher has written a valuable, if anecdotal, analysis of some papers residing on Harvard’s DASH repository (Copyediting’s Role in an Open-Access World, Against the Grain, volume 23, number 2, April 2011, pages 30-34), in an effort to get at the differences between author manuscripts and the corresponding published versions that have benefited from copyediting.

“What may we conclude from this analysis?” he asks. “By and large, the copyediting did not result in any major improvements of the manuscripts as they appear at the DASH site.” He finds that “the vast majority of changes made were for the sake of enforcing a house formatting style and cleaning up a variety of inconsistencies and infelicities, none of which reached into the substance of the writing or affected the meaning other than by adding a bit more clarity here and there” and expects therefore that the DASH versions are “good enough” for many scholarly and educational uses.

Although more substantive errors did occur in the articles he examined, especially in the area of citation and quotation accuracy, they were typically carried over to the published versions as well. He notes that “These are just the kinds of errors that are seldom caught by copyeditors.”

One issue that goes unmentioned in the column is the occasional introduction of errors by the typesetting and copyediting process itself. This used to happen with great frequency in the bad old days when publishers rekeyed papers to typeset them. It was especially problematic in fields like my own, in which papers tend to have large amounts of mathematical notation, which the typesetting staff had little clue about the niceties of. These days more and more journals allow authors to submit LaTeX source for their articles, which the publisher merely applies the house style file to. This practice has been a tremendous boon to the accuracy and typesetting quality of mathematical articles. Still, copyediting can introduce substantive errors in the process. Here’s a nice example from a paper in the Communications of the ACM:

“Besides getting more data, faster, we also now use much more sophisticated learning algorithms. For instance, algorithms based on logistic regression and that support vector machines can reduce by half the amount of spam that evades filtering, compared to Naive Bayes.” (Joshua Goodman, Gordon V. Cormack, and David Heckerman, Spam and the ongoing battle for the inbox, Communications of the Association for Computing Machinery, volume 50, number 2, 2007, page 27.  Emphasis added.)

Any computer scientist would immediately see that the sentence as published makes no sense. There is no such thing as a “vector machine” and in any case algorithms don’t support them. My guess is that the author manuscript had the sentence “For instance, algorithms based on logistic regression and support vector machines can reduce by half…” — without the word that. The copyeditor apparently didn’t realize that the noun phrase support vector machine is a term of art in the machine learning literature; the word support was not intended to be a verb here. (Do a Google search for vector machine. Every hit has the phrase in the context of the term support vector machine, at least for the pages I looked at before boredom set in.)

Presumably, the authors didn’t catch the error introduced by the copyeditor. The occurrence of errors of this sort is no argument against copyediting, but it does demonstrate that it should be viewed as a collaborative activity between copyeditors and authors, and better tools for collaboratively vetting changes would surely be helpful.

In any case, back to Dr. Thatcher’s DASH study. Ellen Duranceau at MIT Libraries News views the study as “support for the MIT faculty’s approach to sharing their articles through their Open Access Policy”, and the same could be said for Harvard as well. However, before we declare victory, it’s worth noting that Dr. Thatcher did find differences between the versions, and in general the edits were beneficial.

The title of Dr. Thatcher’s column gets at the subtext of his conclusions, that in an open-access world, we’d have to live with whatever errors copyediting would have caught, since we’d be reading uncopyedited manuscripts. But open-access journals can and do provide copyediting as one of their services, and to the extent that doing so improves the quality of the articles they publish and thus the imprimatur of the journal, it has a secondary benefit to the journal of improving its brand and its attractiveness to authors.

I admit that I’m a bit of a grammar nerd (with what I think is a nuanced view that manages to be linguistically descriptivist and editorially prescriptivist at the same time) and so I think that copyediting can have substantial value. (My own writing was probably most improved by Savel Kliachko, an outstanding editor at my first employer SRI International.) To my mind, the question is how to provide editing services in a rational way. Given that the costs of copyediting are independent of the number of accesses, and that the value accrues in large part to the author (by making him or her look like less of a halfwit for exhibiting “inconsistencies and infelicities” and occasionally more substantive errors), it seems reasonable that authors ought to pay publishers a fee for these services. And that is exactly what happens in open-access journals. Authors can decide if the bargain is a good one on the basis of the services that the publisher provides, including copyediting, relative to the fee the publisher charges. As a result, publishers are given incentive to provide the best services for the dollar. A good deal all around.

Most importantly, in a world of open-access journals the issue of divergence between author manuscripts and publisher versions disappears, since readers are no longer denied access to the definitive published version. Dr. Thatcher concludes that the benefits of copyediting were not as large as he would have thought. Nonetheless, however limited the benefits might be, properly viewed those benefits argue for open access.

The owld White Harse wants zettin to rights
And the Squire hev promised good cheer,
Zo we’ll gee un a scrape to kip un in zhape,
And a’ll last for many a year.

— Thomas Hughes, The Scouring of the White Horse, 1859

On a recent trip to London, I had an extra day free, and decided to visit the Uffington White Horse with a friend. The Uffington White Horse is one of the most mysterious human artifacts on the planet. In the south of Oxfordshire, less than two hours west of London by Zipcar, it sits atop White Horse Hill in the Vale of White Horse to which it gives its name. It is the oldest of the English chalk figures, which are constructed by removing turf and topsoil to reveal the chalk layer below.

The Uffington White Horse, photo by flickr user superdove, used by permission

The figure is sui generis in its magnificence, far surpassing any of the other hill figures extant in England. The surrounding landscape — with its steep hills, the neighboring Roman earthworks castle, and pastoral lands still used for grazing sheep and cows — is spectacular.

The Uffington horse is probably best known for its appearance in Thomas Hughes’s 1857 novel Tom Brown’s Schooldays. The protagonist Tom Brown, like Hughes himself, hails from Uffington, and Hughes uses that fact as an excuse to spend a few pages detailing the then-prevalent theory of the origin of the figure, proposed by Francis Wise in 1738, that the figure was carved into the hill in honor of King Æthelred’s victory over the Danes there in 871.[1]

As it turns out, in a triumph of science over legend, Oxford archaeologists have dated the horse more accurately within the last twenty years. They conclude that the trenches were originally dug some time between 1400 and 600 BCE, making the figure about three millennia old.[2]

How did the figure get preserved over this incredible expanse of time? The longevity of the horse is especially remarkable given its construction. The construction method is a bit different from its popular presentation as a kind of huge shallow intaglio, revealing the chalk substrate. Instead it is constructed as a set of trenches dug several feet deep and backfilled with chalk. Nonetheless, over time, dirt will overfill the chalk areas and grass will encroach. Over a period of decades, this process leads chalk figures to become “lost”. In fact, several lost chalk figures in England are known of.

Chalk figures thus require regular maintenance to prevent overgrowing. Thomas Baskerville[3] captures the alternatives: “some that dwell hereabout have an obligation upon their lands to repair and cleanse this landmark, or else in time it may turn green like the rest of the hill and be forgotten.”

Figure from Hughes’s The Scouring of the White Horse depicting the 1857 scouring. From the 1859 Macmillan edition.

This “repairing and cleansing” has been traditionally accomplished through semi-regular celebrations, called scourings, occurring at approximately decade intervals, in which the locals came together in a festival atmosphere to clean and repair the chalk lines, at the same time participating in competitions, games, and apparently much beer. Hughes’s 1859 book The Scouring of the White Horse is a fictionalized recounting of the 1857 scouring that he attended.[4]

These days, the regular maintenance of the figure has been taken over by the National Trust, which has also arranged for repair of vandalism damage and even for camouflaging of the figure during World War II.

The author at the Uffington White Horse, 19 March 2011, with Dragon Hill in the background. Note the beginnings of plant growth on the chalk substrate.

Thus, the survival of the Uffington White Horse is witness to a continuous three millennium process of active maintenance of this artifact. As such, it provides a perfect metaphor for the problems of digital preservation. (Ah, finally, I get to the connection with the topic at hand.) We have no precedent for long-term preservation of interpretable digital objects. Unlike books printed on acid-free paper, which survive quite well in a context of benign neglect, but quite like the White Horse, bits degrade over time. It requires a constant process of maintenance and repair — mirroring,[5] verification, correction, format migration — to maintain interpretable bits over time scales longer than technology-change cycles. By coincidence, those time scales are about commensurate with the time scales for chalk figure loss, on the order of decades.

The tale of the Uffington White Horse provides some happy evidence that humanity can, when sufficiently motivated to establish appropriate institutions, maintain this kind of active process over millennia, but also serves as a reminder of the kind of loss we might see in the absence of such a process. The figure is to my knowledge the oldest extant human artifact that has survived due to continual maintenance. In recognition of this, I propose that we adopt as an appropriate term for the regular processes of digital preservation “the scouring of the White Horse”.

[A shout out to the publican at Uffington’s Fox and Hounds Pub for the lunch and view of White Horse Hill after our visit to the horse.]


[1]Francis Wise, A Letter to Dr. Mead concerning some antiquities in Berkshire; Particularly shewing that the White Horse, which gives name to the Vale, is a Monument of the West Saxons, made in memory of great Victory obtained over the Danes A.D. 871, 1758.

[2]David Miles and Simon Palmer, “White Horse Hill,” Current Archaeology, volume 142, pages 372-378, 1995.

[3]Thomas Baskerville, The Description of Towns, on the Road from Faringdon to Bristow and Other Places, 1681.

[4]One of the salutary byproducts of the recent mass book digitization efforts is the open availability of digital versions of both Hughes books: through Open Library and Google Books.

[5]Interestingly, the Uffington White Horse has been “mirrored” as well, with replicas in Hogansville, GA, Juarez, Mexico, and Canberra, Australia.

In recognition of the third anniversary of the establishment of the NIH Public Access Policy on April 7, 2008, I’ve sent letters to John Holdren, Director of the Office of Science and Technology PolicyFrancis Collins., Director of the National Institutes of Health; and Kathleen Sebelius, Secretary of Health and Human Services. The letter to Dr. Holdren is duplicated below; the others are substantially similar. The Alliance for Taxpayer Access provides further background.

April 13, 2011

John Holdren
Assistant to the President for Science and Technology
Director, Office of Science and Technology Policy, Executive Office of the President
New Executive Office Building
725 – 17th Street NW
Washington, DC 20502

Dear Dr. Holdren:

I write to you in my role as the Director of the Office for Scholarly Communication at Harvard University, where I lead efforts to broaden access to the research and scholarly results of our university. I and others at Harvard working towards these goals so central to the university’s mission have been inspired by the National Institutes of Health Public Access Policy, now celebrating its third anniversary. The NIH policy has had an enormous impact in increasing availability of government-funded research to the citizens that have supported it through their tax dollars. Every day nearly half a million people access the over two million articles that the NIH policy makes available through the PubMed Central repository. I am especially proud that Harvard affiliates have contributed over thirty thousand of these articles.

The NIH should be applauded for these efforts to bring the fruits of scientific research to the public, and should be encouraged to provide even more timely access by shortening the embargo period in the policy. I believe that the NIH example should be broadly followed by all government agencies engaged in substantial research funding, as envisioned in the Federal Research Public Access Act (FRPAA) that has several times been introduced in Congress, and encourage you to extend this kind of policy to other science and technology funding agencies as soon as possible.

The tremendous success of the NIH policy should be celebrated.  It provides a sterling example of government acting in the public interest, leading to broader access to the important scientific results that inform researchers and lay citizens alike.

Sincerely,

Stuart M. Shieber
Welch Professor of Computer Science, and
Director, Office for Scholarly Communication

Hubble’s Dark Matter Map from flickr user NASA Goddard Photo and Video, used by permission

The Harvard repository, DASH, comprises several thousand articles in all fields of scholarship. These articles are stored and advertised through an item page providing metadata — such as title, author, citation, abstract, and link to the definitive version of record — which typically allows downloading of the article as well. But not all articles are distributed. On some of the item pages, the articles themselves can’t be downloaded; they are “dark”. The decision whether or not to allow for dark articles in a repository comes up sufficiently often that it is worth rehearsing the several reasons to allow it.

  1. Posterity: Repositories have a role in providing access to scholarly articles of course. But an important part of the purpose of a repository is to collect the research output of the institution as broadly as possible. Consider the mission of a university archives, well described in this Harvard statement: “The Harvard University Archives (HUA) supports the University’s dual mission of education and research by striving to preserve and provide access to Harvard’s historical records; to gather an accurate, authentic, and complete record of the life of the University; and to promote the highest standards of management for Harvard’s current records.” Although the role of the university archives and the repository are different, that part about “gather[ing] an accurate, authentic, and complete record of the life of the University” reflects this role of the repository as well.Since at any given time some of the articles that make up that output will not be distributable, the broadest collection requires some portion of the collection to be dark.
  2. Change: The rights situation for any given article can change over time — especially over long time scales, librarian time scales — and having materials in the repository dark allows them to be distributed if and when the rights situation allows. An obvious case is articles under a publisher embargo. In that case, the date of the change is known, and repository software can typically handle the distributability change automatically. There are also changes that are more difficult to predict. For instance, if a publisher changes its distribution policies, or releases backfiles as part of a corporate change, this might allow distribution where not previously allowed. Having the materials dark means that the institution can take advantage of such changes in the rights situation without having to hunt down the articles at that (perhaps much) later date.
  3. Preservation: Dark materials can still be preserved. Preservation of digital objects is by and large an unknown prospect, but one thing we know is that the more venues and methods available for preservation, the more likely the materials will be preserved. Repositories provide yet another venue for preservation of their contents, including the dark part.
  4. Discoverability: Although the articles themselves can’t be distributed, their contents can be indexed to allow for the items in the repository to be more easily and accurately located. Articles deposited dark can be found based on searches that hit not only the title and abstract but the full text of the article. And it can be technologically possible to pass on this indexing power to other services indexing the repository, such as search engines.
  5. Messaging: When repositories allow both open and dark materials, the message to faculty and researchers can be made very simple: Always deposit. Everything can go in; the distribution decision can be made separately. If authors have to worry about rights when making the decision whether to deposit in the first place, the cognitive load may well lead them to just not deposit. Since the hardest part about running a successful repository is getting a hold of the articles themselves, anything that lowers that load is a good thing. This point has been made forcefully by Stevan Harnad. It is much easier to get faculty in the habit of depositing everything than in the habit of depositing articles subject to the exigencies of their rights situations.
  6. Availability: There are times when an author has distribution rights only to unavailable versions of an article. For instance, an author may have rights to distribute the author’s final manuscript, but not the publisher’s version. Or an art historian may not have cleared rights for online distribution of the figures in an article and may not be willing to distribute a redacted version of the article without the figures. The ability to deposit dark enables depositing in these cases too. The publisher’s version or unredacted version can be deposited dark.
  7. Education: Every time an author deposits an article dark is a learning moment reminding the author that distribution is important and distribution limitations are problematic.

For all these reasons, I believe that it is important to allow for dark items in an article repository. Better dark than missing.

[Hat tip to Sue Kriegsman for discussions on this issue.]