## Switching to Markdown for scholarly article production

### August 29th, 2014

With few exceptions, scholars would be better off writing their papers in a lightweight markup format called Markdown, rather than using a word-processing program like Microsoft Word. This post explains why, and reveals a hidden agenda as well.1

### Microsoft Word is not appropriate for scholarly article production

 …lightweight… “Old two pan balance” image from Nikodem Nijaki at Wikimedia Commons. Used by permission.

Before turning to lightweight markup, I review the problems with Microsoft Word as the lingua franca for producing scholarly articles. This ground has been heavily covered. (Here’s a recent example.) The problems include:

Substantial learning curve
Microsoft Word is a complicated program that is difficult to use well.
Appearance versus structure
Word-processing programs like Word conflate composition with typesetting. They work by having you specify how a document should look, not how it is structured. A classic example is section headings. In a typical markup language, you specify that something is a heading by marking it as a heading. In a word-processing program you might specify that something is a heading by increasing the font size and making it bold. Yes, Word has “paragraph styles”, and some people sometimes use them more or less properly, if you can figure out how. But most people don’t, or don’t do so consistently, and the resultant chaos has been well documented. It has led to a whole industry of people who specialize in massaging Word files into some semblance of consistency.
Backwards compatibility
Word-processing program file formats have a tendency to change. Word itself has gone through multiple incompatible file formats in the last decades, one every couple of years. Over time, you have to keep up with the latest version of the software to do anything at all with a new document, but updating your software may well mean that old documents are no longer identically rendered. With Markdown, no software is necessary to read documents. They are just plain text files with relatively intuitive markings, and the underlying file format (UTF-8 née ASCII) is backward compatible to 1963. Further, typesetting documents in Markdown to get the “nice” version is based on free and open-source software (markdown, pandoc) and built on other longstanding open source standards (LaTeX, BibTeX).
Poor typesetting
Microsoft Word does a generally poor job of typesetting, as exemplified by hyphenation, kerning, mathematical typesetting. This shouldn’t be surprising, since the whole premise of a word-processing program means that the same interface must handle both the specification and typesetting in real-time, a recipe for having to make compromises.
Lock-in
Because Microsoft Word’s file format is effectively proprietary, users are locked in to a single software provider for any and all functionality. The file formats are so complicated that alternative implementations are effectively impossible.

### Lightweight markup is the solution

The solution is to use a markup format that allows specification of the document (providing its logical structure) separate from the typesetting of that document. Your document is specified – that is, generated and stored – as straight text. Any formatting issues are handled not by changing the formatting directly via a graphical user interface but by specifying the formatting textually using a specific textual notation. For instance, in the HTML markup language, a word or phrase that should be emphasized is textually indicated by surrounding it with <em>…</em>. HTML and other powerful markup formats like LaTeX and various XML formats carry relatively large overheads. They are complex to learn and difficult to read. (Typing raw XML is nobody’s idea of fun.) Ideally, we would want a markup format to be lightweight, that is, simple, portable, and human-readable even in its raw state.

Markdown is just such a lightweight markup language. In Markdown, emphasis is textually indicated by surrounding the phrase with asterisks, as is familiar from email conventions, for example, *lightweight*. See, that wasn’t so hard. Here’s another example: A bulleted list is indicated by prepending each item on a separate line with an asterisk, like this:

 * First item
* Second item

which specifies the list

• First item
• Second item

Because specification and typesetting are separated, software is needed to convert from one to the other, to typeset the specified document. For reasons that will become clear later, I recommend the open-source software pandoc. Generally, scholars will want to convert their documents to PDF (though pandoc can convert to a huge variety of other formats). To convert file.md (the Markdown-format specification file) to PDF, the command

 pandoc file.md -o file.pdf

suffices. Alternatively, there are many editing programs that allow entering, editing, and typesetting Markdown. I sometimes use Byword. In fact, I’m using it now.

Markup languages range from the simple to the complex. I argue for Markdown for four reasons:

1. Basic Markdown, sufficient for the vast majority of non-mathematical scholarly writing, is dead simple to learn and remember, because the markup notations were designed to mimic the kinds of textual conventions that people are used to – asterisks for emphasis and for indicating bulleted items, for instance. The coverage of this basic part of Markdown includes: emphasis, section structure, block quotes, bulleted and numbered lists, simple tables, and footnotes.
2. Markdown is designed to be readable and the specified format understandable even in its plain text form, unlike heavier weight markup languages such as HTML.
3. Markdown is well supported by a large ecology of software systems for entering, previewing, converting, typesetting, and collaboratively editing documents.
4. Simple things are simple. More complicated things are more complicated, but not impossible. The extensions to Markdown provided by pandoc cover more or less the rest of what anyone might need for scholarly documents, including links, cross-references, figures, citations and bibliographies (via BibTeX), mathematical typesetting (via LaTeX), and much more.For instance, this equation (the Cauchy-Schwarz inequality) will typeset well in generated PDF files, and even in HTML pages using the wonderful MathJax library.$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)$(Pandoc also provides some extensions that simplify and extend the basic Markdown in quite nice ways, for instance, definition lists, strikeout text, a simpler notation for tables.)

Above, I claimed that scholars should use Markdown “with few exceptions”. The exceptions are:

1. The document requires nontrivial mathematical typesetting. In that case, you’re probably better off using LaTeX. Anyone writing a lot of mathematics has given up word processors long ago and ought to know LaTeX anyway. Still, I’ll often do a first draft in Markdown with LaTeX for the math-y bits. Pandoc allows LaTeX to be included within a Markdown file (as I’ve done above), and preserves the LaTeX markup when converting the Markdown to LaTeX. From there, it can be typeset with LaTeX. Microsoft Word would certainly not be appropriate for this case.
2. The document requires typesetting with highly refined or specialized aspects. I’d probably go with LaTeX here too, though desktop publishing software (InDesign) is also appropriate if there’s little or no mathematical typesetting required. Microsoft Word would not be appropriate for this case either.

Some have proposed that we need a special lightweight markup language for scholars. But Markdown is sufficiently close, and has such a strong community of support and software infrastructure, that it is more than sufficient for the time being. Further development would of course be helpful, so long as the urge to add “features” doesn’t overwhelm its core simplicity.

### The hidden agenda

I have a hidden agenda. Markdown is sufficient for the bulk of cases of composing scholarly articles, and simple enough to learn that academics might actually use it. Markdown documents are also typesettable according to a separate specification of document style, and retargetable to multiple output formats (PDF, HTML, etc.).2 Thus, Markdown could be used as the production file format for scholarly journals, which would eliminate the need for converting between the authors’ manuscript version and the publishers internal format, with all the concomitant errors that process is prone to produce.

In computer science, we have by now moved almost completely to a system in which authors provide articles in LaTeX so that no retyping or recomposition of the articles needs to be done for the publisher’s typesetting system. Publishers just apply their LaTeX style files to our articles. The result has been a dramatic improvement in correctness and efficiency. (It is in part due to such an efficient production process that the cost of running a high-end computer science journal can be so astoundingly low.)

Even better, there is a new breed of collaborative web-based document editing tools being developed that use Markdown as their core file format, tools like Draft and Authorea. They provide multi-author editing, versioning, version comparison, and merging. These tools could constitute the system by which scholarly articles are written, collaborated on, revised, copyedited, and moved to the journal production process, generating efficiencies for a huge range of journals, efficiencies that we’ve enjoyed in computer science and mathematics for years.

As Rob Walsh of ScholasticaHQ says, “One of the biggest bottlenecks in Open Access publishing is typesetting. It shouldn’t be.” A production ecology built around Markdown could be the solution.

1. Many of the ideas in this post are not new. Complaints about WYSIWYG word-processing programs have a long history. Here’s a particularly trenchant diatribe pointing out the superiority of disentangling composition from typesetting. The idea of “scholarly Markdown” as the solution is also not new. See this post or this one for similar proposals. I go further in viewing certain current versions of Markdown (as implemented in Pandoc) as practical already for scholarly article production purposes, though I support coordinated efforts that could lead to improved lightweight markup formats for scholarly applications. Update September 22, 2014: I’ve just noticed a post by Dennis Tenen and Grant Wythoff at The Programming Historian on “Sustainable Authorship in Plain Text using Pandoc and Markdown” giving a tutorial on using these tools for writing scholarly history articles.
2. As an example, I’ve used this very blog post. Starting with the Markdown source file (which I’ve attached to this post), I first generated HTML output for copying into the blog using the command
pandoc -S --mathjax --base-header-level=3 markdownpost.md -o markdownpost.html

A nicely typeset version using the American Mathematical Society’s journal article document style can be generated with

pandoc markdownpost.md -V documentclass:amsart -o markdownpost-amsart.pdf

To target the style of ACM transactions instead, the following command suffices:

pandoc markdownpost.md -V documentclass:acmsmall -o markdownpost-acmsmall.pdf

 Attachments mardownpost.md: The source file for this post in Markdown format markdownpost-amsart.pdf: The post rendered using pandoc according to AMS journal style markdownpost-acmsmall.pdf: The post rendered using pandoc according to ACM journal style

## No, the Turing Test has not been passed.

### June 10th, 2014

 …that’s not Turing’s Test… “Turing Test” image from xkcd. Used by permission.

There has been a flurry of interest in the Turing Test in the last few days, precipitated by a claim that (at last!) a program has passed the Test. The program in question is called “Eugene Goostman” and the claim is promulgated by Kevin Warwick, a professor of cybernetics at the University of Reading and organizer of a recent chatbot competition there.

The Turing Test is a topic that I have a deep interest in (see this, and this, and this, and this, and, most recently, this), so I thought to give my view on Professor Warwick’s claim “We are therefore proud to declare that Alan Turing’s Test was passed for the first time on Saturday.” The main points are these. The Turing Test was not passed on Saturday, and “Eugene Goostman” seems to perform qualitatively about as poorly as many other chatbots in emulating human verbal behavior. In summary: There’s nothing new here; move along.

First, the Turing Test that Turing had in mind was a criterion of indistinguishability in verbal performance between human and computer in an open-ended wide-ranging interaction. In order for the Test to be passed, judges had to perform no better than chance in unmasking the computer. But in the recent event, the interactions were quite time-limited (only five minutes) and in any case, the purported Turing-Test-passing program was identified correctly more often than not by the judges (almost 70% of the time in fact). That’s not Turing’s test.

Update June 17, 2014: The time limitation was even worse than I thought. According to my colleague Luke Hunsberger, computer science professor at Vassar College, who was a judge in this event, the five minute time limit was for two simultaneous interactions. Further, there were often substantial response delays in the system. In total, he estimated that a judge might average only four or five rounds of chat with each interlocutor. I’ve argued before that a grossly time-limited Turing Test is no Turing Test at all.

Sometimes, people trot out the prediction from Turing’s seminal 1950 Mind article that “I believe that in about fifty years’ time it will be possible to programme computers, with a storage capacity of about $$10^9$$, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent. chance of making the right identification after five minutes of questioning.” As I explain in my book on the Test:

The first thing to note about the prediction is that it is not a prediction about the Test per se: Turing expects 70 percent prediction accuracy, not the more difficult 50 percent expected by chance, and this after only a limited conversation of five minutes. He is therefore predicting passage of a test much simpler than the Test.

Not only does the prediction not presuppose a full Turing Test, but it could well be argued that it had already come to pass with the advent of Joseph Weizenbaum’s Eliza some thirty-five years early. Weizenbaum developed a simple computer program that attempted to imitate the parroting conversational style of a Rogerian psychotherapist…. Although the methods used were quite simple – repeating the user’s question after adjusting some pronouns, throwing in a stock phrase every now and then – the result was, in its own way, extremely convincing.

Second, “Eugene Goostman” uses the very techniques that began with Weizenbaum’s  “Eliza” program from the 1960’s. We see the same tricks – repeating the judge’s statements with simple substitutions, keyword-triggered responses, falling back on vague or unresponsive replies, and the like. Those tricks are no more successful than they have been in the two decades of runnings of the Loebner Prize Competition, another ill-conceived attempt at running a Turing-like test. And there too, entrants used the trick of having their programs emulate humans with built in excuses.  “Eugene Goostman” purports to be a non-English-fluent child, rather than a fluent adult.1

In fact, anyone with the slightest familiarity with chatbots, the modern incarnations of “Eliza”, could unmask “Eugene Goostman” quickly. (The judges at the recent competition were apparently not so familiar.) Scott Aaronson has provided a transcript of his own interaction with “Eugene Goostman”, which shows the telltale signs as soon as the computer’s first reply, which blatantly ignores an utterly trivial question and tries to change the subject:

Scott: Which is bigger, a shoebox or Mount Everest?

Eugene: I can’t make a choice right now. I should think it out later. And I forgot to ask you where you are from…

Here’s an appeal to cute, vague language:

Scott: How many legs does a camel have?

Eugene: Something between 2 and 4. Maybe, three? :-))) By the way, I still don’t know your specialty – or, possibly, I’ve missed it?

And here’s the repetition of a canned response:

Scott: No, I need to know that you’re not a chatbot. Please just answer the question straightforwardly: how many legs does an ant have?

Eugene: Something between 2 and 4. Maybe, three? :-))) Oh, what a fruitful conversation;-)

Here’s a canned keyword-triggered response, nonresponsive as usual:

Scott: In your opinion, does your existence demonstrate any idea or principle that wasn’t demonstrated just as convincingly by ELIZA, Joseph Weizenbaum’s chatbot from the 1960s?

Eugene: Eliza was a break-thru. All the bots after it were nothing but weak parodies, claiming to have “revolutionary improvements”.

And here’s “Eliza”‘s trick of rewriting the judge’s utterance after pronoun replacement and (poorly executed) synonym substitution:

Scott: OK, I’ll count that as the second sensible thing you’ve said (grading generously). Don’t you think that a more convincing chatbot could be created, by using Google to trawl the entire web for plausible-sounding answers (analogous to what IBM’s Watson did)?

Eugene: Yes, I do think that a more convincing chatbot can be built by using google to trawl the all web for plausible – sounding – I hope I have right to have my own opinion. Maybe, let’s talk about something else? What would you like to discuss?

Literally every one of “Eugene”‘s responses reflects its “Eliza”-like programming. It would be amusing, if it weren’t so predictable.

In summary, “Eugene Goostman” is not qualitatively superior to other chatbots, and certainly has not passed a true Turing Test. It isn’t even close.

1. In a parody of this approach, the late John McCarthy, professor of computer science at Stanford University and inventor of the term “artifical intelligence”, wrote a letter to the editor responding to a publication about an “Eliza”-like program that claimed to emulate a paranoid psychiatric patient. He presented his own experiments that I described in my Turing Test book: “He had designed an even better program, which passed the same test. His also had the virtue of being a very inexpensive program, in these times of tight money. In fact you didn’t even need a computer for it. All you needed was an electric typewriter. His program modeled infantile autism. And the transcripts – you type in your questions, and the thing just sits there and hums – cannot be distinguished by experts from transcripts of real conversations with infantile autistic patients.”

## How universities can support open-access journal publishing

### To university administrators and librarians:

 …enablement becomes transformation… “Shelf of journals” image from Flickr user University of Illinois Library. Used by permission.

As a university administrator or librarian, you may see the future in open-access journal publishing and may be motivated to help bring that future about.1 I would urge you to establish or maintain an open-access fund to underwrite publication fees for open-access journals, but to do so in a way that follows the principles that underlie the Compact for Open-Access Publishing Equity (COPE). Those principles are two:

Principle 1: Our goal should be to establish an environment in which publishers are enabled2 to change their business model from the unsustainable closed access model based on reader-side fees to a sustainable open access model based on author-side fees.

If publishers could and did switch to the open-access business model, in the long term the moneys saved in reader-side fees would more than cover the author-side fees, with open access added to boot.

But until a large proportion of the funded research comes with appropriately structured funds usable to pay author-side fees, publishers will find themselves in an environment that disincentivizes the move to the preferred business model. Only when the bulk of research comes with funds to pay author-side fees underwriting dissemination will publishers feel comfortable moving to that model. Principle 1 argues for a system where author-side fees for open-access journals should be largely underwritten on behalf of authors, just as the research libraries of the world currently underwrite reader-side fees on behalf of readers.3 But who should be on the hook to pay the author-side fees on behalf of the authors? That brings us to Principle 2.

Principle 2: Dissemination is an intrinsic part of the research process. Those that fund the research should be responsible for funding its dissemination.

Research funding agencies, not universities, should be funding author-side fees for research funded by their grants. There’s no reason for universities to take on that burden on their behalf.4 But universities should fund open-access publication fees for research that they fund themselves.

We don’t usually think of universities as research funders, but they are. They hire faculty to engage in certain core activities – teaching, service, and research – and their job performance and career advancement typically depends on all three. Sometimes researchers obtain outside funding for the research aspect of their professional lives, but where research is not funded from outside, it is still a central part of faculty members’ responsibilities. In those cases, where research is not funded by extramural funds, it is therefore being implicitly funded by the university itself. In some fields, the sciences in particular, outside funding is the norm; in others, the humanities and most social sciences, it is the exception. Regardless of the field, faculty research that is not funded from outside is university-funded research, and the university ought to be responsible for funding its dissemination as well.

The university can and should place conditions on funding that dissemination. In particular, it ought to require that if it is funding the dissemination, then that dissemination be open – free for others to read and build on – and that it be published in a venue that provides openness sustainably – a fully open-access journal rather than a hybrid subscription journal.

Organizing a university open-access fund consistent with these principles means that the university will, at present, fund few articles, for reasons detailed elsewhere. Don’t confuse slow uptake with low impact. The import of the fund is not to be measured by how many articles it makes open, but by how it contributes to the establishment of the enabling environment for the open-access business model. The enabling environment will have to grow substantially before enablement becomes transformation. It is no less important in the interim.

What about the opportunity cost of open-access funds? Couldn’t those funds be better used in our efforts to move to a more open scholarly communication system? Alternative uses of the funds are sometimes proposed, such as university libraries establishing and operating new open-access journals or paying membership fees to open-access publishers to reduce the author-side fees for their journals. But establishing new journals does nothing to reduce the need to subscribe to the old journals. It adds costs with no anticipation, even in the long term, of corresponding savings elsewhere. And paying membership fees to certain open-access publishers puts a finger on the scale so as to preemptively favor certain such publishers over others and to let funding agencies off the hook for their funding responsibilities. Such efforts should at best be funded after open-access funds are established to make good on universities’ responsibility to underwrite the dissemination of the research they’ve funded.

1. It should go without saying that efforts to foster open-access journal publishing are completely consistent with, in fact aided by, fostering open access through self-deposit in open repositories (so-called “green open access”). I am a long and ardent supporter of such efforts myself, and urge you as university administrators and librarians to promote green open access as well. [Since it should go without saying, comments recapitulating that point will be deemed tangential and attended to accordingly.]
2. I am indebted to Bernard Schutz of Max Planck Gesellschaft for his elegant phrasing of the issue in terms of the “enabling environment”.
3. Furthermore, as I’ve argued elsewhere, disenfranchising readers through subscription fees is a more fundamental problem than disenfranchising authors through publication fees.
4. In fact, by being willing to fund author-side fees for grant-funded articles, universities merely delay the day that funding agencies do their part by reducing the pressure from their fundees.

## Public underwriting of research and open access

### April 4th, 2014

 …a social contract… Title page of the first octavo edition of Rousseau’s Social Contract

[This post is based loosely on my comments on a panel on 2 April 2014 for Terry Fisher‘s CopyrightX course. Thanks to Terry for inviting me to participate and provoking this piece, and to my Berkman colleagues for their wonderful contributions to the panel session.]

Copyright is part of a social contract: You the author get a monopoly to exploit rights for a while in return for us the public gaining “the progress of Science and the Useful Arts”. The idea is that the direct financial benefit of exploiting those rights provides incentive for the author to create.

But this foundation for copyright ignores the fact that there are certain areas of creative expression in which direct financial benefit is not an incentive to create: in particular, academia. It’s not that academics who create and publish their research don’t need incentives, even financial incentives, to do so. Rather, the financial incentives are indirect. They receive no direct payment for the articles that they publish describing their research. They benefit instead from the personal uplift of contributing to human knowledge and seeing that knowledge advance science and the useful arts. Plus, their careers depend on the impact of their research, which is a result of its being widely read; it’s not all altruism.

In such cases, a different social contract can be in force without reducing creative expression. When the public underwrites the research that academics do – through direct research grants for instance – they can require in return that the research results must be made available to the public, without allowing for the limited period of exclusive exploitation. This is one of the arguments for the idea of open access to the scholarly literature. You see it in the Alliance for Taxpayer Access slogan “barrier-free access to taxpayer-funded research” and the White House statement that “The Obama Administration agrees that citizens deserve easy access to the results of research their tax dollars have paid for.” It is implemented in the NIH public access policy, requiring all articles funded by NIH grants to be made openly available through the PubMed Central website, where millions of visitors access millions of articles each week.

But here’s my point, one that is underappreciated even among open access supporters. The penetration of the notion of “taxpayer-funded research”, of “research their tax dollars have paid for”, is far greater than you might think. Yes, it includes research paid for by the $30 billion invested by the NIH each year, and the$7 billion research funded by the NSF, and the $150 million funded by the NEH. But all university research benefits from the social contract with taxpayers that makes universities tax-exempt.1 The Association of American Universities makes clear this social contract: The educational purposes of universities and colleges – teaching, research, and public service – have been recognized in federal law as critical to the well-being of our democratic society. Higher education institutions are in turn exempted from income tax so they can make the most of their revenues…. Because of their tax exemption, universities and colleges are able to use more resources than would otherwise be available to fund: academic programs, student financial aid, research, public extension activities, and their overall operations. It’s difficult to estimate the size of this form of support to universities. The best estimate I’ve seen puts it at something like$50 billion per year for the income tax exemption. That’s more than the NIH, NSF, and (hardly worth mentioning) the NEH put together. It’s on par with the total non-defense federal R&D funding.

And it’s not just exemption from income tax that universities benefit from. They also are exempt from property taxes for their campuses. Their contributors are exempt from tax for their charitable contributions to the university, which results ceteris paribus in larger donations. Their students are exempt from taxes on educational expenses. They receive government funding for scholarships, freeing up funds for research. Constructing an estimate of the total benefit to universities from all these sources is daunting. One study places the total value of all direct tax exemptions, federal, state, and local, for a single university, Northeastern University, at $97 million, accounting for well over half of all government support to the university. (Even this doesn’t count several of the items noted above.) All university research, not just the grant-funded research, benefits from the taxpayer underwriting implicit in the tax exemption social contract. It would make sense then, in return, for taxpayers to require open access to all university research in return for continued tax-exempt status. Copyright is the citizenry paying authors with a monopoly in return for social benefit. But where the citizenry pays authors through some other mechanism, like$50 billion worth of tax exemption, it’s not a foregone conclusion that we should pay with the monopoly too.

Some people point out that just because the government funds something doesn’t mean that the public gets a free right of access. Indeed, the government funds various things that the public doesn’t get access to, or at least, not free access. The American Publisher’s Association points out, for instance, that although taxpayers pay for the national park system “they still have to pay a fee if they want to go in, and certainly if they want to camp.” On the other hand, you don’t pay when the fire department puts out a fire in your house, or to access the National Weather Service forecasts. It seems that the social contract is up for negotiation.

And that’s just the point. The social contract needs to be designed, and designed keeping in mind the properties of the goods being provided and the sustainability of the arrangement. In particular, funding of the contract can come from taxpayers or users or a combination of both. In the case of national parks, access to real estate is an inherently limited resource, and the benefit of access redounds primarily to the user (the visitor), so getting some of the income from visitors puts in place a reasonable market-based constraint.

Information goods are different. First, the benefits of access to information redound widely. Information begets information: researchers build on it, journalists report on it, products are based on it. The openness of NWS data means that farms can generate greater yields to benefit everyone (one part of the fourth of six goals in the NWS Strategic Plan). The openness of MBTA transit data means that a company can provide me with an iPhone app to tell me when my bus will arrive at my stop. Second, access to information is not an inherently limited resource. As Jefferson said, “He who receives an idea from me, receives instruction himself without lessening mine.” If access is to be restricted, it must be done artificially, through legal strictures or technological measures. The marginal cost of providing access to an academic article is, for all intents and purposes, zero. Thus, it makes more sense for the social contract around distributing research results to be funded exclusively from the taxpayer side rather than the user side, that is, funding agencies requiring completely free and open access for the articles they fund, and paying to underwrite the manifest costs of that access. (I’ve written in the past about the best way for funding agencies to organize that payment.)

It turns out that we, the public, are underwriting directly and indirectly every research article that our universities generate. Let’s think about what the social contract should provide us in return. Blind application of the copyright social contract would not be the likely outcome.

1. Underappreciated by many, but as usual, not by Peter Suber, who anticipated this argument, for instance, in his seminal book Open Access:

All scholarly journals (toll access and OA) benefit from public subsidies. Most scientific research is funded by public agencies using public money, conducted and written up by researchers working at public institutions and paid with public money, and then peer-reviewed by faculty at public institutions and paid with public money. Even when researchers and peer reviewers work at private universities, their institutions are subsidized by publicly funded tax exemptions and tax-deductible donations. Most toll-access journal subscriptions are purchased by public institutions and paid with taxpayer money. [Emphasis added.]

## A true transitional open-access business model

### March 28th, 2014

 …provide a transition path… “The Temple of Transition, Burning Man 2011” photo by flickr user Michael Holden, used by permission

David Willetts, the UK Minister for Universities and Research, has written a letter to Janet Finch responding to her committee’s “A Review of Progress in Implementing the Recommendations of the Finch Report”. Notable in Minister Willetts response is this excerpt:

Government wants [higher education institutions] to fully participate in the take up of Gold OA and create a better functioning market. Hence, Government looks to the publishing industry to develop innovative and sustainable solutions to address the ‘double-dipping’ issue perceived by institutions. Publishers have an opportunity to incentivise early adoption of Gold OA by moderating the total cost of publication for individual institutions. This would remove the final obstacle to greater take up of Gold OA, enabling universal acceptance of ‘hybrid’ journals.

It is important for two reasons: in its recognition, first, that the hybrid journal model has inherent obstacles as currently implemented (consistent with a previous post of mine), and second, that the solution is to make sure that individual institutions (as emphasized in the original) be properly incentivized for underwriting hybrid fees.

This development led me to dust off a pending post that has been sitting in my virtual filing cabinet for several years now, being updated every once in a while as developments motivated. It addresses exactly this issue in some detail.

## A document scanning smartphone handle

### March 13th, 2014

 …my solution to the problem… (Demonstrating the Scan-dle to my colleagues from the OSC over a beer in a local pub. Photo: Reinhard Engels)

They are at the end of the gallery; retired to their tea and scandal, according to their ancient custom.

William Congreve

For a project that I am working on, I needed to scan some documents in one of the Harvard libraries. Smartphones are a boon for this kind of thing, since they are highly portable and now come with quite high-quality cameras. The iPhone 5 camera, for instance, has a resolution of 3,264 x 2,448, which comes to about 300 dpi scanning a letter-size sheet of paper, and a brightness depth of 8 bits per pixel provides an effective resolution much higher.

The downside of a smartphone, and any handheld camera, is the blurring that inevitably arises from camera shake when holding the camera and pressing the shutter release. True document scanners have a big advantage here. You could use a tripod, but dragging a tripod into the library is none too convenient, and staff may even disallow it, not to mention the expense of a tripod and smartphone tripod mount.

My solution to the problem of stabilizing my smartphone for document scanning purposes is a kind of document scanning smartphone handle that I’ve dubbed the Scan-dle. The stabilization that a Scan-dle provides dramatically improves the scanning ability of a smartphone, yet it’s cheap, portable, and unobtrusive.

The Scan-dle is essentially a triangular cross-section monopod made from foam board with a smartphone platform at the top. The angled base tilts the monopod so that the smartphone’s camera sees an empty area for the documents.[1] Judicious use of hook-and-loop fasteners allows the Scan-dle to fold small and flat in a couple of seconds.

The plans at right show how the device is constructed. Cut from a sheet of foam board the shape indicated by the solid lines. (You can start by cutting out a 6″ x 13.5″ rectangle of board, then cutting out the bits at the four corners.) Then, along the dotted lines, carefully cut through the top paper and foam but not the bottom layer of paper. This allows the board to fold along these lines. (I recommend adding a layer of clear packaging tape along these lines on the uncut side for reinforcement.) Place four small binder clips along the bottom where indicated; these provide a flatter, more stable base. Stick on six 3/4″ hook-and-loop squares where indicated, and cut two 2.5″ pieces of 3/4″ hook-and-loop tape.

When the board is folded along the “fold for storage” line (see image at left), you can use the tape pieces to hold it closed and flat for storage. When the board is folded along the two “fold for use” lines (see image at right), the same tape serves to hold the board together into its triangular cross section. Hook-and-loop squares applied to a smartphone case hold the phone to the platform.

To use the Scan-dle, hold the base to a desk with one hand and operate the camera’s shutter release with the other, as shown in the video below. An additional trick for iPhone users is to use the volume buttons on a set of earbuds as a shutter release for the iPhone camera, further reducing camera shake.

The Scan-dle has several nice properties:

1. It is made from readily available and inexpensive materials. I estimate that the cost of the materials used in a single Scan-dle is less than $10, of which about half is the iPhone case. In my case, I had everything I needed at home, so my incremental cost was$0.
2. It is extremely portable. It folds flat to 6″ x 7″ x .5″, and easily fits in a backpack or handbag.
3. It sets up and breaks down quickly. It switches between its flat state and ready-to-go in about five seconds.
4. It is quite sufficient for stabilizing the smartphone.

The scanning area covered by a Scan-dle is about 8″ x 11″, just shy of a letter-size sheet. Of course, you can easily change the device’s height in the plans to increase that area. But I prefer to leave it short, which improves the resolution in scanning smaller pages. When a larger area is needed you can simply set the base of the Scan-dle on a book or two. Adding just 1.5″ to the height of the Scan-dle gives you coverage of about 10″ x 14″. By the way, after you’ve offloaded the photos onto your computer, programs like the freely available Scantailor can do a wonderful job of splitting, deskewing, and cropping the pages if you’d like.

Let me know in the comments section if you build a Scan-dle and how it works for you, especially if you come up with any use tips or design improvements.

### Materials:

(Links are for reference only; no need to buy in these quantities.)

1. The design bears a resemblance to a 2011 Kickstarter-funded document scanner attachment called the Scandy, though there are several differences. The Scandy was a telescoping tube that attached with a vise mount to a desk; the Scan-dle simplifies by using the operator’s hand as the mount. The Scandy’s telescoping tube allowed the scan area to be sized to the document; the Scan-dle must be rested on some books to increase the scan area. Because of its solid construction, the Scandy was undoubtedly slightly heavier and bulkier than the Scan-dle. The Scandy cost some €40 ($55); the Scan-dle comes in at a fraction of that. Finally, the Scandy seems no longer to be available; the open-source Scan-dle never varies in its availability. ↩ ## A model OA journal publication agreement ### February 19th, 2014  …decided to write my own… In a previous post, I proposed that open-access journals use the CC-BY license for their scholar-contributed articles: As long as you’re founding a new journal, its contents should be as open as possible consistent with appropriate attribution. That exactly characterizes the CC-BY license. It’s also a tremendously simple approach. Once the author grants a CC-BY license, no further rights need be granted to the publisher. There’s no need for talk about granting the publisher a nonexclusive license to publish the article, etc., etc. The CC-BY license already allows the publisher to do so. There’s no need to talk about what rights the author retains, since the author retains all rights subject to the nonexclusive CC-BY license. I’ve made the case for a CC-BY license at length elsewhere. Recently, a journal asked me how to go about doing just that. What should their publication agreement look like? It was a fair question, and one I didn’t have a ready answer for. The “Online Guide to Open Access Journals Publishing” provides a template agreement that is refreshingly minimalist, but by my lights misses some important aspects. I looked around at various journals to see what they did, but didn’t find any agreements that seemed ideal either. So I decided to write my own. Herewith is my proposal for a model OA publication agreement. Read the rest of this entry » ## Thoughts on founding open-access journals ### November 21st, 2013  … altogether too much concern with the contents of the journal’s spine text… “reference” image by flickr user Sara S. used by permission. Precipitated by a recent request to review some proposals for new open-access journals, I spent some time gathering my own admittedly idiosyncratic thoughts on some of the issues that should be considered when founding new open-access journals. I make them available here. Good sources for more comprehensive information on launching and operating open-access journals are SPARC’s open-access journal publishing resource index and the Open Access Directories guides for OA journal publishers. Unlike most of my posts, I may augment this post over time, and will do so without explicit marking of the changes. Your thoughts on additions to the topics below—via comments or email—are appreciated. A version number (currently version 1.0) will track the changes for reference. ### It is better to flip a journal than to found one The world has enough journals. Adding new open-access journals as alternatives to existing ones may be useful if there are significant numbers of high quality articles being generated in a field for which there is no reasonable open-access venue for publication. Such cases are quite rare, especially given the rise of open-access “megajournals” covering the sciences (PLoS ONE, Scientific Reports, AIP Advances, SpringerPlus, etc.), and the social sciences and humanities (SAGE Open). Where there are already adequate open-access venues (even if no one journal is “perfect” for the field), scarce resources are probably better spent elsewhere, especially on flipping journals from closed to open access. Admittedly, the world does not have enough open-access journals (at least high-quality ones). So if it is not possible to flip a journal, founding a new one may be a reasonable fallback position, but it is definitely the inferior alternative. ### Licensing should be by CC-BY As long as you’re founding a new journal, its contents should be as open as possible consistent with appropriate attribution. That exactly characterizes the CC-BY license. It’s also a tremendously simple approach. Once the author grants a CC-BY license, no further rights need be granted to the publisher. There’s no need for talk about granting the publisher a nonexclusive license to publish the article, etc., etc. The CC-BY license already allows the publisher to do so. There’s no need to talk about what rights the author retains, since the author retains all rights subject to the nonexclusive CC-BY license. I’ve made the case for a CC-BY license at length elsewhere. ### It’s all about the editorial board The main product that a journal is selling is its reputation. A new journal with no track record needs high quality submissions to bootstrap that reputation, and at the start, nothing is more convincing to authors to submit high quality work to the journal than its editorial board. Getting high-profile names somewhere on the masthead at the time of the official launch is the most important thing for the journal to do. (“We can add more people later” is a risky proposition. You may not get a second chance to make a first impression.) Getting high-profile names on your board may occur naturally if you use the expedient of flipping an existing closed-access journal, thereby stealing the board, which also has the benefit of acquiring the journal’s previous reputation and eliminating one more subscription journal. Another good idea for jumpstarting a journal’s reputation is to prime the article pipeline by inviting leaders in the field to submit their best articles to the journal before its official launch, so that the journal announcement can provide information on forthcoming articles by luminaries. ### Follow ethical standards Adherence to the codes of conduct of the Open Access Scholarly Publishers Association (OASPA) and the Committee on Publication Ethics (COPE) should be fundamental. Membership in the organizations is recommended; the fees are extremely reasonable. ### You can outsource the process There is a lot of interest among certain institutions to found new open-access journals, institutions that may have no particular special expertise in operating journals. A good solution is to outsource the operation of the journal to an organization that does have special expertise, namely, a journal publisher. There are several such publishers who have experience running open-access journals effectively and efficiently. Some are exclusively open-access publishers, for example, Co-Action Publishing, Hindawi Publishing, Ubiquity Press. Others handle both open- and closed-access journals: HighWire Press, Oxford University Press, ScholasticaHQ, Springer/BioMed Central, Wiley. This is not intended as a complete listing (the Open Access Directory has a complementary offering), nor in any sense an endorsement of any of these organizations, just a comment that shopping the journal around to a publishing partner may be a good idea. Especially given the economies of scale that exist in journal publishing, an open-access publishing partner may allow the journal to operate much more economically than having to establish a whole organization in-house. ### Certain functionality should be considered a baseline Geoffrey Pullum, in his immensely satisfying essays “Stalking the Perfect Journal” and “Seven Deadly Sins in Journal Publishing”, lists his personal criteria in journal design. They are a good starting point, but need updating for the era of online distribution. (There is altogether too much concern with the contents of the journal’s spine text for instance.) • Reviewing should be anonymous (with regard to the reviewers) and blind (with regard to the authors), except where a commanding argument can be given for experimenting with alternatives. • Every article should be preserved in one (or better, more than one) preservation system. CLOCKSS, Portico1, a university or institutional archival digital repository are good options. • Every article should have complete bibliographic metadata on the first page, including license information (a simple reference to CC-BY; see above), and (as per Pullum) first and last page numbers. • The journal should provide DOIs for its articles. OASPA membership is an inexpensive way to acquire the ability to assign DOIs. An article’s DOI should be included in the bibliographic metadata on the first page. There’s additional functionality beyond this baseline that would be ideal, though the tradeoff against the additional effort required would have to be evaluated. • Provide article-level metrics, especially download statistics, though other “altmetrics” may be helpful. • Provide access to the articles in multiple formats in addition to PDF: HTML, XML with the NLM DTD. • Provide the option for readers to receive alerts of new content through emails and RSS feeds. • Encourage authors to provide the underlying data to be distributed openly as well, and provide the infrastructure for them to do so. ### Take advantage of the networked digital era Many journal publishing conventions of long standing are no longer well motivated in the modern era. Here are a few examples. They are not meant to be exhaustive. You can probably think of others. The point is that certain standard ideas can and should be rethought. • There is no longer any need for “issues” of journals. Each article should be published as soon as it is finished, no later and no sooner. If you’d like, an “issue” number can be assigned that is incremented for each article. (Volumes, incremented annually, are still necessary because many aspects of the scholarly publishing and library infrastructure make use of them. They are also useful for the purpose of characterizing a bolus of content for storage and preservation purposes.) • Endnotes, a relic of the day when typesetting was a complicated and fraught process that was eased by a human being not having to determine how much space to leave at the bottom of a page for footnotes, should be permanently retired. Footnotes are far easier for readers (which is the whole point really), and computers do the drudgery of calculating the space for them. • Page limits are silly. In the old physical journal days, page limits had two purposes. They were necessary because journal issues came in quanta of page signatures, and therefore had fundamental physical limits to the number of pages that could be included. A network-distributed journal no longer has this problem. Page limits also serve the purpose of constraining the author to write relatively succinctly, easing the burden on reviewer and (eventually) reader. But for this purpose the page is not a robust unit of measurement of the constrained resource, the reviewers’ and the readers’ attention. One page can hold anything from a few hundred to a thousand or more words. If limits are to be required, they should be stated in appropriate units such as the number of words. The word count should not include figures, tables, or bibliography, as they impinge on readers’ attention in a qualitatively different way. • Author-date citation is far superior to numeric citation in every way except for the amount of space and ink required. Now that digital documents use no physical space or ink, there is no longer an excuse for numeric citations. Similarly, ibid. and op. cit. should be permanently retired. I appreciate that different fields have different conventions on these matters. That doesn’t change the fact that those fields that have settled on numeric citations or ibidded footnotes are on the wrong side of technological history. • Extensive worry about and investment in fancy navigation within and among the journal’s articles is likely to be a waste of time, effort, and resources. To first approximation, all accesses to articles in the journal will come from sites higher up in the web food chain—the Google’s and Bing’s, the BASE’s and OAIster’s of the world. Functionality that simplifies navigation among articles across the whole scholarly literature (cross-linked DOIs in bibliographies, for instance, or linked open metadata of various sorts) is a different matter. ### Think twice In the end, think long and hard about whether founding a new open-access journal is the best use of your time and your institution’s resources in furthering the goals of open scholarly communication. Operating a journal is not free, in cash and in time. Perhaps a better use of resources is making sure that the academic institutions and funders are set up to underwrite the existing open-access journals in the optimal way. But if it’s the right thing to do, do it right. 1. A caveat on Portico’s journal preservation service: The service is designed to release its stored articles when a “trigger event” occurs, for instance, if the publisher ceases operations. Unfortunately, Portico doesn’t release the journal contents openly, but only to its library participants, even for OA journals. However, if the articles were licensed under CC-BY, any participating library could presumably reissue them openly. ## The Affordable Care Act’s contradictory free market stance ### October 30th, 2013  …in the upper 90’s… apparently from Health Care for America Now! via logarchism.com. The Patient Protection and Affordable Care Act (ACA) limits the “medical loss ratio” (MLR) that an insurer can have — the percentage of collected medical premiums that must go to medical services for the insured. The minimum MLR mandated by the law is 80-85% depending on the particular market. (For simplicity, let’s call it 80%.) On its face, this seems like a good idea. If an insurer’s MLR is really low, say 50%, they’re keeping an awful lot of money for administration and profit, and it looks like the premium-payers are getting a raw deal. By limiting MLR to at least 80%, premium-payers are guaranteed that at most 20% of their money will go to those costs that benefit them not at all. But there may be unintended consequences of the MLR limit, and alternatives to achieving its goal. Because of the MLR limit, an insurance company that spends$1,000,000 on medical services can generate at most $250,000 in profit. They’d reach this limit by charging premiums totalling$1,250,000, yielding an MLR of 1,000,000/1,250,000 = .80. (Of course, they’d generate even less profit than this, since they have other costs than medical services, but \$250,000 is an upper bound on their profit.) They can’t increase their profit by charging higher premiums alone, since this would just blow the MLR limit. The only way to increase the profits (governed by the denominator in the MLR calculation) is to increase medical services (the numerator) as well — pay for more doctor visits, longer stays, more tests, just the kinds of things we’re already spending too much on with our moral-hazard–infested medical care system. The MLR limit embeds an incentive for insurance companies to push for more medical services, whether needed or not.

And why 80%? Medicare has had an MLR in the upper 90%’s for a couple of decades, and private insurers used to make a go of it in that range as well in the early 1990’s. (See graph.) Other countries have MLR’s in the mid-90’s as well. An MLR limit of 80% means that once an insurer reaches 80% MLR, the regulation drops any incentive to improve further.

Wasn’t this moral hazard and inefficiency just the sort of thing the ACA was supposed to resolve by using market forces? When people buy insurance premiums on a transparently priced exchange, if one insurer is less efficient or egregious in profit-taking (therefore with a low MLR), it should end up outcompeted by more efficient and leaner insurers. No need to mandate a limit; the market will solve the problem.

If you think that the market forces in the health care exchanges won’t compete down adminstrative overheads and profits (that is, raise MLR) on their own and that regulation is necessary to prevent abuse, then you’re pretty much conceding that the market doesn’t work under the ACA, and that we should move to a single-payer system. MLR limits are not a way of achieving a more efficient insurance system but rather an admission that our insurance system is inherently broken. The MLR limit looks to me like a crisis of faith in the free market. What am I missing?

## Can gerrymandering be solved with cut-and-choose?

### October 28th, 2013

Update March 25, 2019: Wesley Pegden, Ariel D. Procaccia, and Dingli Yu have an elegant working out of the proposal below that they call “I cut, you freeze.” Pegden and Procaccia describe it in a Washington Post opinion piece.

 …how to split a cupcake… “Halves” image by flickr user Julie Remizova.

Why is gerrymandering even possible in a country with a constitutional right to equal protection?:

No State shall make or enforce any law which shall…deny to any person within its jurisdiction the equal protection of the laws.

By reshaping districts to eliminate the voting power of particular individuals, as modern district mapping software allows, some persons are being denied equal protection, I’d have thought. And so have certain Supreme Court justices.

It’s hard to know what to do about the problem. Appeals to fairness aren’t particularly helpful, since who decides what’s fair? It would be nice to think that requirements of “compact districts of contiguous territory” (as Chief Justice Harlan put it) would be sufficient. But this reduces the problem of districting to a mathematical optimization problem; James Case proposes something like minimum isoperimetric quotient tessellation of a polygon. But such purely mathematical approaches may yield results that violate our intuitions about what is fair. They ignore other criteria, such as “natural or historical boundary lines”, determined for instance by geographical features like rivers and mountains or shared community interests. These boundaries may not coincide with the mathematical optima, so any mathematical formulation would need to be defeasible to take into account such features. This leads us right back to how to decide in which cases the mathematical formulation should be adjusted: who should decide what is fair?

A comment at a ProPublica article about gerrymandering from “damien” caught my attention as a nice way out of this quandary. In essence, he proposes that the parties themselves choose what’s fair.

The first solution to gerrymandering is to have a fitness measure for a proposed districting (e.g. the sum of the perimeters), and then to allow any individual or organisation to propose a districting, with the winner having the best fitness value.

What “damien” is proposing, I take it, is the application of an algorithm somewhat like one familiar from computer science (especially cryptography) and grade school cafeterias known as “cut and choose”. How do you decide how to split a cupcake between two kids? One cuts; the other chooses. The elegance of cut-and-choose is that it harmonizes the incentives of the two parties. The cutter is incentivized to split equally, since the chooser can punish inequity.

Cut-and-choose is asymmetrical; the two participants have different roles. A symmetrical variant has each participant propose a cut and an objective third party selecting whichever is better according to the pertinent objective measure. This variant shares the benefit that each participant has an incentive to be more nearly equal than the other. If Alice proposes a cut that gives her 60% of the cupcake and Bob 40%, she risks Bob proposing a better split that gives her only 45% with him taking the remaining 55%. To avoid getting taken advantage of, her best bet is to propose a split as nearly equal as possible.

In the anti-gerrymandering application of the idea, the two parties propose districtings, which they could gerrymander however they wanted. Whichever of the two proposals has the lower objective function (lower isoperimetric quotient, say) is chosen. Thus, if one party gerrymanders too much, their districting will be dropped in favor of the other party’s proposal. Each party has an incentive to hew relatively close to a compact partition, while being allowed to deviate in appropriate cases.

A nice property of this approach is that the optimization problem doesn’t ever need to be solved. All that is required is the evaluation of the objective function for the two proposed districtings, which is computationally far simpler. (In fact, I’d guess the minimum isoperimetric quotient optimization problem might well be NP-hard.)

There are problems of course. The procedure is subject to gaming when the proposal-generating process is not private to the parties. It is unclear how to extend the method to more than two parties. Of course, the obvious generalization works once the eligible parties are determined. The hard part is deciding what parties are eligible to propose a redistricting. Most critically, the method is subject to collusion, especially in cases where both parties benefit from gerrymandering. In particular, both parties benefit from a districting that protects incumbencies for both parties. The parties could agree, for instance, not to disturb each other’s safe districts, and would benefit from observing the agreement.

Nonetheless, once districting is thought of in terms of mechanism design, the full range of previous algorithms can be explored. Somewhere in the previous literature there might be a useful solution. (Indeed, the proposal here is essentially the first step in Brams, Jones, and Klamler’s surplus procedure for cake-cutting.)

Of course, as with many current political problems (campaign financing being the clearest example), the big question is how such new mechanisms would be instituted, given that it is not in the incumbent majority party’s interest to do so. Until that’s sorted out, I’m not holding out much hope.