Crowdsourcing and Open Access

I gave a short talk earlier today to my colleagues about the open access movement in legal scholarship, about which the three of us here at Info/Law have blogged from time to time (check out our open access tag for more). I used the occasion to go public with my own minor contribution to improving access to primary legal source materials—nothing remotely on the order of Tim Wu’s important work, just something I’ve been tinkering with on my own, but hopefully it will bear fruit and perhaps inspire similar projects in the future.

The House Report on the Copyright Act of 1976 is a key reference in the intellectual property domain, routinely cited by courts in copyright cases. It has been indispensable in resolving disputes as to legislative intent in the face of uncertain statutory text. But so far as I’ve been able to determine, it’s not freely available online (it is, of course, available in the big commercial databases). That’s unfortunate. As has often been noted, the copyright statute is intractably, even maddeningly, vague in places, and the legislative reports have been crucial tools in figuring out just what Congress was trying to do across a host of issues.

Taking advantage of our spiffy new copier, I scanned the entire House Report, working a few pages at a time over the course of a couple of weeks. That left me with a big folder full of TIFF files on my PC, which I scrubbed with the wonderful tool unpaper before converting to PDF. You can now download the completed PDF here, although be warned that it’s a very large file (155 MB): House Report No. 94-1476 (PDF).

Getting the scanned page images online, though, is only part of the battle. What I ultimately would like to see online is the text of the report, freely searchable, copyable, and indexable, rather than just the images. Because I don’t have the time or energy to convert the images to text myself, I’ve thrown the project open as an experiment in crowdsourcing. All my page scans are now available on Wikimedia Commons, and volunteers are slowly converting the raw OCR output to intelligible text on Wikisource. It’s a lengthy document, but given enough eyeballs, as they say. The Wikisource index to the scanned pages already appears on the first page of the Google search results for “House Report 94-1476.” Eventually, this process should produce a fairly well cleaned-up version of the source text.

Assuming this ultimately works (a big “if,” to be sure), what are some other public domain legal source texts that should get the crowdsourcing treatment? Perhaps some old, pre-1923 treatises on copyright or patent law? Some of my colleagues who teach Civil Procedure mentioned the legislative reports on the Rules Enabling Act of 1934. Or should we stick with more contemporary references? The most recent stuff is mostly already available in electronic form, although not exclusively, of course. Where is the greatest need?

UPDATE: The (admittedly daunting) dimensions of the problem of opening access to older legal materials, along with some interesting suggestions for increasing the size of the pool of potential editors and proofreaders, are now being discussed on Wikisource itself following a link to this blog posting. At bottom, like everything else, it’s an issue of resources. Some years ago here at Cincinnati, I’m told, we had a couple of employees in the library scanning and digitizing old UN reports, with a view towards ultimately making them available in electronic form. I’m not sure what ultimately came of that project, which was defunct before I joined the faculty, but it could be a nice proof of concept. Imagine if the law library at every law school in the United States had a person or two on staff whose full-time job it was to digitize old texts and assist in the proofreading process!

3 Responses to “Crowdsourcing and Open Access”

  1. That’s a great idea. We have in the German Wikisource a lot of legal PD texts (laws and court decisions) at:

  2. Distributed Proofreaders has open source Web-based software to facilitate the task of crowdsourced OCR. So far they’ve done 11,000 texts in several different languages. Unfortunately, the version they host on their Web site is only available to OCR texts destined for Project Gutenberg, i.e. books in the public domain. But the software is available, and someone could install it and run their own instance, to crowdsource OCR whatever they like.

  3. I think this is a great idea, but you could be looking at a sizeable project of having staff to sort and proofread the texts in each student library of the US. This would take some real persuation to colleges and professors.