Open Access Directory’s List of Blogs: Wikidata Ingest

[N.B. Views are my own, not in any way officially representative of Harvard. This blog began when I worked for Harvard until mid-October 2019, and to date I still have access to the account, even though I am no longer affiliated.]

 

Simmons University in the U.S. hosts the Open Access Directory, a community-maintained wiki about the global movement for open access to scholarship. The directory was founded in 2008 by open access experts Peter Suber and the late Robin Peek. 

In this post, I describe preparation and ingest of the directory’s list of “Blogs about Open Access” (oad.simmons.edu/oadwiki/Blogs_about_OA) to Wikidata, the free and open knowledge base. The goal: to increase discoverability. 

1. Determine notability

The Open Access Directory is a standard reference source about the open access movement. Most of its listings meet the Wikidata notability requirement (www.wikidata.org/wiki/Wikidata:Notability).

2. Structure data

For this particular example, the directory’s wiki page contained a list of more than 300 blog titles and URLS in a somewhat consistent pattern. However, the semi-structured data required some reformatting prior to upload to Wikidata. 

First, I pasted the content into a plain text document, then used the find and replace function to clean up the formatting.

Eventually the text was ready to be pasted into a spreadsheet, where a bit more cleanup was required.

3. Evaluate data

Some of the directory’s URLs had become outdated. To check for broken links, I followed Adham El Banhawy’s illustrated instructions for creating a short JavaScript script in Google Sheets.

function getStatusCode(url){
var options = {
‘muteHttpExceptions’: true,
‘followRedirects’: false
};
var url_trimmed = url.trim();
var response = UrlFetchApp.fetch(url_trimmed, options);
return response.getResponseCode();
}

The returned HTTP status codes (200, 301, 302, 403, 404, 410, 500, 503, etc.) were interpreted with httpstatuses.com.

4. Reconcile

I then downloaded the cleaned spreadsheet as a CSV file, then uploaded it to OpenRefine. In OpenRefine, I reconciled the title and URL against Wikidata. For this particular content, I was not surprised that only a few blogs matched Wikidata items. For unmatched blogs, I selected: Reconcile/Actions/Create-a-new-item-for-each-cell.

5. Build schema 

The OpenRefine extension feature includes an “edit Wikidata schema” option, which allows one to build an import template for a group of items that have properties and references in common.

A schema can be put together by a combination of dragging and dropping the reconciled column icons, and manually typing statement components. In this case, all items were instances of blogs, all had official websites, all addressed the subject open access.

Each statement also included the same source: the Wikidata representation of the Open Access Directory page “Blogs About Open Access” (www.wikidata.org/wiki/Q56229362).

I previewed the schema prior to uploading. OpenRefine automatically found a few duplicates, which I then fixed. 

6. Check quality 

After the OpenRefine batch uploader created new items in Wikidata, I carefully reviewed each one. Next steps: the addition of details about language, editor, license, motto, and date of inception, for example, will be welcome.

Several blogs had live websites but otherwise seemed inactive. This information will be recorded in both Wikidata and the directory.

Finally, Wikidata Q numbers for each blog will be listed in the directory.

 

 

Resources

OpenRefine reconciliation guide by Antonin Delpeuch
https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation

OpenRefine Google group
https://groups.google.com/forum/#!forum/openrefine

Wikidata help
https://www.wikidata.org/wiki/Help:Contents

Getting started as an Open Access Directory contributor
http://oad.simmons.edu/oadwiki/Getting_started_as_an_OAD_contributor

 

 

Anniversaries of Open Access Policies

[N.B. Views are my own, not in any way officially representative of Harvard.]

 

A dynamic list of more than 650 open access polices and their dates of adoption can be found in Wikidata: https://w.wiki/8DF. The list contains many facts first collected in the UK-based Registry of Open Access Repositories Mandates and Policies (ROARMAP).

Czech Academy of Sciences example

According to the list, the Czech Academy of Sciences adopted an open access policy on 14 October 2010. To verify the date I followed the ROARMAP link:
http://roarmap.eprints.org/99/

…which led to a broken link to the policy:
http://www.cas.cz/o_avcr/zakladni_informace/dokumenty/politika-otevreneho-pristupu.html

The Internet Archive’s Wayback Machine has a copy of the policy as it appeared online in 2016:
https://web.archive.org/web/20160305014822/http://www.avcr.cz/o_avcr/zakladni_informace/dokumenty/politika-otevreneho-pristupu.html

In Wikidata, I added the Waybacked URL as a reference, along with the Czech-language policy title.
https://www.wikidata.org/w/index.php?title=Q42297364&type=revision&diff=1027885811&oldid=728433059

Then I turned to the Wikidata item for the Czech Academy of Sciences:
https://www.wikidata.org/wiki/Q374686

…and added the date of the policy as a “significant event” in the history of the academy.

https://www.wikidata.org/wiki/Property:P793

https://www.wikidata.org/w/index.php?title=Q374686&type=revision&diff=1027888236&oldid=1004027816

 

In English-language Wikipedia, I added a “History” section to the Czech Academy of Sciences article and sentence about the OA policy.
https://en.wikipedia.org/w/index.php?title=Czech_Academy_of_Sciences&type=revision&diff=920220137&oldid=918041186

As a reference, I used a Waybacked copy of an OpenAire country profile from 2016, found by following links in the Wikidata item for open access in the Czech Republic.
https://www.wikidata.org/wiki/Q56064875

https://web.archive.org/web/20160805223240/https://www.openaire.eu/oa-czech

 

Want to contribute? Get started here. 

 

Update: Open-Access Society Publishers in Wikidata

[N.B. Views are my own, not in any way officially representative of Harvard.]

 

As described in my post of 6 September 2019, I am working on adding SOAR Catalog references to Wikidata. Here is an ongoing list as of 11 September 2019. It will be revised in the near future.

 

Open-Access Society Publishers in Wikidata

[N.B. Views are my own, not in any way officially representative of Harvard.]

 

In September 2013 Caroline Sutton, Peter Suber, and Amanda Page launched a third edition of the Society Open Access Research (SOAR) Catalog. As of 2019 it exists online as a continuously updated Google spreadsheet.

Recently I chatted with the catalog’s editors and began an independent, volunteer effort to amplify the carefully maintained contents of this list via the public-domain knowledge base Wikidata. My project aims to include a reference to the SOAR Catalog in each relevant Wikidata society and journal entry. Below is a description of the steps involved and some of the issues that have arisen so far.

PREPARATION

To find out which societies already had representation in Wikidata, I downloaded the SOAR Catalog as a CSV file, then imported it to a new Google sheet. Each row in the Google spreadsheet was assigned a unique identifier.

The SOAR editors had entered society names in abbreviated form, so I changed each name to its fullest form. For example “Soc” was changed to “Society” and “Amer” to “American” and so on.

The revised spreadsheet was downloaded as a CSV file, then uploaded to OpenRefine as a new project.

OPENREFINE

In OpenRefine I reconciled the society names against Wikidata. The retrieved information was saved as a new column “Q” in OpenRefine. Wikidata items were found for 526 societies, roughly half the total on the SOAR list.

Names were reconciled as type “organization” (Q43229, including subclasses “learned society”, “scientific society”, “association”, etc.), and then again as no type. During the reconciliation process, sometimes a society had multiple Wikidata items which I then examined and merged into one.

I exported the revised data as a CSV file, then imported it back to the Google sheet.

ANALYSIS

In the Google spreadsheet, I then analyzed the full catalog of 1,043 journals to find patterns. Were there some societies that published more journals than others? What was the geographic distribution by country? When was the catalog data last updated?

I found roughly 881 unique societies listed in the SOAR Catalog.

The following Google Sheets functions were useful:
=UNIQUE
=VLOOKUP

A few of the journals with multiple publishers had been entered into single cells. The journal Psychology, Community, and Health, for example, is published by the “SPPS – Sociedade Portuguesa de Psicologia da Saúde, SPPC – Sociedade Portuguesa de Psicologia Comunitária, SPSC – Sociedade Portuguesa de Sexologia Clínica, ISPA – Instituto Universitário”. The format of this data is not easily searchable as is, and should be parsed into several cells for each publisher name.

WIKIDATA

For societies not found through OpenRefine reconciliation, I searched each name individually in Wikidata, and if found, added those Q identifiers to my spreadsheet. If not found, I created new Wikidata items. Some of the society websites required a bit of foreign language translating. To create a new item for the Croatian Physical Society, for example, I verified the society website and double-checked spelling in Croatian (“Hrvatsko fizikalno društvo”). Google Translate came in handy, as did the internet browser’s developer page source view.

As of this writing, some eight new Wikidata items have been created for societies (Q67167604, Q67167730, Q67167802, Q67167898, Q67167964, Q67168062, Q67168156, Q67168296). It will take some time to finish the remaining 350 names on the list.

QUICKSTATEMENTS

The tool QuickStatements was used to annotate Wikidata with references. The CSV-formatted syntax looks like this:

qid,P31,Q45400320,S248,Q55823083,s585
Q1376656,P31,Q45400320,S248,Q55823083,+2019-00-00T00:00:00Z/9
Q8035326,P31,Q45400320,S248,Q55823083,+2019-00-00T00:00:00Z/9

RESULTS

A dynamic list of societies in Wikidata with references to the SOAR Catalog can be found here: https://w.wiki/8Cw

NEXT STEPS

The next phase of this effort will address Wikidata representation of the 1,043 journals listed in the SOAR Catalog.

RECOMMENDED RESOURCES

OpenRefine reconciliation guide by Antonin Delpeuch
https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation

OpenRefine Google group
https://groups.google.com/forum/#!forum/openrefine

Google Sheets function list
https://support.google.com/docs/table/25273

QuickStatements help
https://www.wikidata.org/wiki/Help:QuickStatements#Add_statement_with_sources

Societies and Open Access Research project, part of the Harvard Open Access Project
bit.ly/hoap-soar