Open-Access Society Publishers in Wikidata

[N.B. Views are my own, not in any way officially representative of Harvard.]


In September 2013 Caroline Sutton, Peter Suber, and Amanda Page launched a third edition of the Society Open Access Research (SOAR) Catalog. As of 2019 it exists online as a continuously updated Google spreadsheet.

Recently I chatted with the catalog’s editors and began an independent, volunteer effort to amplify the carefully maintained contents of this list via the public-domain knowledge base Wikidata. My project aims to include a reference to the SOAR Catalog in each relevant Wikidata society and journal entry. Below is a description of the steps involved and some of the issues that have arisen so far.


To find out which societies already had representation in Wikidata, I downloaded the SOAR Catalog as a CSV file, then imported it to a new Google sheet. Each row in the Google spreadsheet was assigned a unique identifier.

The SOAR editors had entered society names in abbreviated form, so I changed each name to its fullest form. For example “Soc” was changed to “Society” and “Amer” to “American” and so on.

The revised spreadsheet was downloaded as a CSV file, then uploaded to OpenRefine as a new project.


In OpenRefine I reconciled the society names against Wikidata. The retrieved information was saved as a new column “Q” in OpenRefine. Wikidata items were found for 526 societies, roughly half the total on the SOAR list.

Names were reconciled as type “organization” (Q43229, including subclasses “learned society”, “scientific society”, “association”, etc.), and then again as no type. During the reconciliation process, sometimes a society had multiple Wikidata items which I then examined and merged into one.

I exported the revised data as a CSV file, then imported it back to the Google sheet.


In the Google spreadsheet, I then analyzed the full catalog of 1,043 journals to find patterns. Were there some societies that published more journals than others? What was the geographic distribution by country? When was the catalog data last updated?

I found roughly 881 unique societies listed in the SOAR Catalog.

The following Google Sheets functions were useful:

A few of the journals with multiple publishers had been entered into single cells. The journal Psychology, Community, and Health, for example, is published by the “SPPS – Sociedade Portuguesa de Psicologia da Saúde, SPPC – Sociedade Portuguesa de Psicologia Comunitária, SPSC – Sociedade Portuguesa de Sexologia Clínica, ISPA – Instituto Universitário”. The format of this data is not easily searchable as is, and should be parsed into several cells for each publisher name.


For societies not found through OpenRefine reconciliation, I searched each name individually in Wikidata, and if found, added those Q identifiers to my spreadsheet. If not found, I created new Wikidata items. Some of the society websites required a bit of foreign language translating. To create a new item for the Croatian Physical Society, for example, I verified the society website and double-checked spelling in Croatian (“Hrvatsko fizikalno društvo”). Google Translate came in handy, as did the internet browser’s developer page source view.

As of this writing, some eight new Wikidata items have been created for societies (Q67167604, Q67167730, Q67167802, Q67167898, Q67167964, Q67168062, Q67168156, Q67168296). It will take some time to finish the remaining 350 names on the list.


The tool QuickStatements was used to annotate Wikidata with references. The CSV-formatted syntax looks like this:



A dynamic list of societies in Wikidata with references to the SOAR Catalog can be found here:


The next phase of this effort will address Wikidata representation of the 1,043 journals listed in the SOAR Catalog.


OpenRefine reconciliation guide by Antonin Delpeuch

OpenRefine Google group!forum/openrefine

Google Sheets function list

QuickStatements help

Societies and Open Access Research project, part of the Harvard Open Access Project



Follow me on Twitter: @abritton1630.