You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

The Longest Now


Anonymizing data on the users of Wikipedia
Wednesday July 25th 2018, 12:22 pm
Filed under: chain-gang,citation needed,Glory, glory, glory,wikipedia

Updated for the new year: with specific things we can all start doing 🙂

Wikipedia currently tracks and stores almost no data about its readers and editors.  This persistently foils researchers and analysts inside the WMF and its projects; and is largely unnecessary.

Not tracked last I checked: sessions, clicks, where on a page readers spend their time, time spent on page or site, returning users.  There is a small exception: data that can fingerprint a user’s use of the site is stored for a limited time, made visible only to developers and checkusers, in order to combat sockpuppets and spam.

This is all done in the spirit of preserving privacy: not gathering data that could be used by third parties to harm contributors or readers for reading or writing information that some nation or other powerful group might want to suppress.  That is an essential concern, and Wikimedia’s commitment to privacy and pseudonymity is wonderful and needed.

However, the data we need to improve the site and understand how it is used in aggregate doesn’t require storing personally identifiable data that can be meaningfully used to target editors in specific. Rather than throwing out data that we worry would expose users to risk, we should be fuzzing and hashing it to preserve the aggregates we care about.  Browser fingerprints, including the username or IP, can be hashed; timestamps and anything that could be interpreted as geolocation can have noise added to them.

We could then know things such as, for instance:

  • the number of distinct users in a month, by general region
  • how regularly each visitor comes to the projects; which projects + languages they visit [throwing away user and article-title data, but seeing this data across the total population of ~1B visitors]
  • particularly bounce rates and times: people finding the site, perhaps running one search, and leaving
  • the number of pages viewed in a session, its tempo, or the namespaces they are in [throwing away titles]
  • the reading + editing flows of visitors on any single page, aggregated by day or week
  • clickflows from the main page or from search results [this data is gathered to some degree; I don’t know how reusably]

These are just rough descriptions — great care must be taken to vet each aggregate for preserving privacy. but this is a known practice that we could do with expert attention..

What keeps us from doing this today?  Some aspects of this are surely discussed in places, but is hard to find.  Past discussions I recall were brought to an early end by [devs worrying about legal] or [legal worrying about what is technically possible].

Discussion of obstacles and negative-space is generally harder to find on wikis than discussion of works-in-progress and responses to them: a result of a noun-based document system that requires discussions to be attached to a clearly-named topic!

What we can do, both researchers and data fiduciaries:

  • As site-maintainers: Start gathering this data, and appoint a couple privacy-focused data analysts to propose how to share it.
    • Identify challenges, open problems, solved problems that need implementing.
  • Name the (positive, future-crafting, project-loving) initiative to do this at scale, and the reasons to do so.
    • By naming the positive aspect, distinguish this from a tentative caveat to a list of bad things to avoid, which leads to inaction.  (“never gather data!  unless you have extremely good reasons, someone else has done it before, it couldn’t possibly be dangerous, and noone could possibly complain.“)
  • As data analysts (internal and external): write about what better data enables.  Expand the list above, include real-world parallels.
    • How would this illuminate the experience of finding and sharing knowledge?
  • Invite other sociologists, historians of knowledge, and tool-makers to start working with stub APIs that at first may not return much data.

Without this we remain in the dark —- and, like libraries who have found patrons leaving their privacy-preserving (but less helpful) environs for data-hoarding (and very handy) book-explorers, we remain vulnerable to disuse.




Bringing together the people who know how to do something; how much it costs to do; how much it costs over time to leave it undone; and how it fits into broader strategy — this helps make possible many important things.

I was reminded last week of the similar case of organizing national + international coalition-level funds for provisioning vaccines. Getting the finance, development, and health ministers together to analyze the decade-level costs and benefits of vaccination allowed a rational decision that had been prevented when looked at only locally and within a single facet of community-health at a time.

Comment by Samhain 12.28.18 @ 4:55 pm





Bad Behavior has blocked 190 access attempts in the last 7 days.