The Longest Now


Anonymizing data on the users of Wikipedia
Wednesday July 25th 2018, 12:22 pm
Filed under: chain-gang,citation needed,Glory, glory, glory,wikipedia

Wikipedia currently tracks and stores almost no data about its readers and editors.  This persistently foils researchers and analysts inside the WMF and its projects; and is largely unnecessary.

Not tracked last I checked: sessions, clicks, where on a page readers spend their time, time spent on page or site, returning users.  There is a small exception: data that can fingerprint a user’s use of the site is stored for a limited time, made visible only to developers and checkusers, in order to combat sockpuppets and spam.

This is all done in the spirit of preserving privacy: not gathering data that could be used by third parties to harm contributors or readers for reading or writing information that some nation or other powerful group might want to suppress.  That is an essential concern, and Wikimedia’s commitment to privacy and pseudonymity is wonderful and needed.

However, the data we need to improve the site and understand how it is used in aggregate doesn’t require storing personally identifiable data that can be meaningfully used to target editors in specific. Rather than throwing out data that we worry would expose users to risk, we should be fuzzing and hashing it to preserve the aggregates we care about.  Browser fingerprints, including the username or IP, can be hashed; timestamps and anything that could be interpreted as geolocation can have noise added to them.

We could then know things such as, for instance:

  • the number of distinct users in a month, by general region
  • how regularly each visitor comes to the projects; which projects + languages they visit [throwing away user and article-title data, but seeing this data across the total population of ~1B visitors]
  • particularly bounce rates and times: people finding the site, perhaps running one search, and leaving
  • the number of pages viewed in a session, its tempo, or the namespaces they are in [throwing away titles]
  • the reading + editing flows of visitors on any single page, aggregated by day or week
  • clickflows from the main page or from search results [this data is gathered to some degree; I don’t know how reusably]

These are just rough descriptions — great care must be taken to vet each aggregate for preserving privacy. but this is a known practice that we could do with expert attention..

What keeps us from doing this today?  This may be discussed  somewhere, but past discussions I recall were brought to an early end by [devs worrying about legal] or [legal worrying about what is techincally possible].  It seems to me this would help tremendously in improving our understanding of the projects, our participants, and their experience of the wikis.  Without that we’re rather in the dark.  And we would draw in many great sociologists, historians of knowledge, and data scientists (as outside researchers, perhaps as staff) by having richer material for them to work with.



Archiving Web links: Building global layers of caches and mirrors
Sunday June 12th 2016, 4:23 pm
Filed under: international,knowledge,meta,metrics,popular demand,wikipedia

The Web is highly distributed and in flux; the people using it, even moreso.  Many projects exist to optimize its use, including:

  1. Reducing storage and bandwidth:  compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
  2. Reducing latency and long-distance bandwidth:  caching popular parts of the web locally around the world [CDNs, clouds, &c]
  3. Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]
  4. Increasing interoperability of naming schemes for describing or pointing to things on the Web, so that it’s easier to cluster similar things and find copies or versions of them [HvdS’s 15-year overview of advancing interop]

This week I was thinking about the 3rd point. What would a comprehensively backed-up Web of links look like?  How resilient can we make references to all of the failure modes we’ve seen and imagined?  Some threads for a map:

  1. Links should include timestamps, important ones should request archival permalinks.
    • When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
    • Robust links can embed information about where to find a cache in the a tag that generates the link (and possibly a fuzzy content hash?).
    • Permalinks can use an identifier system that allows searching for the page across any of the nodes of the local network, and across the different cache-networks. (Browsers can know how to attempt to find a copy.)
  2. Sites should have a coat of amber: a local cached snapshot of anything linked from that site, stored on their host or a nearby supernode.  So as long as that site is available, snapshots of what it links to are, too.
    • We can comprehensively track whether sites have signalled they have an amber layer.  If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
    • Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
  3. Snapshots of entire websites should be archived regularly
    • Both public snapshots for search engines and private ones for long-term archives.
  4. A global network of mirrors (a la [C]LOCKSS) should maintain copies of permalink and snapshot databases
    • Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
      • mirrors should be available within any country that has expensive interconnects with the rest of the world;
      • prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues
  5. There should be a clear way to scan for fuzzy matches for a broken link. Especially handy for anyone updating a large archive of broken links.
    • Is the base directory there? Is the base URL known to have moved?
    • Are distant-timestamped versions of the file available?  [some robustlink implementations do this already]
    • Are there exact matches elsewhere in the web for a [rare] filename?  Can you find other documents with the same content hash? [if a hash was included in the link]
    • Are there known ways to contact the original owner of the file/directory/site?

Related questions: What other aspects of robustness need consideration? How are people making progress at each layer?  What more is needed to have a mesh of archived links at every scale? For instance, WordPress supports a chunk of the Web; top CDNs cache more than that. What other players can make this happen?  What is needed for them to support this?



Reader: Discover the effect of happiness on your health today
Wednesday November 25th 2015, 11:44 pm
Filed under: %a la mod,wikipedia

“When I was 5 years old, my mother always told me that happiness was the key to life.  When I went to school, they asked me what I wanted to be when I grew up.  I wrote down happy. They told me I didn’t understand the assignment, and I told them they didn’t understand life.”  —Lennon

From the BODYWORLDS exhibit in Amsterdam, full of flayed and preserved human bodies.



WMF Audit Committee update – Call for Volunteers
Friday June 05th 2015, 7:07 pm
Filed under: wikipedia

The Wikimedia Foundation has an Audit Committee that represents its Board in overseeing financial and accounting matters.  This includes reviewing the foundation’s financials, its annual tax return, and an independent audit by KPMG. For details, and the current committee members, see the WMF’s Audit Committee page and the Audit Committee charter.

I currently serve as the Audit Committee chair.  We are forming the committee for 2015-16, and are looking for volunteers from the community.

Members serve on the Committee for one year, from July through July.  The Foundation files its annual tax return in the U.S. in April, and publishes its annual plan in June.  Committee members include trustees from the Foundation’s board and contributors from across the Wikimedia movement.

Time commitment for the committee is modest: reviews are carried out via three or four conference calls over the course of the year.  The primary requirement is financial literacy: some experience with finance, accounting or auditing.

If you are interested in joining the Committee for the coming year, please email me at sj at wikimedia.org with your CV, and your thoughts on how you could contribute. Thank you!



Soft, distributed review of public spaces: Making Twitter safe
Monday October 27th 2014, 2:56 pm
Filed under: %a la mod,ideonomy,knowledge,popular demand,wikipedia

Successful communities have learned a few things about how to maintain healthy public spaces. We could use a handbook for community designers gathering effective practices. It is a mark of the youth of interpublic spaces that spaces such as Twitter and Instagram [not to mention niche spaces like Wikipedia, and platforms like WordPress] rarely have architects dedicated to designing and refining this aspect of their structure, toolchains, and workflows.

Some say that ‘overly’ public spaces enable widespread abuse and harassment. But the “publicness” of large digital spaces can help make them more welcoming in ways than physical ones – where it is harder to remove graffiti or eggs from homes or buildings – and niche ones – where clique formation and systemic bias can dominate. For instance, here are a few ‘soft’ (reversible, auditable, post-hoc) tools that let a mixed ecosystem review and maintain their own areas in a broad public space:

Allow participants to change the visibility of comments:  Let each control what they see, and promote or flag it for others.

  • Allow blacklists and whitelists, in a way that lets people block out harassers or keywords entirely if they wish. Make it easy to see what has been hidden.
  • Rating (both average and variance) and tags for abuse or controversy can allow for locally flexible display.  Some simple models make this hard to game.
  • Allow things to be incrementally hidden from view.  Group feedback is more useful when the result is a spectrum.

Increase the efficiency ratio of moderation and distribute it: automate review, filter and slow down abuse.

  • Tag contributors by their level of community investment. Many who spam or harass try to cloak in new or fake identities.
  • Maintain automated tools to catch and limit abusive input. There’s a spectrum of response: from letting only the poster and moderators see the input (cocooning), to tagging and not showing by default (thresholding), to simply tagging as suspect (flagging).
  • Make these and other tags available to the community to use in their own preferences and review tools
  • For dedicated abuse: hook into penalties that make it more costly for those committed to spoofing the system.

You can’t make everyone safe all of the time, but can dial down behavior that is socially unwelcome (by any significant subgroup) by a couple of magnitudes.  Of course these ideas are simple and only work so far.  For instance, in a society at civil war, where each half are literally threatened by the sober political and practical discussions of the other half, public speech may simply not be safe.



Utter License, n.: A minimal way to grant all rights to a work
Tuesday October 21st 2014, 3:03 am
Filed under: %a la mod,Aasw,null,poetic justice,wikipedia

[You may do UTTERLY ANYTHING with this work.]

UTTER ♥2

 

Utter details and variants



Lila Tretikov named as Wikimedia’s upcoming ED
Thursday May 01st 2014, 5:49 pm
Filed under: fly-by-wire,ideonomy,knowledge,popular demand,wikipedia

And there was much rejoicing. Welcome, Lila!



Aaron Swartz hackfests this weekend around the world: honoring his work
Friday November 08th 2013, 7:04 pm
Filed under: Aasw,Glory, glory, glory,international,knowledge,meta,metrics,popular demand,wikipedia

Help continue projects Aaron believed in, in person or online.
I’ll be at the Cambridge event and aftermath throughout the long weekend.

Related project summaries:



Wikipedia The Movie: the maddest thing I’ve read in some time
Wednesday October 30th 2013, 9:30 pm
Filed under: Glory, glory, glory,indescribable,poetic justice,wikipedia

Wikipedia The Movie, a wiki-amusement started by Mark Pellegrini during the Chrome Age of Wikipedia, is hard to describe, and not exactly what you might expect. It is a surreal cataclysm of in-jokes pretzled together into a tilable shape. Its pieces have been polished by those who appreciate it: editors with a sense of humor, reflecting on a larger community whose relationship with humor is more nuanced. In short Dalí-scented scenes, and the language of cafeteria gossip, it captures something about the projects in a way that is honest to the madness of humanity. Enough to make any committed editor wince/smile.  It makes me wonder what a similarly frank slice of subtext would look like for other large-scale projects.

While I remember the original being written – not called ‘Episode 1’ at the time – I only discovered last year that it had been turned into a franchise, slowly unfolding year after year. And I can’t complain that I was cast as my favorite Shakespearean fragranceur.

Wikipedia the Movie

And then there is the musical version. For you who have shared the private hallucinations of those who breathe too deep of lemony Huggle vapour, give over a few minutes of your day to a stroll down memory’s phantasmagoria:

Wikipedia the Movie, Episode 1 ·  Wikipedia: The Musical



Women wield wiki: Ada Lovelaceathon returns next week:
Thursday October 10th 2013, 7:03 pm
Filed under: Uncategorized,wikipedia

Thanks to Maia W  and all who are making the Adathons in Boston happen!



LOC is down. Archive.org remains up. What can this teach us?
Tuesday October 01st 2013, 3:51 pm
Filed under: popular demand,wikipedia

Thanks to the US government shutdown, the Library of Congress website went down today.  So did NASA, the NSF, the USDA, the FTC, and the National Park Service.  On the other hand, privately-run websites such as the Internet Archive (and, thankfully, its glorious Wayback Machine!) remain online and unaffected by these sorts of government changes.

As we plan for making our Internet more robust in the future, we should make sure to avoid single (or small-group) points of failure, as even services run by major corporations or governments can go offline in a hurry.  (People who live their lives on Google tools: I’m looking at you 🙂

Wikipedia itself should be sure to support a thriving mirror network, and should probably move towards a fully distributed cache-and-forward model where possible.



A New ‘Pedia: planning for the future of Wikipedia
Saturday August 10th 2013, 2:58 am
Filed under: citation needed,Glory, glory, glory,Uncategorized,wikipedia

Wikipedia has gotten more elaborate and complex to use. Adding a reference, marking something for review, uploading a file or creating a new article now take many steps — and failing to follow them can lead to starting all over. The curators of the core projects are concerned with uniformly high quality, and impatient with contributors who don’t have the expertise and wiki-experience to create something according to policy. Good stubs or photos are deleted for failing to comply with one of a dozen policies, or for inadequate cites or license templates; even when they are in fact derived from reliable sources and freely licensed.

The Article Creation Wizard has a five-step process for drafting an article, after which it is submitted for review by a team of experienced editors, and finally moved to the article namespace. 7 steps for approval is too much overhead for many.  And the current notability guidelines on big Wikipedias excludes most local and specialist knowledge.

We need a simpler scratch-space to develop new material:

  • A place not designed to be high quality, where everything can be in flux, possibly wrong, in need of clarification and polishing and correction.
  • A place that can be used to build draft articles, images, and other media before posting them to Wikipedia
  • A place where everyone is welcome to start a new topic, and share what they know: relying on verifiability over time (but not requiring it immediately), and without any further standard for notability
  • A place with no requirements to edit: possibly style guidelines to aspire to, but where newbies who don’t know how the tools or system works are welcomed and encouraged to contribute more, and not chastised for getting things wrong.

Since this will be a new sort of compendium or comprehensive cyclopedia, covering all topics, it should have a new name. Something simple, say Newpedia. Scripts can be written to help editors work through the most polished Newpedia items and push them to Wikipedia and Wikisource and Commons. We could invite editors to start doing their rough work on Newpedia, to avoid the conflict and fast reversion on the larger wiki references that make it hard to use for quick new work.

Update: Mako discussed Newpedia (or double-plus-newpedia) in his panel about “Wikipedia in 2022“, and Erik Moeller talked about how the current focus on notability is keeping all of our projects from growing, in his “Ghosts of Wikipedia Future“.  I look forward to the video and transcripts.

What do you think?  I started a mailing list for people who are interested in developing such a knowledge-project.  I look forward to your thoughts, both serious and otherwise 😉



Wikimedia is protected as a host, Italian courts rule
Wednesday June 26th 2013, 7:13 pm
Filed under: Blogroll,international,wikipedia

Former Italian Minister of Defense Cesare Previti, angry that his Wikipedia biography had at one point contained a statement he found offensive, recently filed suit against the Wikimedia Foundation: claiming that it published defamatory statements.

An Italian court ruled that as a hosting provider, not a content provider – allowing others to create content but not creating it – Wikimedia was protected from such claims. Their two considerations – whether the role of the WMF is clear, and whether there are procedures for readers to address errors, both marked Wikimedia as a host.

This is covered in more detail on the WMF blog. Thanks to our Italian counsel Hogan Lovells for their support in this case.

Ilario Valdelli, from the national chapter Wikimedia Italia, commented:

The justice in Italy is really slow in general, but in this case the timeline has been short and the result of the court of justice really good for the future.

The definition that the prosecutor is able to change the definition in Wikipedia and for this reason there is no sense to accuse Wikipedia or WMF, it’s a good way to give an indication for the future.



Annotation Notes from a recent discussion with this year’s Berkterns
Thursday June 13th 2013, 10:18 pm
Filed under: citation needed,knowledge,meta,popular demand,wikipedia

Anno-notes.  (thanks, piratepad)



Wikipedian forced to delete article by the French police
Tuesday April 30th 2013, 11:06 pm
Filed under: international,metrics,Not so popular,wikipedia

In France, a Wikipedia admin was sought out by France’s homeland intelligence agency, the Direction Centrale du Renseignement Intérieur brought physically to their offices, and forced to delete an article about a military base (which they claimed contained classified information) if he did not wish to be held overnight.

This sort of bullying tactic is one up with which we should not put. The issue later became a minor cause célèbre in the French press for a short time.



Wiki AuditCom : Yearly call for volunteers to help vet our books
Friday April 26th 2013, 3:28 pm
Filed under: international,knowledge,metrics,wikipedia

As noted last week by AuditCom chair Stu West.



Wikipedians in Residence: Seven new positions open
Wednesday April 24th 2013, 11:44 pm
Filed under: poetic justice,popular demand,wikipedia

Gathered by the Generalist.




Bad Behavior has blocked 385 access attempts in the last 7 days.