Archiving Web links: Building global layers of caches and mirrors
Sunday June 12th 2016, 4:23 pm
The Web is highly distributed and in flux; the people using it, even moreso.  Many projects exist to optimize its use, including:

  1. Reducing storage and bandwidth:  compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
  2. Reducing latency and long-distance bandwidth:  caching popular parts of the web locally around the world [CDNs, clouds, &c]
  3. Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]
  4. Increasing interoperability of naming schemes for describing or pointing to things on the Web, so that it’s easier to cluster similar things and find copies or versions of them [HvdS’s 15-year overview of advancing interop]

This week I was thinking about the 3rd point. What would a comprehensively backed-up Web of links look like?  How resilient can we make references to all of the failure modes we’ve seen and imagined?  Some threads for a map:

  1. Links should include timestamps, important ones should request archival permalinks.
    • When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
    • Robust links can embed information about where to find a cache in the a tag that generates the link (and possibly a fuzzy content hash?).
    • Permalinks can use an identifier system that allows searching for the page across any of the nodes of the local network, and across the different cache-networks. (Browsers can know how to attempt to find a copy.)
  2. Sites should have a coat of amber: a local cached snapshot of anything linked from that site, stored on their host or a nearby supernode.  So as long as that site is available, snapshots of what it links to are, too.
    • We can comprehensively track whether sites have signalled they have an amber layer.  If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
    • Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
  3. Snapshots of entire websites should be archived regularly
    • Both public snapshots for search engines and private ones for long-term archives.
  4. A global network of mirrors (a la [C]LOCKSS) should maintain copies of permalink and snapshot databases
    • Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
      • mirrors should be available within any country that has expensive interconnects with the rest of the world;
      • prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues
  5. There should be a clear way to scan for fuzzy matches for a broken link. Especially handy for anyone updating a large archive of broken links.
    • Is the base directory there? Is the base URL known to have moved?
    • Are distant-timestamped versions of the file available?  [some robustlink implementations do this already]
    • Are there exact matches elsewhere in the web for a [rare] filename?  Can you find other documents with the same content hash? [if a hash was included in the link]
    • Are there known ways to contact the original owner of the file/directory/site?

Related questions: What other aspects of robustness need consideration? How are people making progress at each layer?  What more is needed to have a mesh of archived links at every scale? For instance, WordPress supports a chunk of the Web; top CDNs cache more than that. What other players can make this happen?  What is needed for them to support this?

Wednesday November 25th 2015, 11:44 pm
“When I was 5 years old, my mother always told me that happiness was the key to life.  When I went to school, they asked me what I wanted to be when I grew up.  I wrote down happy. They told me I didn’t understand the assignment, and I told them they didn’t understand life.”  —Lennon

From the BODYWORLDS exhibit in Amsterdam, full of flayed and preserved human bodies.

WMF Audit Committee update – Call for Volunteers
Friday June 05th 2015, 7:07 pm
The Wikimedia Foundation has an Audit Committee that represents its Board in overseeing financial and accounting matters.  This includes reviewing the foundation’s financials, its annual tax return, and an independent audit by KPMG. For details, and the current committee members, see the WMF’s Audit Committee page and the Audit Committee charter.

I currently serve as the Audit Committee chair.  We are forming the committee for 2015-16, and are looking for volunteers from the community.

Members serve on the Committee for one year, from July through July.  The Foundation files its annual tax return in the U.S. in April, and publishes its annual plan in June.  Committee members include trustees from the Foundation’s board and contributors from across the Wikimedia movement.

Time commitment for the committee is modest: reviews are carried out via three or four conference calls over the course of the year.  The primary requirement is financial literacy: some experience with finance, accounting or auditing.

If you are interested in joining the Committee for the coming year, please email me at sj at with your CV, and your thoughts on how you could contribute. Thank you!

Soft, distributed review of public spaces: Making Twitter safe
Monday October 27th 2014, 2:56 pm
Successful communities have learned a few things about how to maintain healthy public spaces. We could use a handbook for community designers gathering effective practices. It is a mark of the youth of interpublic spaces that spaces such as Twitter and Instagram [not to mention niche spaces like Wikipedia, and platforms like WordPress] rarely have architects dedicated to designing and refining this aspect of their structure, toolchains, and workflows.

Some say that ‘overly’ public spaces enable widespread abuse and harassment. But the “publicness” of large digital spaces can help make them more welcoming in ways than physical ones – where it is harder to remove graffiti or eggs from homes or buildings – and niche ones – where clique formation and systemic bias can dominate. For instance, here are a few ‘soft’ (reversible, auditable, post-hoc) tools that let a mixed ecosystem review and maintain their own areas in a broad public space:

Allow participants to change the visibility of comments:  Let each control what they see, and promote or flag it for others.

  • Allow blacklists and whitelists, in a way that lets people block out harassers or keywords entirely if they wish. Make it easy to see what has been hidden.
  • Rating (both average and variance) and tags for abuse or controversy can allow for locally flexible display.  Some simple models make this hard to game.
  • Allow things to be incrementally hidden from view.  Group feedback is more useful when the result is a spectrum.

Increase the efficiency ratio of moderation and distribute it: automate review, filter and slow down abuse.

  • Tag contributors by their level of community investment. Many who spam or harass try to cloak in new or fake identities.
  • Maintain automated tools to catch and limit abusive input. There’s a spectrum of response: from letting only the poster and moderators see the input (cocooning), to tagging and not showing by default (thresholding), to simply tagging as suspect (flagging).
  • Make these and other tags available to the community to use in their own preferences and review tools
  • For dedicated abuse: hook into penalties that make it more costly for those committed to spoofing the system.

You can’t make everyone safe all of the time, but can dial down behavior that is socially unwelcome (by any significant subgroup) by a couple of magnitudes.  Of course these ideas are simple and only work so far.  For instance, in a society at civil war, where each half are literally threatened by the sober political and practical discussions of the other half, public speech may simply not be safe.

Utter License, n.: A minimal way to grant all rights to a work
Tuesday October 21st 2014, 3:03 am
[You may do UTTERLY ANYTHING with this work.]



Utter details and variants

Lila Tretikov named as Wikimedia’s upcoming ED
Thursday May 01st 2014, 5:49 pm
And there was much rejoicing. Welcome, Lila!

Aaron Swartz hackfests this weekend around the world: honoring his work
Friday November 08th 2013, 7:04 pm
Help continue projects Aaron believed in, in person or online.
I’ll be at the Cambridge event and aftermath throughout the long weekend.

Related project summaries:

Wikipedia The Movie: the maddest thing I’ve read in some time
Wednesday October 30th 2013, 9:30 pm
Wikipedia The Movie, a wiki-amusement started by Mark Pellegrini during the Chrome Age of Wikipedia, is hard to describe, and not exactly what you might expect. It is a surreal cataclysm of in-jokes pretzled together into a tilable shape. Its pieces have been polished by those who appreciate it: editors with a sense of humor, reflecting on a larger community whose relationship with humor is more nuanced. In short Dalí-scented scenes, and the language of cafeteria gossip, it captures something about the projects in a way that is honest to the madness of humanity. Enough to make any committed editor wince/smile.  It makes me wonder what a similarly frank slice of subtext would look like for other large-scale projects.

While I remember the original being written – not called ‘Episode 1’ at the time – I only discovered last year that it had been turned into a franchise, slowly unfolding year after year. And I can’t complain that I was cast as my favorite Shakespearean fragranceur.

Wikipedia the Movie

And then there is the musical version. For you who have shared the private hallucinations of those who breathe too deep of lemony Huggle vapour, give over a few minutes of your day to a stroll down memory’s phantasmagoria:

Wikipedia the Movie, Episode 1 ·  Wikipedia: The Musical

Women wield wiki: Ada Lovelaceathon returns next week:
Thursday October 10th 2013, 7:03 pm
Thanks to Maia W  and all who are making the Adathons in Boston happen!

LOC is down. remains up. What can this teach us?
Tuesday October 01st 2013, 3:51 pm
Thanks to the US government shutdown, the Library of Congress website went down today.  So did NASA, the NSF, the USDA, the FTC, and the National Park Service.  On the other hand, privately-run websites such as the Internet Archive (and, thankfully, its glorious Wayback Machine!) remain online and unaffected by these sorts of government changes.

As we plan for making our Internet more robust in the future, we should make sure to avoid single (or small-group) points of failure, as even services run by major corporations or governments can go offline in a hurry.  (People who live their lives on Google tools: I’m looking at you 🙂

Wikipedia itself should be sure to support a thriving mirror network, and should probably move towards a fully distributed cache-and-forward model where possible.

A New ‘Pedia: planning for the future of Wikipedia
Saturday August 10th 2013, 2:58 am
Wikipedia has gotten more elaborate and complex to use. Adding a reference, marking something for review, uploading a file or creating a new article now take many steps — and failing to follow them can lead to starting all over. The curators of the core projects are concerned with uniformly high quality, and impatient with contributors who don’t have the expertise and wiki-experience to create something according to policy. Good stubs or photos are deleted for failing to comply with one of a dozen policies, or for inadequate cites or license templates; even when they are in fact derived from reliable sources and freely licensed.

The Article Creation Wizard has a five-step process for drafting an article, after which it is submitted for review by a team of experienced editors, and finally moved to the article namespace. 7 steps for approval is too much overhead for many.  And the current notability guidelines on big Wikipedias excludes most local and specialist knowledge.

We need a simpler scratch-space to develop new material:

  • A place not designed to be high quality, where everything can be in flux, possibly wrong, in need of clarification and polishing and correction.
  • A place that can be used to build draft articles, images, and other media before posting them to Wikipedia
  • A place where everyone is welcome to start a new topic, and share what they know: relying on verifiability over time (but not requiring it immediately), and without any further standard for notability
  • A place with no requirements to edit: possibly style guidelines to aspire to, but where newbies who don’t know how the tools or system works are welcomed and encouraged to contribute more, and not chastised for getting things wrong.

Since this will be a new sort of compendium or comprehensive cyclopedia, covering all topics, it should have a new name. Something simple, say Newpedia. Scripts can be written to help editors work through the most polished Newpedia items and push them to Wikipedia and Wikisource and Commons. We could invite editors to start doing their rough work on Newpedia, to avoid the conflict and fast reversion on the larger wiki references that make it hard to use for quick new work.

Update: Mako discussed Newpedia (or double-plus-newpedia) in his panel about “Wikipedia in 2022“, and Erik Moeller talked about how the current focus on notability is keeping all of our projects from growing, in his “Ghosts of Wikipedia Future“.  I look forward to the video and transcripts.

What do you think?  I started a mailing list for people who are interested in developing such a knowledge-project.  I look forward to your thoughts, both serious and otherwise 😉

Wikimedia is protected as a host, Italian courts rule
Wednesday June 26th 2013, 7:13 pm
Former Italian Minister of Defense Cesare Previti, angry that his Wikipedia biography had at one point contained a statement he found offensive, recently filed suit against the Wikimedia Foundation: claiming that it published defamatory statements.

An Italian court ruled that as a hosting provider, not a content provider – allowing others to create content but not creating it – Wikimedia was protected from such claims. Their two considerations – whether the role of the WMF is clear, and whether there are procedures for readers to address errors, both marked Wikimedia as a host.

This is covered in more detail on the WMF blog. Thanks to our Italian counsel Hogan Lovells for their support in this case.

Ilario Valdelli, from the national chapter Wikimedia Italia, commented:

The justice in Italy is really slow in general, but in this case the timeline has been short and the result of the court of justice really good for the future.

The definition that the prosecutor is able to change the definition in Wikipedia and for this reason there is no sense to accuse Wikipedia or WMF, it’s a good way to give an indication for the future.

Annotation Notes from a recent discussion with this year’s Berkterns
Thursday June 13th 2013, 10:18 pm
Anno-notes.  (thanks, piratepad)

Wikipedian forced to delete article by the French police
Tuesday April 30th 2013, 11:06 pm
In France, a Wikipedia admin was sought out by France’s homeland intelligence agency, the Direction Centrale du Renseignement Intérieur brought physically to their offices, and forced to delete an article about a military base (which they claimed contained classified information) if he did not wish to be held overnight.

This sort of bullying tactic is one up with which we should not put. The issue later became a minor cause célèbre in the French press for a short time.

Wiki AuditCom : Yearly call for volunteers to help vet our books
Friday April 26th 2013, 3:28 pm
As noted last week by AuditCom chair Stu West.

Wikipedians in Residence: Seven new positions open
Wednesday April 24th 2013, 11:44 pm
Gathered by the Generalist.

The Wikidata Revolution: enabling structured data love
Wednesday April 24th 2013, 4:00 pm
A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.

By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.

Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions. Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, will automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.” – Sue Gardner

The development of Wikidata began in March 2012, led by Wikimedia Deutschland, the German chapter of the Wikimedia movement. Since went live on October 30, a growing community of around 3,000 active contributors started building its database of ‘items’ (e.g. things, people or concepts), first by collecting topics that are already the subject of Wikipedia articles in several languages. An item’s central page on Wikidata replaces the complex web of language links which previously connected these articles about the same topic in different Wikipedia versions. Wikidata’s collection of these items now numbers over 10 million. The community also began to enrich Wikidata’s database with factual statements about these topics (data like the mayor of a city, the ISBN of a book, the languages spoken in a country, etc.). This information has now become available for use on Wikipedia itself.

It is the goal of Wikidata to collect the world’s complex knowledge in a structured manner so that anybody can benefit from it.  Whether that’s readers of Wikipedia who are able to be up to date about certain facts or engineers who can use this data to create new products that improve the way we access knowledge.” – Denny Vrandečić, Wikidata project lead

The next phase of Wikidata will allow for the automatic creation of lists and charts based on the data in Wikidata. Wikimedia Deutschland will continue to support the project with an engineering team that is dedicated to Wikidata’s second year of development and maintenance.

Wikidata is operated by the Wikimedia Foundation and its fact database is published under a Creative Commons 0 public domain dedication. Funding of Wikidata’s initial development was provided by the Allen Institute for Artificial Intelligence [AI]², the Gordon and Betty Moore Foundation, and Google, Inc.

More information available here:

Volunteers can get involved with Wikidata in many ways.  Some of the first applications demonstrating the potential of Wikidata applications, and as a platform:

  • The simia “tree of life” drawn from relations among biological species in Wikidata’s database
  • “GeneaWiki” generates a graph showing a person’s family relations as recorded in Wikidata.  See for example: the Bach family

