The Longest Now

Archiving Web links: Building global layers of caches and mirrors
Sunday June 12th 2016, 4:23 pm
The Web is highly distributed and in flux; the people using it, even moreso.  Many projects exist to optimize its use, including:

  1. Reducing storage and bandwidth:  compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
  2. Reducing latency and long-distance bandwidth:  caching popular parts of the web locally around the world [CDNs, clouds, &c]
  3. Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]
  4. Increasing interoperability of naming schemes for describing or pointing to things on the Web, so that it’s easier to cluster similar things and find copies or versions of them [HvdS’s 15-year overview of advancing interop]

This week I was thinking about the 3rd point. What would a comprehensively backed-up Web of links look like?  How resilient can we make references to all of the failure modes we’ve seen and imagined?  Some threads for a map:

  1. Links should include timestamps, important ones should request archival permalinks.
    • When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
    • Robust links can embed information about where to find a cache in the a tag that generates the link (and possibly a fuzzy content hash?).
    • Permalinks can use an identifier system that allows searching for the page across any of the nodes of the local network, and across the different cache-networks. (Browsers can know how to attempt to find a copy.)
  2. Sites should have a coat of amber: a local cached snapshot of anything linked from that site, stored on their host or a nearby supernode.  So as long as that site is available, snapshots of what it links to are, too.
    • We can comprehensively track whether sites have signalled they have an amber layer.  If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
    • Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
  3. Snapshots of entire websites should be archived regularly
    • Both public snapshots for search engines and private ones for long-term archives.
  4. A global network of mirrors (a la [C]LOCKSS) should maintain copies of permalink and snapshot databases
    • Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
      • mirrors should be available within any country that has expensive interconnects with the rest of the world;
      • prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues
  5. There should be a clear way to scan for fuzzy matches for a broken link. Especially handy for anyone updating a large archive of broken links.
    • Is the base directory there? Is the base URL known to have moved?
    • Are distant-timestamped versions of the file available?  [some robustlink implementations do this already]
    • Are there exact matches elsewhere in the web for a [rare] filename?  Can you find other documents with the same content hash? [if a hash was included in the link]
    • Are there known ways to contact the original owner of the file/directory/site?

Related questions: What other aspects of robustness need consideration? How are people making progress at each layer?  What more is needed to have a mesh of archived links at every scale? For instance, WordPress supports a chunk of the Web; top CDNs cache more than that. What other players can make this happen?  What is needed for them to support this?

Soft, distributed review of public spaces: Making Twitter safe
Monday October 27th 2014, 2:56 pm
Successful communities have learned a few things about how to maintain healthy public spaces. We could use a handbook for community designers gathering effective practices. It is a mark of the youth of interpublic spaces that spaces such as Twitter and Instagram [not to mention niche spaces like Wikipedia, and platforms like WordPress] rarely have architects dedicated to designing and refining this aspect of their structure, toolchains, and workflows.

Some say that ‘overly’ public spaces enable widespread abuse and harassment. But the “publicness” of large digital spaces can help make them more welcoming in ways than physical ones – where it is harder to remove graffiti or eggs from homes or buildings – and niche ones – where clique formation and systemic bias can dominate. For instance, here are a few ‘soft’ (reversible, auditable, post-hoc) tools that let a mixed ecosystem review and maintain their own areas in a broad public space:

Allow participants to change the visibility of comments:  Let each control what they see, and promote or flag it for others.

  • Allow blacklists and whitelists, in a way that lets people block out harassers or keywords entirely if they wish. Make it easy to see what has been hidden.
  • Rating (both average and variance) and tags for abuse or controversy can allow for locally flexible display.  Some simple models make this hard to game.
  • Allow things to be incrementally hidden from view.  Group feedback is more useful when the result is a spectrum.

Increase the efficiency ratio of moderation and distribute it: automate review, filter and slow down abuse.

  • Tag contributors by their level of community investment. Many who spam or harass try to cloak in new or fake identities.
  • Maintain automated tools to catch and limit abusive input. There’s a spectrum of response: from letting only the poster and moderators see the input (cocooning), to tagging and not showing by default (thresholding), to simply tagging as suspect (flagging).
  • Make these and other tags available to the community to use in their own preferences and review tools
  • For dedicated abuse: hook into penalties that make it more costly for those committed to spoofing the system.

You can’t make everyone safe all of the time, but can dial down behavior that is socially unwelcome (by any significant subgroup) by a couple of magnitudes.  Of course these ideas are simple and only work so far.  For instance, in a society at civil war, where each half are literally threatened by the sober political and practical discussions of the other half, public speech may simply not be safe.

Lila Tretikov named as Wikimedia’s upcoming ED
Thursday May 01st 2014, 5:49 pm
And there was much rejoicing. Welcome, Lila!

Women’s Public Voice: points left out of Mary Beard’s history of speech
Sunday March 02nd 2014, 10:38 pm
Bruce recently recommended an essay on the historical public voice of women, by noted classicist Mary Beard.

Beard is a fine and provocative writer; it is good rhetoric.

But I don’t think it gives much insight into historical causes, or ways we can bring about change. Women face deeply gendered and hateful criticism today, particularly online. The argument that this is due to Greco-Roman rhetorical traditions, or the Western literary canon, is unconvincing. I see selection bias in Beard’s examples.

I would love to see a version of this essay that gets nuances right, and tries to explain changes in the past century based on its arguments.

Left out:
+ The complexity of women’s voice in Rome, from Fulvia and Livia to Irene of Athens;
+ Greek admiration of Gorgo, Roman admiration of Zenobia;
+ Conflicting views of leaders in adjacent cultures (Boudica, Cleopatra, Dido);
+ The Old Testament (Deborah and Esther ?)

Misused for effect:
– Ovid: No metamorphs of any gender could speak; Io for one was changed back.
– Fulvia: First by describing her as someone’s wife, though she was one of the most powerful figures in Rome; then by framing her hatred of Cicero as a matter of gender.

On a tangent: Two speeches I love, to lift the spirits. (Both American; I know less oratory from the rest of the world. Suggestions welcome!):

Frances Wright on global patriotism and change:
# Independence Day speech at New Harmony (1828)

Margaret Chase Smith on an issue too great to be obscured by eloquence, thankfully no longer a concern today:
# Declaration of Conscience (1950)

Aaron Swartz hackfests this weekend around the world: honoring his work
Friday November 08th 2013, 7:04 pm
Help continue projects Aaron believed in, in person or online.
I’ll be at the Cambridge event and aftermath throughout the long weekend.

Related project summaries:

LOC is down. remains up. What can this teach us?
Tuesday October 01st 2013, 3:51 pm
Thanks to the US government shutdown, the Library of Congress website went down today.  So did NASA, the NSF, the USDA, the FTC, and the National Park Service.  On the other hand, privately-run websites such as the Internet Archive (and, thankfully, its glorious Wayback Machine!) remain online and unaffected by these sorts of government changes.

As we plan for making our Internet more robust in the future, we should make sure to avoid single (or small-group) points of failure, as even services run by major corporations or governments can go offline in a hurry.  (People who live their lives on Google tools: I’m looking at you 🙂

Wikipedia itself should be sure to support a thriving mirror network, and should probably move towards a fully distributed cache-and-forward model where possible.

OLPC’s new XO Tablet is up for grabs, now on Amazon
Tuesday August 06th 2013, 6:15 am
The XO Tablet is now available for purchase on Amazon, for $150. It’s a beautiful little device, and worth having to play with for children of all ages.

This combines the joys of the smaller ipads with those of the XO design and experience; though it’s roughly half the experience of each — it runs an Android build with a number of constructionist activities and content; and you can install any other Android apps you like on it. It is based on a stock tablet; while not as brilliant and robust a design as the original laptop or the XO-Touch, it is still a real pleasure to walk through its paces.

Give it a try, or borrow your friends, and let me know what you think.

Rep Mark Takano, California Congressman, Internet Hero
Friday July 12th 2013, 11:28 pm
There Will Be Charts.

This is what I always assumed the best political minds in our country would spend their time doing, solving difficult problems at the highest level of social- and legal-norm creation. Thank you, Mark Takano.

Annotation Notes from a recent discussion with this year’s Berkterns
Thursday June 13th 2013, 10:18 pm
Anno-notes.  (thanks, piratepad)

Future Conduct and the Limits of Class-Action Settlements – James G.
Monday May 20th 2013, 1:25 am
The coruscating James Grimmelmann recently published a crisp, clean exorcism of “future conduct” releases in class action suits, in the North Carolina Law Review.  Using a number of recent class actions as motivation, including the Google Books case, he patiently and eloquently dissects the ideas behind such carte blanche releases, and the rare cases in which they might be called for.

This is a gem of a monograph – worth reading even if you are not a copyright geek.

From the opening salvo (emphasis mine):

This Article identifies a new and previously unrecognized trend in class-action settlements: releases for the defendant’s future conduct. Such releases, which hold the defendant harmless for wrongs it will commit in the future, are unusually dangerous to class members and to the public… [F]uture-conduct releases pose severe informational problems for class members and for courts… create moral hazard for the defendant, give it concentrated power, and thrust courts into a prospective planning role they are ill-equipped to handle.

Courts should guard against the dangers of future-conduct releases with a standard and a rule. The standard is heightened scrutiny for all settlements containing such releases; the Article describes the warning signs courts must be alert to and the safeguards courts should insist on. The rule is parity of preclusion: a class-action settlement may release future-conduct claims if and only if they could have been lost in litigation. […] The Article concludes by applying its recommendations to seven actual future-conduct settlements, in each case yielding a better result or clearer explanation than the court was able to provide.

If you’re in a hurry and don’t have time to savor all 90 pages of finely referenced background and analysis, a handy comparative timeline is on p.410, the standard and rule start on p.431, and the 7 brief case studies start on p.458.

via the Laboratorium.

Wikipedians in Residence: Seven new positions open
Wednesday April 24th 2013, 11:44 pm
Gathered by the Generalist.

The Wikidata Revolution: enabling structured data love
Wednesday April 24th 2013, 4:00 pm
A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.

By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.

Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions. Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, will automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.” – Sue Gardner

The development of Wikidata began in March 2012, led by Wikimedia Deutschland, the German chapter of the Wikimedia movement. Since went live on October 30, a growing community of around 3,000 active contributors started building its database of ‘items’ (e.g. things, people or concepts), first by collecting topics that are already the subject of Wikipedia articles in several languages. An item’s central page on Wikidata replaces the complex web of language links which previously connected these articles about the same topic in different Wikipedia versions. Wikidata’s collection of these items now numbers over 10 million. The community also began to enrich Wikidata’s database with factual statements about these topics (data like the mayor of a city, the ISBN of a book, the languages spoken in a country, etc.). This information has now become available for use on Wikipedia itself.

It is the goal of Wikidata to collect the world’s complex knowledge in a structured manner so that anybody can benefit from it.  Whether that’s readers of Wikipedia who are able to be up to date about certain facts or engineers who can use this data to create new products that improve the way we access knowledge.” – Denny Vrandečić, Wikidata project lead

The next phase of Wikidata will allow for the automatic creation of lists and charts based on the data in Wikidata. Wikimedia Deutschland will continue to support the project with an engineering team that is dedicated to Wikidata’s second year of development and maintenance.

Wikidata is operated by the Wikimedia Foundation and its fact database is published under a Creative Commons 0 public domain dedication. Funding of Wikidata’s initial development was provided by the Allen Institute for Artificial Intelligence [AI]², the Gordon and Betty Moore Foundation, and Google, Inc.

More information available here:

Volunteers can get involved with Wikidata in many ways.  Some of the first applications demonstrating the potential of Wikidata applications, and as a platform:

  • The simia “tree of life” drawn from relations among biological species in Wikidata’s database
  • “GeneaWiki” generates a graph showing a person’s family relations as recorded in Wikidata.  See for example: the Bach family

Pope Francis is now Q450675 : In WikiData
Tuesday April 23rd 2013, 10:05 pm
Filed under: Glory, glory, glory,popular demand,wikipedia

Wikidata, the newest Wikimedia Project, is now in very active development, and being used on all of the Wikipedias. Here for instance is the item record for Pope Francis:

Paus Fransiskus, Францыск, Папа Рымскі, পোপ ফ্রান্সিস, Pave Frans, Pápa Proinsias, Ferenc pápa, Popiežius Pranciškus, Francés I (papa), Ransisku (Tayta Papa), I. Franciscus, 方濟各 (教宗),

Annotation Hacks: Hypothesis XXX begins to converge
Thursday March 28th 2013, 1:12 pm
The various threads around, the Open Annotation spec, and the campus-wide annotation projects at MIT, Yale, and Harvard are starting to converge. It’s nice to see a future pillar of the global web take shape – with no less friction but a more diverse audience than gathered to create the early Internet specs.

I’m at the Convergence Workshop at Harvard on the topic today, and will be at the iAnnotate workshop in San Francisco in 3 weeks. Consolidating notes on a “Hypothesis XXX” hackpad. [Btw: We dearly need a fully open hackpad equivalent with more reliable uptime than piratepad et al.! I default to HP when I have a doc that needs to sustain heavy editing and be guaranteed available during a narrow window of time at a conference… but I would much rather use a Wikimedia or similarly hosted service, with a more explicit guarantee of ongoing availability, at no cost ever.]

A personal note from MIT President L. Rafael Reif
Sunday January 13th 2013, 5:40 pm
This just went out by email, from MIT President Reif, who was inaugurated president in September:

To the members of the MIT community:

Yesterday we received the shocking and terrible news that on Friday in New York, Aaron Swartz, a gifted young man well known and admired by many in the MIT community, took his own life. With this tragedy, his family and his friends suffered an inexpressible loss, and we offer our most profound condolences. Even for those of us who did not know Aaron, the trail of his brief life shines with his brilliant creativity and idealism.

Although Aaron had no formal affiliation with MIT, I am writing to you now because he was beloved by many members of our community and because MIT played a role in the legal struggles that began for him in 2011.

I want to express very clearly that I and all of us at MIT are extremely saddened by the death of this promising young man who touched the lives of so many. It pains me to think that MIT played any role in a series of events that have ended in tragedy.

I will not attempt to summarize here the complex events of the past two years. Now is a time for everyone involved to reflect on their actions, and that includes all of us at MIT. I have asked Professor Hal Abelson to lead a thorough analysis of MIT’s involvement from the time that we first perceived unusual activity on our network in fall 2010 up to the present. I have asked that this analysis describe the options MIT had and the decisions MIT made, in order to understand and to learn from the actions MIT took. I will share the report with the MIT community when I receive it.

I hope we will all reach out to those members of our community we know who may have been affected by Aaron’s death. As always, MIT Medical is available to provide expert counseling, but there is no substitute for personal understanding and support.

With sorrow and deep sympathy,

L. Rafael Reif

Funeral (Tuesday), Memorials (Next Week), and other events
Sunday January 13th 2013, 3:11 pm
Aaron’s funeral will be on Tuesday in Chicago:

10am at Chabad of Highland Park, Chicago
followed by internment at Shalom Memorial Park in Arlington Heights, Illinois.

Memorials will take place in a few cities over the next two weeks: including
Boston (at the MIT Media Lab)
New York
Washington DC
San Francisco

A Boston-area protest was planned in for 12:00 today at the MIT Counsel’s OFfice and the MA District Attorney’s office.

“My new idea is… like a Star Wars Convention” – from my dearest friend
Saturday January 05th 2013, 11:26 pm
Now I want to hear more… but I’m bullish on it.

