The Longest Now

Archiving Web links: Building global layers of caches and mirrors
Sunday June 12th 2016, 4:23 pm
Filed under: international,knowledge,meta,metrics,popular demand,wikipedia

The Web is highly distributed and in flux; the people using it, even moreso.  Many projects exist to optimize its use, including:

  1. Reducing storage and bandwidth:  compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
  2. Reducing latency and long-distance bandwidth:  caching popular parts of the web locally around the world [CDNs, clouds, &c]
  3. Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]
  4. Increasing interoperability of naming schemes for describing or pointing to things on the Web, so that it’s easier to cluster similar things and find copies or versions of them [HvdS’s 15-year overview of advancing interop]

This week I was thinking about the 3rd point. What would a comprehensively backed-up Web of links look like?  How resilient can we make references to all of the failure modes we’ve seen and imagined?  Some threads for a map:

  1. Links should include timestamps, important ones should request archival permalinks.
    • When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
    • Robust links can embed information about where to find a cache in the a tag that generates the link (and possibly a fuzzy content hash?).
    • Permalinks can use an identifier system that allows searching for the page across any of the nodes of the local network, and across the different cache-networks. (Browsers can know how to attempt to find a copy.)
  2. Sites should have a coat of amber: a local cached snapshot of anything linked from that site, stored on their host or a nearby supernode.  So as long as that site is available, snapshots of what it links to are, too.
    • We can comprehensively track whether sites have signalled they have an amber layer.  If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
    • Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
  3. Snapshots of entire websites should be archived regularly
    • Both public snapshots for search engines and private ones for long-term archives.
  4. A global network of mirrors (a la [C]LOCKSS) should maintain copies of permalink and snapshot databases
    • Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
      • mirrors should be available within any country that has expensive interconnects with the rest of the world;
      • prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues
  5. There should be a clear way to scan for fuzzy matches for a broken link. Especially handy for anyone updating a large archive of broken links.
    • Is the base directory there? Is the base URL known to have moved?
    • Are distant-timestamped versions of the file available?  [some robustlink implementations do this already]
    • Are there exact matches elsewhere in the web for a [rare] filename?  Can you find other documents with the same content hash? [if a hash was included in the link]
    • Are there known ways to contact the original owner of the file/directory/site?

Related questions: What other aspects of robustness need consideration? How are people making progress at each layer?  What more is needed to have a mesh of archived links at every scale? For instance, WordPress supports a chunk of the Web; top CDNs cache more than that. What other players can make this happen?  What is needed for them to support this?

Bad Behavior has blocked 366 access attempts in the last 7 days.