Filed under: international,knowledge,meta,metrics,popular demand,wikipedia
The Web is highly distributed and in flux; the people using it, even moreso. Many projects exist to optimize its use, including:
- Reducing storage and bandwidth: compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
- Reducing latency and long-distance bandwidth: caching popular parts of the web locally around the world [CDNs, clouds, &c]
- Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]
- Increasing interoperability of naming schemes for describing or pointing to things on the Web, so that it’s easier to cluster similar things and find copies or versions of them [HvdS’s 15-year overview of advancing interop]
This week I was thinking about the 3rd point. What would a comprehensively backed-up Web of links look like? How resilient can we make references to all of the failure modes we’ve seen and imagined? Some threads for a map:
- Links should include timestamps, important ones should request archival permalinks.
- When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
- Robust links can embed information about where to find a cache in the a tag that generates the link (and possibly a fuzzy content hash?).
- Permalinks can use an identifier system that allows searching for the page across any of the nodes of the local network, and across the different cache-networks. (Browsers can know how to attempt to find a copy.)
- Sites should have a coat of amber: a local cached snapshot of anything linked from that site, stored on their host or a nearby supernode. So as long as that site is available, snapshots of what it links to are, too.
- We can comprehensively track whether sites have signalled they have an amber layer. If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
- Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
- Snapshots of entire websites should be archived regularly
- Both public snapshots for search engines and private ones for long-term archives.
- A global network of mirrors (a la [C]LOCKSS) should maintain copies of permalink and snapshot databases
- Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
- mirrors should be available within any country that has expensive interconnects with the rest of the world;
- prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues
- Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
- There should be a clear way to scan for fuzzy matches for a broken link. Especially handy for anyone updating a large archive of broken links.
- Is the base directory there? Is the base URL known to have moved?
- Are distant-timestamped versions of the file available? [some robustlink implementations do this already]
- Are there exact matches elsewhere in the web for a [rare] filename? Can you find other documents with the same content hash? [if a hash was included in the link]
- Are there known ways to contact the original owner of the file/directory/site?
Related questions: What other aspects of robustness need consideration? How are people making progress at each layer? What more is needed to have a mesh of archived links at every scale? For instance, WordPress supports a chunk of the Web; top CDNs cache more than that. What other players can make this happen? What is needed for them to support this?
No Comments so far
Leave a comment
Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>