Archiving Web links: Building global layers of caches and mirrors
The Web is highly distributed and in flux; the people using it, even moreso. Many projects exist to optimize its use, including:
- Reducing storage and bandwidth: compressing parts of the web; deduplicating files that exist in many places, replacing many with pointers to a single copy of the file [Many browsers & servers, *Box]
- Reducing latency and long-distance bandwidth: caching popular parts of the web locally around the world [CDNs, clouds, &c]
- Increasing robustness & permanence of links: caching linked pages (with timestamps or snapshots, for dynamic pages) [Memento, Wayback Machine, perma, amber]
- Increasing interoperability of naming schemes for describing or pointing to things on the Web, so that it’s easier to cluster similar things and find copies or versions of them [HvdS’s 15-year overview of advancing interop]
This week I was thinking about the 3rd point. What would a comprehensively backed-up Web of links look like? How resilient can we make references to all of the failure modes we’ve seen and imagined? Some threads for a map:
- Links should include timestamps, important ones should request archival permalinks.
- When creating a reference, sites should notify each of the major cache-networks, asking them to store a copy.
- Robust links can embed information about where to find a cache in the a tag that generates the link (and possibly a fuzzy content hash?).
- Permalinks can use an identifier system that allows searching for the page across any of the nodes of the local network, and across the different cache-networks. (Browsers can know how to attempt to find a copy.)
- Sites should have a coat of amber: a local cached snapshot of anything linked from that site, stored on their host or a nearby supernode. So as long as that site is available, snapshots of what it links to are, too.
- We can comprehensively track whether sites have signalled they have an amber layer. If a site isn’t yet caching what they link to, readers can encourage them to do so or connect them to a supernode.
- Libraries should host amber supernodes: caches for sites that can’t host those snapshots on their host machine.
- Snapshots of entire websites should be archived regularly
- Both public snapshots for search engines and private ones for long-term archives.
- A global network of mirrors (a la [C]LOCKSS) should maintain copies of permalink and snapshot databases
- Consortia of libraries, archives, and publishers should commit to a broad geographic distribution of mirrors.
- mirrors should be available within any country that has expensive interconnects with the rest of the world;
- prioritization should lead to a kernel of the cached web that is stored in ‘seed bank‘ style archives, in the most secure vaults and other venues
- There should be a clear way to scan for fuzzy matches for a broken link. Especially handy for anyone updating a large archive of broken links.
- Is the base directory there? Is the base URL known to have moved?
- Are distant-timestamped versions of the file available? [some robustlink implementations do this already]
- Are there exact matches elsewhere in the web for a [rare] filename? Can you find other documents with the same content hash? [if a hash was included in the link]
- Are there known ways to contact the original owner of the file/directory/site?
Related questions: What other aspects of robustness need consideration? How are people making progress at each layer? What more is needed to have a mesh of archived links at every scale? For instance, WordPress supports a chunk of the Web; top CDNs cache more than that. What other players can make this happen? What is needed for them to support this?
Digital rights groups in Europe are gaining ground: a model to watch
The recent historic wins for net neutrality in the EU demonstrate an organized and informed advocacy network that is still not echoed in the US or in many other parts of the world. We should celebrate and learn from their work.
Thanks to Axel Arnbak for his thorough and delightful writeup of this.
Aksyonov predicts Crimean takeover in ’79 novel
Vassily Aksyonov wrote The Island of Crimea in 1979 – about an imagined future. It looks surprisingly like the present.
Kudos to Michael Idov at the New Yorker for writing about it beautifully, with all of its spooky accuracy.
(Night Wolves! Aksyonov again!)
Pope Francis won’t stop being awesome: please enjoy these sweet papal memes
Here is a gallery of great pope memes celebrating the awesomeness emanating from Catholicism’s new Pope.
After a Pope who sometimes made one despair that global religious leaders could inspire perspective, this is a daily source of happiness.
Aaron Swartz hackfests this weekend around the world: honoring his work
Friday November 08th 2013, 7:04 pm
Filed under: Aasw
,Glory, glory, glory
Help continue projects Aaron believed in, in person or online.
I’ll be at the Cambridge event and aftermath throughout the long weekend.
Related project summaries:
Inversionistas inmobiliarimos en Chile de hoy
En Puerto Varas, para ser precisos. Un articulo por Sebastian. ᔥmadre.
“Hay paisajes extraordinarios, pienso, y luego este. Esos campos y poblados guardan un centenario orgullo que emociona.”
Women hefting weight: a global physical meme of strength and focus
I ran across a random development conference today. It included a classic photo of a woman in a wrap carrying a load on her head. This was being used to represent the “members of the local community” in Bali, in political discourse about the use of their land. Even the little thumbnail image used in diagrams to represent community vs. industry showed men wearing typical clothing and women wearing typical clothing… and carrying 15kg.
The same thing happens in images from across Africa, where women more often than men are the ones carrying heavy loads to and fro; at least on their heads (rather than on motorbikes or other vehicles). This strikes me as a meme reflective of strength and work ethic, though so omnipresent it is simply taken as a fact of life and not worth remarking on. I wonder what other such memes are out there.
HT: Studie Rede via: IEEE Actionable Data Book
Aside: I mistakenly began to write something here about ‘development porn’. A concept worth discussing but riddled with untruth… and not behind this meme, which is widely present in media in all cultures, not just in the eye of ‘international development’.
Kenya’s laptop dream: reaching for the firmament, and rote naysaying
Over at ZeroGeography, Mark Graham shares a prepub version of an essay he wrote for the Guardian, about the new Kenyan drive to provide laptops to its primary students. Firstly, thank you to the author for posting your thoughts on his blog as well.
The argument that “this [money] could be better spent“, however, is a bit stale. I don’t generally go in for critical theory and analysis (despite the obvious rightness of tvtropes!), but sometimes patterns show up so strongly in someone’s writing or argument that they are clearly part of a larger social norm and can be understood as such.
This essay is one part imperialist critique of developing countries investing in new tools, one part assumption of bad implementation, and one part missed context.
Graham worries that Kenya’s new e-learning plan — which extends recent efforts to make laptops available to older students, to all primary schools — is not part of a larger strategy; though the budget speech he cites describes such a strategy. He makes assumptions about how much of the national budget goes to different basic needs which don’t seem to be accurate.
The central argument is one I hear often about why underdeveloped regions should slow down technological & educational change. It runs something like this:
A) don’t introduce new things, fix old things first.
B) come up with a strategy addressing all possible issues before including modern tech.
C) if your country is poorer than mine, there must be something basic and low-tech you need more.
D) technology amplifies existing skills. it is wasteful to subsidize it for the less privileged, who can’t use it properly anyway.
These arguments don’t stand up to a second look. Sure, it would be ideal to fix “all the things” — various underlying inequalities, inadequacies of the existing system of experts and mentors and teachers, gaps in the quality of textbooks and in local job opportunities for better-educated youth. But no single effort will do all of that. If you are lacking many things, your primary long-term bottleneck is often your ability to develop new solutions: you need more seed corn, not more ugali. Outside of an immediate crisis, you need knowledge, tools, factories, and other local capacity, so you can go on to invest in your own community while resolving other problems, basic and complex.
And lastly, the idea of not offering a powerful opportunity to those less privileged, because it might take them some time to make the best use of it… that sort of argument is not even wrong. In the short term, any opportunity would be used ‘more thoroughly’ by the already-privileged. But they have usually had that opportunity to begin with; government programs simply subsidize it for those without. Comparing who could “use it better” is a fallacy.
By definition, when you start bootstrapping you don’t have a lot; you get there step by step. And every individual and community deserves access to bootstrapping tools: Blackboards, electricity, glasses, phones, bikes, computers, and other technology. Not necessarily for free, sometimes requiring sweat and barnraising by the community, but as part of a civil campaign to make this part of society everywhere. These are all generative technologies, catalysing other new work, returning far more than their cost in what they enable. This is true three times over for computers: they are communication devices, creative tools for making and sharing, and factories for new tools. So the results of a community learning to use them includes trying and discovering new things not currently imagined.
Graham does make the following excellent point in his essay:
There is a long history of people and states framing information and communication technologies as a solution to economic, social, political, and even environmental problems.
So there is. Kenya should be clear that having tools, capacity, knowledge, stronger social networks, and access to more markets and jobs is not the same as solving specific problems. This will make it easier to solve some problems; it will create others; it will add to the general standard of living and also the expectations that come with it. It will empower people to do both good and bad things. It will be a boon to gamers and activists and gambling and muckraking and cottage industry and artists and pornography and transparency.
But it will surely prepare the country’s youth to be an active part of the internetworked world in which we live, and to help design its future.
Wikimedia is protected as a host, Italian courts rule
Former Italian Minister of Defense Cesare Previti, angry that his Wikipedia biography had at one point contained a statement he found offensive, recently filed suit against the Wikimedia Foundation: claiming that it published defamatory statements.
An Italian court ruled that as a hosting provider, not a content provider – allowing others to create content but not creating it – Wikimedia was protected from such claims. Their two considerations – whether the role of the WMF is clear, and whether there are procedures for readers to address errors, both marked Wikimedia as a host.
This is covered in more detail on the WMF blog. Thanks to our Italian counsel Hogan Lovells for their support in this case.
Ilario Valdelli, from the national chapter Wikimedia Italia, commented:
The justice in Italy is really slow in general, but in this case the timeline has been short and the result of the court of justice really good for the future.
The definition that the prosecutor is able to change the definition in Wikipedia and for this reason there is no sense to accuse Wikipedia or WMF, it’s a good way to give an indication for the future.
Wikipedian forced to delete article by the French police
In France, a Wikipedia admin was sought out by France’s homeland intelligence agency, the Direction Centrale du Renseignement Intérieur brought physically to their offices, and forced to delete an article about a military base (which they claimed contained classified information) if he did not wish to be held overnight.
This sort of bullying tactic is one up with which we should not put. The issue later became a minor cause célèbre in the French press for a short time.
The Wikidata Revolution: enabling structured data love
A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.
By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.
”Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions. Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, will automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.” – Sue Gardner
The development of Wikidata began in March 2012, led by Wikimedia Deutschland, the German chapter of the Wikimedia movement. Since Wikidata.org went live on October 30, a growing community of around 3,000 active contributors started building its database of ‘items’ (e.g. things, people or concepts), first by collecting topics that are already the subject of Wikipedia articles in several languages. An item’s central page on Wikidata replaces the complex web of language links which previously connected these articles about the same topic in different Wikipedia versions. Wikidata’s collection of these items now numbers over 10 million. The community also began to enrich Wikidata’s database with factual statements about these topics (data like the mayor of a city, the ISBN of a book, the languages spoken in a country, etc.). This information has now become available for use on Wikipedia itself.
“It is the goal of Wikidata to collect the world’s complex knowledge in a structured manner so that anybody can benefit from it. Whether that’s readers of Wikipedia who are able to be up to date about certain facts or engineers who can use this data to create new products that improve the way we access knowledge.” – Denny Vrandečić, Wikidata project lead
The next phase of Wikidata will allow for the automatic creation of lists and charts based on the data in Wikidata. Wikimedia Deutschland will continue to support the project with an engineering team that is dedicated to Wikidata’s second year of development and maintenance.
Wikidata is operated by the Wikimedia Foundation and its fact database is published under a Creative Commons 0 public domain dedication. Funding of Wikidata’s initial development was provided by the Allen Institute for Artificial Intelligence [AI]², the Gordon and Betty Moore Foundation, and Google, Inc.
More information available here:
Volunteers can get involved with Wikidata in many ways. Some of the first applications demonstrating the potential of Wikidata applications, and as a platform:
- The simia “tree of life” drawn from relations among biological species in Wikidata’s database
- “GeneaWiki” generates a graph showing a person’s family relations as recorded in Wikidata. See for example: the Bach family
The Unscrupulous Institute for Cultural Diplomacy Fails
I had never heard of the Institute for Cultural Diplomacy before they started harrassing and legally threatening my friend Mako Hill. But they are clearly an organization that understands neither cross-cultural communication nor diplomacy.
There is also no possible outcome of this dispute that is in their favor. The Streisand effect can’t be reversed by filing law suits. Filing law suits will never make a diplomacy institute look good. And Wikipedia policy isn’t structured in such a way that it could ever have an article about them after this without mentioning the drama in the first place – even if enough editors one day changed their minds about notability.
Update: Cory D. was thoughtful enough to put up a note about the case on BoingBoing.
Genius And The Soil / Inspired By Aaron: Thoughts From me, mako, jwyg
From the latest issue of the UK magazine red pepper. With photos by Sage Ross from a memorable Boston Wikipedia meetup in 2009. Click on the pages for higher resolution:
Annotation Hacks: Hypothesis XXX begins to converge
The various threads around Hypothes.is, the Open Annotation spec, and the campus-wide annotation projects at MIT, Yale, and Harvard are starting to converge. It’s nice to see a future pillar of the global web take shape – with no less friction but a more diverse audience than gathered to create the early Internet specs.
I’m at the Convergence Workshop at Harvard on the topic today, and will be at the iAnnotate workshop in San Francisco in 3 weeks. Consolidating notes on a “Hypothesis XXX” hackpad. [Btw: We dearly need a fully open hackpad equivalent with more reliable uptime than piratepad et al.! I default to HP when I have a doc that needs to sustain heavy editing and be guaranteed available during a narrow window of time at a conference… but I would much rather use a Wikimedia or similarly hosted service, with a more explicit guarantee of ongoing availability, at no cost ever.]
Half-Baked Idea (fit for the half-bakery): a Coffee Presser
Java-Logs exist. They’re just like firestarters but they smell nice and are made of coffee grounds; wrapped in paper just like regular firestarters. Coffee machines produce lots and lots of grounds that get thrown out (ideally into compost but usually not. often they just sit around, uncompacted, and grow mold.)
So: someone should design a “Coffee Presser” add-on to traditional popular coffeemakers – and standalone for dumping the dregs from your French Press – that produces mini Java Logs and wraps them so you can hold then when spit out. Perfect for those climes suitable for both lots of hot coffee and nightly fires.