Is De-Identification Dead Again?

Earlier this year, the journal Science published a study called “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata” by Yves-Alexandre de Montjoye et al. The article has reinvigorated claims that deidentified research data can be reidentified easily. These claims are not new, but their recitation in a vaunted science journal led to a new round of panic in the popular press.

The particulars of the actual study are neither objectionable nor enlightening. The authors demonstrate that in highly dimensional databases (for example, those with a lot of variables that can take a lot of different values), each person in the database is distinguishable from the others. Indeed, each person looks distinguishable from the others based on just a small subset of details about them. This will not surprise anybody who actually uses research data because the whole point of accessing individual-level data is to make use of the unique combinations of factors that the people represented in the database possess. Otherwise, aggregated tables would do. What is surprising, however, is the authors’ bold conclusions that their study somehow proves that data anonymization is an “inadequate” concept and that “the open sharing of raw deidentified metadata data sets is not the future.” How Science permitted this sweeping condemnation of open data based on such thin evidence is itself a study in the fear and ideology that drives policy and scientific discourse around privacy.

What the de Montjoye Study Actually Demonstrated

The credit card metadata study used a database consisting of three months of credit card records for 1.1 million clients in an unspecified OECD country. The bank removed names, addresses, and other direct identifiers, but did nothing else to mask the data. The authors used this database to evaluate the chance that any given person is unique among clients in the database based on X number of purchase transactions. So, using an example from the paper, if Scott was the only person who made a purchase at a particular bakery on September 23rd and at a particular restaurant on September 24th, he would be unique with only two transactions within the database. The authors use these “tuples” (place-date combinations) to estimate the chance that a person in the database looks unique compared to the other data subjects. They found that 90% of the data subjects were unique in the database based on just four place-date tuples. And the rate of uniqueness increased if approximate price information was added to each tuple.

The authors treat database uniqueness and reidentifiability as one and the same. That is, the authors treat the chance that a person is unique in the dataset based on X number of tuples as the chance that the person can be reidentified.

I am sympathetic to the authors’ goal of finding concrete, a quantifiable measure of privacy risk. But database uniqueness should not be its measure. Measures of sample uniqueness systematically exaggerate the risk of reidentification. Consequently, any research and data sharing policy that relies only on sample uniqueness as the measure of re-identification risk will strike the balance of privacy and data utility interests in the wrong place.

Problem 1: Sample Uniqueness is Not Reidentification. (It’s Not Even Actual Uniqueness.)

The greatest defect in the Science article is treating uniqueness within a sample database as equivalent to “reidentification,” which the authors do several times. For example, the authors state that 90% of individuals can be “uniquely reidentified” with just four place-date tuples. I suspect that most readers interpreted the article and its subsequent coverage in the popular media to mean that if you know just four pieces of place-date purchase information for a person, you are 90% likely to be able to figure out who they are in the de-identified research database. But the authors did not come close to proving that.

The problem is that uniqueness in a deidentified research database cannot tell us whether the data subject is actually unique in the general population. The research database will describe only a sample of the population, and may be missing a lot of information about each of its data subjects. Inferring actual uniqueness from database uniqueness requires some extra information and modeling about what proportion of the population is sampled, and how complete the data about them is.

To give an extreme example, let’s go back to “Scott”—the credit card-holder who went to a bakery on September 23rd and a restaurant on September 24th. Suppose that his data was part of a research dataset that included the purchase histories of just ten credit card customers. Using this database on ten people, could we reliably say anything about whether Scott was the only person in his city to go to the bakery and the restaurant? Of course not. We may have a hunch that the city’s inhabitants are unlikely to go to this bakery and that restaurant on the same days that Scott did, but we’d be using our intuitions rather than the research data to draw our conclusions about uniqueness. Read more…

Privacy in a Data Collection Society

Jane and I are here with a great group of presenters and attendees at a conference at Loyola University Chicago School of Law, Privacy in a Data Collection Society. I’m speaking this afternoon on the folly of information sharing as a means of improving cybersecurity, and I’ll post a cleaned-up draft of my remarks here (hopefully, eventually to become an essay). And, I’ll try to post some ad hoc updates on what the speakers have to say.

Update 1: Here is Jane’s abstract:

All Life Is an Experiment. (Sometimes It’s a Controlled Experiment.)

What the Facebook Emotion Contagion Study Can Teach Us About the Policy and Public Perception of Research

Thesis: Our unexamined instincts about social science research lead us to craft laws and public opinions that are backwards. Our disapprobation and legal restrictions apply most strongly to research that is performed by academics and other neutral investigators, that is more methodologically sound, that distributes its burdens more evenhandedly, and that shares its insights with the general public.

Update 2: Meg Jones, on the Right to be Forgotten

  • Google v. Spain – Spanish newspaper had right to process information on Gonzales, but Google did not.
  • Google assesses individual’s claims under national law
  • Lauber / Werle v. Wikipedia – brothers convicted of murdering actor, and sought to have references to them removed from Web sites referring to the crime.
  • Martin v. Hearst – CT erasure stature nullified Martin’s arrest. She sued newspaper for publishing about her arrest. Second Circuit: newspaper’s truth is different from her truth.
  • Clash of values between Europe and U.S. over forgetting
  • [shows clip of Phineas and Ferb “Cyberspace Rules of the Road“]
  • Link rot and other ways that information disappears
  • Digital immortality? Internet is not the perfect memory we’re afraid of
  • Poll on whether Americans ought to have right to remove irrelevant information from search results (39% Yes, 21% No too hard to define, 18% No public record, 15% Yes only minors, 6% Yes except public figures)

Update 3: Felix Wu, How EU Right to be Forgotten Relates to US Law

  • Conventional wisdom: EU approach is crazy and would never work in US
  • Felix: less incompatible than we think, and the incompatibility is different than commonly believed
  • US does have areas where information is removed: Fair Credit Reporting Act (bankruptcies – 1o years)
  • Key is sectoral vs. over-arching approach
  • We would be surprised to see US adopt, as first omnibus right, a right to be forgotten
  • Why not adopt a sector-specific RtbF?
  • HIPAA – already specifies certain sensitive information where access is restricted (though HIPAA applies only to covered entities)
  • How to think about Google in this context? Is it a new sort of credit report?
  • Credit report is defined, in part, by use – Google is used for commercial and non-commercial purposes
  • Removal in certain contexts as intermediate step


  • Mention of data obscurity as term rather than RtbF! Hailing Woody Hartzog!
  • How do we know about periods of data retention by companies?

Update 4: Jane Bambauer, All Life Is an Experiment

  • Using Facebook emotional contagion study as vehicle for instincts and laws about research
  • Reactions most harsh when research most legitimate – we criticize academics far more than industry
  • Sanctions are strongest when study authors disclose results to public
  • Facebook’s alteration of scale of emotion in postings led to effect on postings by users seeing them
  • Why did this experiment engender controversy, rather than “poke to vote,” for example?
  • Objections to ethics of research: lack of informed consent, surreptitious intervention, violation of Common Rule
    • FB study undoubtedly violates FIPPs (respect for context)
    • God punishes King David with plague for taking a census – only God is to know that information
    • Good research requires repurposing data – Google has identified unreported side effects of drugs this way
    • Piketty repurposed tax data for his book on wealth distribution
  • Surreptitious manipulation of Newsfeed
    • Standard part of metrics-driven research
    • Bricks-and-mortar retail observes traffic to optimize shelf display
    • Individual physicians may select among equally effective treatment options for each patient – may be useful to formalize the experiment since it has better controls
    • Sunstein’s “50 Shades of Manipulation” – promotes self interest of manipulator, and designed to bypass cognitive reasoning
    • Downstream use of research can fit within this definition, but the research itself does not – it’s a cost to the company, and the company does not know if it bypasses reasoning
    • How do we know status quo is preferable?
    • Research is less self-serving when it’s shared publicly
    • Researchers at Cornell were the ones who took the real hit, but Cornell’s IRB says it’s in compliance
    • Even if their research was not exempt from IRB review, it would have qualified for expedited review and exemption from informed consent
    • The most legally exposed people were the researchers, not FB or the journal
  • Problematic outcomes
    • Companies are at a disadvantage when they work with neutral / academic researchers
    • Firms are benefited when they avoid formally testing hypotheses and assumptions using randomized control trials
    • It’s safer to avoid sharing results with media / public
  • Sensible to reform Common Rule
    • Require IRB review when intervention would create physical or legal risk if performed for non-research purposes

Brett Frischmann – Being Human in the Twenty-First Century: How Social and Technological Tools are Reshaping Humanity

  • Machines and technologies steer us in ways that make us increasingly predictable and manipulable, and ultimately less human
  • Post-WWII: concerns about computers overtaking humans – Turing test as exemplar
  • We want to be humans who use computers, not humans who are computers
  • When does technology replace or diminish our humanity? Can we detect it?
  • Hard definitional baseline – what is human?
  • Interconnected sensor networks, Internet of Things, Big Data will expand scale / scope of human engineering – ubiquity is key
  • Technology / humanity are abstract and complex
  • 3 parts to project
    • Humans and tools – technological dehumanization
    • Human-focused Turing-type tests
    • Applications (critique of nudging) – each incremental nudge can be justified, but path of nudging itself may be unjustifiable
  • Focus is techno-social engineering of humans: influence, manipulate, construct
  • Internet has transformed environments within which we live our lives
  • Demand for Big Data is dependent upon sensors on / around humans
  • IBM’s Watson as an example of technology approaching Turing line
  • Brett is interested in whether humans are approaching Turing line – conditions under which they’re indistinguishable from a machine
    • What happens if human passes test and appears machine? Consequences?
    • On-line contracting: designed to nudge you to click I Agree


Deven Desai – Associational Freedom and Data Hoarding

  • FBI has stated preference for using warrant for GPS tracking
  • Concern for associational freedom and interplay with Fourth Amendment
  • Freedom to develop ideas before speaking – vital to self-governance
  • Sedition Act criminalizes speech and assembly separately
  • Meet-ups and activists are current incarnations of assembly concerns – fear of backward-looking surveillance
  • Protect precursors to speech
  • Bugging in public places undercuts associational freedom
  • Digital data can be hoarded, and lack of rules on law enforcement use leads inexorably to accumulation
  • Key limits
    • Duration
    • Minimization
    • Apply limits retrospectively as well for searches in data troves
    • Return – government must return or delete data


Helen Nissenbaum – Big Data’s End Run Around Informed Consent

  • Full title of paper: Anonymity and Consent
  • Big Data: epistemological paradigm – faith in power of data to produce knowledge
  • Ethics of big data – what happens when the data is about individuals?
    • Anonymity breaks link between data and identifiable individual
    • Thesis: big data poses insurmountable challenges to anonymity and consent – renders them ineffective in quest to privacy
  • Notice & consent enshrined in U.S. privacy regulation (FIPPS, GLBA, FERPA, VPPA, GLB, notice and opt-out requirements)
    • Require consent from subjects if one deviates from substantive rules
    • Notice and choice regime of ToS online
    • GLBA gives you very little chance to opt-out
    • Critiques of notice and consent as theoretical matter and in operational challenges
  • Challenges to N&C increasing
    • More actors, information, flow
    • Impossible to predict future uses or consequences
  • Transparency dilemma: impossible to have a policy that is both comprehensible and comprehensive
  • Public lives of others: inferences based on network analysis, social networks, representative sample
  • Informed consent may have to be abandoned, which is acceptable because informed consent is a means rather than an end (which is privacy)
  • Privacy as control over information is wrong definitional approach
  • Instead, privacy as contextual integrity
    • Ideal informational norms: settle competing interests / preferences / desires best; promote ethical and political values; promote context-specific ends and values for social integrity
  • Patient consent operates as permission for limited departures from standards / expectations
  • Key role of background assumptions and societal constraints
  • Privacy policies should shrink in importance, and societal limitations should wax in importance, in terms of constraining information flow

Against Jawboning

I’d be grateful for feedback on a new draft article, Against Jawboning, coming out in volume 100 of the Minnesota Law Review. Here’s the abstract:

Despite the trend towards strong protection of speech in U.S. Internet regulation, federal and state governments still seek to regulate on-line content. They do so increasingly through informal enforcement measures, such as threats, at the edge of or outside their authority – a practice this Article calls “jawboning.” The Article argues that jawboning is both pervasive and normatively problematic. It uses a set of case studies to illustrate the practice’s prevalence. Next, it explores why Internet intermediaries are structurally vulnerable to jawboning. It then offers a taxonomy of government pressures based on varying levels of compulsion and specifications of authority. To assess jawboning’s legitimacy, the Article employs two methodologies, one grounded in constitutional structure and norms, and the second driven by process-based governance theory. It finds the practice troubling on both accounts. To remediate, the Article considers four interventions: implementing limits through law, imposing reputational consequences, encouraging transparency, and labeling jawboning as normatively illegitimate. In closing, it extends the jawboning analysis to other fundamental constraints on government action, including the Second Amendment. The Article concludes that the legitimacy of informal regulatory efforts should vary based on the extent to which deeper structural limits constrain government’s regulatory power.

The Antidote for “Anecdata”: A Little Science Can Separate Data Privacy Facts from Folklore

Guest post by Daniel Barth-Jones

For anyone who follows the increasingly critical topic of data privacy closely, it would have been impossible to miss the remarkable chain reaction that followed the New York TLC’s (Taxi and Limousine Commission) recent release of data on more than 173 million taxi rides in response to a FOIL (Freedom of Information Law) request by Urbanist and self-described “Data Junkie” Chris Whong.  It wasn’t long at all after the data went public that the sharp eyes and keen wit of software engineer Vijay Pandurangan detected that taxi drivers’ license numbers and taxi plate (or medallion) numbers hadn’t been anonymized properly and could be decoded due to the failed encryption process.

Soon after Pandurangan’s revelation of the botched unsalted MD5 cryptographic hash in the TLC data, Anthony Tockar, working on a summer Data Science internship with Neustar,  posted his blog “Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset” with the aim of introducing the concept of “differential privacy” and announcing Neustar’s expertise in this area. (It’s well worth checking out both Tockar’s short, but informative, tutorial on differential privacy and his application of the method to the maps of the TLC taxi data as his smartly designed graphics allow you interactively adjust differential privacy’s “epsilon” parameter and see its impact on the results.)

To illustrate possible rider privacy risks for the TLC taxi-data, Tockar, armed with some celebrity paparazzi photos and some clever insights as to when, where and how to find potential vulnerabilities produced a blog post replete with attention grabbing tales of miserly celebrities who stiffed drivers on their tips and cyber-stalking strip club patrons, which quickly went viral. And so as to up the fear, uncertainty, and dread (FUD) factors surrounding his attacks, Tockar further gravely warned us all in his post that:

Equipped with this [TLC Taxi] dataset, and just a little auxiliary information about you, it would be quite trivial for someone to follow your movements, collecting data on your whereabouts and habits, while you remain blissfully unaware. A stalker could find out where you live and work. Your partner may spy on you. A thief could work out when you’re away from home, based on your habits.

However, as I’ll explain in more detail, sorting out these quite concerning claims in a rational fashion which will enable us to consider complex decisions about the possible trade-offs between Freedom of Information and open government principles and data privacy concerns requires that we move beyond mere citation of anecdotes (or worse, collections of anecdotes in which carefully targeted and especially vulnerable, non-representative cases have been repackaged as “anecdata”). Instead, we must base our risk assessment in a systematic investigation appropriately founded in the principles of scientific study design and statistically representative samples. Regrettably though, this wasn’t the case here and has quite often not been the case for many headline snatching re-identification attacks that have repeatedly made the news in recent years.

Read more…

Big Pharma: the New Hustler

That’s the provocative thesis of Jane’s post over at Balkinization for the conference Public Health in the Shadow of the First Amendment. Worth a read! And here’s her second post.

The Cambridge University Press decision and Educational Fair Use

The Eleventh Circuit released its 129-page opinion in Cambridge University Press v. Patton (which most of us probably still think of as the Becker case) last Friday. Although the appeals court reversed what I thought was a pretty solid opinion of the district court upholding Georgia State University’s practice of distributing digital “course packs” of reading materials to its students, it is very far from a big win for the publishers who challenged the practice. There is a lot to like in the opinion for advocates of educational fair use, and it is difficult to imagine that the district court on remand will rule in favor of the publisher plaintiffs with respect to very many of the works at issue even though the appeals court directed changes in some aspects of its fair use analysis. Although it found some errors in the district court’s treatment of the second and third fair use factors, the appeals court sensibly and correctly rejected several arguments that would have materially constricted the scope of educational fair use in the digital arena. (Full disclosure: I joined Jason Schultz’s excellent amicus brief on behalf of Georgia State.)

Although the Court of Appeals’ opinion deserves a close look, I’ll confine myself here just to noting a few highlights. Read more…

Celebrities, Copyright, and Cybersecurity

The fall began with a wave of hacked nude celebrity photos (as Tim notes in his great post). The release generated attention to the larger problem of revenge porn – or, more broadly, the non-consensual sharing of intimate media. Legislators and scholars have moved to tackle the problem. Danielle Citron proposes a model statute for criminalizing revenge porn in a Slate article (excerpted from her new book), and California finally got around to dealing with the random selfie-only coverage of its law.

I’ve written an article that proposes using copyright law to address non-consensual sharing (but see Rebecca Tushnet’s critique). It’s worth noting that Reddit took down the illicit celebrity images after receiving a copyright claim – which sites have to respect, on pain of getting sued into oblivion (since Section 230‘s immunity doesn’t apply to IP claims). Apparently others have the same idea – one attorney is threatening Google with a $100M lawsuit for failure, in his view, to comply with the DMCA’s takedown requirements. (The letter bloviates and any suit has as much chance of winning as this plaintiff did.) The revenge porn contretemps raises at least four issues:

1. Everyone does it – The sharing of intimate media (videos and images of people nude or engaged in sexual activity) is ubiquitous. Jennifer Lawrence, Kate Upton, Kirsten Dunst – somehow, it took leaks of celebrity intimate media to drive home this point. This has two helpful consequences (one hopes). First, “just say no” should go the way of Nancy Reagan’s campaign, since it had about the same efficacy. Partners sharing intimate media is the new normal, and it’s foolish to pretend otherwise. Second, the moral critique attached to the practice should fade. One common response to revenge porn is “He/she took the risk, so too bad.” That approach focuses culpability on the victim, not the offender. The risk is not in using intimate media – it’s in trusting the wrong person. Most of us have done that at some point.

2. Stupid is as stupid does – Regulating revenge porn properly matters. Here in Arizona, it’s only a matter of time before the state’s terribly-drafted revenge porn bill is enjoined by a federal judge. (The ACLU is suing to block the bill, along with a coalition of bookstores, journalists, and others.) I, along with many others, pointed out that the bill was fatally flawed the moment it passed. This means that victims in Arizona are going to be without protection because their legislators failed them – and that all of us Arizonans are going to fund the state’s defense of a statute that is without hope. The Arizona legislature could have gotten it right – my understanding is that they consulted law professor and revenge porn expert Mary Anne Franks during the drafting – but they whiffed: the drafters apparently ignored Prof. Franks’s good advice. (It’s not as though she drafted a model statute they could have used.) So too with Texas, where the legislature messed up a statute that is arguably underinclusive. So too California, although the Golden State just fixed its law. The lesson is simple: legislators should take their time, get diverse input, and ask the experts.

3. Changing norms – One interesting and hopeful development with the celebrity revenge porn hack is a new wave of calls for people not to look at the pictures. Those calls aren’t likely to be highly effective; there are plenty of people all too eager to see Jennifer Lawrence nude. But this could herald a shift towards disapprobation not only to leaking intimate media, but to viewing it if it’s shared without consent. Norms are powerful regulators, and this change would mark a useful riposte to the gleeful distribution of revenge porn.

4. Everyone needs cybersecurity – It appears that the celebrity photos were obtained through a combination of guessing security questions on Apple’s iCloud service and, perhaps, social engineering. Early reports also suggested that attackers may have simply used dictionary attacks to guess passwords on iCloud. The truth is probably a mix. But it means we all must start to care about cybersecurity. We all have something to hide: credit card numbers, trade secrets, job applications, nude selfies – the list goes on. We carry that information on an ever-increasing array of devices and Internet services. That means we have to invest some time and effort to do things like check privacy and security policies, figure out whether your smartphone encrypts its data, and use good passwords for things you care about. Cybersecurity: it’s not just for geeks anymore.

This isn’t the last we’ll hear of this topic, unfortunately. But perhaps the discourse is shifting in a useful direction…


On Accuracy in Cybersecurity

I have a new article on how to address questions of accuracy in cybersecurity up on SSRN. It’s titled Schrödinger’s Cybersecurity; here’s the abstract:

Both law and cybersecurity prize accuracy. Cyberattacks, such as Stuxnet, demonstrate the risks of inaccurate data. An attack can trick computer programs into making changes to information that are technically authorized but incorrect. While computer science treats accuracy as an inherent quality of data, law recognizes that accuracy is fundamentally a socially constructed attribute. This Article argues that law has much to teach cybersecurity about accuracy. In particular, law’s procedural mechanisms and contextual analysis can define concepts such as authorization and correctness that are exogenous to code. The Article assesses why accuracy matters, and explores methods law and cybersecurity deploy to attain it. It argues both law and cybersecurity have but two paths to determining accuracy: hierarchy, and consensus. Then, it defends the controversial proposition that accuracy is constructed through social processes, rather than emerging from information itself. Finally, it offers a proposal styled on the common law to evaluate when accuracy matters, and suggests that regulation should bolster technological mechanisms through a combination of mandates and funding. Like the cat of Schrödinger’s famous thought experiment, information is neither accurate nor inaccurate until observed in social context.

Cite: Derek E. Bambauer, Schrödinger’s Cybersecurity, 48 UC Davis Law Review (forthcoming 2014).


ACLU Challenges Arizona Revenge Porn Law

The ACLU, ably assisted by Dentons US LLP, has filed a challenge to Arizona’s revenge porn law in federal district court (complaint, ACLU blog, WIRED story). This is great news for Arizonans: the bill was terribly drafted and unconstitutional from the moment it was signed into law. Fighting revenge porn is important, but as Arizona is about to learn, you don’t get to trample the Constitution even in the service of a good cause. (Here’s my earlier post on the law.)


Alan Trammell and I have a new article coming out on the problems of personal jurisdiction analysis when it involves Internet contacts. (The title is Personal Jurisdiction and “teh Interwebs”; I tried very hard to convince Alan to go with the title of this post, to no avail.) Abstract is below; we’d love your comments and thoughts.

For nearly twenty years, lower courts and scholars have struggled to figure out how personal jurisdiction doctrine should apply in the Internet age. When does virtual conduct make someone amenable to jurisdiction in any particular forum? The classic but largely discredited response by courts has been to give primary consideration to a commercial Web site’s interactivity. That approach distorts the current doctrine and is divorced from coherent jurisdictional principles. Moreover, scholars have not yielded satisfying answers. They typically have argued either that the Internet is thoroughly exceptional and requires its own rules, or that it is largely unexceptional and can be subject to current doctrinal tests. 

The difficult relationship between the Internet and modern personal jurisdiction doctrine is a symptom of a much larger problem. We argue that the Supreme Court’s current approach has bifurcated physical and intangible harm. Viewed through that lens, the overarching problem comes into focus because rules that sensibly govern the physical world apply awkwardly — sometimes incoherently — to intangible harm. Accordingly, we propose a return to personal jurisdiction’s first principles, particularly a concern for fairness and predictability. We argue that courts should dispense with the fiction that purely virtual conduct creates any meaningful contact with a particular forum. The narrow approach that we advocate likely will restrict the number of places where a plaintiff can sue for intangible harm, but through three test cases we demonstrate why such a rule will enhance fairness and predictability while also ensuring sufficient access to justice.

Cite: Alan M. Trammell & Derek E. Bambauer, Personal Jurisdiction and “teh Interwebs,” 100 Cornell Law Review (forthcoming 2015).