Applying Information Privacy Norms to Re-Identification Demonstrations (Re-Identification Symposium)

This is the first post in Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week. Background on the symposium is here. You can call up all of the symposium contributions by clicking here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post). —MM

By Stephen Wilson

I’m fascinated by the methodological intersections of technology and privacy – or rather the lack of intersection, for it appears that a great deal of technology development occurs in blissful ignorance of information privacy norms.  By “norms” in the main I mean the widely legislated OECD Data Protection  Principles (see Graham Greenleaf, Global data privacy laws: 89 countries, and accelerating, Privacy Laws & Business International Report, Issue 115, Special Supplement, February 2012).

Standard data protection and information privacy regulations world-wide are grounded by a reasonably common set of principles; these include, amongst other things, that personal information should not be collected if it is not needed for a core business function, and that personal information collected for one purpose should not be re-used for unrelated purposes without consent. These sorts of privacy formulations tend to be technology neutral; they don’t much care about the methods of collection but focus instead on the obligations of data custodians regardless of how personal information has come to be in their systems. That is, it does not matter if you collect personal information from the public domain, or from a third party, or if you synthesise it from other data sources, you are generally accountable under the Collection Limitation and Use Limitation principles in the same way as if you collect that personal information directly from the individuals concerned.

I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently.  In the first, Yaniv Erlich used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be “very hard”. In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.

A great deal of the debate around these cases has focused on the consent forms and the research subjects’ expectations of anonymity. These are important matters for sure, yet for me the ethical issue in re-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements.  The act of recording a person’s name against erstwhile anonymous data represents a collection of personal information.  The implications for genomic data re-identification are clear.

Let’s consider Subject S who donates her DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S and links her to the DNA sample. The fact is that R2 has collected personal information about S. If R2 has no relationship with S, then S has not consented to this new collection of her personal information.

Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the DNA sample later represents a new collection, one that has been undertaken without any consent.  Given that S has no knowledge of R2, there can be no implied consent in her original understanding with R1, even if absolute anonymity was disclaimed.

Naturally the re-identification demonstrations have served a purpose.  It is undoubtedly important that the limits of anonymity be properly understood, and the work of Yaniv and Latanya contribute to that.  Nevertheless, these demonstrations were undertaken without the knowledge much less the consent of the individuals concerned.  I contend that bioinformaticians using clever techniques to attach names to anonymous samples need ethics approval, just as they would if they were taking fresh samples from the people concerned.

5 thoughts on “Applying Information Privacy Norms to Re-Identification Demonstrations (Re-Identification Symposium)

  1. In your example, S consented to R1, but there is no consent with R2. Absent some law, therefore, R2 does not seem to have a legal obligation to protect the privacy of R2.

    Now imagine that we add another variable to the equation, P (for Personal Identifiable Information, PII). R2 presumable has to get P from somewhere to re-identify the DNA sample. If that “somewhere” is, for example, a website, then the entity hosting the information could presumably put legal restriction on the use of its data. In other words, the entity could create a contract and legal cause of action against R2 if he (I always assume bad actors are males) uses P inconsistent with the Terms of Use. More specifically, the Terms of Use could prohibit the use of the information for purposes other than those allowed by the service and could specifically call out re-identification as a prohibited use.


    • Posted on behalf of Steve Wilson:

      Ken makes a good point, that PII used in a re-identification has to come from somewhere, and the corresponding data custodian could try and restrict its use. In the case of Yaniv’s work, P came from public genealogical databases. I have not chased down the Ts&Cs for those databases — but I bet the lawyers are!

      But there’s another thing: In many jurisdictions, data protection or information privacy laws *do* put obligations on R2 despite the fact they have no relationship with S. This is actually the point of such laws — to protect consumers. Companies and governments in Europe, Australia, New Zealand and many Asian nations are subject to legal Collection Limitation and Use Limitation principles.

      In the US there is famously no such cross-sector generalised privacy law. So what to do in America? Should researchers stick to the letter of the law, or should they adopt for themselves a higher duty of care, and refrain from collecting and using PII without consent if it is not really needed?

      I believe that many technologists and informaticians are simply unaware that across the world, re-identification is a form of indirect collection of PII, and as such may breach privacy laws. Once they understand this, they might ask themselves if an act that is unlawful in many parts of the world is, by extension, unethical in the USA?

  2. Stephen, great post, what should we make of the fact that we are “leaking” genetic information constantly? Does it matter that the user in one case is a researcher and in another is just an ordinary citizen interested in learning something about you from what leave on a water bottle vs law enforcement or another arm of the state?

  3. For those interested in the connections and divergences between data privacy protections in the U.S. and the E.U., and a propos of Steve’s post, Paul Schwartz and Daniel Solove have posted a new draft on SSRN, “Reconciling Personal Information in the United States and European Union.” The abstract:

    US and EU privacy law diverge greatly. At the foundational level, they diverge in their underlying philosophy: In the US, privacy law focuses on redressing consumer harm and balancing privacy with efficient commercial transactions. In the EU, privacy is hailed as a fundamental right that trumps other interests. Even at the threshold level – determining what information is covered by the regulation – the US and EU differ significantly. The existence of personal information — commonly referred to as “personally identifiable information” (PII) – is the trigger for when privacy laws apply. PII is defined quite differently in US and EU privacy law. The US approach involves multiple and inconsistent definitions of PII that are often quite narrow. The EU approach defines PII to encompass all information identifiable to a person, a definition that can be quite broad and vague. This divergence is so basic that it significantly impedes international data flow. A way to bridge the divergence remains elusive, and many commentators have generally viewed the differences between US and EU privacy law as impossible to reconcile.

    In this essay, we argue that there is a way to bridge these differences at least with PII. We contend that a tiered approach to the concept of PII (which we call “PII 2.0”) represents a superior way of defining PII than the current approaches in the US and EU. We also argue that PII 2.0 is consistent with the different underlying philosophies of the US and EU privacy law regimes. Under PII 2.0, all of the Fair Information Practices (FIPs) should apply when data refers to an identified person or where these is a significant risk of the data being identified. Only some of the FIPs should apply when data is merely identifiable, and no FIPs should apply when there is a minimal risk that the data is identifiable. We demonstrate how PII 2.0 advances the goals of both US and EU privacy law and is consistent with their different underlying philosophies. PII 2.0 thus begins the process of bridging the current gap between US and EU privacy law.

    Download it here:

Comments are closed.