Is De-Identification Dead Again?

Earlier this year, the journal Science published a study called “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata” by Yves-Alexandre de Montjoye et al. The article has reinvigorated claims that deidentified research data can be reidentified easily. These claims are not new, but their recitation in a vaunted science journal led to a new round of panic in the popular press.

The particulars of the actual study are neither objectionable nor enlightening. The authors demonstrate that in highly dimensional databases (for example, those with a lot of variables that can take a lot of different values), each person in the database is distinguishable from the others. Indeed, each person looks distinguishable from the others based on just a small subset of details about them. This will not surprise anybody who actually uses research data because the whole point of accessing individual-level data is to make use of the unique combinations of factors that the people represented in the database possess. Otherwise, aggregated tables would do. What is surprising, however, is the authors’ bold conclusions that their study somehow proves that data anonymization is an “inadequate” concept and that “the open sharing of raw deidentified metadata data sets is not the future.” How Science permitted this sweeping condemnation of open data based on such thin evidence is itself a study in the fear and ideology that drives policy and scientific discourse around privacy.

What the de Montjoye Study Actually Demonstrated

The credit card metadata study used a database consisting of three months of credit card records for 1.1 million clients in an unspecified OECD country. The bank removed names, addresses, and other direct identifiers, but did nothing else to mask the data. The authors used this database to evaluate the chance that any given person is unique among clients in the database based on X number of purchase transactions. So, using an example from the paper, if Scott was the only person who made a purchase at a particular bakery on September 23rd and at a particular restaurant on September 24th, he would be unique with only two transactions within the database. The authors use these “tuples” (place-date combinations) to estimate the chance that a person in the database looks unique compared to the other data subjects. They found that 90% of the data subjects were unique in the database based on just four place-date tuples. And the rate of uniqueness increased if approximate price information was added to each tuple.

The authors treat database uniqueness and reidentifiability as one and the same. That is, the authors treat the chance that a person is unique in the dataset based on X number of tuples as the chance that the person can be reidentified.

I am sympathetic to the authors’ goal of finding concrete, a quantifiable measure of privacy risk. But database uniqueness should not be its measure. Measures of sample uniqueness systematically exaggerate the risk of reidentification. Consequently, any research and data sharing policy that relies only on sample uniqueness as the measure of re-identification risk will strike the balance of privacy and data utility interests in the wrong place.

Problem 1: Sample Uniqueness is Not Reidentification. (It’s Not Even Actual Uniqueness.)

The greatest defect in the Science article is treating uniqueness within a sample database as equivalent to “reidentification,” which the authors do several times. For example, the authors state that 90% of individuals can be “uniquely reidentified” with just four place-date tuples. I suspect that most readers interpreted the article and its subsequent coverage in the popular media to mean that if you know just four pieces of place-date purchase information for a person, you are 90% likely to be able to figure out who they are in the de-identified research database. But the authors did not come close to proving that.

The problem is that uniqueness in a deidentified research database cannot tell us whether the data subject is actually unique in the general population. The research database will describe only a sample of the population, and may be missing a lot of information about each of its data subjects. Inferring actual uniqueness from database uniqueness requires some extra information and modeling about what proportion of the population is sampled, and how complete the data about them is.

To give an extreme example, let’s go back to “Scott”—the credit card-holder who went to a bakery on September 23rd and a restaurant on September 24th. Suppose that his data was part of a research dataset that included the purchase histories of just ten credit card customers. Using this database on ten people, could we reliably say anything about whether Scott was the only person in his city to go to the bakery and the restaurant? Of course not. We may have a hunch that the city’s inhabitants are unlikely to go to this bakery and that restaurant on the same days that Scott did, but we’d be using our intuitions rather than the research data to draw our conclusions about uniqueness.

The de Montjoye study had a much larger database, of course. Their deidentified database covered 1.1 million people. But the authors did not identify the country served by the credit card company, so 1.1 million people, while large in absolute numbers, likely represents only a small fraction of the unnamed OECD country’s population. If so, estimating population uniqueness would require more work and more information. For this reason, the authors’ failure to disclose the country of origin prevents any reader (including the editors at Science) from knowing how badly off the de Montjoye estimates of population uniqueness are likely to be.

However, as Dan Barth-Jones, a Columbia University epidemiologist and colleague of mine, points out, there are only 34 possible OECD countries where the data could have been obtained. Two are too small to be the country in question and for the majority of the remainder countries, 1.1 million persons would constitute a small fraction of the adult populations in question. Barth-Jones further points to an unusual statistic buried in the de Montjoye paper’s supplemental materials indicating that only 24% of the individuals in the credit card sample data were women, which provides a valuable clue as to the likely source of this data. We’ll have to wait for him to publish the additional supporting details, but his critical review of data for OECD populations has given me good reason to think that the mystery OECD country may well be Turkey. So let’s just run with that scenario as a hypothetical example of how far off the de Montjoye results could be for demonstrating actual re-identification risks. Turkey has over 80 million people, and at least half are in prime consumer ages. The fact that a person in a database containing 1.1 million cardholders looks unique would actually say little about whether any of these other 40 million people may have made similar purchases, at least without further information and modeling.

Worse still, credit card data by its nature cannot capture complete place-date purchasing information since consumers often use multiple credit cards and cash to make their purchases. Invariably, the sample uniqueness of an incomplete database (containing incomplete information for a small fraction of the population) would greatly overestimate uniqueness in the general population.

Earlier this month, Barth-Jones and I coauthored a letter to Science along with other colleagues raising these glaring flaws. The authors did not directly address this criticism. In their response, the authors instead criticize us for suggesting that an attacker “cannot know” whether an individual is a client of the bank. They insisted that this assumption “artificially lowers the estimated, and thus perceived, risks of reidentification.” But this is a naked mischaracterization of our critique. We haven’t claimed the attacker “cannot know” whether a target banks at the particular institution releasing data. Rather, the authors assumed in their work that the attacker does know without making this assumption explicit, and without further conditioning and reporting their findings to include the extra effort or special information that would be required to successfully reidentify someone.

Problem 2: The study does not identify its threat model.

A related problem is that the authors do not make explicit how a reidentification attack would work. The authors seem to acknowledge that there is no publicly available comprehensive resource that provides purchase and credit card institution in an identified form (no identified database to match, or link to, the research data.) If there were such an identified database with purchasing information available, e.g. from consumer data aggregators, it isn’t clear what would be left for an adversary to re-identify.

The authors don’t rule out a data matching attack because, they explain, Internet service providers, telecommunications companies, and Google may have sufficient data to make such a linking attack even if their data is not publicly available. I don’t doubt that Google might have enough information to do a large scale matching attack, but they have little incentive to do so. Why would they risk the lawsuits and public relations disaster that would follow from reidentifying people in a research database when they have plenty of information to make detailed inferences and predictions using their own data?

Another option is the “nosy neighbor” threat model. Even if an intruder doesn’t have access to a large database of purchasing information to make a linking attack, the intruder could use information that he learns on Facebook or from casual conversations to conduct a targeted attack on a few people. But this runs directly into the small sample problem described above: in a country with 75 million people, why would an intruder assume that his target is in the (relatively) small research database? Even if the nosy neighbor did match enough pieces of data to find a unique line in the deidentified data, the match is likely to be a false one. A purported reidentification would be difficult to substantiate without modeling the population and purchases that are not included in the database.

Problem 3: Deidentification techniques include more than removal of identifiers and coarsening.

Finally, the authors of the Science article studied a crudely deidentified database, but they draw dramatic conclusions about all forms of deidentification. Some techniques, like data-swapping and synthetic data techniques, would thwart the steps necessary to link a person’s identity to the research data even though a small number of tuples in the research database will continue to look “unique.” These sophisticated techniques have been developed over several decades by disclosure risk experts, and they are meticulously designed to minimize quantified measures of privacy risk, just as the authors propose at the end of their Science article.

Of course, these more sophisticated techniques are costly, and risk degrading the data’s utility. Given the low incidence of attack, it is not at all clear to me that we should insist on any technically complex methods of deidentificaiton, especially if the data is not sensitive or is covered by research licenses restrticting reidentification and redisclosure.

As I’ve said before, if I were hell-bent on finding out secrets, I would exploit a data security vulnerability to get raw, identified data. That’s what our real adversaries do. Reidentification research and the fantasies they inspire have consumed more than their fair share of public anxiety considering that we still have yet to experience a large-scale bona fide re-identification attack. We have not had a “Target Breach” of reidentification, and we probably never will. This is not to say that disclosure risk research is pointless—the more utility we can eek out while minimizing risks to data subjects, the better. But there is no reason to abandon or disparage the decades of good work done in the field of deidentification because of a phantom credit card attack.

Broad access to research data is crucial for public health and wellbeing. Concluding that “data sharing” is “not our future” is a bleak prediction, especially considering the authors themselves recognize that “in science, it is essential for the data to be available and shareable.”

Comments are closed.