Skip links

De-identify, re-identify: Anonymised data’s dirty little secret

Feature Publishing data of all kinds offers big benefits for government, academic, and business users. Regulators demand that we make that data anonymous to deliver its benefits while protecting personal privacy. But what happens when people read between the lines?

Making data anonymous is known as de-identifying it, but doing it properly is more challenging than it seems, says Wei Wang, professor of computer science and director of the Scalable Analytics Institute at UCLA.

“It’s one thing to remove the identity, but we also need to keep in mind that the remaining data right after we remove that entity is still useful,” she says.

With a little work, people can often recreate your identify from these remaining data points. This process is called re-identification, and it can ruin lives.

In a recent case, an online newsletter outed a Catholic priest who was a frequent user of the Grindr gay hookup app. The newsletter purchased the Grindr usage data from a third-party data broker. Even though the data set had no identifying information, the newsletter found him using his device ID and location data. The ID showed up in gay bars, his work address, and family addresses, which was enough to find his name and out him. He later resigned.

The spectre of re-identification has grave implications for us all, and should give us pause as we rush to publish anonymous data sets. It has become a sport for some researchers, such as those who mined anonymous AOL search queries in 2006 and identified individuals from de-identified Netflix usage data. Both organisations had published the data in the name of research. Back in 2009, a gay woman sued Netflix, alleging that the data could have outed her.

How de-identification works

There are different ways to de-identify data. These include deleting identifiable fields from records, which theoretically should let researchers use the data without linking it back to an individual.

The danger here is that smart third parties could re-identify someone using data elements that were deemed innocuous enough to leave in the records. In an explainer on the topic, the Georgetown University Law School describes multiple levels of identifiability.

These levels begin with data such as a phone number and social security number that can directly identify a person. At the level below that are items such as gender, birth date, and zip code. These might not identify an individual alone but can quickly single a person out when combined. At still lower levels, the data points relate less specifically to a single person, such as favourite restaurants and movies.

In the mid-nineties, the state of Massachusetts published scrubbed data on every state employee’s hospital visits, but left in some level-two data: zip code, gender, and age.

Re-identification researcher Latanya Sweeney used public zip code records, correlated with the other two data points, to single out the one person matching them all: state governor William Weld. His full medical history, gleaned from the data set, landed on his desk shortly afterwards.

A token gesture

Another approach to de-identification replaces identifiable data with a token. This theoretically allows the data set’s producer to map the tokens back to the user’s real ID while leaving others guessing.

This is also sometimes vulnerable to attack. If those tokens aren’t truly random and an attacker can reverse-engineer them to retrieve a real-world data attribute, they could find the data’s owner. This happened in 2014, when someone reverse-engineered tokens created from New York taxi medallions and mined information about specific taxi rides.

Even if you can’t reverse-engineer the token, you can use it to correlate a single data subject’s activity over time. That’s how researchers pinpointed people in the 2006 AOL dataset; tokens representing individuals allowed them to group search queries and attribute them to a single person, gleaning lots of information about them.

Using additional sources

The availability of multiple data sets compounds the problem of re-identification, warns Wang. “There’s a lot of information that you can collect from different sources and correlate them together,” she says. Taken individually, each data set might seem innocuous enough. Put them together, and you can cross-reference that information. “Then you can figure out a lot of information that’s going to surprise you,” she adds.

The problem, as the UK’s ICO outlines in its own Anonymisation Code (PDF), is that you can never be sure what other data is out there and how someone might map it against your anonymous data set. Neither can you tell what data will surface tomorrow, or how re-identification techniques might evolve. Data brokers readily selling location access data without the owners’ knowledge amplifies the dangers.

Other de-identification techniques include aggregating data. This, the fourth level of data on Georgetown Law’s list, includes summarised data such as census records.

You could aggregate neighbourhood-level health records at a county level. Even that can be dangerous, warns Wang. You might be able to correlate aggregate data with other data sets, especially if the number of people with a specific attribute at the aggregated level are low enough.

Concerns about re-identification have surfaced of late with the NHS Digital’s recent push to collect the public’s health data en masse under its General Practice Data for Planning and Research initiative. The scheme would have transferred GP medical records for all of England’s residents to a central research store, giving people a short window to opt out.

NHS Digital had outlined specific data fields that it would transfer under the scheme, which would have allowed it to share that data with third parties. After delaying the deadline in response to pressure from GPs and relaxing opt-out deadlines, it had to put the project on hold.

Solving the re-identification problem

One theoretical way to cut through the whole tangled mess is to just keep removing data points that could reveal someone’s identity. Taking out age, zip (post) code, and gender might have stopped Sweeney’s Weld discovery, for example. But each piece of data that you take out lessens the data set’s value, warns Eerke Boiten, professor of cybersecurity at De Montfort University’s School of Computer Science and Informatics.

“If your objective is to make the information less specific, less specifically pinpointing one specific person, you’re also taking out the utility,” he says.

One way to reconcile anonymity and usefulness could be differential privacy. This technique adds statistical noise to the data by subtly altering parameters, perhaps shifting someone’s age or zip code slightly, which makes it harder to correlate them.

Scientists can still filter out that noise with repeated database queries, so another factor of differential privacy is a restriction on the number of times that they can access that data. This restriction is known as a privacy budget, or epsilon, and you can alter the anonymity of a database by changing it.

That involves retaining control over the data, Boiten says, pointing out: “Control and accountability disappears when you hand it over.” An alternative is to avoid publishing the data openly and instead make it available in a controlled research environment. “Rather than sharing the data set you share the access,” he explains.

The ICO’s Anonymisation Code makes it clear that in some scenarios, where re-identification could be damaging, organisations should seek consent before distributing anonymous data sets. Some situations might demand restricting disclosure to a closed community, it adds, and in some cases the data shouldn’t be shared at all.

Regulating our way out of it

Scientists also call for more legislation around de-identification. The GDPR excludes data that it deems de-identified from its regulation.

The ICO warns that if the data can be re-identified using “any reasonably available means,” then it won’t pass muster under the EU General Data Protection Regulations. Olivier Thereaux, head of research and development for the non-profit Open Data Institute, says that misjudging this can get companies into hot water.

“GDPR does state that it does not apply to anonymous information, so anonymisation has sometimes been seen as a way to ‘get out’ of data protection obligations,” he says. “That is often a mistake as there are many ways to anonymise data, and some may be regarded by data protection authorities as ‘not reasonably anonymised’.”

Danish taxi service Taxa 4×35 is a case in point. Regulators penalised it after it deleted names associated with trip records from its database after two years. The regulator found that the customers were still re-identifiable.

It’s a question of risk

No de-identification technique is completely foolproof though, warns Omer Tene, chief knowledge officer at the International Association of Privacy Professionals.

“While there are scientific remedies, most practical remedies are limited in terms of really being risk-based,” he says. “They minimise or limit risk but don’t completely eliminate it.”

The ICO makes this clear in its Code, pointing out that it’s “impossible to assess re-identification risk with absolute certainty.”

It recommends what it calls a ‘motivated intruder’ test in which a person without prior knowledge could re-identify individuals using publicly available tools.

Does this mean that we shouldn’t publish data at all? Not at all, says Thereaux. To do so would have a chilling effect on research. “Statistics bodies like the ONS do publish data that is anonymised to a minute risk of re-identification, and that publication is hugely valuable to our society,” he says.

Lowering risk involves taking a careful and multi-faceted approach to de-identification. Thereaux points to the UK Anonymisation Network, which is a non-profit originally created by the ICO to share best practices in de-identification. It publishes a decision-making framework to help navigate the de-identification process.

The framework emphasises the need to engage with people who might be affected. “Making sure you are transparent and honest about how risks were mitigated, and how you are responding to a breach is key,” Thereaux warns. “Organisations who fail to engage and plan for what they might do if anonymisation is breached are the ones who end up at the heart of data scandals.”

The data broker who sold Grindr data without considering the implications could perhaps have done with some of that thinking. Come to that, so could everyone involved in that supply chain. Clearly, when it comes to understanding and protecting identities in anonymous data, there’s still a lot of work to be done. ®