De-identification … it’s the latest buzzword.
With all the press it’s been getting recently, you could be forgiven for thinking that de-identification is the magic solution to all the privacy problems facing open data and Big Data projects. But like other forms of magic, this may prove to be just an illusion. Resolving privacy risks is easier said than done.
Increasingly our clients want advice on how to do data-matching, or release datasets under Open Data initiatives, or conduct Big Data analytics, in a privacy-protective manner. Some are seeking to participate in cross-agency research projects; others are facing requests to hand over their data to the NSW Data Analytics Centre; while others are simply seeking to find policy or operational insights by leveraging their own data via business intelligence systems. All are worried about the privacy risks.
There is big picture advice available, like the OAIC’s new guide on how the APPs apply to Big Data, and our own guide to resolving the ethical issues raised by data analytics. But the one aspect of the discussion that I see causing the most angst is de-identification.
Is de-identification the answer? Is it the same thing as anonymisation? How do we even do it?
The Australian Privacy Commissioner Timothy Pilgrim recently described de-identification as “privacy’s rocket science – the technology that unlocks the potential of where we want to go, while protecting individual rights”. But he also warned that just like space flight, “the risks of getting it wrong can be substantial and very public”.
Thud. Ouch. That’s the sound of over-excited data analysts falling back to earth.
As a society, we want privacy protection because it is the oil that lubricates trust, and without trust we cannot function. The fear of being monitored and targeted for what we say or do has a chilling effect on our freedom of speech. Public health outcomes cannot be realised if people don’t trust the anonymity of their health information; think of the clients of sexual health, mental health and substance abuse services in particular. But we also want the full value of data to be realised. If big data analytics can help find a cure for cancer, or prevent child abuse, we’re all for it. Bring it on, we all say.
And for the organisation holding data, de-identification sounds like a magic solution, because if you can reach a state with your data where it is not possible for any individual to be identified or re-identified from the data, then it no longer meets the legal definition of “personal information”. And that means you don’t have to comply with the Privacy Act when you collect, store, use or disclose that data. Legal risks resolved, hooray, let’s all go home.
So de-identification seems to promise that we can have our cake and eat it too. It’s the holy grail of data management.
BUT … and this is a big but … can true de-identification ever be achieved, without the utility of the data also being lost?
I have written before about how easily an individual’s identity, pattern of behaviour, physical movements and other traits can be extrapolated from a supposedly ‘anonymous’ set of data, published with good intentions in the name of ‘open data’, public transparency or research. The examples are many: Netflix, AOL, the Thousand Genomes Project, the London bike-sharing scheme, Washington State health data, and my personal favourite, the NYC taxi data.
So should we throw in the towel, and give up on trying to pursue data analytics? (Or even worse, give up on privacy?) No, I don’t believe so. I think we just need to get better at de-identification, because there is more than one way to skin this particular cat.
But we’re not going to get better at de-identification unless we understand it. Privacy professionals should not be seduced by boffins who whisper techy sweet nothings in our ear like ‘SLK’ and ‘k-anonymity’, ‘differential privacy’ and ‘encryption’. Instead, we need to better understand the language and the techniques involved in de-identification for ourselves, so that we can perform proper risk assessments, and know which privacy controls to apply when.
(For what it’s worth: SLKs are keys used to link data about people with confidence, using a code generated from details like their name, gender and date of birth. The code works only as a pseudonym, so don’t even think about describing SLKs as offering true anonymity, or you’ll get a grumpy tweet from me.)
Privacy professionals need to better understand the relative merits and limitations of different de-identification techniques. Open data advocates and data analysts need to develop deeper understanding of the full spectrum of privacy threats that can impact on individuals. And we all need clearer guidance on how to balance data utility and data protection, within the scope of privacy law.
The UK’s Data Protection Commissioner has a really useful Anonymisation Code of Practice – but it’s not a light read at 108 pages. In the US, the National Institute of Standards and Technology has published a 54-page paper on de-identification which laments the absence of standards for testing the effectiveness of de-identification techniques, and just this month academics from the Berkman Center for Internet & Society at Harvard University have produced a 107-page tome proposing “a framework for a modern privacy analysis informed by recent advances in data privacy from disciplines such as computer science, statistics, and law”.
But in the meantime I think we need a brief, lay person’s guide to de-identification. A non-boffin’s set of crib notes, if you like.
Perhaps that will be my blog for another day. Just as soon as I’ve mastered pulling a rabbit out of a hat.
Photograph (c) Shutterstock