If de-identification is the new black, then data analytics is the new ‘it’ black handbag: trendy, sexy despite its increasing ubiquity, and capable of holding – and hiding – anything. It’s the opacity of the data analytics handbag that has me worried.
The NSW Government is throwing serious money at data analytics. For example Data61, a business arm of the CSIRO, has been funded close to $4M by the NSW Government to tackle Sydney’s traffic congestion. The project includes using data collected real-time from Opal Cards (Sydney’s public transport smart cards), as well as ‘anonymised’ data from in-vehicle GPS devices. But did Opal Card users sign up for that? Did car drivers agree to that? And can geolocation data ever be anonymous?
(In some exquisite timing, that news came in the same week that Charlie Pickering of The Weekly poked his funny-but-serious stick at the privacy threats from Big Data, including in particular the public fallout and subsequent apology after the sale of cars’ GPS data by TomTom to police in the Netherlands. Ooops.)
Even more important for those of us in NSW is the establishment of the Data Analytics Centre (DAC). Minister for Innovation and Better Regulation Victor Dominello says the DAC will be his lasting legacy. Special legislation was passed last year to establish the regime for information-sharing between public sector agencies and the DAC, but crucially, that legislation makes very clear that it does not override the privacy principles governing agencies. The limitations on using and disclosing personal information for purposes unrelated to the original purpose of collection still apply. Agencies can’t just disclose personal information about their clients – be they students, patients, prisoners, tenants, licence-holders, consumers, ratepayers, passengers or whatever – willy-nilly. (Yes, ‘willy-nilly’ is a technical privacy legal term, Mum. Because I said so.)
Yet I have seen presentations about the work of the DAC which suggest that any privacy concerns have been ‘resolved’ simply because DAC got its own legislation. Huh? DAC CEO Dr Ian Oppermann has said that the DAC legislation has dealt with the “not allowed” argument that agencies previously gave for not sharing their data. Minister Dominello has also been quoted as saying that the “barrier” posed by privacy and confidentiality has been dramatically reduced because of the new DAC legislation: “instead of being Mount Everest, (the barrier) is just a small molehill”.
But how can that be, when the DAC legislation explicitly states that it does not alter the legal privacy obligations on agencies? The DAC website even notes that privacy laws have not been changed, and says sharing of personal data is excluded from the DAC.
Yet public sector agencies have started receiving requests from the DAC to hand over their data, without advice as to how the disclosure of the data requested will comply with their privacy legal obligations. These requests might ask for ‘anonymous’ data, but suggest an outline that includes unit record data, with direct identifiers intact, that would easily enable identification of their clients.
Even if DAC does not know names yet, identifiability is surely within reach. Their stated goal, in a current project mapping data in South Sydney region, is to “get it down to 30-minute intervals of not only who lives where with whom, but who travels in, who travels out, who travels around, or who stays put”. They claim to be collating data not only from public sector agencies including Opal Card data, but from energy and water utilities, telcos, banks and car-share companies. Given the potency of geolocation data and metadata to enable individuation and identification of individuals, the privacy implications are enormous.
This is serious, Big Brother stuff.
How can agencies possibly hand over unit record data, in a state that would surely risk identification of the individual, without breaching their privacy obligations? Rightly, agencies are concerned about the impact on public trust if they get it wrong, and about the “unexpected consequences of sharing”. And yet Minister Dominello is also quoted as saying he is getting close to using his “sledgehammer” coercive powers to demand data be handed over to the DAC.
Is the problem that we are all talking at cross-purposes here? Perhaps there isn’t a shared understanding within the NSW Government about what ‘anonymity’ means. Or about just how broad the definition of ‘personal information’ goes. Just because you don’t know someone’s name, doesn’t mean that you’re not breaching their privacy.
It would be no surprise if there is not a shared understanding on these issues. Even now, 16 years after the NSW privacy laws commenced, I often hear public servants quite incorrectly assert that their privacy obligations only relate to ‘private’ information (and disturbingly, this presentation from the ABS makes the same claim); or that privacy laws don’t cover information observed in the public domain (like CCTV footage, say); or that celebrities don’t have privacy rights. Or that who ‘owns’ the data is somehow relevant; or that children have no privacy rights; or that information linked to an address is not personal information about the people who live there. Misunderstanding of what our privacy laws actually say is sadly very common.
Even amongst the research community there are conflicting definitions of ‘anonymity’ and ‘de-identification’. Some use the terms interchangeably, but others maintain a distinction. In the presentation referenced above, the Chief Methodologist at the ABS uses a definition of ‘anonymity’ which only means removing direct identifiers like name and address. In the words of The Economist, the “stripping of a few details as the only means of assuring anonymity, in a world choked with data exhaust, cannot work”. So when the ABS uses the word ‘anonymity’, they mean something less than the process offered by k-anonymity, which also removes indirect identifiers like date of birth, ethnicity and many other factors as well. De-identification suggests a final state, being the point where privacy laws can cease to apply, because there is no longer a reasonable chance of identifying an individual.
Public trust in government agencies doing data analytics depends on getting privacy right. But do the researchers, statisticians and data scientists have enough guidance on their ethical and legal obligations?
And do the public servants acting as custodians of our personal information have enough guidance on how to conduct de-identification in a way that protects our privacy, relieves them of their legal obligations to not disclose the information, yet still offers up data to the DAC and elsewhere that is of public utility? Getting de-identification right is very, very hard.
The NSW Privacy Commissioner’s office is so shamefully under-resourced that it struggles to produce the type of sector-wide guidance that could genuinely add value to both the DAC and the rest of the public sector. (I know they are working on various guidelines, but timing is everything when horses are already bolting.)
Any such guidance also needs to distinguish between the ethical and legal considerations when doing big data analytics for the purpose of generating insights, compared with using analytics to apply or operationalise those insights.
For example, it is one thing to discover the insight that apparently people who buy lots of pasta pose higher risk to car insurers than people who buy lots of red meat. It is another thing to ‘operationalise’ that insight into offering different insurance products based on the shopping habits of the individual customer. Or in another context: if a university learns from its data analytics program that students from an indigenous background are at significantly higher risk of failure than non-indigenous students, what is the proper ethical and legal response to this insight? Surely we would not want the university to stop admitting indigenous students. (For more guidance on developing an ethical framework for Big Data projects, check out our eBook on this topic.)
So I am wondering how the DAC is resolving these complex issues. It’s hard to know; there are lots of media articles about the what, but not much transparency about the how.
Have the six Cabinet-approved DAC projects undergone privacy impact assessment? Do they use Privacy by Design methodology? Do they follow the requirements of the National Statement on Ethical Conduct in Human Research, which would necessitate the approval of a Human Research Ethics Committee (HREC) before any personal information was collected, used or disclosed for a research purpose? (I don’t know about you, but I doubt NSW Cabinet meetings bear much similarity to the deliberations of a HREC mulling over the public interest considerations at stake.)
Can they satisfy the various legal tests usually applied to research involving personal information? What method is being applied to de-identify information, when and by who? Has it been tested to ensure it could withstand a re-identification attack?
And will the considerations be different, if the purpose is not only research to guide a public policy response, but a project to actually track down individuals in order to penalise them? Minister Dominello is reportedly planning to use the DAC to identify and target slumlords, by combining data from energy and water utilities, local council records, and Fair Trading complaints.
Hmmm, let’s just think about that for a moment. How might this data analytics project work, and what might the privacy risks be?
Let’s say you have some examples of properties known to have been illegally over-crowded with tenants, and you have data about their history of water and electricity consumption. And let’s say you get data from the water and power companies, showing everyone’s water and electricity consumption levels. You could start with aggregated data; no need to identify anyone yet.
Now render all that data into a bell curve of water and electricity use. (You remember bell curves from Stats 101 at Uni, right?) Plot the consumption data from the known illegal tenancies on the bell curve and see if they are outliers. If not, maybe you need to narrow your range of data subjects, by increasing the number of elements used in the data. Throw in the council zoning records. First you might want to exclude all commercial premises from your dataset, but that still might not be enough to give you an accurate picture of what might be suspicious levels of residential use. Next, you might realise that many apartment buildings have a single shared meter, so you will need to figure how to break down the shared figure into something more indicative of use-per-household. Even then, you can’t necessarily distinguish which apartment resident is the one hogging all the resources. I don’t know how to resolve those problems for now (and nor can body corporates with residents cranky about their utility bills), but for the sake of this exercise let’s just imagine that DAC has magically figured that one out already.
Hmmm, what else? Swimming pools use a lot of water and power, so to exclude those people you might also need data from another dataset which knows where all the swimming pools are. Maybe local council records again, or Google Earth? But to match the data together, you might now need to know the specific address or GIS coordinates of all water and electricity customers. Not so de-identified anymore; we’re now into ‘personal information’ and hence prohibitions on use and disclosure, but regardless, let’s keep going with this hypothetical exercise.
Now let’s say you are left with a bell curve of water and power consumption data for all residential properties which don’t have swimming pools, and you can somehow tell the houses from the apartments, and the high-consumption apartments from their low-consumption fellow apartments. And let’s say that you have been able to refine the data to the point where the consumption data from the properties known to have been used as illegal tenancies are clustered in the top 1% or so – the outliers.
The question then becomes: who else is going to be an outlier – and thus under suspicion – when it comes to unusually high consumption of water and electricity? People using hydroponics to cultivate certain green leafy products, that’s who. Well, you might not have much sympathy for them, but what if we’re also talking about Aunty Mildred who lovingly tenders her hot-housed roses? And how about people with home dialysis machines?
What about those of us who just really really like long hot showers? (Honestly, if taking long hot showers was an Olympic sport, my family would be well stocked with green and gold tracksuits.) Or how about people who have cats who have been known to knock the kitchen tap on full, immediately before going on holidays? (Yeah, I’m looking at you Felix.) Does that mean we they will be targeted by the Minister’s door-knocking inspectors as well?
Of course, once data analytics is being used to identify and target individuals or households for some kind of intervention, the activity has moved well beyond ‘research’ or even public policy development. If DAC wants the address of all the ‘outlier’ customers, the use or disclosure of that personal information needs to be authorised on other grounds, such as law enforcement. In NSW, the test for non-health, non-sensitive personal information is that a disclosure must be “reasonably necessary … in order to investigate an offence where there are reasonable grounds to believe that an offence may have been committed”. Is being plotted as an outlier on a bell curve enough to provide reasonable grounds to believe an offence has actually been committed? (Or if you are going to also use data about known properties with a history of Fair Trading complaints, that begs the question: why collect energy and water data on all households to start with?)
At what point does data analytics become just a fancy name for social surveillance on a mass scale? Fishing expeditions masquerading as law enforcement or public safety initiatives are the very type of activity that privacy laws are intended to protect us from. Allowing our lives to be ruled by algorithms means surrendering not only our privacy, but our autonomy as individuals, and as citizens.
These are fascinating, big picture philosophical questions. There are no easy answers. As a recent article in Wired notes, scientists are just as confused about the ethics of Big Data research as the rest of us.
I have had some wonderfully engaged discussions recently with researchers, scientists, philosophers, lawyers and lay people on these very questions, as I run a series of workshops around Australia on behalf of PRAXIS, on navigating privacy considerations in research, for the research community and members of HRECs. (I also like to pose these questions to Felix, but he just ignores me and continues playing with the kitchen tap.)
Where do we turn, to help resolve these ethical questions? Privacy legislation can be horribly tangled, but it is the closest thing we have to help navigate a way forward. Privacy principles were developed deliberately, as a way of codifying our society’s values and ethics. They represent a considered balancing act between the public interest served by protecting privacy, and other social objectives such as law enforcement, research in the public interest, and the proper administration of government.
So I have faith that our privacy laws can guide the way, so long as in the rush to develop ‘big data’ analytics, the data scientists actually pause long enough to develop a nuanced understanding of what their privacy legal obligations entail.
Like many women, I love my black handbags – not just because they look good, but for what they can hold, and what they can hide. But when it comes to the DAC, I would like to see that black handbag turned inside out, so we can all see what’s going on inside, with our data – and judge for ourselves whether or not we think it is ethical and appropriate.
Photograph (c) Shutterstock