As you might expect, the recent investigation by the Office of the Victorian Information Commissioner (OVIC) into the public release of data about myki card users includes important insights into de-identification and re-identification, which were picked up and commented on in the media. But a secondary impact of the report, not much commented on to date, is that OVIC has offered important critiques of Privacy Impact Assessment (PIA) as a methodology.
The report highlights the importance of approaching PIAs in a fulsome, defensible and iterative way; to be careful about making assumptions concerning safe use of data; and to ensure that all parties involved in a project understand who is responsible for what.
I have drawn these critiques into eight lessons we can all learn about PIAs.
The background
In mid-2018, Public Transport Victoria (PTV), the agency with responsibility for public transport administration across Victoria, released a dataset of 1.8 billion records of transport users’ activity to Data Science Melbourne for use in the Melbourne Datathon. The Datathon is an annual event in which entrants (typically data scientists, academics and students) compete to find innovative uses of a dataset. The dataset contained the records of ‘touch on’ and ‘touch off’ activity of 15.1 million myki cards over a three year period to June 2018.
PTV maintained that the dataset was disclosed in response to a request from the Department of Premier and Cabinet (DPC), which oversights the government’s open data platform, through the DataVic Access Policy and Guidelines. DPC had been represented on the Datathon judging panel and provided sponsorship to Data Science Melbourne for the Datathon.
Based on advice that certain de-identification techniques would be applied to the data prior to release, PTV completed a threshold PIA checklist and gave the ‘all clear’ for the release. However, on their receipt, a number of Datathon competitors reported concerns that the dataset was still readily identifying. Whilst names were excised and card numbers randomised, in a number of cases, taking what might be known from as little as one shared trip with an acquaintance was enough to deduce all trips they had made in the three year period.
A re-identification exercise conducted by a team from the University of Melbourne also found that combining information from other sources with information in the dataset about the relatively small number of some categories of card holder (police and politician card holders), rendered the dataset ‘personally identifying’. As a consequence, it revealed a significant amount of location data about individuals and their likely travel patterns.
Due to a number of factors, including the heightened risk posed by the sheer size of the dataset, and the potential impact of the breach on public trust, OVIC determined to investigate the circumstances, including the steps taken to assess and approve the data for release.
There’s a degree of irony in the concept of assessing re-identification risk for a massive dataset intended for use by data enthusiasts. Unlike most other instances of proactive public release of data – for general human interest, or to foster public sector accountability, this was a group itching to investigate what powerful insights could be drawn from the data when manipulated, drilled, mixed and matched. In those circumstances, one would expect that the most robust of de-identification techniques might be employed, in addition to other data security and assurance measures.
There are eight poignant lessons to be learned from the investigation, on the purpose of, and the approach to, PIAs.
Plan for the unknown
PIAs that are fulsome, measured, defensible and iterative are powerful tools. They can anticipate the foreseeable (and the not so foreseeable) privacy risks. They highlight current controls around general data governance frameworks and established processes to deal with ‘known knowns’. They suggest new treatments or risk mitigations to deal with ‘known unknowns’, and data breach management plans to deal with ‘unknown unknowns’. They suggest strategies for dealing with risks such as the risk of re-identification. PIAs foreshadow and plan for the consequences in the event that things do not actually go to plan.
Use existing frameworks
In relying exclusively on ‘de-identification’ of the data to manage the risk of ‘re-identification’, PTV overlooked other possible means of protecting the information, such as the Five Safes Framework for managing statistical disclosure risk, which would have suggested:
- Limiting disclosure to a known and fixed list of Datathon participants
- Ensuring participants were subject to contractual or legal obligations to not attempt to re-identify the data, or on-disclose the data, and to destroy the data at the end of the Datathon
- Ensuring the data was held on a secure system, to limit extraction and retention by participants, and
- Testing whether the data was ‘safe’ for the planned release.
The importance of data literacy
Deficiencies in governance and risk management in relation to data can undermine the protection of privacy, even for well-intentioned projects. As OVIC saw it, this matter demonstrated the significant challenges in identifying privacy risks in large, complex datasets, and the need for the Victorian public sector, which possesses many large and sensitive data holdings, to have a high level of data literacy.
In particular, over the course of many months during which PTV considered the proposed data release, consulted, undertook its PIA, and prepared the data for the release, a raft of guidance, from both OVIC and the OAIC, was published on managing re-identification risks.
The OAIC’s March 2018 report into the problems associated with the public release of MBS and PBS data highlighted the risks of taking a simplistic approach to de-identification before an open data release. OAIC’s guidance on de-identification was also updated in March 2018. Even before that, in late 2017, the OAIC and CSIRO had released a detailed De-identification Decision-Making Framework.
Meanwhile in May 2018, OVIC’s guidance on de-identification suggested that analysis of unit level data “is most appropriately performed in a controlled environment by data scientists” rather than being released publicly, which OVIC described as “a risky enterprise”. Most pertinently, in that report OVIC had called out PTV’s counterpart agency in NSW, Transport for NSW, as offering a benchmark for how to safely share and analyse public transport smartcard users’ data, without unnecessarily raising re-identification risks.
Yet this critical regulator guidance on de-identification had not filtered through to impact on key decisions made by PTV and DPC, before the PTV data was released in July 2018.
Be clear about responsibilities
At its broadest, the OVIC investigation highlighted the consequences of a manifest disconnect between PTV as data custodian, what it thought its role was, and its obligations and accountabilities, in relation to Datathon competitors’ use of the data, and DPC as a more distant sponsor, supporter and public sector lead for the initiative.
Throughout the process of developing and disclosing the myki dataset to the Datathon, and OVIC’s investigation, both PTV and DPC displayed a lack of clarity about which agency was responsible for protecting the dataset and identifying and managing privacy risks.
Use experts and test assumptions
The PIA was premised on the assumption that the dataset had been successfully ‘anonymised’ by one area of PTV, and so concluded that the dataset could therefore be safely released for use in the Datathon.
For example as to whether the program was going to collect, use or disclose re-identifiable information, PTV’s PIA stated:
“No. There is no way to link the public transport travel patterns of individual mykis to specific people via the encrypted internal card ID – this is not publicly available and will be encrypted in any case. The only remaining risk is that someone may attempt to identify a specific myki card based on the travel patterns but this would require a detailed knowledge of when and where a person had used public transport – basically a travel diary – and it would be very difficult to distinguish from other cards with similar travel patterns. In the unlikely event that this succeeded it would only reveal which Public Transport modes and stops the card had appeared at”.
This view, that the dataset had been de-identified, formed the basis for the subsequent governance of the released data. However as OVIC and the University of Melbourne team demonstrated, re-identification from the dataset was considerably easier than PTV had imagined.
OVIC noted that:
- PTV did not seek any external expertise to assist with de-identifying the dataset, and
- PTV’s decision-making processes were not clear or well documented, and appeared to lack both the support of an effective enterprise risk management framework and suitable rigour in the application of a risk management process.
OVIC cautioned that appropriate processes and expertise should sit behind any decision to release de-identified personal information.
Templates are useful, but only to a point
PTV had used a PIA template report issued by the predecessor agency to OVIC. But by design, templates are generic in nature, and will not be a neat fit for every project. Answering template questions without critical analysis will not magically produce the answers needed to properly document, assess and manage risk.
The PIA did not describe in detail what data would be released, other than to say it would be anonymised myki data. The PIA concluded that “no personal information capable of identifying an individual” would be used, but without sufficient analysis or reasoning for that conclusion.
The template did ask users to consider the risk of re-identifiable information and why. But given PTV had firmly concluded that there was no ‘personal information’ involved, the remaining sections of the template, which were designed to manage risks including that of re-identification, were left incomplete.
Know the project’s scope (and keep the PIA current)
Unfortunately, the PIA did not appear to envisage the dataset being released as ‘open data’, despite contemporaneous documents providing mixed accounts of how the data would be released or used as part of the Datathon. Further, the scale of release was significantly broadened following the PIA, but the PIA was never revisited to align with the change in project scope.
OVIC cautioned on the importance of re-visiting a PIA and its assessment of risk if, over time, the scope and quantity of data to be managed is to be expanded or clarified.
A PIA is not a project approval
The PIA was approved by the PTV ‘owner’ of the myki dataset, and the PTV chief information officer. It was the only authorising decision or documentation for PTV’s decision to release the data. There was no other written agreement with respect to use by Data Science Melbourne of the dataset for the Datathon.
Conclusions
- Be fulsome and factually correct. Properly account for the relationships between parties involved in the project, the data flows, and describe the data governance arrangements to put in place.
- Ask, “Where has something similar been done elsewhere?”, “What approach was taken?” and “Was it successful?”
- Approach the consultations with stakeholders that will inform the PIA’s content, exploratively and collectively – not as shuttle discussions. Both PIAs and the consultations that support them are exercises in translation, ensuring everyone is on the same page about a project’s objectives, risks and their management.
- Plan alternative strategies to deal with incorrect or misunderstood conclusions about how identifying your data might be – in both planned and alternative contexts.
- Keep the PIA scope broad, considering whether links between purpose of collection and disclosure are logical, and try to gauge community expectations about use (aka ‘the pub test’).
- Remember that PIAs conducted at a single point in time must be revisited as projects evolve in scope.
- Don’t rely on a template to achieve more than it is designed to do, or use it as the sole source of approval for release of a significant and complex dataset. This gives a false sense of security for all stakeholders involved.
In its investigation report, OVIC acknowledged that while data-driven insights can bring great benefit, they can also put individuals at risk, and some old assumptions about data de-identification require revisiting. For datasets containing unit level data about individuals, and particularly longitudinal data about behaviour, OVIC noted that some research now indicates that such material may not be safe for open release, even where extensive attempts have been made to de-identify it.
Not everyone can read a crystal ball, but a well-considered PIA that foreshadows and plans for all risks – even if considered unlikely – will at least go some way towards identifying and mitigating privacy risks for your projects.
Photograph (c) iStock