All it takes for an individual to be reidentified in an anonymous dataset are three crucial pieces of information - birth date, gender, and postcode. These factors are arguably the most asked questions when filling out a form, and with every interaction with digital technology recorded, protecting people’s privacy and information can be challenging.
But data is crucial to enhance the capabilities of products, services and research. For example, an online retailer learning which products their customers prefer to purchase, or a video streaming platform suggesting similar content to the user’s most watched. Personal data is also critical for research, such as medical studies designed to better understand and treat diseases.
So, when it came to reporting the spread of the COVID-19 pandemic in New South Wales, data scientists needed to ensure that the information provided to the public was accurate, transparent and secure.
“Given the very strong community interest in growing COVID-19 cases, we needed to release critical and timely information at a fine-grained level detailing when and where COVID-19 cases were identified,” explains Dr Ian Oppermann, NSW Government’s Chief Data Scientist.
“This also included information such as the likely cause of infection and, earlier in the pandemic, the age range of people confirmed to be infected.
“We wanted the data to be as detailed and granular as possible, but we also needed to protect the privacy and identity of the individuals associated with those datasets.”
Since March 2020, the NSW Government has been using an early version of the Personal Information Factor (PIF) tool to analyse a dataset’s risk of reidentification and cyber-attacks and apply appropriate levels of protection before the data is released as open data.
Developed as a collaboration between CSIRO’s Data61, the NSW Government, the Australian Computer Society (ACS) and several other groups, the PIF privacy tool assesses the risks to an individual’s data within any dataset, allowing targeted and effective protection mechanisms to be put in place.
“There’s no other piece of software like the PIF tool,” Dr Oppermann said.
“It was developed through a long and very collaborative process involving many states, Commonwealth and industry colleagues. CSIRO’s Data61 really brought it to life and made it useable.
“Every day, it helps us analyse the security and privacy risks of releasing de-identified datasets of people infected with COVID-19 in NSW and the testing cases for COVID-19, allowing us to minimise the re-identification risk before releasing to the public."
According to PIF’s lead researcher Dr Sushmita Ruj of CSIRO’s Data61, simple data de-identification is not enough to provide the level of data privacy needed.
“Having studied other privacy metrics, the team concluded a one-size-fits-all approach to estimating the information content and hence the reidentification risks for all applications and data was insufficient,” she says.
“PIF takes a tailored approach to each dataset, singling out the most suitable privacy metrics to assess the risks of re-identification for different data types and application scenarios.”
By approaching a de-identified dataset’s security and privacy from the perspective of an attacker wanting to identify individuals, Dr Ruj and her team developed an artificial intelligence (AI) based planning algorithm to mitigate reidentification risks.
“PIF’s algorithm continuously selects de-identification operations like aggregation, obfuscation, differential privacy and considers various attack scenarios used to de-identify data sets, with the system providing a score and recommendation on the risk and safety of each,” she explains.
“This method enables PIF to consider information theoretic metrics (a system of measurement for information) and provably secure algorithms, like differential privacy, to design a secure and safe framework to share data.”
If the personal information factor is below the desired threshold, the program suggests data transformation techniques such as anonymisation, obfuscation, aggregation and more to certify the dataset is safe to be released.
“With PIF, you have a scale on which you can understand the risk, and that is something other tools don’t provide,” explained Prof Helge Janicke, Research Director of the Cyber Security Cooperative Research Centre (CSCRC).
“Data analysation is well understood, but how good the output is once shared is very difficult to understand.
“The metrics-based approach and analysis that underpins PIF is hugely valuable in achieving the ethical and responsible sharing of critical data, with this technology allowing data owners to fully assess the risks and residual impacts associated with data sharing.”
The PIF is being used to examine other data sets before public release in areas such as domestic violence data collected during the COVID-19 lockdown and public transport usage. The tool will continue to be developed by CSIRO’s Data61 and the CSCRC and is expected to be made available for wider public use by June 2022.
CSIRO would like to acknowledge and thank the Government of New South Wales and the Government of Western Australia and the Australian Computer Society (ACS) for providing datasets needed to test PIF and supporting the research, along with our partners in advancing the Cyber Security Cooperative Research Centre.