Confidentiality Information Series

Introduction

There is a growing recognition within many sectors of Australian society that data is a strategic resource. The Australian Government Public Data Policy Statement acknowledges that data 'holds considerable value for growing the economy, improving service delivery and transforming policy outcomes for the nation'.

Organisations that collect data, including the Australian Bureau of Statistics (ABS), are being encouraged to make their data holdings more widely available. This includes releasing aggregate and unit record datasets (microdata) in ways that optimise their usefulness while still protecting the secrecy and privacy of those providing the information as required by Australian legislation. This series focusses on methods and management techniques to securely release data while maintaining the confidentiality of individuals or organisations about which the information relates.

The Australian Bureau of Statistics has safely and effectively made data holdings available for over 110 years. By law the ABS must disseminate official statistics while making sure that information is not released in a way that is likely to enable individuals or organisations to be identified.

The Confidentiality Information Series was first released in 2009 as part of the Cross Portfolio Statistical Data Integration Initiative. The ABS worked with data partners in the National Statistical Service to release the Series to provide guidance on how to safely maximise the usefulness of datasets for statistical and research purposes. The Series has now been updated in this second release.

The updated Series provides a brief legislative background to keeping data confidential, explains how to assess and manage confidentiality risks by using tools such as the Five Safes Framework, and outlines methods for treating data as part of this risk management approach.

Parts 1—3 are intended as a non-technical introduction. They provide conceptual background on approaches to managing data access for statistical and research purposes and the re-identification risks that occur in a statistical context.

Parts 4—5 provide statistical guidance and explain statistical methods for treating data to protect confidentiality, while Part 6 provides an example of the potential impact of data treatment on research.

Important note: All data examples (except those noted in Part 6 which use published data) presented in these sheets are entirely invented.

Part 1 - What is confidentiality and why is it important?

Agencies collecting information from people and organisations have a legal and ethical responsibility to ensure:

they respect the privacy of those providing the information; and
that individuals and organisations cannot be identified in a disseminated dataset.

There is a clear relationship between confidentiality and privacy. A breach of confidentiality can result in disclosure of information which might intrude on the privacy of a person or an organisation.

Confidentiality refers to the obligation of data custodians (agencies that collect information) to keep the confidential information they are entrusted with secret.

Why is confidentiality important?

Agencies collecting data often rely on the trust and goodwill of the Australian people to provide information.

Maintaining public trust helps to achieve better quality data and a higher response to data collections.

Protecting confidentiality is a key element in maintaining the trust of data providers.

This leads to reliable data to inform governments, researchers and the community.

Confidentiality and therefore trust can be broken when a person or organisation can be identified in a disseminated dataset, either directly or indirectly.

For example, a person could be directly identified in a dataset if that dataset contains their name and address. However, a person or an organisation could also be indirectly identified if there is a combination of information in the dataset from which their identity can be deduced.

Example: the combination of date of birth and a detailed area code (for example, a town where 300 people live) may enable identification as there will be some unique dates of birth in such a small area.

What does ‘confidentialise’ mean?

The term confidentialise refers to the steps a data custodian must take to mitigate the risk that a particular person or organisation could be identified in a dataset, either directly or indirectly. Confidentialisation requires two key steps:

de-identification of the data, that is, the removal of any direct identifiers (e.g. name and address) from the data; and
assessment and management of the risk of indirect identification occurring in the de-identified dataset.

De-identified data does not necessarily protect the identity of individuals or organisations.

Removing identifying information such as name and address protects data providers from direct identification.

However, it may still be possible to indirectly identify a person or an organisation in a de-identified dataset. If enough detail is available, the identity of a particular person or organisation may be derived from the presence of a very rare characteristic or the combination of unique or remarkable characteristics.

Example: the identity of a person could be deduced if a dataset indicates the person is over 85 years old, has yearly income of more than one million dollars, and resides in a town of 400 people.

Example: the identity of a person with a very rare disease or health condition could be deduced even in highly aggregated data.

To protect the identity of individuals and organisations, both direct and indirect identification need to be considered.

Confidentialising data involves removing or altering information, or collapsing detail, to ensure that no person or organisation is likely to be identified in the data (either directly or indirectly).

There are various methods used to confidentialise data. These methods aim to protect the identity of individuals and organisations while enabling sufficiently detailed information to be released to make the data useful for statistical and research purposes.

The main techniques for confidentialising data are described in below: "How to confidentialise data: the basic principles".

For more information about assessing and managing the risks of indirect identification in microdata see below: "Managing the risk of disclosure in the release of microdata".

The confidentiality information series

This information sheet is part of a series designed to explain, and provide advice on, a range of issues around confidentialising data. The other sheets are below.

The obligation to protect identity and privacy

Confidentiality refers to the obligation of data custodians (agencies that collect information) to keep the confidential information they are entrusted with secret.

This obligation is recognised in the Privacy Act 1988. The obligation to protect confidential information is also reflected in legislation governing the collection, use and dissemination of information for specific government activities. Examples include the Social Security (Administration) Act 1999, the Taxation Administration Act 1953, and the Census and Statistics Act 1905 (see examples below). Penalties apply if the secrecy provisions set out in these Acts are breached.

As well as the requirements set out in legislation, obligations to protect a person’s or organisation’s identity and privacy are also outlined in government policies and principles. These provide advice on the protocols and procedures required to manage information safely. ‘High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes’ is one example of a set of principle-based obligations for Commonwealth government agencies.

Managing identification risks

Confidentiality, and therefore trust, can be broken when a person or an organisation can be identified in a disseminated dataset, either directly or indirectly.

One of the biggest challenges in making data publicly available is ensuring that no person or organisation is likely to be identified in the data.

Identification (often referred to as disclosure) occurs when someone learns something that they did not already know about another person or organisation through data that has been disseminated. This may be in the form of aggregate data (typically data presented in tables) or microdata (unit record data where each record represents observations for a person or an organisation).

Definitions

Microdata are unit record data containing individual responses to questions on survey questionnaires, or administrative forms. For example, data in response to the question ‘In what year were you born?’.

Aggregate data (or macrodata) refers to aggregated microdata. For example, a count of the number of people of a particular age (obtained from the question ‘In what year were you born?’).

How to confidentialise data: the basic principles

Managing the risks of identification in disseminated data (also called disclosure control) involves taking steps to evaluate and mitigate the risk that the identity of a particular person or organisation may be disclosed.

Risks of identification can be managed by confidentialising data. The aim is to protect the identity of a person or an organisation, while at the same time maximising the usefulness of the data.

In simple cases, data can be manually confidentialised. However, the use of software is sometimes necessary. Specialised skills and knowledge of the data are also required to correctly confidentialise a dataset to minimise the risk of identification.

This information sheet outlines some common techniques for confidentialising data. The information is provided as a guide only and gives simple examples, using data presented in tables, to illustrate the concepts. However, most of the techniques apply to both aggregate and microdata.

Issues specific to the management of the confidentiality of microdata are discussed in Information sheet 5, ‘Managing the risk of disclosure in the release of microdata’.

Managing the risk of disclosure in the release of microdata

"As the world becomes more complex and computing capabilities increasingly advance, the role of microdata in statistics has become much more important. Decision makers are increasingly turning to and requesting access to microdata." Statistical Journal of the International Association for Official Statistics 26 (2009/2010) 57-63

Microdata are unit record data where each record represents observations for a person or an organisation. Microdata contain individual responses to questions on survey questionnaires, or administrative forms, including identifying information such as name, address, telephone number and age.

Microdata are a valuable resource for researchers and policy makers. The challenge for data custodians is striking the right balance between fulfilling obligations to protect the identity of individuals and organisations, and maximising the information available for statistical and research purposes. This requires careful weighing of the identification risks and benefits.

How confidentiality affects research

Agencies and users should work together to promote legislative, regulatory, and dissemination policies and practices that facilitate timely and cost-effective access to data for statistical research and policy analysis but do not permit full and open access by all of the public for any use. If confidentiality issues are not fully addressed in constructive and proactive ways, users face the very real risk of losing access to high quality data

Source: Doyle P, Lane JW, Theeuwes JJM, Zayatz LM (2001) Confidentiality, disclosure and data access. Theory and practical applications for statistical agencies. North-Holland.

Confidentialisation techniques are applied to microdata (unit record data where each record represents observations for a person or organisation) to enable them to be made available to analysts and researchers. Without these techniques, access to valuable information for research and analytic purposes would be severely restricted. Although application of confidentialisation techniques generally leads to losses in information availability, when confidentialisation is done well and with knowledge of the key research objectives in mind, the information loss can be minimal. This fact sheet looks at the impact of confidentialisation on information availability for use in research and analysis.

There has always been a debate over the delicate balance between gaining full, unrestricted access to data (for researchers), and the application of confidentiality techniques to protect the privacy of data providers (by data custodians). Selecting the confidentiality techniques to be applied to microdata is a careful balance between fulfilling obligations to protect the identity of individuals and organisations and maximising the information available for statistical and research purposes.

The information required by the research sector is becoming more sophisticated over time. Increases in the use of techniques such as data linkage, data modelling, and data mining mean that researchers’ requirements are more detailed and more varied than ever before. As a consequence, there is more pressure on data custodians to provide greater access to microdata through high quality, detailed unit record files.

Users of microdata may be concerned that any reductions or changes made to datasets during the confidentialisation process may affect their ability to undertake analysis or research using the data, or may impact on the results of an analysis. However, generally very few changes need to be made to the dataset, and in most cases these have no impact on statistical analyses.

The goal of confidentialisation is to protect the identity of individual respondents. The main types of data cells affected when confidentialising a dataset are:

rare events or characteristics;
unusual data (extreme high or low reported values); and
low count cross-classification cells.

Generally, statistical analysis is based on observing trends and patterns in data, and most statistical techniques rely on multiple events or individuals with similar characteristics from which to draw inferences. Where there are multiple observations with similar characteristics, the risk of individual identification is low, and the confidentialisation process generally would not result in any change to the data. The low frequency events and unusual values that are targeted in confidentialisation procedures are generally not amenable to statistical analysis.

Part 2 - Understanding re-identification

Managing the risk of identification (disclosure control)

The process for managing the risk of identification is the same for any data that are disseminated, whether it is aggregate data or microdata.

The first step is to assess potential identification risks. The second step is to manage the risks of identification by using an appropriate method to confidentialise the data.

The key elements in the process of managing the risk of identification are outlined in the following diagram.

A process for managing identification risk

Understand the legal and ethical obligations to protect the confidentiality of data.
Establish policies and procedures to meet confidentiality obligations.
De-identify the data (remove any direct identifiers from the dataset)
Assess potential identification risks by evaluating the factors that contribute to the likelihood of identification.
Test and evaluate to mitigate risk
Manage the risks of identification - confidentialise.
Provide safe access to data.

Assessment of potential identification risks

The likelihood of a person or an organisation being identified, and the confidentiality methods used to minimise the risk of identification, will vary depending on a range of factors including:

the amount of detail included in the data (the more detail, the higher the risk of identification);
the sensitivity of the variables within the data (data items such as financial, health and medical or criminal record information may increase the risk of identification and would cause serious public concern if disclosed); and
how the information is presented (aggregate data released in tables poses less risk of identification than the release of microdata).

Confidentialisation to manage potential identification risk

The objective of confidentialisation is to meet legal and ethical obligations to protect the identity and privacy of individuals and organisations, while at the same time maximising the usefulness of the data for statistical and research purposes.

While confidentialisation can be used to manage the risk of identification, it may not completely eliminate the risk.

There are two general methods (often referred to as statistical disclosure control methods) used to confidentialise data that are to be disseminated:

data modification methods (perturbation) which involve changing the data slightly to reduce the risk of disclosure, while retaining as much content and structure as possible; and
data reduction methods which aim to control or limit the amount of detail available, without compromising the overall usefulness of the information available for research.

The techniques used to confidentialise data are similar for both aggregate data and microdata. However, the way in which the risks are assessed and the confidentiality techniques which are applied are different:

for aggregate data, the techniques are applied to cells in a table identified as unsafe, after aggregation; and
for microdata, the techniques are applied to data items within individual unit records, or to individual unit records, identified as unsafe before aggregation or analysis.

This information sheet provides a broad overview of the process involved in managing the risks of identification. More detailed information is given in the subsequent information sheets in this series.

How to confidentialise data: the basic principles focuses on the risk assessment that applies to aggregate data presented in tables, and the popular confidentiality techniques that apply to both aggregate and microdata.

Managing the risk of disclosure in the release of microdata focuses on the risk assessment and confidentiality considerations that apply to microdata.

See Part 1 - What is confidentiality and why is it important? for the complete series.

Part 3 - Managing the risk of disclosure: The Five Safes Framework

Meeting legislative obligations

Agencies protect the secrecy of information by implementing policies and procedures that address all aspects of data protection.

They do this by ensuring that identifiable information about individuals and organisations:

is not released publicly;
is available to authorised people on a need to know basis only;
cannot be derived from disseminated data; and
is maintained and accessed securely.

Privacy legislation

The Privacy Act 1988 sets out people’s rights in relation to the collection, use and sharing of information that they provide to the Commonwealth and ACT governments. These governments are bound to privacy protections under the Information Privacy Principles of the Act.

Some private sector organisations, and all health service providers, are bound by rules of conduct called the National Privacy Principles, outlined in Schedule 3 of the Act.

State and territory government agencies, except Western Australia, are bound by their state privacy legislation. Currently, various confidentiality provisions and privacy principles provided in the Freedom of Information Act 1992 apply to Western Australian government agencies.

Information Privacy Principles

The Information Privacy Principles for Commonwealth and ACT government agencies cover:

how personal information is collected;
the storage and security of personal information;
accuracy and completeness of personal information;
the use of personal information and its disclosure to third parties; and
the general right of individuals to access and correct their own records.

National Privacy Principles

The National Privacy Principles for business cover:

what an organisation should do when collecting personal information;
the use and disclosure of personal information;
information quality and security;
openness;
the general right of individuals to access and correct their own records; and
rules around sensitive information (e.g. health, racial or ethnic background, or criminal record).

Examples

Example 1: Social Security (Administration) Act 1999

The confidentiality provisions in the Social Security (Administration) Act 1999 prohibit any person from misusing information about a person that is, or was, held in government records for social security purposes. The provisions specify offences related to the unauthorised disclosure or use of protected information. They also specify circumstances where obtaining, recording, disclosing, or otherwise using protected information may be authorised. The penalty for breaking the confidentiality provisions is up to two years imprisonment.

Reference: Social Security Act 1991 and Social Security (Administration) Act 1999 Part 5 Division 3 Confidentiality.

Example 2: Australian Institute of Health and Welfare Act 1987

The provisions of the Australian Institute of Health and Welfare (AIHW) Act 1987 ensure that data collections managed by AIHW are kept under strict conditions with respect to confidentiality. The penalty for breaking the confidentiality provisions is $2,000 or imprisonment for 12 months, or both. The AIHW Act 1987 provides for AIHW to release health and welfare-related data for research purposes, with the approval of the AIHW Ethics Committee, under certain terms and conditions. However, AIHW is also subject to the Privacy Act 1988, which restricts AIHW’s ability to release identifiable data about living individuals.

The combined effect of these Acts is that AIHW may make health data about living individuals available for research with the approval of the AIHW Ethics Committee, provided certain terms are met. Release of identifiable welfare data may only be approved by the AIHW Ethics Committee in respect of deceased individuals. Under Section 29 of the AIHW Act, a person to whom such information is divulged for any reason is subject to the same confidentiality obligations as apply to AIHW staff.

Reference: Australian Institute of Health and Welfare Act 1987, Section 29.

Example 3: Taxation Administration Act 1953

The disclosure of information about the tax affairs of a particular entity is prohibited except in certain specified circumstances under the Taxation Administration Act 1953. Those exceptions are designed to meet the principle that disclosure of information should be permitted only if the public benefit derived outweighs the entity’s privacy. The penalty for breaking these provisions of the Taxation Act is two years imprisonment.

Reference: Taxation Administration Act 1953, Schedule 1, Division 355, Confidentiality of taxpayer information.

Example 4: Census and Statistics Act 1905

The Census and Statistics Act 1905 gives the Australian Bureau of Statistics (ABS) authority to collect data for statistical purposes. Under this Act, information supplied to the ABS cannot be published or disseminated in a manner that is likely to enable the identification of a particular person or organisation. The Act contains provisions obliging past and present employees of the ABS to maintain the secrecy of data collected under the Census and Statistics Act. A fine of up to $20,400, or a penalty of two years imprisonment, or both, applies to an unauthorised disclosure of information collected under the Act.

Reference: Census and Statistics Act 1905, Sections 12 and 19.

Example 5: High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes

Commonwealth Portfolio Secretaries have endorsed a set of principles for a safe and effective environment for data integration involving Commonwealth data for statistical and research purposes. Principle six, Preserving Privacy and Confidentiality, says that policies and procedures used in data integration must minimise any potential impact on privacy and confidentiality. For example, access to potentially identifiable data for statistical and research purposes outside secure and trusted institutional environments should only occur where: legislation allows; it is necessary to achieve the approved purposes; and it meets agreements with source data agencies.

Part 4 - Managing the risk of disclosure: Treating Aggregate Data

Identification risks in aggregate data

Aggregate data can be disseminated in various forms including tables, maps and graphs. Aggregate data in any form can present an identification risk if individual responses can be estimated or derived from the output (for example, outliers in a graph).

Tables are the most common way of presenting aggregate data and are the focus of the following discussion.

There are two main types of tables:

Count (frequency) tables: cells contain the number of individuals or organisations contributing to the cell (e.g. the number of people in various age groups, or the number of businesses in each industry).
Magnitude tables: cells contain values calculated from a numeric response (e.g. total income or profit).

It is sometimes possible to deduce information about a particular person or organisation from these two types of tables.

For example, when a cell in a frequency table has a low count (that is, only a very small number of contributors), it may be possible to deduce information about a particular person or organisation using the information you already know and the additional information presented in the table. This poses an identification (or disclosure) risk.

In magnitude tables, table cells present an identification (or disclosure) risk when they are dominated by values relating to one or two businesses or individuals.

Further identification risks exist if users have access to multiple tables that contain some common elements. It may be possible to use information from one table to determine the identity of a person or organisation contributing to a second table. This means it is important to keep track of all information that is released from the dataset.

When should a cell be confidentialised?

To identify table cells that pose an identification (or disclosure) risk, confidentiality ruIes must be determined and applied to each cell in a table. If a cell fails such rules, then further investigation or action is needed to minimise the risk of identification.

Two common rules used to assess identification risk in a table cell are:

The frequency rule (also called the threshold rule), which sets a threshold value for the minimum number of units (contributors) in any cell. Common threshold values are 3, 5 and 10.
The cell dominance rule (also called the cell concentration rule) which applies to cells where a small number of data providers contribute a large percentage to the cell total.

Setting values for these rules is the responsibility of individual data custodians. The value used will depend on that custodian’s assessment of the identification risk. Some datasets may be more sensitive than others while legislation and organisational policies will also influence the values that are set and applied.

Frequency Rule

The frequency rule is best applied to count tables. While it can be used for magnitude tables, the underlying frequency data (i.e. the number of contributing units for each cell) is needed to apply the rule.

Sometimes zero cells (cells with no contributors, or cells where all respondents reported a zero value for a magnitude table), or 100% cells (where all contributors within a category have a particular characteristic) are also considered confidential.

A 100% cell that may not require protection is included in example 1. In this example all 15–19 year olds have a low income (20). This result may not be unexpected or sensitive and the cell may not require protection. However, if this same example related to a very small community and the 100% cell was instead for 15–19 year olds with high income then the cell may need to be protected from an identification risk as the result is unexpected and may be considered sensitive.

Cell Dominance Rule

The cell dominance rule is also referred to as the (n,k) rule and is used for magnitude tables.

According to the cell dominance rule, a cell is regarded as unsafe if the combined contributions of the ‘n’ largest members of the cell represent more than ‘k’% of the total value of the cell.

The 'n' and 'k' values are determined by the data custodian.

Techniques to confidentialise data

There are various techniques used to confidentialise data that pose an identification risk – either as an unsafe cell in aggregated data, or as unsafe records in microdata. These are grouped into two broad groups: data reduction methods and data modification methods.

Data reduction methods

These methods maintain confidentiality of respondents by selecting appropriate aggregations or in presentation of data.

Combining (or collapsing) categories
This method involves combining several response categories into one, or reducing the amount of classificatory detail available in a table or in microdata. It is often used for magnitude data and sometimes for count data. Combining or collapsing categories is best used when a handful of responses have a small number of contributors, or when a table is very detailed and there are many small cells. A good knowledge of the subject is important when combining or collapsing categories to ensure that the new category groupings are relevant to data users.
Primary and secondary suppression Data suppression
This involves not releasing information for unsafe cells, or, if nothing else works, deleting individual records or data items from the microdata file. If a table contains totals, it may be possible to calculate the value of a suppressed cell by subtracting the value of other cells from the total. At least one additional cell may also need to be suppressed to prevent identification. A cell that is suppressed because it fails one of the confidentiality rules is called a primary suppression cell. The suppression of other cells (to prevent the disclosure of a primary suppression cell) is called secondary suppression or consequential suppression.

Data modification methods (Perturbation)

These methods maintain respondent confidentiality by altering the identifiable data in a small way without affecting aggregate results.

Data rounding

Data rounding involves slightly altering small cells in a table to ensure results from analysis based on the data are not significantly affected, but the original values cannot be known with certainty. Data rounding may be random or controlled.

Random rounding is used in count tables and involves replacing small values that would appear in a table with other small random numbers. This results in some data distortion so that the sum of cell values within or between tables will not equal the table total. Random rounding to base X involves randomly changing every number in a table to a multiple of X. For example, random rounding to base 3 (RR3) means that all values are rounded to the nearest multiple of 3. Each value, including the totals, is rounded independently. Values which are already a multiple of 3 are left unchanged.
Graduated random rounding is a rounding method used for magnitude tables. It is similar to random rounding, however after specified cell sizes the rounding base increases. That is, a small number will have a smaller rounding base than a large number. This ensures the protection offered does not diminish for large-valued cells.
Controlled rounding is a form of random rounding but it is constrained to have the sum of the cells equal to the appropriate row or column totals within a table. This method may not provide consistency between tables.

Sampling

Sampling provides some protection against identification risks because it reduces the certainty about whether a particular individual or organisation is in the data. A value may be unique in the sample, but not necessarily unique in the population.

Part 5 - Managing the risk of disclosure: Treating Microdata

Types of disclosure risk in microdata

There are two key types of disclosure risk associated with microdata:

risk that identification is made without any deliberate attempt to identify a person or organisation (spontaneous recognition); and
risk associated with a deliberate (malicious) attempt to identify a person or organisation.

Spontaneous recognition - an identification made without any deliberate attempt, can occur if individuals with rare characteristics are present in the data. The identification risk this poses depends on how remarkable the characteristic(s) are. For example, the dataset may include people with unusual jobs (e.g. pop star or judge) or very large incomes which are highly visible in the data and could lead to their identification.

Deliberate attempts to identify a person or an organisation in a dataset may include, for example, list matching (matching unique records to external files using a combination of characteristics common to both datasets) or a ‘record attack’ (where a user tries to find a particular person or organisation with a set of characteristics known to the user).

Managing the risks of identification

Assessing microdata for identification risks is a subjective process which requires a detailed examination of the data. Methods to assess identification risk in microdata include:

cross-tabulation of variables (for example looking at age by income or marital status) to determine unique combinations that may enable a person or an organisation to be identified;
comparing sample data with population data to determine whether the unique characteristics in the sample are unique in the population; and
acquiring knowledge of other datasets and publicly available information that could be used for list matching.

The risk of identification can also be assessed by considering factors that contribute to the likelihood of identification (see below).

Various software packages are available to help assess potential identification risks. These include:

Mu-ARGUS - a software package developed by Statistics Netherlands. The software is designed to protect against spontaneous recognition only and does not attempt to protect against list matching.
SUDA - software developed by the University of Manchester. SUDA stands for 'Special Unique Detection Algorithm'. It examines unit record data files and looks for records that are at risk of identification because they have unique combinations of characteristics.

Protecting microdata

The first level of protection for microdata is to remove direct identifiers such as names and Australian Business Numbers. Direct identifiers should always be removed from the data before release. However, there is still a risk of indirect identification occurring in the de-identified data.

Two common approaches to protect microdata are confidentialising and/or restricting access to the file.

Confidentialising microdata

Data perturbation and data reduction methods are used to confidentialise microdata. These are the same basic principles used to protect aggregate data. Popular techniques to confidentialise microdata include:

limiting the number of variables included in the dataset;
introducing small amounts of random error (e.g. rounding or data swapping);
combining categories that are likely to enable identification (e.g. giving age in five year ranges);
top/bottom coding extreme values of continuous variables like income or age;
suppressing particular values or records that cannot otherwise be protected from the risk of identification; and
data swapping - this involves swapping a value in an identifiable record with a value in another record with similar characteristics to hide the uniqueness of the record. For example, a record with a unique language spoken in the region could be swapped with a similar record (based on age, sex, income etc.) in another region where the language is more commonly spoken.

Restricting access to the microdata file

Providing controlled access to microdata is important in protecting the data from identification (or disclosure) risk.

Access to detailed microdata should only be provided under the strictest conditions to approved researchers for an approved purpose. Generally researchers must sign undertakings to abide by specified conditions for access and use of the data.

The extent to which the files are confidentialised will determine how the files are accessed. The more detailed the information, the more protection is required when providing access to microdata.

One way of releasing microdata is in the form of Confidentialised Unit Record Files (CURFs). These are files that have been confidentialised to ensure that the direct and indirect identification of individuals or organisations is highly unlikely.

A highly confidentialised microdata file, such as a CURF, may be released publicly on CD-ROM. However, if more detail is left in the CURF, more secure ways of accessing the data need to be used.

Researchers who need lots of detail may have to access the data through a very secure environment. Secure on-site data laboratories are one way of achieving this. The microdata are de-identified and should have some level of confidentialisation to avoid spontaneous recognition, but may still contain data that would allow indirect identification. For this reason, access to this data is available only at data custodian access sites so that all output generated is confidentialised before it leaves the premises.

Another option is providing access to microdata through a remote access facility. Remote access facilities are used by statistical agencies and research organisations around the world and mean data users can access microdata from their desktop. Approved researchers can submit data queries through a secure internet-based interface. Requests are generally run against the microdata which is securely stored within the data custodian’s computing environment. The results of the queries are confidentialised.

Factors that increase the risk of identification

Motivation to attempt identification

Motivation is very hard to assess but one issue to consider is whether an organisation or individual would receive any tangible benefit if identification was made.

Level of detail disclosed by the data

The more detail that is included in a unit record, the more likely identification becomes. Data items containing detailed categories, or unit records containing a large number of data items, could reveal enough information to enable identification to be made through a unique combination of characteristics.

Presence of rare characteristics in the data

Even if the number of data items and the number of categories within the data items are limited, there may be a risk of identification if there is a rare and remarkable characteristic in the data. The identification risk posed by the presence of a rare characteristic (or combination of characteristics) depends how remarkable the characteristic is. For example, a 19 year old girl who is widowed is likely to be noticeable in the data and further action would be needed to ensure confidentiality.

Accuracy of the data

The more accurate the dataset, the higher the risk of identification. Conversely, data that are subject to reporting errors, or contain some data items that are not maintained and consequently are out of date, decrease the likelihood of identification.

Age of the data

As a general rule, disclosing an individual's area of residence or marital status 10 years ago is less likely to enable a successful identification than disclosing their current characteristics. However, in some circumstances, the risk of identification can increase as age increases - for example, current quarter earnings for a business may be kept confidential and known only to a small number of people, but previous year earnings may be published in an annual report and so may be publicly available information, enabling identification.

Coverage of the data (completeness)

Complete coverage of a dataset increases the risk of disclosure because a researcher knows that any individual they are aware of in a sub-population will be represented somewhere in the dataset. The sampling process (and the fact that the specific sample selections that have been made are kept confidential) provides some protection against identification.

Presence of other information that can assist in identification

A de-identified dataset cannot, of itself, lead to an identification being made. For an identification to occur a researcher must, implicitly or explicitly, match the data to some other data source, either publicly available information or personal knowledge. Given this, other information that is likely to be accessible or known to the researcher(s) must be assessed.

Part 6 - ABS Microdata: Uses and impacts on research quality

Types of research

While the data requirements for researchers are highly diverse, research studies can be broadly divided into two main groups: quantitative and non-quantitative.

Quantitative studies require sufficient amounts of data for reliable estimates or models;
non-quantitative studies may focus on an in-depth analysis of an unusual characteristic or small cohort.

Quantitative research and confidentiality

When undertaking analysis for quantitative based research, analysts require sufficient amounts of data to ensure high quality outputs. If the required variables of interest are available, Confidentialised Unit Record Files (CURFs) should be able to meet the needs of researchers. This is because the confidentiality techniques applied to these files are the same steps that analysts typically apply to the data to ensure robust outputs. That is:

collapse cells with small counts;
collapse numeric values into groups;
recode variables with long tails (extreme high or low reported values); and
remove, or otherwise treat, outliers in distributions.

For these types of analyses the outputs achieved using CURFs should not differ greatly from the results that would have been achieved using the original non-confidentialised unit record files.

Example: How does confidentiality affect statistical analysis?

Professor David Lawrence, Telethon Institute of Child Health Research (TICHR)

Case study: Cigarette smoking and anxiety disorders

TICHR conducted a study using the ABS’ 2007 Survey of Mental Health and Wellbeing expanded Confidentialised Unit Record File (CURF) to investigate the association between smoking and mental health problems. The expanded CURF was analysed through the Remote Access Data Laboratory (RADL). The RADL is a remote analysis environment where a client can submit analysis code to the data custodian, via the internet, and subsequently receive the output without ever having direct access to the unit record level information.

After the project was completed, an evaluation was conducted to determine whether the confidentiality techniques of perturbation and re-coding of variables (high and low data values) applied to the CURF had any effect on the results of the analysis. An ABS officer re-ran the analyses using the Survey of Mental Health and Wellbeing master file (the original file before confidentialisation techniques were applied). The analysis included running several multi-way weighted tables of proportions, and fitting several logistic and proportional hazards regression models. Results from the study were published to three significant digits. None of the figures varied between the CURF and master file analyses at this level of precision.

Table 1 shows estimated hazard ratios for quitting smoking to five significant digits as obtained from performing the same analyses on both the expanded CURF and the original master file. The largest difference was observed at the level of 4 significant digits, and represented 1% of the standard error of the relevant estimate.

The minor changes between the CURF and master file analyses were of neither practical or statistical significance. This result is consistent with the very small level of change to the dataset during the confidentialisation process as described in the User’s Guide for the survey CURF.

Summary: When our research on smoking and mental illness was run on the master file, the outputs were not practically different to those achieved using the expanded CURF. Therefore, the confidentiality techniques applied to the master file (perturbation and recoding of variables) did not affect the analyses conducted or the conclusions drawn from them during the project.

Full details of the weighted analyses and models undertaken for this project are described in: Lawrence D, Considine J, Mitrou F, Zubrick SR (2010) Anxiety disorders and cigarette smoking. Results from the Australian Survey of Mental Health and Wellbeing. Australian and New Zealand Journal of Psychiatry. 44: 521-528.

Table 1: Hazard ratios* for smoking cessation, people with anxiety disorders compared with people with no lifetime mental disorder, by type and nature of anxiety disorder

Disorder		Hazard Ratio (from CURF)	Hazard Ratio (from master file)	Standard Error of Hazard Ratio	Difference between hazard ratios as proportion of standard error
No lifetime mental disorder		1.00	1.00	(reference category)
Anxiety disorders, by type -
	Panic disorder	0.59678	0.59615	0.10335	0.0102
	Agoraphobia	0.44936	0.44936	0.08420	0
	Social phobia	0.55374	0.55374	0.07973	0
	Generalised anxiety disorder	0.33631	0.33631	0.07846	0
	Obsessive-compulsive disorder	0.47782	0.47782	0.10674	0
	Post-traumatic stress disorder	0.63221	0.63221	0.07865	0
Anxiety disorders, by severity -
	Mild	0.74901	0.74882	0.11495	0.00218
	Moderate	0.58942	0.5892	0.08275	0.00447
	Severe	0.39196	0.39202	0.06016	0.00249
Anxiety disorders, by use of services -
	Has accessed services	0.54875	0.54858	0.07046	0.00426
	No use of services	0.59683	0.59669	0.07113	0.00323
Anxiety disorders, by years since first onset -
	0-2 years	0.73114	0.73089	0.16758	0.00203
	2-5 years	0.74544	0.74492	0.20212	0.00346
	5-10 years	0.59126	0.59086	0.12845	0.00522
	More than 10 years	0.53100	0.53093	0.05969	0.00201

* Hazard ratios measure how often a particular event occurs in one group compared to how often it occurs in another group, over time.

Non-quantitative research

Non-quantitative research may include the qualitative investigation of the circumstances surrounding rare events of interest. CURFs do not meet the data requirements for these studies as CURFs are de-identified and subject to confidentialisation techniques. Alternative approaches are needed. This may require the consent of the subjects involved to participate in this type of research.

Example: A qualitative study of children born with a rare birth defect

This type of study may aim to better understand what factors may have led to the occurrence of a rare birth defect, with the ultimate goal of preventing their occurrence or decreasing their incidence. Clearly these studies play an important role in understanding rare cases or phenomena, but they require an alternative approach to statistical analysis methods.

In this example, administrative microdata may be useful to identify potential participants in a more detailed study. The researcher would require special approval from the custodian to obtain the sensitive data records. This permission could be granted by a body such as a Human Research Ethics Committee constituted under the National Health and Medical Research Council (NHMRC).

Further to this, permission and further information may be needed from the parents of the child born with the rare birth defect. This type of research is best conducted by approaching the affected individuals and directly seeking their consent to participate in a research study. Statistical analysis of a microdata file, whether from a survey or an administrative data source, is rarely the most appropriate research design for this type of investigation.

Conclusion

There are generally two types of research: quantitative and non-quantitative.

For quantitative research, data custodians are turning to a diverse range of products to meet researchers’ needs, including CURFs and data laboratories. Generally speaking, as the small changes that are made to unit record files during the confidentialisation process usually target unusual or extreme values only, the confidentiality techniques used to make the microdata available have little impact on the analyses performed in these types of studies. In many cases the small proportion of information that is lost from the original microdata has no impact on the question being analysed, and often similar steps would be taken by the researcher to prepare the data for the statistical analysis. As a result there will be little to no impact on the analysis using the confidentialised file compared to the original file.

For non-quantitative studies, such as examining unusual characteristics or small cohorts, or linking datasets, the current publicly available data products or methods are not able to meet the researcher’s needs. In these cases, researchers are required to complete a more rigorous application process. This may include sign-off from an ethics committee, permission from the data custodian, and possibly, permission from the data provider.

These types of data requests are increasing and researchers and data custodians will need to work together to come up with innovative ways to streamline access to datasets while ensuring that the confidentiality of the data provider is always respected and maintained. This is critical to ensuring the continuation of Australia’s high quality data collections.

Glossary of Terms

Administrative data

Information (including personal information) collected by agencies for the administration of programs, policies or services and with the potential to be used for statistical purposes. Administrative data is one type of microdata.

Aggregate data

Aggregate data are produced by grouping information into categories and aggregating values within these categories. For example, a count of the number of people of a particular age (obtained from the question ‘In what year were you born?’).

Aggregate data is typically presented in tables. Aggregate data is also referred to as tabular data or macrodata.

Anonymised data

This term is most commonly used to refer to data from which direct identifiers have been removed (de-identified data), but it is sometimes also used to refer to confidentialised data. To avoid confusion the more specific terms de-identified data and confidentialised data are used in the Confidentiality Information Series.

Cell concentration rule

See 'Cell dominance rule'.

Cell dominance rule

A rule commonly applied to cells in a table to assess whether a cell may enable identification. The cell dominance rule (also called the cell concentration rule) is used to identify cells where a small number of data providers contribute a large percentage to the cell. If a cell fails this rule further investigation or action is needed to ensure that identification is unlikely.

Confidential data

"An obligation to the provider of information to maintain the secrecy of that information."

Source: UN Economic Commission for Europe, 2009

Confidentiality rules

Rules that are applied to each cell in a table to identify table cells that pose a risk of identification (disclosure). Two common rules are the frequency rule and the cell dominance rule.

Data custodian

The organisation or agency which is responsible for the collection, use and disclosure of information in a dataset. Data custodians have an obligation to keep the confidential information they are entrusted with secret.

Data laboratories

On-site data laboratories provide access to detailed microdata at a secure site controlled by the data custodian.

Data modification

See 'Perturbation'.

Data provider

An individual, household, business or other organisation which supplies data either for statistical or administrative purposes.

Data reduction

A technique used to confidentialise data. Data reduction methods aim to control or limit the amount of detail available to avoid identification of a particular individual or business. Data reduction methods include combining categories of information or suppressing information for unsafe cells.

Data rounding

Involves slightly altering small cells in a table to ensure results from analysis based on the data are not significantly affected, but the original values cannot be known with certainty. Data rounding may be random or controlled.

De-identified data

Data that have had any identifiers removed. May also be referred to as unidentified data.

Direct identification

Occurs when a direct identifier is included with the data that can be used to establish the identity of an individual or organisation.

Disclosure

Disclosure occurs when a person or an organisation recognises or learns something that they did not already know about another person or organisation through released data.

Disclosure control

Managing the risks of an individual or organisation being identified either directly or indirectly.

Frequency rule

A rule commonly applied to cells in a table to assess whether a cell may enable identification. The frequency rule (also called the threshold rule) sets a threshold value for the minimum number of individuals or businesses in any cell. Common threshold values are 3, 5 and 10. If a cell fails this rule further investigation or action is needed to ensure that identification is unlikely.

Identifiable

Able to be identified either directly or indirectly.

Identification

See 'Direct identification', 'Identified data' and 'Indirect identification'.

Identified data

Data that include an identifier.

Identifier(s)

An identifier (direct identifier) is information that directly establishes the identity of an individual or organisation. Examples of identifiers are: name, address, driver's licence number, Medicare number and Australian Business Number.

Indirect identification

Occurs when the identity of an individual or organisation is disclosed, not through the use of direct identifiers, but through a combination of unique characteristics.

Macrodata

See 'Aggregate data'.

Microdata

Unit record data where each record represents observations for an individual or organisation. Unit record data may contain individual responses to questions on a survey questionnaire or administrative forms. For example, answers given to the question ‘In what year were you born?’.

Outlier

An unusual value that is correctly reported but is not typical of the rest of the population.

Personal information

"Information or an opinion (including information or an opinion forming part of a database), whether true or not, and whether recorded in a material form or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion." (Privacy Act 1988)

Personal information is information that identifies, or could identify a person. There are some obvious examples of personal information, such as name or address. Personal information can also include medical records, bank account details, photos, videos, and even information about what a person likes, their opinions and where they work - basically, any information through which person is reasonably identifiable.

Information does not have to include a name to be personal information. For example, in some cases, date of birth and post code may be enough to identify someone.

Perturbation

A technique used to confidentialise data. Perturbation is a data modification method that involves changing the data slightly to reduce the risk of disclosure while retaining as much content and structure as possible.

Perturbation techniques include data rounding or data swapping.

Privacy

An individual, household, business or other organisation which provides data either to statistical collections or administrative collections. May also be referred to as a respondent.

Record attack

Occurs when a user tries to find a particular person or organisation with a set of characteristics known to the user.

Remote access facility

Remote access facilities are used by agencies around the world to enable approved researchers to submit data queries through a secure internet-based interface from their desktop. The request is run against a confidentialised unit record file, which is securely stored within the data custodian's computing environment.

Remarkable characteristics

The presence of a rare characteristic in the data can pose an identification risk, depending on how remarkable (or extraordinary or noticeable) the characteristic is. This might include people in unusual jobs, very large families or young people with very high educational qualifications.

Risk management

In the context of confidentiality, risk management involves identification and management of the risk of disclosure in accordance with the impact and likelihood of a disclosure occurring and within the constraints provided by legislation and policies.

Security

Safe storage and access to held data, including physical security of buildings and IT security.

Spontaneous recognition

An identification made without any deliberate attempt.

Statistical purposes

Purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical output, and the dissemination of those outputs and the information describing them.

This means that information cannot be used for administrative, regulatory, law enforcement or other purpose that affects the rights, privileges or benefits of particular individuals or organisations.

Suppression

Data suppression involves not releasing information that is considered unsafe because it fails confidentiality rules being applied.

Tabular data

See 'Aggregate data'.

Threshold rule

See 'Frequency rule'.

Unidentified data

See 'De-identified data'.

Uniqueness

Used to characterise the situation where an individual can be distinguished from all other members in a population or sample in terms of information available on microdata records. The existence of uniqueness is determined by the size of the population or sample, the degree to which it is segmented by geographic information and the number and detail of characteristics provided for each unit in the dataset.

Unit record data

See 'Microdata'.