This is my second blog on the topic of anonymization, which I’ve spent some time over the past several months researching. My first blog, Anonymization for data managers, focused on the technical process. Now let’s dive into the role for analysts, report designers and information owners.
To analysts and reporting experts, anonymization means something quite different. For them, it means the process of rendering personal information non-personal, in much more general terms.
Anonymization in this context includes a wider set of practices for ensuring that personal data is well-governed, and that when summary statistics about a group of individuals are published, the number of individuals making up any one aggregate group is large enough that those individuals cannot be personally identified from the characteristics of the group.
For example, suppose a high school publishes details of its exam results, and breaks them down in several ways as part of a national initiative to be open about its performance, allowing parents and local government to see how the school is performing. The school is at risk of breaching the anonymity of some of their pupils, by inadvertently revealing sensitive personal information about them, if they are not cautious about choosing which figures to publish and which to keep confidential. (There is nothing wrong with calculating those figures, just with publishing them).
Now suppose there are 30 students in a particular class in the school, of whom 15 are male and 15 are female. In the whole class, 6 (1 boy and 5 girls) are in an ethnic group which is in a minority in the school’s local area. The remaining 24 students in the class are from the local majority ethnic group.
The school may publish figures on the students’ representation in the class by gender and ethnic origin, in order to demonstrate fairness in admission procedures, or its success or otherwise in supporting all students equally:
- Publishing examination scores in a subject by class, and by gender of the students is probably okay – there are 15 students in each gender group, so an unusually low or high score is not attributable to any individual student.
- It would require a judgement call to decide whether publishing the mean examination scores by the student’s ethnic origin was a good idea or not: one would wish to exercise some care and sensitivity.
- But if the school breaks down the examination scores by both gender and the student’s ethnic origin, they have a problem. The average figure for boys in the minority ethnic group is in fact made from just one student’s exam score. If they publish a summary table including this figure, they are publishing the exam result of one easily-identified student, with the potential to cause that student considerable distress, and expose the school to risk of litigation if the student has not consented to that information being made public.
The solution – one which is supported by features for suppressing small-contributing-population fields in reports in several of SAS’ products, is to publish the table, but suppress values in cells with small contributing counts, e.g. <5 or <10.
One way to do this is to maintain a ‘cell count’ value alongside each measure value, containing the number of individuals whose values (whether or not those values are missing) were aggregated to produce the measure value. Where the cell count for a measure value is less than the threshold for suppression of small cells, the published version of the report should display some other value or symbol instead of the actual measure value for the cell (e.g. ‘suppressed’, or ‘*’).
However, you must consider the vulnerability – by combining information in several different summary tables which you publish, or by combining summary tables which you publish with other related data published elsewhere or previously, it may be possible to figure out the suppressed value. For example:
In the school we considered earlier, suppose you publish summary statistics as follows, based on data which I will not show:
- The mean score for Chemistry for the class of 30 students in 2015 was 60%.
- For the 15 boys, the mean score is 55%, and for the 15 girls, 65%. These can be published without revealing anyone’s identity or revealing overly-personal information about any individual.
- For both ethnic groups in this class (having 23 students in the local majority group, and 7 in the local minority group), the mean score is 60%. Although 7 is a fairly small cell count for the local minority ethnic group, I would feel encouraged to go ahead and publish this figure because it is not hugely different from the local majority’s mean score, and is therefore unlikely to be controversial or cause anyone much distress. If there was a big difference, I would wish to consider sharing this with the relevant education bodies and acting to redress that difference, but may decide not to publish this figure.
- For boys in the local majority ethnic group, the mean score for Chemistry is 56.79% (rounded to 2 decimal places). This is not problematic in itself, but...
- For boys in the local minority ethnic group, the mean score for Chemistry is suppressed because (those who know the makeup of the class will know), there is only one student in this group
Is it safe to publish the data? I would argue that it is not. The reason is that someone familiar with the school and the composition of the class may know that there is only one male student in the local minority ethnic group. That person can calculate his chemistry exam score as follows:
Suppressed score = 55*15 - 56.79*14 = 29.94% (≈30%)
The information above shares enough data about the class that a strikingly low exam score for one identifiable student can easily be calculated. You should not publish all of this data, as it may easily cause that student distress. Moreover, to publish it as above may be a breach of that student’s privacy which could be legally actionable.
So, we must consider the overall set of available data which relates to the individual subjects of summary data, and think of how it can be combined to ‘attack’ the anonymisation we have applied when choosing what information to publish. But more generally, how do you address both this and your wider governance responsibilities, and publish data responsibly?
The UK Anonymisation Network’s anonymization framework
Just over a month ago, I attended a workshop titled “Save the Titanic: Hands-on anonymisation and risk control of publishing open data”. Several things presented in the workshop were so interesting that I thought they were worth sharing here.
The workshop was held at the Open Data Institute (ODI), on behalf of the UK Anonymisation Network (of course it uses the British spelling of anonymization, http://ukanon.net/), led by Ulrich Atz (Twitter: @statshero) of the ODI. He presented anonymization as a process by which, if you follow certain steps in your organisation, you can help maintain anonymity for individuals even when you handle, and publish potentially-sensitive data about them.
The British Information Commisioner’s Office (ICO) offers this definition of anonymization:
Anonymisation* is the process of turning data into a form which does not identify individuals and where identification is not likely to take place. This allows for a much wider use of the information.
The UKAN offers this definition:
When a dataset is anonymised, the identifiers are removed, obscured, aggregated or altered to prevent identification. The term “identifiers” is often misunderstood to simply mean formal identifiers such as the name, address or, for example, the [National Health Service patient] number. But, identifiers could in principle include any piece of information; what is identifying will depend on context. For instance, if a group of individuals is known to contain only one woman, then the gender will be identifying for the woman (and gender is not a typical identifier). Identifiers can also be constructed out of combinations of attributes (for example, consider a “sixteen year old widow” or a “15 year old male University Student” or a “female Bangladeshi bank manager living in Thurso”). [adapted]
Creative Commons BY-SA license
The information taken from the workshop is presented here with permission. The workshop content is licensed under a Creative Commons Attribution-ShareAlike license (CC BY-SA, see https://creativecommons.org/licenses/), which “…lets others remix, tweak, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms.“. Under the terms of that license, the full content of this blog post (but only this post) may also be shared and used under the same terms.
(If you’re curious about the title of the workshop, it used a sample dataset containing personal information about victims of the Titanic disaster in 1912 to illustrate various anonymization principals – this data is publically available).
When thinking about how closed or open data is, Ulrich Atz talked about having a spectrum where, from most closed to most open, example data could be:
- Your thoughts [Most closed]
- National security data
- Commercially sensitive data
- Personal finances
- Combined/aggregate health data
- Bus timetables
- The value of π [Most open]
In the workshop, Ulrich also presented 10 key parts of an anonymization decision-making framework, which I felt were worth sharing. This list is developed by the UKAN, to assist organisations in minimising the risk of identification in data they publish. The items in bold were taken directly from this framework. The commentary following each point is based on my notes from the workshop:
- Know your data and its origins. You must analyse and explore your data, and be familiar with the specification defining what each field represents, the distribution of values (especially unique and low-cardinality values, or unusual value combinations between fields – such as the gender-ethnicity combination in the example above). Also understand its provenance, what you are allowed to do with it, the methodology of collection, and the culture of control for the data.
- Understand the use cases (& abuse cases). How might the data be used, how might it be misused (so that you can consider how to prevent misuse).
- Understand the legal issues and pre-share/release governance. Have an anonymization policy, and a governance structure for management of your sensitive data. Do you currently have sufficient need to collect, and hold the data? Do you share the data with third parties, and have you defined their responsibilities?
- Understand the issue of consent and your ethical obligations. What would constitute fair processing – using the data for its intended purposes only, and not for other purposes? Do you have informed consent from the owners of the data – the individuals it concerns – to collect, process and use it? Can your data model reflect the different types of consent individuals may have given to the ways you may or may not use their data? Have you a process in place to support removal of an individual’s data if they withdraw their consent for you to hold it? This is a serious consideration for application or data architects and technical consultants!
- Know the processes you will need to go through to address the risk of re-identification. The factors that influence this risk are the sensitivity of the data (your banking or salary details, medical details, voting and the content of private communications are probably more sensitive than what was in your shopping basket or which brand of clothing you prefer), how accessible the data will be (published vs internal), how disclosive the information is (eg your location now, or how you voted is more disclosive than your postal code or the fact that you voted), and who might be interested in the data and what their motivation is. Several of these factors can change in sensitivity over time.
- Know the processes you will need to go through to anonymize your data. The processes are usually iterative – produce your first anonymized data, then consider how you would attack it to exploit it, and revise the anonymization process.
- Understand the data environment. What other data is out there which may be combined with your data to facilitate a new exploit. How should you restrict access to your data, to allow access only by appropriate need? If you publish school exam statistics, do your local board of education have a legitimate need for statistics that you should not publish to the general population?
- Know your audience and how you will communicate. How do you license your audience to use your data? Be as open as possible about your methodology to your audience, tell your subjects about what you do with the data you hold, and about how it is useful to them. Be open and talk about the benefits of publishing the data. Take care with use of technical terms – e.g. some audiences understand what pseudonymisation is, others do not – be ready to explain your terms.
- Know what to do if things go wrong. Have a breach notification process – know who you need to tell in case of a breach. In some situations, informing the subject of the breach may cause more distress, and it may not be appropriate. Know how to take the data out of circulation, or restrict sharing. Face the media, maintain good, open communication about the breach. Talk with the person or group who caused the breach; senior execs need to know different things than line staff. Identify the cause and how to ensure it doesn’t happen in future. Inform your insurer if you have data breach insurance. Know your appetite for risk – how will you choose to react (and to not overreact) to a breach, depending on its severity.
- Know what happens next once you have shared and/or released data. Understand how your responsibility does not end once you have published it – the data environment may change, e.g. when someone else publishes correlated data, so that data which was previously anonymous is no longer safe, or when a new version of your source data becomes available. Regularly review your anonymization, and the usefulness of the data. Establish who will be the point of contact in your organisation for the published data, if the current point of contact leaves – this would ideally be someone with a job title like Data Protection Officer, who has a holistic view of issues, incidents and queries across all of your published sources of data.
I’d be interested to hear if you want to know more about either type of anonymization. Is there an organisation in your country which provides similar support to businesses and government, free of charge? Would you like more information about techniques that can be used to create and manage a system of hashed surrogate IDs, so that you can store data anonymously within a database, and still use it effectively for analytics, but where the anonymized values can be ‘de-anonymized’ to retrieve the original clear values e.g. so that a subject may be contacted?
Or do you have experience of creating, working with or trying to attack anonymized data ethically, that would be worth sharing?