Over the last few years I’ve seen and worked on many transformational projects with big data, especially those that tap into big data’s ability to provide new and improved services for the public good. But there’s also a danger that analytics, unchecked, can do social harm by indirectly discriminating against certain groups.
The London Fire Brigade uses data analytics to predict fire risk areas, while New York City uses analytics to improve parking and traffic control. There’s even a project involving a food bank run by churches in Liverpool. It uses data analytics to get quick insights into how often the food bank is used, who is using it, why they’re using it and where they’re from. While these projects are noteworthy – crucially they all rely on "someone" capturing and analysing information about the general public.
Recently I attended and presented at the Data for Policy 2015 conference at the University of Cambridge in the UK. There were people from all walks of life, including government policy makers, actively thinking and coming up with new ways to improve ethics and security implications around data.
The event delivered a real debate about the ethics and transparency required with big data. The discussion was around how we, as consumers and suppliers, should apply the right rigour and processes to ensure the reality (and perceptions) of big data are positive.
Today, being legally compliant isn’t enough. We need to be cognisant of public perceptions of how data is used and protected. Also, that our application of machine learning doesn’t create profiles or segments that potentially stereotype negatively – e.g. along racial lines - or misrepresent segments of our customer bases.
I’ve learnt a lot about how to manage this from the projects I’ve worked on, and from my colleagues in the industry. If you are thinking of using public data for a project, here are my top five things to remember:
- Set a goal. Start with a clear understanding of where you might be going; if it’s exploratory be clear and open about it. Set up the project for that aim from the outset.
- Ethics and the legal position for data are two different things. Factor in time to carefully consider the wider implications of your project.
- Security first. Ensure you plan security in from the start of any big data projects, particularly if it’s a Hadoop project. It will inevitably take longer than you expect.
- Anonymise the data. Tokenisation or obfuscation has to be used carefully to hide an individual’s connection to the data. Analytics can make it easy to identify people. Hashing addresses and names isn’t enough to truly anonymise the information.
- Visualise the data to see the full picture. It is relatively easy to ensure a machine learning model is robust and predictive. However, thorough post-analysis is essential to see if unexpected bias or segmentation has been introduced. Visualisation can often be a great way of exploring this quickly. There’s some useful background reading here on the ethics of machine learning.
For more information, check out this simple five-step guide to big data privacy.