Anonymization Does Not Guarantee User Privacy
We live in a world in which, more than ever, companies gather personal data about their users — for better targeting them with ads, providing personalized recommendations, honing news feeds and search results, or myriad other applications. But while users may be happy to share this valuable information in some cases, there is also more awareness than ever about the importance of data privacy online.
Whether it’s users embracing privacy centric tools (such as opting out of app user tracking in the latest version of Apple’s iOS mobile operating system) or the growing number of legislative frameworks like the European Union’s GDPR (General Data Protection Regulation), public conversations about privacy have never been louder or more fiercely argued.
How do you balance the positives of data-driven technology with people’s justifiable concerns about privacy? For many, the answer might appear simple: Anonymization. Data anonymization via the erasure or encryption of identifiers that link a stored record to a particular individual. For instance, a company that is in possession of Personally Identifiable Information (PII) could utilize a data anonymization process to protect user information like names, addresses, or social security numbers so as to keep sources anonymous. This can allow them to use or even publish data sets without potentially violating the privacy of the individuals who made the gathering of that information possible — or breaching data protection regulations in the process.
Unfortunately, like many solutions which sound like straightforward, simple solutions to complex problems, there is absolutely no guarantee that this works. It’s possible to de-anonymize data, frequently by cross-referencing information with other sources of data as a way to identify a data source that was previously anonymous.
The re-identification process
This can be carried out in several different ways. For instance, if a government publishes statistical data regarding medical records, it may be possible to identify anonymized individuals from this database using information from other databases — like combining information about the city people are based in and the prescription combination they might use. Another possibility is that overly weak and inefficient data masking algorithms are used to obscure information such as telephone numbers and names. If this data masking is not carried out with sufficient complexity, it may be possible to decode personal information using a reverse algorithm that helps to unshuffle this data. One other approach is something called field matching that identifies and matches records that correspond to the same person from multiple databases.
There have been some concerning examples of the terrifying power of de-anonymization. For instance, in mid-2021, a high-ranking Catholic priest in the US was ousted from their job after an online newsletter published information regarding their usage of Grindr, a gay dating app. This report was based on de-identified information purchased from a third party data broker. Despite the data not including identifying information about the priest in question, the newsletter’s authors were able to positively ID him based on information such as location data, place of work, device ID, and more.
It was a chilling illustration of how it is possible to reverse-engineer a positive identification of an individual by utilizing different pieces of information as clues or digital breadcrumbs to confirm a match. Even if one piece of information on its own does not result in a positive identification, combining data from multiple sources and correlating this together can make it possible to correctly identify an individual. Depending on the usage of this re-identification process, the outcome could be anything from mildly invasive to — in the case of the Catholic priest, who had broken no laws through his usage of Grindr — life-altering.
Better ways to protect user data
What is needed is a way of better protecting user data. That does not have to mean tearing up the data protection playbook and starting again. In many cases, what is required is simply a better way of carrying out some of the processes used to protect individual’s identities. As an example, data masking can be very effective — so long as the process uses multiple transformation techniques to hide identities and sensitive information, rather than using easily breakable approaches can offer vastly superior data protection and anonymization.
Related to this should be improved company measures to safeguard sensitive data, such as data loss prevention (DLP) tools that are able to monitor data in motion, in cloud storage, on endpoint devices, and at rest on servers. Meanwhile, database activity monitoring can keep tabs on data in various locations and send alerts if there are possible policy violations.
Organizations must do whatever they can to help improve data privacy and protection. Whether you’re concerned about reputational damage, rule-breaking compliance and data protection violations, or you simply want to do right by your users, this is one of the best investments you can make. The world will thank you for it.