Here are some new tips for masking. The new EU General Data Protection Regulation (GDPR) requires your company to implement (quote) all necessary technical and organizational measures and to take into consideration the available technology at the time of the processing and technological developments. So, how can you comply with this requirement in the real world? In Part 1, we anonymized field content or replaced it with aliases. That can be sufficient, but it doesn’t have to be. That’s why we’ll cover beta functions in this article (the ideal solution for pseudonymization), personal data that has slipped through the cracks, and the exciting question of ...Read part 1 of this series: Pseudonymagical: masking data to get up to speed with GDPR
How random can your birth be?
The exact date of your birth is important to you, naturally. The analytics experts working with your data, on the other hand, aren’t looking to send you birthday wishes anyway (missing opt-in?!). What they’re interested in is your approximate age, maybe even just the decade. The SQL code from Part 1 moves the date of birth randomly plus or minus five days. Someone who knows your birth date would therefore be unable to locate your records within a stolen database. Privacy risk abated!
But even that should be verified… with respect to providing proof of “appropriate measures,” in other words, cluster size. In our example of around 5,000 VIP customers, there is only one who is in their 20’s and has a postal code beginning with the numeral 1. The time required to indirectly identify the individual (Recital 21, GDPR) could be rather low here. In the worst case scenario, legally too low.
Enter the beta function: the ideal solution for pseudonymization
Luckily, Recital 29 of the General Data Protection Regulation tells us how to handle this problem. The information required to pinpoint an individual is simply stored separately. That can be accomplished using a key or a mathematical function, in other words a macro, with a secret key that I only use – but don’t know about the math hidden behind it. The law doesn’t tell us how tricky this logic has to be, though. This so-called beta function should satisfy two additional conditions from an analytical standpoint:
- It must be invertible (a hash is not, for instance).
- The result of the masking should be monotonic, which means: high original value = high new value (encryption doesn’t do this).
Why? Well, we don’t want to affect the analytic modelling too much - ideally, the function would output something linear or slightly exponential… Here is a √2 example I’ve kept simple:
proc fcmp outlib= SH.GDPR.BETA; function BETA1( Typ $, Wert ); if Type = ‘AGE1’ then return(value*sqrt(2)); if Type = ‘DATE1’ then return(value+floor(3650*sqrt(2))); endsub; function BETA1I( Typ $, Wert ); if Typ = ‘AGE1’ then return(value/sqrt(2)); if Typ = ‘DATE1’ then return(value-floor(3650*sqrt(2))); endsub; run;
Mathematically, this is a coordinate transformation - or you can also think of it in terms of Star Trek: people are being beamed to an unfamiliar planet. There is a different gravity field than the earth there (a different coordinate system), but it applies to everyone equally — which means that lightweight visitors on the planet can still jump higher there than their heavyweight colleagues. The same applies accordingly to age etc.
proc DS2; package GDPR / overwrite=yes language=’fcmp’ table=’SH.GDPR’; run; data pdp_de_demo.data.crm_customerbase_ds2_beta(type=view overwrite=yes keep=(birthdate birthdate_after_beta age age_after_beta)); dcl package GDPR p(); dcl date birthdate_after_beta having format ddmmyyp10.; dcl double age; dcl double age_after_beta; method run(); set pdp_de_demo.data.CRM_CUSTOMERBASE; age = round((today()-to_double(birthdate))/365.25,1); age_after_beta = round(p.BETA1(‘AGE1’,age),1); birthdate_after_beta = to_date(p.BETA1(‘DATE1’,to_double(birthdate))); end; enddata; run; quit;
We don’t lose the “true” age. It can be re-calculated using another beta function. With what is known as the inverse, but it’s available only to authorized employees - for instance to fraud or legal people during data protection lawsuits. In these cases, your customer can safely be beamed back to earth, so to speak.
A complaint from my office mate
“But how do I explain to the boss my model behavior for these 300-year-olds?!” ... Well in this era of machine learning, neural networks are gaining in popularity and are as selective as they are indescribable. On our side, the math behind it is at least deterministic and explainable; good to know that this key code is no longer stored on your PC, not glued to its data source and target, but remote and safe – because of modern data protection to protect you and the data. And that’s a good thing.
Final aspect: the data for relevant columns has now been subjected to smart masking, the logic is in a central repository, and it’s working in secret. But what about those seemingly harmless fields way in the back, mostly empty and irrelevant, which then in the form of a sales memo or notice suddenly reveal the name of the wife, the second email address, or the former employer? The author who created them thought it was extremely practical, since they didn’t find anywhere else in the contract template where they could enter and save the information.
(CASE WHEN ( SYSPROC.DQ.DQUALITY.DQEXTRACT ( A.COMMENTFIELD, ‘PDP - Personal Data (Core)’, ‘Individual’,’DEDEU’ ) ne “ ) THEN ‘* * *’ ELSE A.COMMENTFIELD END) AS commentfield_without_name, SYSPROC.DQ.DQUALITY.DQIDENTIFY ( A.ANNOTATION, ‘PDP - Personal Data (Core)’, ‘DEDEU’ ) AS ANNOTATION_IDENTIFY
SAS Data Quality has pre-configured, transparent sets of rules that you can tweak as necessary to detect many of these types of cases using heuristics. That’s indispensable because if I don’t know about it, I can’t protect against it. (If I forget about the tiny basement window when installing the security system, I can be sure that the robbers won’t cooperate by breaking down the front door).
That is a prerequisite for an inventory of the data warehouse, the estimate of the GDPR implementation expense — and here an additional safeguard. Because in the code above, a firewall filter is applied to the data: if the name of a human being slips through the cracks, then only asterisks are displayed when it is output. The field “Note” is always replaced by the description of the category, such as “This is where a telephone number is hidden. After approval by the data protection officer, you may read it – but not for now.”
Are you ready for the GDPR? Learn how your peers are preparing in this global survey report.
Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.
Use of the beta function is an attractive idea, but is it vulnerable to identification of just one individual?
For example, suppose you have a pseudonymized table containing age and duration of employment for 10,000 individuals working in a company. One of them puts the following on his Facebook page:
"Today is my birthday! At 73, I am the oldest (and longest employed) member of the company!"
It wouldn't take much work to find this record in the table, then "reverse engineer" the beta function algorithm.