Data Anonymization - Removing Personal Identifiers
About 2 min read
Anonymization is a technique that irreversibly removes personally identifiable information from data, transforming it into a state where the original individual can no longer be identified. Under GDPR, properly anonymized data falls outside the definition of personal data and is therefore exempt from regulation. However, achieving "complete anonymization" is extremely difficult technically, and there are multiple reported cases where inadequate anonymization was broken by re-identification attacks. Anonymization is not mere data processing but an advanced technical field that lies at the core of privacy protection.
The Crucial Difference from Pseudonymization
An irreversible transformation. It creates a state in which no means exist to identify the original individual. Exempt from GDPR regulation. However, proving complete anonymization is difficult.
A reversible transformation. It replaces identifiers with placeholder values, but the original individual can be restored if a mapping table exists. It remains subject to GDPR regulation. Tokenization is one method of pseudonymization.
A distinction often confused in practice is the difference from data masking. Masking is concealment at the display level (for example, replacing all but the last 4 digits of a credit card number with *), and if the original data remains, it cannot be called anonymization. When handling personally identifiable information, you must understand the differences between these techniques accurately.
Technical Approaches to Anonymization
| Method | Overview | Strength |
|---|---|---|
| k-anonymity | Ensures that each record is indistinguishable from at least k-1 other records | Basic. Vulnerable to attribute attacks |
| l-diversity | In addition to k-anonymity, guarantees that sensitive attributes have at least l distinct values | Moderate. Vulnerable to skewed distributions |
| t-closeness | Guarantees that the distribution of sensitive attributes within each group stays within a difference of t from the overall distribution | High. Complex to implement |
| Differential privacy | Adds noise to query results, mathematically guaranteeing that the presence of an individual does not affect the statistics | Highest. Adopted by Apple and Google |
Practical Examples of Differential Privacy
Differential privacy is used by Apple for iOS keyboard input statistics and Safari browsing data collection, and by Google for Chrome usage statistics (RAPPOR) and Google Maps congestion information. Because random noise is added to each individual user's data before aggregation, it becomes mathematically impossible to infer the behavior of a specific individual from the aggregated results. The smaller the value of the privacy parameter ε (epsilon), the stronger the protection, but there is a trade-off in that the usefulness of the data declines.
The Reality of Re-identification Attacks - The Netflix Prize Dataset Case
In 2006, Netflix released the movie rating data of about 500,000 people in "anonymized" form for a contest to improve its recommendation algorithm. Usernames were removed and IDs were replaced with random numbers. However, in 2007, researchers at the University of Texas demonstrated that individuals could be identified from Netflix's anonymized data by cross-referencing it with public data from IMDb (a movie review site). With a combination of just 6-8 movie ratings and dates, individuals could be re-identified with 99% accuracy.
This case shows that merely removing direct identifiers (names, email addresses) is insufficient as anonymization. There is always a re-identification risk arising from combinations of quasi-identifiers such as behavioral patterns, purchase history, and location data.
Its Position Under GDPR
Recital 26 of the GDPR explicitly states that anonymized data does not constitute personal data. In other words, properly anonymized data is exempt from GDPR's consent requirements, the exercise of data subject rights, cross-border transfer restrictions, and the like. This is a major incentive for companies, but the criteria for "proper anonymization" are strict. The European Data Protection Board (EDPB) requires that all three risks of "singling out (identifying an individual)," "linkability (linking data)," and "inference" be eliminated as the assessment criteria for anonymization.
Following the principle of privacy by design, it is recommended to build anonymization into the design from the data collection stage. Please also refer to balancing privacy and convenience and the privacy settings guide.Data privacy books on Amazon allow you to systematically learn the theory and practice of anonymization technology.
Practical Decision Criteria
When carrying out anonymization, a risk assessment that assumes "an attacker with both the motive and the capability to re-identify individuals from this data" is essential. For data used in internal analysis, k-anonymity may be sufficient, but for datasets released externally, the application of differential privacy should be considered. From the perspective of digital identity protection as well, the quality of anonymization is directly linked to an organization's trustworthiness.
Was this article helpful?