Skip to main content

When Anonymized Data Isn't Private: How Large Language Models Reveal Emerging Weak Points

Classic anonymisation was built for smaller, slower data worlds. Today’s massive, linkable datasets—and off-the-shelf LLMs—make privacy attacks easier by lowering skill barriers and automating expert-style reasoning. Resilience now depends on data minimisation, prompt controls, and formal guarantees like differential privacy to counter uncontrolled linkage.

Photo of Alejandro Russo
Alejandro Russo

Thanks to Cybercampus Sverige, Alejandro Russo and the rest of DPella's  team were able to explore how simple privacy attacks can now be amplified by modern AI tools. What began as a bachelor thesis on data anonymization weaknesses in the presence of AI evolved into hands-on experiments with realistic datasets — resulting in clickable, reproducible examples that reveal how easily data can be reconstructed under seemingly safe conditions. The work also highlights a growing risk: large language models (LLMs) are not only powerful tools for data analysis — they can also be repurposed for de-anonymization, lowering the technical barrier for privacy attacks and challenging long-held assumptions about data safety.

Description

Classic anonymisation methods (k-anonymity, l-diversity, etc.) were built for a world of smaller, slower-phased, and less connected datasets. In today's larger and more linkable environment, their practical resilience is naturally strained, not because a sudden theoretical break appeared, but because the context changed. Off‑the‑shelf LLMs now let an ordinary user take well‑known privacy attack ideas (background knowledge, linkage, similarity) and, through plain language prompts, receive structured, analyst‑style reasoning about likely hidden attributes. The interesting part is less the final guess (which is still often correct given a well-crafted prompt) and more the reasoning path: combining demographic cues, cross‑table hints, and population priors in a way that mirrors what a privacy specialist might do manually. This lowers the skill barrier and makes probing anonymised releases easier in practice. Improving resilience now means pairing data minimisation and careful prompt boundaries with methods that offer formal guarantees (e.g., differential privacy) and reducing uncontrolled auxiliary linkage.

Read on for the story, methodology, examples, and how to reproduce the experiments yourself.

Resourses

Contact

alejandro@dpella.io