When Anonymized Data Isn't Private: How Large Language Models Reveal Emerging Weak Points
Classic anonymisation was built for smaller, slower data worlds. Today’s massive, linkable datasets—and off-the-shelf LLMs—make privacy attacks easier by lowering skill barriers and automating expert-style reasoning. Resilience now depends on data minimisation, prompt controls, and formal guarantees like differential privacy to counter uncontrolled linkage.
Description
Classic anonymisation methods (k-anonymity, l-diversity, etc.) were built for a world of smaller, slower-phased, and less connected datasets. In today's larger and more linkable environment, their practical resilience is naturally strained, not because a sudden theoretical break appeared, but because the context changed. Off‑the‑shelf LLMs now let an ordinary user take well‑known privacy attack ideas (background knowledge, linkage, similarity) and, through plain language prompts, receive structured, analyst‑style reasoning about likely hidden attributes. The interesting part is less the final guess (which is still often correct given a well-crafted prompt) and more the reasoning path: combining demographic cues, cross‑table hints, and population priors in a way that mirrors what a privacy specialist might do manually. This lowers the skill barrier and makes probing anonymised releases easier in practice. Improving resilience now means pairing data minimisation and careful prompt boundaries with methods that offer formal guarantees (e.g., differential privacy) and reducing uncontrolled auxiliary linkage.
Read on for the story, methodology, examples, and how to reproduce the experiments yourself.
Resourses
Contact
alejandro@dpella.io