One step closer to decrease linkage risk

A linkage attack is an attempt to re-identify individuals in an anonymized dataset by combining data and background information. Linking can use identifiers like postcode, gender, payment and many more, which are present in both sets to create identification links.

Many organizations are not aware of the linkage risk associated with their data, and although they remove/anonymize the direct identifiers from the dataset, they often do not think about embedded risk.
We participated on a hackathon event to gather most recent experience and knowledge to assure and secure our clients data.

The Privacy Preserving Machine Learning (#PPML) with an ethical approach hackathon for students, professionals and enthusiasts was held between the 16th and 17th of June as a part of the International AI Ethicon event. The participating E-Group team demonstrated their excellence at the event, our colleagues won second place competing with international and domestic teams.

Organizers have launched 2 different challenges:

  1. Differential privacy & syntenic data generation
  2. Analysis of vulnerable datasets

Our team (Marcell Zoltay, Krisztián Schlepp and Marcell Gál) participated in hackathon challenge 2 and demonstrated the limitations of data de-identification methods and thus promoted an appreciation of the utility of more sophisticated Privacy Enhancing Technologies (PETs), such as differential privacy.

Data de-identification is a technique of removing any obvious “personally identifiable information” (PII), such as names, addresses, and date-of-births from the dataset for data anonymization. However, privacy attacks and advancements in privacy-preserving research have revealed that datasets anonymized via de-identification can be compromised with “linkage attacks”. The published data is de-anonymized by linking it to auxiliary information obtained from a different source.

The task was to find as many connections as possible between data from 4 different data sets.

  • Credit card transactions
  • Credit score data based on address location
  • Credit card churn details
  • Anonymized personal data

The team has gained very useful experience not just in data connection discovery but also in their Python data manipulation and data analysis skills. The current challenge has demonstrated how easy is to build data-linkage and how vulnerable the data sets are we share so often, such as banking credit card transaction data.

E-Group never rests, we are constantly working to improve our solutions to denial potential vulnerabilities of data and discover new prevention methods.

Share this post
This site is registered on as a development site.