GULYÁS, Gábor György, Ph.D.

Blog

Use kmap to visualize anonymity/uniqueness in your data

2016-05-26 | Gabor

While I was working on a paper recently, I was asking myself the question how to visualize the uniqueness (or anonymity set sizes) in the data. The only visualization that I am still aware of is Fig. 3 in the Panopticlick experiment, which shows anonymity set sizes created by each value of each attribute. This is it:

While this is a nice figure, it is quite hard to understand it quantitatively, and it can be even more complicated if you want to compare different datasets by using this visualization method. However, it would be nice to understand the state of uniqueness in datasets, especially if you consider different attributes in each case, apply anonymization or other countermeasures to decrease uniqueness.

This is why I started looking for another option, which finally lead to creating a simple, but heavily customizable plotting function I call kmap [code]. This tool can be used for multiple purposes, either if you are a data scientist experimenting or looking for a way that enables explanatory visualization to non-experts. It is useful to

visualize how different attributes partitionate your data into anonymity sets,
how different anonymization schemes or their setup affect uniqueness,
show how precise a certain fingerprinting method is.

Let's see a nice example based on UCI Adult Data Set. This tabular dataset contains attributes like age, sex or workclass of more than 30k adults. Let's pretend that we are considering releasing this dataset, and we would like to know how many (and which) attributes could be safely released. In order to get a better understanding of this, let's visualize the level of identification (uniqueness) if we release only 3, 6 or 9 attributes of each user. This looks like this with kmap:

3 attributes	6 attributes	9 attributes

It is quite easy to tell the differences by looking at the figures: releasing only 3 or 6 attributes is relatively safe (*) as less than 25% of the dataset can be uniquely identified. On the other hand, if 9 attributes are released, that would make almost 75% of users concerned by the release unique.

If you would like to try out kmap for your self, you can find the code and the files for the example in this git repository. Plus, our paper got accepted where this visualization was used, thus more useful examples can be expected.

I would like to hereby thank Gergely Acs, Claude Castelluccia, Amrit Kumar and Luca Melis for their comments while I was developing kmap.

(*) What is safe or not is another question; in some scenarios even having 6% of the users identifiable can be considered as a problem.

Tags: anonymity, visualization, plot, kmap, anonymity set, uniqueness

Back to the archives

Blog tagcloud

CSP (1), Content-Security-Policy (1), ad industry (1), adblock (1), ads (1), advertising wars (1), amazon (1), announcement (1), anonymity (9), anonymity measure (2), anonymity paradox (3), anonymity set (1), boundary (1), bug (2), code (1), control (1), crawling (1), data privacy (1), data retention (1), data surveillance (1), de-anonymization (2), definition (1), demo (1), device fingerprint (2), device identifier (1), disposable email (1), ebook (1), el capitan (1), email privacy (1), encryption (1), end (1), extensions (1), fairness (1), false-beliefs (1), fingerprint (3), fingerprint blocking (1), fingerprinting (3), firefox (1), firegloves (1), font (1), future of privacy (2), google (1), google glass (1), home (1), hungarian keyboard layout (1), inkscape (1), interesting paper (1), internet measurement (1), keys (1), kmap (1), latex (1), location guard (1), location privacy (1), logins (1), mac (1), machine learning (3), neural networks (1), nsa (2), osx (2), paper (2), pet symposium (2), plot (1), price of privacy (1), prism (1), privacy (8), privacy enhancing technology (1), privacy-enhancing technologies (2), privacy-enhancing technology (1), profiling (2), projects (1), raising awareness (1), rationality (1), re-identification (1), simulation (1), social network (2), surveillance (2), tbb (1), thesis contest (1), tor (1), tracemail (1), tracking (12), tracking cookie (1), transparency (1), tresorit blog (4), uniqueness (3), visualization (1), web bug (3), web privacy (3), web security (1), web tracking (3), win (1), you are the product (1)

@GulyasGG