GULYÁS, Gábor György, Ph.D.

Social Network De-anonymization

This page demonstrates the power of social network de-anonymization; it explains what it is, provides insights how these attacks work, and contains a visual simulator, where you can play around with the attacks.

This could be a useful resource for:

privacy-pro activists for demonstrating the power of re-identification attacks,
teachers explaining privacy risk to their students,
curious people to learn how much of our privacy is exposed when meta-data is revealed,
researchers, who would like to take a look at what we did and try it out. We proposed a new attack called Bumblebee, our related paper is here.

You can also find a Python starter kit here. It contains the algorithm source codes and the datasets you can find below.

If you would like to go for a deeper test, you should take a look at the SALab framework, with which you can create your own datasets, and run experiments on large datasets having tens of thousands of nodes.

So, what is social network de-anonymization?

For the intuitive explanation check out the following demo, flavored a bit for the presidential elections.

Click me!

to unveil the visual explanation!

These attacks can have a severe impact on privacy, as they could be used in multiple adversarial settings – beside re-identification attacks like in this example, e.g., linking unrelated datasets, like email and telephone communication networks. It wouldn't be surprising to find out someday that secret services had been relying on these attacks in some form.

Changing the scoring method – heart transplantation?

(Note: this part concerns our paper. If you would like to understand better what are the differences on the algorithms below, read this part, otherwise, you can directly jump to the simulator. )

Remember the scoring part in the visual explanation? Actually, that plays a crucial part in the first attack of this kind, which we will call Nar later on. In that attack, they derived the scoring function from the cosine similarity. They showed on two real social networks that Nar can do its job pretty well: it mapped roughly 8,000 users correctly between a Twitter and a Flickr crawl, while it only made a bit more than 3,000 errors (out of ~27k users that were available in both social networks).

As said, the Nar scoring scheme was derived from the cosine similarity and looks like this:

where each set is the set of neighbors of each node. This scheme is biased, as it penalizes high degree nodes in favor of low degree nodes. We proposed the following metric, called BlbSim, that eliminates the bias:

and incorporated in an attack called Bumblebee ( more details here). We'll use Blb for short.

It turned out that this changes the whole behavior of the attack. As the similarity functions provides a wider diversity of scores, but also changes several characteristics of the attack.

Below, you can run the Nar and Blb attacks and compare them directly in your browser! Here, we provide some tips on what differences might be worth checking for, along our main findings.
(Note: these are very small toy networks, handle results with precaution. Do real experiments in the SALab framework.)

So, some of our key findings:

Blb has a higher number of correct re-identifications (TPR), while keeps the error low (FPR).
Demo: for example run Nar over Dataset #2 with default settings, then Blb with setting Θ=0.1.
Blb is more robust to noise than Nar.
Demo: switch to Dataset #3, where you'll see that attacks have a more difficult task. Run Nar with Θ=0.5, #seeds=25, then Blb with setting Θ=0.5 or Θ=1.0, #seeds=25, δ=0.5.
Blb can be run with less initial mappings than Nar.
Demo: with Dataset #1 and default settings run Nar with #seeds ∈ [1, 2, 3], then Blb with setting #seeds=1.
(In larger networks we observed differences in minimum seed set sizes like 1 vs. 60.)
The new scoring system of Blb gives the attack a more fine-grained control.
We evaluated Blb against several other attacks under different anonymization schemes. We did this by replicating the results of a recent Usenix paper that compared social network de-anonymization attacks. We found that Bumblebee had superior results to all: read more in this blog post

DIY: test de-anonymization for yourself – in the browser!

Below, you can run de-anonymization attacks and compare results on a scoreboard. Attacks run on two graphs: the grey is the background information of the attacker where user identities are known, and the blue is the anonymized network with sensitive information.

Parameter Θ controls the greediness for both algorithms: higher Θ rates will make the algorithms more precautios (less mappings with less error).
The #seeds determines the number of initial mappings before the algorithms are run.
With parameter δ you can set the sensitiveness of the Blb attack to node degrees. For example, with low δ it will not seriously consider node degree differences, but with high δ the difference between node degrees will play a significant role. We found that it is also a control of greediness, and as a rule of thumb it is good to have it around 0.1-0.5.
You can choose between three datasets to play with. The first one is quite small and an easy prey for de-anonymization (the two graphs are identical). It will surely run on older smartphones, though. Datset #2 and #3 are larger ones with varying levels of differences in the background and anonymized graphs.

Controls

Algorithm

Start

Params

Θ =
#seeds =
δ =

Click me!

to see the attacks in action – right in your browser!

Scoreboard

@GulyasGG