(Originally written, Apr 8, 2012.)

Anonymity on the internet is a very fragile thing; every anonymous online identity on this planet is only about $31$ bits of information away from being completely exposed. This is because the total number of internet users on this planet is about $2$ billion, or approximately $2^{31}$. Initially, all one knows about an anonymous internet user is that he or she is a member of this large population, which has a Shannon entropy of about $31$ bits. But each piece of new information about this identity will reduce this entropy. For instance, knowing the gender of the user will cut down the size of the population of possible candidates for the user’s identity by a factor of approximately two, thus stripping away one bit of entropy. (Actually, one loses a little less than a whole bit here, because the gender distribution of internet users is not perfectly balanced.) Similarly, any tidbit of information about the nationality, profession, marital status, location (e.g. timezone or IP address), hobbies, age, ethnicity, education level, socio-economic status, languages known, birthplace, appearance, political leaning, etc. of the user will reduce the entropy further. (Note though that entropy loss is not always additive; if knowing $X$ removes $2$ bits of entropy and knowing $Y$ removes $3$ bits, then knowing both $X$ and $Y$ does not necessarily remove $5$ bits of entropy, because $X$ and $Y$ may be correlated instead of independent, and so much of the information gained from $Y$ may already have been present in $X$).

One can reveal quite a few bits of information about oneself without any serious loss to one’s anonymity; for instance, if one has revealed a net of $20$ independent bits of information over the lifetime of one’s online identity, this still leaves one in a crowd of about $2^{11} \sim 2000$ other people, enough to still enjoy some reasonable level of anonymity. But as one approaches the threshold of $31$ bits, the level of anonymity drops exponentially fast. Once one has revealed more than $31$ bits, it becomes theoretically possible to deduce one’s identity, given a sufficiently comprehensive set of databases about the population of internet users and their characteristics. Of course, such an ideal set of databases does not actually exist; but one can imagine that government intelligence agencies may have enough of these databases to deduce one’s identity from, say, $50$ or $60$ bits of information, and even publicly available databases (such as what one can access from popular search engines) are probably enough to do the job given, say, $100$ bits of information, assuming sufficient patience and determination. Thus, in today’s online world, a crowd of billions of other people is considerably less protection for one’s anonymity than one may initially think, and just because the first $20$ or $30$ bits of information you reveal about yourself leads to no apparent loss of anonymity, this does not mean that the next $20$ or $30$ bits revealed will do so also.

Restricting access to online databases may recover a handful of bits of anonymity, but one will not return to anything close to pre-internet levels of anonymity without extremely draconian information controls. Completely discarding a previous online identity and starting afresh can reset one’s level of anonymity to near-maximum levels, but one has to be careful never to link the new identity to the old one, or else the protection gained by switching will be lost, and the information revealed by the two online identities, when combined together, may cumulatively be enough to destroy the anonymity of both.

But one additional way to gain more anonymity is through deliberate disinformation. For instance, suppose that one reveals $100$ independent bits of information about oneself. Ordinarily, this would cost $100$ bits of anonymity (assuming that each bit was a priori equally likely to be true or false), by cutting the number of possibilities down by a factor of $2^{100}$; but if $5$ of these $100$ bits (chosen randomly and not revealed in advance) are deliberately falsified, then the number of possibilities increases again by a factor of $\binom{100}{5} \approx 2^{26}$, recovering about $26$ bits of anonymity. In practice one gains even more anonymity than this, because to dispel the disinformation one needs to solve a satisfiability problem, which can be notoriously intractible computationally, although this additional protection may dissipate with time as algorithms improve (e.g. by incorporating ideas from compressed sensing).

It is perhaps worth pointing out that disinformation is only a partial defence at best, and to protect anonymity it is better not to emit any information in the first place. For instance, in the above example, even with disinformation, one has still given away about $74$ bits of information, which already is more than enough (in principle, at least) to identify the identity.