Slightly off-topic for text analytics but still pretty good from the perspective of social media analytics more generally, I saw Paul Krugman claiming that Twitter followers follow a power law. As a good reader of the esteemed Cosma Shalizi, I’m well aware that when someone not a statistician claims that something is a power law they’re very likely to be wrong. Why? I can’t say it better than Prof. Shalizi and his coauthors but in short: processes where many factors add together to generate a result are, by the central limit theorem, distributed normally and processes where many factors *multiply* together to generate a result are, also by the central limit theorem, distributed log-normally. For the practical data analyst, this fact means that “fat tails” should make you think “log-normal,” not “power law.” On top of this, looking at line-like things on log-log plots is unlikely to help you – lots of different distributions make lines on log-log plots, meaning a test that simple can’t help you distinguish one from another. You need to do it properly.

But hey, Prof. Shalizi and coauthors are good citizens and put their code on the internet. There’s even a nice recent example of bad power law-dom and example code. So, being both incredibly pedantic and having wanted to try out the code for some time, I thought I’d give it a whirl. All mistakes in the following are of course mine and nobody else’s.

First, data. To be sure of comparability, I grabbed data on the top 1000 people from twitaholic, which was Prof. Krugman’s source, dumped it into Excel (thanks, copy-paste!), deleted the images, spat it out as .csv, and read it into R; if anybody asks, I’ll put up the code and data. This basically follows Prof. Shalizi’s serial killer example, so it’s easy to follow along for yourself.

Correspondingly, I can show the upper cumulative distribution function on a log-log plot. This is kind of a weighted version of Prof. Krugman’s ln(followers) versus ln(rank) plot, just with swapped axes:

I can then fit a power law to it by maximum likelihood; while people do try to fit power laws by linear regression on log-log plots, this basically violates all the hypotheses that go into a linear regression model. Fitting a distribution by maximum likelihood is from the user perspective as easy as typing a slightly different command, so there’s no reason not to do it. Then plot the best-fit power law and log-normal:

Log-normal is a vastly better fit – comparing likelihood ratios, about two hundred quintillion times more likely. Vuong’s model choice statistic** is 5.19 in favor of the log-normal, indicating that the chance of a power law producing data that fits the log-normal this well is about one in ten million.

But wait – Prof. Krugman just posted the top 100. Maybe if I use less data, the power law will look better:

It does! Much better, but still not good – in the top tail, the log-normal has a likelihood about ten times higher than the power law and Vuong’s model choice statistic in favor of log-normal by about 1.4, meaning a power law could produce a fit to a log-normal this good about 8% of the time.

Takeaways: Twitter followers follow a fat-tailed distribution at the top end, but not as fat-tailed as a power law. Doing this right is easy – I spent way more time remembering how to plot in R than actually computing. Prof. Krugman is sometimes wrong. Gauss is not mocked. And I should probably do more work at work.

*One small undocumented point – Prof. Shalizi’s code uses 3 throughout as the threshold, the minimum of his dataset. I use the minimum of my dataset, which is more like 500k than 3.

**Read their paper for more details.

Um, Krugman actually says that it _doesn’t_ follow a power law.

He says it doesn’t “exactly” follow a power law, but does try to imply that the fit to a power law is pretty good. It isn’t.

But of course this is just a pedantic excuse to pretend there’s a news hook for some statistics 😉

Power law in what way? Does that mean if it takes a person 10 months to obtain 1000 followers then it will take 1 month to increase that to 10000 and 3 more days to reach 100000?

A power law here is about the distribution of followers — how many followers does the person with the most followers have versus the second most and so on. There’s no time involved. The question is really about how many followers Lady Gaga has and how many more Lady Gaga’s there are (answer: not as many as a power law would predict).

But is there then some asymptote – when everyone will eventually follow everyone?

hoof, I was making a presentation after many years being away from the literature on this topic, which brought be to your blog. I am glad I got out of the academia in time to keep my sanity …

http://en.wikipedia.org/wiki/Sayre%27s_law

Is it possible that the twitter follower distribution follows a power-law distribution when you include ALL accounts (not just the top 1000)?

It’s possible, but usually when people talk about things following power laws they’re talking about weight of the top tail, which is why we looked at the top 1000. Without knowing anything, it’s much more likely that the bottom looks normal or log-normal than power law-distributed.

In my experience, people are referring to the entire distribution. For example: http://www.ams.org/notices/201006/rtx100600726p.pdf

Hello – fun read 😛 I’m trying to understand twitters underlying dsitribution – would you be able to share the code and data please? Thank you!