Slightly off-topic for text analytics but still pretty good from the perspective of social media analytics more generally, I saw Paul Krugman claiming that Twitter followers follow a power law. As a good reader of the esteemed Cosma Shalizi, I’m well aware that when someone not a statistician claims that something is a power law they’re very likely to be wrong. Why? I can’t say it better than Prof. Shalizi and his coauthors but in short: processes where many factors add together to generate a result are, by the central limit theorem, distributed normally and processes where many factors multiply together to generate a result are, also by the central limit theorem, distributed log-normally. For the practical data analyst, this fact means that “fat tails” should make you think “log-normal,” not “power law.” On top of this, looking at line-like things on log-log plots is unlikely to help you – lots of different distributions make lines on log-log plots, meaning a test that simple can’t help you distinguish one from another. You need to do it properly.
But hey, Prof. Shalizi and coauthors are good citizens and put their code on the internet. There’s even a nice recent example of bad power law-dom and example code. So, being both incredibly pedantic and having wanted to try out the code for some time, I thought I’d give it a whirl. All mistakes in the following are of course mine and nobody else’s.
First, data. To be sure of comparability, I grabbed data on the top 1000 people from twitaholic, which was Prof. Krugman’s source, dumped it into Excel (thanks, copy-paste!), deleted the images, spat it out as .csv, and read it into R; if anybody asks, I’ll put up the code and data. This basically follows Prof. Shalizi’s serial killer example, so it’s easy to follow along for yourself.
Correspondingly, I can show the upper cumulative distribution function on a log-log plot. This is kind of a weighted version of Prof. Krugman’s ln(followers) versus ln(rank) plot, just with swapped axes:
I can then fit a power law to it by maximum likelihood; while people do try to fit power laws by linear regression on log-log plots, this basically violates all the hypotheses that go into a linear regression model. Fitting a distribution by maximum likelihood is from the user perspective as easy as typing a slightly different command, so there’s no reason not to do it. Then plot the best-fit power law and log-normal:
Log-normal is a vastly better fit – comparing likelihood ratios, about two hundred quintillion times more likely. Vuong’s model choice statistic** is 5.19 in favor of the log-normal, indicating that the chance of a power law producing data that fits the log-normal this well is about one in ten million.
But wait – Prof. Krugman just posted the top 100. Maybe if I use less data, the power law will look better:
It does! Much better, but still not good – in the top tail, the log-normal has a likelihood about ten times higher than the power law and Vuong’s model choice statistic in favor of log-normal by about 1.4, meaning a power law could produce a fit to a log-normal this good about 8% of the time.
Takeaways: Twitter followers follow a fat-tailed distribution at the top end, but not as fat-tailed as a power law. Doing this right is easy – I spent way more time remembering how to plot in R than actually computing. Prof. Krugman is sometimes wrong. Gauss is not mocked. And I should probably do more work at work.
*One small undocumented point – Prof. Shalizi’s code uses 3 throughout as the threshold, the minimum of his dataset. I use the minimum of my dataset, which is more like 500k than 3.
**Read their paper for more details.