Does Weibo Really Follow Zipf’s Law?

A recent paper used the frequency of Weibo keywords to predict air pollution and argued that social media data can add finer detail to environmental monitoring. That idea is appealing. But if we set aside the environmental angle for a moment, the text-analysis side of Weibo is not especially hard to implement. A simple way to demonstrate that is to test whether a Weibo corpus follows Zipf’s law.

The dataset used here is an open Weibo corpus from NLPIR, containing about 230,000 posts. From that corpus, the goal is to extract words and their frequencies, then check a classic bibliometric pattern: Zipf’s law. In its usual form, if a word’s frequency is (f) and its rank is (r), then their product should be approximately constant:

[ r \times f = C ]

Zipf illustration

The figure above comes from Wikipedia.

Code used for the analysis

The following code can be reproduced once the dataset has been downloaded and the file path is set correctly.

# 读入xml包
library(XML)
# 读取数据并提取文本信息
doc <- xmlTreeParse('NLPIR微博内容语料库.xml',useInternal=TRUE)
rootNode <- xmlRoot(doc)
doc1 <- xpathSApply(rootNode,"//article",xmlValue)
# 去除无关标点与数字
doc2 <- gsub(pattern="http:[a-zA-Z\\/\\.0-9]+","",doc1)
# 中文分词
library(Rwordseg)
doc3 <- segmentCN(doc2)
# 构建语料库 去掉标点与数字与高频词
library(tm)
doc4 <- Corpus(VectorSource(doc3))
doc5 <- tm_map(doc4, removePunctuation)
doc6 <- tm_map(doc5, removeNumbers)
# 高频无意义词在这里可以搞到 https://github.com/yufree/democode/tree/master/data
x <- scan("stopwords.txt", what="")
doc7 <- tm_map(doc6, removeWords, x)
doc8 <- tm_map(doc7, stripWhitespace)
# 构建全范围的词频矩阵
control=list(minDocFreq=5,wordLengths = c(1, Inf),bounds = list(global = c(5,Inf)),weighting = weightTf,encoding = 'UTF-8')
doc.tdm=TermDocumentMatrix(doc8,control)
# 这里截取词频高于5长度为2的词
control2=list(minDocFreq=5,wordLengths = c(2, 2),bounds = list(global = c(5,Inf)),weighting = weightTf,encoding = 'UTF-8')
doc.tdm2=TermDocumentMatrix(doc8,control2)
# 这里截取词频高于5长度为3的词
control3=list(minDocFreq=5,wordLengths = c(3, 3),bounds = list(global = c(5,Inf)),weighting = weightTf,encoding = 'UTF-8')
doc.tdm3=TermDocumentMatrix(doc8,control3)
# 这里截取词频高于5长度为4的词
control4=list(minDocFreq=5,wordLengths = c(4, 4),bounds = list(global = c(5,Inf)),weighting = weightTf,encoding = 'UTF-8')
doc.tdm4=TermDocumentMatrix(doc8,control4)
# 这里截取词频高于5长度为5的词
control5=list(minDocFreq=5,wordLengths = c(5, 5),bounds = list(global = c(5,Inf)),weighting = weightTf,encoding = 'UTF-8')
doc.tdm5=TermDocumentMatrix(doc8,control5)
# 得到词频列表
library(slam)
freq <- rowapply_simple_triplet_matrix(doc.tdm,sum)
freq2 <- rowapply_simple_triplet_matrix(doc.tdm2,sum)
freq3 <- rowapply_simple_triplet_matrix(doc.tdm3,sum)
freq4 <- rowapply_simple_triplet_matrix(doc.tdm4,sum)
freq5 <- rowapply_simple_triplet_matrix(doc.tdm5,sum)
# save(freq,freq2,doc8,file ='constellation.RData')
# 验证齐普夫定律
order <- order(freq[order(freq,decreasing = T)],decreasing = T)
freq0 <- freq[order(freq,decreasing = T)]
order2 <- order(freq2[order(freq2,decreasing = T)],decreasing = T)
freq20 <- freq2[order(freq2,decreasing = T)]
order3 <- order(freq3[order(freq3,decreasing = T)],decreasing = T)
freq30 <- freq3[order(freq3,decreasing = T)]
order4 <- order(freq4[order(freq4,decreasing = T)],decreasing = T)
freq40 <- freq4[order(freq4,decreasing = T)]
order5 <- order(freq5[order(freq5,decreasing = T)],decreasing = T)
freq50 <- freq5[order(freq5,decreasing = T)]
# 结果可视化
# plot(log(order)~log(freq0))
png('logzipfplot.png')
par(mfrow=c(2,2))
plot(log(order2)~log(freq20),main="word length: 2")
plot(log(order3)~log(freq30),main="word length: 3")
plot(log(order4)~log(freq40),main="word length: 4")
plot(log(order5)~log(freq50),main="word length: 5")
dev.off()
png('czipfplot.png')
par(mfrow=c(2,2))
plot(order2*freq20,main="word length: 2")
plot(order3*freq30,main="word length: 3")
plot(order4*freq40,main="word length: 4")
plot(order5*freq50,main="word length: 5")
dev.off()

Log-rank vs. log-frequency plots

Rank times frequency plots

What showed up in the plots

The result was more surprising than expected.

Zipf’s law gets invoked very casually in many discussions, as if it were a dependable argument by itself. But it is only an empirical law to begin with, and in this Weibo analysis it does not seem to match actual usage patterns very well.

In theory, the first set of plots should look roughly linear. Here, that only becomes barely visible when words are grouped in sets of five by character length. Looking at the top 50 words in those groups, that apparent linearity seems to be driven more by English words than by Chinese ones. That raises a possibility: maybe the part of the corpus that resembles Zipf behavior is the portion shaped by heavy English usage on Weibo.

Here are some high-frequency examples from that group:

     zynga 发展中国家      happy      phone
       504        189        148        131
     webos      china 南京大屠杀      party
       130        125        119        108
     gmail      world      store      style
       106        104         98         98
个人所得税      ilook      never      there
        93         90         89         88
     gucci 高尔夫球场      touch      adobe
        81         79         59         57
     weico      weibo      hello      rovio
        57         56         53         53
     heart 印度尼西亚      icann      green
        50         49         49         47
     belle      kitty      leave 人民检察院
        46         45         45         44
原教旨主义      first      light 毛泽东思想
        44         44         44         43
中国科学院      black      david      kevin
        42         42         42         42
     brian      yahoo 中央气象台      nexon
        41         41         40         40
     nexus      apple      muddy      still
        40         39         39         39
     would      ralph
        39         38

But once the segmentation focuses on two-character and three-character words, Chinese vocabulary dominates. And that is exactly where the distribution departs most clearly from Zipf’s law. Instead of a straight-line relationship, it looks more like a parabolic curve.

Two-character words:

中国  腐败  城管  一个  北京  微博  问题  政府
38951 37655 28787 24242 15834 14984 12972 12651
 社会  今天  美国  国家  工作  公司  经济  已经
11418 11310 11264 11261 10326  9722  8402  8227
 现在  时间  表示  事件  香港  发现  世界  发生
 8222  7941  7913  7755  7710  7466  7258  7196
 进行  知道  人员  安全  生活  目前  新闻  调查
 7189  7184  7077  7032  6985  6916  6901  6378
 记者  今年  孩子  市场  官员  昨天  事故  企业
 6043  6037  6034  5944  5919  5809  5748  5706
 看到  图片  朋友  部门  视频  成为  全国  认为
 5614  5587  5572  5496  5440  5386  5370  5339
 大学  媒体
 5317  5219

Three-character words:

北京市 越来越 房地产 嫌疑人 公安局 老百姓 互联网
  3266   2122   2074   2033   1928   1913   1804
公务员 候选人 联合国 公安部 人民币 电视台 消费者
  1789   1747   1728   1715   1657   1620   1577
亿美元 负责人 委员会 国务院 铁道部 办公室 哈哈哈
  1515   1514   1513   1457   1417   1243   1243
进一步 中纪委 第一次 派出所 开发商    ceo    the
  1241   1213   1125   1099   1096   1090   1068
俄罗斯 幼儿园 临时工 董事长 发言人 意味着 机动车
  1054   1006    982    952    916    902    897
利比亚 大学生    you 万美元 全世界 新华社 朝阳区
   821    798    796    781    766    764    762
领导人 身份证 志愿者 一个月 总经理 自行车 发布会
   750    741    717    705    688    661    651
奢侈品
   642

Does removing stopwords explain the mismatch?

A natural objection is that high-frequency function words were removed, and that operation alone might break Zipf’s law.

That objection is fair as far as it goes. If the original corpus did follow Zipf’s law, then removing frequent stopwords would indeed disrupt the constant (C). Once the ranking is shifted upward, the values at the front should become smaller. But that effect should not force the later part of the curve downward in the same way. If the underlying pattern still held, the tail should gradually approach the constant instead of collapsing away from it.

That is where the second set of plots matters. If the Weibo corpus followed Zipf’s law, the graph of (r \times f) should be roughly flat. Instead, it forms a hump—steep on the front side and more gradual on the back.

So the next question becomes obvious: what words are sitting near that peak?

Two-character words near the peak:

人心 阻止 一套 原文 随便 展开 男性 下次 无能 征集
 421  442  422  443  420  643  421  442  422  441
中方 轻松 手中 名人 走红 审判 借贷 滋生 水果 早安
 423  443  420  421  425  442  422  426  643  423
团购 森林   lt 债务 买房 观众 咖啡 环卫 遭到 警告
 642  441  641  640  420  639  443  440  638  421
嘉宾 城区 火锅 争取 选项 转型 原本 姐姐 震惊 角色
 425  442  422  426  423  448  444  441  424  420
军事 欧元 好多 好处 富人 商务 会长 有所 沉默 真心
 443  643  440  421  425  641  422  640  442  445

Three-character words near the peak:

执行官 侦查员 著作权 原材料 总公司    see 男朋友
    91     91     91     92     84     63     91
徐家汇 一口气 知情人 婴幼儿 交易日 信息化 小金库
    90     92     84     63     91     90     92
针对性    lte 写字楼 中文版    lee 银行家 直辖市
    84     81     63     89     80     88     83
专卖店 孙悟空 多一些 太阳能    tot 中南海 微生物
    62     63     91     90     61     81     92
抑郁症 手续费    fbi 自己人 小学校 新街口 一周年
    84     89     80     64     62     83     88
尼古丁    gif 从业者 国家队 一两个 演播室 芝加哥
    63     61     91     90     81     84     71
星期六 玉泉营 想象力 生产力 小博士 十多年 科技界
    80     64     87     92     62     89     63
债权人
    60

Staring at those lists does not immediately reveal a clean explanation. Still, one intuition seems worth keeping.

The product of frequency and rank may work as a rough measure of a term’s influence inside the corpus. In text collections that do follow Zipf’s law, the most influential words would simply be the highest-frequency ones. But those are often the least useful terms analytically, which is why routine NLP pipelines usually remove them.

Once a corpus does not follow Zipf’s law—and once high-frequency stopwords are already filtered out—the question becomes different: how do you find representative terms? The terms you want cannot be too common, because then they become generic noise, but they also cannot be too rare, because then no stable pattern emerges.

By that logic, words with large (r \times f) values may be especially interesting. They are not huge in scale, yet they may differ sharply across groups, and that difference could easily get diluted when Weibo is treated as one broad pool of everyday discussion.

A tentative takeaway

Several points stand out from this experiment:

Even after removing high-frequency stopwords, the Weibo environment—especially Chinese-language usage—does not appear to fit Zipf’s law very well.
In Chinese corpora, the hump-shaped pattern in frequency times rank may offer a way to identify comparatively influential keywords within groups.
The mixture of Chinese and English on Weibo is not a minor detail; it materially affects the distribution.
There is plenty of room to dig further.

Code used for the analysis

What showed up in the plots

Does removing stopwords explain the mismatch?

A tentative takeaway

Related Posts

Maybe Things Will Feel Better Once School Starts

How I Made My Blog Feel Almost Instant

Love as the Adventure of Seeing It Through

Stop Hunting for Avatars—Make One That Actually Feels Like You with Midjourney

When Speaking Online Feels Easier Than Speaking to People