Rank-frequency distribution of natural languages: a difference of probabilities approach

The time variation of the rank k of words for six Indo-European languages is obtained using data from Google Books. For low ranks the distinct languages behave differently, maybe due to syntaxis rules, whereas for k>50 the law of large numbers predominates. The dynamics of k is described stochastically through a master equation governing the time evolution of its probability density, which is approximated by a Fokker-Planck equation that is solved analytically. The difference between the data and the asymptotic solution is identified with the transient solution, and good agreement is obtained.

 

Rank-frequency distribution of natural languages: a difference of probabilities approach
Germinal Cocho, R. F. Rodríguez, Sergio Sánchez, Jorge Flores, Carlos Pineda, Carlos Gershenson

Source: arxiv.org