Tunnicliffe, Martin and Hunter, Gordon (2025) The classical model of type-token systems compared with items from the standardized project Gutenberg Corpus. Analytics, 4(2), p. 16. ISSN (online) 2813-2203
Abstract
We compare the “classical” equations of type-token systems, namely Zipf’s laws, Heaps’ law and the relationships between their indices, with data selected from the Standardized Project Gutenberg Corpus (SPGC). Selected items all exceed 100,000 word-tokens and are trimmed to 100,000 word-tokens each. With the most egregious anomalies removed, a dataset of 8432 items is examined in terms of the relationships between the Zipf and Heaps’ indices computed using the Maximum Likelihood algorithm. Zipf’s second (size) law indices suggest that the types vs. frequency distribution is log–log convex, with the high and low frequency indices showing weak but significant negative correlation. Under certain circumstances, the classical equations work tolerably well, though the level of agreement depends heavily on the type of literature and the language (Finnish being notably anomalous). The frequency vs. rank characteristics exhibit log–log linearity in the “middle range” (ranks 100–1000), as characterised by the Kolmogorov–Smirnov significance. For most items, the Heaps’ index correlates strongly with the low frequency Zipf index in a manner consistent with classical theory, while the high frequency indices are largely uncorrelated. This is consistent with a simple simulation.
Actions (Repository Editors)
![]() |
Item Control Page |