Modelling the growth of vocabulary in textual documents

Tunnicliffe, Martin and Hunter, Gordon (2023) Modelling the growth of vocabulary in textual documents. In: UK Speech 2023; 14-15 Jun 2023, Sheffield, U.K.. (Unpublished)


Over the past few decades, many attempts have been made to unify the type-token statistics of textual documents, specifically the two Zipfian distributions (frequency vs. rank and size vs. frequency) and the Heaps characteristic (vocabulary vs. document length) as a unified mathematical model. Here we review a number of approaches ranging from the early work of Mandelbrot (1953) and Efron & Thisted (1976) to more recent studies by Gerlach and Altmann (2013), Corral & Font-Clos (2017) and the authors’ own work (Tunnicliffe & Hunter, 2022). We show that while static probabilistic models can be tuned tantalizingly close to the experimental data, there are complexities which they cannot duplicate. Of particular interest is the heterogeneity of textual corpora: while some items are better represented by closed vocabularies, others are better accommodated by infinite or continuously evolving word-pools. A related issue is that the Zipf indices, considered constant in most conventional models, behave anomalously in the lower frequency ranges where values measured using different techniques often differ significantly. Finally, wide statistical variations make analytical relationships between the different parameters hard to test, even when large volumes of data are analysed. We present a framework for approaching these issues using data from the Standardised Project Guttenberg Corpus (SPGC) and present some preliminary results. References Mandelbrot, B (1953), “An Informational Theory of the Statistical Structure of Language”, in Willis Jackson (Ed.), “Applications of Communication Theory”, pp.486-802, Butterworths, London. Efron, B, Thisted, R (1976), “Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know”, Biometrika, 63(3), pp.435-47. Gerlach, M, Altmann, E.G. (2013) “Stochastic Model for the Vocabulary Growth in Natural Languages, Phys. Rev., X3, 021006. Corral, A, Font-Clos, F (2017), “Dependence of Exponents on Text Length versus Finite-Size Scaling for Word-Frequency Distributions”, Phys. Rev. E, 96 (2-1), 022318. Tunnicliffe, M, Hunter G, (2022), “Random Sampling of the Zipf-Mandelbrot Distribution as a Representation of Vocabulary Growth”, Physica A, 608, 128259

Actions (Repository Editors)

Item Control Page Item Control Page