The predictive capabilities of mathematical models for the type-token relationship in English language corpora

Tunnicliffe, Martin and Hunter, Gordon (2021) The predictive capabilities of mathematical models for the type-token relationship in English language corpora. Computer Speech & Language, 70, p. 101227. ISSN (print) 0885-2308

Abstract

We investigate the predictive capability of mathematical models of the type-token relationship applied to the vocabulary growth profiles of selected of English language documents. We compare the existing Good-Toulmin and Heaps formulae with an alternative approach based on Bernoulli trial word selection from a fixed finite vocabulary using the Zipf and Zipf-Mandelbrot probability distributions. We make two major observations: firstly, while the Zipf-Mandelbrot model makes better predictions of vocabulary growth than the Zipf model, the optimized parameters of the latter correlate better than those of the former with statistics gleaned independently from the data. Secondly, the mean of the Zipf-Mandelbrot, Good-Toulmin and Heaps models provides a more consistent and unbiased prediction of vocabulary than any individual model alone.

Actions (Repository Editors)

Item Control Page Item Control Page