Skip to main content
Inadequacy of the chi-squared test to examine vocabulary differences between corpora
Literary and Linguistic Computing (2014)
  • Yves Bestgen, Université catholique de Louvain
Pearson’s chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (2007) proposed various adaptations of this test to allow for the simultaneous comparison of more than two corpora while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its mod- ified version. Several potential approaches to circumventing this problem are discussed in the conclusion.
  • corpus linguistics,
  • resampling test,
  • lexical variation,
  • British English versus American English
Publication Date
Citation Information
Yves Bestgen. "Inadequacy of the chi-squared test to examine vocabulary differences between corpora" Literary and Linguistic Computing Vol. 29 (2014)
Available at: