Significant or random?: A critical review of sociolinguistic generalisations based on large corpora

This article offers a critical review of a methodology often employed in corpusbased sociolinguistic studies which make use of aggregate data. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this proc...

Full description

Bibliographic Details
Main Authors: Brezina, V, Meyerhoff, M
Format: Journal article
Language:English
Published: John Benjamins Publishing Company 2014
Description
Summary:This article offers a critical review of a methodology often employed in corpusbased sociolinguistic studies which make use of aggregate data. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this procedure lies in the fact that it emphasises inter-group differences and ignores within group variation. The methodology thus often yields falsely positive results (with highly significant log-likelihood scores). This article presents evidence which shows that sociolinguistic studies based on aggregate data are in principle unreliable. Using BNC 32, a one million-word corpus of informal speech, it demonstrates that random (and therefore sociolinguistically irrelevant) speaker groupings can often yield statistically significant results. The article offers suggestions for an alternative methodology (using the Mann-Whitney U test), which takes into account within group differences and therefore produces more meaningful results.© John Benjamins Publishing Company.