Significant or random?: A critical review of sociolinguistic generalisations based on large corpora
This article offers a critical review of a methodology often employed in corpusbased sociolinguistic studies which make use of aggregate data. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this proc...
Main Authors: | , |
---|---|
Format: | Journal article |
Language: | English |
Published: |
John Benjamins Publishing Company
2014
|
Summary: | This article offers a critical review of a methodology often employed in corpusbased sociolinguistic studies which make use of aggregate data. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this procedure lies in the fact that it emphasises inter-group differences and ignores within group variation. The methodology thus often yields falsely positive results (with highly significant log-likelihood scores). This article presents evidence which shows that sociolinguistic studies based on aggregate data are in principle unreliable. Using BNC 32, a one million-word corpus of informal speech, it demonstrates that random (and therefore sociolinguistically irrelevant) speaker groupings can often yield statistically significant results. The article offers suggestions for an alternative methodology (using the Mann-Whitney U test), which takes into account within group differences and therefore produces more meaningful results.© John Benjamins Publishing Company. |
---|