Automatic coding of occupation and cause-of-death records

The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysi...

Full description

Bibliographic Details
Main Authors: Richard Tobin, Elaine Farrow, Claire Grover, Beatrice Alex
Format: Article
Language:English
Published: Swansea University 2019-11-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1202
_version_ 1827608861348986880
author Richard Tobin
Elaine Farrow
Claire Grover
Beatrice Alex
author_facet Richard Tobin
Elaine Farrow
Claire Grover
Beatrice Alex
author_sort Richard Tobin
collection DOAJ
description The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysis. The digitised birth, marriage, and death certificates include textual descriptions of occupations and causes of death. Our aim is to map these descriptions to standard HISCO and ICD-10 codes. It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques. A proportion of the records will be manually coded and used to train the system. More recent records are already coded and these can also be used for training. Following earlier work by [Kirby et al] and [Carson et al] we are experimenting with Bayesian classifiers for this task. By combining exact matching for texts that have been seen in the training data and Bayes for the rest, we get an accuracy in cross-validation of 92% for causes of death and 94-97% for occupations. We are investigating methods to improve this, including automatic spelling correction and synonym detection, use of age and sex information, and (for causes of death) the presence of co-occurring causes. We are also investigating the value of coarser-grained but more reliable coding, and reporting second- and third-choice codes. This is work in progress, and the final paper will consider whether the improvements we are making are sufficient to produce useful data for further research. We will also make recommendations about further manual annotation to provide training data covering the whole timespan of the records.
first_indexed 2024-03-09T07:20:42Z
format Article
id doaj.art-3811cf2247e84aa595ff190c0513b413
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:20:42Z
publishDate 2019-11-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-3811cf2247e84aa595ff190c0513b4132023-12-03T07:23:13ZengSwansea UniversityInternational Journal of Population Data Science2399-49082019-11-014310.23889/ijpds.v4i3.1202Automatic coding of occupation and cause-of-death recordsRichard Tobin0Elaine Farrow1Claire Grover2Beatrice Alex3The University of EdinburghThe University of EdinburghThe University of EdinburghThe University of EdinburghThe Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysis. The digitised birth, marriage, and death certificates include textual descriptions of occupations and causes of death. Our aim is to map these descriptions to standard HISCO and ICD-10 codes. It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques. A proportion of the records will be manually coded and used to train the system. More recent records are already coded and these can also be used for training. Following earlier work by [Kirby et al] and [Carson et al] we are experimenting with Bayesian classifiers for this task. By combining exact matching for texts that have been seen in the training data and Bayes for the rest, we get an accuracy in cross-validation of 92% for causes of death and 94-97% for occupations. We are investigating methods to improve this, including automatic spelling correction and synonym detection, use of age and sex information, and (for causes of death) the presence of co-occurring causes. We are also investigating the value of coarser-grained but more reliable coding, and reporting second- and third-choice codes. This is work in progress, and the final paper will consider whether the improvements we are making are sufficient to produce useful data for further research. We will also make recommendations about further manual annotation to provide training data covering the whole timespan of the records.https://ijpds.org/article/view/1202
spellingShingle Richard Tobin
Elaine Farrow
Claire Grover
Beatrice Alex
Automatic coding of occupation and cause-of-death records
International Journal of Population Data Science
title Automatic coding of occupation and cause-of-death records
title_full Automatic coding of occupation and cause-of-death records
title_fullStr Automatic coding of occupation and cause-of-death records
title_full_unstemmed Automatic coding of occupation and cause-of-death records
title_short Automatic coding of occupation and cause-of-death records
title_sort automatic coding of occupation and cause of death records
url https://ijpds.org/article/view/1202
work_keys_str_mv AT richardtobin automaticcodingofoccupationandcauseofdeathrecords
AT elainefarrow automaticcodingofoccupationandcauseofdeathrecords
AT clairegrover automaticcodingofoccupationandcauseofdeathrecords
AT beatricealex automaticcodingofoccupationandcauseofdeathrecords