Automatic coding of occupation and cause-of-death records
The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysi...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2019-11-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/1202 |
_version_ | 1827608861348986880 |
---|---|
author | Richard Tobin Elaine Farrow Claire Grover Beatrice Alex |
author_facet | Richard Tobin Elaine Farrow Claire Grover Beatrice Alex |
author_sort | Richard Tobin |
collection | DOAJ |
description | The Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysis.
The digitised birth, marriage, and death certificates include textual descriptions of occupations and causes of death. Our aim is to map these descriptions to standard HISCO and ICD-10 codes.
It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques. A proportion of the records will be manually coded and used to train the system. More recent records are already coded and these can also be used for training. Following earlier work by [Kirby et al] and [Carson et al] we are experimenting with Bayesian classifiers for this task.
By combining exact matching for texts that have been seen in the training data and Bayes for the rest, we get an accuracy in cross-validation of 92% for causes of death and 94-97% for occupations.
We are investigating methods to improve this, including automatic spelling correction and synonym detection, use of age and sex information, and (for causes of death) the presence of co-occurring causes.
We are also investigating the value of coarser-grained but more reliable coding, and reporting second- and third-choice codes.
This is work in progress, and the final paper will consider whether the improvements we are making are sufficient to produce useful data for further research. We will also make recommendations about further manual annotation to provide training data covering the whole timespan of the records. |
first_indexed | 2024-03-09T07:20:42Z |
format | Article |
id | doaj.art-3811cf2247e84aa595ff190c0513b413 |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T07:20:42Z |
publishDate | 2019-11-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-3811cf2247e84aa595ff190c0513b4132023-12-03T07:23:13ZengSwansea UniversityInternational Journal of Population Data Science2399-49082019-11-014310.23889/ijpds.v4i3.1202Automatic coding of occupation and cause-of-death recordsRichard Tobin0Elaine Farrow1Claire Grover2Beatrice Alex3The University of EdinburghThe University of EdinburghThe University of EdinburghThe University of EdinburghThe Digitising Scotland project aims to digitise 24 million Scottish vital event records of births, marriages and deaths from 1856 to 1973. To use these records effectively for large-scale research they must not only be made machine-readable, but also coded in a form suitable for statistical analysis. The digitised birth, marriage, and death certificates include textual descriptions of occupations and causes of death. Our aim is to map these descriptions to standard HISCO and ICD-10 codes. It is impractical to have experts code all the records manually, so we treat the problem as a text classification task and apply machine learning techniques. A proportion of the records will be manually coded and used to train the system. More recent records are already coded and these can also be used for training. Following earlier work by [Kirby et al] and [Carson et al] we are experimenting with Bayesian classifiers for this task. By combining exact matching for texts that have been seen in the training data and Bayes for the rest, we get an accuracy in cross-validation of 92% for causes of death and 94-97% for occupations. We are investigating methods to improve this, including automatic spelling correction and synonym detection, use of age and sex information, and (for causes of death) the presence of co-occurring causes. We are also investigating the value of coarser-grained but more reliable coding, and reporting second- and third-choice codes. This is work in progress, and the final paper will consider whether the improvements we are making are sufficient to produce useful data for further research. We will also make recommendations about further manual annotation to provide training data covering the whole timespan of the records.https://ijpds.org/article/view/1202 |
spellingShingle | Richard Tobin Elaine Farrow Claire Grover Beatrice Alex Automatic coding of occupation and cause-of-death records International Journal of Population Data Science |
title | Automatic coding of occupation and cause-of-death records |
title_full | Automatic coding of occupation and cause-of-death records |
title_fullStr | Automatic coding of occupation and cause-of-death records |
title_full_unstemmed | Automatic coding of occupation and cause-of-death records |
title_short | Automatic coding of occupation and cause-of-death records |
title_sort | automatic coding of occupation and cause of death records |
url | https://ijpds.org/article/view/1202 |
work_keys_str_mv | AT richardtobin automaticcodingofoccupationandcauseofdeathrecords AT elainefarrow automaticcodingofoccupationandcauseofdeathrecords AT clairegrover automaticcodingofoccupationandcauseofdeathrecords AT beatricealex automaticcodingofoccupationandcauseofdeathrecords |