Semi-supervised machine learning with word embedding for classification in price statistics

The Office for National Statistics (ONS) is currently undertaking a substantial research program into using price information scraped from online retailers in the Consumer Prices Index including occupiers’ housing costs (CPIH). In order to make full use of these data, we must classify it into the pr...

Full description

Bibliographic Details
Main Authors:	Hazel Martindale, Edward Rowland, Tanya Flower, Gareth Clews
Format:	Article
Language:	English
Published:	Cambridge University Press 2020-01-01
Series:	Data & Policy
Subjects:	Classification machine learning natural language processing semi-supervised learning
Online Access:	https://www.cambridge.org/core/product/identifier/S2632324920000139/type/journal_article

_version_	1811156489545449472
author	Hazel Martindale Edward Rowland Tanya Flower Gareth Clews
author_facet	Hazel Martindale Edward Rowland Tanya Flower Gareth Clews
author_sort	Hazel Martindale
collection	DOAJ
description	The Office for National Statistics (ONS) is currently undertaking a substantial research program into using price information scraped from online retailers in the Consumer Prices Index including occupiers’ housing costs (CPIH). In order to make full use of these data, we must classify it into the product types that make up the basket of goods and services used in the current collection. It is a common problem that the amount of labeled training data is limited and it is either impossible or impractical to manually increase the size of the training data, as is the case with web-scraped price data. We make use of a semi-supervised machine learning (ML) method, Label Propagation, to develop a pipeline to increase the number of labels available for classification. In this work, we use several techniques in succession and in parallel to enable higher confidence in the final increased labeled dataset to be used in training a traditional ML classifier. We find promising results using this method on a test sample of data achieving good precision and recall values for both the propagated labels and the classifiers trained from these labels. We have shown that through combining several techniques together and averaging the results, we are able to increase the usability of a dataset with limited labeled training data, a common problem in using ML in real world situations. In future work, we will investigate how this method can be scaled up for use in future CPIH calculations and the challenges this brings.
first_indexed	2024-04-10T04:52:36Z
format	Article
id	doaj.art-5a29c6395d154d0da91a9480b26d45db
institution	Directory Open Access Journal
issn	2632-3249
language	English
last_indexed	2024-04-10T04:52:36Z
publishDate	2020-01-01
publisher	Cambridge University Press
record_format	Article
series	Data & Policy
spelling	doaj.art-5a29c6395d154d0da91a9480b26d45db2023-03-09T12:31:28ZengCambridge University PressData & Policy2632-32492020-01-01210.1017/dap.2020.13Semi-supervised machine learning with word embedding for classification in price statisticsHazel Martindale0https://orcid.org/0000-0001-6953-7760Edward Rowland1Tanya Flower2Gareth Clews3Methodology Division, Office for National Statistics, Newport, United KingdomMethodology Division, Office for National Statistics, Newport, United KingdomPrices Division, Office for National Statistics, Newport, United KingdomMethodology Division, Office for National Statistics, Newport, United KingdomThe Office for National Statistics (ONS) is currently undertaking a substantial research program into using price information scraped from online retailers in the Consumer Prices Index including occupiers’ housing costs (CPIH). In order to make full use of these data, we must classify it into the product types that make up the basket of goods and services used in the current collection. It is a common problem that the amount of labeled training data is limited and it is either impossible or impractical to manually increase the size of the training data, as is the case with web-scraped price data. We make use of a semi-supervised machine learning (ML) method, Label Propagation, to develop a pipeline to increase the number of labels available for classification. In this work, we use several techniques in succession and in parallel to enable higher confidence in the final increased labeled dataset to be used in training a traditional ML classifier. We find promising results using this method on a test sample of data achieving good precision and recall values for both the propagated labels and the classifiers trained from these labels. We have shown that through combining several techniques together and averaging the results, we are able to increase the usability of a dataset with limited labeled training data, a common problem in using ML in real world situations. In future work, we will investigate how this method can be scaled up for use in future CPIH calculations and the challenges this brings.https://www.cambridge.org/core/product/identifier/S2632324920000139/type/journal_articleClassificationmachine learningnatural language processingsemi-supervised learning
spellingShingle	Hazel Martindale Edward Rowland Tanya Flower Gareth Clews Semi-supervised machine learning with word embedding for classification in price statistics Data & Policy Classification machine learning natural language processing semi-supervised learning
title	Semi-supervised machine learning with word embedding for classification in price statistics
title_full	Semi-supervised machine learning with word embedding for classification in price statistics
title_fullStr	Semi-supervised machine learning with word embedding for classification in price statistics
title_full_unstemmed	Semi-supervised machine learning with word embedding for classification in price statistics
title_short	Semi-supervised machine learning with word embedding for classification in price statistics
title_sort	semi supervised machine learning with word embedding for classification in price statistics
topic	Classification machine learning natural language processing semi-supervised learning
url	https://www.cambridge.org/core/product/identifier/S2632324920000139/type/journal_article
work_keys_str_mv	AT hazelmartindale semisupervisedmachinelearningwithwordembeddingforclassificationinpricestatistics AT edwardrowland semisupervisedmachinelearningwithwordembeddingforclassificationinpricestatistics AT tanyaflower semisupervisedmachinelearningwithwordembeddingforclassificationinpricestatistics AT garethclews semisupervisedmachinelearningwithwordembeddingforclassificationinpricestatistics

Semi-supervised machine learning with word embedding for classification in price statistics

Similar Items