A learner corpus is born this way: From raw data to processed dataset

This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical ph...

Full description

Bibliographic Details
Main Authors:	Chung Hong Danny Leung, Mei Yung Vanliza Chow, Haoyan Ge
Format:	Article
Language:	English
Published:	Elsevier 2022-10-01
Series:	Data in Brief
Subjects:	Learner language corpus Written data Meta data Data processing ‘Regular expression’ text processing technique Natural language toolkit
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340922007211

_version_	1818049553762353152
author	Chung Hong Danny Leung Mei Yung Vanliza Chow Haoyan Ge
author_facet	Chung Hong Danny Leung Mei Yung Vanliza Chow Haoyan Ge
author_sort	Chung Hong Danny Leung
collection	DOAJ
description	This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages.
first_indexed	2024-12-10T10:39:25Z
format	Article
id	doaj.art-1dfae20857984e1699c625dceee7b77f
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-12-10T10:39:25Z
publishDate	2022-10-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-1dfae20857984e1699c625dceee7b77f2022-12-22T01:52:21ZengElsevierData in Brief2352-34092022-10-0144108527A learner corpus is born this way: From raw data to processed datasetChung Hong Danny Leung0Mei Yung Vanliza Chow1Haoyan Ge2Corresponding author's email address and Twitter handle; School of Education and Languages, Hong Kong Metropolitan University, Ho Man Tin, Kowloon, Hong Kong Special Administrative RegionSchool of Education and Languages, Hong Kong Metropolitan University, Ho Man Tin, Kowloon, Hong Kong Special Administrative RegionSchool of Education and Languages, Hong Kong Metropolitan University, Ho Man Tin, Kowloon, Hong Kong Special Administrative RegionThis data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages.http://www.sciencedirect.com/science/article/pii/S2352340922007211Learner language corpusWritten dataMeta dataData processing‘Regular expression’ text processing techniqueNatural language toolkit
spellingShingle	Chung Hong Danny Leung Mei Yung Vanliza Chow Haoyan Ge A learner corpus is born this way: From raw data to processed dataset Data in Brief Learner language corpus Written data Meta data Data processing ‘Regular expression’ text processing technique Natural language toolkit
title	A learner corpus is born this way: From raw data to processed dataset
title_full	A learner corpus is born this way: From raw data to processed dataset
title_fullStr	A learner corpus is born this way: From raw data to processed dataset
title_full_unstemmed	A learner corpus is born this way: From raw data to processed dataset
title_short	A learner corpus is born this way: From raw data to processed dataset
title_sort	learner corpus is born this way from raw data to processed dataset
topic	Learner language corpus Written data Meta data Data processing ‘Regular expression’ text processing technique Natural language toolkit
url	http://www.sciencedirect.com/science/article/pii/S2352340922007211
work_keys_str_mv	AT chunghongdannyleung alearnercorpusisbornthiswayfromrawdatatoprocesseddataset AT meiyungvanlizachow alearnercorpusisbornthiswayfromrawdatatoprocesseddataset AT haoyange alearnercorpusisbornthiswayfromrawdatatoprocesseddataset AT chunghongdannyleung learnercorpusisbornthiswayfromrawdatatoprocesseddataset AT meiyungvanlizachow learnercorpusisbornthiswayfromrawdatatoprocesseddataset AT haoyange learnercorpusisbornthiswayfromrawdatatoprocesseddataset

A learner corpus is born this way: From raw data to processed dataset

Similar Items