The textcat Package for n -Gram Based Text Categorization in R
Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Foundation for Open Access Statistics
2013-01-01
|
Series: | Journal of Statistical Software |
Subjects: | |
Online Access: | http://www.jstatsoft.org/v52/i06/paper |
_version_ | 1817976348206956544 |
---|---|
author | Kurt Hornik Patrick Mair Johannes Rauch Wilhelm Geiger Christian Buchta Ingo Feinerer |
author_facet | Kurt Hornik Patrick Mair Johannes Rauch Wilhelm Geiger Christian Buchta Ingo Feinerer |
author_sort | Kurt Hornik |
collection | DOAJ |
description | Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods. |
first_indexed | 2024-04-13T22:02:25Z |
format | Article |
id | doaj.art-bdb160c3a046471287c271c14cb13ca8 |
institution | Directory Open Access Journal |
issn | 1548-7660 |
language | English |
last_indexed | 2024-04-13T22:02:25Z |
publishDate | 2013-01-01 |
publisher | Foundation for Open Access Statistics |
record_format | Article |
series | Journal of Statistical Software |
spelling | doaj.art-bdb160c3a046471287c271c14cb13ca82022-12-22T02:28:03ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602013-01-01526The textcat Package for n -Gram Based Text Categorization in RKurt HornikPatrick MairJohannes RauchWilhelm GeigerChristian BuchtaIngo FeinererIdentifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.http://www.jstatsoft.org/v52/i06/papertext miningtext categorizationlanguage identicationn -gramstextcatR |
spellingShingle | Kurt Hornik Patrick Mair Johannes Rauch Wilhelm Geiger Christian Buchta Ingo Feinerer The textcat Package for n -Gram Based Text Categorization in R Journal of Statistical Software text mining text categorization language identication n -grams textcat R |
title | The textcat Package for n -Gram Based Text Categorization in R |
title_full | The textcat Package for n -Gram Based Text Categorization in R |
title_fullStr | The textcat Package for n -Gram Based Text Categorization in R |
title_full_unstemmed | The textcat Package for n -Gram Based Text Categorization in R |
title_short | The textcat Package for n -Gram Based Text Categorization in R |
title_sort | textcat package for n gram based text categorization in r |
topic | text mining text categorization language identication n -grams textcat R |
url | http://www.jstatsoft.org/v52/i06/paper |
work_keys_str_mv | AT kurthornik thetextcatpackageforngrambasedtextcategorizationinr AT patrickmair thetextcatpackageforngrambasedtextcategorizationinr AT johannesrauch thetextcatpackageforngrambasedtextcategorizationinr AT wilhelmgeiger thetextcatpackageforngrambasedtextcategorizationinr AT christianbuchta thetextcatpackageforngrambasedtextcategorizationinr AT ingofeinerer thetextcatpackageforngrambasedtextcategorizationinr AT kurthornik textcatpackageforngrambasedtextcategorizationinr AT patrickmair textcatpackageforngrambasedtextcategorizationinr AT johannesrauch textcatpackageforngrambasedtextcategorizationinr AT wilhelmgeiger textcatpackageforngrambasedtextcategorizationinr AT christianbuchta textcatpackageforngrambasedtextcategorizationinr AT ingofeinerer textcatpackageforngrambasedtextcategorizationinr |