The textcat Package for n -Gram Based Text Categorization in R

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram...

Full description

Bibliographic Details
Main Authors:	Kurt Hornik, Patrick Mair, Johannes Rauch, Wilhelm Geiger, Christian Buchta, Ingo Feinerer
Format:	Article
Language:	English
Published:	Foundation for Open Access Statistics 2013-01-01
Series:	Journal of Statistical Software
Subjects:	text mining text categorization language identication n -grams textcat R
Online Access:	http://www.jstatsoft.org/v52/i06/paper

_version_	1817976348206956544
author	Kurt Hornik Patrick Mair Johannes Rauch Wilhelm Geiger Christian Buchta Ingo Feinerer
author_facet	Kurt Hornik Patrick Mair Johannes Rauch Wilhelm Geiger Christian Buchta Ingo Feinerer
author_sort	Kurt Hornik
collection	DOAJ
description	Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.
first_indexed	2024-04-13T22:02:25Z
format	Article
id	doaj.art-bdb160c3a046471287c271c14cb13ca8
institution	Directory Open Access Journal
issn	1548-7660
language	English
last_indexed	2024-04-13T22:02:25Z
publishDate	2013-01-01
publisher	Foundation for Open Access Statistics
record_format	Article
series	Journal of Statistical Software
spelling	doaj.art-bdb160c3a046471287c271c14cb13ca82022-12-22T02:28:03ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602013-01-01526The textcat Package for n -Gram Based Text Categorization in RKurt HornikPatrick MairJohannes RauchWilhelm GeigerChristian BuchtaIngo FeinererIdentifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.http://www.jstatsoft.org/v52/i06/papertext miningtext categorizationlanguage identicationn -gramstextcatR
spellingShingle	Kurt Hornik Patrick Mair Johannes Rauch Wilhelm Geiger Christian Buchta Ingo Feinerer The textcat Package for n -Gram Based Text Categorization in R Journal of Statistical Software text mining text categorization language identication n -grams textcat R
title	The textcat Package for n -Gram Based Text Categorization in R
title_full	The textcat Package for n -Gram Based Text Categorization in R
title_fullStr	The textcat Package for n -Gram Based Text Categorization in R
title_full_unstemmed	The textcat Package for n -Gram Based Text Categorization in R
title_short	The textcat Package for n -Gram Based Text Categorization in R
title_sort	textcat package for n gram based text categorization in r
topic	text mining text categorization language identication n -grams textcat R
url	http://www.jstatsoft.org/v52/i06/paper
work_keys_str_mv	AT kurthornik thetextcatpackageforngrambasedtextcategorizationinr AT patrickmair thetextcatpackageforngrambasedtextcategorizationinr AT johannesrauch thetextcatpackageforngrambasedtextcategorizationinr AT wilhelmgeiger thetextcatpackageforngrambasedtextcategorizationinr AT christianbuchta thetextcatpackageforngrambasedtextcategorizationinr AT ingofeinerer thetextcatpackageforngrambasedtextcategorizationinr AT kurthornik textcatpackageforngrambasedtextcategorizationinr AT patrickmair textcatpackageforngrambasedtextcategorizationinr AT johannesrauch textcatpackageforngrambasedtextcategorizationinr AT wilhelmgeiger textcatpackageforngrambasedtextcategorizationinr AT christianbuchta textcatpackageforngrambasedtextcategorizationinr AT ingofeinerer textcatpackageforngrambasedtextcategorizationinr

The textcat Package for n -Gram Based Text Categorization in R

Similar Items