Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package

This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimati...

Full description

Bibliographic Details
Main Authors:	Alexis Gabadinho, Gilbert Ritschard
Format:	Article
Language:	English
Published:	Foundation for Open Access Statistics 2016-08-01
Series:	Journal of Statistical Software
Subjects:	state sequences categorical sequences sequence visualization sequence data mining variable-length Markov chains probabilistic suffix trees R
Online Access:	https://www.jstatsoft.org/index.php/jss/article/view/2801

_version_	1828976585224486912
author	Alexis Gabadinho Gilbert Ritschard
author_facet	Alexis Gabadinho Gilbert Ritschard
author_sort	Alexis Gabadinho
collection	DOAJ
description	This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST.
first_indexed	2024-12-14T14:41:23Z
format	Article
id	doaj.art-10791a9b056d40df849a14834b060d1f
institution	Directory Open Access Journal
issn	1548-7660
language	English
last_indexed	2024-12-14T14:41:23Z
publishDate	2016-08-01
publisher	Foundation for Open Access Statistics
record_format	Article
series	Journal of Statistical Software
spelling	doaj.art-10791a9b056d40df849a14834b060d1f2022-12-21T22:57:24ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602016-08-0172113910.18637/jss.v072.i031030Analyzing State Sequences with Probabilistic Suffix Trees: The PST R PackageAlexis GabadinhoGilbert RitschardThis article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST.https://www.jstatsoft.org/index.php/jss/article/view/2801state sequencescategorical sequencessequence visualizationsequence data miningvariable-length Markov chainsprobabilistic suffix treesR
spellingShingle	Alexis Gabadinho Gilbert Ritschard Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package Journal of Statistical Software state sequences categorical sequences sequence visualization sequence data mining variable-length Markov chains probabilistic suffix trees R
title	Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_full	Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_fullStr	Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_full_unstemmed	Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_short	Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_sort	analyzing state sequences with probabilistic suffix trees the pst r package
topic	state sequences categorical sequences sequence visualization sequence data mining variable-length Markov chains probabilistic suffix trees R
url	https://www.jstatsoft.org/index.php/jss/article/view/2801
work_keys_str_mv	AT alexisgabadinho analyzingstatesequenceswithprobabilisticsuffixtreesthepstrpackage AT gilbertritschard analyzingstatesequenceswithprobabilisticsuffixtreesthepstrpackage

Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package

Similar Items