Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package

This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimati...

Full description

Bibliographic Details
Main Authors: Alexis Gabadinho, Gilbert Ritschard
Format: Article
Language:English
Published: Foundation for Open Access Statistics 2016-08-01
Series:Journal of Statistical Software
Subjects:
Online Access:https://www.jstatsoft.org/index.php/jss/article/view/2801
_version_ 1828976585224486912
author Alexis Gabadinho
Gilbert Ritschard
author_facet Alexis Gabadinho
Gilbert Ritschard
author_sort Alexis Gabadinho
collection DOAJ
description This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST.
first_indexed 2024-12-14T14:41:23Z
format Article
id doaj.art-10791a9b056d40df849a14834b060d1f
institution Directory Open Access Journal
issn 1548-7660
language English
last_indexed 2024-12-14T14:41:23Z
publishDate 2016-08-01
publisher Foundation for Open Access Statistics
record_format Article
series Journal of Statistical Software
spelling doaj.art-10791a9b056d40df849a14834b060d1f2022-12-21T22:57:24ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602016-08-0172113910.18637/jss.v072.i031030Analyzing State Sequences with Probabilistic Suffix Trees: The PST R PackageAlexis GabadinhoGilbert RitschardThis article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST.https://www.jstatsoft.org/index.php/jss/article/view/2801state sequencescategorical sequencessequence visualizationsequence data miningvariable-length Markov chainsprobabilistic suffix treesR
spellingShingle Alexis Gabadinho
Gilbert Ritschard
Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
Journal of Statistical Software
state sequences
categorical sequences
sequence visualization
sequence data mining
variable-length Markov chains
probabilistic suffix trees
R
title Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_full Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_fullStr Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_full_unstemmed Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_short Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package
title_sort analyzing state sequences with probabilistic suffix trees the pst r package
topic state sequences
categorical sequences
sequence visualization
sequence data mining
variable-length Markov chains
probabilistic suffix trees
R
url https://www.jstatsoft.org/index.php/jss/article/view/2801
work_keys_str_mv AT alexisgabadinho analyzingstatesequenceswithprobabilisticsuffixtreesthepstrpackage
AT gilbertritschard analyzingstatesequenceswithprobabilisticsuffixtreesthepstrpackage