A stochastic memoizer for sequence data

We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes...

Full description

Bibliographic Details
Main Authors: Wood, F, Archambeav, C, Gasthaus, J, James, L, Teh, Y
Format: Journal article
Language:English
Published: 2009
_version_ 1797096207373303808
author Wood, F
Archambeav, C
Gasthaus, J
James, L
Teh, Y
author_facet Wood, F
Archambeav, C
Gasthaus, J
James, L
Teh, Y
author_sort Wood, F
collection OXFORD
description We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.
first_indexed 2024-03-07T04:38:47Z
format Journal article
id oxford-uuid:d0e4822f-555b-49c0-995a-96a63293e48a
institution University of Oxford
language English
last_indexed 2024-03-07T04:38:47Z
publishDate 2009
record_format dspace
spelling oxford-uuid:d0e4822f-555b-49c0-995a-96a63293e48a2022-03-27T07:53:14ZA stochastic memoizer for sequence dataJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:d0e4822f-555b-49c0-995a-96a63293e48aEnglishSymplectic Elements at Oxford2009Wood, FArchambeav, CGasthaus, JJames, LTeh, YWe propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.
spellingShingle Wood, F
Archambeav, C
Gasthaus, J
James, L
Teh, Y
A stochastic memoizer for sequence data
title A stochastic memoizer for sequence data
title_full A stochastic memoizer for sequence data
title_fullStr A stochastic memoizer for sequence data
title_full_unstemmed A stochastic memoizer for sequence data
title_short A stochastic memoizer for sequence data
title_sort stochastic memoizer for sequence data
work_keys_str_mv AT woodf astochasticmemoizerforsequencedata
AT archambeavc astochasticmemoizerforsequencedata
AT gasthausj astochasticmemoizerforsequencedata
AT jamesl astochasticmemoizerforsequencedata
AT tehy astochasticmemoizerforsequencedata
AT woodf stochasticmemoizerforsequencedata
AT archambeavc stochasticmemoizerforsequencedata
AT gasthausj stochasticmemoizerforsequencedata
AT jamesl stochasticmemoizerforsequencedata
AT tehy stochasticmemoizerforsequencedata