Statistical Models for Co-occurrence Data

Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two...

Full description

Bibliographic Details
Main Authors: Hofmann, Thomas, Puzicha, Jan
Language:en_US
Published: 2004
Online Access:http://hdl.handle.net/1721.1/7253
_version_ 1826202484644773888
author Hofmann, Thomas
Puzicha, Jan
author_facet Hofmann, Thomas
Puzicha, Jan
author_sort Hofmann, Thomas
collection MIT
description Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The main challenge for statistical models in this context is to overcome the inherent data sparseness and to estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical data organization. A novel family of mixture models is proposed which explain the observed data by a finite number of shared aspects or clusters. This provides a common framework for statistical inference and structure discovery and also includes several recently proposed models as special cases. Adopting the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop improved versions of EM which largely avoid overfitting problems and overcome the inherent locality of EM--based optimization. Among the broad variety of possible applications, e.g., in information retrieval, natural language processing, data mining, and computer vision, we have chosen document retrieval, the statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images to test and evaluate the proposed algorithms.
first_indexed 2024-09-23T12:08:11Z
id mit-1721.1/7253
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T12:08:11Z
publishDate 2004
record_format dspace
spelling mit-1721.1/72532019-04-14T06:54:03Z Statistical Models for Co-occurrence Data Hofmann, Thomas Puzicha, Jan Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The main challenge for statistical models in this context is to overcome the inherent data sparseness and to estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical data organization. A novel family of mixture models is proposed which explain the observed data by a finite number of shared aspects or clusters. This provides a common framework for statistical inference and structure discovery and also includes several recently proposed models as special cases. Adopting the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop improved versions of EM which largely avoid overfitting problems and overcome the inherent locality of EM--based optimization. Among the broad variety of possible applications, e.g., in information retrieval, natural language processing, data mining, and computer vision, we have chosen document retrieval, the statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images to test and evaluate the proposed algorithms. 2004-10-20T21:04:18Z 2004-10-20T21:04:18Z 1998-02-01 AIM-1625 CBCL-159 http://hdl.handle.net/1721.1/7253 en_US AIM-1625 CBCL-159 1827298 bytes 1464297 bytes application/postscript application/pdf application/postscript application/pdf
spellingShingle Hofmann, Thomas
Puzicha, Jan
Statistical Models for Co-occurrence Data
title Statistical Models for Co-occurrence Data
title_full Statistical Models for Co-occurrence Data
title_fullStr Statistical Models for Co-occurrence Data
title_full_unstemmed Statistical Models for Co-occurrence Data
title_short Statistical Models for Co-occurrence Data
title_sort statistical models for co occurrence data
url http://hdl.handle.net/1721.1/7253
work_keys_str_mv AT hofmannthomas statisticalmodelsforcooccurrencedata
AT puzichajan statisticalmodelsforcooccurrencedata