kamila: Clustering Mixed-Type Data in R and Hadoop

In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, w...

Full description

Bibliographic Details
Main Authors: Alexander H. Foss, Marianthi Markatou
Format: Article
Language:English
Published: Foundation for Open Access Statistics 2018-02-01
Series:Journal of Statistical Software
Subjects:
Online Access:https://www.jstatsoft.org/index.php/jss/article/view/2812
Description
Summary:In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets.
ISSN:1548-7660