Principal Component Analysis of Process Datasets with Missing Values

Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building....

Full description

Bibliographic Details
Main Authors: Severson, Kristen, Molaro, Mark, Braatz, Richard D
Other Authors: Massachusetts Institute of Technology. Department of Chemical Engineering
Format: Article
Language:English
Published: MDPI AG 2020
Online Access:https://hdl.handle.net/1721.1/125630
_version_ 1826194838162243584
author Severson, Kristen
Molaro, Mark
Braatz, Richard D
author2 Massachusetts Institute of Technology. Department of Chemical Engineering
author_facet Massachusetts Institute of Technology. Department of Chemical Engineering
Severson, Kristen
Molaro, Mark
Braatz, Richard D
author_sort Severson, Kristen
collection MIT
description Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA), which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets. Keywords: principal component analysis; missing data; process data analytics; chemometrics; machine learning; multivariable statistical process control; process monitoring; Tennessee Eastman problem
first_indexed 2024-09-23T10:02:56Z
format Article
id mit-1721.1/125630
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T10:02:56Z
publishDate 2020
publisher MDPI AG
record_format dspace
spelling mit-1721.1/1256302022-09-26T15:24:10Z Principal Component Analysis of Process Datasets with Missing Values Severson, Kristen Molaro, Mark Braatz, Richard D Massachusetts Institute of Technology. Department of Chemical Engineering Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA), which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets. Keywords: principal component analysis; missing data; process data analytics; chemometrics; machine learning; multivariable statistical process control; process monitoring; Tennessee Eastman problem 2020-06-02T18:39:46Z 2020-06-02T18:39:46Z 2017-07 2017-05 2019-08-14T18:20:09Z Article http://purl.org/eprint/type/JournalArticle 2227-9717 https://hdl.handle.net/1721.1/125630 Severson, Kristen et al. “Principal Component Analysis of Process Datasets with Missing Values.” Processes 5, 4 (July 2017): 38. © 2017 The Authors en http://dx.doi.org/10.3390/pr5030038 Processes Creative Commons Attribution 4.0 International license https://creativecommons.org/licenses/by/4.0/ application/pdf MDPI AG MDPI
spellingShingle Severson, Kristen
Molaro, Mark
Braatz, Richard D
Principal Component Analysis of Process Datasets with Missing Values
title Principal Component Analysis of Process Datasets with Missing Values
title_full Principal Component Analysis of Process Datasets with Missing Values
title_fullStr Principal Component Analysis of Process Datasets with Missing Values
title_full_unstemmed Principal Component Analysis of Process Datasets with Missing Values
title_short Principal Component Analysis of Process Datasets with Missing Values
title_sort principal component analysis of process datasets with missing values
url https://hdl.handle.net/1721.1/125630
work_keys_str_mv AT seversonkristen principalcomponentanalysisofprocessdatasetswithmissingvalues
AT molaromark principalcomponentanalysisofprocessdatasetswithmissingvalues
AT braatzrichardd principalcomponentanalysisofprocessdatasetswithmissingvalues