mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data

Mass spectrometry (MS) data are used to analyze biological phenomena based on chemical species. However, these data often contain unexpected duplicate records and missing values due to technical or biological factors. These ‘dirty data’ problems increase the difficulty of performing MS analyses beca...

Full description

Bibliographic Details
Main Authors:	Geunho Lee, Hyun Beom Lee, Byung Hwa Jung, Hojung Nam
Format:	Article
Language:	English
Published:	Wiley 2017-07-01
Series:	FEBS Open Bio
Subjects:	dirty data duplicate record mass spectrometry missing value MS data preprocessor R package
Online Access:	https://doi.org/10.1002/2211-5463.12247

_version_	1798018906153025536
author	Geunho Lee Hyun Beom Lee Byung Hwa Jung Hojung Nam
author_facet	Geunho Lee Hyun Beom Lee Byung Hwa Jung Hojung Nam
author_sort	Geunho Lee
collection	DOAJ
description	Mass spectrometry (MS) data are used to analyze biological phenomena based on chemical species. However, these data often contain unexpected duplicate records and missing values due to technical or biological factors. These ‘dirty data’ problems increase the difficulty of performing MS analyses because they lead to performance degradation when statistical or machine‐learning tests are applied to the data. Thus, we have developed missing values preprocessor (mvp), an open‐source software for preprocessing data that might include duplicate records and missing values. mvp uses the property of MS data in which identical chemical species present the same or similar values for key identifiers, such as the mass‐to‐charge ratio and intensity signal, and forms cliques via graph theory to process dirty data. We evaluated the validity of the mvp process via quantitative and qualitative analyses and compared the results from a statistical test that analyzed the original and mvp‐applied data. This analysis showed that using mvp reduces problems associated with duplicate records and missing values. We also examined the effects of using unprocessed data in statistical tests and examined the improved statistical test results obtained with data preprocessed using mvp.
first_indexed	2024-04-11T16:32:00Z
format	Article
id	doaj.art-e15c06d1be4d46a490c815c3cfa9b053
institution	Directory Open Access Journal
issn	2211-5463
language	English
last_indexed	2024-04-11T16:32:00Z
publishDate	2017-07-01
publisher	Wiley
record_format	Article
series	FEBS Open Bio
spelling	doaj.art-e15c06d1be4d46a490c815c3cfa9b0532022-12-22T04:14:01ZengWileyFEBS Open Bio2211-54632017-07-01771051105910.1002/2211-5463.12247mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry dataGeunho Lee0Hyun Beom Lee1Byung Hwa Jung2Hojung Nam3School of Electrical Engineering and Computer Science Gwangju Institute of Science and Technology (GIST) KoreaMolecular Recognition Research Center Korea Institute of Science and Technology (KIST) Seoul KoreaMolecular Recognition Research Center Korea Institute of Science and Technology (KIST) Seoul KoreaSchool of Electrical Engineering and Computer Science Gwangju Institute of Science and Technology (GIST) KoreaMass spectrometry (MS) data are used to analyze biological phenomena based on chemical species. However, these data often contain unexpected duplicate records and missing values due to technical or biological factors. These ‘dirty data’ problems increase the difficulty of performing MS analyses because they lead to performance degradation when statistical or machine‐learning tests are applied to the data. Thus, we have developed missing values preprocessor (mvp), an open‐source software for preprocessing data that might include duplicate records and missing values. mvp uses the property of MS data in which identical chemical species present the same or similar values for key identifiers, such as the mass‐to‐charge ratio and intensity signal, and forms cliques via graph theory to process dirty data. We evaluated the validity of the mvp process via quantitative and qualitative analyses and compared the results from a statistical test that analyzed the original and mvp‐applied data. This analysis showed that using mvp reduces problems associated with duplicate records and missing values. We also examined the effects of using unprocessed data in statistical tests and examined the improved statistical test results obtained with data preprocessed using mvp.https://doi.org/10.1002/2211-5463.12247dirty dataduplicate recordmass spectrometrymissing valueMS data preprocessorR package
spellingShingle	Geunho Lee Hyun Beom Lee Byung Hwa Jung Hojung Nam mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data FEBS Open Bio dirty data duplicate record mass spectrometry missing value MS data preprocessor R package
title	mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data
title_full	mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data
title_fullStr	mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data
title_full_unstemmed	mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data
title_short	mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data
title_sort	mvp an open source preprocessor for cleaning duplicate records and missing values in mass spectrometry data
topic	dirty data duplicate record mass spectrometry missing value MS data preprocessor R package
url	https://doi.org/10.1002/2211-5463.12247
work_keys_str_mv	AT geunholee mvpanopensourcepreprocessorforcleaningduplicaterecordsandmissingvaluesinmassspectrometrydata AT hyunbeomlee mvpanopensourcepreprocessorforcleaningduplicaterecordsandmissingvaluesinmassspectrometrydata AT byunghwajung mvpanopensourcepreprocessorforcleaningduplicaterecordsandmissingvaluesinmassspectrometrydata AT hojungnam mvpanopensourcepreprocessorforcleaningduplicaterecordsandmissingvaluesinmassspectrometrydata

mvp – an open‐source preprocessor for cleaning duplicate records and missing values in mass spectrometry data

Similar Items