Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques
<p>The chemical composition of ambient organic aerosols plays a critical role in driving their climate and health-relevant properties and holds important clues to the sources and formation mechanisms of secondary aerosol material. In most ambient atmospheric environments, this composition rema...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Copernicus Publications
2022-06-01
|
Series: | Atmospheric Measurement Techniques |
Online Access: | https://amt.copernicus.org/articles/15/3779/2022/amt-15-3779-2022.pdf |
_version_ | 1811236236807897088 |
---|---|
author | E. B. Franklin L. D. Yee B. Aumont R. J. Weber P. Grigas A. H. Goldstein A. H. Goldstein |
author_facet | E. B. Franklin L. D. Yee B. Aumont R. J. Weber P. Grigas A. H. Goldstein A. H. Goldstein |
author_sort | E. B. Franklin |
collection | DOAJ |
description | <p>The chemical composition of ambient organic aerosols
plays a critical role in driving their climate and health-relevant
properties and holds important clues to the sources and formation mechanisms of secondary aerosol material. In most ambient atmospheric environments, this composition remains incompletely characterized, with the number of identifiable species consistently outnumbered by those that have no mass spectral matches in the literature or the National Institute of Standards and Technology/National Institutes of Health/Environmental Protection Agency (NIST/NIH/EPA) mass spectral databases, making them nearly impossible to definitively identify. This creates significant challenges in utilizing the full analytical capabilities of techniques which separate and generate spectra for complex environmental samples. In this work, we develop the use of machine learning techniques to quantify and characterize novel, or unidentifiable, organic material. This work introduces Ch3MS-RF (Chemical Characterization by Chromatography–Mass
Spectrometry Random Forest Modeling), an open-source, R-based software tool, for efficient machine-learning-enabled characterization of compounds separated in chromatography–mass spectrometry applications but not identifiable by comparison to mass spectral databases. A random forest model is trained and tested on a known 130 component representative external standard to predict the response factors of novel environmental organics based on position in volatility–polarity space and mass spectrum, enabling the reproducible, efficient, and optimized quantification of novel environmental species. Quantification accuracy on a reserved 20 % test set randomly split from the external standard compound list indicates that random forest modeling significantly outperforms the commonly used methods in both precision and accuracy, with a median response factor percent error of <span class="inline-formula">−</span>2 %, for modeled response factors, compared to <span class="inline-formula">></span> 15 %, for typically used proxy assignment-based methods. Chemical properties modeling, evaluated on the same reserved 20 % test set and an extrapolation set of species identified in ambient organic aerosol samples collected in the Amazon rainforest, also demonstrate robust performance. Extrapolation set property prediction mean absolute errors for carbon number, oxygen to carbon ratio (O : C), average carbon oxidation state (<span class="inline-formula"><math xmlns="http://www.w3.org/1998/Math/MathML" id="M3" display="inline" overflow="scroll" dspmath="mathml"><mover accent="true"><mrow><msub><mi mathvariant="normal">OS</mi><mi mathvariant="normal">c</mi></msub></mrow><mo mathvariant="normal">‾</mo></mover></math><span><svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="23pt" height="16pt" class="svg-formula" dspmath="mathimg" md5hash="2c49cca086414aa428290b1e8a3931a8"><svg:image xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="amt-15-3779-2022-ie00001.svg" width="23pt" height="16pt" src="amt-15-3779-2022-ie00001.png"/></svg:svg></span></span>), and vapor pressure
are 1.8, 0.15, 0.25, and 1.0 (log(atm)), respectively. Extrapolation set
out-of-sample <span class="inline-formula"><i>R</i><sup>2</sup></span> for all properties modeled are above 0.75, with the
exception of vapor pressure. While predictive performance for vapor pressure is less robust compared to the other chemical properties modeled, random-forest-based modeling was significantly more accurate than other commonly used methods of vapor pressure prediction, decreasing the mean vapor pressure prediction error to 0.24 (log(atm)) from 0.55 (log(atm))
(chromatography-based vapor pressure prediction) and 1.2 (log(atm))
(chemical formula-based vapor pressure prediction). The random forest model
significantly advances an untargeted analysis of the full scope of chemical
speciation yielded by two-dimensional gas chromatography (GCxGC-MS) techniques and can be applied to gas chromatography coupled with electron ionization mass spectrometry (GC-MS) as well. It enables the accurate estimation of key chemical properties commonly utilized in the atmospheric chemistry community, which may be used to more efficiently identify important tracers for further individual analysis and to characterize compound populations uniquely formed under specific ambient conditions.</p> |
first_indexed | 2024-04-12T12:05:33Z |
format | Article |
id | doaj.art-372c556193bb491e9faf635087153019 |
institution | Directory Open Access Journal |
issn | 1867-1381 1867-8548 |
language | English |
last_indexed | 2024-04-12T12:05:33Z |
publishDate | 2022-06-01 |
publisher | Copernicus Publications |
record_format | Article |
series | Atmospheric Measurement Techniques |
spelling | doaj.art-372c556193bb491e9faf6350871530192022-12-22T03:33:43ZengCopernicus PublicationsAtmospheric Measurement Techniques1867-13811867-85482022-06-01153779380310.5194/amt-15-3779-2022Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniquesE. B. Franklin0L. D. Yee1B. Aumont2R. J. Weber3P. Grigas4A. H. Goldstein5A. H. Goldstein6Department of Civil and Environmental Engineering, University of California Berkeley, Berkeley 94720, USADepartment of Environmental Science, Policy and Management, University of California Berkeley, Berkeley 94720, USAUniversité Paris-Est Créteil and Université de Paris, CNRS, LISA, 94010 Créteil, FranceDepartment of Environmental Science, Policy and Management, University of California Berkeley, Berkeley 94720, USADepartment of Industrial Engineering and Operations Research, University of California Berkeley, Berkeley 94720, USADepartment of Civil and Environmental Engineering, University of California Berkeley, Berkeley 94720, USADepartment of Environmental Science, Policy and Management, University of California Berkeley, Berkeley 94720, USA<p>The chemical composition of ambient organic aerosols plays a critical role in driving their climate and health-relevant properties and holds important clues to the sources and formation mechanisms of secondary aerosol material. In most ambient atmospheric environments, this composition remains incompletely characterized, with the number of identifiable species consistently outnumbered by those that have no mass spectral matches in the literature or the National Institute of Standards and Technology/National Institutes of Health/Environmental Protection Agency (NIST/NIH/EPA) mass spectral databases, making them nearly impossible to definitively identify. This creates significant challenges in utilizing the full analytical capabilities of techniques which separate and generate spectra for complex environmental samples. In this work, we develop the use of machine learning techniques to quantify and characterize novel, or unidentifiable, organic material. This work introduces Ch3MS-RF (Chemical Characterization by Chromatography–Mass Spectrometry Random Forest Modeling), an open-source, R-based software tool, for efficient machine-learning-enabled characterization of compounds separated in chromatography–mass spectrometry applications but not identifiable by comparison to mass spectral databases. A random forest model is trained and tested on a known 130 component representative external standard to predict the response factors of novel environmental organics based on position in volatility–polarity space and mass spectrum, enabling the reproducible, efficient, and optimized quantification of novel environmental species. Quantification accuracy on a reserved 20 % test set randomly split from the external standard compound list indicates that random forest modeling significantly outperforms the commonly used methods in both precision and accuracy, with a median response factor percent error of <span class="inline-formula">−</span>2 %, for modeled response factors, compared to <span class="inline-formula">></span> 15 %, for typically used proxy assignment-based methods. Chemical properties modeling, evaluated on the same reserved 20 % test set and an extrapolation set of species identified in ambient organic aerosol samples collected in the Amazon rainforest, also demonstrate robust performance. Extrapolation set property prediction mean absolute errors for carbon number, oxygen to carbon ratio (O : C), average carbon oxidation state (<span class="inline-formula"><math xmlns="http://www.w3.org/1998/Math/MathML" id="M3" display="inline" overflow="scroll" dspmath="mathml"><mover accent="true"><mrow><msub><mi mathvariant="normal">OS</mi><mi mathvariant="normal">c</mi></msub></mrow><mo mathvariant="normal">‾</mo></mover></math><span><svg:svg xmlns:svg="http://www.w3.org/2000/svg" width="23pt" height="16pt" class="svg-formula" dspmath="mathimg" md5hash="2c49cca086414aa428290b1e8a3931a8"><svg:image xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="amt-15-3779-2022-ie00001.svg" width="23pt" height="16pt" src="amt-15-3779-2022-ie00001.png"/></svg:svg></span></span>), and vapor pressure are 1.8, 0.15, 0.25, and 1.0 (log(atm)), respectively. Extrapolation set out-of-sample <span class="inline-formula"><i>R</i><sup>2</sup></span> for all properties modeled are above 0.75, with the exception of vapor pressure. While predictive performance for vapor pressure is less robust compared to the other chemical properties modeled, random-forest-based modeling was significantly more accurate than other commonly used methods of vapor pressure prediction, decreasing the mean vapor pressure prediction error to 0.24 (log(atm)) from 0.55 (log(atm)) (chromatography-based vapor pressure prediction) and 1.2 (log(atm)) (chemical formula-based vapor pressure prediction). The random forest model significantly advances an untargeted analysis of the full scope of chemical speciation yielded by two-dimensional gas chromatography (GCxGC-MS) techniques and can be applied to gas chromatography coupled with electron ionization mass spectrometry (GC-MS) as well. It enables the accurate estimation of key chemical properties commonly utilized in the atmospheric chemistry community, which may be used to more efficiently identify important tracers for further individual analysis and to characterize compound populations uniquely formed under specific ambient conditions.</p>https://amt.copernicus.org/articles/15/3779/2022/amt-15-3779-2022.pdf |
spellingShingle | E. B. Franklin L. D. Yee B. Aumont R. J. Weber P. Grigas A. H. Goldstein A. H. Goldstein Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques Atmospheric Measurement Techniques |
title | Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques |
title_full | Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques |
title_fullStr | Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques |
title_full_unstemmed | Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques |
title_short | Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography–mass spectrometry techniques |
title_sort | ch3ms rf a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography mass spectrometry techniques |
url | https://amt.copernicus.org/articles/15/3779/2022/amt-15-3779-2022.pdf |
work_keys_str_mv | AT ebfranklin ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques AT ldyee ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques AT baumont ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques AT rjweber ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques AT pgrigas ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques AT ahgoldstein ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques AT ahgoldstein ch3msrfarandomforestmodelforchemicalcharacterizationandimprovedquantificationofunidentifiedatmosphericorganicsdetectedbychromatographymassspectrometrytechniques |