Generative data augmentation and automated optimization of convolutional neural networks for process monitoring

Chemometric modeling for spectral data is considered a key technology in biopharmaceutical processing to realize real-time process control and release testing. Machine learning (ML) models have been shown to increase the accuracy of various spectral regression and classification tasks, remove challe...

Full description

Bibliographic Details
Main Authors:	Robin Schiemer, Matthias Rüdt, Jürgen Hubbuch
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2024-01-01
Series:	Frontiers in Bioengineering and Biotechnology
Subjects:	chemometrics convolutional neural networks process analytical technology data augmentation hyperparameter optimization feature importance
Online Access:	https://www.frontiersin.org/articles/10.3389/fbioe.2024.1228846/full

_version_	1797338335363989504
author	Robin Schiemer Matthias Rüdt Jürgen Hubbuch
author_facet	Robin Schiemer Matthias Rüdt Jürgen Hubbuch
author_sort	Robin Schiemer
collection	DOAJ
description	Chemometric modeling for spectral data is considered a key technology in biopharmaceutical processing to realize real-time process control and release testing. Machine learning (ML) models have been shown to increase the accuracy of various spectral regression and classification tasks, remove challenging preprocessing steps for spectral data, and promise to improve the transferability of models when compared to commonly applied, linear methods. The training and optimization of ML models require large data sets which are not available in the context of biopharmaceutical processing. Generative methods to extend data sets with realistic in silico samples, so-called data augmentation, may provide the means to alleviate this challenge. In this study, we develop and implement a novel data augmentation method for generating in silico spectral data based on local estimation of pure component profiles for training convolutional neural network (CNN) models using four data sets. We simultaneously tune hyperparameters associated with data augmentation and the neural network architecture using Bayesian optimization. Finally, we compare the optimized CNN models with partial least-squares regression models (PLS) in terms of accuracy, robustness, and interpretability. The proposed data augmentation method is shown to produce highly realistic spectral data by adapting the estimates of the pure component profiles to the sampled concentration regimes. Augmenting CNNs with the in silico spectral data is shown to improve the prediction accuracy for the quantification of monoclonal antibody (mAb) size variants by up to 50% in comparison to single-response PLS models. Bayesian structure optimization suggests that multiple convolutional blocks are beneficial for model accuracy and enable transfer across different data sets. Model-agnostic feature importance methods and synthetic noise perturbation are used to directly compare the optimized CNNs with PLS models. This enables the identification of wavelength regions critical for model performance and suggests increased robustness against Gaussian white noise and wavelength shifts of the CNNs compared to the PLS models.
first_indexed	2024-03-08T09:29:38Z
format	Article
id	doaj.art-1d6deebc14b142d98d4769d585f3ad23
institution	Directory Open Access Journal
issn	2296-4185
language	English
last_indexed	2024-03-08T09:29:38Z
publishDate	2024-01-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Bioengineering and Biotechnology
spelling	doaj.art-1d6deebc14b142d98d4769d585f3ad232024-01-31T04:45:08ZengFrontiers Media S.A.Frontiers in Bioengineering and Biotechnology2296-41852024-01-011210.3389/fbioe.2024.12288461228846Generative data augmentation and automated optimization of convolutional neural networks for process monitoringRobin Schiemer0Matthias Rüdt1Jürgen Hubbuch2Institute of Process Engineering in Life Sciences, Section IV: Biomolecular Separation Engineering, Karlsruhe Institute of Technology (KIT), Karlsruhe, GermanyInstitute of Life Technologies, HES-SO Valais-Wallis, Sion, SwitzerlandInstitute of Process Engineering in Life Sciences, Section IV: Biomolecular Separation Engineering, Karlsruhe Institute of Technology (KIT), Karlsruhe, GermanyChemometric modeling for spectral data is considered a key technology in biopharmaceutical processing to realize real-time process control and release testing. Machine learning (ML) models have been shown to increase the accuracy of various spectral regression and classification tasks, remove challenging preprocessing steps for spectral data, and promise to improve the transferability of models when compared to commonly applied, linear methods. The training and optimization of ML models require large data sets which are not available in the context of biopharmaceutical processing. Generative methods to extend data sets with realistic in silico samples, so-called data augmentation, may provide the means to alleviate this challenge. In this study, we develop and implement a novel data augmentation method for generating in silico spectral data based on local estimation of pure component profiles for training convolutional neural network (CNN) models using four data sets. We simultaneously tune hyperparameters associated with data augmentation and the neural network architecture using Bayesian optimization. Finally, we compare the optimized CNN models with partial least-squares regression models (PLS) in terms of accuracy, robustness, and interpretability. The proposed data augmentation method is shown to produce highly realistic spectral data by adapting the estimates of the pure component profiles to the sampled concentration regimes. Augmenting CNNs with the in silico spectral data is shown to improve the prediction accuracy for the quantification of monoclonal antibody (mAb) size variants by up to 50% in comparison to single-response PLS models. Bayesian structure optimization suggests that multiple convolutional blocks are beneficial for model accuracy and enable transfer across different data sets. Model-agnostic feature importance methods and synthetic noise perturbation are used to directly compare the optimized CNNs with PLS models. This enables the identification of wavelength regions critical for model performance and suggests increased robustness against Gaussian white noise and wavelength shifts of the CNNs compared to the PLS models.https://www.frontiersin.org/articles/10.3389/fbioe.2024.1228846/fullchemometricsconvolutional neural networksprocess analytical technologydata augmentationhyperparameter optimizationfeature importance
spellingShingle	Robin Schiemer Matthias Rüdt Jürgen Hubbuch Generative data augmentation and automated optimization of convolutional neural networks for process monitoring Frontiers in Bioengineering and Biotechnology chemometrics convolutional neural networks process analytical technology data augmentation hyperparameter optimization feature importance
title	Generative data augmentation and automated optimization of convolutional neural networks for process monitoring
title_full	Generative data augmentation and automated optimization of convolutional neural networks for process monitoring
title_fullStr	Generative data augmentation and automated optimization of convolutional neural networks for process monitoring
title_full_unstemmed	Generative data augmentation and automated optimization of convolutional neural networks for process monitoring
title_short	Generative data augmentation and automated optimization of convolutional neural networks for process monitoring
title_sort	generative data augmentation and automated optimization of convolutional neural networks for process monitoring
topic	chemometrics convolutional neural networks process analytical technology data augmentation hyperparameter optimization feature importance
url	https://www.frontiersin.org/articles/10.3389/fbioe.2024.1228846/full
work_keys_str_mv	AT robinschiemer generativedataaugmentationandautomatedoptimizationofconvolutionalneuralnetworksforprocessmonitoring AT matthiasrudt generativedataaugmentationandautomatedoptimizationofconvolutionalneuralnetworksforprocessmonitoring AT jurgenhubbuch generativedataaugmentationandautomatedoptimizationofconvolutionalneuralnetworksforprocessmonitoring

Generative data augmentation and automated optimization of convolutional neural networks for process monitoring

Similar Items