Summary: | The Partial Least Square Regression (PLSR) is a multivariate method
commonly used to build a predictive model of Near Infrared (NIR) spectral data.
Based on our experience, several weaknesses of the PLSR have been
identified with respect to its robustness issues in the pre-processing and inprocessing
when outliers and High Leverage Points (HLP) exist in the dataset.
In addressing these problems, some robust procedures for PLSR are
developed.
In the pre-processing, the pretreatment procedure is needed to remove both
additive and multiplicative baseline effects and to distinguish the scattering
effect in the raw spectral. The existing methods are not very successful in
removing those effects. Hence, a new robust Generalized Multiplicative Scatter
Correction (GMSC) algorithm is proposed to correct the additive and/or
multiplicative baseline effects during pre-processing spectra. The results
indicate that the proposed method outperforms the existing methods in this
study.
In the in-processing, the PLSR model is very sensitive to the optimal number of
PLS components used in the model fitting process. Several selection
procedures of the optimal number of PLS components have been developed in
this regard. However, each procedure yields different result. To date, no one
has been able to determine the more superior method. Hence, a Robust
Reliable Weighted Average (RRWA-PLS) which does not require the selection
of an optimal number of PLS is developed by employing the weighted average
strategy from multiple PLSR models generated by different complexity of the
PLS components. In the PLSR model there is no variable selection procedure
that able to remove the irrelevant wavelengths. To fill-in the gap in the literature, a new robust procedure in wavelength selection based on input
scaling method is developed using Filter-Wrapper method. The PLSR fails to
discover the nonlinear structure in the original input space. As such, the use of
the classical PLSR might not be appropriate. In addition, the contamination of
outliers and HLP in the dataset also might damage the whole data processing
procedures. To address these problems, robust nonlinear solutions of PLSR
are developed through kernel based learning by nonlinearly projecting the
original input data matrix to a high dimensional feature mapping corresponding
to the kernel space. The nonlinear solutions coupled with some improved
robust methods such as Diagnostic Robust Generalized Potential (DRGP)
method and GM6-Estimator are also introduced.
Several statistical measures such as Root Mean Squared Error (RMSE),
Coefficient of Determination (R2), Ratio of Performance to Deviation (RPD), and
Standard Error (SE) are used to evaluate the superiority of the proposed
methods. The results of the simulation study and two NIR spectral data sets,
namely the NIR spectral of oil palm (Elaeis guineensis Jacq.) fresh and dried
ground fruit mesocarp, show that all the proposed methods are superior
compared to the existing methods in this study.
|