Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree

There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models suc...

Full description

Bibliographic Details
Main Authors: Ndiye M. Kebonye, Prince C. Agyeman, James K.M. Biney
Format: Article
Language:English
Published: Elsevier 2023-02-01
Series:Smart Agricultural Technology
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2772375522000715
_version_ 1811214719387697152
author Ndiye M. Kebonye
Prince C. Agyeman
James K.M. Biney
author_facet Ndiye M. Kebonye
Prince C. Agyeman
James K.M. Biney
author_sort Ndiye M. Kebonye
collection DOAJ
description There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART).
first_indexed 2024-04-12T06:08:13Z
format Article
id doaj.art-5b616c36f9ed48a6b0877e625aa1290b
institution Directory Open Access Journal
issn 2772-3755
language English
last_indexed 2024-04-12T06:08:13Z
publishDate 2023-02-01
publisher Elsevier
record_format Article
series Smart Agricultural Technology
spelling doaj.art-5b616c36f9ed48a6b0877e625aa1290b2022-12-22T03:44:46ZengElsevierSmart Agricultural Technology2772-37552023-02-013100106Optimized modelling of countrywide soil organic carbon levels via an interpretable decision treeNdiye M. Kebonye0Prince C. Agyeman1James K.M. Biney2Department of Geosciences, Chair of Soil Science and Geomorphology, University of Tübingen, Rümelinstr. 19-23, Tübingen, Germany; DFG Cluster of Excellence “Machine Learning: New Perspectives for Science”, University of Tübingen, AI Research Building, Maria-von-Linden-Str. 6, Tübingen 72076, Germany; Corresponding author at: Department of Geosciences, Chair of Soil Science and Geomorphology, University of Tübingen, Rümelinstr. 19-23, Tübingen, Germany.Department of Soil Science and Soil Protection, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 129, Prague, Suchdol 165 00, Czech RepublicDepartment of Landscape Ecology, The Silva Tarouca Research Institute for Landscape and Ornamental Gardening, Lidická 25/27, Brno 602 00, Czech RepublicThere are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART).http://www.sciencedirect.com/science/article/pii/S2772375522000715Intelligible modelsModel parsimonyCzech RepublicGeneralizationDigital soil mapping (DSM)
spellingShingle Ndiye M. Kebonye
Prince C. Agyeman
James K.M. Biney
Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
Smart Agricultural Technology
Intelligible models
Model parsimony
Czech Republic
Generalization
Digital soil mapping (DSM)
title Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
title_full Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
title_fullStr Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
title_full_unstemmed Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
title_short Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
title_sort optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
topic Intelligible models
Model parsimony
Czech Republic
Generalization
Digital soil mapping (DSM)
url http://www.sciencedirect.com/science/article/pii/S2772375522000715
work_keys_str_mv AT ndiyemkebonye optimizedmodellingofcountrywidesoilorganiccarbonlevelsviaaninterpretabledecisiontree
AT princecagyeman optimizedmodellingofcountrywidesoilorganiccarbonlevelsviaaninterpretabledecisiontree
AT jameskmbiney optimizedmodellingofcountrywidesoilorganiccarbonlevelsviaaninterpretabledecisiontree