Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree
There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models suc...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-02-01
|
Series: | Smart Agricultural Technology |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2772375522000715 |
_version_ | 1811214719387697152 |
---|---|
author | Ndiye M. Kebonye Prince C. Agyeman James K.M. Biney |
author_facet | Ndiye M. Kebonye Prince C. Agyeman James K.M. Biney |
author_sort | Ndiye M. Kebonye |
collection | DOAJ |
description | There are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART). |
first_indexed | 2024-04-12T06:08:13Z |
format | Article |
id | doaj.art-5b616c36f9ed48a6b0877e625aa1290b |
institution | Directory Open Access Journal |
issn | 2772-3755 |
language | English |
last_indexed | 2024-04-12T06:08:13Z |
publishDate | 2023-02-01 |
publisher | Elsevier |
record_format | Article |
series | Smart Agricultural Technology |
spelling | doaj.art-5b616c36f9ed48a6b0877e625aa1290b2022-12-22T03:44:46ZengElsevierSmart Agricultural Technology2772-37552023-02-013100106Optimized modelling of countrywide soil organic carbon levels via an interpretable decision treeNdiye M. Kebonye0Prince C. Agyeman1James K.M. Biney2Department of Geosciences, Chair of Soil Science and Geomorphology, University of Tübingen, Rümelinstr. 19-23, Tübingen, Germany; DFG Cluster of Excellence “Machine Learning: New Perspectives for Science”, University of Tübingen, AI Research Building, Maria-von-Linden-Str. 6, Tübingen 72076, Germany; Corresponding author at: Department of Geosciences, Chair of Soil Science and Geomorphology, University of Tübingen, Rümelinstr. 19-23, Tübingen, Germany.Department of Soil Science and Soil Protection, Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences Prague, Kamýcká 129, Prague, Suchdol 165 00, Czech RepublicDepartment of Landscape Ecology, The Silva Tarouca Research Institute for Landscape and Ornamental Gardening, Lidická 25/27, Brno 602 00, Czech RepublicThere are relatively few studies that explicitly evaluate the performance of machine learning algorithms (MLAs) such as decision trees while varying conditions like data splitting strategies and feature selection methods in digital soil mapping (DSM). Since several more powerful black-box models such as Random Forest (RF) exist, regular models like the Classification and Regression Tree (CART) are least applied despite being more intelligible than the former. We demonstrate a simple yet relevant way to improve the performance of a CART model for DSM while still benefiting from its intelligibility, interpretability and intuition potential. Soil organic carbon (SOC) levels across the Czech Republic are predicted at 30 m × 30 m resolution using selected covariates coupled with respective CART models. For this work, 440 topsoils (0–20 cm) for the Czech Republic were retrieved from the LUCAS soil database. Regarding the distinct CART models, data splitting strategies (Random, SPlit and Conditional Latin Hypercube Sampling: cLHS) and 7 feature selection methods were varied. Meanwhile, overall model results were compared using accuracy metrics including the root mean square error (RMSE). One of the satisfactory SOC model validation results based on SPlit has a root mean square error (RMSE) of 17.30 g/kg and a coefficient of determination (R2) of 0.52. The cLHS proves robust for model data splitting. Feature selection methods including Stepwise Regression (SWR), Recursive Feature Elimination (RFE) and the Genetic Algorithm (GA) were considered computationally efficient for identifying relevant covariates. Generally, the study demonstrates the relevance and effectiveness of varying data splitting strategies and feature selection methods for improving SOC modelling via a decision tree (CART).http://www.sciencedirect.com/science/article/pii/S2772375522000715Intelligible modelsModel parsimonyCzech RepublicGeneralizationDigital soil mapping (DSM) |
spellingShingle | Ndiye M. Kebonye Prince C. Agyeman James K.M. Biney Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree Smart Agricultural Technology Intelligible models Model parsimony Czech Republic Generalization Digital soil mapping (DSM) |
title | Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree |
title_full | Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree |
title_fullStr | Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree |
title_full_unstemmed | Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree |
title_short | Optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree |
title_sort | optimized modelling of countrywide soil organic carbon levels via an interpretable decision tree |
topic | Intelligible models Model parsimony Czech Republic Generalization Digital soil mapping (DSM) |
url | http://www.sciencedirect.com/science/article/pii/S2772375522000715 |
work_keys_str_mv | AT ndiyemkebonye optimizedmodellingofcountrywidesoilorganiccarbonlevelsviaaninterpretabledecisiontree AT princecagyeman optimizedmodellingofcountrywidesoilorganiccarbonlevelsviaaninterpretabledecisiontree AT jameskmbiney optimizedmodellingofcountrywidesoilorganiccarbonlevelsviaaninterpretabledecisiontree |