DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.

Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of vari...

Full description

Bibliographic Details
Main Authors: Ghadi S Al Hajj, Johan Pensar, Geir K Sandve
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0284443
_version_ 1797843091535691776
author Ghadi S Al Hajj
Johan Pensar
Geir K Sandve
author_facet Ghadi S Al Hajj
Johan Pensar
Geir K Sandve
author_sort Ghadi S Al Hajj
collection DOAJ
description Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim.
first_indexed 2024-04-09T16:59:08Z
format Article
id doaj.art-200c49e1d6a24b1586dcb03b8f809a45
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-09T16:59:08Z
publishDate 2023-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-200c49e1d6a24b1586dcb03b8f809a452023-04-21T05:33:40ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01184e028444310.1371/journal.pone.0284443DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.Ghadi S Al HajjJohan PensarGeir K SandveData simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim.https://doi.org/10.1371/journal.pone.0284443
spellingShingle Ghadi S Al Hajj
Johan Pensar
Geir K Sandve
DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
PLoS ONE
title DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
title_full DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
title_fullStr DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
title_full_unstemmed DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
title_short DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
title_sort dagsim combining dag based model structure with unconstrained data types and relations for flexible transparent and modularized data simulation
url https://doi.org/10.1371/journal.pone.0284443
work_keys_str_mv AT ghadisalhajj dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation
AT johanpensar dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation
AT geirksandve dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation