DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.
Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of vari...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2023-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0284443 |
_version_ | 1797843091535691776 |
---|---|
author | Ghadi S Al Hajj Johan Pensar Geir K Sandve |
author_facet | Ghadi S Al Hajj Johan Pensar Geir K Sandve |
author_sort | Ghadi S Al Hajj |
collection | DOAJ |
description | Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim. |
first_indexed | 2024-04-09T16:59:08Z |
format | Article |
id | doaj.art-200c49e1d6a24b1586dcb03b8f809a45 |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-04-09T16:59:08Z |
publishDate | 2023-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-200c49e1d6a24b1586dcb03b8f809a452023-04-21T05:33:40ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01184e028444310.1371/journal.pone.0284443DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation.Ghadi S Al HajjJohan PensarGeir K SandveData simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim.https://doi.org/10.1371/journal.pone.0284443 |
spellingShingle | Ghadi S Al Hajj Johan Pensar Geir K Sandve DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. PLoS ONE |
title | DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. |
title_full | DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. |
title_fullStr | DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. |
title_full_unstemmed | DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. |
title_short | DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. |
title_sort | dagsim combining dag based model structure with unconstrained data types and relations for flexible transparent and modularized data simulation |
url | https://doi.org/10.1371/journal.pone.0284443 |
work_keys_str_mv | AT ghadisalhajj dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation AT johanpensar dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation AT geirksandve dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation |