Language-Agnostic Reproducible Data Analysis Using Literate Programming.

A modern biomedical research project can easily contain hundreds of analysis steps and lack of reproducibility of the analyses has been recognized as a severe issue. While thorough documentation enables reproducibility, the number of analysis programs used can be so large that in reality reproducibi...

Full description

Bibliographic Details
Main Authors: Boris Vassilev, Riku Louhimo, Elina Ikonen, Sampsa Hautaniemi
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5053501?pdf=render
_version_ 1818337644264816640
author Boris Vassilev
Riku Louhimo
Elina Ikonen
Sampsa Hautaniemi
author_facet Boris Vassilev
Riku Louhimo
Elina Ikonen
Sampsa Hautaniemi
author_sort Boris Vassilev
collection DOAJ
description A modern biomedical research project can easily contain hundreds of analysis steps and lack of reproducibility of the analyses has been recognized as a severe issue. While thorough documentation enables reproducibility, the number of analysis programs used can be so large that in reality reproducibility cannot be easily achieved. Literate programming is an approach to present computer programs to human readers. The code is rearranged to follow the logic of the program, and to explain that logic in a natural language. The code executed by the computer is extracted from the literate source code. As such, literate programming is an ideal formalism for systematizing analysis steps in biomedical research. We have developed the reproducible computing tool Lir (literate, reproducible computing) that allows a tool-agnostic approach to biomedical data analysis. We demonstrate the utility of Lir by applying it to a case study. Our aim was to investigate the role of endosomal trafficking regulators to the progression of breast cancer. In this analysis, a variety of tools were combined to interpret the available data: a relational database, standard command-line tools, and a statistical computing environment. The analysis revealed that the lipid transport related genes LAPTM4B and NDRG1 are coamplified in breast cancer patients, and identified genes potentially cooperating with LAPTM4B in breast cancer progression. Our case study demonstrates that with Lir, an array of tools can be combined in the same data analysis to improve efficiency, reproducibility, and ease of understanding. Lir is an open-source software available at github.com/borisvassilev/lir.
first_indexed 2024-12-13T14:58:29Z
format Article
id doaj.art-62217ed05f944b61be8b09d1793e3648
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-13T14:58:29Z
publishDate 2016-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-62217ed05f944b61be8b09d1793e36482022-12-21T23:41:10ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-011110e016402310.1371/journal.pone.0164023Language-Agnostic Reproducible Data Analysis Using Literate Programming.Boris VassilevRiku LouhimoElina IkonenSampsa HautaniemiA modern biomedical research project can easily contain hundreds of analysis steps and lack of reproducibility of the analyses has been recognized as a severe issue. While thorough documentation enables reproducibility, the number of analysis programs used can be so large that in reality reproducibility cannot be easily achieved. Literate programming is an approach to present computer programs to human readers. The code is rearranged to follow the logic of the program, and to explain that logic in a natural language. The code executed by the computer is extracted from the literate source code. As such, literate programming is an ideal formalism for systematizing analysis steps in biomedical research. We have developed the reproducible computing tool Lir (literate, reproducible computing) that allows a tool-agnostic approach to biomedical data analysis. We demonstrate the utility of Lir by applying it to a case study. Our aim was to investigate the role of endosomal trafficking regulators to the progression of breast cancer. In this analysis, a variety of tools were combined to interpret the available data: a relational database, standard command-line tools, and a statistical computing environment. The analysis revealed that the lipid transport related genes LAPTM4B and NDRG1 are coamplified in breast cancer patients, and identified genes potentially cooperating with LAPTM4B in breast cancer progression. Our case study demonstrates that with Lir, an array of tools can be combined in the same data analysis to improve efficiency, reproducibility, and ease of understanding. Lir is an open-source software available at github.com/borisvassilev/lir.http://europepmc.org/articles/PMC5053501?pdf=render
spellingShingle Boris Vassilev
Riku Louhimo
Elina Ikonen
Sampsa Hautaniemi
Language-Agnostic Reproducible Data Analysis Using Literate Programming.
PLoS ONE
title Language-Agnostic Reproducible Data Analysis Using Literate Programming.
title_full Language-Agnostic Reproducible Data Analysis Using Literate Programming.
title_fullStr Language-Agnostic Reproducible Data Analysis Using Literate Programming.
title_full_unstemmed Language-Agnostic Reproducible Data Analysis Using Literate Programming.
title_short Language-Agnostic Reproducible Data Analysis Using Literate Programming.
title_sort language agnostic reproducible data analysis using literate programming
url http://europepmc.org/articles/PMC5053501?pdf=render
work_keys_str_mv AT borisvassilev languageagnosticreproducibledataanalysisusingliterateprogramming
AT rikulouhimo languageagnosticreproducibledataanalysisusingliterateprogramming
AT elinaikonen languageagnosticreproducibledataanalysisusingliterateprogramming
AT sampsahautaniemi languageagnosticreproducibledataanalysisusingliterateprogramming