Reproducible Science

What will this session teach me?

  • The notion of a "reproducible paper", a research object that goes beyond a PDF, but includes everything that is necessary to reproduce a scientific result.
  • Concepts & technology overview and short introduction to containers, computational environments. What is a container, how can a container help reproducibility, how can I get or make containers and add them to my dataset?


You can get a PDF of the slides here.


Hands-on 1: Write a reproducible paper (LaTeX, Python, Make & DataLad)

Tested to work on Linux and MacOS

datalad clone the repository found at

datalad clone
cd repro-paper-sketch

Check that you have all Requirements installed (latexmk and Python3)

Create a virtual environment and install the Python modules in requirements.txt with pip

virtualenv --python=python3 ~/env/repro
. ~/env/repro/bin/activate
pip install -r requirements.txt

Run make to see the template in action, and take a look into the resulting PDF (main.pdf with a PDF reader).


Then, take a look into the script inside of code/ and the Makefile, and try to find out what how the setup works. Change the color palette from "muted" to "Blues" (in the function plot_relationships()) and rerun make. Take another look into the PDF to see an updated figure.

Hands-on 2: Run a containerized neuroimaging workflow

Tested to work on Linux and MacOS, taken from

This short example runs a containerized neuroimaging pipeline (MRIQC, a pipeline for creating image quality metrics from structural and functional magnetic resonance imaging data) on fMRI data.


Create a new dataset to contain mriqc output:

datalad create -d ds000003-qc -c text2git
cd ds000003-qc

Install the ReproNim container collection for convenience:

datalad install -d . ///repronim/containers
# (optionally) Freeze container of interest to the specific version desired
# to facilitate reproducibility of some older results
datalad run -m "Downgrade/Freeze mriqc container version" containers/scripts/freeze_versions bids-mriqc=0.15.1

Install input data as a subdataset. The dataset installed here is a slimmed-down OpenNeuro dataset with only two subjects.

datalad clone -d . sourcedata

Execute a containerized pipeline using datalad containers-run to create a re-executable run record (this takes about 15 minutes of execution time):

datalad containers-run \
        -n containers/bids-mriqc \
        -m "Run MRIQC on the input data" \
        --input sourcedata \
        --output . \
        '{inputs}' '{outputs}' participant group

Check the Git log to find the run-records commit hash:

git log -n 1
Rerun the run record (much faster execution time because we saved the work dir).

datalad rerun <INSERT HASH>
Use Git tool to explore what changed between runs (if you end up in a pager, pressing q gets you out):
# get difference between most recent and second most recent commit for one file
git diff HEAD..HEAD~1 -- group_bold.html

# list all files that were changed between the two runs
git diff HEAD..HEAD~1 --name-only