Paper preprint: Reproducible genomics analysis pipelines with GNU Guix

Discussion:

Ricardo Wurmus

2018-04-11 12:18:38 UTC

Hey all,

I’m happy to announce that the group I’m working with has released a
preprint of a paper on reproducibility with the title:

Reproducible genomics analysis pipelines with GNU Guix
https://www.biorxiv.org/content/early/2018/04/11/298653

We built a collection of bioinformatics pipelines and packaged them with
GNU Guix, and then looked at the degree to which the software achieves
bit-reproducibility (spoiler: ~98%), analysed sources of non-determinism
(e.g. time stamps), discussed experimental reproducibility at runtime
(e.g. random number generators, kernel+glibc interface, etc) and
commented on the idea of using “containers” (or application bundles)
instead.

The middle section is a bit heavy on genomics to showcase the features
of the pipelines, but I think the introduction and the
discussion/conclusion may be of general interest.

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net

Ricardo Wurmus

2018-04-11 18:40:47 UTC

Permalink

Hi Holger,

thanks for your comments!

just one thing/question: in the keywords you have "reproducible
software" but not "reproducible builds", which is kind of our "marketing
term". Do you think you could squeeze that in?

Heh, it used to be “reproducible builds”, but the term was deemed too
abstract for the audience of biologists, so it was decided to change it
to “reproducible software”…

Lots of small compromises need to be made when writing a paper together,
and that was one of them :)

--
Ricardo

GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net

Holger Levsen

2018-04-11 19:00:55 UTC

Permalink

just one thing/question: in the keywords you have "reproducible

Heh, it used to be âreproducible buildsâ, but the term was deemed too
abstract for the audience of biologists, so it was decided to change it
to âreproducible softwareââŠ

hehe.

Lots of small compromises need to be made when writing a paper together,
and that was one of them :)

I understand.

& thanks again, super cool!

--
cheers,
Holger

Holger Levsen

2018-04-11 18:31:31 UTC

Permalink

hi again,

and extra kudos and thanks for releasing this under a free licence! \o/

--
cheers,
Holger

Roel Janssen

2018-04-11 21:16:19 UTC

Permalink

Post by Ricardo Wurmus
Hey all,
I’m happy to announce that the group I’m working with has released a
Reproducible genomics analysis pipelines with GNU Guix
https://www.biorxiv.org/content/early/2018/04/11/298653
We built a collection of bioinformatics pipelines and packaged them with
GNU Guix, and then looked at the degree to which the software achieves
bit-reproducibility (spoiler: ~98%), analysed sources of non-determinism
(e.g. time stamps), discussed experimental reproducibility at runtime
(e.g. random number generators, kernel+glibc interface, etc) and
commented on the idea of using “containers” (or application bundles)
instead.
The middle section is a bit heavy on genomics to showcase the features
of the pipelines, but I think the introduction and the
discussion/conclusion may be of general interest.

This looks really great! I also like how you leverage GNU Autotools.

Finally there is a paper that uses GNU Guix as deployment tool for
scientific purposes. :)

Kind regards,
Roel Janssen

Amirouche Boubekki

2018-04-15 07:50:19 UTC

Permalink

Wow very great, thanks for sharing.

Post by Roel Janssen

Post by Ricardo Wurmus
Hey all,
Iâm happy to announce that the group Iâm working with has released a
Reproducible genomics analysis pipelines with GNU Guix
https://www.biorxiv.org/content/early/2018/04/11/298653
We built a collection of bioinformatics pipelines and packaged them with
GNU Guix, and then looked at the degree to which the software achieves
bit-reproducibility (spoiler: ~98%), analysed sources of non-determinism
(e.g. time stamps), discussed experimental reproducibility at runtime
(e.g. random number generators, kernel+glibc interface, etc) and
commented on the idea of using âcontainersâ (or application bundles)
instead.
The middle section is a bit heavy on genomics to showcase the features
of the pipelines, but I think the introduction and the
discussion/conclusion may be of general interest.

This looks really great! I also like how you leverage GNU Autotools.
Finally there is a paper that uses GNU Guix as deployment tool for
scientific purposes. :)
Kind regards,
Roel Janssen

Ludovic Courtès

2018-04-23 08:20:26 UTC

Permalink

Hello Ricardo & all!

Post by Ricardo Wurmus
I’m happy to announce that the group I’m working with has released a
Reproducible genomics analysis pipelines with GNU Guix
https://www.biorxiv.org/content/early/2018/04/11/298653
We built a collection of bioinformatics pipelines and packaged them with
GNU Guix, and then looked at the degree to which the software achieves
bit-reproducibility (spoiler: ~98%), analysed sources of non-determinism
(e.g. time stamps), discussed experimental reproducibility at runtime
(e.g. random number generators, kernel+glibc interface, etc) and
commented on the idea of using “containers” (or application bundles)
instead.

Very impressive piece of work! I think it’s important to stress that
reproducible builds is a crucial foundation for reproducible
computational experiments, and this paper does a great job at this.

Also nice that you show you can have these bit-reproducible pipelines
formalized in Guix *and* produce a ready-to-use “container image.”

Hopefully we can soon address the remaining sources of non-determinism
shown in Table 3 (I think you already addressed some of them in the
meantime, didn’t you?).

The bit I’m less comfortable with is Autotools. I do understand how it
helps capture configure-time dependencies, and how it generally helps
people package and use the software; I think it’s one of the best tools
for the job. However it’s also hard to learn and, whether it’s
justified or not, it’s considered “scary.”

Given the intended audience, I wonder how we could provide a simpler
path to achieve the same goal. It could be a set of Autoconf macros
leading to high-level ‘configure.ac’ files without any line of shell
code, or it could be Guix interpreting a top-level .scm or JSON file,
both of which would ideally be easier to write for bioinformaticians.

What are your thoughts on this?

Anyway, kudos on this, thank you!

Ludo’.

Catonano

2018-05-13 08:58:44 UTC

Permalink

Ricardo, I don't understand the problem you're raising here (I didn't

read

the article yet, though)
Would you mind to elaborate on that ?
Why would you want to record the environment ?

I want to record the detected build environment so that I can restore it
at execution time. Autoconf provides macros that probe the environment
and record the full path to detected tools. For example, Iâm looking
for Samtools, and the user may provide a particular variant of Samtools
at configure time.

Thanks for clarifying !

Let me vent some thoughts on te issue !

Under Guix, the way to provide a specific version of the Samtools would be
to run the configuration in an environment that offers a specific Samtools
package, so that the configuration tool can pick that up

Under a traditional distro, it'd be to feed file paths to the configuration
tool

So, how much of the traditional way of doing things do we want to support,
in our pipelines ?

I record the full path to the executable at

configure time and embed that path in a configuration file that is read
when the pipeline is run.
This works fine for tools, but doesnât work very well at all for modules
in language environments. Take R for example. I can detect and record
the location of the R and Rscript executables, but I cannot easily
record the location of build-time R packages (such as r-deseq2) in a way
that allows me to rebuild the environment at runtime.
Instead of writing an Autoconf macro that records the exact location of
each of the detected R packages and their dependencies I chose to solve
the problem in Guix by wrapping the pipeline executables in R_SITE_LIBS,
because I figured that on systems without Guix you arenât likely to
install R packages into separate unique locations anyway â on most
systems R packages end up being installed to one and the same directory.
I think the desire to restore the configured environment at runtime is
valid and we do this all the time when we run binaries that have
embedded absolute paths (to libraries or other tools).

I didn't mean to imply it's not valid
I was just trying to understand what are the concerns on the ground and the
context

Itâs just that
it gets pretty awkward to do this for things like R packages or Python
modules (or Guile modules for that matter).
The Guix workflow language solves this problem by depending on Guix for
software deployment. For PiGx we picked Snakemake early on and it does
not have a software deployment solution (it expects to either run inside
a suitable environment that the user provides or to have access to
pre-built Singularity application bundles). I donât like to treat
pipelines like some sort of collection of scripts that must be invoked
in a suitable environment. I like to see pipelines as big software
packages that should know about the environment they need, that can be
configured like regular tools, and thus only require the packager to
assemble the environment, not the end-user.

I understand your concern to consider pipelines as packages

But say, for example, that a pipeline gets distributed as a .deb package
with dependencies to R (or Guile) modules

Or, say, that a pipeline is distributed with a bundled guix.scm file
containing R modules (or Guile modules) as inputs

Would that break the idea of a pipeline as a package ?

I'm afraid that the idea of a pipeline as a package shouldn't be entrusted
to the configuration tool, but rather to the package management tool

And the pipeline author shouldn't be assumed to work in isolation,
confident that any package management environment will be able to rus their
pipeline smoothly

The pipelines authors should be concerned with the collocation of their
pipeline in the packaged graph, that shouldn't be a concern of the packager
only

Maybe the sotware authors should provide dependency information in a
standardized format (rdf ? ) and that should be leveraged by packagers in
order to prepare .deb packages or guix.scm files

And if you are a developer and you want to test the software with a
specific version of a dependency, then you should run the configuration
tool in an environment where that version of the dependency is available,
so that the configuration tool can pick that up

If you are on Guix, you will probably create that environment with the Guix
environment tool

If you are on Debian or Fedora, you will have to rely on those distros
development tools

On traditional distros, you can install packages in your user folder or in
/opt or in other positions

And then, you can feed those to the configuration tool

On Guix, the conditions are different

The idea of pipelines as packages will be treated differently by the
configuration tool under Guix and the configuration tool under Debian/Fedora

So, in my view a configuration tool should be quite dumb and assume that
the package management is smarter

You object that implies the idea of the pipeline as a ugly hack

That is not necessarily so

It's just that I don't think that the pipelines authors can complete the
issue in their configuration management

Guix introduces the idea of the whole dependencies stack and that can't be
of concern to packagers only.
I don't think so

Maybe I'm too pessimistic, I don't know

Thanks for this discussion !