Provenance or Audit Trail of computation

We are evaluating Nextflow to replace an existing in-house workflow platform.

Our existing system allows us to track a detailed provenance/audit trail of each computation, input, output, version of software, version of workflow etc. We think this is a key feature. For example, last week in response to some poor scientific outcomes, the head of our lab asked the question ‘what exactly did you run to get these outputs?’… We were able to respond to them the version of the scientific software that was run, the version of the workflow, the exact input parameters to each computation, the unique ID of the input and output files… and we were able to retrieve the input files from our S3 Storage.

The last time we evaluated Nextflow (maybe 6 years ago), this was not possible. Is it possible now? If not, could we build it?

Hi Michael!

Welcome to the Seqera community! Glad to have you here!

What you’re asking about is definitely possible with Nextflow, in fact it’s an area where Nextflow excels when compared with other workflow managers. The way to enable detailed provenance/audit trail to any Nextflow pipeline is as easy as using the nf-prov plugin by adding the following lines to your nextflow.config file.

plugins {
  id 'nf-prov'
}
prov {
  enabled = true
  formats {
    bco {
      file = 'bco.json'
      overwrite = true
    }
  }
}

This enables automatic generation of Biocompute Objects which adhere to IIEEE standards for pipeline reproducibility.

With a bit of extra work you can also collect the tool versions and everything in an easily digestible summary as part of your pipeline summary report. For an example download the files from some recent sample output using the aws command below and open multiqc-report.html

aws s3 cp --recursive \
    s3://nf-core-awsmegatests.s3-eu-west-1.amazonaws.com/rnaseq/results-4e34945f6ca86621a08e7d573cd6b4fbb7fb1f0e/aligner_star_rsem/multiqc/star_rsem/ \
    .

This is from nf-core/rnaseq one of the gold standard pipelines from the open-source organization nf-core. All nf-core pipelines have similar reports included in part of their outputs.

1 Like