Steps for reproducible example for broken Nextflow cache, after successful retries

This is linked to the slack message here: Slack
and probably linked to this old/closed post: https://community.seqera.io/t/resume-not-loading-retries-from-cache/892

My initial message was:

A pipeline step succeeds (but on a retry attempt), then the pipeline continues and fails on a later step, but then on pipeline resume, the on-retry-succeeded step is not picked up, and that step runs again? :/

I figured that this isn’t always the case; you also need to have attempted to run that step before, at least once (i.e., having a relevant work folder for that module created).

Here are the steps to reproduce the caching system fail:

  1. clone proteinfamilies
  2. alter conf/test.config here to make it fail on the 1st attempt, but pass on the 2nd:
withName: 'NFCORE_PROTEINFAMILIES:PROTEINFAMILIES:FAA_SEQFU_SEQKIT:SEQKIT_SEQ' {
       memory = { task.attempt == 1 ? 1.MB : 2.GB } // memory = { 1.4.GB * task.attempt }
       time   = { 2.m    * task.attempt }
   }
       
  1. Your executor has to be able to stop a process when it requires excess memory than what was defined (or use docker), so I used slurm with this slurm.config:
profiles {
   slurm {
       executor {
           name              = "slurm"
           queueSize         = 100
           queueGlobalStatus = true

       }
       workDir = "/path/to/work_proteinfamilies_cache_test/"
       process {
           queue  = 'standard'
           cache  = 'lenient'
       }
       params {
           // Boilerplate options
           outdir = "${launchDir}/results"
       }
   }
}
  1. NXF_VER=25.10.4 nextflow run proteinfamilies -c conf/slurm.config -profile singularity,test,slurm -resume

  2. As soon as the work folder for SEQKIT_SEQ is created, kill the pipeline (ctrl +C) (this step is improtant, otherwise if it passes on the 2nd attempt, without pre-failing at least once, the cache will work properly)

  3. Resume the pipeline with the same command, and wait for SEQKIT_SEQ to complete successfully on its 2nd attempt.

  4. Kill the pipeline again.

  5. Resume the pipeline with the same command. Verdict: SEQKIT_SEQ will begin to run again, while SEQFU_STATS_BEFORE, which has run in parallel and completed on its first attempt, will have been cached properly.

I fed these steps to the bots and they suggested a potential fix.

I am leaving the relevant PR here: bot attempt to figure the bug, based on reproducible example by vagkaratzas · Pull Request #6882 · nextflow-io/nextflow · GitHub