Show cached tasks in nextflow run preview

Hi!

Is it doable to know in advance which tasks are going to be resumed from cache?

The output of nextflow run -resume -preview does not show this information, and I have not found a way to do it.

I am using version 24.04.4.

Thanks in advance!

Hello @Gullumluvl.

Unfortunately, it’s not possible to know precisely what tasks will be resumed without running the pipeline with resume enabled (e.g. -resume). The -preview option makes Nextflow run the workflow script skipping the execution of all processes. If it’s skipping the execution, it doesn’t get to the stage to verify if it’s required to run the task or if it’s possible to use the cache.

Thanks.

Wouldn’t it be a useful feature?

I have been very enthusiastic about Nextflow since I started 3 months ago, however at the moment I have one big problem with it, which is how to deal with the caching mechanism. I end up involuntarily restarting many time-consuming tasks because of this.

More precisely:

  1. I think a less strict criterion for resuming would be useful. Possible solutions:
    • a configuration option to loosen it, like cache = "filenames+timestamp"
    • a way to prevent relaunching processes with a specific name.
  2. In the absence of the above point, a way to predict caching/resuming without executing any process. The -preview option seems like the most natural option to enhance with this.

Based on my recent experience, examples of modifications to processes that prevented resuming:

  • removing one of the files from the output. E.g. from:

    output:
    tuple path('out.ext1'), path('out.ext2')
    

    to:

    output:
    path('out.ext1')
    
  • adding input/output val that don’t impact the executed command, for example to pass metadata.

  • adding line returns, other whitespaces, or reorder arguments in a way should not change the result of the script block. This might be impossible to detect programmatically which is why a looser criterion for resuming would be nice.

  • Many ones where I don’t have an explanation, for example I currently have this independent block in a workflow which got entirely restarted (222 long-running processes):

    Channel.fromPath('data/raw/*.bam') | bam_coverage
    

    I cannot tell if it’s because I changed the maxForks directive of the process, or if it’s because I updated one of nextflow.config or the other config and param files I set up. I don’t remember changing anything more closely related to this process but I might be wrong.

I’d like to know what you and other developers at Nextflow think. Actually I would imagine a similar request has already been made.

a way to prevent relaunching processes with a specific name.

You can use the cache process directive to turn off caching when resuming pipelines.

process ProcessIDontWantToUseCache {
  cache false
  ...
}

Also, using process selectors, you can expand this to a group of processes, processes with a specific label, specific name, and so on. For example, add this to a configuration file:

process {
    withLabel: noCache {
        cache = false
    }
}

Now, all processes with the label noCache will never use cache.

process SomeProcessInMyPipeline {
  label 'noCache'

  script:
  """
     your_command --here
     """
}

I am a bit confused, this would cause the task to reexecute, right? What if I want the opposite, ie, “force resuming”?

I can use the storeDir directive, but it’s not ideal either, because it does not check timestamps.

Sorry, I read it as prevent resuming, but you can still use it to answer your question inverting it.

I showed an example on how to try to resume some tasks but NOT some tasks (label noCache). You could turn off cache for all tasks (process.cache = false), but not for some with a specific label (withCache). Of course you still need -resume or resume = true.

I can use the storeDir directive, but it’s not ideal either, because it does not check timestamps.

You’re correct, but depending on what you want to do, you could add a timestamp to the filename and then by checking if the file is there, you’d automatically be checking for timestamp too.

I see, but my point is that the conditions to meet for resuming a previous task are too strict for my use case. Filename and timestamp would be enough, because while developing a workflow, everything else in the code might change anytime, despite the results still being valid.

For example, the doc says

Changing any of the inputs included in the task hash will invalidate the cache, for example:

  • Resuming from a different session ID
  • Changing the process name
  • Changing the task container image or Conda environment
  • Changing the task script
  • Changing an input file or bundled script used by the task

While the following examples would not invalidate the cache:

  • Changing the value of a directive (other than ext), even if that directive is used in the task script

And most of these modification could just be code “cleanup” (renaming process, changing input names, etc).

Sorry, I have a bit derailed from the original subject, but I am curious if others feel the need for a less strict caching (i.e. resume more easily). Meanwhile, predicting what will be resumed would be useful.

I’d recommend you open an issue on the Nextflow repo :smiley:. It’s a better place over there to gather support from other users for a new feature.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.