Avoiding clobbers when process input and output file have same name

bskubi · June 27, 2024, 10:11pm

I have a process that ingests a user-provided file and generates a new output based on a user-provided sample_id tag. However this can lead to the input and output files having the same name and this causes clobbering. I’m leery of renaming the input file because I don’t know how that would interact with cache and resume.

Is there a convenient way to ensure that an input file won’t clobber the output, perhaps by giving it some sort of permanent temporary filename or something like that?

mribeirodantas · June 27, 2024, 10:30pm

You can use ‘name’ or ‘stageAs’ to have an input named some specific way in the task work directory. For example:

input:
path query_file, name: 'query.fa'

or, using a shorter syntax:

input:
path 'query.fa'

You can read more about this in here.

bskubi · June 27, 2024, 11:40pm

That makes sense, thank you!

bskubi · June 27, 2024, 11:42pm

Actually, I just realized my need is slightly more complex. The program I’m calling parses the filename extension to determine if the input is gzipped. I therefore would need to stage the file in a way that takes the given input name into account. For example if the user supplies input.txt.gz, then I’d like to stage it as something like “__temp__input.txt.gz”. Is this possible?

mribeirodantas · June 28, 2024, 12:21pm

Does the snippet below work as a solution to what you have?

process FOO {
  input:
  path ifile

  output:
  path ifile

  script:
  """
  mv ${ifile} __tmp__${ifile}
  echo ./do_something_with __tmp__${ifile} --output ${ifile}
  touch ${ifile}
  """
}

workflow {
  Channel
    .of(file('foo.txt'))
    | FOO
}

Task directory:

tree work/66/248471eedbbc030568510c47fcc7c5/
work/66/248471eedbbc030568510c47fcc7c5/
├── __tmp__foo.txt -> /Users/mribeirodantas/foo.txt
└── foo.txt

1 directory, 2 files

mahesh.binzerpanchal · July 2, 2024, 6:01pm

Bash has a -C flag that prevents clobbering.

You can do:

process.shell = ['/bin/bash', '-Ceuo', 'pipefail']

for all processes, or in your bash code.

script:
"""
set -C 

commands
"""

Topic		Replies	Views
Prevent nextflow from running a process if the output file exists Tips & Tricks nextflow	1	45	February 8, 2025
Best way to set an identifier for identical tasks run in parallel Ask for help	0	22	October 8, 2024
How can I dynamically name collectFile output based on input file Ask for help nextflow	5	325	April 6, 2024
Input file name collision - There are multiple input files Ask for help	6	372	April 25, 2024
Why are my samples being mixed up in the work dirs? Ask for help	7	25	August 1, 2024

Avoiding clobbers when process input and output file have same name

Related topics