Hi all, I got stuck in understand how publishDir
works in many processes. I wrote two processes and the workflow did the job for me. However, I get very confused about more than one process deal with the same publishDir
path. Here are my script running with no issue, but the output structure is not what I want.
#!/usr/bin/env nextflow
process GEF2GEM {
publishDir "bin20/${chip_ID}/", mode: 'copy'
input:
tuple val(chip_ID), path(tif), path(gef)
output:
tuple val(chip_ID), path(tif), path("${chip_ID}.tissue.gem.gz"), emit: gem
script:
"""
/stomics_data/sif_files/saw-8.1.2/bin/saw convert gef2gem --gef ${gef} --bin-size 1 --gem ${chip_ID}.tissue.gem
tar -czvf ${chip_ID}.tissue.gem.gz ${chip_ID}.tissue.gem
rm ${chip_ID}.tissue.gem
"""
}
process GEM2RDS {
publishDir "bin20/${chip_ID}/", mode: 'copy'
//publishDir "./", mode: 'copy'
input:
tuple val(chip_ID), path(tif), path(gem)
output:
path "bin20/${chip_ID}/", emit: output_folder
script:
"""
export NUMBA_CACHE_DIR=/tmp/numba_cache
mkdir -p /tmp/numba_cache && chmod 777 /tmp/numba_cache
addImage.sh -t ${tif} -i ${gem} -H 6 -l 2 -d ${chip_ID} -b 20 -o bin20/${chip_ID}/
"""
}
params.input_csv = "sampleInfo.csv"
workflow {
file_ch = channel.fromPath(params.input_csv)
.splitCsv(header:true)
.take(1)
.map { row -> [row.chip_id, file(row.regist_tif),file(row.gef_file)] }
.view()
GEF2GEM(file_ch)
GEM2RDS(GEF2GEM.out.gem)
}
My problem is:
- when I only run the first process
GEF2GEM
, it copied both thetif
and${chip_ID}.tissue.gem.gz
to the folderbin20/${chip_ID}/
. Because I defined the tif and gem.gz file in the output block, is that correct?
$ tree bin20/ # output of running first process GEF2GEM bin20/
βββ A04539C2
βββ A04539C2_HE_regist.tif
βββ A04539C2.tissue.gem.gz
1 directory, 2 files
- When I run the whole script together, the output structure shown below: why was
bin20/${chip_ID}/
created twice? The bash script in process GEM2RDS was originally a docker run, which take-o
to specify the output path, this path requires preexist before running. There is no way the script block create thebin20/${chip_ID}/
underbin20/${chip_ID}/
.
$ tree -L 4 bin20 # output when `publishDir "bin20/${chip_ID}/", mode: 'copy'` in process GEM2RDS
bin20
βββ A04539C2
βββ A04539C2_HE_regist.tif
βββ A04539C2.tissue.gem.gz
βββ bin20
βββ A04539C2
βββ addimage
4 directories, 2 files
- If I change the
publishDir
in GEM2RDS process to bepublishDir "./", mode: 'copy'
, it does save the output to thebin20/${chip_ID}/
. I guess GEM2RDS uses thebin20/${chip_ID}/
path generated by GEF2GEM process? Otherwise, the script block should throw an error like the pathbin20/${chip_ID}/
not found. My remember is that the copy step of outputs is sometimes delayed, better not to use this path. Does that mean it is not recommend to write the code in my way? But It also delete thetif
andgem.gz
files from process GEF2GEM, how to avoid this?
$ tree -L 2 bin20/ # output when set `publishDir "./", mode: 'copy' `; this is the correct output structure I want, but it delete the tif and gem.gz files from first process.
bin20/
βββ A04539C2
βββ addimage
2 directories, 0 files
My ideal output is this the output in point 3, but still keeping the outputs from process GEF2GEM.
Could anyone please help me figure out what is wrong with my understanding of setting up publishDir
in this case?
Thank you so so much!!
Best,
LC