Set up publishDir in more than one process

Hi all, I got stuck in understand how publishDir works in many processes. I wrote two processes and the workflow did the job for me. However, I get very confused about more than one process deal with the same publishDir path. Here are my script running with no issue, but the output structure is not what I want.

#!/usr/bin/env nextflow

process GEF2GEM {
    publishDir "bin20/${chip_ID}/", mode: 'copy'

    input:
    tuple val(chip_ID), path(tif), path(gef)

    output:
    tuple val(chip_ID), path(tif), path("${chip_ID}.tissue.gem.gz"), emit: gem

    script:
    """
    /stomics_data/sif_files/saw-8.1.2/bin/saw convert gef2gem --gef ${gef} --bin-size 1 --gem ${chip_ID}.tissue.gem
    tar -czvf ${chip_ID}.tissue.gem.gz ${chip_ID}.tissue.gem
    rm ${chip_ID}.tissue.gem
    """
}

process GEM2RDS {
    publishDir "bin20/${chip_ID}/", mode: 'copy'
    //publishDir "./", mode: 'copy'
    input:
    tuple val(chip_ID), path(tif), path(gem)

    output:
    path "bin20/${chip_ID}/", emit: output_folder

    script:
    """
    export NUMBA_CACHE_DIR=/tmp/numba_cache
    mkdir -p /tmp/numba_cache && chmod 777 /tmp/numba_cache
    addImage.sh -t ${tif} -i ${gem} -H 6 -l 2 -d ${chip_ID} -b 20 -o bin20/${chip_ID}/
    """
}
params.input_csv = "sampleInfo.csv"
workflow {
    file_ch = channel.fromPath(params.input_csv)
                     .splitCsv(header:true)
                     .take(1)
                     .map { row -> [row.chip_id, file(row.regist_tif),file(row.gef_file)] }
                     .view()
    GEF2GEM(file_ch)
    GEM2RDS(GEF2GEM.out.gem)
}

My problem is:

  1. when I only run the first process GEF2GEM , it copied both the tif and ${chip_ID}.tissue.gem.gz to the folder bin20/${chip_ID}/ . Because I defined the tif and gem.gz file in the output block, is that correct?
$ tree bin20/ # output of running first process GEF2GEM bin20/ 
└── A04539C2 
        β”œβ”€β”€ A04539C2_HE_regist.tif 
        └── A04539C2.tissue.gem.gz 
1 directory, 2 files
  1. When I run the whole script together, the output structure shown below: why was bin20/${chip_ID}/ created twice? The bash script in process GEM2RDS was originally a docker run, which take -o to specify the output path, this path requires preexist before running. There is no way the script block create the bin20/${chip_ID}/ under bin20/${chip_ID}/.
$ tree -L 4 bin20 # output when `publishDir "bin20/${chip_ID}/", mode: 'copy'` in process GEM2RDS 
bin20 
└── A04539C2 
       β”œβ”€β”€ A04539C2_HE_regist.tif 
       β”œβ”€β”€ A04539C2.tissue.gem.gz 
       └── bin20 
             └── A04539C2 
             └── addimage 
4 directories, 2 files
  1. If I change the publishDir in GEM2RDS process to be publishDir "./", mode: 'copy' , it does save the output to the bin20/${chip_ID}/. I guess GEM2RDS uses the bin20/${chip_ID}/ path generated by GEF2GEM process? Otherwise, the script block should throw an error like the path bin20/${chip_ID}/ not found. My remember is that the copy step of outputs is sometimes delayed, better not to use this path. Does that mean it is not recommend to write the code in my way? But It also delete the tif and gem.gz files from process GEF2GEM, how to avoid this?
$ tree -L 2 bin20/ # output when set `publishDir "./", mode: 'copy' `; this is the correct output structure I want, but it delete the tif and gem.gz files from first process. 
bin20/ 
└── A04539C2 
└── addimage 
2 directories, 0 files

My ideal output is this the output in point 3, but still keeping the outputs from process GEF2GEM.
Could anyone please help me figure out what is wrong with my understanding of setting up publishDir in this case?

Thank you so so much!!

Best,
LC

The reason the second example is bin/ID/bin/ID is because the output: is bin/ID in the second process and then you also tell nextflow with publishDir that you want this output in bin/ID, so concatenating them you get bin/ID/bin/ID.

The final output folder will always be publishDir_path/output_path. Ideally it’s best to make sure that publishDir_path doesn’t overlap with another processes output_path.

You can move files in the working directory so the output_path is only the files.

script:
"""
mkdir -p bin/chipID
touch bin/chipID/file{1,2}.txt
mv bin/chipID/*.txt .
"""

output:
path "*.txt", emit: txt // Only captures the txts in the base of the task directory and the folder names don't become part of the output_path.

Does that help?

1 Like

when I only run the first process GEF2GEM , it copied both the tif and ${chip_ID}.tissue.gem.gz to the folder bin20/${chip_ID}/ . Because I defined the tif and gem.gz file in the output block, is that correct?

Exactly. If you want a subset of your outputs to be published, you can use the option pattern, e.g.

process {
    withName: GEF2GEM {
        publishDir = [
                path: { "${params.outdir}/my_folder" },
                mode: params.publish_dir_mode,
                pattern: '*.gz',
        ]
    }
}

If I change the publishDir in GEM2RDS process to be publishDir "./", mode: 'copy' , it does save the output to the bin20/${chip_ID}/

I wouldn’t go this way. You never know from where the person is running your pipeline, or where it’s stored. It’s safer to have a publishDir folder to store your outputs in there.

My ideal output is this the output in point 3, but still keeping the outputs from process GEF2GEM.

If having the intermediate outputs in the work dir works for you, then forget publishDir for those. We’re usually interested in the final outputs, the real outputs of our analysis, and we add the publishDir directive for these processes nearing the end of the pipeline run. If you want to keep in the publushDir folder some intermediate file, fine, but I usually just leave them in the work dir, where they’re automatically stored.

1 Like

Thank you so much! That helps a lot. I thought set publishDir and output block the same path would end up using the same path, instead of creating both. And my script by default will create the path specified by -o if not present. Now everything is clear. I need to keep in mind that nextflow’s working dir is always work. Thanks a lot!!

2 Likes

Thank you for your step-by-step confirmation and explanation. Really appreciate it. Thanks for sharing your suggestions and useful tips, which deepens my understanding of how nextflow works.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.