Generating a List of File Paths for Parallel Jobs from output of another process

EZUY · March 20, 2025, 4:28pm

Hi all,

Quick question that always bothers me. I would like to create a list of strings using a process as follows:

greeting_ch = Channel.fromPath("/scratch/mismatch.csv")
                    .view { csv -> "Before splitCsv: $csv" }
                    .splitCsv()
                    .view { csv -> "After splitCsv: $csv" }

However, this does not work for me. Additionally, the following process doesn’t work either:

process Get_List_From_TXT {
    input:
    path path_file_txt

    output:
    val list_each_line

    script:
    list_each_line = path_file_txt.readLines()
    """
    echo ${path_file_txt}
    """
}

Can anyone give me some recommendation? Thank you so much!

The goal is to get a list of file in output : path "list.relative.path.txt", the use the list for parallel job later.

ewels · March 20, 2025, 4:37pm

I assume this is when working through the Hello Nextflow training material, channels section:

Can you elaborate on what the problem is - in what way it’s not working?

EZUY · March 20, 2025, 4:59pm

I generate a file, list.relative.path.mismatch.txt, in a process. Each line in this file represents a path that I need to process in parallel using another process. However, I’m unsure how to properly utilize this file as input for the next process.

Additionally, I noticed that nf_log does not contain any output from my process, even though the expected files are created correctly in the publishDir.

My Workflow:

Process 1: Generating the List of Paths

process Check_size_manifest {
    tag "Check_size_manifest"
    label 'slurm_cpu_1'

    publishDir "${params.path_folder_output}/${params.id_run}", mode: 'copy', overwrite: true

    input:
    path path_file_json_manifest

    output:
    path "size_check.all.csv", emit: path_csv_size_check_all
    path "size_check.mismatch.csv", emit: path_csv_size_check_mismatch
    path "list.relative.path.mismatch.txt", emit: path_txt_list_relative_path_mismatch

    script:
    path_folder_output_id_run = path_file_json_manifest.getParent().toString()

    """
    module load python/2023-09-0
    python3 $projectDir/modules/download/check_path.py \
    "." \
    "$path_file_json_manifest" \
    "size_check.all.csv" \
    "size_check.mismatch.csv" \
    "list.relative.path.mismatch.txt"
    """
}

Workflow Snippet

workflow {
    path_file_json_manifest = Download_manifest()
    (path_csv_size_check_all, path_csv_size_check_mismatch, path_txt_list_relative_path_mismatch) = Check_size_manifest(path_file_json_manifest)

    path_txt_list_relative_path_mismatch.view { "ZZZ - " + it }
    mismatchPaths = path_txt_list_relative_path_mismatch.splitText().view { "ZZZ - " + it }
}

Questions:

How can I use list.relative.path.mismatch.txt as input for another process, treating each line as a separate item to process in parallel?
What is the best way to properly split this text file and pass each line to a process in Nextflow?
Why does nf_log not contain the expected output, and how can I ensure proper logging of the process execution? Thank you so much Phil!

EZUY · March 20, 2025, 7:42pm

I solved the parallel problem by using:

path_relative_path_mismatch = path_txt_list_relative_path_mismatch.splitText().map { it.trim() }.filter { it }

however, I still cannot figure out why path_relative_path_mismatch.view() is not showing anything in nf_log

ewels · March 20, 2025, 8:17pm

Ok, bunch of minor things that I see as I go:

tag "Check_size_manifest"
- Tags are for task-specific things really, as it’s the same as the process name this will just duplicate what’s shown. Better to leave as blank (defaults to an index count) or something specific to the input
module load python/2023-09-0
- If you really need environment modules, better to use the module directive (Nextflow has built in support for these)
- Better still to use docker / singularity / conda
path_folder_output_id_run
- Unused, I think you can remove this
python3 $projectDir/modules/download/check_path.py
- Better to use module binaries or the pipeline bin directory.
(path_csv_size_check_all, path_csv_size_check_mismatch, path_txt_list_relative_path_mismatch) = Check_size_manifest(path_file_json_manifest)
- This looks like weird / incorrect syntax. Just call the process without it returning anything, then access the outputs. See training.
- ```
Check_size_manifest(path_file_json_manifest)
process_two(Check_size_manifest.out.path_txt_list_relative_path_mismatch.splitText())
```

I think that it’s the final part that is causing your code to not work as expected.

Hope that helps!

EZUY · March 21, 2025, 2:40pm

Hi Phil,

Thanks a lot for the valuable gold standard recommendations! I’ve updated my code accordingly and will follow these instructions for all future projects.

Updated Code

process Check_md5_manifest {
    label 'slurm_cpu_1'
    module 'python/2023-09-0'
    publishDir "${params.path_folder_output}/${params.id_run}/nf_output/check_md5", mode: 'copy', overwrite: true

    input:
    val list_path_relative_path_mismatch_download
    path path_file_json_manifest

    output:
    path "md5_check.all.csv", emit: path_csv_md5_check_all
    path "md5_check.mismatch.csv", emit: path_csv_md5_check_mismatch
    path "list.relative.path.mismatch.md5.txt", emit: path_txt_list_relative_path_mismatch_md5

    script:
    path_folder_output_id_run = "${params.path_folder_output}/${params.id_run}"

    """
    python3 ica_check_md5.py \
    "$path_folder_output_id_run" \
    "$path_file_json_manifest" \
    "md5_check.all.csv" \
    "md5_check.mismatch.csv" \
    "list.relative.path.mismatch.md5.txt"
    """
}

Two Minor Issues

Python script for moduleBinaries
I have the following structure in my module folder (./modules/projectdata):

[projectdata]$ tree
.
├── main.nf
└── resources
    └── usr
        └── bin
            ├── ica_check_size.py

Additionally, nextflow.enable.moduleBinaries = true is set in nextflow.config. However, the script isn’t being detected properly. I noticed that if I remove python3, Nextflow resolves the absolute path, but the script still fails since .py files require the python3 prefix. The error message I get is:

python3: can't open file 'ica_check_md5.py': [Errno 2] No such file or directory

or if I remove python3

.command.sh: line 2: /nf_test/modules/projectdata/resources/usr/bin/ica_check_md5.py: Permission denied

Behavior of channel assignment in process calls
I’ve been using this approach to capture process outputs:

(path_csv_size_check_all, path_csv_size_check_mismatch, path_txt_list_relative_path_mismatch) = Check_size_manifest(path_file_json_manifest)

This has worked consistently for me, but I want to confirm whether this behavior is officially supported or if I’ve just been lucky. If it’s not intended behavior, I’ll switch to using:

Check_size_manifest.out.path_txt_list_relative_path_mismatch

in combination with emit: path_txt_list_relative_path_mismatch in the process definition.

Thank you so much!

EZUY · March 21, 2025, 2:54pm

I fixed the Python script for moduleBinaries issue by:

Making the .py file executable with chmod +x.
Adding #!/usr/bin/env python3 at the beginning of the script.
Removing python3 before calling ica_check_md5.py.

However, I’d still appreciate your guidance on the legitimacy of behavior in channel assignment for process calls.

ewels · March 22, 2025, 10:13am

Great! Glad you got it working.

I think that the channel assignment is fine, I’m just not used to seeing it written that way. Personally I find the other approach easier to read as there is no intermediate variable that you need to assign, instead you just access the outputs directly. But up to you

Phil

Topic		Replies	Views
Recommended way of passing a file that lists filenames as input to a process Ask for help	1	73	September 11, 2024
Collecting output from different processes to run another process Ask for help	3	127	July 23, 2024
Writing multiple filenames to an output file Ask for help	1	27	March 20, 2025
Five files in to a process, but only 1 comes out for the next? Ask for help nextflow	9	476	October 6, 2023
How to print files names after collect in a process to a file Ask for help	11	352	May 20, 2024