Generating a List of File Paths for Parallel Jobs from output of another process

Hi all,

Quick question that always bothers me. I would like to create a list of strings using a process as follows:

greeting_ch = Channel.fromPath("/scratch/mismatch.csv")
                    .view { csv -> "Before splitCsv: $csv" }
                    .splitCsv()
                    .view { csv -> "After splitCsv: $csv" }

However, this does not work for me. Additionally, the following process doesn’t work either:

process Get_List_From_TXT {
    input:
    path path_file_txt

    output:
    val list_each_line

    script:
    list_each_line = path_file_txt.readLines()
    """
    echo ${path_file_txt}
    """
}

Can anyone give me some recommendation? Thank you so much!

The goal is to get a list of file in output : path "list.relative.path.txt", the use the list for parallel job later.

I assume this is when working through the Hello Nextflow training material, channels section:

Can you elaborate on what the problem is - in what way it’s not working?

I generate a file, list.relative.path.mismatch.txt, in a process. Each line in this file represents a path that I need to process in parallel using another process. However, I’m unsure how to properly utilize this file as input for the next process.

Additionally, I noticed that nf_log does not contain any output from my process, even though the expected files are created correctly in the publishDir.

My Workflow:

Process 1: Generating the List of Paths

process Check_size_manifest {
    tag "Check_size_manifest"
    label 'slurm_cpu_1'

    publishDir "${params.path_folder_output}/${params.id_run}", mode: 'copy', overwrite: true

    input:
    path path_file_json_manifest

    output:
    path "size_check.all.csv", emit: path_csv_size_check_all
    path "size_check.mismatch.csv", emit: path_csv_size_check_mismatch
    path "list.relative.path.mismatch.txt", emit: path_txt_list_relative_path_mismatch

    script:
    path_folder_output_id_run = path_file_json_manifest.getParent().toString()

    """
    module load python/2023-09-0
    python3 $projectDir/modules/download/check_path.py \
    "." \
    "$path_file_json_manifest" \
    "size_check.all.csv" \
    "size_check.mismatch.csv" \
    "list.relative.path.mismatch.txt"
    """
}

Workflow Snippet

workflow {
    path_file_json_manifest = Download_manifest()
    (path_csv_size_check_all, path_csv_size_check_mismatch, path_txt_list_relative_path_mismatch) = Check_size_manifest(path_file_json_manifest)

    path_txt_list_relative_path_mismatch.view { "ZZZ - " + it }
    mismatchPaths = path_txt_list_relative_path_mismatch.splitText().view { "ZZZ - " + it }
}

Questions:

  1. How can I use list.relative.path.mismatch.txt as input for another process, treating each line as a separate item to process in parallel?
  2. What is the best way to properly split this text file and pass each line to a process in Nextflow?
  3. Why does nf_log not contain the expected output, and how can I ensure proper logging of the process execution? Thank you so much Phil!

I solved the parallel problem by using:

path_relative_path_mismatch = path_txt_list_relative_path_mismatch.splitText().map { it.trim() }.filter { it } 
            

however, I still cannot figure out why path_relative_path_mismatch.view() is not showing anything in nf_log

Ok, bunch of minor things that I see as I go:

  • tag "Check_size_manifest"
    • Tags are for task-specific things really, as it’s the same as the process name this will just duplicate what’s shown. Better to leave as blank (defaults to an index count) or something specific to the input
  • module load python/2023-09-0
    • If you really need environment modules, better to use the module directive (Nextflow has built in support for these)
    • Better still to use docker / singularity / conda :wink:
  • path_folder_output_id_run
    • Unused, I think you can remove this
  • python3 $projectDir/modules/download/check_path.py
  • (path_csv_size_check_all, path_csv_size_check_mismatch, path_txt_list_relative_path_mismatch) = Check_size_manifest(path_file_json_manifest)
    • This looks like weird / incorrect syntax. Just call the process without it returning anything, then access the outputs. See training.
    • Check_size_manifest(path_file_json_manifest)
      process_two(Check_size_manifest.out.path_txt_list_relative_path_mismatch.splitText())
      

I think that it’s the final part that is causing your code to not work as expected.

Hope that helps!

1 Like

Hi Phil,

Thanks a lot for the valuable gold standard recommendations! I’ve updated my code accordingly and will follow these instructions for all future projects.

Updated Code

process Check_md5_manifest {
    label 'slurm_cpu_1'
    module 'python/2023-09-0'
    publishDir "${params.path_folder_output}/${params.id_run}/nf_output/check_md5", mode: 'copy', overwrite: true

    input:
    val list_path_relative_path_mismatch_download
    path path_file_json_manifest

    output:
    path "md5_check.all.csv", emit: path_csv_md5_check_all
    path "md5_check.mismatch.csv", emit: path_csv_md5_check_mismatch
    path "list.relative.path.mismatch.md5.txt", emit: path_txt_list_relative_path_mismatch_md5

    script:
    path_folder_output_id_run = "${params.path_folder_output}/${params.id_run}"

    """
    python3 ica_check_md5.py \
    "$path_folder_output_id_run" \
    "$path_file_json_manifest" \
    "md5_check.all.csv" \
    "md5_check.mismatch.csv" \
    "list.relative.path.mismatch.md5.txt"
    """
}

Two Minor Issues

  1. Python script for moduleBinaries
    I have the following structure in my module folder (./modules/projectdata):
[projectdata]$ tree
.
├── main.nf
└── resources
    └── usr
        └── bin
            ├── ica_check_size.py

Additionally, nextflow.enable.moduleBinaries = true is set in nextflow.config. However, the script isn’t being detected properly. I noticed that if I remove python3, Nextflow resolves the absolute path, but the script still fails since .py files require the python3 prefix. The error message I get is:

python3: can't open file 'ica_check_md5.py': [Errno 2] No such file or directory

or if I remove python3

.command.sh: line 2: /nf_test/modules/projectdata/resources/usr/bin/ica_check_md5.py: Permission denied
  1. Behavior of channel assignment in process calls
    I’ve been using this approach to capture process outputs:
(path_csv_size_check_all, path_csv_size_check_mismatch, path_txt_list_relative_path_mismatch) = Check_size_manifest(path_file_json_manifest)

This has worked consistently for me, but I want to confirm whether this behavior is officially supported or if I’ve just been lucky. If it’s not intended behavior, I’ll switch to using:

Check_size_manifest.out.path_txt_list_relative_path_mismatch

in combination with emit: path_txt_list_relative_path_mismatch in the process definition.

Thank you so much!

I fixed the Python script for moduleBinaries issue by:

  1. Making the .py file executable with chmod +x.
  2. Adding #!/usr/bin/env python3 at the beginning of the script.
  3. Removing python3 before calling ica_check_md5.py.

However, I’d still appreciate your guidance on the legitimacy of behavior in channel assignment for process calls.

1 Like

Great! Glad you got it working.

I think that the channel assignment is fine, I’m just not used to seeing it written that way. Personally I find the other approach easier to read as there is no intermediate variable that you need to assign, instead you just access the outputs directly. But up to you :slight_smile:

Phil