Using Reference File Name to Emit Reference Path

MTDouglas · June 24, 2024, 1:18pm

Hi All,

I’ve created a process that uses a Python script to iterate over a directory of results per sample to determine the best reference fasta for that sample for downstream analysis. The final output of the script is the fasta name. The inputs to the process are the sample results and a path to possible fastas. The goal of this process is to only emit the fasta that was outputted by the Python script. I think I’m having a disconnect between how to use the text outputted by the Python script and have nextflow register that as the file name to output as the path. Below is my code and output. Thanks!

process GETBESTREF {
    tag "$meta.id"
    label 'process_medium'

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/pandas%3A1.5.2':
        'quay.io/biocontainers/pandas%3A1.5.2' }"

    input:
    tuple val(meta) , path(results)
    path(fastas)

    output:
    tuple val(meta), path("\${best_ref_path}") , emit: sample_ref
    path "versions.yml"           , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"

    """
    #ls
    best_ref_path="\$(find_best_ref.py $results)"
    echo "\${best_ref_path}"


    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        getbestref: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//' ))
    END_VERSIONS
    """
}```

ERROR ~ Error executing process > 'NFCORE_HEPATITISCDENOVO:HEPATITISCDENOVO:GETBESTREF (Sample_1)'

Caused by:
  Missing output file(s) `${best_ref_path}` expected by process `NFCORE_HEPATITISCDENOVO:HEPATITISCDENOVO:GETBESTREF (Sample_1)`

Command executed:

  #ls
  best_ref_path="$(find_best_ref.py Sample_1)"
  echo "${best_ref_path}"
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_HEPATITISCDENOVO:HEPATITISCDENOVO:GETBESTREF":
      getbestref: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//' ))
  END_VERSIONS

Command exit status:
  0

Command output:
  D17763.fasta

Work dir:
 /nf-core-hepatitiscdenovo/work/1b/b0e4a0e47332954a851dc79c5204eb

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Adam_Talbot · June 25, 2024, 3:45pm

This means "find a file at path \${best_ref_path} (literal, not interpreted), which doesn’t exist.

You can do two things here. Firstly, you could either catch the output as a value which means using a val, stdout or env output. The downside of using a value is it is just a value, so you can’t use it as a file later without doing some Nextflow scripting using the file method described here.

Alternatively, if the file is available in the process you could try and make a new file within the process to capture. This isn’t very efficient but it’s relatively straightforward. Your process would look like this (abridged):

process GETBESTREF {
    input:
    tuple val(meta) , path(results)
    path(fastas)

    output:
    tuple val(meta), path("*.fa*") , emit: sample_ref
    path "versions.yml"                , emit: versions

    script:
    """
    best_ref_path="\$(find_best_ref.py $results)"
    mv "\${best_ref_path}" new_fasta.fasta
    """
}

Here, you are moving the “best” fasta to a new file, then capturing it. Nextflow does not include input files by default so it should ignore any input fasta files. This is inefficient but not the worst solution.

MTDouglas · June 25, 2024, 5:26pm

Hi Adam,

Thank you for your response! I ended up emitting as an env output. Reason being is that there are cases in my pipeline where more than one fasta may be the best reference for a sample (multiple genotypes in a sample). I also added the generation of a text file that stores what fasta(s) are the best reference per sample for easy identification downstream. Below is the code in case others may find it useful in the future

process GETBESTREF {
    tag "$meta.id"
    label 'process_medium'

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/pandas%3A1.5.2':
        'quay.io/biocontainers/pandas%3A1.5.2' }"

    input:
    tuple val(meta) , path(results)
    //path(fastas)

    output:
    //tuple val(meta), path(final_fasta) , emit: sample_ref
    tuple val(meta), env(best_ref_path) , emit: sample_ref
    path("*_fastas.txt"), emit: txt
    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    //def final_fasta = file("\${best_ref_path}")

    """
    
    best_ref_path="\$(find_best_ref.py $results)"
    echo "\${best_ref_path}" > ${prefix}_fastas.txt

    """
}

system · July 2, 2024, 5:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error on selecting a specific output from a process that outputs multiple files, and pass it to the next process? Ask for help nextflow , hpc	6	281	August 11, 2024
DRY Principle in Nextflow: Reusing Output Path Definitions in `output:` and `script:` sections Ask for help nextflow	3	35	June 18, 2025
Returning emitting same output path as input path? Ask for help	1	68	June 12, 2024
Writing multiple filenames to an output file Ask for help	1	27	March 20, 2025
Why nextflow overwrite my input? Ask for help nextflow	6	76	March 28, 2025

Using Reference File Name to Emit Reference Path

Related topics