Using Reference File Name to Emit Reference Path

Hi All,

I’ve created a process that uses a Python script to iterate over a directory of results per sample to determine the best reference fasta for that sample for downstream analysis. The final output of the script is the fasta name. The inputs to the process are the sample results and a path to possible fastas. The goal of this process is to only emit the fasta that was outputted by the Python script. I think I’m having a disconnect between how to use the text outputted by the Python script and have nextflow register that as the file name to output as the path. Below is my code and output. Thanks!

process GETBESTREF {
    tag "$meta.id"
    label 'process_medium'

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/pandas%3A1.5.2':
        'quay.io/biocontainers/pandas%3A1.5.2' }"

    input:
    tuple val(meta) , path(results)
    path(fastas)

    output:
    tuple val(meta), path("\${best_ref_path}") , emit: sample_ref
    path "versions.yml"           , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"

    """
    #ls
    best_ref_path="\$(find_best_ref.py $results)"
    echo "\${best_ref_path}"


    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        getbestref: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//' ))
    END_VERSIONS
    """
}```



ERROR ~ Error executing process > 'NFCORE_HEPATITISCDENOVO:HEPATITISCDENOVO:GETBESTREF (Sample_1)'

Caused by:
  Missing output file(s) `${best_ref_path}` expected by process `NFCORE_HEPATITISCDENOVO:HEPATITISCDENOVO:GETBESTREF (Sample_1)`

Command executed:

  #ls
  best_ref_path="$(find_best_ref.py Sample_1)"
  echo "${best_ref_path}"
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_HEPATITISCDENOVO:HEPATITISCDENOVO:GETBESTREF":
      getbestref: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//' ))
  END_VERSIONS

Command exit status:
  0

Command output:
  D17763.fasta

Work dir:
 /nf-core-hepatitiscdenovo/work/1b/b0e4a0e47332954a851dc79c5204eb

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

This means "find a file at path \${best_ref_path} (literal, not interpreted), which doesn’t exist.

You can do two things here. Firstly, you could either catch the output as a value which means using a val, stdout or env output. The downside of using a value is it is just a value, so you can’t use it as a file later without doing some Nextflow scripting using the file method described here.

Alternatively, if the file is available in the process you could try and make a new file within the process to capture. This isn’t very efficient but it’s relatively straightforward. Your process would look like this (abridged):

process GETBESTREF {
    input:
    tuple val(meta) , path(results)
    path(fastas)

    output:
    tuple val(meta), path("*.fa*") , emit: sample_ref
    path "versions.yml"                , emit: versions

    script:
    """
    best_ref_path="\$(find_best_ref.py $results)"
    mv "\${best_ref_path}" new_fasta.fasta
    """
}

Here, you are moving the “best” fasta to a new file, then capturing it. Nextflow does not include input files by default so it should ignore any input fasta files. This is inefficient but not the worst solution.

1 Like

Hi Adam,

Thank you for your response! I ended up emitting as an env output. Reason being is that there are cases in my pipeline where more than one fasta may be the best reference for a sample (multiple genotypes in a sample). I also added the generation of a text file that stores what fasta(s) are the best reference per sample for easy identification downstream. Below is the code in case others may find it useful in the future

process GETBESTREF {
    tag "$meta.id"
    label 'process_medium'

    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/pandas%3A1.5.2':
        'quay.io/biocontainers/pandas%3A1.5.2' }"

    input:
    tuple val(meta) , path(results)
    //path(fastas)

    output:
    //tuple val(meta), path(final_fasta) , emit: sample_ref
    tuple val(meta), env(best_ref_path) , emit: sample_ref
    path("*_fastas.txt"), emit: txt
    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    //def final_fasta = file("\${best_ref_path}")

    """
    
    best_ref_path="\$(find_best_ref.py $results)"
    echo "\${best_ref_path}" > ${prefix}_fastas.txt

    """
}

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.