Why nextflow overwrite my input?

cleanfq.nf (1.8 KB)
sub.fq.gz (46.7 KB)

Dear seqera community,

I am new to nextflow. Attached is my first nextfow script, I tried to clean some weird characters found in the sub.fq.gz headers attached. However, the nextflow I wrote did run the first two process with no error, but it seems overwrote the input. Also, the third step failed. I appreciate anyone can help me test and suggests me what I am doing run.

Thank you so much.
Best,
LC

Hi @LChan, have you had a chance to go through our newcomer training course, Hello Nextflow? It covers the basics of managing inputs and outputs so I’d recommend you check that out and see if it helps you understand what’s going wrong with your script.

1 Like

Hi @LChan,

In your workflow, you are creating an output file with the same name as an input file so it tries to write over the existing file, e.g.:

process clean_fq {
    publishDir "${params.output_dir}", mode: 'copy'

    input:
    path input_file1
    val input_filename

    output:
    path "${input_filename}", emit: cleaned_file

    script:
    """
    zcat ${input_file1} | awk '{
        if (NR % 4 == 1) {
            gsub(/\\x00/, "")
        }
        if (\$0 != "") {
            print
        }
    }' | gzip > ${input_filename}

    """
}

workflow {
    // Resolve the full path to the input file
    input_file1 = file(params.input_file1)

    // Extract the filename (excluding dir path)
    input_filename = input_file1.name

    clean_fq(input_file1, input_filename)
}

becomes:

zcat sub.fq.gz | awk '{
    if (NR % 4 == 1) {
        gsub(/\\x00/, "")
    }
    if (\$0 != "") {
        print
    }
}' | gzip > sub.fq.gz

Instead, you should make sure to rename the output file created at runtime.

#! /usr/bin/env nextflow

// Define the input parameters
params.input_files = "fastq/sub.fq.gz"
params.output_dir = "cleaned_fastq"

// Define the processes

// process to clean the fastq file
process clean_fq {
    publishDir "${params.output_dir}", mode: 'copy'

    input:
    path input_file

    output:
    path "${output_filename}", emit: cleaned_file

    script:
    output_filename = "${input_file.baseName}" + ".trim.fastq.gz"
    """
    zcat ${input_file} | awk '{
        if (NR % 4 == 1) {
            gsub(/\\x00/, "")
        }
        if (\$0 != "") {
            print
        }
    }' | gzip > ${output_filename}
    """
}

workflow {
    // Resolve the full path to the input file
    input_files = Channel.fromPath(params.input_files)

    clean_fq(input_files)
}

Differences:

  • Replace file with Channel.fromPath to handle as many inputs as you like (file will just do one)
  • Determine the output filename within the process using the file methods
  • Use the output filename within the script instead of the value

Benefits:

  • This will run as many samples as you like using a glob! (--input_files "fastq/*.fq.gz")
  • no file collisions within the process ever (i.e. overwriting the input with the output)
  • Easier to pass to the next process

Hopefully this gets you started and helps you fix the other two.

1 Like

Hi Adam,

Thank you so much for pointing out my issues. I thought the same file name with different input and output folder wouldn’t be an issue. My understanding was wrong. All my three processes works now.

In my terminal, the --input_files "fastq/*.fq.gz" must be in double-quotes.

Really appreciate your help.

Best,
LC

Hi @GeraldineVdA,

Thanks for the suggestion. I watched the tutorial, it helps a lot.

Best,
LC

Ah woops! that’s my mistake. Yes you want to pass the string to Nextflow and not let your shell expand the glob.

In your terminal, this:

--input_files fastq/*.fq.gz

becomes:

 --input_files fastq/1.fq.gz fastq/2.fq.gz fastq/3.fq.gz fastq/4.fq.gz ...

Which doesn’t make sense to Nextflow. Quoting it prevents the expansion.