LChan
(LChan)
March 13, 2025, 4:59am
1
cleanfq.nf (1.8 KB)
sub.fq.gz (46.7 KB)
Dear seqera community,
I am new to nextflow. Attached is my first nextfow script, I tried to clean some weird characters found in the sub.fq.gz headers attached. However, the nextflow I wrote did run the first two process with no error, but it seems overwrote the input. Also, the third step failed. I appreciate anyone can help me test and suggests me what I am doing run.
Thank you so much.
Best,
LC
GeraldineVdA
(Geraldine Van der Auwera)
March 14, 2025, 4:58pm
2
Hi @LChan , have you had a chance to go through our newcomer training course, Hello Nextflow ? It covers the basics of managing inputs and outputs so I’d recommend you check that out and see if it helps you understand what’s going wrong with your script.
1 Like
Hi @LChan ,
In your workflow, you are creating an output file with the same name as an input file so it tries to write over the existing file, e.g.:
process clean_fq {
publishDir "${params.output_dir}", mode: 'copy'
input:
path input_file1
val input_filename
output:
path "${input_filename}", emit: cleaned_file
script:
"""
zcat ${input_file1} | awk '{
if (NR % 4 == 1) {
gsub(/\\x00/, "")
}
if (\$0 != "") {
print
}
}' | gzip > ${input_filename}
"""
}
workflow {
// Resolve the full path to the input file
input_file1 = file(params.input_file1)
// Extract the filename (excluding dir path)
input_filename = input_file1.name
clean_fq(input_file1, input_filename)
}
becomes:
zcat sub.fq.gz | awk '{
if (NR % 4 == 1) {
gsub(/\\x00/, "")
}
if (\$0 != "") {
print
}
}' | gzip > sub.fq.gz
Instead, you should make sure to rename the output file created at runtime.
#! /usr/bin/env nextflow
// Define the input parameters
params.input_files = "fastq/sub.fq.gz"
params.output_dir = "cleaned_fastq"
// Define the processes
// process to clean the fastq file
process clean_fq {
publishDir "${params.output_dir}", mode: 'copy'
input:
path input_file
output:
path "${output_filename}", emit: cleaned_file
script:
output_filename = "${input_file.baseName}" + ".trim.fastq.gz"
"""
zcat ${input_file} | awk '{
if (NR % 4 == 1) {
gsub(/\\x00/, "")
}
if (\$0 != "") {
print
}
}' | gzip > ${output_filename}
"""
}
workflow {
// Resolve the full path to the input file
input_files = Channel.fromPath(params.input_files)
clean_fq(input_files)
}
Differences:
Replace file with Channel.fromPath to handle as many inputs as you like (file will just do one)
Determine the output filename within the process using the file methods
Use the output filename within the script instead of the value
Benefits:
This will run as many samples as you like using a glob! (--input_files "fastq/*.fq.gz"
)
no file collisions within the process ever (i.e. overwriting the input with the output)
Easier to pass to the next process
Hopefully this gets you started and helps you fix the other two.
1 Like
LChan
(LChan)
March 19, 2025, 8:13pm
4
Hi Adam,
Thank you so much for pointing out my issues. I thought the same file name with different input and output folder wouldn’t be an issue. My understanding was wrong. All my three processes works now.
In my terminal, the --input_files "fastq/*.fq.gz"
must be in double-quotes.
Really appreciate your help.
Best,
LC
LChan
(LChan)
March 19, 2025, 8:14pm
5
Hi @GeraldineVdA ,
Thanks for the suggestion. I watched the tutorial, it helps a lot.
Best,
LC
Ah woops! that’s my mistake. Yes you want to pass the string to Nextflow and not let your shell expand the glob.
In your terminal, this:
--input_files fastq/*.fq.gz
becomes:
--input_files fastq/1.fq.gz fastq/2.fq.gz fastq/3.fq.gz fastq/4.fq.gz ...
Which doesn’t make sense to Nextflow. Quoting it prevents the expansion.