Hello everyone!
Could you please tell me any best practices for writing Nextflow code to flexibly configure pipeline error handling?
If the samples are OK and the pipeline runs without failures, then the scenarios from Nextflow Training work perfectly. However, I’ve found virtually no examples of how to competently and flexibly handle process errors.
For example, in my pipeline, I created a module for checking md5sum for raw reads:
process MD5_CHECKSUM {
conda "${CONDA_PREFIX_1}/envs/multiqc"
tag "Md5sum on ${sample_id}"
publishDir "${params.outdir}/md5sum/${sample_id}", mode: "copy"
cpus 1
maxForks 20
input:
tuple val(sample_id), path(reads)
path md5sum_txt
output:
tuple val(sample_id), path(reads), emit: sample_id__reads
path("${sample_id}.md5sum.ok"), emit: md5_ok
errorStrategy 'ignore'
script:
"""
set -euo pipefail
r1="${reads[0]}"
r2="${reads[1]}"
b1=\$(basename "\$r1")
b2=\$(basename "\$r2")
md5_1=\$(grep -E "([[:space:]]|\\\\*)\${b1}\$" "${md5sum_txt}" | awk '{print \$1}' | head -n1)
md5_2=\$(grep -E "([[:space:]]|\\\\*)\${b2}\$" "${md5sum_txt}" | awk '{print \$1}' | head -n1)
if [[ -z "\$md5_1" || -z "\$md5_2" ]]; then
echo "ERROR: md5 not found for \$b1 or \$b2 in ${md5sum_txt}" >&2
echo "Matches for b1/b2 in md5 file:" >&2
grep -n -F "\$b1" "${md5sum_txt}" >&2 || true
grep -n -F "\$b2" "${md5sum_txt}" >&2 || true
exit 2
fi
printf "%s %s\\n" "\$md5_1" "\$r1" > check.md5
printf "%s %s\\n" "\$md5_2" "\$r2" >> check.md5
md5sum -c check.md5 | tee md5sum_for_pair.log
touch "${sample_id}.md5sum.ok"
"""
}
Currently, this is implemented using errorStrategy: ignore: if the md5 doesn’t match, the pipeline element is simply not generated, and only valid files are passed to subsequent steps (for example, FastQC). This generally works, but is this approach considered correct?
Perhaps it would be better to intercept the exit status within the process itself (so that the final exit code is always 0), pass the check status as a flag to the output, and filter valid samples before the next steps at the workflow level? And if we go this route, what’s the best way to organize the collection of general statistics on problematic samples?
I’d be grateful for any advice or practical examples!