Jobs marked as failed by nextflow but slurm exit code 0

asherps · June 9, 2025, 5:13pm

Hi – we’ve been encountering an error with some of our nextflow pipelines (both nf-core and ones we are developing) in which we are getting an error that seems to suggest that one of the jobs failed and was terminated by our hpc job scheduler (we’re on a slurm-based HPC), yet when we go into the work directory it appears to have run successfully (outputs are in the work directory and exit code is 0). I’ve included a snippet of the error message at the end. We see some discussion of this here:

github.com/nextflow-io/nextflow

Process terminated for an unknown reason -- Likely it has been terminated by the external system

opened 03:13PM - 04 May 22 UTC

closed 05:22AM - 16 Jan 23 UTC

hukai916

executor/lsf stale

## Bug report ### Expected behavior and actual behavior I expect the NF eng…ine to correctly capture the job status code submitted to LSF scheduler, but it fails to do so occasionally. ### Steps to reproduce the problem I didn't have this problem using the same workflow over the past a few months, but starts to encounter it recently. This problem is hard to reproduce because it happens sporadically. I was able to capture one of such cases and traced the .nextflow.log file and found the following facts: 1. The job (jobId: 5535024) was actually completed without problem according to its .command.log file: ``` Started at Tue May 3 09:20:54 2022 Terminated at Tue May 3 18:24:37 2022 Results reported at Tue May 3 18:24:37 2022 ``` 2. The job was submitted to LSF at `May-03 09:10:36.762`, it was still PENDING 10 minutes later at `May-03 09:20:40.795 `, but was considered as COMPLETED only a few milliseconds later at `May-03 09:20:40.803`, and NF engine couldn't locate it output files because they were not being generated at that time. Therefore, NF generated an error and terminated all other processes and exited. 3. Relevant log is below: ``` May-03 09:10:36.762 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11) > jobId: 5535024; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 May-03 09:20:40.795 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: RUNNING; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270000 Current queue status: > job: 5535024: PENDING May-03 09:20:40.803 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: COMPLETED; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] May-03 09:20:40.820 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.out May-03 09:20:40.822 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.err May-03 09:20:40.823 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.log May-03 09:20:40.825 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11)' ``` The question is boiled down to "Why a job actually finished at `May 3 18:24:37 2022` was considered as COMPLETED at `May-03 09:20:40.803` ? I noticed similar issues here (https://github.com/nextflow-io/nextflow/issues/2540), here (https://github.com/nextflow-io/nextflow/issues/1045), and here (https://github.com/nextflow-io/nextflow/issues/1644). Any input to further debug will be highly appreciated. ### Environment * Nextflow version: [nextflow version 21.10.6.5660] * Java version: [openjdk version "1.8.0_92"] * Bash version: (use the command `$SHELL --version`) [GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)]

I’d also like to note that we take a look at the execution trace according to slurm the job completed successfully and there is a .exitcode file yet it seems the exit code status is marked null in the execution trace (this is attached as well).
If anyone has suggestions it would be much appreciated.

Thanks,
Asher

======
ERROR ~ Error executing process > ‘SHAHCOMPBIO_BAMBU_NF:BAMBU_NF:BAMBU_MERGE_QUANT (SHAH_H003842_T01_01_TR01_NDR_DEFAULT)’

Caused by:
Process SHAHCOMPBIO_BAMBU_NF:BAMBU_NF:BAMBU_MERGE_QUANT (SHAH_H003842_T01_01_TR01_NDR_DEFAULT) terminated for an unknown reason – Likely it has been terminated by the external system

Command executed:

mkdir -p transcriptome_NDR_DEFAULT
transcript_assembly.R
–rds=SHAH_H003842_T01_01_TR01_R1.sorted.bam
–yieldsize=100000
–ref_genome=GRCh38.primary_assembly.genome.fa
–ref_gtf=extended_annotations.gtf
–out_dir=transcriptome_NDR_DEFAULT
–ncore=1

–discovery=FALSE
cat <<-END_VERSIONS > versions.yml
“SHAHCOMPBIO_BAMBU_NF:BAMBU_NF:BAMBU_MERGE_QUANT”:
r-base: (echo (R --version 2>&1) | sed 's/^.R version //; s/ .//') bambu: (Rscript -e “library(bambu); cat(as.character(packageVersion(‘bambu’)))”)
END_VERSIONS

Command exit status:

Command output:
[1] “processing 1 read classes”

execution_trace_2025-06-03_13-44-19.txt (1.6 KB)

Topic		Replies	Views
Nextflow run stopped for uknown reason Ask for help nf-core	2	100	August 20, 2024
My jobs are being canceled and logs are unable to be recovered Ask for help	4	322	March 27, 2024
Slurm jobs not being submitted Ask for help nextflow	6	184	July 24, 2024
Nextflow pipeline randomly freezes Ask for help nextflow	0	74	January 2, 2025
Nextflow process hangs with no error message Ask for help nextflow	1	113	February 9, 2025

Jobs marked as failed by nextflow but slurm exit code 0

Command exit status:

Related topics