asherps
(Asher Preska Steinberg)
June 9, 2025, 5:13pm
1
Hi – we’ve been encountering an error with some of our nextflow pipelines (both nf-core and ones we are developing) in which we are getting an error that seems to suggest that one of the jobs failed and was terminated by our hpc job scheduler (we’re on a slurm-based HPC), yet when we go into the work directory it appears to have run successfully (outputs are in the work directory and exit code is 0). I’ve included a snippet of the error message at the end. We see some discussion of this here:
opened 03:13PM - 04 May 22 UTC
closed 05:22AM - 16 Jan 23 UTC
executor/lsf
stale
## Bug report
### Expected behavior and actual behavior
I expect the NF eng… ine to correctly capture the job status code submitted to LSF scheduler, but it fails to do so occasionally.
### Steps to reproduce the problem
I didn't have this problem using the same workflow over the past a few months, but starts to encounter it recently.
This problem is hard to reproduce because it happens sporadically. I was able to capture one of such cases and traced the .nextflow.log file and found the following facts:
1. The job (jobId: 5535024) was actually completed without problem according to its .command.log file:
```
Started at Tue May 3 09:20:54 2022
Terminated at Tue May 3 18:24:37 2022
Results reported at Tue May 3 18:24:37 2022
```
2. The job was submitted to LSF at `May-03 09:10:36.762`, it was still PENDING 10 minutes later at `May-03 09:20:40.795 `, but was considered as COMPLETED only a few milliseconds later at `May-03 09:20:40.803`, and NF engine couldn't locate it output files because they were not being generated at that time. Therefore, NF generated an error and terminated all other processes and exited.
3. Relevant log is below:
```
May-03 09:10:36.762 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11) > jobId: 5535024; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642
May-03 09:20:40.795 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: RUNNING; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270000
Current queue status:
> job: 5535024: PENDING
May-03 09:20:40.803 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: COMPLETED; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ]
May-03 09:20:40.820 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.out
May-03 09:20:40.822 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.err
May-03 09:20:40.823 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.log
May-03 09:20:40.825 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11)'
```
The question is boiled down to "Why a job actually finished at `May 3 18:24:37 2022` was considered as COMPLETED at `May-03 09:20:40.803` ?
I noticed similar issues here (https://github.com/nextflow-io/nextflow/issues/2540), here (https://github.com/nextflow-io/nextflow/issues/1045), and here (https://github.com/nextflow-io/nextflow/issues/1644).
Any input to further debug will be highly appreciated.
### Environment
* Nextflow version: [nextflow version 21.10.6.5660]
* Java version: [openjdk version "1.8.0_92"]
* Bash version: (use the command `$SHELL --version`) [GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)]
I’d also like to note that we take a look at the execution trace according to slurm the job completed successfully and there is a .exitcode file yet it seems the exit code status is marked null in the execution trace (this is attached as well).
If anyone has suggestions it would be much appreciated.
Thanks,
Asher
======
ERROR ~ Error executing process > ‘SHAHCOMPBIO_BAMBU_NF:BAMBU_NF:BAMBU_MERGE_QUANT (SHAH_H003842_T01_01_TR01_NDR_DEFAULT)’
Caused by:
Process SHAHCOMPBIO_BAMBU_NF:BAMBU_NF:BAMBU_MERGE_QUANT (SHAH_H003842_T01_01_TR01_NDR_DEFAULT)
terminated for an unknown reason – Likely it has been terminated by the external system
Command executed:
mkdir -p transcriptome_NDR_DEFAULT
transcript_assembly.R
–rds=SHAH_H003842_T01_01_TR01_R1.sorted.bam
–yieldsize=100000
–ref_genome=GRCh38.primary_assembly.genome.fa
–ref_gtf=extended_annotations.gtf
–out_dir=transcriptome_NDR_DEFAULT
–ncore=1
–discovery=FALSE
cat <<-END_VERSIONS > versions.yml
“SHAHCOMPBIO_BAMBU_NF:BAMBU_NF:BAMBU_MERGE_QUANT”:
r-base: (echo (R --version 2>&1) | sed 's/^.R version //; s/ . //')
bambu: (Rscript -e “library(bambu); cat(as.character(packageVersion(‘bambu’)))”)
END_VERSIONS
Command exit status:
Command output:
[1] “processing 1 read classes”
execution_trace_2025-06-03_13-44-19.txt (1.6 KB)