Hi Seqera,
I’m running a custom Nextflow pipeline on a Slurm HPC cluster and I’ve noticed that my pipeline will sometimes stall at random steps, and upon resubmission of the job there is no issue. For a job the usually takes a handful of minutes to complete, the freezing can result in it taking hours to run, or sometimes it never finishes before reaching a generous time limit.
This has occurred across two pipelines, at different samples and steps, on different nodes… seemingly random. I’ve attached the nextflow.log and output log files.
A distinctive pattern in the nextflow.log is these two lines being repeated over and over:
~> TaskHandler[id: 54; name: pipeline:medakaVariants (8); status: RUNNING; exit: -; error: -; workDir: /hpf/largeprojects/pray/microbiology_testing/rsv/samples/2024-09-05_RSVSEQ_FG/wf-rsv/work/eb/216477a136d0f36df98c045c65d641]
Dec-12 13:38:44.073 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 – submitted tasks are shown below
~> TaskHandler[id: 54; name: pipeline:medakaVariants (8); status: RUNNING; exit: -; error: -; workDir: /hpf/largeprojects/pray/microbiology_testing/rsv/samples/2024-09-05_RSVSEQ_FG/wf-rsv/work/eb/216477a136d0f36df98c045c65d641]
Dec-12 13:43:44.088 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 – submitted tasks are shown below
Similar issue has been reported here:
Pipeline getting frozen (nf-core)
I’ve tried increasing memory and CPUs and using a newer nextflow version (nextflow/24.11.0), but this still occurs. Any advice?
qsub_1.15123185.txt (2.2 MB)
nextflow.log (192.1 KB)