My jobs are being canceled and logs are unable to be recovered

Hello, I have a SLURM workflow that fails randomly fails at pretty much any step.

The terminal reads with this error:
terminated for an unknown reason -- Likely it has been terminated by the external system

When i check the working directory the only files i found were .command.run and .command.sh

The nextflow logs I see errors in recovering any of the metadata about the run

Process ... terminated for an unknown reason -- Likely it has been terminated by the external system
[Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process ... Cause: java.nio.file.NoSuchFileException:
[Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process ...  Cause: java.nio.file.NoSuchFileException:

Does anyone know what’s happening here, a random step in the workflow seems to fail one time and work the next? I am having a really hard time debugging with minimal explanation as to why. I can get the workflow to complete by restarting the run again and again through brute force but that defeats the purpose of nextflow workflows

What is happening is Nextflow is running the pipeline, it submits the process but the computer that runs the process is killing it. This implies something is wrong with your infrastructure, normally it’s an issue with permissions or configuration.

Given it is working sometimes but not others, I expect there is an element of randomness that is different between instances of the process. For example, do you have multiple machines in your Slurm cluster? Perhaps the permissions are different between them.

To start debugging, can you go to the working directory and run bash .command.run? This will mimic Nextflow running the process and you may see a log in the terminal that tells you what is happening.

Thanks for your help!
I have been able to run the jobs manually by submitting the .command.run script but any time i have a job with a queuesize above ~40 or so I run into this issue.

After talking with the sys admins they cant seem to find anything that indicates why these jobs are being killed. And theres no explanation for why the jobs aren’t able to print error logs or any intermediate files either.

The error seems to happen to a seemingly random job each attempt, so this def seems to be a issue agnostic of the sample or step in the workflow

Do you have a cap on how many jobs you can submit at once to your machine? You can control this with the executor.queueSize config option..

I am running into a similar issue, although it does not seem to be a cap on number of jobs because my Nextflow jobs are killed even if there are 8 or less jobs running. This only happens to jobs submitted using Seqera Platform. I haven’t run into problems running pipelines on the command line. Our cluster uses Slurm as well. Seqera tower submits the jobs using my user profile via ssh and the input data and output directory are in my home directory, so it shouldn’t be a permissions issue. Could it be the way the sys admins configure the cluster/slurm for security purposes?