if, like me, you’re running Nextflow with SLURM on a cluster that experiences node evictions, you probably want a way to automatically rerun jobs when they fail due to eviction.
According to the documentation on Spot Instance failures and retries, this feature is implemented when using AWS Batch or Google Batch. However, I found a relatively simple solution to achieve similar behaviour on a SLURM‑based cluster.
When a job fails due to an eviction, SLURM reports a NODE_FAIL, and Nextflow treats this as the node being “terminated for an unknown reason.” We can use this to trigger automatic retries by adding the following to your config:
errorStrategy = {
if (task.previousException?.toString().contains("terminated for an unknown reason") && task.attempt <= 3) {
return 'retry'
}else{
return 'ignore'
}
}
This was my relatively simple “tick” that i found ![]()