AWS Batch: Spot Instance - How does Nextflow log an interruption?

Iam trying to make the decision about which processes in my nextflow pipeline should utilize spot instances.

In theory, one should pick fault-tolerant workloads that can handle interruptions gracefully, or that are not expensive to fail. If you have an extremely long task that will cost a lot of money for each sample, using spot machines may not be the best approach. However, with Fusion Snapshots, one shouldn’t worry about this, as when the task is retried, Fusion will make it start where it stopped before the interruption.

how Nextflow logs spot instance interruptions? Is there specific information I can look for in the .command.log or trace file that would clue be into these interruptions?

The cloud provider is the one killing the task, so we rely on them to give us enough information to identify the task was finished due to the spot machine being reclaimed. For AWS Batch, there is a Nextflow configuration that sets the number of times Nextflow can retry a task run in a spot machine that was reclaimed, which is aws.batch.maxSpotAttempts. You can read more about it here.

Be aware this is different from other cloud providers. For Google Cloud Batch, for example, you have to turn on spot machines in your Nextflow configuration and watch for a specifi exit status to retry.

1 Like