Frequently failing runs using AWS Batch Spot

Hi,

I’m relatively new to Nextflow/Seqera and I’m hoping to get some help using AWS Batch compute environments for running pipelines.

I am currently using a spot provisioning model to run the nf-corre/methylseq pipeline and I find my runs failing because of instances being taken away from me (see screenshot). I understand that this is part of the deal with the spot model but it is now happening frequently enough that it’s affecting project timelines.

Reading around, it looks like batch parameters like max number of retries and spot price bid could help a bit. However, I’m hoping that there may some better solutions out there to reduce the frequency of such events

Appreciate any help/insight that you can provide!

Hi @sameer_abraham,

Options will vary depending on your cloud provider.

In Nextflow, the standard approach to handle spot instance interruptions is to automatically retry the task, which means the task will restart from the beginning each time it’s interrupted. While effective, this can add extra time and resource usage if interruptions happen frequently.

For more advanced handling, solutions like Fusion Snapshots in our enterprise offerings allow tasks to resume from their last checkpoint instead of starting over, significantly reducing the impact of interruptions.

Unfortunately, we can’t control how often spot instances are reclaimed; this depends on the policies of the specific cloud provider. You may want to check with them for any options to manage or minimize these interruptions.

Let us know if you have further questions!