AWS issues and optimisations

I have a pipeline which I was running locally on HPC and am trying to move to run on AWS Batch to allow us to access more compute at once.

Currently my batch setup is as follows:

  • AWS Batch queue and compute environment
  • AWS Batch GPU queue and compute environment
  • S3 workdir
  • S3 output results dir
  • S3 reference data dir
  • Custom AMI with AWS CLI installed
  • Custom ECS timeouts extending the defaults to 10m
  • Launch template with increased IOPS, throughput, and block size for EBS

However I am seeing several issues. If it’s determined to be better to split some of these into their own issues then happy to do that. I am also aware that some of these are underlying problems to AWS Batch but I am hopeful that there is some way to mitigate or work around them when running nextflow pipelines (as I would suspect they are somewhat common).

  1. Docker create timeouts

I am getting a lot of errors with DockerTimeoutError: Could not transition to created; timed out after waiting 10m0s. I am using gp3 EBS volumes so there is no burst balance but I have tried increasing the throughput to the maximum of 500 and the IOPS to a fairly high 20000 (vs the default of 3000) but I am still seeing these errors, typically when > 10 tasks are launched on a single EC2 instance. I am reticent to increase the IOPS further as the cost starts to balloon and they are not an issue beyond the initial startup step.

There is an open GitHub Issue which puts some of the blame/problem here on AWS but I cannot believe I am the first person to see many jobs launched on a single instance using nextflow?

I have considered whether lowering the job submission rate or setting the executor queue size to some much smaller value (i.e. 5) might allow fewer jobs to be fired off at once, but this would seem to negate some of the benefit of the horizontal scaling capacity of AWS?

  1. EBS volume sizes

Since the EBS autoscaling has been deprecated by AWS their recommendations for replacements are Mountpoint for S3 (can’t act as a workDir) and EFS (expensive and doesn’t work out of the box as a workDir). I can give my instances large EBS volumes using a launch template, but this could result in a very large, mostly unused volume being attached to a small compute instance. The issue being that AWS tries to jam as many tasks onto the largest instance it can provision. One possible bodge for this would be to have the launch template grow the EBS file system based upon the number of vCPUs on an instance. I am not clear on if this is wise or possible?

  1. Dead instance time

As mentioned above AWS tries to provision fewer large instances and run as many tasks as possible on as few instances as possible. This creates problems such as the EBS sizing issue above, but also it can leave large expensive instances running mostly idle for considerable time if some long-running but low-resource (or even middle-resource e.g. 12CPU 144GB) task gets put onto one of these and other tasks complete. Typically this happens because early on a pipeline might spin up 10-12 large instances and long-running tasks get spread across these with smaller shorter tasks filling the extra space. Then you are left with e.g. 10 partially utilised expensive large instances.

The only real solution I can think of to this is assigning the different resource-level jobs to different queues with compute envs with more limited subsets of instance types, but this seems like a lot of overhead?

  1. I/O

As with many bioinformatics pipelines there are many large files and much I/O usage. Sometimes I instances appear to be kept alive long after tasks have completed and I am believing this may be due to the publishing out to S3 of large outputs? Is this the case, can this be improved in some way?

Fusion
I’m aware that I am likely to hear that Fusion can solve many of these issues, however we do not use Platform and as we cannot use public platform we would need to have a self-hosted instance which would be likely be prohibitively costly and much overhead to simply access Fusion.

The free-tier unfortunately would not be an option as halfway through a single run of this one pipeline the throughputs are:
Total rchar: 295.77 TB
Total wchar: 103.56 TB

As you can see this is far in excess of the 100TB/month free tier usage.

I’m hoping we can find some solutions for some of these issues or at least open some dialogue about best-practices to mitigate them.