Getting Started: Disk and Memory Management Questions

Hey all,

I’m getting started with Nextflow and AWS Batch and am a little confused by my options for managing memory and disk space. Any input is appreciated!

High-level background:
I am processing biological datasets that are currently stored in S3. The datasets and files can be highly variable in size and the tools used to process them can use a lot of memory, which may not scale linearly with file size.

Given that, here are my high-level goals:

  • Generally, minimize costs.
  • Avoid pipeline failures from running out of disk space (e.g. No space left on device errors).
  • Avoid pipeline failures from running out of memory.
  • Eliminate transfers (or minimize transfer times) between S3 and AWS Batch instances.

More detailed background:
My current setup is very generic. I followed the setup instructions in the Nextflow documentation here, and opted to use the AWS CLI installed in a custom AMI.

So far, I can run Nextflow pipelines, and the processes are successfully submitted as jobs in AWS Batch. However, due to the file sizes, the transfers to and from S3 can take quite a while and I’ve been intermittently getting No space left on device errors.

As I understand it, the available disk space is determined by the storage I configured when I created the custom AMI, which I intentionally made quite small.

To deal with the limited disk space, I think I could mount a volume to the instance run by AWS Batch, but the variability in file size makes it difficult to predict what size of volume I’d need to mount ahead of time, and I’d rather not pay for more storage than I need by proactively mounting a huge volume. Given that, I was excited when I came across this blog post explaining how to set up EBS Auto-scaling on AWS Batch. I followed the guide, but wasn’t able to get auto-scaling working due to various bugs. I then noticed that amazon-ebs-autoscale is deprecated, which might explain why I was having so much trouble getting it to work.

The amazon-ebs-autoscale page now recommends using Mountpoint for Amazon S3 or Amazon Elastic File System (EFS). Amazon EFS appears to be an alternative to S3 that costs a lot more and it’s not clear whether it is supported by Nextflow, so I’d say that it’s a no-go. Mountpoint for Amazon S3 appears to be an alternative to Seqera’s Fusion file system and after reading this article explaining the differences, I’d probably opt to go with the Fusion file system.

Where I need input:
Looking at the Fusion file system documentation, it seems like I can do away with the custom AMI I made, but I still have some outstanding questions that aren’t addressed by the documentation:

Q. The documentation states that “Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor” and also that one should “make sure to use instance types that provide a NVMe disk as instance storage”. Given that, where does Nextflow store the data that it processes in each step of the pipeline? Does it copy from S3 to the AWS Batch EC2 instance NVMe storage, process the data there, and then copy it back to the S3 work directory?

Q. Assuming that the data gets copied from S3 to the instance’s NVMe storage, is there a way to configure autoscaling so that you don’t run out of disk space but also don’t use more storage than you need?

Q. The documentation provides this example configuration for using AWS Batch with NVMe disks:

aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false

What is the “/path/to/ec2/nvme” one should use and where do you find it?
And, does Nextflow automatically know to mount the instance /tmp folder in the Docker container and copy and process the data there when process.scratch is set to false?

Q. So far everything I’ve asked about has had to do with managing disk space flexibly. Assuming I go with the Fusion file system, how can I also configure flexible memory to avoid running out of RAM? I know that (per the documentation) I can set the memory requirements for each process. However, in bioinformatics work, it can be difficult to know how much memory to request ahead of time given (frequently) opaque tool documentation and the (at times) significant variability in file size. The solution would seemingly be to enable swap memory in my containers. However, the documentation doesn’t explain where the swap memory is stored (i.e. does it know to use the NVMe disk, if available) or what the --memory-swappiness parameter does exactly. Any suggestions on memory management are appreciated!

Note, I ended up asking many of these questions (with slightly improved understanding) over in the Seqera Slack, e.g. see here and here.

Once I get all the answers I need to establish a working setup, I’ll plan on posting the distilled information here.