With fusion based on (unlimited) object storage, how much of space is needed on the main OS itself?

I’m curious regarding the use of /tmp storage in the context of fusionfs and just thinking out loud here :thinking:

Since fusion is effectively powered by the (unlimited) object storage backends, what is the maximum storage needed on the host itself?

I understand you’re referring to the minimum storage required on the host system. Fusion does not need the capability to store the entire file in /tmp as it splits the files into chunks of no more than 250MB. This means that your /tmp directory can be smaller than the largest file you intend to process.

The recommended size for /tmp depends on the types of pipelines you are running. For instance, we have observed optimal results with a 100GB /tmp on large EC2 instances handling approximately 10 concurrent tasks for the standard nf-core/rnaseq (full test profile) pipeline. Since all tasks share the same /tmp, this allocation amounts to about 10GB per task.

Nonetheless, it is important to note that the Fusion garbage collector is not fully efficient yet, and there is potential for enhancement. We plan to focus on improving this aspect in the coming year.

3 Likes

Thanks Jordi! This does shed more light upon the curiosity I had about fusion.

One suggestion (as a nice-to-have feature), is to implement a fusion benchmark command which can test these metrics on a specific infrastructure since the plan is to support multiple s3 API providers, which opens up the pathway for using different cloud providers.

It’d be nice to know what’s the capability of a node (+ internet + s3 backend) on a non-AWS infrastructure.

Looking anxiously forward to what fusion can enable for cloud-agnostic infra! :star_struck:

1 Like