I’m curious regarding the use of /tmp
storage in the context of fusionfs
and just thinking out loud here
Since fusion is effectively powered by the (unlimited) object storage backends, what is the maximum storage needed on the host itself?
I’m curious regarding the use of /tmp
storage in the context of fusionfs
and just thinking out loud here
Since fusion is effectively powered by the (unlimited) object storage backends, what is the maximum storage needed on the host itself?
I understand you’re referring to the minimum storage required on the host system. Fusion does not need the capability to store the entire file in /tmp
as it splits the files into chunks of no more than 250MB. This means that your /tmp
directory can be smaller than the largest file you intend to process.
The recommended size for /tmp
depends on the types of pipelines you are running. For instance, we have observed optimal results with a 100GB /tmp
on large EC2 instances handling approximately 10 concurrent tasks for the standard nf-core/rnaseq
(full test profile) pipeline. Since all tasks share the same /tmp
, this allocation amounts to about 10GB per task.
Nonetheless, it is important to note that the Fusion garbage collector is not fully efficient yet, and there is potential for enhancement. We plan to focus on improving this aspect in the coming year.
Thanks Jordi! This does shed more light upon the curiosity I had about fusion.
One suggestion (as a nice-to-have feature), is to implement a fusion benchmark
command which can test these metrics on a specific infrastructure since the plan is to support multiple s3 API providers, which opens up the pathway for using different cloud providers.
It’d be nice to know what’s the capability of a node (+ internet + s3 backend) on a non-AWS infrastructure.
Looking anxiously forward to what fusion
can enable for cloud-agnostic infra!