Context
Nextflow uses files to stitch a workflow together. However, the creation of large number of small files and directories is ill suited to the parallel filesystems used on HPC systems, which are shared resources with limited, if large, inode quotas and optimised for large files with large throughput. This is somewhat orthogonal to how Nextflow operates and how it is often used.
Use on HPC systems
Using ephemeral filesystems.
Once solution to the production of large number of small files and a workflow where many files and directories are created is to use node memory as a filesystem through /tmp/
, tar up the results and then move the output and the log from the ephemeral storage of /tmp
to the longer storage. This would also apply to use of nvme
.
However, this means that the move process is not handled by nextflow.
- Would it be possible to construct a more complex
publishDir
process which does atar
followed by amv
?
Using squashfs (or other such filesystems)
Another solution for workflows is to use a squashfs file system and overlay it while running a container. The issue here is one would what all jobs of a certain type to write into the overlay. I am not certain what the best practice would be here other than ensuring that the nextflow process actually ignores the workdir for writing results and writes output into the overlay while ensuring that these process do not run a publishDir
, and have a separate process to manage the movement of the squashfs file.
Large files and moving
Another issue is for large files the data movement can take quite a while. This can needlessly chew up resources and allocation on compute nodes if there are dedicated data mover nodes.
Currently, the best solution I am aware of is to have a separate process to handle moving data. Here again, a more bespoke publishDir
would be useful.
- Is it possible to provide publishDir a new partition in which to submit a new
slurm
/pbworks
job but have it tightly tied in the view of nextflow to the job that has just been run to produce the data?
Other comments
I haven’t yet read what Fusion offers in detail, so it might solve these problems. I am curious to hear what other people have done.