Handling lots of IO (large files or lots of small files) on HPC systems

Context

Nextflow uses files to stitch a workflow together. However, the creation of large number of small files and directories is ill suited to the parallel filesystems used on HPC systems, which are shared resources with limited, if large, inode quotas and optimised for large files with large throughput. This is somewhat orthogonal to how Nextflow operates and how it is often used.

Use on HPC systems

Using ephemeral filesystems.

Once solution to the production of large number of small files and a workflow where many files and directories are created is to use node memory as a filesystem through /tmp/, tar up the results and then move the output and the log from the ephemeral storage of /tmp to the longer storage. This would also apply to use of nvme.

However, this means that the move process is not handled by nextflow.

  • Would it be possible to construct a more complex publishDir process which does a tar followed by a mv?

Using squashfs (or other such filesystems)

Another solution for workflows is to use a squashfs file system and overlay it while running a container. The issue here is one would what all jobs of a certain type to write into the overlay. I am not certain what the best practice would be here other than ensuring that the nextflow process actually ignores the workdir for writing results and writes output into the overlay while ensuring that these process do not run a publishDir, and have a separate process to manage the movement of the squashfs file.

Large files and moving

Another issue is for large files the data movement can take quite a while. This can needlessly chew up resources and allocation on compute nodes if there are dedicated data mover nodes.

Currently, the best solution I am aware of is to have a separate process to handle moving data. Here again, a more bespoke publishDir would be useful.

  • Is it possible to provide publishDir a new partition in which to submit a new slurm/pbworks job but have it tightly tied in the view of nextflow to the job that has just been run to produce the data?

Other comments

I haven’t yet read what Fusion offers in detail, so it might solve these problems. I am curious to hear what other people have done.

1 Like

Point 1: You can already move a file to an output using the process directive mode: move. It’s risky though because you might move a file from where it needs to be to an publish directory. You could create a compressed directory using the final process combined with mode: move to achieve what you wanted (see below).

There is also the process directive scratch which may solve some of the problems you are finding, which copies files to a local fast storage drive. This is for the process contents rather than the file publishing though.

Question 2: I haven’t used squashfs but this sounds like this doesn’t sound like a very appropriate solution for bioinformatics, which does a lot of writing. Could be great for reference data however?

Question 3: If you want to have a separate process for publishing data then honestly I’d suggest…making a separate process. It provides all the control and features you’d expect from a Nextflow process. Here’s a template for you to use:

process PUBLISH {

    publishDir "${params.outputDirectory}", mode: 'copy', overwrite: false

    input:
        val(archiveName)
        path("*")

    output:
        path("*.tar.gz"), emit: tar

    """
    tar -czvf ${archiveName}.tar.gz *
    """
}

One of the big challenges with Nextflow is that it needs to support everyone. By customising it for your situation, it will negatively impact someone else in an unpredictable manner; it’s deliberately generic, with configuration creating the specificity for your system.

1 Like