Stage single input for multiple instances of one process

sav-che · November 23, 2025, 6:29pm

Hi! I have a situation when I want to copy a large reference DB (for FCS-GX) into local memory on a cluster - only once - and then use it for many invocations of the same process.

Maybe I misread the code, but it seems that current nf-core rungx module rclones DB into memory each time a process is called and deletes DB when it’s done, which becomes a problem if the process is invoked dozens of times. Is there an elegant way to stage it in the memory once, e.g., in the beginning of dedicated subworkflow, and clean up only when all instances are finished?

mahesh.binzerpanchal · November 24, 2025, 10:01am

Hiya. I’m one of the authors behind that module. The reasoning for loading and copying each time is because the jobs can be submitted to several nodes. It’s not ideal either as it doesn’t let parallel jobs run on the same node. We haven’t come up with a good practice way yet that can satisfy all use cases.

For your case, you can still use as much of the module as you like, and then edit the the code, and then use nf-core modules patch fcsgx/rungx to save your changes, and apply them over any future updates to the module.

In your case, I would comment out the clean up line (trap), and modify the conditional to not error, but just echo that that database is already in memory. Note you might have to do this serially still because I don’t know the effects of running fcs on a partially loaded database (e.g. process 1 starts the copying, process 2 sees the directory, but the copy is incomplete but moves onto the rungx command ).

If you’re running your workflow locally on a single node, then do split the loading of the database into another process and supply the path to the downstream rungx process. You can use .collect on an output channel of the rungx process, to then trigger the cleanup process to remove the database from memory.

sav-che · November 24, 2025, 6:50pm

Thanks, Mahesh. The method in your last paragraph is exactly what I started thinking about, but then became worried that it would mess up caching. Anyways, I will try your sugestions and get back to this thread.

Another solution is probably to .collect inputs and forloop them one by one in a script body, but that will require additional effort to carry metadata through it all.

mahesh.binzerpanchal · November 25, 2025, 7:48am

Exactly. I was about to come back and suggest this as well, but as you rightly say, it comes with the issue of managing the metadata.

The method in your last paragraph is exactly what I started thinking about, but then became worried that it would mess up caching.

I have to admit, I hadn’t thought it through that much, but that definitely will mess up caching, because the output of the db loading process will not be in memory after the clean up. However if you use a file on disk as a dummy while writing to the memory, it should still work correctly.
So, load the DB into memory, and write out a dummy file. Pass the dummy as input to the rungx processes (and with it perhaps the path in memory as a string), and then do the clean up. As long as the memory location is not an output: caching should also work cleanly.

ppreshant · November 27, 2025, 8:34am

Does the StoreDir: option in the process help with this?

mahesh.binzerpanchal · November 27, 2025, 2:49pm

Between runs perhaps (in a local setting), but not between processes. storeDir would first check if the database is in memory ( assuming that works - I haven’t checked ), and then not rerun if the db is there. That process output would then be passed to the rungx process which would then use it. In a local setting it makes little difference. In a distributed setting, there’s no guarantee the subsequent job would go to the same node ( and is why the db is loaded into memory in the same process ).

sav-che · December 22, 2025, 4:40pm

It has been a while, but I made some tests and arrived to the abstract solution below that doesn’t kill caching, but only works properly with maxForks = 1. I tested a separate staging process with dummy output, but it needs extra logic: if staging succeeds but some downstream processes fail, they can’t be resumed because staging is already cozily cached. I wonder, if there is a way to check previous run, and stage DB only if any of dependent FCS processes failed?

Then, I did not test how parameter with '$TMPDIR' will behave on different nodes - not sure if they pick up the correct local one.

Also, I’m not sure if its ok to not make a channel for params.db; and if def is really needed or I can just insert params in script body as ${param.db}?

nextflow.enable.dsl = 2

params.db = "/path/to/file"
params.stage_dir = '$TMPDIR'

process STAGE_AND_RUN {
    input:
    val n

    output:
    path "file_*"

    script:
    def db = params.db
    def stage_dir = params.stage_dir
    """
    if [ ! -d "${stage_dir}/gxdb" ]; then
        mkdir "${stage_dir}/gxdb"
        rsync -v --info=progress2 "${db}/" "${stage_dir}/gxdb"
        echo "...Stage folder did not exist: created, DB staged"
        ls -lh "${stage_dir}/gxdb"
    else
        echo "...Stage folder exists: doing nothing"
        ls -lh "${stage_dir}/gxdb"
    fi

    touch "file_${n}"
    """
}

process UNSTAGE {
    cache false

    input:
    val collection

    script:
    def stage_dir = params.stage_dir
    """
    rm -rf "${stage_dir}/gxdb"
    """
}

workflow {
    ch_input = channel.of(1..10)

    STAGE_AND_RUN(ch_input)

    UNSTAGE(STAGE_AND_RUN.out.collect())
}

Topic		Replies	Views
Does Nextflow avoid redundancy when fetching same HTTP/FTP files 2+ times? Ask for help	2	123	June 4, 2024
Process input that is not cached and does not affect task hash Ask for help nextflow	8	152	October 9, 2024
Caching and resuming to skip one of the parallel jobs Ask for help nextflow	8	74	October 22, 2025
Resume not loading retries from cache Ask for help nextflow	8	212	January 16, 2025
Question about nf-core/nextflow reuse of data Ask for help	1	57	August 29, 2025

Stage single input for multiple instances of one process

Related topics