Hi! I have a situation when I want to copy a large reference DB (for FCS-GX) into local memory on a cluster - only once - and then use it for many invocations of the same process.
Maybe I misread the code, but it seems that current nf-core rungx module rclones DB into memory each time a process is called and deletes DB when it’s done, which becomes a problem if the process is invoked dozens of times. Is there an elegant way to stage it in the memory once, e.g., in the beginning of dedicated subworkflow, and clean up only when all instances are finished?
Hiya. I’m one of the authors behind that module. The reasoning for loading and copying each time is because the jobs can be submitted to several nodes. It’s not ideal either as it doesn’t let parallel jobs run on the same node. We haven’t come up with a good practice way yet that can satisfy all use cases.
For your case, you can still use as much of the module as you like, and then edit the the code, and then use nf-core modules patch fcsgx/rungx to save your changes, and apply them over any future updates to the module.
In your case, I would comment out the clean up line (trap), and modify the conditional to not error, but just echo that that database is already in memory. Note you might have to do this serially still because I don’t know the effects of running fcs on a partially loaded database (e.g. process 1 starts the copying, process 2 sees the directory, but the copy is incomplete but moves onto the rungx command ).
If you’re running your workflow locally on a single node, then do split the loading of the database into another process and supply the path to the downstream rungx process. You can use .collect on an output channel of the rungx process, to then trigger the cleanup process to remove the database from memory.
Thanks, Mahesh. The method in your last paragraph is exactly what I started thinking about, but then became worried that it would mess up caching. Anyways, I will try your sugestions and get back to this thread.
Another solution is probably to .collect inputs and forloop them one by one in a script body, but that will require additional effort to carry metadata through it all.
Exactly. I was about to come back and suggest this as well, but as you rightly say, it comes with the issue of managing the metadata.
The method in your last paragraph is exactly what I started thinking about, but then became worried that it would mess up caching.
I have to admit, I hadn’t thought it through that much, but that definitely will mess up caching, because the output of the db loading process will not be in memory after the clean up. However if you use a file on disk as a dummy while writing to the memory, it should still work correctly.
So, load the DB into memory, and write out a dummy file. Pass the dummy as input to the rungx processes (and with it perhaps the path in memory as a string), and then do the clean up. As long as the memory location is not an output: caching should also work cleanly.
Between runs perhaps (in a local setting), but not between processes. storeDir would first check if the database is in memory ( assuming that works - I haven’t checked ), and then not rerun if the db is there. That process output would then be passed to the rungx process which would then use it. In a local setting it makes little difference. In a distributed setting, there’s no guarantee the subsequent job would go to the same node ( and is why the db is loaded into memory in the same process ).