I’m struggling with an issue with nf-core modules I have made for building Kraken2.
For context, Kraken2 uses a directory as a database. To build the database, you have to execute two separate commands. In the first step you add all a collection of files to a directory. In the second command, you then execute the actual ‘build’ command that creates the custom files used by Kraken2 itself.
Following nf-core module structure, I have split these two steps into two separate modules. This structure also has the benefit because each of the two commands takes a long time. Therefore separating the two makes it more HPC friendly.
A user where I’m using this (nf-core/createtaxdb), has reported an issue in this system when using container systems. The second module (the ‘build’ one that receives the directory as input) fails when supplying the pipeline with a file on a local file system, because the tool cannot find one of the files inside the the received directory.
Command error: build_db: error opening taxonomy//nodes.dmp: No such file or directory
I’ve identified this as the problem with Docker not (being able to) follow the two layers of symlinks (i.e. the symlink of the original file on the filesystem into the directory, the directory of which is then symlinked from the first to the second process). I can identify this by taking the docker run command from the .command.run process, tweaking it to an interactive session, and looking inside. Files supplied via a URL into the module and thus staged into the working directory by Nextflow are accessible within the docker container, but the files on the local system have a red ‘broken link’ bash colour.
To demonstrate the problem, I’ve made a little-ish reprex of the issue
kraken2-symlink-issue-reprex.zip (1.4 KB)
And you can run nextflow run main.nf -c nextflow.config , where you will get the error
Command error:cat: taxonomy/a_staged.txt: No such file or directory
I was wondering if anyone would have any suggestions if there is a work around to make Docker follow all the symlinks and if there is a way to get Nextflow to do this.
Workarounds I’ve found are:
- Use
stageAsMode 'copy'but this is unsatisfying as duplicating the very large files will hit the disk - Merging the two modules into one, but this is unsatisfying as it’s merging two very long running processes
One other suggestion I’ve been given (not yet tested) is to not export a directory from the first module, but all files and then ‘reconstruct’ the directory using stageAs in the downstream module
But the two tested workarounds are not great.