How to handle in Nextflow docker mounting of symlinked files within a symlinked directory

I’m struggling with an issue with nf-core modules I have made for building Kraken2.

For context, Kraken2 uses a directory as a database. To build the database, you have to execute two separate commands. In the first step you add all a collection of files to a directory. In the second command, you then execute the actual ‘build’ command that creates the custom files used by Kraken2 itself.

Following nf-core module structure, I have split these two steps into two separate modules. This structure also has the benefit because each of the two commands takes a long time. Therefore separating the two makes it more HPC friendly.

A user where I’m using this (nf-core/createtaxdb), has reported an issue in this system when using container systems. The second module (the ‘build’ one that receives the directory as input) fails when supplying the pipeline with a file on a local file system, because the tool cannot find one of the files inside the the received directory.

Command error:  build_db: error opening taxonomy//nodes.dmp: No such file or directory

I’ve identified this as the problem with Docker not (being able to) follow the two layers of symlinks (i.e. the symlink of the original file on the filesystem into the directory, the directory of which is then symlinked from the first to the second process). I can identify this by taking the docker run command from the .command.run process, tweaking it to an interactive session, and looking inside. Files supplied via a URL into the module and thus staged into the working directory by Nextflow are accessible within the docker container, but the files on the local system have a red ‘broken link’ bash colour.

To demonstrate the problem, I’ve made a little-ish reprex of the issue

kraken2-symlink-issue-reprex.zip (1.4 KB)

And you can run nextflow run main.nf -c nextflow.config , where you will get the error

Command error:cat: taxonomy/a_staged.txt: No such file or directory

I was wondering if anyone would have any suggestions if there is a work around to make Docker follow all the symlinks and if there is a way to get Nextflow to do this.

Workarounds I’ve found are:

  • Use stageAsMode 'copy' but this is unsatisfying as duplicating the very large files will hit the disk
  • Merging the two modules into one, but this is unsatisfying as it’s merging two very long running processes

One other suggestion I’ve been given (not yet tested) is to not export a directory from the first module, but all files and then ‘reconstruct’ the directory using stageAs in the downstream module

But the two tested workarounds are not great.

Doesn’t one also mount the entire work directory also if you have multiple symlinks? They’re relative right? Otherwise the whole filesystem structure needs to be mirrored inside.

I thought that should work at first, but I think the problem is that the first inputs are linked from outside of the work directory.

1 Like

I agree with pontus, I think the problem is that the mounts are correctly resolved when the files are passed as inputs and the symlink during the first process works for that reason. When those files are then passed to process 2, they are still pointing to some place, but nextflow does not resolve the symlink when creating the container mounts, only the path to the symlink, therefore the symlink cannot be followed inside the second container.

Copying in solves that problem, another option may be to just pass the same set of files as inputs to the second process, so they are linked again for that process and the correct mounts are added during container start.

Yes exactly, what Pontus described that is the problem.

And I think @nschan also describes the behaviour I think that is happening too…

I was hoping someone may have a clever solution to get Nextflow to resolve all the symlinks but maybe indeed I have to just ‘reconstruct’ the directory structure of the databases in the second process