I am working on a Nextflow DSL 1 script and have a problem.
At the very end of the pipeline run, I need to process log files (made by the processes) that were stored in a non-work, publish, directory during the run. I understand that this is inadvisable but it seems like the simplest solution.
I would like a way to run a few functions to process the log files after the pipeline finishes running processes but before it quits entirely. I remember finding a hook that does that but it was unreliable because it quit before the additional functions finished, as I remember.
Reading the Nextflow discussions, I see a question and answer about this problem where the proposed solution is to add a process that collects outputs from Nextflow processes and uses it to do the final processing (Collating all files at the end of a pipeline · nextflow-io/nextflow · Discussion #2925 · GitHub). I am not sure that this will work because some of the processes may be turned off using parameters. That is, I added an output channel containing only a completion flag to about 11 processes but not all of the processes are enabled necessarily so those channels will not emit completion flags. Does the collect() operator ignore channels from disabled processes?
More precisely, I wonder what happens when a process has more than one input channel but one or more of the input channels are ‘empty’ because the processes that output them are disabled using the when block. Do those processes execute based on the non-empty channels, as I hope? I would prefer to avoid having to replace the when blocks with conditional blocks because these upstream processes have other output channels, in addition to the completion flag channel. I suppose that there are ways to work around this but work-arounds become wide-spread in a complex script and invite the introduction of errors in the code.
DSL1 is no longer supported, so I strongly recommend you to upgrade your workflow to DSL2. Newer versions of Nextflow won’t run your pipeline, and some people learned Nextflow with DSL2 already, so it’s also more difficult to find people trained in DSL1 to help you
Having said that, it’s a bit confusing what you’re describing. All tasks have their files staged in their work directories. You can also publish some files to some other directories, but it doesn’t mean it won’t be in the task directory. And if you are talking about output files from other workflow runs, well, you can provide them as inputs and it should work just fine.
If you want to do something with files that you want to publish somewhere, things get trickier, such as one last process to check published files. The publishing happens async, which means that some files will only be published when tasks from all processes are finished, so you can’t have a process to take care of that. Using event handlers may work, but AFAIK it’s not guaranteed to work for the same reason.
If you have optional inputs, your task will do what you tell it to do. This kind of stuff is your hands, though very soon in DSL2 we should have optional inputs built-in.
If you can bring a minimal reproducible example, it’s more likely you will get help, but I still recommend you to upgrade your workflow to DSL2
I intend to convert our scripts to DSL2 when time permits but that may be far in the future.
On the question about processing at the end of the run, at one time I tried using the onComplete event handler but it appeared that Nextflow could terminate before the command in onComplete finished, which made it unattractive. Perhaps I am mistaken, or perhaps Nextflow no longer terminates before onComplete finishes.
I am aware that Nextflow publishes files asynchronously but the log files are small so I imagine that publishing them happens quickly, and a short sleep in the onComplete handler should ensure that the copies finish before the onComplete handler begins processing them.