Is there a way to pass an input to a process that does not affect the task hash? Example:
process P {
input:
val(cacheable)
val(uncacheable)
output:
path(cacheable)
val(uncacheable)
shell:
"touch ${cacheable}"
}
In the above, is there a way to cause P to compute a task hash that takes “cacheable” into account but not “uncacheable?” The goal behavior is that every time a channel item is input to P with a previously seen value of “cacheable” but a new value of “uncacheable”, it uses the cached output for cacheable, and outputs the new value of uncacheable. For my purposes, it would not be necessary to use or modify “uncacheable” within the process in any way – I just want to be able to link a particular uncacheable channel item with a particular set of cacheable channel items.
I dont have an answer but I am wondering if by ‘cacheable’ you are intending this behavior to be used with the -resume feature? So that a resume’d run would skip some items and not others?
just guessing but this also makes it sound as if you are trying to maintain and introspect some pipeline state between runs, where the state would be the list of all ‘caceable’ / ‘uncacheable’ pairs that have already been seen.
for something like that, it seems like maybe you would instead want to skip the actual "cache"ing part and just build a Groovy list or map of all pairs that have already been completed, and then implement some conditional that implements one of several variations of the ‘script’ block if the input combination of items is present in the list or not. Where the list would be populated at run-time with some external file your pipeline could write and maintain with the list of completed tasks.
tbh I am not entirely sure how different this is from the normal -resume behavior
Thanks for your response. Let me give more context.
My workflow is structured around channel items that are LinkedHashMaps storing processes outputs and per-sample attributes under meaningful keys. Over the course of the workflow, samples accumulate the whole history of their individual processing. This is extremely powerful for extensibility and maintainability, but has some downsides stemming from how Nextflow handles cacheing, staging, and channels.
Since Nextflow seems to have no way of telling it to stage a file to the work directory other than by explicitly declaring it as a path in the input block, I have to extract files from the hashmap and put them in tuples to pass in to the process, then turn the outputs back into a hashmap.
My goal is to add the process outputs to a sample hashmap. But since I can’t pass the entire hashmap to a process without irrelevant attribute changes causing the task hash to change, causing an unnecessary process rerun, I have to extract only the specific inputs required by the process, then figure out a way to update the right channel item hashmap with the right process output. Currently, I handle this by passing a unique sample id as a process input and writing an sql-like join feature (going beyond Nextflow’s join operator) to update the sample hashmaps with the process output. I learned about the fair directive today which might simplify this somewhat by letting me update process outputs by order of execution rather than id.
What would simplify this whole problem, though, would be a way to be able to pass the entire sample hashmap as a process input and output without it affecting the task hash. This would let me update the hashmap directly within the process and simply collect the hashmap from the process output. Even better would be if, in addition to being able to pass inputs that were inert with respect to caching, Nextflow had a way to direct it to stage a file from within the shell/script block. This way I could just have a hashmap as potentially the sole process input and output, and stage as needed within the shell/script block.
I’ve never seen that pattern of accumulating task outputs in a Map. Sounds fascinating! Do you have a public example of such a workflow?
From what I understand of the pattern from your description, I think the proposal on Github isssue #5308 would address your need. Do you think that this would be an appropriate feature?
That GitHub issue is precisely what I’m looking for. My workflow is in active development but is feature complete and approaching the cleanup stage in prep for publication. Code is at GitHub at bskubi/hich. You may be interested in some of the extraops.nf bits where I have code to facilitate rejoining a map of process outputs with the map of accumulating sample attributes based on an id. It’s this step that simply passing through the map and only cacheing the id would simplify. Note that there is a lot of cruft in the extraops especially that I still need to clean up. Sqljoin is the bit worth looking at I think, along with some of the code to reshape a channel of hash maps to a columnar format or row format and to coalesce them or group them. You might also be interested in my implementation of the sample selection strategies and analysis plans in the last few steps. If interested I’d be happy to chat in more detail.
if the goal is to accumulate items from the output of various upstream processes and group them in map(s) then these are usually the ones I use.
keep in mind that iirc groupTuple can also do a grouping based on the entire map item being passed in the first slot, which is useful when you want to use something like a sub-map as the grouping key (multiple attributes)
of course passing a map as an input item still has issues as I think you noticed in that the embedded files do not get staged, you still need to have a pre-process step that yanks out the desired files from the map and passes them as path input items (along with the entire map itself, optionally)
the closest situation I have had come up several times in the past, is to gather up .vcf files output by several variant callers for the set of samples, keeping each in a named key in a map so that I can remember which .vcf came from which caller because the downstream vcf manipulation steps requiring using specific args for specific callers’ vcf files in order for things like bcftools to work right.
one of the dangers of using a map as a vehicle for pipeline input items, is that as you start to merge different maps together you can run the risk of some maps having missing keys if some upstream task failed, or for other reasons, leading to the situation where you can no longer be sure if a given map has the required keys or not for the next pipeline step. This is something that does not happen when you stick with the Nextflow native data-flow paradigm.
I swear there was a GitHub issue specifically about using map objects with embedded file paths as true ‘path’ inputs but I cant find it now after searching again.
We probably hit on similar solutions. I wound up creating a pseudo-operator “sqljoin” that implements left, right, full and inner true joins over channels that contain HashMap items by an arbitrary submap (note that Nextflow’s ‘join’ operator is not a true inner join). It sneaks the Nextflow requirement for things like join and cross by using groupTuple on the left channel, then iterating over the grouped left channel items, combining them with each corresponding right channel item, collecting the results, flattening, and emitting as a new set of hashmaps. Works quite well.
This lets me update the full per-sample hashmap with the outputs from a process. I do unpack and repack the process inputs and outputs to accommodate Nextflow’s staging process. In my workflow, if a process fails, then the whole workflow halts, which is fine for this use case. But in other cases I think you could just validate that all the hashmaps in the channel have the required set of keys for the next step, either by using map{} to check coupled with an assert or error statement to make the process fail, or by filtering on channel items containing the required keys, as necessary.
Hi Ben, your strategy of accumulating the provenance information in the results is really interesting. I hope we can find a way to do this automatically in Nextflow one day.
I’ll repeat what I said on the GitHub issue that Rob linked…
The root problem as it see it is that process invocations are very inflexible. You have to go through a lot of ceremony to prepare input and output channels around a process call, especially if there are multiple inputs. As a result, it’s more convenient to “pass through” data that you don’t need, which leads to unnecessary cache invalidation.
I believe we can avoid all of this convoluted operator logic (as fun as it is) by simply moving the process call into the operator closure (e.g. of a map). Then you can control how each individual task is invoked:
pass only the values that are needed into the task, no accidental cache invalidation
no need for joins, just keep related data together from the beginning
I have some internal sketches and am confident that this will work. I hope to share more in the future. But for this reason I’m hesitant to reach for things like new directives or operators, when it’s really a symptom of the language.