Bundling in tuple gives different results in process output vs. workflow

I’m trying to understand how to bundle channels into tuples and pass them as process or subworkflow inputs and outputs.

Here is a minimal example that make me confused.

process p1 {
    input:
    path A
    path B
    
    output:
    tuple(path(A), path(B))

    script:
    "touch ${A}; touch ${B}"
}

process p2 {
    input:
    tuple(path(A), path(B))

    script:
    ":"
}

workflow {
    A = Channel.fromPath("A.txt")
    B = Channel.fromPath("B.txt")

    // This works
    //p2(p1(A, B))

    // This gives the error
    p2(tuple(A, B))
}

The above example run with nextflow run demo.nf using nextflow 23.10.1 on Ubuntu 22.04.4 LTS gives the error:

Not a valid path value type: groovyx.gpars.dataflow.DataflowBroadcast (DataflowBroadcast around DataflowStream[?])

This surprised me. Both the commented-out example labeled ‘this works’ and the ‘this gives the error’ lines seem like they ought to produce equivalent results.

It seems like bundling channels into a tuple in the process output section leads to a different result than doing it in the workflow section. What’s the difference under the hood? Is there a way to modify the ‘this gives the error’ example to make it work as an input to p2?

The p1 process outputs a channel element with two items (two paths). The p2 process expects a channel element containing a tuple with two paths as input, so it works just fine. tuple(A, B), however, creates a tuple with two channels. This is not what p2 expects. Inside process instances (tasks), there’s no channel, just the items inside a channel element: Paths, values, and so on. When you pass a channel to it, it then lets you know that a channel (groovyx.gpars.dataflow.DataflowBroadcast) is not a valid path.

I confess it’s a bit confusing if you’re not well versed in the concepts of channels and processes, but what you’re doing is like passing a Channel.of(Channel.of... Why? Because when you pass to a channel a non-channel variable type (such as a tuple), Nextflow will implicitly convert it to a value channel, so what you’re actually doing is Channel.value(tuple(Channel.fromPath("A.txt"), Channel.fromPath("B.txt")))

Thank you very much for your help. Sounds like the combine operator is what I’d use to assemble an input to p2 via: p2(A.combine(B))

So just to check my understanding, it sounds like every line of process input and output implicitly corresponds to a single channel. What’s described in the input/output is what will be the contents of those channels.

When we call A = Channel.fromPath("A.txt"), we are creating a channel named A which contains a path item (sun.nio.fs.UnixPath according to A.map{it.getClass())} whose path string is “[working directory]/A.txt”. A.getClass() by contrast is groovyx.gpars.dataflow.DataflowBroadcast, which I’m assuming is the under-the-hood implementation of a channel. Is that correct?

Yes. Every line in the input block refers to a single element from an input channel. That’s why we often say that usually the only places you’re dealing with regular variables (not channels) in Nextflow are inside a process or inside a channel operator.

When you pass a channel to a process, the process consumes an element from that channel (a regular variable). If you pass a channel with a channel (or a multi-channel, output from e.g. branch or a process with multiple outputs), Nextflow will complain.

Just to clarify terminology, you’re not wrong to refer to the entities inside channels as items, but we usually refer to them as elements, and to parts of it (if it’s a list for example) as items. It makes it easier to understand when things get complicated.

For example:

Channel.of([1, 2, 3], [4, 5, 6])

The channel above has two elements with three items each.

1 Like

When we call A = Channel.fromPath("A.txt") , we are creating a channel named A which contains a path item (sun.nio.fs.UnixPath according to A.map{it.getClass())} whose path string is “[working directory]/A.txt”. A.getClass() by contrast is groovyx.gpars.dataflow.DataflowBroadcast, which I’m assuming is the under-the-hood implementation of a channel. Is that correct?

Marcel has given a great explanation, but allow me to use a little analogy.

Your process are the metro stations, each item in the channel is the train, each line is the channel. Your workflow is the whole tube map.

In your example:
Station: p1, p2
Lines: A, B
Trains (consist of either): path(A), path(B) or tuple(path(A), path(B))

And you gotta make sure those trains fit on the platform. That’s what the operators are for.

I have to apologise if I’m explaining that monads are burritos.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.