Join and cross operators not behaving as expected

My understanding of an inner join (from Nextflow documentation: “It is equivalent to an inner join in SQL”) is that it is supposed to return every match on the key between the left and right rows. However, in my code it is only returning one match.

Minimal example:

workflow {
    ch1 = Channel.of([0, 1], [0, 2])
    ch2 = Channel.of([0, "A"])
    ch1.join(ch2) | view
} 

Output:

[0, 1, A]

Expected output:

[0, 1, A]
[0, 2, A]

If I do ch2.join(ch1) | view instead its output is:

[0, A, 1]

And similarly, I would have expected:

[0, A, 1]
[0, A, 2]

If I replace join with cross` in the last workflow line, I get:

[[0, 1], [0, A]]

Given that cross is supposed to return ‘every pairwise combination of two channels for which the pair has a matching key,’ I would have instead expected to get:

[[0, 1], [0, A]]
[[0, 2], [0, A]]

If I do ch2.cross(ch1) it gives:

[[0, A], [0, 1]]
[[0, A], [0, 2]]

Which is what I expect.

Thanks for any clarification you can offer.

The join operator is intended to match items with unique keys, so there should never be duplicate matches and such a duplicate would be considered an error in most cases.

The cross operator is more of a piece-wise outer product i.e. “cross join” because it only combines values with matching keys. Whereas combine can do a full cross product if you don’t specify the by option.

I have always found the distinction between join and cross and combine to be confusing, but I’m guessing they simply evolved to serve various use cases. It is something I’d like to improve in a future DSL version, but for now I have simply tried to document their differences as best I can. See the note here: Operators — Nextflow documentation

Gotcha. It may help to note the expectation of unique keys more clearly in the documentation, since coming from a DataFrame/SQL mindset, the Nextflow join operator is really not quite the same as an inner join.

@bentsherman

I was able to implement a true inner join with a bit of creativity:

workflow innerJoin {
    take:
    left
    right

    main:

    left
        | groupTuple
        | cross(right)
        | map{
            result = []
            for (left_cols in it[0].drop(1).transpose()) {
                result += [[[it[0][0]], left_cols, it[1].drop(1)]]
            }
            result
        }
        | collect
        | flatMap{it}
        | set{joined}
    
    emit:
    joined
}

workflow {
    source = channel.of([0, "s1.1", "s1.2"], [0, "s2.1", "s2.2"], [0, "s3.1", "s3.2"], [1, "s4.1", "s4.2"])
    target = channel.of([0, "t1.1", "t1.2"], [0, "t2.1", "t2.2"], [1, "t3.1", "t3.2"])
    innerJoin(source, target) | view
}

Output:

[[0], [s1.1, s1.2], [t1.1, t1.2]]
[[0], [s2.1, s2.2], [t1.1, t1.2]]
[[0], [s3.1, s3.2], [t1.1, t1.2]]
[[0], [s1.1, s1.2], [t2.1, t2.2]]
[[0], [s2.1, s2.2], [t2.1, t2.2]]
[[0], [s3.1, s3.2], [t2.1, t2.2]]
[[1], [s4.1, s4.2], [t3.1, t3.2]]

Take it with a grain of salt as I haven’t put much work into validating it.

I also have implementations of left join, right join, and full join which I can share if interested. Currently they only let you join on the first key, and of course since we’re not actually dealing with a database, they just put the non-key elements from the left channel in the second position and the non-key elements from the right channel in the third position.

Do you think Nextflow would be interested in including these options as new operators?

This kind of inner join can be achieved using the cross operator, or the combine operator with the by option. The cross operator essentially does an “inner-cross” join by default. I guess I have never had to do such an inner join, even in my SQL/pandas days, because I only ever joined on unique keys. I haven’t seen any real-world examples of an inner join with duplicate keys, but I’m sure they’re out there.

Also, join can do an outer join using the remainder option. You would then need to filter out certain remainders to achieve a left or right join.

i personally find all of this very confusing. In a future DSL version, there should probably only be a join operator which can do all of the different joins, and maybe a cross operator as a convenience for doing a full cross product. combine is difficult to understand because all of these operators are “combining” channels in some sense…