Complete DAG: show tasks (not process) dependencies

I know I can generate a DAG or a report from a workflow run. It will show the list of tasks on one side, the process dependencies on the other side. What I would like to see is the dependencies between the tasks.

For instance, look at this pseudo workflow:

process A {
 input: path in
 output: path out
}

process B {
 input: path in
 output: path out
}

Workflow {
 channel(file1, files2) | A | B
}

Trace would report:

taskid1 A(1)
taskid2 A(2)
taskid3 B(1)
taskid4 B(2)

The information I’m missing is taskid1 → taskid3 and taskid2 → taskid3. This would help for instance to track the processing of the files from the input to the output.

Is this information already available? Or do would I need to write my own plugin?

Thanks

No plugin required, this is a core feature of Nextflow but it’s not very obvious at first.

Tag Directive

Firstly, the exact solution you are looking for is the tag directive. This allows you to tag any tasks which will appear in the log.

In this example, I use a file method to get the name of the file and use it as a tag which will appear in the log:

process A {

    tag "${in.simpleName}"

    input: 
        path in
    output: 
        path "*_out"

    script:
    """
    mv $in ${in.simpleName}_out
    """

}

process B {

    tag "${in.simpleName}"

    input: 
        path in
    output: 
        stdout

    script:
    """
    echo $in 
    """
}

workflow {

    def file1 = file("${workDir}/hello.txt")
    file1.text = "Hello, world!"
    
    def file2 = file("${workDir}/morning.txt")
    file2.text = "Good morning!"
    

    Channel.of(file1, file2) | A | B
}
> nextflow run . -ansi-log false
N E X T F L O W  ~  version 24.10.5
Launching `./main.nf` [crazy_jennings] DSL2 - revision: 4a7461e06d
[e8/39fc6d] Submitted process > A (hello)
[96/58ffcd] Submitted process > A (morning)
[d0/d55e07] Submitted process > B (hello_out)
[05/6d8792] Submitted process > B (morning_out)

Metadata Propagation

Of course, this is a very simple example and relies on filenames. Filenames are deeply unreliable and should never be used to hold metadata.

Nextflow supports propagating data with the files, i.e. you can pass sample information such as the ID, treatment etc along with the files themselves and use that information in each process. This is extremely valuable because you can construct complex instructions from all the data you have accessible. For a deep dive into this topic, check out the advanced training: Metadata Propagation - training.nextflow.io

In this example, I build some files using a map from a greeting and return a tuple of [ greeting, file ]. I then use the greeting as the tag to identify the task:

process A {

    tag "${greeting}"

    input: 
        tuple val(greeting), path(in)
    output: 
        tuple val(greeting), path("*_out")

    script:
    """
    mv $in ${greeting}_out
    """

}

process B {

    tag "${greeting}"

    input: 
        tuple val(greeting), path(in)
    output: 
        stdout

    script:
    """
    echo $in 
    """
}

workflow {

    Channel.of("hello", "morning")
        .map { greeting ->
            def greetingFile = file("${workDir}/${greeting}.txt")
            greetingFile.text = "${greeting} world!"
            return [ greeting, greetingFile ]
        }
        .set { greetings }

    greetings | A | B
}
> nextflow run . -ansi-log false
N E X T F L O W  ~  version 24.10.5
Launching `./main.nf` [determined_lamarck] DSL2 - revision: c1789b1008
[0a/a31ba3] Submitted process > A (morning)
[9b/f30dbc] Submitted process > A (hello)
[b7/5b3025] Submitted process > B (hello)
[1b/016608] Submitted process > B (morning)

There’s nothing stopping you constructing complex tags, e.g. "${sampleId}_${referenceName}" to indicate combinations of inputs.

Provenance and Dependency Chaining

Maybe I’ve missed the point entirely, and you are really after provenance of the tasks, which in this situation means which task relates to which earlier tasks. This is more complicated and can’t really be expressed by tags but it’s something we’re working on.

Thanks Adam. I’m actually looking for the provenance of the task, and for some way to do it for existing pipelines. I’ve looks at the entrypoints for plugins, and at the task handlers, but didn’t found any attribute/method with this information. The closest think I 've found is looking at the symbolic links in the work directory to see where the data comes from (e.g. from which task).

Check out the dag format in nf-prov, it will give you the task DAG as a Mermaid diagram. You might also look at the source code to see how I use the TraceObserver to infer the provenance.

The only caveat is that it uses input/output files to track provenance, so it can’t track e.g. a val output being passed to a val input. But this is a rare edge case and there are ways to avoid it if needed.

2 Likes

Great, this is what I was looking for. thanks a lot!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.