Resume failing due to change in wave container path

Hi there,

We’ve been having some trouble with the resume functionality when launching pipelines on platform. We don’t experience the same issue locally.

When resuming a pipeline on an updated version of the branch with a new commit, the caches fail and the pipeline re-starts from the top. We would expect the caches of all processes upstream of the change introduced by the commit to be unchanged, and for the resume to start from the first changed process. When resuming a pipeline without incorporating any new commits (ie on an identical version of the branch), the resume functionality works as expected.

I have been investigating this issue via the cache hashes, and have narrowed the difference down to the container fingerprint hash, which differs in the resumed run. We have not made any changes to our containers between runs. When I look at the individual processes on platform, under the ‘resources requested’ header, I can see that the path to the container differs in the resumed run. The format of the path is
wave.seqera.io/wt/some_string/biocontainers/gtfparse:1.2.1--pyh864c0ab_0. It is the string in the middle of the path that is changing.

Our containers are all in quay.io. I don’t know much about the wave/fusion system and how this path is constructed. Is the commit hash used to construct this path?

I’ve had a look through existing issues and couldn’t spot any similar posts.

Any help in diagnosing this issue is much appreciated, as we are losing a lot of time without resume working correctly when we make pipeline updates!

2 Likes

Hello, what version of Nextflow are you using ?

Morning Paolo. I’m using v 23.10.1. I’ll try updating this to Nextflow 24.04.2 and test the resume failure again and get back to you.

Hi again Paolo. We set up a new compute environment using ‘Batch Forge’ which we expected would install the latest version of Nextflow, however the version is still appearing as 23.10.1. Are there any additional config options that I have overlooked that will force install v24?

Yes, this issues has been fixed in Nextflow 24.04.x, however Platform is still using 23.10.x. You can bump the nextflow version by adding in the launch pre-run script field the following environment variable

export NXF_VER=24.04.0

We included the version environment variable in the pre-run script field like so:

However the launch failed. I am unable to download the logs so I have pasted the output from the GUI below.

Downloading nextflow dependencies. It may require a few seconds, please wait ..

2CAPSULE: Downloading dependency ch.qos.logback:logback-core:jar:1.4.14

3CAPSULE: Downloading dependency com.fasterxml.jackson.core:jackson-databind:jar:2.17.0

4CAPSULE: Downloading dependency org.yaml:snakeyaml:jar:2.2

5CAPSULE: Downloading dependency org.eclipse.jgit:org.eclipse.jgit:jar:6.6.1.202309021850-r

6CAPSULE: Downloading dependency org.apache.ivy:ivy:jar:2.5.2

7CAPSULE: Downloading dependency com.google.guava:guava:jar:33.0.0-jre

8CAPSULE: Downloading dependency com.google.errorprone:error_prone_annotations:jar:2.23.0

9CAPSULE: Downloading dependency org.apache.groovy:groovy-templates:jar:4.0.21

10CAPSULE: Downloading dependency com.google.guava:failureaccess:jar:1.0.2

11CAPSULE: Downloading dependency io.nextflow:nextflow:jar:24.04.0

12CAPSULE: Downloading dependency com.googlecode.javaewah:JavaEWAH:jar:1.2.3

13CAPSULE: Downloading dependency org.apache.groovy:groovy-xml:jar:4.0.21

14CAPSULE: Downloading dependency ch.qos.logback:logback-classic:jar:1.4.14

15CAPSULE: Downloading dependency com.fasterxml.jackson.core:jackson-annotations:jar:2.17.0

16CAPSULE: Downloading dependency net.bytebuddy:byte-buddy:jar:1.14.9

17CAPSULE: Downloading dependency org.apache.groovy:groovy-yaml:jar:4.0.21

18CAPSULE: Downloading dependency org.pf4j:pf4j:jar:3.10.0

19CAPSULE: Downloading dependency org.apache.groovy:groovy-json:jar:4.0.21

20CAPSULE: Downloading dependency org.apache.groovy:groovy-nio:jar:4.0.21

21CAPSULE: Downloading dependency org.checkerframework:checker-qual:jar:3.41.0

22CAPSULE: Downloading dependency io.nextflow:nf-httpfs:jar:24.04.0

23CAPSULE: Downloading dependency io.nextflow:nf-commons:jar:24.04.0

24CAPSULE: Downloading dependency ch.artecat.grengine:grengine:jar:3.0.2

25CAPSULE: Downloading dependency com.fasterxml.jackson.core:jackson-core:jar:2.17.0

26CAPSULE: Downloading dependency com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.17.0

272/7424 KB

28Downloading plugin nf-amazon@2.5.1

29WARN: Unable to start plugin 'nf-amazon' required by s3://csg-tower-bucket/scratch/36QWyueAawA304

30ERROR ~ Missing plugin 'nf-amazon' required to read file: s3://csg-tower-bucket/scratch/36QWyueAawA304

Same issue here using platform and the wave system. nf-amazon@2.5.1 missing plugin.

Same log for each task:

I think I have a solution for it. It is just change the containers from x86_64 to ARM64, if we don’t have this change we will get this exec format error. Is platform prepared to handle gravitron 3 or 4? Thanks guys for your outstanding work leading seqera. I hope this may help.

@ImogenCarruthers could you please try specifically 24.04.3 and see if that helps?

Hi Phil. Thank you very much for this suggestion. The compute environment now launches successfully. Unfortunately, the broken resume functionality persists.

I have checked the cache hashes again, and the issue still seems to be with the container fingerprints. I have put an example below.

This is the cache profile for one of the tasks in our pipeline when run through initially:

[features_file] cache hash: df5b05a0ddcbfc66326dc7c276fbb098; mode: STANDARD; entries: 
  dac367ee37f632595386541e7bc30b7b [java.util.UUID] 123dfbee-1e45-484f-b6e3-9744e9fafdda 
  bd094351ca5ac07f84d3b0a060c0462b [java.lang.String] features_file 
  bc4561159b3f93b57f807f20b4fe55d8 [java.lang.String]   '''
  features_names.py !{gtf} !{gtf.baseName}_features_names_tmp.tsv
  sed 's/mm10___/mm10_/g' !{gtf.baseName}_features_names_tmp.tsv > !{gtf.baseName}_features_names.tsv
  '''
 
  83740b1e082aa32d3a4b64609590a8c3 [java.lang.String] 338af31aede8fad8e8bf1d241cbe770d 
  f1d1b4b60efcd9b60bb2ee1df7986b3d [java.lang.String] gtf 
  cc889c9bd61cf2f9eb6ef7e0a4232ab1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/csgx.public.readonly/resources/references/Ensembl_Gencode_resources/GRCh38.Ensembl109.GENCODEv44_GRCm39.Ensembl110.GENCODEvM33_with_pcLoF_without_readthrough/genes/gencode.v44.vM33.primary_assembly.annotation.modified_seq_names.gene_subset.species_tagged.gtf, storePath:/csgx.public.readonly/resources/references/Ensembl_Gencode_resources/GRCh38.Ensembl109.GENCODEv44_GRCm39.Ensembl110.GENCODEvM33_with_pcLoF_without_readthrough/genes/gencode.v44.vM33.primary_assembly.annotation.modified_seq_names.gene_subset.species_tagged.gtf, stageName:gencode.v44.vM33.primary_assembly.annotation.modified_seq_names.gene_subset.species_tagged.gtf)] 
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true 
  bda06a192131339b62484b17664390c0 [sun.nio.fs.UnixPath] /.nextflow/assets/csgenetics/rnaseq/bin/features_names.py

I then committed a change to a process downstream, and resumed the pipeline using the new commit ID. The resume failed and the pipeline restarted from the top. The cache hash of the same features_file task in the resumed pipeline is below:

[features_file] cache hash: dee28bc0cae6f60756fd66b2971ad591; mode: STANDARD; entries: 
  dac367ee37f632595386541e7bc30b7b [java.util.UUID] 123dfbee-1e45-484f-b6e3-9744e9fafdda 
  bd094351ca5ac07f84d3b0a060c0462b [java.lang.String] features_file 
  bc4561159b3f93b57f807f20b4fe55d8 [java.lang.String]   '''
  features_names.py !{gtf} !{gtf.baseName}_features_names_tmp.tsv
  sed 's/mm10___/mm10_/g' !{gtf.baseName}_features_names_tmp.tsv > !{gtf.baseName}_features_names.tsv
  '''
 
  25acd0497914cd8b8ca579889113fca7 [java.lang.String] a6e019fc9352418cc56a8a5d90561f84 
  f1d1b4b60efcd9b60bb2ee1df7986b3d [java.lang.String] gtf 
  cc889c9bd61cf2f9eb6ef7e0a4232ab1 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/csgx.public.readonly/resources/references/Ensembl_Gencode_resources/GRCh38.Ensembl109.GENCODEv44_GRCm39.Ensembl110.GENCODEvM33_with_pcLoF_without_readthrough/genes/gencode.v44.vM33.primary_assembly.annotation.modified_seq_names.gene_subset.species_tagged.gtf, storePath:/csgx.public.readonly/resources/references/Ensembl_Gencode_resources/GRCh38.Ensembl109.GENCODEv44_GRCm39.Ensembl110.GENCODEvM33_with_pcLoF_without_readthrough/genes/gencode.v44.vM33.primary_assembly.annotation.modified_seq_names.gene_subset.species_tagged.gtf, stageName:gencode.v44.vM33.primary_assembly.annotation.modified_seq_names.gene_subset.species_tagged.gtf)] 
  4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 
  16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true 
  bda06a192131339b62484b17664390c0 [sun.nio.fs.UnixPath] /.nextflow/assets/csgenetics/rnaseq/bin/features_names.py

As you can see, the only hash which changes is the one which I believe is the container fingerprint. It appears as this in the initial run:

83740b1e082aa32d3a4b64609590a8c3 [java.lang.String] 338af31aede8fad8e8bf1d241cbe770d

But this in the resumed run:

25acd0497914cd8b8ca579889113fca7 [java.lang.String] a6e019fc9352418cc56a8a5d90561f84

Is there any other info I can provide which may be helpful for resolving this?

Thanks!

Hi @ImogenCarruthers,

Could you please enable trace debugging? We’re having issues replicating this at our end. This trace log message should help:

Phil

Hi all.

I have been working on putting together a reproducible example to demonstrate the issue. Please find it here:

This is a very basic pipeline with two processes. The two processes use different containers. The second process runs a python script stored in bin.

To recreate the issue, please follow these steps:

  • Launch the pipeline from the platform launchpad. Use the penultimate commit for the initial run: d9536188e69f570c15050dd7b617b9c5bcbdd303

  • Once the pipeline has run through, resume it on the most recent commit: a54c784346a3bb0096bd0f467134f27f5def770a

You will see that the only change between these two commits is a change in the .py script that is used in the second process.

Expected behaviour: the first process is cached (as it is totally unchanged), the second is rerun (as the .py script was modified).

Actual behaviour: both processes are rerun.

This has been tested on both v24.04.3 and v24.06.0-edge. The same behaviour is seen in both cases.

Hi Imogen

Wave needs to bundle all of the bin/ directory executables into the container it builds and delivers for each Nextflow task. By changing one of the files in the bin directory, this changes the container used to run all tasks in that workflow. Changes to the container hash will result in a new task hash, which will force Nextflow to re-run each task.

The solution is to utlize the “Module binaries” feature of Nextflow (documentation here). Essentially, this feature enables you to only include the executables for specific processes (where those executables are used). This allows Wave to build process-specific containers and changes to the container for ProcessA will not affect the container delivered to ProcessB.

I have made an example Pull Request here to your example repository which shows what changes are required to make use of the module binaries.

To follow - one other approach is to use the template directive. This allows a separate script file but tells Nextflow to interpolate that into the script block at run time. So the process task doesn’t run the script file from bash, it runs the code directly (if that makes sense).

See docs: Processes — Nextflow documentation

You see this approach used in nf-core quite a bit, as it doesn’t require Wave / containers. For example, the rtn module with an R script.

Hope that helps!

Phil

Thanks Rob! Thanks Phil!
I’m sorry we haven’t gotten back to you on this yet. We’ll test both of these implementations when we can and get back to you.

1 Like