We’ve been having some trouble with the resume functionality when launching pipelines on platform. We don’t experience the same issue locally.
When resuming a pipeline on an updated version of the branch with a new commit, the caches fail and the pipeline re-starts from the top. We would expect the caches of all processes upstream of the change introduced by the commit to be unchanged, and for the resume to start from the first changed process. When resuming a pipeline without incorporating any new commits (ie on an identical version of the branch), the resume functionality works as expected.
I have been investigating this issue via the cache hashes, and have narrowed the difference down to the container fingerprint hash, which differs in the resumed run. We have not made any changes to our containers between runs. When I look at the individual processes on platform, under the ‘resources requested’ header, I can see that the path to the container differs in the resumed run. The format of the path is wave.seqera.io/wt/some_string/biocontainers/gtfparse:1.2.1--pyh864c0ab_0. It is the string in the middle of the path that is changing.
Our containers are all in quay.io. I don’t know much about the wave/fusion system and how this path is constructed. Is the commit hash used to construct this path?
I’ve had a look through existing issues and couldn’t spot any similar posts.
Any help in diagnosing this issue is much appreciated, as we are losing a lot of time without resume working correctly when we make pipeline updates!
Hi again Paolo. We set up a new compute environment using ‘Batch Forge’ which we expected would install the latest version of Nextflow, however the version is still appearing as 23.10.1. Are there any additional config options that I have overlooked that will force install v24?
Yes, this issues has been fixed in Nextflow 24.04.x, however Platform is still using 23.10.x. You can bump the nextflow version by adding in the launch pre-run script field the following environment variable
I think I have a solution for it. It is just change the containers from x86_64 to ARM64, if we don’t have this change we will get this exec format error. Is platform prepared to handle gravitron 3 or 4? Thanks guys for your outstanding work leading seqera. I hope this may help.
Hi Phil. Thank you very much for this suggestion. The compute environment now launches successfully. Unfortunately, the broken resume functionality persists.
I have checked the cache hashes again, and the issue still seems to be with the container fingerprints. I have put an example below.
This is the cache profile for one of the tasks in our pipeline when run through initially:
I then committed a change to a process downstream, and resumed the pipeline using the new commit ID. The resume failed and the pipeline restarted from the top. The cache hash of the same features_file task in the resumed pipeline is below:
Wave needs to bundle all of the bin/ directory executables into the container it builds and delivers for each Nextflow task. By changing one of the files in the bin directory, this changes the container used to run all tasks in that workflow. Changes to the container hash will result in a new task hash, which will force Nextflow to re-run each task.
The solution is to utlize the “Module binaries” feature of Nextflow (documentation here). Essentially, this feature enables you to only include the executables for specific processes (where those executables are used). This allows Wave to build process-specific containers and changes to the container for ProcessA will not affect the container delivered to ProcessB.
I have made an example Pull Request here to your example repository which shows what changes are required to make use of the module binaries.
To follow - one other approach is to use the template directive. This allows a separate script file but tells Nextflow to interpolate that into the script block at run time. So the process task doesn’t run the script file from bash, it runs the code directly (if that makes sense).
Thanks Rob! Thanks Phil!
I’m sorry we haven’t gotten back to you on this yet. We’ll test both of these implementations when we can and get back to you.