Background: Running on the enterprise version (23.2.0_d157b0d) with SLURM on an offline HPC, using the GUI to launch the job.
Issue: My head job unfortunately ran out of memory at a very late stage of the pipeline. I would like to resume this job with more memory for the head job, but I can’t figure out how to do this.
- I’ve tried updating the current compute environment, but quite sure that’s not possible.
- I duplicated the compute environment and changed the head job submission options, but I could not select this compute environment in the resume form.
As you say, you can’t resume a previous run on a new compute environment. However, I think that you should be able to resume a run and then put something like this into the Pre-run script under Advanced options:
export NXF_OPTS="-Xms5G -Xmx20G"
This is based on a snippet from this blog post:
I haven’t tried this though, so it’s a best-guess at the moment. If it does work, I think we could look into adding it as an option into the Launch form for HPC compute envs. When running on AWS and some other compute env types, there’s already an option in that form for adjusting the head job cpus + memory.
ooh @mbosio just told me that it is possible to change compute environment from Tower 22.4 onwards, apologies:
So the strategy of duplicating + editing the compute may work, after all. The feature is limited to the case when the work directory is accessible in both compute envs, so it could be disabled due to something along those lines? Are the CE workdir paths definitely the same?
Thank you Phil!
I was thinking it was the slurm parameters I would have to change, but you might be right it’s the java parameters that might need tweaking. I can have a look at that as well.
I tried changing the compute environment, but all options were greyed out. The workdir is set to a variable
$TW_AGENT_WORK so that could be why it’s confused? It’s the same for both though and I didn’t change the agent argument between runs. But perhaps it’s just the possibility that it could have changed that is the issue?
Ah, entirely possible - I’m not sure how that validation step works. I’ve added a note on our internal tracker for the feature, so that we look into it.
Could you try duplicating the CE and setting the work dir to the absolute (base) path used in the previous run?
Yes, I think that did the trick atcually. I can now select the other CE in the GUI. In fact, if I do so, I can’t select the original again. I don’t dare starting it at this time though since I have the started the same run again from scratch and it’s still running. At least we now know it’s the $TW_AGENT_WORK that is the problem.
Thanks for testing!
Ok, then I think we have as much of a solution here as we are going to get in the short term. Longer term, we’ll try to get the env var thing fixed on the
Tower Seqera Platform side. And hopefully also add an option to tweak head job cpus + memory on the launch form for HPC.