Resume run with changed head job configuration

Johannes_Alneberg · October 26, 2023, 7:51am

Hello!

Background: Running on the enterprise version (23.2.0_d157b0d) with SLURM on an offline HPC, using the GUI to launch the job.
Issue: My head job unfortunately ran out of memory at a very late stage of the pipeline. I would like to resume this job with more memory for the head job, but I can’t figure out how to do this.

I’ve tried updating the current compute environment, but quite sure that’s not possible.
I duplicated the compute environment and changed the head job submission options, but I could not select this compute environment in the resume form.

Thank you,
Johannes Alneberg

ewels · October 26, 2023, 12:46pm

Hi Johannes!

As you say, you can’t resume a previous run on a new compute environment. However, I think that you should be able to resume a run and then put something like this into the Pre-run script under Advanced options:

export NXF_OPTS="-Xms5G -Xmx20G"

This is based on a snippet from this blog post:

I haven’t tried this though, so it’s a best-guess at the moment. If it does work, I think we could look into adding it as an option into the Launch form for HPC compute envs. When running on AWS and some other compute env types, there’s already an option in that form for adjusting the head job cpus + memory.

Phil

ewels · October 26, 2023, 12:50pm

ooh @mbosio just told me that it is possible to change compute environment from Tower 22.4 onwards, apologies:

So the strategy of duplicating + editing the compute may work, after all. The feature is limited to the case when the work directory is accessible in both compute envs, so it could be disabled due to something along those lines? Are the CE workdir paths definitely the same?

Phil

Johannes_Alneberg · October 26, 2023, 12:54pm

Thank you Phil!

I was thinking it was the slurm parameters I would have to change, but you might be right it’s the java parameters that might need tweaking. I can have a look at that as well.

I tried changing the compute environment, but all options were greyed out. The workdir is set to a variable $TW_AGENT_WORK so that could be why it’s confused? It’s the same for both though and I didn’t change the agent argument between runs. But perhaps it’s just the possibility that it could have changed that is the issue?

ewels · October 26, 2023, 1:02pm

Ah, entirely possible - I’m not sure how that validation step works. I’ve added a note on our internal tracker for the feature, so that we look into it.

Could you try duplicating the CE and setting the work dir to the absolute (base) path used in the previous run?

Johannes_Alneberg · October 26, 2023, 1:08pm

Yes, I think that did the trick atcually. I can now select the other CE in the GUI. In fact, if I do so, I can’t select the original again. I don’t dare starting it at this time though since I have the started the same run again from scratch and it’s still running. At least we now know it’s the $TW_AGENT_WORK that is the problem.

ewels · October 26, 2023, 1:13pm

Thanks for testing!

Ok, then I think we have as much of a solution here as we are going to get in the short term. Longer term, we’ll try to get the env var thing fixed on the ~~Tower~~ Seqera Platform side. And hopefully also add an option to tweak head job cpus + memory on the launch form for HPC.

Topic		Replies	Views
Slurm killing platform jobs Ask for help hpc , platform , slurm	3	299	April 8, 2024
Slurm jobs not being submitted Ask for help nextflow	6	152	July 24, 2024
Slurm requiring multiple resumes for pipeline advancement Ask for help	1	49	August 13, 2024
SIngle nextflow job vs the executor Ask for help	2	146	August 9, 2024
Platform launch folders and concurrent Nextflow runs Ask for help hpc , platform , slurm	1	265	April 23, 2024

Resume run with changed head job configuration

Related topics