Zombie runs, jobs has been killed in local but state is still running on Seqera

Hi, Seqera community.

When running a pipeline with CLI, runs are failed/canceled and I already checked the state with slurm. But the state is still running on Seqera. The success run’s state is success normally.

As I see, this phenomenon especially happens with this pipeline.
What are the possible causes in a pipeline that leads to the “zombie runs” problem in general? Also, how do I cancel a “zombie” job on Seqera?

Review developing history, I see:

They also met the same problem. It seems they solved this by providing workflow_id and the Tower administrator has canceled for them. Is this way possible now?

Thanks for the great platform.

I always get these “zombie runs” when I kill a nextflow run that I have submitted to the cluster with slurm using “scancel”. The PID gets killed as expected and the run interrupts, but the pipeline is shown as running in seqera cloud forever.

I have the same problem, how to remove these zombies?

If the Nextflow process is forcibly killed, it cannot report back to the Platform that the jobs have terminated. To force-kill the task from the Platform side, you can use the API directly and use the force=true parameter. You’ll need:
The run id (which you can pull from the run url)
The workspace ID (the numeric ID available on the org page for each workspace)
A token (available to be minted in “User Tokens” in the top-right user menu.
For example, to delete a single run:

RUN_ID='2vEhA4EADamYMz'
WORKSPACE_ID='64430228002560'
curl -X POST "https://api.cloud.seqera.io/workflow/delete?workspaceId=$WORKSPACE_ID&force=true" \
  -H "Authorization: Bearer $TOWER_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"workflowIds\":\"$RUN_ID\"}"

To bulk delete runs, you can provide an array of run ids as the workflowIds value:

curl -X POST "https://api.cloud.seqera.io/workflow/delete?workspaceId=$WORKSPACE_ID&force=true" \
  -H "Authorization: Bearer $TOWER_ACCESS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"workflowIds":["4GviYJvYgPzetw","1g40rRkH80oHAT"]}'
2 Likes

@robsyme I have tried your solution and got a response:

{"failedWorkflowIds":[]}

However, the zombie run still hasn’t died yet.

That message indicates that there were no failures in the cancellation. If you refresh the page is the run still present in the list of runs displayed by Platform?

@robsyme Thanks you so much. When I check on Tower Runs history, the zombies have gone. However, the web link of zombies are still accessible, and I still see it running when access those links.

There is any way to “force cancel” zombie jobs instead of delete them from the Runs History?