I am trying to run the nf-core/smrnaseq workflow on Seqera cloud with Google Batch. My files are all in GCP buckets which have been connected to Seqera cloud with a credential. Whenever I try to run the workflow, all of the initial processes submitted fail with a mysterious error:
Process terminated for an unknown reason -- Likely it has been terminated by the external system.
I have increased the Quotas on my GCP account to have enough memory and CPUs, and the Batch jobs are being submitted and attempted, but failing while generating no logs and no exit code.
Any guidance on how to resolve this issue would be greatly appreciated.
Can you run a simple pipeline such as nextflow-io/hello in this compute environment? The first step in debugging is to make sure it’s not something in the CE (roles, etc).
I have been able to run nextflow-io/hello and nf-core/smrnaseq with the test profile. As far as I can tell, the main difference is that the test files (10s of MB each) are much smaller than my sample files (~3 GB each).
I don’t think it is a problem with roles or permissions, and I am no longer getting errors saying that my CPU and/or memory quotas are being exceeded.
Sometimes nf-schema gets a bit trigger happy and kills your pipeline before it can report what went wrong. When this happens I normally run the pipeline locally and try to recreate the error on my machine so I can check the logs, normally it means one of the parameters failed validation. Could you try running the same script with the command line and seeing if the pipeline starts? You can cancel it before it actually runs to prevent it running on your laptop.
I don’t think it should be parameter validation, as that would happen on the Nextflow head job. Here it’s the process tasks that are being terminated, not the head job.
Is it possible to view the job submissions in GCP and get any additional information?
Thank you for the responses, everyone. I looked into the logs and each of the child jobs in Batch has the following error/log message:
no VM has agent reporting correctly within the time window 1080 seconds. VM state for instance nf-686847ef-171798-2b22770e-de53-44a10-group0-0-hnff is 2024/06/10-02:05:36+0000,agent,start.
From what I have found online, it appears to be an issue with the configuration of my Batch permissions or networking. I am going to try to resolve the issue following the advice in the linked page, but if anyone has experienced this issue and has further advice I would greatly appreciate it!