I launched a WGS workflow for HiFi data I wrote that utilizes Seqera Wave to execute a pipeline process. By enabling the Wave plugin, my understanding is the Dockerfile is uploaded to the Wave service, a container is built on-the-fly, and pushed to a temporary repository. The container name is automatically included in the process execution and used to run the task. Basically, I am doing what is described in example two in wave-showcase repo.
This has been working great, but then I was stress testing it by submititng 25 samples to run.
executor {
queueSize = 25 //limit to 25 concurrent jobs
}
But then my workflow exited and the error message:
Essential container in task exited - CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s
The actual process completed, and the resulting output file is in the workdir bucket on S3.
My pipeline was working just fine, but do I need to reduce the executor queue size? Am I hitting Wave limits? I have my TOWER_ACCESS_TOKEN environment variable set and listed in my config
By enabling the Wave plugin, my understanding is the Dockerfile is uploaded to the Wave service, a container is built on-the-fly, and pushed to a temporary repository
It depends on your wave.strategy. Can you share your configuration file? If you decrease the queueSize it works? Do you always get this error for queueSize = 25?
I re-started and decreased queueSize to 10, but then I got the same error. Specifically the process that failed was this one, pbmm2_align
My Dockerfile is listed here.
It’s the same situation, the exit code is zero, and the aligned BAM is in the workdir.
I wasn’t familiar with wave.strategy. I don’t explicitly set it in my config.
Here is are the relevant portions of my nextflow.config:
Both failed on my read alignment process with pbmm2. I subsequently resumed the pipeline, keeping queueSize of 10 and the remainder of the pbmm2 processes finished, the workflow is still executing the downstream processes. Not sure if I will encounter the same error, but things ae running.
I think I may have found the problem. I found my error message in the FAQ
When I launched at test job, I checked the container instance running it and saw the ECS agent was outdated. (it is v1.51)
When I made my batch compute environment, I associated it with an AMI. Do I need to adjust that AMI with the updated ECS agent? It seems like the solution described here is to update the agent, while the container is running.
But do I have to do that each time I launch a process? When I am using Wave, does it use the AMI associated with my AWS batch compute environment to launch the jobs?