CannotInspectContainerError when launching jobs from localhost to run on AWS Batch

Hi,

I launched a WGS workflow for HiFi data I wrote that utilizes Seqera Wave to execute a pipeline process. By enabling the Wave plugin, my understanding is the Dockerfile is uploaded to the Wave service, a container is built on-the-fly, and pushed to a temporary repository. The container name is automatically included in the process execution and used to run the task. Basically, I am doing what is described in example two in wave-showcase repo.

This has been working great, but then I was stress testing it by submititng 25 samples to run.

executor {
    queueSize = 25 //limit to 25 concurrent jobs
}

But then my workflow exited and the error message:

 Essential container in task exited - CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s

The actual process completed, and the resulting output file is in the workdir bucket on S3.

My pipeline was working just fine, but do I need to reduce the executor queue size? Am I hitting Wave limits? I have my TOWER_ACCESS_TOKEN environment variable set and listed in my config

By enabling the Wave plugin, my understanding is the Dockerfile is uploaded to the Wave service, a container is built on-the-fly, and pushed to a temporary repository

It depends on your wave.strategy. Can you share your configuration file? If you decrease the queueSize it works? Do you always get this error for queueSize = 25?

I re-started and decreased queueSize to 10, but then I got the same error. Specifically the process that failed was this one, pbmm2_align
My Dockerfile is listed here.
It’s the same situation, the exit code is zero, and the aligned BAM is in the workdir.

I wasn’t familiar with wave.strategy. I don’t explicitly set it in my config.

Here is are the relevant portions of my nextflow.config:

process {
    
    executor = 'awsbatch'
    queue = 'my-queue'
    // Default settings for all processes
    memory = '16GB'
    
    withLabel: 'high_memory' {
        memory = '64GB'
        
    }
    
}

aws {
    region = 'us-east-1'
    accessKey = ''
    secretKey = ''
    
    batch {
        cliPath = '/usr/local/aws-cli/v2/current/bin/aws'
    }
    
}

executor {
    queueSize = 10 //  decreased to 10 concurrent jobs
}



workDir = 's3://my-nextflow-logs-prod'
docker.enabled = true
wave.enabled = true

tower {
  accessToken = "$TOWER_ACCESS_TOKEN"
}

I was taking a look at the compute environment I am using for Batch:

I think I might be hitting my vCPU limit on the compute environment? These numbers are before AWS increased my vCPU quota.

Please share the .nextflow.log file

nextflow.log_2.txt (232.9 KB)

The log_2 was my original log when I first encountered error with queueSize 25.

nextflow.log_1.txt (357.8 KB)

The log_1, was when I reduced queueSize to 10.

Both failed on my read alignment process with pbmm2. I subsequently resumed the pipeline, keeping queueSize of 10 and the remainder of the pbmm2 processes finished, the workflow is still executing the downstream processes. Not sure if I will encounter the same error, but things ae running.

This is the latest failure. At least it’s reproducible :upside_down_face:

Same error: Essential container in task exited - CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s

This time it failed on a process involving SV discovery. Here is my latest nextflow log

nextflow.log (721.5 KB)

I think I may have found the problem. I found my error message in the FAQ

When I launched at test job, I checked the container instance running it and saw the ECS agent was outdated. (it is v1.51)

When I made my batch compute environment, I associated it with an AMI. Do I need to adjust that AMI with the updated ECS agent? It seems like the solution described here is to update the agent, while the container is running.

But do I have to do that each time I launch a process? When I am using Wave, does it use the AMI associated with my AWS batch compute environment to launch the jobs?

I updated the ECS agent and the error went away.

1 Like