AWS Batch job stuck at runnable when using a compute environment that would enable use of a GPU (A10) -> which next step to take?

I’m encountering an issue with a fairly simple Nextflow pipeline running on AWS Batch. I’ve set up two AWS Batch queues and configured Nextflow to use them based on process tags. I specified two different queues as I need two different AMI’s, one of them for running a g5.xlarge on which I use a GPU for Nanopore basecalling. Here’s my nextflow.config:


/************************************************
| CONFIGURATION FILE FOR NAO BASECALL WORKFLOW |
************************************************/

params {
    mode = "basecall"
    debug = true

    // Directories
    base_dir = "*****NAO-ONT-20240912-DCS_RNA3" // Parent for working and output directories (can be S3)

    // Run parameters
    nanopore_run = "NAO-ONT-20240912-DCS_RNA3"

    // Files
    pod5_dir = "${base_dir}/pod5/" //
    calls_bam = "${base_dir}/bam/calls.bam"
    fastq_file = "${base_dir}/raw/${nanopore_run}.fastq.gz"
}

includeConfig "${projectDir}/configs/containers.config"
includeConfig "${projectDir}/configs/resources.config"
includeConfig "${projectDir}/configs/profiles.config"

docker {
    enabled = true
    runOptions = '--gpus all'
}

fusion {
    enabled = true
    exportStorageCredentials = true
}

wave {
    enabled = true
}

process {
    withLabel: dorado {
        executor = 'awsbatch'
        queue = 'slg-basecall-batch-queue'
        errorStrategy = "retry"
        maxRetries = 3
    }
}

process{
    errorStrategy = "retry"
    maxRetries = 3
    executor = "awsbatch"
    queue = "simon-batch-queue"
}

The problem I’m facing is that while nextflow.log shows PROCESS BASECALL:BASECALL_POD_5 as submitted, it remains in a “runnable” state on the AWS Batch dashboard and doesn’t progress.
I’ve checked the following:

  1. I can see the launched instance on the EC2 dashboard.
  2. Both the compute environment and the job queue have status “Valid” and state “Enabled”.
  3. I’ve attempted to SSH into the instance using the key specified in the instance template, but I’m unable to connect.
    ssh -i ~/.ssh/*** ec2-user@ec2-54-196-156-38.compute-

Has anyone encountered similar issues or have suggestions for troubleshooting this? I’m particularly interested in:

  1. Why might the job be stuck in a “runnable” state?
  2. Are there any logs or specific areas in AWS I should check to diagnose the problem?
  3. Any common misconfigurations that could cause this behavior?

Let me know if further information from my side would be helpful!

Additional info

Compute environment JSON:

{
  "computeEnvironmentName": "simon-batch-3",
  "computeEnvironmentArn": "arn:aws:batch:us-east-1:058264081542:compute-environment/simon-batch-3",
  "ecsClusterArn": "arn:aws:ecs:us-east-1:058264081542:cluster/AWSBatch-simon-batch-3-8151f108-6a73-3faf-b52f-fdb1c5cc38c5",
  "tags": {},
  "type": "MANAGED",
  "state": "ENABLED",
  "status": "VALID",
  "statusReason": "ComputeEnvironment Healthy",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_PRICE_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 1024,
    "desiredvCpus": 4,
    "instanceTypes": [
      "g5.xlarge"
    ],
    "subnets": [
      "subnet-06fbe1854c70d78dd",
      "subnet-0d0c2057c57a07015",
      "subnet-0200df4c5f948f131",
      "subnet-021abc14f7da9660b",
      "subnet-069530171cc8bb793",
      "subnet-03aece59a8e0cffe3"
    ],
    "securityGroupIds": [
      "sg-0ebaa288146c2d8ab"
    ],
    "ec2KeyPair": "simon_sb",
    "instanceRole": "arn:aws:iam::058264081542:instance-profile/ecsInstanceRole",
    "tags": {},
    "launchTemplate": {
      "launchTemplateName": "simon-basecall-template",
      "version": "$Latest"
    },
    "ec2Configuration": [
      {
        "imageType": "ECS_AL2_NVIDIA"
      },
      {
        "imageType": "ECS_AL2"
      }
    ]
  },
  "serviceRole": "arn:aws:iam::058264081542:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch",
  "updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },
  "containerOrchestrationType": "ECS",
  "uuid": "08e52f2c-bdea-3c34-9e2e-97b47f57c467"
}

Job queue JSON:

{
  "jobQueueName": "slg-basecall-batch-queue",
  "jobQueueArn": "arn:aws:batch:us-east-1:058264081542:job-queue/slg-basecall-batch-queue",
  "state": "ENABLED",
  "status": "VALID",
  "statusReason": "JobQueue Healthy",
  "priority": 100,
  "computeEnvironmentOrder": [
    {
      "order": 1,
      "computeEnvironment": "arn:aws:batch:us-east-1:058264081542:compute-environment/simon-batch-3"
    }
  ],
  "serviceEnvironmentOrder": [],
  "jobQueueType": "ECS",
  "tags": {},
  "jobStateTimeLimitActions": []
}

Further details on AWS instance template:

AMI:
Name: Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) Version 67.0
ID: ami-069699babcf73a576
Architecture: x86_64
Virtualization: hvm
Root device type: ebs
ENA Enabled: Yes

Instance Type: g5.xlarge
4 vCPU
16 GiB Memory

Storage:
Volume 1 (Root): 1000 GiB, EBS, General purpose SSD (gp3)
Total: 2 volumes, 1250 GiB


IAM Instance Profile: 
Not included in launch template

Advanced Settings:
Most advanced options set to "Don't include in launch template"
No CPU options specified (not supported for g5.xlarge)
No user data specified

Hi @Simon_Grimm , welcome to the community! Were you able to solve this or do you still need help?

Yep, I resolved this!