AWS Batch job stuck at runnable when using a compute environment that would enable use of a GPU (A10) -> which next step to take?

Simon_Grimm · September 18, 2024, 10:04pm

I’m encountering an issue with a fairly simple Nextflow pipeline running on AWS Batch. I’ve set up two AWS Batch queues and configured Nextflow to use them based on process tags. I specified two different queues as I need two different AMI’s, one of them for running a g5.xlarge on which I use a GPU for Nanopore basecalling. Here’s my nextflow.config:


/************************************************
| CONFIGURATION FILE FOR NAO BASECALL WORKFLOW |
************************************************/

params {
    mode = "basecall"
    debug = true

    // Directories
    base_dir = "*****NAO-ONT-20240912-DCS_RNA3" // Parent for working and output directories (can be S3)

    // Run parameters
    nanopore_run = "NAO-ONT-20240912-DCS_RNA3"

    // Files
    pod5_dir = "${base_dir}/pod5/" //
    calls_bam = "${base_dir}/bam/calls.bam"
    fastq_file = "${base_dir}/raw/${nanopore_run}.fastq.gz"
}

includeConfig "${projectDir}/configs/containers.config"
includeConfig "${projectDir}/configs/resources.config"
includeConfig "${projectDir}/configs/profiles.config"

docker {
    enabled = true
    runOptions = '--gpus all'
}

fusion {
    enabled = true
    exportStorageCredentials = true
}

wave {
    enabled = true
}

process {
    withLabel: dorado {
        executor = 'awsbatch'
        queue = 'slg-basecall-batch-queue'
        errorStrategy = "retry"
        maxRetries = 3
    }
}

process{
    errorStrategy = "retry"
    maxRetries = 3
    executor = "awsbatch"
    queue = "simon-batch-queue"
}

The problem I’m facing is that while nextflow.log shows PROCESS BASECALL:BASECALL_POD_5 as submitted, it remains in a “runnable” state on the AWS Batch dashboard and doesn’t progress.
I’ve checked the following:

I can see the launched instance on the EC2 dashboard.
Both the compute environment and the job queue have status “Valid” and state “Enabled”.
I’ve attempted to SSH into the instance using the key specified in the instance template, but I’m unable to connect.
ssh -i ~/.ssh/*** ec2-user@ec2-54-196-156-38.compute-

Has anyone encountered similar issues or have suggestions for troubleshooting this? I’m particularly interested in:

Why might the job be stuck in a “runnable” state?
Are there any logs or specific areas in AWS I should check to diagnose the problem?
Any common misconfigurations that could cause this behavior?

Let me know if further information from my side would be helpful!

Additional info

Compute environment JSON:

{
  "computeEnvironmentName": "simon-batch-3",
  "computeEnvironmentArn": "arn:aws:batch:us-east-1:058264081542:compute-environment/simon-batch-3",
  "ecsClusterArn": "arn:aws:ecs:us-east-1:058264081542:cluster/AWSBatch-simon-batch-3-8151f108-6a73-3faf-b52f-fdb1c5cc38c5",
  "tags": {},
  "type": "MANAGED",
  "state": "ENABLED",
  "status": "VALID",
  "statusReason": "ComputeEnvironment Healthy",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_PRICE_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 1024,
    "desiredvCpus": 4,
    "instanceTypes": [
      "g5.xlarge"
    ],
    "subnets": [
      "subnet-06fbe1854c70d78dd",
      "subnet-0d0c2057c57a07015",
      "subnet-0200df4c5f948f131",
      "subnet-021abc14f7da9660b",
      "subnet-069530171cc8bb793",
      "subnet-03aece59a8e0cffe3"
    ],
    "securityGroupIds": [
      "sg-0ebaa288146c2d8ab"
    ],
    "ec2KeyPair": "simon_sb",
    "instanceRole": "arn:aws:iam::058264081542:instance-profile/ecsInstanceRole",
    "tags": {},
    "launchTemplate": {
      "launchTemplateName": "simon-basecall-template",
      "version": "$Latest"
    },
    "ec2Configuration": [
      {
        "imageType": "ECS_AL2_NVIDIA"
      },
      {
        "imageType": "ECS_AL2"
      }
    ]
  },
  "serviceRole": "arn:aws:iam::058264081542:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch",
  "updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },
  "containerOrchestrationType": "ECS",
  "uuid": "08e52f2c-bdea-3c34-9e2e-97b47f57c467"
}

Job queue JSON:

{
  "jobQueueName": "slg-basecall-batch-queue",
  "jobQueueArn": "arn:aws:batch:us-east-1:058264081542:job-queue/slg-basecall-batch-queue",
  "state": "ENABLED",
  "status": "VALID",
  "statusReason": "JobQueue Healthy",
  "priority": 100,
  "computeEnvironmentOrder": [
    {
      "order": 1,
      "computeEnvironment": "arn:aws:batch:us-east-1:058264081542:compute-environment/simon-batch-3"
    }
  ],
  "serviceEnvironmentOrder": [],
  "jobQueueType": "ECS",
  "tags": {},
  "jobStateTimeLimitActions": []
}

Further details on AWS instance template:

AMI:
Name: Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) Version 67.0
ID: ami-069699babcf73a576
Architecture: x86_64
Virtualization: hvm
Root device type: ebs
ENA Enabled: Yes

Instance Type: g5.xlarge
4 vCPU
16 GiB Memory

Storage:
Volume 1 (Root): 1000 GiB, EBS, General purpose SSD (gp3)
Total: 2 volumes, 1250 GiB


IAM Instance Profile: 
Not included in launch template

Advanced Settings:
Most advanced options set to "Don't include in launch template"
No CPU options specified (not supported for g5.xlarge)
No user data specified

GeraldineVdA · October 7, 2024, 1:16pm

Hi @Simon_Grimm , welcome to the community! Were you able to solve this or do you still need help?

Simon_Grimm · October 7, 2024, 5:39pm

Yep, I resolved this!

Matt_Nash · March 27, 2025, 5:33pm

Hi @Simon_Grimm,

Would you mind explaining what you did to resolve your problem? That would help people searching this forum who encounter the same issue, like me.

Topic		Replies	Views
Issue with nf-core/taxprofiler Pipeline Stuck in Runnable Status on AWS Batch Ask for help nextflow , nf-core	1	318	December 4, 2023
Using 2 AWS Batch Compute Environments within a single pipeline Ask for help nextflow , aws , platform	3	118	November 22, 2024
Nextflow config to switching to GPU in nf-core/sarek pipeline Ask for help nextflow , nf-core , platform	3	188	August 8, 2025
Enabling `-resume` and `-log` on AWS Batch Ask for help aws	1	70	September 18, 2024
How do Nextflow and AWS Batch are working together on an architectural level? Ask for help aws	1	220	June 13, 2024

AWS Batch job stuck at runnable when using a compute environment that would enable use of a GPU (A10) -> which next step to take?

Related topics