I’m encountering an issue with a fairly simple Nextflow pipeline running on AWS Batch. I’ve set up two AWS Batch queues and configured Nextflow to use them based on process tags. I specified two different queues as I need two different AMI’s, one of them for running a g5.xlarge on which I use a GPU for Nanopore basecalling. Here’s my nextflow.config
:
/************************************************
| CONFIGURATION FILE FOR NAO BASECALL WORKFLOW |
************************************************/
params {
mode = "basecall"
debug = true
// Directories
base_dir = "*****NAO-ONT-20240912-DCS_RNA3" // Parent for working and output directories (can be S3)
// Run parameters
nanopore_run = "NAO-ONT-20240912-DCS_RNA3"
// Files
pod5_dir = "${base_dir}/pod5/" //
calls_bam = "${base_dir}/bam/calls.bam"
fastq_file = "${base_dir}/raw/${nanopore_run}.fastq.gz"
}
includeConfig "${projectDir}/configs/containers.config"
includeConfig "${projectDir}/configs/resources.config"
includeConfig "${projectDir}/configs/profiles.config"
docker {
enabled = true
runOptions = '--gpus all'
}
fusion {
enabled = true
exportStorageCredentials = true
}
wave {
enabled = true
}
process {
withLabel: dorado {
executor = 'awsbatch'
queue = 'slg-basecall-batch-queue'
errorStrategy = "retry"
maxRetries = 3
}
}
process{
errorStrategy = "retry"
maxRetries = 3
executor = "awsbatch"
queue = "simon-batch-queue"
}
The problem I’m facing is that while nextflow.log
shows PROCESS BASECALL:BASECALL_POD_5
as submitted, it remains in a “runnable” state on the AWS Batch dashboard and doesn’t progress.
I’ve checked the following:
- I can see the launched instance on the EC2 dashboard.
- Both the compute environment and the job queue have status “Valid” and state “Enabled”.
- I’ve attempted to SSH into the instance using the key specified in the instance template, but I’m unable to connect.
ssh -i ~/.ssh/*** ec2-user@ec2-54-196-156-38.compute-
Has anyone encountered similar issues or have suggestions for troubleshooting this? I’m particularly interested in:
- Why might the job be stuck in a “runnable” state?
- Are there any logs or specific areas in AWS I should check to diagnose the problem?
- Any common misconfigurations that could cause this behavior?
Let me know if further information from my side would be helpful!
Additional info
Compute environment JSON:
{
"computeEnvironmentName": "simon-batch-3",
"computeEnvironmentArn": "arn:aws:batch:us-east-1:058264081542:compute-environment/simon-batch-3",
"ecsClusterArn": "arn:aws:ecs:us-east-1:058264081542:cluster/AWSBatch-simon-batch-3-8151f108-6a73-3faf-b52f-fdb1c5cc38c5",
"tags": {},
"type": "MANAGED",
"state": "ENABLED",
"status": "VALID",
"statusReason": "ComputeEnvironment Healthy",
"computeResources": {
"type": "SPOT",
"allocationStrategy": "SPOT_PRICE_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 1024,
"desiredvCpus": 4,
"instanceTypes": [
"g5.xlarge"
],
"subnets": [
"subnet-06fbe1854c70d78dd",
"subnet-0d0c2057c57a07015",
"subnet-0200df4c5f948f131",
"subnet-021abc14f7da9660b",
"subnet-069530171cc8bb793",
"subnet-03aece59a8e0cffe3"
],
"securityGroupIds": [
"sg-0ebaa288146c2d8ab"
],
"ec2KeyPair": "simon_sb",
"instanceRole": "arn:aws:iam::058264081542:instance-profile/ecsInstanceRole",
"tags": {},
"launchTemplate": {
"launchTemplateName": "simon-basecall-template",
"version": "$Latest"
},
"ec2Configuration": [
{
"imageType": "ECS_AL2_NVIDIA"
},
{
"imageType": "ECS_AL2"
}
]
},
"serviceRole": "arn:aws:iam::058264081542:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch",
"updatePolicy": {
"terminateJobsOnUpdate": false,
"jobExecutionTimeoutMinutes": 30
},
"containerOrchestrationType": "ECS",
"uuid": "08e52f2c-bdea-3c34-9e2e-97b47f57c467"
}
Job queue JSON:
{
"jobQueueName": "slg-basecall-batch-queue",
"jobQueueArn": "arn:aws:batch:us-east-1:058264081542:job-queue/slg-basecall-batch-queue",
"state": "ENABLED",
"status": "VALID",
"statusReason": "JobQueue Healthy",
"priority": 100,
"computeEnvironmentOrder": [
{
"order": 1,
"computeEnvironment": "arn:aws:batch:us-east-1:058264081542:compute-environment/simon-batch-3"
}
],
"serviceEnvironmentOrder": [],
"jobQueueType": "ECS",
"tags": {},
"jobStateTimeLimitActions": []
}
Further details on AWS instance template:
AMI:
Name: Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) Version 67.0
ID: ami-069699babcf73a576
Architecture: x86_64
Virtualization: hvm
Root device type: ebs
ENA Enabled: Yes
Instance Type: g5.xlarge
4 vCPU
16 GiB Memory
Storage:
Volume 1 (Root): 1000 GiB, EBS, General purpose SSD (gp3)
Total: 2 volumes, 1250 GiB
IAM Instance Profile:
Not included in launch template
Advanced Settings:
Most advanced options set to "Don't include in launch template"
No CPU options specified (not supported for g5.xlarge)
No user data specified