Shared memory bug?

Eric_Kofman · August 13, 2024, 8:31pm

We are getting the error "Error executing request, Exception : Invalid shared memory size: 0.", which is appearing for code that was working fine just a few days ago. Our code sets the shared memory for the docker instance using the process directive:

process {
    withName: 'TRAIN' {
        container = "blahblah"
        containerOptions = '--shm-size 32768'
    }
}

Was some new code pushed that resulted in an issue with how shared memory is parsed? This is a major blocker on all of our jobs.

mribeirodantas · August 13, 2024, 9:07pm

Welcome to the community forum, @Eric_Kofman.

There was this PR two weeks ago fixing a bug. It’s the only recent change I can remember.

Where are you running this pipeline? AWS Batch?

Eric_Kofman · August 13, 2024, 9:19pm

Yes: Service: AWSBatch; Status Code: 400; Error Code: ClientException; Request ID: 25a12649-b96c-4dc3-92e0-b61bc5a42bf5; Proxy: null).

robsyme · August 13, 2024, 9:28pm

Hi Eric

This is unexpected. Can you share the Nextflow versions of failed and successful runs?

You can also specify containerOptions as a map. Can you try this:

process {
    withName: 'TRAIN' {
        container = "blahblah"
        containerOptions = 'shm-size': 32768
    }
}

Eric_Kofman · August 13, 2024, 9:55pm

For the one that worked it was version 24.04.3, and for the failed job it was nextflow version 24.04.4

Eric_Kofman · August 14, 2024, 2:07am

How can we specify which version of Nextflow we want our Towers-launched jobs to use?

Eric_Kofman · August 14, 2024, 3:02am

Figured it out in the pre-run script, and when we use the old version it works again.

ewels · August 14, 2024, 6:40am

This looks a bit worrying. @robsyme are you able to replicate a 0 shared size with the OP config?

@Eric_Kofman is there anything else you can help us with to get a minimal reproducible example of this?

Phil

robsyme · August 14, 2024, 1:50pm

I think the issue is that the shm-size is interpreted differently on AWS batch and on Docker.

When running with Docker, the --shm-size can take a unit suffix, but by default is measured in bytes.
On AWS Batch, the shm-size is interpreted as MiB (docs).

In 24.04.3, we just passed the argument directly through to the executor, which has two problems:

The number would be interpreted differently on each executor, giving inconsistent behaviours between executors and breaking portability of the workflow.
Docker can accept unit suffixes, but AWS cannot. If a unit suffix was applied, it would work on Docker but break on AWS Batch.

To resolve these issues, 24.04.4 will interpret a unit-less integer as a number of bytes (as per the Docker standard). If you specify a unit suffix, Nextflow will now do the conversion for you and turn the number into MiB.

I would recommend adding a unit suffix, and Nextflow will turn this into the relevant number of MiB for AWS and pass the string directly to Docker (which will do the unit conversion natively)

process {
    withName: 'TRAIN' {
        container = "blahblah"
        containerOptions = '--shm-size 32768M'
    }
}

ewels · August 14, 2024, 3:39pm

Makes sense

Any idea why the error message returns 0 as an invalid memory value, instead of 32768 bytes? I guess they could just be rounding the float number, 0.032768 mb is pretty small…

@Eric_Kofman let us know if the above solves your problem. If so then perhaps we can add a note about this to the Nextflow docs.

robsyme · August 14, 2024, 4:31pm

The AWS API only accepts integer values in MiB. A request for 0.03 MiB would be invalid.

Topic		Replies	Views
Correct use of containerOptions with multiple options Ask for help nextflow , aws	5	201	May 30, 2024
Nextflow Docker thin pool error Ask for help nextflow	3	25	December 4, 2024
Troubleshooting why Nextflow is not capturing an Out of Memory (OOM) error (exit status 137) in a piped command within a process Ask for help nextflow , platform	3	190	November 18, 2024
Memory allocation issue Ask for help	3	206	March 27, 2024
Nextflow Error Ask for help nextflow , nf-core , google-cloud , platform	5	406	July 1, 2024

Shared memory bug?

Related topics