Problems with GPU-enabled wave images

Hello,

I am trying to build a GPU-enabled image using wave. I encountered several problems:

Seqera containers (web UI) does not allow selecting a specific build

jaxlib has for each version multiple builds: Some with CUDA, some without. In order to make sure that a GPU (CUDA) version is enabled, I need to specify a specific build. This is not possible with the web tool. Thus I sticked with the wave CLI, which is not a big problem but I just wanted to mention. The yml file used with the CLI is shown here.

Installing cuda requires the virtual __cuda package

Conda virtual packages are explained here. As the wave servers do not have CUDA installed, it is necessary to set the CONDA_OVERRIDE_CUDA environment variable. This is neither possible with Seqera Containers, not in combination with --conda-file in the wave CLI (except if one has a --conda-base-image that already sets the environment variable.
Thus, I created a dockerfile that does this. Not straightforward to figure out, but not a huge problem.

Error 137 (out of memory)

You can see my final setup here. The wave build log can be accessed here. Apparently it fails due to an 137 (OOM) error. It is not uncommon for CUDA dependencies to be rather large (several gigabytes).

Wrapping up

Would love to see them fixed, but I understand if they are not top priority. But GPU computing is on a rise, so maybe these problems will become more prominent with time.

1 Like

I think it’s not an Out out memory issue. More likely it was killed because it reached the 15 minutes build timeout.

If you authenticate your request adding your tower (aka Platform) token, you will gain 25 build minutes timeout

1 Like

Hey,
this is now the build log with a platform token provided: log

It appears to fail in the same way, again after around 15:30, so apparently the additional time limit did not help :thinking: