Seqera Platform Compute Environment error

Our network was upgraded to RHEL9 and made some changes. When I tried to update the compute environment I get an error:

Failed to validate HPC scheduler availability - command: [sinfo, --version]; status=1; output:

However, when I try running from the command line it works fine:

>ssh mbsr@login2.vacc.uvm.edu sinfo --version
slurm 24.05.4

Any suggestions for figuring this out?

Hey Ramiro!

Did they change the networking policies for the HPC? When you log into the HPC do you have to connect to a VPN or be on campus first?

Many/most HPC environments don’t allow ssh connections from public IP address. Usually you have to ask the network admins to whitelist Seqera Cloud’s IP addresses to allow inbound SSH connections or switch to the Tower Agent credential type instead.

I found the page in the docs where you can find the IP addresses used by Seqera Cloud:

I’m one of Ramiro’s system administrators.

This is not an issue with a firewall. The logs on our machine indicate successful network connection from 18.171.4.252, and the ssh key is correct and being accepted. We see a session being opened, some time elapses, and then the connection is closed from the Seqera end. For whatever reason, Seqera does not seem to be recognizing that the session is open, so it closes the connection after a few minutes.

Please note that using a real ssh client from a remote host terminal window other than Seqera and running sinfo --version works fine. The problem seems to be that Seqera’s connection method is not recognizing that a session has been granted.

Note that we have older machines, and the sinfo command seems to work when Seqera is pointed to those. The new and old machines share a common network home directory.

What Ramiro really needs is a way to get debugging information from the Seqera server to see what it thinks is happening after the session opens. As far as we can tell, everything is right on our end.

1 Like

I met with the sysadmins at the University and did some troubleshooting. We verified with tcpdump that packets are going between both hosts. Something in your python connection doesn’t seem to be recognizing that a session is open. Note: the change from old cluster to the new cluster is that session startup is now systemd when it was not before. Testing ocurred March 28 14:20 to 14:40 Eastern Time (New York time). Our host name is login4.vacc.uvm.edu and we were using user mbsr. We are happy to send you more login information privately.

Hi @ramirobarrantes and @Crows_Laughing !

I passed your report onto our engineers who cross-referenced it to our internal logs. We were able to validate that the ssh connection is formed correctly and Seqera’s services recognize that the ssh session is open.

The problem comes during the second step in the connection process when trying to determine if sinfo is available. The full command that Seqera Platform runs from within the ssh sessions during this check is bash --login -ec 'sinfo --version'.

In your case, the error is likely simply what the message says. Somehow in the bash environment that Seqera Platform gets through SSH, sinfo and other slurm commands aren’t added to $PATH. Some places to look at to help you troubleshoot are:

  • Instead of running ssh mbsr@login2.vacc.uvm.edu sinfo --version as a single command, run sinfo --version within an already created ssh session.
  • Check if your local ssh client has some configuration (e.g some files in ~/.ssh folder) that may be influencing what’s in $PATH.
  • Check for differences in the bash profiles of the older and newer machines (unless that is included in “common network home directory”).
  • Check to ensure all of the network paths are properly mounted on the newer server nodes.

Hope this helps in your troubleshooting!

Thanks for suggesting running the shell and the command it is to invoke; that revealed the issue.

Our newer installation runs

$ bash --version
GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)

as distributed with Red Hat 9 (and equivalents). The version of bash on the older installation is

$ bash --version
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

as distributed with Red Hat 7 (and equivalents). The combination of -ec flags does not seem to work with bash version 5, as it comes with Red Hat 9.

Using your suggestion, on the older installation we get

$ bash --login -ec 'sinfo --version'
slurm 23.02.7
$ echo $?
0

Whereas on the newer installation, we get

$ bash --login -ec 'sinfo --version'
$ echo $?
1

This can be worked around on our end by setting the BASH_COMPAT variable to 4.2 on the command line, as in

$ BASH_COMPAT=4.2 bash --login -c 'sinfo --version'
slurm 24.05.4

Alternately, one can drop the -e option from the bash command,

$ bash --login -c 'sinfo --version'
slurm 24.05.4

and that also works on the newer installation.

Unfortunately, this seems only to work from the interactive prompt, as neiter adding the command

export BASH_COMPAT=4.2

nor adding

shopt -s compat42

to either the user’s .bashrc file or to the .bash_profile seem to make the command without the explicit variable declaration on the bash command line work.

I haven’t been able to find a way to get a compatibility mode set up that will enable bash 5.1 to Just Work with the bash --login -ec 'sinfo --version' command. That -e flag seems to be the problem.

We have a pretty generic RH9 installation, so I don’t think we’ve done anything to make this behavior occur.

Hello, actually, this just worked adding the following on our .bashrc file:

export BASH_COMPAT=4.2

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.