Wondering if anyone has experience actually admining slurm clusters and could shead a little light on something that I’ve never found a definate answer on.
When assigning processes to specific nodes and queues and requesting memory, I believe our nodes boot with their OS image to RAM(or so I was told). There’s no physical disk in each node.
That means there’s around 10GB of memory overhead for the OS is what I was quoted by our HPC admins a few years ago.
I’ve used this command to get node specs:
sinfo --Node --long
Sun Jan 28 17:07:52 2024
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
compute-1-1-0 1 256i allocated 20 2:10:1 256000 0 1 (null) none
compute-1-1-1 1 256i allocated 20 2:10:1 256000 0 1 (null) none
Which shows these nodes have 256GB.
I’ve also used
scontrol show nodes
NodeName=compute-7-6-39 Arch=x86_64 CoresPerSocket=8
CPUAlloc=16 CPUTot=16 CPULoad=21.24
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=compute-7-6-39 NodeHostName=compute-7-6-39 Version=18.08
OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
RealMemory=32106 AllocMem=0 FreeMem=188 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=normal
BootTime=2023-11-30T14:30:57 SlurmdStartTime=2023-11-30T14:32:06
CfgTRES=cpu=16,mem=32106M,billing=16
AllocTRES=cpu=16,mem=32106M,billing=16
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Which will say 32GB, but occasionally a job will run over the memory limit when I set it to 32GB with something like this samtools sort: couldn't allocate memory for bam_mem (#108) · Issues · GUDMAP_RBK / RNA-seq · GitLab