Cheat Sheet[edit | edit source]

Command Description
squeue Displays jobs in the queue.
sbatch [--test-only] job.sh Submit a job
sinfo Show the status of nodes.
scancel Cancel a job.
scontrol show jobid -dd jobid Displays a particular job by ID
scontrol update jobid=<jobid> TimeLimit=20-00:00:00 Modifies an attribute of a pending job. Eg. priorities, time limits, etc.
sacctmgr show qos Show partition preemption settings and submission limits
sacctmgr show assoc format=cluster,user,qos Shows user and QOS associations
sacctmgr modify user where name=username set qos=normal Set a user to the normal QOS. You can pass multiple QOS as a comma separated list.

You can define a custom squeue format by exporting a SQUEUE_FORMAT variable. Eg. export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.20R %q".

Installation[edit | edit source]

Installation on a RHEL/CentOS system involves building the RPM packages and then installing them. The full instructions can be found at https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#installing-rpms

Obtain the latest Slurm packages at https://www.schedmd.com/downloads.php, then install prerequsites and build the RPM package:

# yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
# yum install munge-devel munge-libs mariadb-server mariadb-devel man2html
# export VER=19.05.3-2
# rpmbuild -ta slurm-$VER.tar.bz2

Configuration[edit | edit source]

The main Slurm configuration file docs: https://slurm.schedmd.com/slurm.conf.html

Custom resources such as GPUs are defined in gres.conf. See: https://slurm.schedmd.com/gres.conf.html

Nodes[edit | edit source]

The cluster name is defined with ClusterName=name.

Nodes are defined near the end of the file. Format is:

NodeName=node01 NodeAddr=<ip addr node01> CPUs=4 State=UNKNOWN
NodeName=node02 NodeAddr=<ip addr node02> CPUs=4 State=UNKNOWN
NodeName=node03 NodeAddr=<ip addr node03> CPUs=4 State=UNKNOWN
NodeName=node04 NodeAddr=<ip addr node04> CPUs=4 State=UNKNOWN

Partitions[edit | edit source]

A Slurm partition defines a group of nodes that a job can run on with additional attributes such as maximum allowed CPU time, the job priority, which users have access, etc. You may have multiple partitions defined for a set of nodes. An example use case for having multiple partitions target the same set of nodes would be to create a priority queue that allows jobs run before other jobs in another partition witht a normal priority.

When jobs are submitted to a certain partition, the scheduler will schedule the job to a node defined in the partition. Jobs that do not specify a partition will use the default partition.

The default partition can be defined with:

PartitionName=cluster-name Nodes=nodes[0-5] Default=Yes

Resource Selector Algorithm[edit | edit source]

The SelectType defines which resource selector algorithm the scheduler will use.

  • Multiple jobs per node by allocating each job with individual "consumable resources" such as CPU cores using select/cons_res or select/cons_tres
  • Single jobs per node by allocating whole nodes using select/linear, or by specifying OverSubscribe=Exclusive

The consumable resource that the algorithm should count is defined with SelectTypeParameters.

For example, to allow jobs to be scheduled to nodes based on number of available CPU cores on each node:

SelectType=select/cons_res
SelectTypeParameters=CR_Core

CGroups[edit | edit source]

Slurm supports cgroups which allows the control of resources a job has access to. This is useful to limit the amount of memory, CPU, swap, or devices that a job can access.

CGroups configs are loaded from /etc/slurm/cgroup.conf. If running an older version of Slurm on a newer system, you may need to configure the cgroup path from /cgroup to /sys/fs/cgroup.

If you don't care for cgroups, you may leave it disabled.

Example cgroup config:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

The whitelist device list:

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/any-other-mounts

Example Config[edit | edit source]

Example configuration on a single-node instance:

ControlMachine=dragen
ControlAddr=dragen
AuthType=auth/munge
CryptoType=crypto/munge
JobRequeue=0
MaxJobCount=500000
MpiDefault=none
ReturnToService=0
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1024
FastSchedule=1
MaxArraySize=65535
SchedulerType=sched/backfill
SelectType=select/linear
PriorityType=priority/multifactor #basic means strict fifo
PriorityDecayHalfLife=7-0
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightFairshare=100000
ClusterName=dragen
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
NodeName=dragen NodeAddr=dragen Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=256000
PartitionName=defq Default=YES MinNodes=1 DefaultTime=5-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Pren

Usage[edit | edit source]

Job Submission[edit | edit source]

Use the sbatch job.sh command to submit a job. A job can be a simple shell script, but can also include job parameters with the addition of #SBATCH directives.

For example:

#!/bin/bash
#SBATCH --job-name=slurm-job
#SBATCH --workdir=/home/leo/some-project
#SBATCH --error=error.txt
#SBATCH --output=stdout.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --time=99-00:00:00
#SBATCH --nodes=1
#SBATCH --exclude=bad-node[00-05]

echo "Hello. Start at `date`"
sleep 30
echo "Hello. Done at `date`"

Submit the job by running sbatch job.sh and then view its status with squeue.

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                27      defq  test.sh     root  R       0:01      1 node01

Any job output is placed by default in the current working directory as slurm-JOB_ID.out.

Job Information[edit | edit source]

All jobs can be listed using squeue [-u username] [-t RUNNING|PENDING|COMPLETED] [-p partition]

The reason column specifies the reason for a job's state.

  • If there is no reason, the scheduler hasn't attended to your submission yet.
  • Resources means your job is waiting for an appropriate compute node to open.
  • Priority indicates your priority is lower relative to others being scheduled.

A job's configured parameters can be displayed by running scontrol show jobid -dd jobid and can be updated with scontrol update.

See: https://www.rc.fas.harvard.edu/resources/faq/fix_pending_job/

Modifying a Job[edit | edit source]

Any parameters of a job can be edited. First, determine what values need to be updated for a particular job:

# scontrol show jobid -dd <jobid>

To update a specific value (such as TimeLimit), run:

# scontrol update jobid=<jobid> TimeLimit=20-00:00:00

To update the TimeLimit to 20 days for all jobs by a particular user:

# squeue | grep username | awk '{print $1}' | while read i ; do scontrol update jobid=$i TimeLimit=20-00:00:00 ; done

Tasks[edit | edit source]

Adding nodes to Slurm[edit | edit source]

  1. Ensure munge and slurm users have the same UID on both login node and new node.
  2. Munge from login node to new node: munge -n
  3. Add the node to slurm.conf on the login node, then redistribute the file to all other nodes
  4. Restart slurmd on all nodes
  5. Restart slurmctld on the login node
  6. Ensure all nodes are visible with sinfo -lN.

Cron-like Jobs[edit | edit source]

A job that runs and re-schedules itself at a later time is possible. Example from University of Chicago RCC: https://rcc.uchicago.edu/docs/running-jobs/cron/index.html#cron-jobs

This requires a script or program that prints the next timestamp for the job to run from a given cronjob schedule.

Slurm accepts the following for the --begin option:

--begin=16:00
   --begin=now+1hour
   --begin=now+60           (seconds by default)
   --begin=12/26            (Next December 26)
   --begin=2020-01-20T12:34:00

An example submission would look like:

#!/bin/bash

#SBATCH --time=00:05:00
#SBATCH --output=cron.log
#SBATCH --open-mode=append
#SBATCH --account=cron-account
#SBATCH --partition=cron
#SBATCH --qos=cron

# Specify a valid Cron string for the schedule. This specifies that
# the Cron job run once per day at 5:15a.
SCHEDULE='15 5 * * *'

# Here is an example of a simple command that prints the host name and
# the date and time.
echo "Hello on $(hostname) at $(date)."

# This schedules the next run.
sbatch --quiet --begin=$(next-cron-time "$SCHEDULE") cron.sbatch

Troubleshooting[edit | edit source]

Issues can be troubleshooted by looking at the logs at /var/log/slurmd.log.

Missing Cgroup namespace 'freezer'[edit | edit source]

[2019-10-30T12:45:32.578] error: cgroup namespace 'freezer' not mounted. aborting
[2019-10-30T12:45:32.578] error: unable to create freezer cgroup namespace
[2019-10-30T12:45:32.578] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2019-10-30T12:45:32.578] error: cannot create proctrack context for proctrack/cgroup
[2019-10-30T12:45:32.578] error: slurmd initialization failed

You need to define the cgroup path location in the cgroup.conf configuration file:

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes

Fix a downed node[edit | edit source]

If a node is reporting as down from sinfo even though slurmd and munge are running, the node might require a manual update to the idle state. For example, I saw the following.

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1   down node01

Ensure that the node is reachable using the ping command:

# scontrol ping
Slurmctld(primary/backup) at node01/(NULL) are UP/DOWN

Then, update the status with scontrol update:

# scontrol
scontrol: update NodeName=node01 State=RESUME

If the node is functional, the state should return to idle and should begin accepting new jobs.

# sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1   idle node01

See Also[edit | edit source]

Installing Slurm on a Raspberry Pi cluster:

Enable Dark Mode!