Slurm is a open source job scheduler that runs on Linux and is typically used in high performance computing environments.

Cheat Sheet[edit | edit source]

User Commands[edit | edit source]

Some useful commands when using Slurm as a user.

Command Description
squeue Displays jobs in the queue
squeue -u username

squeue -p partition

squeue -t state

Shows jobs in queue by user or partition

You can limit by type with -t. State being:

  • R for running
  • PD for pending, etc.
scancel jobidscancel -u username

scancel -t PD -u username

Cancels a specific job

Cancels all jobs by a specific user

Cancels all pending jobs by a specific user

scontrol hold jobid

scontrol release jobid

Holds or releases a job, preventing it from being scheduled
sbatch [--test-only] job.sh Submit a job
sacct -u username

sacct -S 2020-02-22 -u username

sacct -S 2020-02-22 -u username --format=JobID,JobName,MaxRSS,Start,Elapsed

Shows all jobs by a specific user

All jobs started since Feb 22 2020

With a nicer formatting

scontrol show jobid -dd jobid Displays a particular job by ID

System-related commands[edit | edit source]

Not strictly for admins, but useful for understanding and managing the system.

Command Description
sinfosinfo -p partition Show the status of nodes.

Status of nodes, limited by partition

scontrol update jobid=<jobid> TimeLimit=20-00:00:00 Modifies an attribute of a pending job. Eg. priorities, time limits, etc.
sacctmgr show qos Show partition preemption settings and submission limits
sacctmgr show assoc format=cluster,user,qos Shows user and QOS associations
sacctmgr modify user where name=username set qos=normal Set a user to the normal QOS. You can pass multiple QOS as a comma separated list.
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j  job-id --allsteps Show the status of a running job. Requires privileges to run.
sacctmgr dump cluster-name

sacctmgr dump cluster-name 2> dump-file

Dumps the state of accounts and user accounts

Since it writes errors to stdout and actual content to stderr, you might want to do something like the second line.

sacctmgr -i load clean file=dump-file Loads all accounts from a file containing dump output
sacctmgr add account=acount-name Create a new account
sacctmgr add user username DefaultAccount=account Creates a new user account belonging to a default account
sacctmgr update user set key=value where user=username Update some field (key) to some value for some user.
sacctmgr show associations Show associations that are made to user accounts or accounts
scontrol write batch_script jobid Writes a jobid out as a batch script
Custom squeue Format
You can define a custom squeue format by exporting a SQUEUE_FORMAT variable.
Eg. Place in your .bashrc: export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.20R %q"


Installation[edit | edit source]

Installation on a RHEL/CentOS system involves building the RPM packages and then installing them. The full instructions can be found at https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#installing-rpms

Obtain the latest Slurm packages at https://www.schedmd.com/downloads.php, then install prerequsites and build the RPM package:

# yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
# yum install munge-devel munge-libs mariadb-server mariadb-devel man2html
# export VER=19.05.3-2
# rpmbuild -ta slurm-$VER.tar.bz2

Configuration[edit | edit source]

The main Slurm configuration file docs:

Custom resources such as GPUs are defined in gres.conf. See: https://slurm.schedmd.com/gres.conf.html

Nodes[edit | edit source]

The cluster name is defined with ClusterName=name.

Nodes are defined near the end of the file. Format is:

NodeName=node01 NodeAddr=<ip addr node01> CPUs=4 State=UNKNOWN
NodeName=node02 NodeAddr=<ip addr node02> CPUs=4 State=UNKNOWN
NodeName=node03 NodeAddr=<ip addr node03> CPUs=4 State=UNKNOWN
NodeName=node04 NodeAddr=<ip addr node04> CPUs=4 State=UNKNOWN

Partitions[edit | edit source]

A Slurm partition defines a group of nodes that a job can run on with additional attributes such as maximum allowed CPU time, the job priority, which users have access, etc. You may have multiple partitions defined for a set of nodes. An example use case for having multiple partitions target the same set of nodes would be to create a priority queue that allows jobs run before other jobs in another partition witht a normal priority.

When jobs are submitted to a certain partition, the scheduler will schedule the job to a node defined in the partition. Jobs that do not specify a partition will use the default partition.

The default partition can be defined with:

PartitionName=cluster-name Nodes=nodes[0-5] Default=Yes

Resource Selector Algorithm[edit | edit source]

The SelectType defines which resource selector algorithm the scheduler will use.

  • Multiple jobs per node by allocating each job with individual "consumable resources" such as CPU cores using select/cons_res or select/cons_tres
  • Single jobs per node by allocating whole nodes using select/linear, or by specifying OverSubscribe=Exclusive

The consumable resource that the algorithm should count is defined with SelectTypeParameters.

For example, to allow jobs to be scheduled to nodes based on number of available CPU cores on each node:

SelectType=select/cons_res
SelectTypeParameters=CR_Core

CGroups[edit | edit source]

Slurm supports cgroups which allows the control of resources a job has access to. This is useful to limit the amount of memory, CPU, swap, or devices such as GPUs that a job can access. If you have no resources that requires this restriction, you may leave this feature disabled.

CGroups configs are loaded from /etc/slurm/cgroup.conf. If running an older version of Slurm on a newer system, you may need to configure the cgroup path from /cgroup to /sys/fs/cgroup.

Example cgroup config:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30

The whitelist device list:

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/any-other-mounts

If cgroups are working, you should see sgroups applied to your session inside an interactive job:

$ cat /proc/$$/cgroup                                                            
12:rdma:/
11:blkio:/system.slice/slurmd.service
10:freezer:/slurm/uid_16311891/job_8497967/step_0
9:pids:/system.slice/slurmd.service
8:memory:/slurm/uid_16311891/job_8497967/step_0/task_0
7:cpu,cpuacct:/slurm/uid_16311891/job_8497967/step_0/task_0
6:cpuset:/slurm/uid_16311891/job_8497967/step_0
5:perf_event:/
4:hugetlb:/
3:net_cls,net_prio:/
2:devices:/slurm/uid_16311891/job_8497967/step_0
1:name=systemd:/system.slice/slurmd.service

These cgroup policies can be found in either /cgroup (old) or /sys/fs/cgroup.

Example Config[edit | edit source]

Example configuration on a single-node instance:

ControlMachine=dragen
ControlAddr=dragen
AuthType=auth/munge
CryptoType=crypto/munge
JobRequeue=0
MaxJobCount=500000
MpiDefault=none
ReturnToService=0
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1024
FastSchedule=1
MaxArraySize=65535
SchedulerType=sched/backfill
SelectType=select/linear
PriorityType=priority/multifactor #basic means strict fifo
PriorityDecayHalfLife=7-0
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightFairshare=100000
ClusterName=dragen
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
NodeName=dragen NodeAddr=dragen Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=256000
PartitionName=defq Default=YES MinNodes=1 DefaultTime=5-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Pren

Usage[edit | edit source]

Submitting a Job (with demo)[edit | edit source]

Use the sbatch job.sh command to submit a job. A job can be a simple shell script, but can also include job parameters with the addition of #SBATCH directives.

For example:

#!/bin/bash
#SBATCH --job-name=demo-slurm-job
#SBATCH --workdir=/home/leo/some-project
#SBATCH --error=error.txt
#SBATCH --output=stdout.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0-01:00:00
#SBATCH --nodes=1
## #SBATCH --exclude=bad-node[00-05]

echo "Hello. Start at `date`"
sleep 30
echo "Hello. Done at `date`"

Submit the job by running sbatch job.sh and then view its status with squeue.

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                27      defq  test.sh     root  R       0:01      1 node01

Any job output is placed by default in the current working directory as slurm-JOB_ID.out.

Retrieving Job Information[edit | edit source]

All jobs can be listed using squeue [-u username] [-t RUNNING|PENDING|COMPLETED] [-p partition]

The reason column specifies the reason for a job's state.

  • If there is no reason, the scheduler hasn't attended to your submission yet.
  • Resources means your job is waiting for an appropriate compute node to open.
  • Priority indicates your priority is lower relative to others being scheduled.

A job's configured parameters can be displayed by running scontrol show jobid -dd jobid and can be updated with scontrol update.

See: https://www.rc.fas.harvard.edu/resources/faq/fix_pending_job/

Modifying a Job[edit | edit source]

Any parameters of a job can be edited. First, determine what values need to be updated for a particular job:

# scontrol show jobid -dd <jobid>

To update a specific value (such as TimeLimit), run:

# scontrol update jobid=<jobid> TimeLimit=20-00:00:00

To update the TimeLimit to 20 days for all jobs by a particular user:

# squeue | grep username | awk '{print $1}' | while read i ; do scontrol update jobid=$i TimeLimit=20-00:00:00 ; done

Administration[edit | edit source]

Database Configuration[edit | edit source]

Show the current accounting configuration with sacctmgr show configuration

Associations are accounting records that links a specific user (by username, cluster, account, partition) to some attribute relating to their account.

# sacctmgr show associations
# sacctmgr show associations format=account,user,fairshare,QOS,GrpTRES,GrpTRESRunMin

Account Management[edit | edit source]

A Slurm Account (henceforth simply as account) is like a UNIX group and consists of one or many users. Accounts can be nested in a hierarchical manner. A user must belong to at least one DefaultAccount.

When either adding or modifying an account, the following sacctmgr options are available:

  • Cluster= Only add this account to these clusters. The account is added to all defined clusters by default.
  • Description= Description of the account. (Default is account name)
  • Name= Name of account. Note the name must be unique and can not represent different bank accounts at different points in the account hierarchy
  • Organization= Organization of the account. (Default is parent account unless parent account is root then organization is set to the account name.)
  • Parent= Make this account a child of this other account (already added).
# sacctmgr add account dtu Description="DTU departments" Organization=dtu
# sacctmgr add account fysik Description="Physics department" Organization=fysik parent=dtu
# sacctmgr add account deptx Description="X department" Organization=deptx parent=dtu

# sacctmgr show account
# sacctmgr show account -s   # Show also associations in the accounts

When a user belonging to multiple accounts submit a job, they may specify which account the job belongs to with sbatch -A account.

User Management[edit | edit source]

Task Command
To create a new user, the username and the default account are required
# sacctmgr create user name=xxx DefaultAccount=yyy
Set an user to an account
# sacctmgr add user xxx Account=zzzz
Change the default account
# sacctmgr add user xxx DefaultAccount=zzzz
Remove an account
# sacctmgr remove user where default=test
List users
# sacctmgr show user
# sacctmgr show user <username>

## Show more information with -s
# sacctmgr show user -s
# sacctmgr show user -s <username>
Other specifications can also be set. For example, Fairshare, DefaultQOS, etc.

See: https://slurm.schedmd.com/sacctmgr.html

# sacctmgr modify user where name=xxx account=zzzz set fairshare=0

User parameters that can be modified or set:

  • Account= Account(s) to add user to (see also DefaultAccount).
  • AdminLevel= This field is used to allow a user to add accounting privileges to this user. Valid options are:
    • None
    • Operator: can add, modify, and remove any database object (user, account, etc), and add other operators. On a SlurmDBD served slurmctld these users can:
    • View information that is blocked to regular uses by a PrivateData flag (see slurm.conf).
    • Create/Alter/Delete Reservations
    • Admin: These users have the same level of privileges as an operator in the database. They can also alter anything on a served slurmctld as if they were the slurm user or root.
  • Cluster= Only add to accounts on these clusters (default is all clusters)
  • DefaultAccount= Default account for the user, used when no account is specified when a job is submitted. (Required on creation)
  • DefaultWCKey= Default WCkey for the user, used when no WCkey is specified when a job is submitted. (Only used when tracking WCkey.)
  • Name= User name
  • Partition= Name of Slurm partition this association applies to.


TRES and Limits[edit | edit source]

Slurm has the concept of trackable resources (shortened as TRES) which covers the resources of CPU, Memory, and Nodes. Additional resources that you wish to track such as licenses, GPUs, or specialized hardware can be defined as generic resources (GRES).

You must first enable accounting and enforce limits by defining in slurm.conf the AccountingStorageEnforce=limits option. Generic resource types are defined with GresTypes=gpu,bandwidth,etc. TRES that you wish to enforce are then listed under AccountingStorageTRES, (Eg. AccountingStorageTRES=cpu,mem,node,gres/gpu,gres/etc).

See: https://slurm.schedmd.com/tres.html

Limits[edit | edit source]

Impose limits

# sacctmgr modify user xxx set GrpTRES=CPU=1000 GrpTRESRunMin=CPU=2000000


QOS[edit | edit source]

QOS defines different job classes with different priority values (a factor taken into account by the priority plugin) and enforces resource restrictions.

Job limits are outlined here: https://lost-contact.mit.edu/afs/pdc.kth.se/roots/ilse/v0.7/pdc/vol/slurm/2.3.1-1/src/slurm-2.3.1/doc/html/qos.shtml

RCS Example[edit | edit source]

At RCS, the QOS that is defined looks like:

$ sacctmgr show qos format=name,priority,usagefactor,maxtres,maxwall,maxtrespu,maxsubmitpu,mintres
      Name   Priority UsageFactor       MaxTRES     MaxWall     MaxTRESPU MaxSubmitPU       MinTRES
---------- ---------- ----------- ------------- ----------- ------------- ----------- -------------
    normal          0    1.000000                7-00:00:00                      2000
   cpu2019          0    1.000000       cpu=240  7-00:00:00       cpu=240        2000
  gpu-v100          0    1.000000 cpu=80,gres/+  1-00:00:00 cpu=160,gres+        2000    gres/gpu=1
    single          0    1.000000       cpu=200  7-00:00:00 cpu=200,node+        2000
      razi          0    1.000000                7-00:00:00                      2000
   apophis          0    1.000000                7-00:00:00                      2000
   razi-bf          0    1.000000       cpu=546    05:00:00       cpu=546        2000
apophis-bf          0    1.000000       cpu=280    05:00:00       cpu=280        2000
   lattice          0    1.000000       cpu=408  7-00:00:00       cpu=408        2000
  parallel          0    1.000000       cpu=624  7-00:00:00       cpu=624        2000
    bigmem          0    1.000000        cpu=80  1-00:00:00        cpu=80          10        mem=4G
   cpu2013          0    1.000000                7-00:00:00                      2000
    pawson          0    1.000000                7-00:00:00                      2000
 pawson-bf          0    1.000000       cpu=480    05:00:00       cpu=480        2000
     theia      10000    1.000000                7-00:00:00                      2000
  theia-bf          0    1.000000       cpu=280    05:00:00                      2000

MinTRES is set to disallow small jobs from the bigmem queue and non GPU jobs from the GPU queue.

Partitions[edit | edit source]

PartitionName=single Default=NO Maxtime=10080 Nodes=cn[001-168] QOS=single State=Up                                                  
PartitionName=lattice Default=NO Maxtime=10080 Nodes=cn[169-415,417-448] QOS=lattice State=Up                                        
PartitionName=parallel Default=NO Maxtime=10080 Nodes=cn[0513-548,0553-1096] QOS=parallel State=Up                                   
PartitionName=cpu2019 Default=YES Maxtime=10080 Nodes=fc[22-61] QOS=cpu2019 State=Up                                                 
PartitionName=cpu2013 Default=NO Maxtime=10080 Nodes=h[1-14] QOS=cpu2013 State=Up                                                    
PartitionName=gpu-v100 Default=NO Maxtime=1440 Nodes=fg[1-13] QOS=gpu-v100 State=Up

Users[edit | edit source]

Users can be set to specific QOS that they can use when submitting a job.

# sacctmgr add qos high priority=10 MaxTRESPerUser=CPU=256

# sacctmgr show qos
# sacctmgr show qos format=name
# sacctmgr --noheader show qos format=name

# sacctmgr -i modify user where name=XXXX set QOS=normal,high
# sacctmgr -i modify user where name=XXXX set QOS+=high

## User's default QOS can be set
# sacctmgr -i modify user where name=XXXX set DefaultQOS=normal

Users must submit jobs to non-default QOS by specifying the qos in sbatch:

# sbatch --qos=high ...

Priority[edit | edit source]

The priority value that slurm calculates determines the order in which jobs execute. The priority value can for a particular job can change over time and isn't set in stone. Priority values are recalculated in the value specified by PriorityCalcPeriod. It may help to see how priorities are assigned by looking at jobs with sprio and sshare -al.

There are two priority plugins for Slurm: priority/basic which provides FIFO scheduling and the priority/multifactor which sets the priority based on several factors.

priority/multifactor[edit | edit source]

The priority of a job is calculated by a set of parameters and their associated weights. The higher the priority value, the higher the job will be positioned in the queue.

Multifactor priority takes into account the following parameters.

  • Nice: User controlled (higher = lower priority)
  • Job Age (length of time in queue, eligible to be scheduled)
  • Job Size: Number of nodes/cpus allocated by job
  • Fairshare: promised resources minus consumed resources
  • TRES: Each TRES type has its own factor
  • Partition: Value set by partition
  • QOS: Value set by QOS
  • Association (since 19.05)
  • Site (since 19.05): Value set by job_submit or site_factor plugin

More information on each of the parameters above at https://slurm.schedmd.com/priority_multifactor.html#mfjppintro

Each parameter has a weight factor (32-bit integer) and a factor value (0.0-1.0) to allow different weights to be set on each parameter. The calculated priority value is an integer. To avoid losing precision, use at least 1000 for each factor weight.

Additional explanation on some of the factors:

  • The fair-share factor takes into consideration the currently allocated and consumed computing resources for each charging accounts and gives priority to queued jobs under under-utilized accounts. By default, resource consideration is calculated as CPU time (cores * duration) but can be adjusted with the use of TRES factors. What Slurm does take into account can be seen under the 'Effective Usage' field under sshare -al. How much resource a user can use is defined by how many 'shares' they have within the account (ie. how much slice of the pie a user gets). Slurm will adjust the fair-share factor periodically so that user's actual compute usage is close to normalized shares. Users that have recently used lots of CPU will find that their pending jobs will have reduced priority, allowing other jobs by other users to be scheduled. Additional priority can be given to some users by increasing their share count.
  • TRES factors give weights (weight * factor) to each type of trackable resources. The weights of each resource type is defined by the TRESBillingWeights as a comma separated list. For example: TRESBillingWeights=CPU=1.0,Mem=0.25G. The sum of all resources are added to the job priority. You can also replace sum with max with the MAX_TRES priority flag.
  • Job size takes into account the number of cores requested. Jobs that take up the entire cluster gets a size factor of 1.0 while jobs that take only 1 node will get a size factor of 0.0. This is to prevent large jobs from being starved. Small jobs will backfill while resources for larger jobs are being freed. The PriorityFavorSmall can flip this behavior so that small jobs get a size factor of 1.0 and vice versa. Additionally, Job Size can be set to calculate based on CPU time over all available CPU time by setting SMALL_RELATIVE_TO_TIME.
  • QOS takes into account the user's QOS priority and divides it by the largest priority that is set.
QOS and partition priority on jobs using multiple partitions
The QOS and Partition weights used in the priority calculation uses the maximum QOS and partition values that are available to that job. The consequence is that if a user submits their job targeting multiple partitions, the partition with the highest priority weight and factor will influence the priority of this job against the other partitions.


The job priority formula, pulled from the Slurm documentation at https://slurm.schedmd.com/priority_multifactor.html#general:

Job_priority =
    site_factor +
    (PriorityWeightAge)       * (age_factor) +
    (PriorityWeightAssoc)     * (assoc_factor) +
    (PriorityWeightFairshare) * (fair-share_factor) +
    (PriorityWeightJobSize)   * (job_size_factor) +
    MAX[ (PriorityWeightPartition) * (partition_factor) ] +
    MAX[ (PriorityWeightQOS)       * (QOS_factor) ] +
	SUM(TRES_weight_cpu    * TRES_factor_cpu,
        TRES_weight_<type> * TRES_factor_<type>,
	    ...)
    - nice_factor

You can see the priority factors weights that are applied to the cluster using sprio -w.

$ sprio -w
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE
        Weights                               1       1000     100000

You can find the current fairshare factors for all users in your system with this nifty command, derivied from Princeton's RC documentation.

$ join -1 4 -2 2 -o 2.7,1.4,2.1 \
  <(squeue | sort -k 4) \
  <(sshare -a | awk 'NF>=7' | grep -v class | sort -k 2) \
  | sort -r | uniq | cat -n | column -t
#   Fairshare Username  Account
1   0.967464  xxxx      razi       
2   0.964593  xxxx      razi       
3   0.936842  xxxx      theia      
4   0.934928  xxxx      all        
5   0.934928  xxxx      all        
6   0.164593  xxxx      all        
7   0.133014  xxxx      all        
8   0.044019  xxxx      all        
9   0.040191  xxxx      all        
10  0.022967  xxxx      all        
11  0.018182  xxxx      all        
12  0.009569  xxxx      all

job_submit Plugin[edit | edit source]

The job_submit plugin allows further customization by allowing you to write an middleware layer that intercepts and changes any submitted jobs before they are queued in the system. This allows for great flexibility in implementing policies on a HPC cluster. For example, your script could define default partitions based on any requested TRES or GRES resources.

The job_submit middleware can be implemented in lua.

slurm_job_submit(job, partition, uid) : job contains an array , partition is a comma delimited list of partitions, uid is the user's uid. Returns either a slurm.ERROR or slurm.SUCCESS.

At RCS[edit | edit source]

The lua script at RCS:

  1. Ensures partitions are set (or sets defaults if not specified. Namely: gpu-v100 with GPU or cpu2019, razi-bf, apophis-bf, pawson-bf, and any other partitions in their account without GPU GRES)
  2. Ensures user has permission to partitions selected (some partitions are hard coded to specific accounts. users without membership to these accounts cannot use these partitions)
  3. Ensures that the job time limit is defined (not 4294967294)
  4. Sets max tasks per node and cpus per task to 1 if unset (65534)
  5. Remove partitions where nodes cannot satisfy the requested taskspernode * cpuspertask.
  6. Remove partitions where the requested time limit exceeds the partition's limit

Tasks[edit | edit source]

Adding nodes to Slurm[edit | edit source]

  1. Ensure munge and slurm users have the same UID on both login node and new node.
  2. Munge from login node to new node: munge -n
  3. Add the node to slurm.conf on the login node, then redistribute the file to all other nodes
  4. Restart slurmd on all nodes
  5. Restart slurmctld on the login node
  6. Ensure all nodes are visible with sinfo -lN.

Cron-like Jobs[edit | edit source]

A job that runs and re-schedules itself at a later time is possible. Example from University of Chicago RCC: https://rcc.uchicago.edu/docs/running-jobs/cron/index.html#cron-jobs

This requires a script or program that prints the next timestamp for the job to run from a given cronjob schedule.

Slurm accepts the following for the --begin option:

--begin=16:00
--begin=now+1hour
--begin=now+60           (seconds by default)
--begin=12/26            (Next December 26)
--begin=2020-01-20T12:34:00

An example submission would look like:

#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --output=cron.log
#SBATCH --open-mode=append
#SBATCH --account=cron-account
#SBATCH --partition=cron
#SBATCH --qos=cron


# Here is an example of a simple command that prints the host name and
# the date and time.
echo "Hello on $(hostname) at $(date)."

# Determine amount of time to wait, then use the
# Now + seconds  begin time format
Interval=3600
WaitFor=$(($Interval - $(date +"%s") % $Interval))
echo "Next submission in $WaitFor seconds"

# Schedule next job
sbatch --quiet --begin=now+$WaitFor  job.sh

PAM Slurm Adopt Module[edit | edit source]

The PAM Slurm Adopt module allows access to a compute node only when a job is scheduled and running on that node. More information available from: https://slurm.schedmd.com/pam_slurm_adopt.html

Installation requires the installation of the pam-slurm_pam package and configuration of the pam system-auth and password-auth files. On CentOS 8, you may wish to create a custom authselect profile and add the following lines after pam_unix.so in both system-auth and password-auth:

account     sufficient                                   pam_access.so                                          {include if "with-slurm"}
account     required                                     pam_slurm_adopt.so                                     {include if "with-slurm"}

By making pam_access sufficient, anyone that's allowed via the /etc/security/access.conf will be allowed access while those denied by pam_access can still authenticate if pam_slurm_adopt allows access. Complete the setup by ensuring that access.conf has at the very end:

# Deny all other users, except root
-:ALL EXCEPT root:ALL

Troubleshooting[edit | edit source]

Issues can be troubleshooted by looking at the logs at /var/log/slurmd.log.

Missing non primary group membership[edit | edit source]

A user's job was having issues reading a group directory. Upon further investigation, it turns out the context the job was running was missing all the non-primary group memberships. This issue is described in the slurm mailing list at https://lists.schedmd.com/pipermail/slurm-users/2018-November/002275.html

The fix is to set LaunchParameters=send_gids which passes the extended group id list for a user as part of the launch credential. This stops slurmd/slurmstepd from looking this information up on the compute node.

Additional info at: https://slurm.schedmd.com/SLUG18/field_notes2.pdf

Missing Cgroup namespace 'freezer'[edit | edit source]

[2019-10-30T12:45:32.578] error: cgroup namespace 'freezer' not mounted. aborting
[2019-10-30T12:45:32.578] error: unable to create freezer cgroup namespace
[2019-10-30T12:45:32.578] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2019-10-30T12:45:32.578] error: cannot create proctrack context for proctrack/cgroup
[2019-10-30T12:45:32.578] error: slurmd initialization failed

You need to define the cgroup path location in the cgroup.conf configuration file:

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes

Fix a downed node[edit | edit source]

If a node is reporting as down from sinfo even though slurmd and munge are running, the node might require a manual update to the idle state. For example, I saw the following.

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1   down node01

Ensure that the node is reachable using the ping command:

# scontrol ping
Slurmctld(primary/backup) at node01/(NULL) are UP/DOWN

Then, update the status with scontrol update:

# scontrol
scontrol: update NodeName=node01 State=RESUME

If the node is functional, the state should return to idle and should begin accepting new jobs.

# sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1   idle node01

See Also[edit | edit source]

Other Projects[edit | edit source]

Resources[edit | edit source]

Installing Slurm on a Raspberry Pi cluster:

Slurm Lua Scripting