Slurm
Slurm is a open source job scheduler that runs on Linux and is typically used in high performance computing environments.
Cheat Sheet
User commands
Some useful commands when using Slurm as a user.
Command | Description |
---|---|
squeue
|
Displays jobs in the queue |
squeue -u username
|
Shows jobs in queue by user or partition
You can limit by type with
|
scancel jobid scancel -u username
|
Cancels a specific job
Cancels all jobs by a specific user Cancels all pending jobs by a specific user |
scontrol hold jobid
|
Holds or releases a job, preventing it from being scheduled |
sbatch [--test-only] job.sh
|
Submit a job |
sacct -u username
|
Shows all jobs by a specific user
All jobs started since Feb 22 2020 With a nicer formatting |
scontrol show jobid -dd jobid
|
Displays a particular job by ID |
salloc --ntasks=1 --cpus-per-task=2 --time=1:00:00 --partition $partition
|
Start a small interactive job on a particular partition for 1 hour |
Not strictly for admins, but useful for understanding and managing the system.
Command | Description |
---|---|
sinfo
|
Show the status of nodes. |
sinfo -p partition
|
Status of nodes, limited by partition |
sinfo -Rl
|
Status of nodes and reason. |
scontrol update jobid=<jobid> TimeLimit=20-00:00:00
|
Extending or change the time limit of an existing job.
You can use |
scontrol update nodename=$nodelist state=fail reason="fail"
|
Sets one or more nodes into the 'FAIL' state to prevent scheduling. Useful if some hardware on the node has failed. |
scontrol reboot reason="reboot" $nodelist
|
Reboots one or more nodes when they become idle |
sacctmgr show qos
|
Show partition preemption settings and submission limits |
sacctmgr show assoc format=cluster,user,qos
|
Shows user and QOS associations |
sacctmgr modify user where name=username set qos=normal
|
Set a user to the normal QOS. You can pass multiple QOS as a comma separated list. |
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j job-id --allsteps
|
Show the status of a running job. Requires privileges to run. |
sacctmgr dump cluster-name
|
Dumps the state of accounts and user accounts
Since it writes errors to stdout and actual content to stderr, you might want to do something like the second line. |
sacctmgr -i load clean file=dump-file
|
Loads all accounts from a file containing dump output |
sacctmgr add account=acount-name
|
Create a new account |
sacctmgr add user username DefaultAccount=account
|
Creates a new user account belonging to a default account |
sacctmgr update user set key=value where user=username
|
Update some field (key) to some value for some user. |
sacctmgr show associations
|
Show associations that are made to user accounts or accounts |
scontrol write batch_script jobid
|
Writes a jobid out as a batch script |
Custom squeue Format
You can define a custom squeue format by exporting a SQUEUE_FORMAT variable.Eg. Place in your .bashrc:
export SQUEUE_FORMAT="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.20R %q"
Installation
Installation on a RHEL/CentOS system involves building the RPM packages and then installing them. The full instructions can be found at https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#installing-rpms, but you can also build the RPMs for CentOS 8 using the Dockerfile at https://git.steamr.com/leo/slurm-rpm-builder.
Configuration
The main Slurm configuration file docs:
Custom resources such as GPUs are defined in gres.conf
. See: https://slurm.schedmd.com/gres.conf.html
Nodes
The cluster name is defined with ClusterName=name
.
Nodes are defined near the end of the file. Format is:
NodeName=node01 NodeAddr=<ip addr node01> CPUs=4 State=UNKNOWN
NodeName=node02 NodeAddr=<ip addr node02> CPUs=4 State=UNKNOWN
NodeName=node03 NodeAddr=<ip addr node03> CPUs=4 State=UNKNOWN
NodeName=node04 NodeAddr=<ip addr node04> CPUs=4 State=UNKNOWN
Partitions
A Slurm partition defines a group of nodes that a job can run on with additional attributes such as maximum allowed CPU time, the job priority, which users have access, etc. You may have multiple partitions defined for a set of nodes. An example use case for having multiple partitions target the same set of nodes would be to create a priority queue that allows jobs run before other jobs in another partition witht a normal priority.
When jobs are submitted to a certain partition, the scheduler will schedule the job to a node defined in the partition. Jobs that do not specify a partition will use the default partition.
The default partition can be defined with:
PartitionName=cluster-name Nodes=nodes[0-5] Default=Yes
Resource Selector Algorithm
The SelectType
defines which resource selector algorithm the scheduler will use.
- Multiple jobs per node by allocating each job with individual "consumable resources" such as CPU cores using
select/cons_res
orselect/cons_tres
- Single jobs per node by allocating whole nodes using
select/linear
, or by specifyingOverSubscribe=Exclusive
The consumable resource that the algorithm should count is defined with SelectTypeParameters
.
For example, to allow jobs to be scheduled to nodes based on number of available CPU cores on each node:
SelectType=select/cons_res
SelectTypeParameters=CR_Core
CGroups
Slurm supports cgroups which allows the control of resources a job has access to. This is useful to limit the amount of memory, CPU, swap, or devices such as GPUs that a job can access. If you have no resources that requires this restriction, you may leave this feature disabled.
CGroups configs are loaded from /etc/slurm/cgroup.conf
. If running an older version of Slurm on a newer system, you may need to configure the cgroup path from /cgroup
to /sys/fs/cgroup
.
Example cgroup config:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30
The whitelist device list:
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/any-other-mounts
If cgroups are working, you should see sgroups applied to your session inside an interactive job:
$ cat /proc/$$/cgroup
12:rdma:/
11:blkio:/system.slice/slurmd.service
10:freezer:/slurm/uid_16311891/job_8497967/step_0
9:pids:/system.slice/slurmd.service
8:memory:/slurm/uid_16311891/job_8497967/step_0/task_0
7:cpu,cpuacct:/slurm/uid_16311891/job_8497967/step_0/task_0
6:cpuset:/slurm/uid_16311891/job_8497967/step_0
5:perf_event:/
4:hugetlb:/
3:net_cls,net_prio:/
2:devices:/slurm/uid_16311891/job_8497967/step_0
1:name=systemd:/system.slice/slurmd.service
These cgroup policies can be found in either /cgroup
(old) or /sys/fs/cgroup
.
Example Config
Example configuration on a single-node instance:
ControlMachine=dragen
ControlAddr=dragen
AuthType=auth/munge
CryptoType=crypto/munge
JobRequeue=0
MaxJobCount=500000
MpiDefault=none
ReturnToService=0
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1024
FastSchedule=1
MaxArraySize=65535
SchedulerType=sched/backfill
SelectType=select/linear
PriorityType=priority/multifactor #basic means strict fifo
PriorityDecayHalfLife=7-0
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightFairshare=100000
ClusterName=dragen
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
NodeName=dragen NodeAddr=dragen Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=256000
PartitionName=defq Default=YES MinNodes=1 DefaultTime=5-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Pren
Usage
Submitting a job
All work has to be submitted to the scheduler as a job. A job is a unit of work that the scheduler will orchestrate onto one of the compute nodes in the cluster. There are two types of jobs: interactive and batch jobs.
- Interactive jobs gives you an interactive shell to do your work in when the job is scheduled and is similar to just connecting to a host via SSH and running your work manually. This is typically useful if you need to try out a new program or need to debug a new workflow and haven't automated the tasks into a shell script.
- A batch job is a job that runs without user intervention by the system. Basically, you 'batch' a set of commands together into a shell script and have Slurm run the job for you when the cluster is available.
Before we jump into starting a job, we need to first understand how to specify a job's resource requirements (memory, CPU, GPU, nodes, licenses, etc.) and its job configuration (where to log to, who to email, where the working directory is, etc).
Job configuration and resource requests
When using the sbatch/srun/salloc commands to run a job, you will need to specify the job's configuration and resource requirements. Common flags you'll encounter are given in the table below. For interactive jobs, stick these options to the srun/salloc command. For batch jobs, add these to your job script with the #SBATCH
directive.
Flag | Example | Description |
---|---|---|
--time
|
--time=96:00:00
|
The maximum duration of your job |
--partition / -p
|
-p bigmem
|
The partition your job will use. Use sinfo to see what partitions cluster has. If not given, you will use your cluster's default partition.
|
--account
|
-a default
|
The account your job should run under. Check with your HPC staff. This is likely set for you by default. |
--qos
|
The QOS you want your job to run under. Check with your HPC staff. This is likely set for you by default. | |
--nodes
|
--nodes=1
|
The number of nodes your job will require |
--mem=<mem>
|
--mem=48gb
|
The amount of memory per node that your job need |
--ntasks=<tasks>, -n
|
-n 1
|
The total number of tasks your job requires |
--nodes=<min>[-<max>], -N
|
-N 1-2
|
The minimum (and optionally maximum) number of nodes your job should be given. |
--mem-per-cpu=
|
The amount of memory per cpu your job requires | |
--ntasks-per-node=#
|
Number of tasks per node. | |
--gres gpu:#
|
--gres gpu:1
|
The number of GPUs per node you need in you job |
--exclusive
|
Job must be exclusive on the given nodes. | |
--constraint=<feature>
|
--constraint=intel&gpu
|
Constrain your job to nodes supporting a particular feature set. You can find the available features your cluster has by running sinfo -o "%20N %10c %10m %25f %10G" and then look under the AVAIL_FEATURES column.
|
Interactive jobs
Interactive jobs let you use the compute resources through an interactive shell. The scheduler will allocate the requested resources and drop you in an interactive session on one of the compute nodes in the cluster. Interactive jobs only do work when the user runs a command and only end when the user explicitly exits from the session or when it hits its time limit. As a result, interactive jobs are typically frowned upon by HPC admins since these jobs tend to tie up resources and become unused.
There are two ways to start an interactive session:
srun --pty bash
along with any job options (such as the ones listed above). This command will create a 'step 0' and drop you in it (don't worry if that makes no sense). As a result, you won't be able to launch subsequent steps with additional resource allocations.salloc
along with any job options. This is typically the preferred option and a requirement if you need to spawn additional steps in your interactive job. If theLaunchParameters=use_interactive_step
option is set,salloc
will drop you on the first allocated node rather than on a node you ransalloc
on.
Batch jobs
Submitting a batch job is done by first writing a shell script with your commands. Next, you will need to specify any additional job parameters by using the #SBATCH
directive. While they appear to be comments, the Slurm scheduler will read these directives and add the specified arguments to your job accordingly. You may also specify these by passing the appropriate arguments to sbatch
at the time of submission, but specifying the parameters within the job script is less error prone and makes it clear what resources are required.
An example job script is given below. Notice how we can specify the job name, resource using the #SBATCH
directive and the appropriate sbatch arguments. If you want to see what other options are available, run man sbatch
.
#!/bin/bash
#SBATCH --job-name=demo-slurm-job
#SBATCH --workdir=/home/leo/some-project
#SBATCH --error=error.txt
#SBATCH --output=stdout.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0-01:00:00
#SBATCH --nodes=1
## #SBATCH --exclude=bad-node[00-05]
scontrol show job $SLURM_JOB_ID
echo "Hello. Start at `date`"
sleep 30
echo "Hello. Done at `date`"
Submit the job by running sbatch myjobscript.sh
. You may also override any #SBATCH
directives or specify any missing parameters by passing the arguments directly to the sbatch
command. For example:
$ sbatch myjobscript.sh
## Specify or override the partition
$ sbatch --partition gpu myjobscript.sh
Once the job is submitted, you can view its status with squeue
. More on this later.
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
27 defq test.sh root R 0:01 1 node01
Any job output is placed by default in the current working directory as slurm-JOB_ID.out
Monitoring your jobs
After submitting your job, it will be placed in the scheduler's queue. Depending on how busy the cluster is, you job will either be scheduled immediately or will sit in a queue. To check the status of your job, use the squeue
command.
The basic usage for squeue is squeue [-u username] [-t RUNNING|PENDING|COMPLETED] [-p partition]
Description | Command |
---|---|
Display all jobs | squeue -a
|
Display information on a specific job ID | squeue -j <jobid>
|
Display information in long format | squeue -l
|
Display jobs in the specified states. You can specify one or more states in a comma separated list.
Common states you might care about are PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED. You can use the short form for each state as well: PD, R, S, CD, CA, F, respectively. There are more states that aren't listed, but those are less common. |
squeue -t pd,r
|
Display jobs by a specific user | squeue -u <username>
|
You can mix and match the options listed above to your liking. An example squeue output:
# squeue -t pd,r -l
Tue Feb 22 12:42:06 2022
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
13174861 compute Samuel_j bob RUNNING 3:22:57 1-00:00:00 1 cm1
13172632 compute cazyme_d bob RUNNING 12:28 10:00:00 1 cm2
13175052 compute interact bob RUNNING 2:50:58 12:00:00 1 cm1
13174975 compute interact bob RUNNING 3:07:30 12:00:00 1 cm1
You might have noticed that there is a (REASON) column at the end of the output. For some job states, the reason column will be populated and will specify the reason for a job's state.
- If there is no reason, the scheduler hasn't attended to your submission yet.
- Resources means your job is waiting for an appropriate compute node to open.
- Priority indicates your priority is lower relative to others being scheduled.
A job's configured parameters can be displayed by running scontrol show jobid -dd jobid
and can be updated with scontrol update
.
See: https://www.rc.fas.harvard.edu/resources/faq/fix_pending_job/
Modifying a Job
Any parameters of a job can be edited. First, determine what values need to be updated for a particular job:
# scontrol show jobid -dd <jobid>
To update a specific value (such as TimeLimit
), run:
# scontrol update jobid=<jobid> TimeLimit=20-00:00:00
To update the TimeLimit
to 20 days for all jobs by a particular user:
# squeue | grep username | awk '{print $1}' | while read i ; do scontrol update jobid=$i TimeLimit=20-00:00:00 ; done
Administration
Database Configuration
Show the current accounting configuration with sacctmgr
show configuration
Associations are accounting records that links a specific user (by username, cluster, account, partition) to some attribute relating to their account.
# sacctmgr show associations
# sacctmgr show associations format=account,user,fairshare,QOS,GrpTRES,GrpTRESRunMin
Account Management
A Slurm Account (henceforth simply as account) is like a UNIX group and consists of one or many users. Accounts can be nested in a hierarchical manner. A user must belong to at least one DefaultAccount
.
When either adding or modifying an account, the following sacctmgr
options are available:
- Cluster= Only add this account to these clusters. The account is added to all defined clusters by default.
- Description= Description of the account. (Default is account name)
- Name= Name of account. Note the name must be unique and can not represent different bank accounts at different points in the account hierarchy
- Organization= Organization of the account. (Default is parent account unless parent account is root then organization is set to the account name.)
- Parent= Make this account a child of this other account (already added).
# sacctmgr add account dtu Description="DTU departments" Organization=dtu
# sacctmgr add account fysik Description="Physics department" Organization=fysik parent=dtu
# sacctmgr add account deptx Description="X department" Organization=deptx parent=dtu
# sacctmgr show account
# sacctmgr show account -s # Show also associations in the accounts
When a user belonging to multiple accounts submit a job, they may specify which account the job belongs to with sbatch -A account
.
User Management
Task | Command |
---|---|
To create a new user, the username and the default account are required | # sacctmgr create user name=xxx DefaultAccount=yyy
|
Set an user to an account | # sacctmgr add user xxx Account=zzzz
|
Change the default account | # sacctmgr add user xxx DefaultAccount=zzzz
|
Remove an account | # sacctmgr remove user where default=test
|
List users | # sacctmgr show user
# sacctmgr show user <username>
## Show more information with -s
# sacctmgr show user -s
# sacctmgr show user -s <username>
|
Other specifications can also be set. For example, Fairshare, DefaultQOS, etc. | # sacctmgr modify user where name=xxx account=zzzz set fairshare=0
|
User parameters that can be modified or set:
- Account= Account(s) to add user to (see also DefaultAccount).
- AdminLevel= This field is used to allow a user to add accounting privileges to this user. Valid options are:
- None
- Operator: can add, modify, and remove any database object (user, account, etc), and add other operators. On a SlurmDBD served slurmctld these users can:
- View information that is blocked to regular uses by a PrivateData flag (see slurm.conf).
- Create/Alter/Delete Reservations
- Admin: These users have the same level of privileges as an operator in the database. They can also alter anything on a served slurmctld as if they were the slurm user or root.
- Cluster= Only add to accounts on these clusters (default is all clusters)
- DefaultAccount= Default account for the user, used when no account is specified when a job is submitted. (Required on creation)
- DefaultWCKey= Default WCkey for the user, used when no WCkey is specified when a job is submitted. (Only used when tracking WCkey.)
- Name= User name
- Partition= Name of Slurm partition this association applies to.
TRES and Limits
Slurm has the concept of trackable resources (shortened as TRES) which covers the resources of CPU, Memory, and Nodes. Additional resources that you wish to track such as licenses, GPUs, or specialized hardware can be defined as generic resources (GRES).
You must first enable accounting and enforce limits by defining in slurm.conf
the AccountingStorageEnforce=limits
option. Generic resource types are defined with GresTypes=gpu,bandwidth,etc
. TRES that you wish to enforce are then listed under AccountingStorageTRES
, (Eg. AccountingStorageTRES=cpu,mem,node,gres/gpu,gres/etc
).
See: https://slurm.schedmd.com/tres.html
Limits
Impose limits
# sacctmgr modify user xxx set GrpTRES=CPU=1000 GrpTRESRunMin=CPU=2000000
Quality of Service (QOS)
The quality of service levels allow for an administrator to specify additional factors used to determine a job's priority, whether the jobs will preempt other jobs, and any additional resource restrictions such as the total CPU time and number of jobs that a particular user/group/account can have.
QOS can be set as a default on the entire cluster or specific partitions and can be overridden by specifying the --qos
option in the sbatch/salloc/srun commands.
The following limits can be imposed to a QOS:
Type | Limit | Description |
---|---|---|
Job | MaxCPUMinsPerJob | Maximum amount of CPU * wall time any one job can have |
MaxCpusPerJob | Maximum number of CPU's any one job can be allocated. | |
MaxNodesPerJob | Maximum number of nodes any one job can be allocated | |
MaxWallDurationPerJob | Wall clock limit for any jobs running with this QOS. | |
User | MaxCpusPerUser | Maximum number of CPU's any user with this QOS can be allocated. |
MaxJobsPerUser | Maximum number of jobs a user can have pending or running | |
MaxNodesPerUser | Maximum number of nodes that can be allocated to any user | |
MaxSubmitJobsPerUser | Maximum number of jobs that can be in the system |
Job limits are outlined here: https://lost-contact.mit.edu/afs/pdc.kth.se/roots/ilse/v0.7/pdc/vol/slurm/2.3.1-1/src/slurm-2.3.1/doc/html/qos.shtml
RCS Example
At RCS, the QOS that is defined looks like:
$ sacctmgr show qos format=name,priority,usagefactor,maxtres,maxwall,maxtrespu,maxsubmitpu,mintres
Name Priority UsageFactor MaxTRES MaxWall MaxTRESPU MaxSubmitPU MinTRES
---------- ---------- ----------- ------------- ----------- ------------- ----------- -------------
normal 0 1.000000 7-00:00:00 2000
cpu2019 0 1.000000 cpu=240 7-00:00:00 cpu=240 2000
gpu-v100 0 1.000000 cpu=80,gres/+ 1-00:00:00 cpu=160,gres+ 2000 gres/gpu=1
single 0 1.000000 cpu=200 7-00:00:00 cpu=200,node+ 2000
razi 0 1.000000 7-00:00:00 2000
apophis 0 1.000000 7-00:00:00 2000
razi-bf 0 1.000000 cpu=546 05:00:00 cpu=546 2000
apophis-bf 0 1.000000 cpu=280 05:00:00 cpu=280 2000
lattice 0 1.000000 cpu=408 7-00:00:00 cpu=408 2000
parallel 0 1.000000 cpu=624 7-00:00:00 cpu=624 2000
bigmem 0 1.000000 cpu=80 1-00:00:00 cpu=80 10 mem=4G
cpu2013 0 1.000000 7-00:00:00 2000
pawson 0 1.000000 7-00:00:00 2000
pawson-bf 0 1.000000 cpu=480 05:00:00 cpu=480 2000
theia 10000 1.000000 7-00:00:00 2000
theia-bf 0 1.000000 cpu=280 05:00:00 2000
MinTRES is set to disallow small jobs from the bigmem queue and non GPU jobs from the GPU queue.
Partitions
PartitionName=single Default=NO Maxtime=10080 Nodes=cn[001-168] QOS=single State=Up
PartitionName=lattice Default=NO Maxtime=10080 Nodes=cn[169-415,417-448] QOS=lattice State=Up
PartitionName=parallel Default=NO Maxtime=10080 Nodes=cn[0513-548,0553-1096] QOS=parallel State=Up
PartitionName=cpu2019 Default=YES Maxtime=10080 Nodes=fc[22-61] QOS=cpu2019 State=Up
PartitionName=cpu2013 Default=NO Maxtime=10080 Nodes=h[1-14] QOS=cpu2013 State=Up
PartitionName=gpu-v100 Default=NO Maxtime=1440 Nodes=fg[1-13] QOS=gpu-v100 State=Up
Users
Users can be set to specific QOS that they can use when submitting a job.
# sacctmgr add qos high priority=10 MaxTRESPerUser=CPU=256
# sacctmgr show qos
# sacctmgr show qos format=name
# sacctmgr --noheader show qos format=name
# sacctmgr -i modify user where name=XXXX set QOS=normal,high
# sacctmgr -i modify user where name=XXXX set QOS+=high
## User's default QOS can be set
# sacctmgr -i modify user where name=XXXX set DefaultQOS=normal
Users must submit jobs to non-default QOS by specifying the qos in sbatch:
# sbatch --qos=high ...
Priority
The priority value that slurm calculates determines the order in which jobs execute. The priority value can for a particular job can change over time and isn't set in stone. Priority values are recalculated in the value specified by PriorityCalcPeriod
. It may help to see how priorities are assigned by looking at jobs with sprio
and sshare -al
.
There are two priority plugins for Slurm: priority/basic which provides FIFO scheduling and the priority/multifactor which sets the priority based on several factors.
priority/multifactor
The priority of a job is calculated by a set of parameters and their associated weights. The higher the priority value, the higher the job will be positioned in the queue.
Multifactor priority takes into account the following parameters.
- Nice: User controlled (higher = lower priority)
- Job Age (length of time in queue, eligible to be scheduled)
- Job Size: Number of nodes/cpus allocated by job
- Fairshare: promised resources minus consumed resources
- TRES: Each TRES type has its own factor
- Partition: Value set by partition
- QOS: Value set by QOS
- Association (since 19.05)
- Site (since 19.05): Value set by job_submit or site_factor plugin
More information on each of the parameters above at https://slurm.schedmd.com/priority_multifactor.html#mfjppintro
Each parameter has a weight factor (32-bit integer) and a factor value (0.0-1.0) to allow different weights to be set on each parameter. The calculated priority value is an integer. To avoid losing precision, use at least 1000 for each factor weight.
Additional explanation on some of the factors:
- The fair-share factor takes into consideration the currently allocated and consumed computing resources for each charging accounts and gives priority to queued jobs under under-utilized accounts. By default, resource consideration is calculated as CPU time (
cores * duration
) but can be adjusted with the use of TRES factors. What Slurm does take into account can be seen under the 'Effective Usage' field undersshare -al
. How much resource a user can use is defined by how many 'shares' they have within the account (ie. how much slice of the pie a user gets). Slurm will adjust the fair-share factor periodically so that user's actual compute usage is close to normalized shares. Users that have recently used lots of CPU will find that their pending jobs will have reduced priority, allowing other jobs by other users to be scheduled. Additional priority can be given to some users by increasing their share count. - TRES factors give weights (weight * factor) to each type of trackable resources. The weights of each resource type is defined by the
TRESBillingWeights
as a comma separated list. For example:TRESBillingWeights=CPU=1.0,Mem=0.25G
. The sum of all resources are added to the job priority. You can also replace sum with max with theMAX_TRES
priority flag. - Job size takes into account the number of cores requested. Jobs that take up the entire cluster gets a size factor of 1.0 while jobs that take only 1 node will get a size factor of 0.0. This is to prevent large jobs from being starved. Small jobs will backfill while resources for larger jobs are being freed. The
PriorityFavorSmall
can flip this behavior so that small jobs get a size factor of 1.0 and vice versa. Additionally, Job Size can be set to calculate based on CPU time over all available CPU time by settingSMALL_RELATIVE_TO_TIME
. - QOS takes into account the user's QOS priority and divides it by the largest priority that is set.
QOS and partition priority on jobs using multiple partitions
The QOS and Partition weights used in the priority calculation uses the maximum QOS and partition values that are available to that job. The consequence is that if a user submits their job targeting multiple partitions, the partition with the highest priority weight and factor will influence the priority of this job against the other partitions.
The job priority formula, pulled from the Slurm documentation at https://slurm.schedmd.com/priority_multifactor.html#general:
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
MAX[ (PriorityWeightPartition) * (partition_factor) ] +
MAX[ (PriorityWeightQOS) * (QOS_factor) ] +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factor
You can see the priority factors weights that are applied to the cluster using sprio -w
.
$ sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE
Weights 1 1000 100000
You can find the current fairshare factors for all users in your system with this nifty command, derived from Princeton's RC documentation (though this version supports long usernames).
$ join -1 1 -2 2 -o 2.7,1.1,2.1 \
<( SQUEUE_FORMAT="%32u" squeue | sort | uniq ) \
<(sshare -a --parsable | tr '|' ' ' | awk 'NF>=7' | grep -v class | sort -k 2) \
| sort -r | uniq | cat -n | column -t
# Fairshare Username Account
1 0.967464 xxxx razi
2 0.964593 xxxx razi
3 0.936842 xxxx theia
4 0.934928 xxxx all
5 0.934928 xxxx all
6 0.164593 xxxx all
7 0.133014 xxxx all
8 0.044019 xxxx all
9 0.040191 xxxx all
10 0.022967 xxxx all
11 0.018182 xxxx all
12 0.009569 xxxx all
Reservations
A reservation can be created to reserve resources (eg. nodes and cores) to specific users or accounts.
Description | Command |
---|---|
Create a reservation | scontrol create reservation
with the following parameters:
|
Show reservations | scontrol show reservation
|
Update reservation | scontrol update ReservationName=testing
followed by any updated attributes. Eg: |
Delete a reservation | scontrol delete ReservationName=testing
|
Reservation flags and options
Flags outline how the reservation behaves. Some useful ones to know are:
maint
- maintenanceignore_jobs
- ignore running jobs when creating the reservationtime_float
- allow start/end time to be specified relative to 'now'. Eg:starttime=now+60minutes
.spec_nodes
- reserve specific nodes, specified by thenodes=
parameter.
See: https://slurm.schedmd.com/reservations.html
job_submit Plugin
The job_submit plugin allows further customization by allowing you to write an middleware layer that intercepts and changes any submitted jobs before they are queued in the system. This allows for great flexibility in implementing policies on a HPC cluster. For example, your script could define default partitions based on any requested TRES or GRES resources.
The job_submit middleware can be implemented in lua.
slurm_job_submit(job, partition, uid)
: job contains an array , partition is a comma delimited list of partitions, uid is the user's uid. Returns either a slurm.ERROR
or slurm.SUCCESS
.
At RCS
The lua script at RCS:
- Ensures partitions are set (or sets defaults if not specified. Namely: gpu-v100 with GPU or cpu2019, razi-bf, apophis-bf, pawson-bf, and any other partitions in their account without GPU GRES)
- Ensures user has permission to partitions selected (some partitions are hard coded to specific accounts. users without membership to these accounts cannot use these partitions)
- Ensures that the job time limit is defined (not 4294967294)
- Sets max tasks per node and cpus per task to 1 if unset (65534)
- Remove partitions where nodes cannot satisfy the requested taskspernode * cpuspertask.
- Remove partitions where the requested time limit exceeds the partition's limit
Pyxis Plugin
Pyxis is a container plugin for Slurm using the SPANK framework. Containers are created using Enroot and functions without any elevated privileges.
Installation
You will need to compile Pyxis from source with the Slurm development packages appropriate for your cluster.
- Install slurm-devel.
- Clone the pyxis repo and run make. You should get a
spank_pyxis.so
library. - Enable this library on your slurm cluster by copying the library to
/usr/local/lib/slurm/spank_pyxis.so
- Ensure that slurm reads the plugstack directory:
echo "include /etc/slurm/plugstack.conf.d/*" > /etc/slurm/plugstack.conf
- Ensure that pyxis is loaded:
echo "required /usr/local/lib/slurm/spank_pyxis.so" > /etc/slurm/plugstack.conf.d/pyxis.conf
- Restart slurmctl
Usage
Cheat sheet
Import a new container | enroot import docker://centos:8
|
Create an instance of a container | enroot create --name centos centos+8.sqsh
|
Running with srun | srun --time 00:10:00 --container-image=centos --container-remap-root cat /etc/os-release
|
Running with sbatch | The sbatch script can invoke srun to target a specific container. Eg:#!/bin/bash
#SBATCH --error=error.txt
#SBATCH --output=stdout.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0-00:10:00
#SBATCH --mem=100m
srun --time 00:10:00 \
--container-image=centos \
--container-remap-root \
bash -c "date ; echo ; whoami ; echo ; grep NAME /etc/os-release ; sleep 10 ; echo ; date" &
|
Tasks
Adding nodes to Slurm
- Ensure
munge
andslurm
users have the same UID on both login node and new node. - Munge from login node to new node:
munge -n
- Add the node to
slurm.conf
on the login node, then redistribute the file to all other nodes - Restart
slurmd
on all nodes - Restart
slurmctld
on the login node - Ensure all nodes are visible with
sinfo -lN
.
Cron-like Jobs
A job that runs and re-schedules itself at a later time is possible. Example from University of Chicago RCC: https://rcc.uchicago.edu/docs/running-jobs/cron/index.html#cron-jobs
This requires a script or program that prints the next timestamp for the job to run from a given cronjob schedule.
Slurm accepts the following for the --begin option:
--begin=16:00
--begin=now+1hour
--begin=now+60 (seconds by default)
--begin=12/26 (Next December 26)
--begin=2020-01-20T12:34:00
An example submission would look like:
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --output=cron.log
#SBATCH --open-mode=append
#SBATCH --account=cron-account
#SBATCH --partition=cron
#SBATCH --qos=cron
# Here is an example of a simple command that prints the host name and
# the date and time.
echo "Hello on $(hostname) at $(date)."
# Determine amount of time to wait, then use the
# Now + seconds begin time format
Interval=3600
WaitFor=$(($Interval - $(date +"%s") % $Interval))
echo "Next submission in $WaitFor seconds"
# Schedule next job
sbatch --quiet --begin=now+$WaitFor job.sh
PAM Slurm Adopt Module
The PAM Slurm Adopt module allows access to a compute node only when a job is scheduled and running on that node. More information available from: https://slurm.schedmd.com/pam_slurm_adopt.html
Installation requires the installation of the pam-slurm_pam
package and configuration of the pam system-auth
and password-auth
files. On CentOS 8, you may wish to create a custom authselect profile and add the following lines after pam_unix.so
in both system-auth
and password-auth
:
account sufficient pam_access.so {include if "with-slurm"}
account required pam_slurm_adopt.so {include if "with-slurm"}
By making pam_access
sufficient, anyone that's allowed via the /etc/security/access.conf
will be allowed access while those denied by pam_access
can still authenticate if pam_slurm_adopt
allows access. Complete the setup by ensuring that access.conf
has at the very end:
# Deny all other users, except root
-:ALL EXCEPT root:ALL
Troubleshooting
Issues can be troubleshooted by looking at the logs at /var/log/slurmd.log
.
Missing non primary group membership
A user's job was having issues reading a group directory. Upon further investigation, it turns out the context the job was running was missing all the non-primary group memberships. This issue is described in the slurm mailing list at https://lists.schedmd.com/pipermail/slurm-users/2018-November/002275.html
The fix is to set LaunchParameters=send_gids
which passes the extended group id list for a user as part of the launch credential. This stops slurmd/slurmstepd from looking this information up on the compute node.
Additional info at: https://slurm.schedmd.com/SLUG18/field_notes2.pdf
Missing Cgroup namespace 'freezer'
[2019-10-30T12:45:32.578] error: cgroup namespace 'freezer' not mounted. aborting
[2019-10-30T12:45:32.578] error: unable to create freezer cgroup namespace
[2019-10-30T12:45:32.578] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2019-10-30T12:45:32.578] error: cannot create proctrack context for proctrack/cgroup
[2019-10-30T12:45:32.578] error: slurmd initialization failed
You need to define the cgroup path location in the cgroup.conf
configuration file:
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
Fix a downed node
If a node is reporting as down from sinfo
even though slurmd and munge are running, the node might require a manual update to the idle state. For example, I saw the following.
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 down node01
Ensure that the node is reachable using the ping command:
# scontrol ping
Slurmctld(primary/backup) at node01/(NULL) are UP/DOWN
Then, update the status with scontrol update
:
# scontrol
scontrol: update NodeName=node01 State=RESUME
If the node is functional, the state should return to idle
and should begin accepting new jobs.
# sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 idle node01
Fixing jobs stuck in completing state
If a job is stuck in the completing (CG) state, check if the node's filesystems are responding. If the user has nothing running, then you can try taking down the node and resuming it to clear the stuck job.
# scontrol update nodename=$node state=down reason=hung_job
# scontrol update nodename=$node state=resume
Prolog error
Some nodes began going in the downed state when the prolog script started erroring out under some situations. This results in the node draining and with a reason of 'Prolog error'.
On a failing node, slurmd.log
showed the following:
error: Waiting for JobId=17567840 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
Could not launch job 17567840 and not able to requeue it, cancelling job
[17567840.extern] task/cgroup: _memcg_initialize: job: alloc=16384MB mem.limit=16384MB memsw.limit=16384MB job_swappiness=18446744073709551614
[17567840.extern] task/cgroup: _memcg_initialize: step: alloc=16384MB mem.limit=16384MB memsw.limit=16384MB job_swappiness=18446744073709551614
[17567840.extern] done with job
The prolog script we have on the system is extremely simple and is likely not the cause (return code's OK, otherwise it would fail with something like error: [job 17671841] prolog failed status=1:0
)
## cat /etc/slurm/prologue.sh
#!/bin/bash
mkdir /scratch/$SLURM_JOB_ID
chown $SLURM_JOB_USER:$SLURM_JOB_USER /scratch/$SLURM_JOB_ID
chmod 0770 /scratch/$SLURM_JOB_ID
I have a hunch that this is caused by some sort of race condition since it seems to occur with job arrays. This error also appears to happen in bunches as observed from the logs:
# pdsh -w $(sinfo -Rl | grep Prolog | awk '{print $6}' | tr '\n' ,) grep REQUEST_LAUNCH /var/log/slurmd.log
mc59: [2023-01-28T14:26:36.892] error: Waiting for JobId=17587106 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc64: [2023-01-28T20:00:08.289] error: Waiting for JobId=17594449 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc68: [2023-01-28T14:29:00.192] error: Waiting for JobId=17588052 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc67: [2023-01-28T12:32:22.818] error: Waiting for JobId=17567840 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc54: [2023-01-19T12:20:34.011] error: Waiting for JobId=17488316 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc54: [2023-01-19T12:20:34.036] error: Waiting for JobId=17488331 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc54: [2023-01-28T21:01:10.667] error: Waiting for JobId=17607542 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc54: [2023-01-28T21:01:10.671] error: Waiting for JobId=17607540 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc52: [2023-01-28T12:34:08.240] error: Waiting for JobId=17568872 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc62: [2023-01-29T17:42:13.629] error: Waiting for JobId=17628106 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc53: [2023-01-28T12:52:46.579] error: Waiting for JobId=17569708 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc53: [2023-01-28T12:52:46.579] error: Waiting for JobId=17569707 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc49: [2023-01-28T12:32:23.110] error: Waiting for JobId=17567925 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc49: [2023-01-28T12:32:23.115] error: Waiting for JobId=17567932 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc66: [2023-01-29T13:05:13.652] error: Waiting for JobId=17612693 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc66: [2023-01-29T13:05:13.654] error: Waiting for JobId=17612695 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc50: [2023-01-28T09:29:43.758] error: Waiting for JobId=17558698 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc50: [2023-01-28T09:29:43.760] error: Waiting for JobId=17558696 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc50: [2023-01-28T09:29:43.766] error: Waiting for JobId=17558695 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc50: [2023-01-28T09:29:43.767] error: Waiting for JobId=17558697 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc51: [2023-01-28T13:01:24.437] error: Waiting for JobId=17580437 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc51: [2023-01-28T13:01:24.446] error: Waiting for JobId=17580428 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
mc51: [2023-01-28T13:01:24.446] error: Waiting for JobId=17580438 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
On the Slurm control node, the error happens when a user tries to submit a large job array. In the slurmctl.log,. each of these failures occured in the middle of a huge submission. Take the last job (17580438) for example, it's one of the many that were submitted at the same time.
[2023-01-28T13:01:24.108] sched: Allocate JobId=17580179_9230(17580413) NodeList=mc72 #CPUs=1 Partition=cpu2022-bf24
[2023-01-28T13:01:24.109] sched: Allocate JobId=17580179_9231(17580414) NodeList=mc72 #CPUs=1 Partition=cpu2022-bf24
[2023-01-28T13:01:24.110] sched: Allocate JobId=17580179_9232(17580415) NodeList=mc72 #CPUs=1 Partition=cpu2022-bf24
[2023-01-28T13:01:24.110] sched: Allocate JobId=17580179_9233(17580416) NodeList=mc72 #CPUs=1 Partition=cpu2022-bf24
[2023-01-28T13:01:24.111] sched: Allocate JobId=17580179_9234(17580417) NodeList=mc72 #CPUs=1 Partition=cpu2022-bf24
[2023-01-28T13:01:24.112] sched: Allocate JobId=17580179_9235(17580418) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.113] sched: Allocate JobId=17580179_9236(17580419) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.114] sched: Allocate JobId=17580179_9237(17580420) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.114] sched: Allocate JobId=17580179_9238(17580421) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.115] sched: Allocate JobId=17580179_9239(17580422) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.116] sched: Allocate JobId=17580179_9240(17580423) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.116] sched: Allocate JobId=17580179_9241(17580424) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.117] sched: Allocate JobId=17580179_9242(17580425) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.118] sched: Allocate JobId=17580179_9243(17580426) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.118] sched: Allocate JobId=17580179_9244(17580427) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.119] sched: Allocate JobId=17580179_9245(17580428) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.120] sched: Allocate JobId=17580179_9246(17580429) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.120] sched: Allocate JobId=17580179_9247(17580430) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.121] sched: Allocate JobId=17580179_9248(17580431) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.122] sched: Allocate JobId=17580179_9249(17580432) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.122] sched: Allocate JobId=17580179_9250(17580433) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.123] sched: Allocate JobId=17580179_9251(17580434) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.124] sched: Allocate JobId=17580179_9252(17580435) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.124] sched: Allocate JobId=17580179_9253(17580436) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.125] sched: Allocate JobId=17580179_9254(17580437) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.126] sched: Allocate JobId=17580179_9255(17580438) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.126] sched: Allocate JobId=17580179_9256(17580439) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.127] sched: Allocate JobId=17580179_9257(17580440) NodeList=mc51 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.128] sched: Allocate JobId=17580179_9258(17580441) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.128] sched: Allocate JobId=17580179_9259(17580442) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.129] sched: Allocate JobId=17580179_9260(17580443) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.130] sched: Allocate JobId=17580179_9261(17580444) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.130] sched: Allocate JobId=17580179_9262(17580445) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.131] sched: Allocate JobId=17580179_9263(17580446) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.131] sched: Allocate JobId=17580179_9264(17580447) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.132] sched: Allocate JobId=17580179_9265(17580448) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.133] sched: Allocate JobId=17580179_9266(17580449) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.133] sched: Allocate JobId=17580179_9267(17580450) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.134] sched: Allocate JobId=17580179_9268(17580451) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.135] sched: Allocate JobId=17580179_9269(17580452) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.135] sched: Allocate JobId=17580179_9270(17580453) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.136] sched: Allocate JobId=17580179_9271(17580454) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.137] sched: Allocate JobId=17580179_9272(17580455) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.137] sched: Allocate JobId=17580179_9273(17580456) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.138] sched: Allocate JobId=17580179_9274(17580457) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.139] sched: Allocate JobId=17580179_9275(17580458) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.139] sched: Allocate JobId=17580179_9276(17580459) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.140] sched: Allocate JobId=17580179_9277(17580460) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.140] sched: Allocate JobId=17580179_9278(17580461) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.141] sched: Allocate JobId=17580179_9279(17580462) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
[2023-01-28T13:01:24.142] sched: Allocate JobId=17580179_9280(17580463) NodeList=mc54 #CPUs=1 Partition=cpu2021-bf24
You can find 'REQUEST_LAUNCH' in the req.c source code to see that it's timing out waiting for another thread to run rpc_prolog
.
Looking at the slurmctl.log further, it looks like some RPC calls are being dropped as it reaches our max_rpc_cnt value:
[2023-01-28T13:01:24.910] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-01-28T13:01:24.924] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-01-28T13:01:24.935] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-01-28T13:01:24.941] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-01-28T13:01:24.950] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
[2023-01-28T13:01:24.984] sched: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt
Likely,. this is happening for job arrays that exceed 256 jobs? We could probably mitigate the issue by raising the value. See the slurm.conf documentation. which states:
If the number of active threads in the slurmctld daemon is equal to or larger than this value, defer scheduling of jobs. The scheduler will check this condition at certain points in code and yield locks if necessary. This can improve Slurm's ability to process requests at a cost of initiating new jobs less frequently. Default: 0 (option disabled), Min: 0, Max: 1000.
See Also
Other Projects
- http://edf-hpc.github.io/slurm-web/index.html - A web application that shows the state of the cluster
- https://github.com/mercanca/spart - CLI tool to show partition states
- https://openondemand.org/ - A web portal that exposes HPC resources for users. See Open OnDemand for more information.
- https://open.xdmod.org/9.0/ - XDMoD gathers and shows resource usage on Slurm
Resources
- https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/
- https://rcc.uchicago.edu/docs/using-midway/index.html
- https://www.dkrz.de/up/systems/mistral/running-jobs/slurm-introduction
- https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting
- https://lost-contact.mit.edu/afs/pdc.kth.se/roots/ilse/v0.7/pdc/vol/slurm/14.11.5/build/slurm-14.11.5/doc/html/accounting.shtml
- Yale has some good notes on Slurm too: https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/resource-usage/
Installing Slurm on a Raspberry Pi cluster:
- https://medium.com/@glmdev/building-a-raspberry-pi-cluster-784f0df9afbd
- https://medium.com/@glmdev/building-a-raspberry-pi-cluster-aaa8d1f3d2ca
- https://medium.com/@glmdev/building-a-raspberry-pi-cluster-f5f2446702e8
Slurm Lua Scripting