Line 77: Line 77:
  
 
* {{code|squeue}} to see jobs on the queue.  
 
* {{code|squeue}} to see jobs on the queue.  
* {{code|sbatch}} to submit a job.
+
* {{code|sbatch [--test-only] job.sh}} to submit a job.
 
* {{code|sinfo}} to show the status of nodes.
 
* {{code|sinfo}} to show the status of nodes.
 +
* {{code|scancel}} to cancel a job.
 +
* {{code|scontrol show jobid -dd jobid}} to see a particular job by ID
  
 
Output from jobs will be placed on the current working directory as {{code|slurm-JOB_ID.out}}.
 
Output from jobs will be placed on the current working directory as {{code|slurm-JOB_ID.out}}.
  
=== squeue ===
+
=== Job Submission ===
 +
Use the {{code|sbatch job.sh}} command to submit a job. A job looks like a shell script with the addition of {{code|#SBATCH}} definitions that outline certain parameters.
  
 +
=== Job Information ===
 +
Submitted jobs can be described using {{code|scontrol show jobid -dd jobid}}.
 +
 +
{{code|squeue [-u username] [-t RUNNING|PENDING|COMPLETED] [-p partition] }}
 
The reason column:
 
The reason column:
 
* If there is no reason, the scheduler hasn't attended to your submission yet.
 
* If there is no reason, the scheduler hasn't attended to your submission yet.
Line 90: Line 97:
  
 
See: https://www.rc.fas.harvard.edu/resources/faq/fix_pending_job/
 
See: https://www.rc.fas.harvard.edu/resources/faq/fix_pending_job/
 +
 +
 +
== See Also ==
 +
* https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/
  
 
== Troubleshooting ==
 
== Troubleshooting ==

Latest revision as of 11:26, 4 December 2019

Installation[edit]

See:

Install prerequsites, then build the RPM.

# yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker
# yum install munge-devel munge-libs mariadb-server mariadb-devel man2html
# export VER=19.05.3-2
# rpmbuild -ta slurm-$VER.tar.bz2

The latest Slurm package can be downloaded from https://www.schedmd.com/downloads.php


Configuration[edit]

The main slurm configuration file docs: https://slurm.schedmd.com/slurm.conf.html

Custom resources such as GPUs are defined in gres.conf. See: https://slurm.schedmd.com/gres.conf.html

Cgroups configs are loaded from /etc/slurm/cgroup.conf. If running an older version of Slurm on a newer system, you may need to configure the cgroup path from /cgroup to /sys/fs/cgroup.

Example configuration:

ControlMachine=dragen
ControlAddr=dragen
AuthType=auth/munge
CryptoType=crypto/munge
JobRequeue=0
MaxJobCount=500000
MpiDefault=none
ReturnToService=0
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1024
FastSchedule=1
MaxArraySize=65535
SchedulerType=sched/backfill
SelectType=select/linear
PriorityType=priority/multifactor #basic means strict fifo
PriorityDecayHalfLife=7-0
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightFairshare=100000
ClusterName=dragen
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
NodeName=dragen NodeAddr=dragen Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=256000
PartitionName=defq Default=YES MinNodes=1 DefaultTime=5-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Pren


Quick Usage[edit]

Start Slurmd and Slurmctld.

  • squeue to see jobs on the queue.
  • sbatch [--test-only] job.sh to submit a job.
  • sinfo to show the status of nodes.
  • scancel to cancel a job.
  • scontrol show jobid -dd jobid to see a particular job by ID

Output from jobs will be placed on the current working directory as slurm-JOB_ID.out.

Job Submission[edit]

Use the sbatch job.sh command to submit a job. A job looks like a shell script with the addition of #SBATCH definitions that outline certain parameters.

Job Information[edit]

Submitted jobs can be described using scontrol show jobid -dd jobid.

squeue [-u username] [-t RUNNING The reason column:

  • If there is no reason, the scheduler hasn't attended to your submission yet.
  • Resources means your job is waiting for an appropriate compute node to open.
  • Priority indicates your priority is lower relative to others being scheduled.

See: https://www.rc.fas.harvard.edu/resources/faq/fix_pending_job/


See Also[edit]

Troubleshooting[edit]

Issues can be troubleshooted by looking at the logs at /var/log/slurmd.log.

[2019-10-30T12:45:32.578] error: cgroup namespace 'freezer' not mounted. aborting
[2019-10-30T12:45:32.578] error: unable to create freezer cgroup namespace
[2019-10-30T12:45:32.578] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2019-10-30T12:45:32.578] error: cannot create proctrack context for proctrack/cgroup
[2019-10-30T12:45:32.578] error: slurmd initialization failed

You need to define the cgroup path location in the cgroup.conf configuration file:

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes