Getting Started with SLURM

Note: This post has been updated to reflect the changes in the queueing system after the software upgrade of Beskow in June, 2019.

Our supercomputer clusters at PDC, equipped with thousands of multi-core processors, can be used to solve large scientific/engineering problems. Because of their much higher performance compared to desktop computers or workstations, supercomputer clusters are also called high performance computing (HPC) clusters. Common application fields of HPC clusters include machine learning, galaxy simulation, climate modelling, bioinformatics, computational physics, quantum chemistry, etc.

Building an HPC cluster demands sophisticated technologies and hardware, but fortunately a regular HPC user doesn’t have to worry too much about that. As an HPC user you can submit jobs that request the compute nodes (physical groups of processors) to do the calculations/simulations you want. But note that you are not the only user of an HPC cluster, there are typically many users using the cluster for the same time period and all of them will be submitting their own jobs. You may have realized by now that there needs to be some soft of queueing system that organizes the jobs and distributes them to the compute nodes. This post will briefly introduce SLURM, which is used in all PDC clusters and is the most widely used workload manager for HPC clusters.

What is SLURM?

SLURM, or Simple Linux Utility for Resource Management, is an open-source cluster management and job scheduling system. It provides three key functions

Allocation of resources (compute nodes) to users’ jobs
Framework for starting/executing/monitoring jobs
Queue management to avoid resource contention

In other words, SLURM oversees all the resources in the whole HPC cluster. Users then send their jobs (requests to run calculations/simulations) to SLURM for later execution. SLURM will keep all the submitted jobs in a queue and decide what priorities the jobs have and how the jobs are distributed to the compute nodes. SLURM provides a series of useful commands for the user which we will now go through.

How to submit jobs

The sbatch command submits a job script to the SLURM queue for later execution

$ sbatch job_script.sh

where job_script.sh is a batch script written in bash syntax. The script usually contains options preceded with #SBATCH to tell SLURM what kind of resources are to be allocated to this job. The most common options include the account (time-allocation) to be charged, the expected duration of the job, and the number of compute nodes needed by the job

#SBATCH -A <time_allocation>
#SBATCH -t 00:10:00
#SBATCH --nodes=2

Here, -A (--account) specifies the project account to be used by this job, -t (--time) sets a limit on the total run time of this job, and -N (--nodes) requests the number of compute nodes to be allocated to this job.

In addition to --nodes, it is also common to use the --ntasks-per-node option for a finer control of parallel execution. As the name indicates, --ntasks-per-node requests that the number of tasks/processes to be executed on each node will be set to the specified number.

Other useful options include

-J or --job-name: specifies the name for the job.
-e or --error: specifies the output file for the job script’s standard error.
-o or --output: specifies the output file for the job script’s standard output.

Please refer to this page for a complete list of sbatch options.

Below is a typical job script

#!/bin/bash -l

#SBATCH -J myjob
#SBATCH -A <time_allocation>
#SBATCH -t 01:00:00

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32

#SBATCH -e error_file.e
#SBATCH -o output_file.o

# Run the executable in parallel
<my_parallel_launcher> -n 128 ./myexe > my_output_file

On the two PDC clusters, Beskow and Tegner, srun and mpirun, respectively, are usually used to launch parallel jobs. srun is the command to run parallel applications via SLURM. Parallel programs on Beskow should always be invoked by the srun command

srun -n 128 ./myexe > my_output_file

where -n specifies the total number of cores to be used. Note that the -n option of srun should match the product of the --nodes and --ntasks-per-node options of sbatch.

On Tegner, mpirun is the command for parallel execution

# Load the Intel MPI module
module load intelmpi/17.0.1
mpirun -n 128 ./myexe > my_output_file

Note that the mpirun launcher is only available after the IntelMPI module (or other MPI module like OpenMPI) is loaded.

How to request nodes

Different types of nodes are available on PDC’s clusters. Beskow has two types of nodes.

1676 nodes have Intel Xeon Haswell CPUs (32 cores/node).
384 nodes have Intel Xeon Broadwell CPUs (36 cores/node).

As you can see, the majority of the compute nodes on Beskow are equipped with Haswell CPUs. To request that your job should exclusively use nodes with Haswell CPUs, you need to specify

#SBATCH --ntasks-per-node=32
#SBATCH --constraint=Haswell

where the “--constraint” option makes sure that the job runs only on Haswell nodes. The Broadwell CPUs, however, are mostly reserved for users from Scania. If you would like to exclusively execute your job on Broadwell nodes, please contact PDC Support.

Tegner has thin and fat nodes that differ in their available memory.

All the 55 thin nodes have 512 GB memory and 24 cores per node.
5 of the 10 fat nodes have 1 TB memory and 48 cores per node.
5 of the 10 fat nodes have 2 TB memory and 48 cores per node.

If you need to use the fat nodes on Tegner, specify “#SBATCH --mem=1000000” for 1 TB memory and “#SBATCH --mem=2000000” for 2 TB memory.

How to request GPUs

PDC’s Tegner cluster is also equipped with GPU (graphics processing unit) cards. You have probably heard of GPUs – they are specialized electronic circuits designed to process computer graphics with a highly parallel structure. Many modern GPUs have also been designed for general purpose computing, taking advantage of the SIMD (single instruction multiple data) architecture. General purpose GPUs are now very popular in the field of machine learning, bioinformatics, molecular dynamics, etc.

Two types of GPU cards are available on the fat and thin nodes of Tegner.

All the 10 fat nodes have 2 Nvidia Quadro K420 cards/node.
46 of the thin nodes have 1 Nvidia Quadro K420 card/node.
9 of the thin nodes have 1 Nvidia Tesla K80 card/node.

If you need to use GPUs in your calculations, use the --gres=<gpu_list> option in the job script. The format for the <gpu_list> entry is “name[[:type]:count]”. Some examples are given below. Note that one K80 card is counted as two GPUs.

#SBATCH --gres=gpu:1
#SBATCH --gres=gpu:K420:1
#SBATCH --gres=gpu:K80:2

How to monitor jobs

We have discussed how to submit jobs with sbatch and how to request certain nodes with specific CPUs and GPUs. Submitted jobs will stay in the SLURM queue and you can monitor the state of your jobs with squeue before they are finished

$ squeue -u <username>

The output of the squeue command consists of several columns including job ID, partition, job name, username, job state, elapsed time, number of nodes, node list, etc.

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  733608      main     job1    user1  R    6:11:55      1 t03n03
  733814      main     job2    user2 PD       0:00      1 (Priority)

Here, partition means a logic group of nodes that can allow certain types of jobs to run. You can imagine that short jobs (e.g. 10 min) and long jobs (e.g. 24 h) will belong to different partitions. The same for small jobs (e.g. 1 node) and large jobs (e.g. 128 nodes). A partition is like a subset of the SLURM queue. Note that, however, there is no need to specify the partition in your job submission script, as the jobs are automatically routed to proper partitions based on the requested time, size of job, type of nodes, etc.

Job state is listed in the ST column of the output of the squeue command. The most common job state codes are

R: Running
PD: Pending
CG: Completing
CA: Cancelled

For more job state codes, please refer to this page.

For a job that is pending, squeue gives the reason, which in most cases is Resources or Priority. The former indicates that there currently are not enough resources to run your job, and you have to wait for somewhat longer. The latter suggests that you job has a low priority, possibly because you have recently used a significant amount of resources or that your project has exceeded the time allocation quota in the project you belong to.

For a job that is running, squeue shows the node (or node list if there’s more than one node) that the job is running on. On Tegner, it is possible to log in to the compute node to check the status of your calculation. If you have a single-node job running on node t03n03, you may log in to the compute node using the command below. For security reasons it should be
done from your local computer:

$ ssh <username>@t03n03.pdc.kth.se

Note that, however, it is not possible to log in to compute node of Beskow.

If you want to know more about your jobs listed by squeue, use scontrol with the job ID

$ scontrol show job <job_id>

The output will show detailed information about the job, including priority, submit time, start time, end time, working directory, standard output, etc. Read this page to learn more about the scontrol command.

To cancel a specific job, use scancel with job ID

$ scancel <job_id>

Interactive nodes

Sometimes you may want to use a compute node interactively, for example for debugging purposes. This can be done with salloc

$ salloc -A <time_allocation> -t 01:00:00 -N 1

If the cluster is busy at the time, you may need to wait for some time before an interactive node is allocated to you. After allocation of an interactive node you can run parallel programs using srun (on Beskow) or mpirun (on Tegner). Note that srun and mpirun are mandatory to make sure that the job runs on the compute node. If you execute your program on the login node without srun or mpirun, the job will be running on the login node, which will cause performance issues or even kill the node.

You can check this out by the following example on Tegner. First, ask for an interactive node and print the SLURM node list via the environment variable “$SLURM_NODELIST”. Then check the output of hostname without and with mpirun. You can see that the hostname command runs on the login node, and that the “mpirun -n 1 hostname” command runs on the actual compute node. Similarly, on an interactive node of Beskow, parallel programs need to be invoked by the srun command.

tegner-login-1$ salloc -A <time_allocation> -t 01:00:00 -N 1
salloc: Granted job allocation 733877
bash-4.2$ echo $SLURM_NODELIST
t03n03
bash-4.2$ hostname
tegner-login-1.pdc.kth.se
bash-4.2$ module load intelmpi/17.0.1
bash-4.2$ mpirun -n 1 hostname
t03n03.pdc.kth.se

We have shown in the previous section that you can log in to the compute node of Tegner provided that you have an active job running on it. If you have an interactive node on Tegner, it is also possible to log in to the compute node from your local computer via e.g. ssh <username>@t03n03.pdc.kth.se.

The allocated interactive node will be revoked if the actual run time exceeds the requested time limit. You may also manually terminate an interactive node with exit or Ctrl-D.

Environment variables

Sometimes it is useful to extract certain information from SLURM. This can be done via SLURM environment variables. For instance, the following script uses the SLURM environment variables to copy files from the working directory to a unique scratch directory. Note that $USER and ${USER:0:1} stand for your username and the first letter of it, respectively.

workdir=$SLURM_SUBMIT_DIR
scratch=/cfs/klemming/scratch/${USER:0:1}/$USER/$SLURM_JOB_ID
mkdir -p $scratch

echo working dir: $workdir
echo scratch dir: $scratch

# copy data to scratch folder
cp -r $workdir/* $scratch

See this page for a complete list of SLURM environment variables.

Moreover, Beskow and Tegner are SNIC resources and therefore support SNIC environment variables that can be convenient to use for portability of batch scripts across SNIC clusters. These environment variables are available after loading the snic-env module. Below are the SNIC environment variables on Beskow and Tegner

user@beskow-login2:~> salloc -A <time_allocation> -t 00:10:00 -N 1
salloc: Granted job allocation 3432703
user@beskow-login2:~> module load snic-env
user@beskow-login2:~> printenv | grep SNIC
SNIC_RESOURCE=beskow
SNIC_BACKUP=/afs/pdc.kth.se/home/u/user
SNIC_NOBACKUP=/cfs/klemming/nobackup/u/user
SNIC_TMP=/cfs/klemming/scratch/u/user
SNIC_SITE=pdc

tegner-login-1$ salloc -A <time_allocation> -t 00:10:00 -N 1
salloc: Granted job allocation 733885
bash-4.2$ module avail snic-env
bash-4.2$ printenv|grep SNIC
SNIC_RESOURCE=tegner
SNIC_BACKUP=/afs/pdc.kth.se/home/u/user
SNIC_NOBACKUP=/cfs/klemming/nobackup/u/user
SNIC_TMP=/cfs/klemming/scratch/u/user
SNIC_SITE=pdc

SLURM job array

In scientific computing it is not uncommon that the user needs to submit a large number of similar jobs. One possible solution is to write a program or script that goes through each job folder, creates job script, and submits the job. Unfortunately, this kind of batch-processing program and script are not only error-prone, but also difficult for troubleshooting. A better way of handling many jobs is to use a SLURM job array, which allows to submit a series of jobs via a single submission script.

Suppose you have 10 jobs residing in folders data-0, data-1, …, data-9. You may use the following script to submit all the jobs in one shot (note the -a option)

#!/bin/bash -l

#SBATCH -J array
#SBATCH -A <time_allocation>
#SBATCH -n 1
#SBATCH -t 01:00:00
#SBATCH -a 0-9

# Fetch one directory from the array based on the task ID
# Note: index starts from 0
CURRENT_DIR=data-${SLURM_ARRAY_TASK_ID}
echo "Running simulation $CURRENT_DIR"

# Go to job folder
cd $CURRENT_DIR
echo "Simulation in $CURRENT_DIR" > result

# Run job
<my_parallel_launcher> -n 32 ./myexe > my_output_file

See this page for a detailed documentation of SLURM job array.

SLURM job arrays can become handy in many scenarios. You may want to cross-validate the result of your statistical analysis, carry out simulations from many different initial conditions, or benchmark your code/method on a variety of systems. In any case, you may find SLURM job arrays useful in helping you handle large numbers of jobs. The only restriction is that all these jobs have to use the same number of compute nodes and wall-clock time limit, which in practice is usually not a problem.

Resource reservation

With SLURM it is possible to reserve resources (e.g. some compute nodes) for jobs to be executed in a specific project. A reservation gives the project exclusive rights to a set of nodes during a specific time period. This is very useful for arranging lab exercises for courses like the PDC summer school to minimize waiting time for course participants. If during a course you have been informed that a reservation is available it is important to use it, otherwise the reserved node will be idle. Resource reservations can be created, updated, or removed by sysadmins, and viewed by normal users with scontrol

user@beskow-login2:> scontrol show reservation
ReservationName=summer-2018-08-15 StartTime=2018-08-15T13:45:00 
   EndTime=2018-08-15T17:15:00 Duration=03:30:00
   Nodes=nid00[004-027,029-067,072-108] NodeCnt=100 CoreCnt=3200 
   Features=(null) PartitionName=4h1-800 Flags= TRES=cpu=6400
   Users=(null) Accounts=edu18.summer Licenses=(null) 
   State=INACTIVE BurstBuffer=(null) Watts=n/a

To use the reserved resources, add the --reservation option to sbatch

#SBATCH -A edu18.summer
#SBATCH -t 01:30:00
#SBATCH --nodes=1
#SBATCH --reservation=summer-2018-08-15

or salloc

$ salloc -A edu18.summer -t 01:30:00 -N 1 --reservation=summer-2018-08-15

Further information

You can read more about SLURM in this page.

Summary

We hope that you have learned something about SLURM and its usage on HPC clusters after reading this post. Some key points are summarized below.

sbatch submits jobs and accepts options preceded by #SBATCH
srun and mpirun are launchers for parallel programs
There are different types of nodes in terms of CPU and memory
GPUs are available on Tegner
You can monitor you jobs with squeue, scontrol and scancel
You can request interactive nodes with salloc
SLURM and SNIC environment variables are sometimes handy
SLURM job arrays can be used to handle large numbers of similar jobs
SLURM can reserve resources for courses using PDC resources

Studies

Research

Collaboration

About KTH

Library