KTH Logo

Using Jupyter Notebooks to manage SLURM jobs

Jupyter Notebooks are gaining in popularity across many academic and industrial research fields . The in-browser, cell-based user interface of the Notebook application enables researchers to interleave code (in many different programming languages) with rich text, graphics, equations, and so forth. Typical use cases involve quick prototyping of code, interactive data analysis and visualisation, keeping digital notebooks for daily tasks, and as a teaching tool. Notebooks are also being used for reproducible workflows and to share scientific analysis with colleagues or whole research communities. High Performance Computing (HPC) is rapidly catching up with this trend and many HPC providers, including PDC, now offer Jupyter Notebooks as a way to interact with their HPC resources.

In October last year, PDC organised a workshop on “HPC Tools for the Modern Era“. One of the workshop modules focused on possible use cases of Jupyter Notebooks in an HPC environment. The notebooks that were used during the workshop are available from the PDC Support GitHub repository (https://github.com/PDC-support/jupyter-notebook), and the rest of this post discusses an example use case based on one of those workshop notebooks.

Jupyter Notebooks are suitable for various HPC usage patterns and workflows. In this blog post we will demonstrate one possible use case: Interacting with the SLURM scheduler from a notebook running in your browser, submitting and monitoring batch jobs and performing light-weight interactive analysis of running jobs. Note that this usage of Jupyter is possible on both Tegner and Beskow. In a future blog post, we will demonstrate how to run interactive analysis directly on a Tegner compute node to perform heavy analysis on large datasets, which might, for instance, be generated from other jobs running at Tegner or Beskow (since both clusters share the klemming file system). This use case will however not be possible on Beskow since the Beskow compute nodes have restricted network access to the outside world.

The following overview will assume that you have some familiarity with Jupyter Notebooks, but if you’ve never tried them out, there are plenty of online resources to get you started. Apart from using the notebooks from the PDC/PRACE workshop (which were mentioned earlier), you can, for example:

(more…)

Getting Started with SLURM


Note: This post has been updated to reflect the changes in the queueing system after the software upgrade of Beskow in June, 2019.


Our supercomputer clusters at PDC, equipped with thousands of multi-core processors, can be used to solve large scientific/engineering problems. Because of their much higher performance compared to desktop computers or workstations, supercomputer clusters are also called high performance computing (HPC) clusters. Common application fields of HPC clusters include machine learning, galaxy simulation, climate modelling,  bioinformatics, computational physics, quantum chemistry, etc.

Building an HPC cluster demands sophisticated technologies and hardware, but fortunately a regular HPC user doesn’t have to worry too much about that. As an HPC user you can submit jobs that request the compute nodes (physical groups of processors) to do the calculations/simulations you want. But note that you are not the only user of an HPC cluster, there are typically many users using the cluster for the same time period and all of them will be submitting their own jobs. You may have realized by now that there needs to be some soft of queueing system that organizes the jobs and distributes them to the compute nodes. This post will briefly introduce SLURM, which is used in all PDC clusters and is the most widely used workload manager for HPC clusters.

What is SLURM?

SLURM, or Simple Linux Utility for Resource Management, is an open-source cluster management and job scheduling system. It provides three key functions

  • Allocation of resources (compute nodes) to users’ jobs
  • Framework for starting/executing/monitoring jobs
  • Queue management to avoid resource contention

In other words, SLURM oversees all the resources in the whole HPC cluster. Users then send their jobs (requests to run calculations/simulations) to SLURM for later execution. SLURM will keep all the submitted jobs in a queue and decide what priorities the jobs have and how the jobs are distributed to the compute nodes. SLURM provides a series of useful commands for the user which we will now go through.

(more…)