Sometimes it can be a daunting task to get all the Kerberos and SSH configurations right on your first attempt at using the PDC systems. Nearly every day PDC Support receives a number of help requests and questions from researchers who have run into configuration problems. So PDC has introduced an alternative way of logging in to our clusters by using Docker containers with pre-configured Kerberos and SSH files.
What is Docker?
Docker is a tool used to deploy and run single or many applications within what are known as containers. Containers are packed with all the parts that are needed to run an application, such as the actual application, any relevant tools, libraries or other necessary information. Each Docker container is delivered as a single package with all the necessary material included. This means that the person using the Docker container does not have to worry about installing any new tools/libraries or configuring them before using them. Since everything is preloaded, the applications inside the containers are ready-to-use and can be executed regardless of any customized settings on the host machine.
Later in this blog article, we will talk about a Docker container that we have developed for the purpose of logging in to PDC clusters.
Our supercomputer clusters at PDC, equipped with thousands of multi-core processors, can be used to solve large scientific/engineering problems. Because of their much higher performance compared to desktop computers or workstations, supercomputer clusters are also called high performance computing (HPC) clusters. Common application fields of HPC clusters include machine learning, galaxy simulation, climate modelling, bioinformatics, computational physics, quantum chemistry, etc.
Building an HPC cluster demands sophisticated technologies and hardware, but fortunately a regular HPC user doesn’t have to worry too much about that. As an HPC user you can submit jobs that request the compute nodes (physical groups of processors) to do the calculations/simulations you want. But note that you are not the only user of an HPC cluster, there are typically many users using the cluster for the same time period and all of them will be submitting their own jobs. You may have realized by now that there needs to be some soft of queueing system that organizes the jobs and distributes them to the compute nodes. This post will briefly introduce SLURM, which is used in all PDC clusters and is the most widely used workload manager for HPC clusters.
What is SLURM?
SLURM, or Simple Linux Utility for Resource Management, is an open-source cluster management and job scheduling system. It provides three key functions
Allocation of resources (compute nodes) to users’ jobs
Framework for starting/executing/monitoring jobs
Queue management to avoid resource contention
In other words, SLURM oversees all the resources in the whole HPC cluster. Users then send their jobs (requests to run calculations/simulations) to SLURM for later execution. SLURM will keep all the submitted jobs in a queue and decide what priorities the jobs have and how the jobs are distributed to the compute nodes. SLURM provides a series of useful commands for the user which we will now go through.