Slurm

Slurm is an open-source workload manager/scheduler for the Discovery cluster. Slurm is basically the intermediary between the Login nodes and compute nodes. Hence, the Slurm scheduler is the gateway for the users on the login nodes to submit work/jobs to the compute nodes for processing.

Slurm has three key functions. First, it provides exclusive and/or non-exclusive access to the resources on the compute nodes to the users for a certain amount of time to perform any computation. Second, it provides a framework to start, execute, and check the work on the set of allocated compute nodes. Lastly, it manages the queue of pending jobs based on the availability of resources.

Explore different topics under the Slurm folder for more information.

Slurm Scheduler

Modern cluster systems often incorporate a very important idea, a scheduling system. The functional purpose of the scheduling system is to eliminate the need to know what individual computers are doing. It aggregates data and monitors the system.

A scheduler will keep an exact and up to date picture of what resources are available and where. Even beyond tracking resources, a scheduler will allow you to submit instructions for running your program, and then run your program on your behalf after the necessary resources are available.

Slurm is an open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top 500 supercomputers.

To explain better, compare Slurm with a restaurant hostess, and the High-Performance Computer to the restaurant. As the hostess has a map of tables/chairs available, Slurm has this on nodes and cores in the Discovery. If you want to sit down in the restaurant, the hostess will put you on a list, and when a table/chairs are available, you will be seated.

Similarly, Slurm will allow you to use resources (nodes/cores) as soon as they’re available. For example, if you request 3 nodes to run a job, you will have to wait although 2 nodes are available. Slurm will grant your request as soon as all 3 nodes are available. In the restaurant setting, if you ask for a table for 3 persons, you will still wait, although the table for 2 is available. Slurm behaves in a similar way and hence, request resources from Slurm in a wise manner.