Creating and Submitting Jobs

How to Create and Submit a Job in Slurm

Consider you’ve a script in one of the programming languages such as Python, MATLAB, C, or Java. How would you execute it using Slurm?

The below section explains a step by step process to creating and submitting a simple job. Also, the SBATCH script is created and used for the execution of a python script or project.

  1. Login to Discovery

  2. Create a new folder in your home directory called myproject and switch into the directory

    $ mkdir myproject && cd myproject
  3. Create a new file called script.sh and script.py then copy and paste the codes in the script.sh and script.py tabs below respectively.

    $ vi script.sh && chmod +x script.sh

    The latter command above, after the double ampersand chmod +x script.sh, makes the file executable after saving and exiting from the text editor

    $ vi script.py
    • script.sh

    • script.py

    #!/bin/bash
    
    #SBATCH --job-name=maxFib      ## Name of the job
    #SBATCH --output=maxFib.out    ## Output file
    #SBATCH --time=10:00           ## Job Duration
    #SBATCH --ntasks=1             ## Number of tasks (analyses) to run
    #SBATCH --cpus-per-task=1      ## The number of threads the code will use
    #SBATCH --mem-per-cpu=100M     ## Real memory(MB) per CPU required by the job.
    
    ## Load the python interpreter
    module load python
    
    ## Execute the python script and pass the argument/input '90'
    srun python script.py 90

    Here, 1 CPU with 100mb memory per CPU and 10 minutes of Walltime was requested for the task (Job steps). If the --ntasks is set to two, this means that the python program will be executed twice.

    Note that the number of tasks requested of Slurm is the number of processes that will be started by srun. After your script has been submitted and resources allocated, srun immediately executes the script on the remote host. It’s actually used to launch the processes. If your program is a parallel MPI program, srun takes care of creating all the MPI processes. If not, srun will run your program as many times as specified by the --ntasks option.

    import sys
    import os
    
    if len(sys.argv) != 2:
      print('Usage: %s MAXIMUM' % (os.path.basename(sys.argv[0])))
             sys.exit(1)
    
    maximum = int(sys.argv[1])
    
    n1 = 1
    n2 = 1
    
    while n2 <= maximum:
      n1, n2 = n2, n1 + n2
    
    print('The greatest Fibonacci number up to %d is %d' % (maximum, n1))

    The python program accepts an integer value as an argument and then finds the greatest Fibonacci number closest to the value you provided.

  4. Now, submit the batch script with the following command.

    $ sbatch script.sh

    After the job has been submitted, you should get an output similar to the one below but with a different jobid.

    Submitted batch job 215578

    You can use the command below to check the progress of your submitted job in the queue.

    syntax: squeue -u <your username>

    $ squeue -u vaduaka

    Output

    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    215578    normal   maxFib  vaduaka  R       0:01      1 discovery-c3
  5. Once your job has completed and no longer in the queue, you can run the ls command to show the list of files in your working directory.

    $ ls
    maxFib.out  script.py  script.sh

    Now a new file called maxFib.out was generated and if you view its output with the cat command, you should see something similar to the output below.

    $ cat maxFib.out

    Output

    The greatest Fibonacci number up to 90 is 89

Practice Examples

Example 1

Step 1 - Create a directory

Create a directory called slurm-test in your home directory. Change your working directory to slurm-test afterward.

cd ~
mkdir slurm-test
cd slurm-test
pwd

Output

/home/<your-username>/slurm-test

The cd command changes your working directory to the new slurm-test directory, which you can confirm with the pwd command.

Step 2 - Create Job Script

Create the job script file test.sh using any text editor. The test.sh file is a Bash shell script that serves as the initial executable for the job. The SBATCH directives at the top of the script inform the scheduler of the job’s requirements.

Create the test.sh file.

vi test.sh

Next, copy the following script below and paste into the file.

#!/bin/bash

#SBATCH --job-name test
#SBATCH --time 05:00
#SBATCH --nodes 1
#SBATCH --output test.out
#SBATCH --mail-user <your-nmsu-username>@nmsu.edu
#SBATCH --mail-type BEGIN, END, FAIL

echo "The job has begun."
echo "Wait one minute..."sleep 60
echo "Wait a second minute..."sleep 60
echo "Wait a third minute..."sleep 60
echo "Enough waiting: job completed."

Save and quit the file.

This script describes a job named "test" that will run for no longer than five minutes. The job consists of a single task running on a single node, with the output directed to a test.out file. You can use the cat command to confirm the content of the new test.sh script.

Step 3 - Make the Job Script Executable

Use the chmod command to make the file executable.

chmod +x test.sh

You can use ls -l command to see the difference in file permission. Also, note that the executable file is green.

ls -l

Output

total 1
-rwxrwxr 1 mushfiq mushfiq 394 Mar 20 21:49 test.sh

Step 4 - Submit the Job

Use the sbatch command to submit the script to Slurm. When Slurm accepts a new job, it responds with the job id (a number) that can be used to identify the specific job in the queue.

sbatch test.sh

Output

submitted batch job 10761

Step 5 - Monitor the Job

Use the squeue command to check the status of pending or running jobs.

The job then enters the queue in the PENDING (PD) state. Once resources become available and the job has the highest priority, an allocation is created for it and it goes into the RUNNING ® state. If the job completes correctly, it goes to the COMPLETED state. Otherwise, it’s set to the FAILED state.

squeue -u mushfiq

Output

JOBID   PARTITION   NAME     USER  ST    TIME  NODES NODELIST(REASON)
10761      normal   test  mushfiq   R    0:15      1 discovery-c12

After the job has started, the output it generates will be directed to the test.out file referenced in the job script. You can watch the output as it’s written using the tail command.

tail -F test.out

Output

The job has begun.
Wait one minute..sleep 60
Wait a second minute...
Wait a third minute...

Use Ctrl + C to return to the terminal window. Your job will eventually exit the queue once it gets completed.

The files in your slurm-test directory should contain both the bash and output files.

ls
test.out test.sh

To view the output, run the below command point_down

tail -F test.out

Output

The job has begun.
Wait one minute..sleep 60
Wait a second minute...
Wait a third minute...
Enough Waiting: job completed.

Example 2

Task: Consider you have the following requirements.

Your resource request:

  1. Job name: test

  2. Output filename: res.txt

  3. Request 1 CPU for 10 minutes

  4. Use 100 MB of RAM

Your job steps:

  1. srun hostname

  2. srun sleep 60

Steps:

  1. Create the job directory (create file test2.sh in the already existing directory slurm-test).

  2. Create the script and add the required resource request and job steps.

  3. Make the file executable.

  4. Submit the job.

  5. Monitor its progress.

  6. Check the output file.

Remember: The resource request aspect should contain the resources required for the job to run while the job steps contain the tasks to be carried out and software to run.

Step 1 - Prepare the Job Directory (Create a File called test2.sh under the Folder slurm-test)

touch test2.sh
ls
test2.sh test.out test.sh

Step 2 - Write the Job Script

Copy the following script into test2.sh:

#!/bin/bash

#SBATCH --job-name test
#SBATCH --output res.txt
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --time 10:00
#SBATCH --mem-per-cpu 100M

srun hostname
srun sleep 60

This script would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue. When started, the job would run the first job step srun hostname, which will launch the UNIX command hostname on the node on which the requested CPU was allocated. Then, a second job step will start the sleep command. Note that the --job-name flag allows giving a meaningful name to the job and the --output flag defines the file to which the output of the job must be sent.

Step 3 - Make the File Executable

chmod +x test2.sh
----
----

Step 4 - Submit the Job

Use sbatch command, which upon success, responds with the job id attributed to the job.

sbatch test2.sh

Output

submitted batch job 10763

Step 5 - Monitor the Job

You can check the status of the job by the squeue command.

squeue -u mushfiq

Output

JOBID   PARTITION   NAME     USER  ST    TIME  NODES NODELIST(REASON)
10763      normal   test  mushfiq   R    0:5      1 discovery-c12

Step 6 - View the Output

Upon completion, the output file contains the result of the commands run in the script file. To view the output, run the following command.

cat res.txt

Output

discovery-c12.cluster.local