Creating and Submitting Jobs
How to Create and Submit a Job in Slurm
Consider you’ve a script in one of the programming languages such as Python, MATLAB, C, or Java. How would you execute it using Slurm?
The below section explains a step by step process to creating and submitting a simple job. Also, the SBATCH script is created and used for the execution of a python script or project.
-
Login to Discovery
-
Create a new folder in your home directory called
myproject
and switch into the directory$ mkdir myproject && cd myproject
-
Create a new file called
script.sh
andscript.py
then copy and paste the codes in thescript.sh
andscript.py
tabs below respectively.$ vi script.sh && chmod +x script.sh
The latter command above, after the double ampersand
chmod +x script.sh
, makes the file executable after saving and exiting from the text editor$ vi script.py
#!/bin/bash #SBATCH --job-name=maxFib ## Name of the job #SBATCH --output=maxFib.out ## Output file #SBATCH --time=10:00 ## Job Duration #SBATCH --ntasks=1 ## Number of tasks (analyses) to run #SBATCH --cpus-per-task=1 ## The number of threads the code will use #SBATCH --mem-per-cpu=100M ## Real memory(MB) per CPU required by the job. ## Load the python interpreter module load python ## Execute the python script and pass the argument/input '90' srun python script.py 90
Here, 1 CPU with 100mb memory per CPU and 10 minutes of Walltime was requested for the task (Job steps). If the
--ntasks
is set to two, this means that the python program will be executed twice.Note that the number of tasks requested of Slurm is the number of processes that will be started by srun. After your script has been submitted and resources allocated, srun immediately executes the script on the remote host. It’s actually used to launch the processes. If your program is a parallel MPI program, srun takes care of creating all the MPI processes. If not, srun will run your program as many times as specified by the --ntasks option.
import sys import os if len(sys.argv) != 2: print('Usage: %s MAXIMUM' % (os.path.basename(sys.argv[0]))) sys.exit(1) maximum = int(sys.argv[1]) n1 = 1 n2 = 1 while n2 <= maximum: n1, n2 = n2, n1 + n2 print('The greatest Fibonacci number up to %d is %d' % (maximum, n1))
The python program accepts an integer value as an argument and then finds the greatest Fibonacci number closest to the value you provided.
-
Now, submit the batch script with the following command.
$ sbatch script.sh
After the job has been submitted, you should get an output similar to the one below but with a different
jobid
.Submitted batch job 215578
You can use the command below to check the progress of your submitted job in the queue.
syntax:
squeue -u <your username>
$ squeue -u vaduaka
Output
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 215578 normal maxFib vaduaka R 0:01 1 discovery-c3
-
Once your job has completed and no longer in the queue, you can run the
ls
command to show the list of files in your working directory.$ ls maxFib.out script.py script.sh
Now a new file called
maxFib.out
was generated and if you view its output with thecat
command, you should see something similar to the output below.$ cat maxFib.out
Output
The greatest Fibonacci number up to 90 is 89
Practice Examples
Example 1
Step 1 - Create a directory
Create a directory called slurm-test
in your home directory. Change your working directory to slurm-test
afterward.
cd ~
mkdir slurm-test
cd slurm-test
pwd
Output
/home/<your-username>/slurm-test
The cd command changes your working directory to the new |
Step 2 - Create Job Script
Create the job script file test.sh using any text editor. The test.sh file is a Bash shell script that serves as the initial executable for the job. The SBATCH directives at the top of the script inform the scheduler of the job’s requirements.
Create the test.sh file.
vi test.sh
Next, copy the following script below and paste into the file.
#!/bin/bash
#SBATCH --job-name test
#SBATCH --time 05:00
#SBATCH --nodes 1
#SBATCH --output test.out
#SBATCH --mail-user <your-nmsu-username>@nmsu.edu
#SBATCH --mail-type BEGIN, END, FAIL
echo "The job has begun."
echo "Wait one minute..."sleep 60
echo "Wait a second minute..."sleep 60
echo "Wait a third minute..."sleep 60
echo "Enough waiting: job completed."
Save and quit the file.
This script describes a job named "test" that will run for no longer than five minutes. The job consists of a single task running on a single node, with the output directed to a test.out file. You can use the cat command to confirm the content of the new test.sh script.
Step 3 - Make the Job Script Executable
Use the chmod
command to make the file executable.
chmod +x test.sh
You can use ls -l command to see the difference in file permission. Also, note that the executable file is green.
ls -l
Output
total 1
-rwxrwxr 1 mushfiq mushfiq 394 Mar 20 21:49 test.sh
Step 4 - Submit the Job
Use the sbatch
command to submit the script to Slurm. When Slurm accepts a new job, it responds with the job id (a number) that can be used to identify the specific job in the queue.
sbatch test.sh
Output
submitted batch job 10761
Step 5 - Monitor the Job
Use the squeue command to check the status of pending or running jobs.
The job then enters the queue in the PENDING (PD) state. Once resources become available and the job has the highest priority, an allocation is created for it and it goes into the RUNNING ® state. If the job completes correctly, it goes to the COMPLETED state. Otherwise, it’s set to the FAILED state.
squeue -u mushfiq
Output
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10761 normal test mushfiq R 0:15 1 discovery-c12
After the job has started, the output it generates will be directed to the test.out file referenced in the job script. You can watch the output as it’s written using the tail command.
tail -F test.out
Output
The job has begun.
Wait one minute..sleep 60
Wait a second minute...
Wait a third minute...
Use Ctrl + C to return to the terminal window. Your job will eventually exit the queue once it gets completed.
The files in your slurm-test
directory should contain both the bash and output files.
ls
test.out test.sh
To view the output, run the below command
tail -F test.out
Output
The job has begun.
Wait one minute..sleep 60
Wait a second minute...
Wait a third minute...
Enough Waiting: job completed.
Example 2
Task: Consider you have the following requirements.
Your resource request:
-
Job name: test
-
Output filename: res.txt
-
Request 1 CPU for 10 minutes
-
Use 100 MB of RAM
Your job steps:
-
srun hostname
-
srun sleep 60
Steps:
-
Create the job directory (create file test2.sh in the already existing directory
slurm-test
). -
Create the script and add the required
resource request
andjob steps
. -
Make the file executable.
-
Submit the job.
-
Monitor its progress.
-
Check the output file.
Remember: The resource request aspect should contain the resources required for the job to run while the job steps contain the tasks to be carried out and software to run.
Step 1 - Prepare the Job Directory (Create a File called test2.sh under the Folder slurm-test
)
touch test2.sh
ls
test2.sh test.out test.sh
Step 2 - Write the Job Script
Copy the following script into test2.sh
:
#!/bin/bash
#SBATCH --job-name test
#SBATCH --output res.txt
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --time 10:00
#SBATCH --mem-per-cpu 100M
srun hostname
srun sleep 60
This script would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue. When started, the job would run the first job step srun hostname
, which will launch the UNIX command hostname on the node on which the requested CPU was allocated. Then, a second job step will start the sleep command. Note that the --job-name flag allows giving a meaningful name to the job and the --output flag defines the file to which the output of the job must be sent.
Step 3 - Make the File Executable
chmod +x test2.sh
----
----
Step 4 - Submit the Job
Use sbatch
command, which upon success, responds with the job id attributed to the job.
sbatch test2.sh
Output
submitted batch job 10763
Step 5 - Monitor the Job
You can check the status of the job by the squeue
command.
squeue -u mushfiq
Output
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10763 normal test mushfiq R 0:5 1 discovery-c12
Step 6 - View the Output
Upon completion, the output file contains the result of the commands run in the script file. To view the output, run the following command.
cat res.txt
Output
discovery-c12.cluster.local