Discovery User Guide

Login

To log in on Discovery, you need to be on NMSU network or use VPN. Please use your NMSU credintials to log in.

[user@yourComp ~]$ ssh username@discovery.nmsu.edu
[username@discovery.nmsu.edu's password: 
*******************************************************************************
Want some software installed? Just let us know and we'll do our best to get
it setup for you.  http://hpc.nmsu.edu/software-request

Need help? Please email hpc-team@nmsu.edu 

You can report issues with this system by sending email to hpc-team@nmsu.edu.
*******************************************************************************
[username@discovery ~]$

Slurm

Slurm is the job scheduler currently implemented on Discovery. All users are required to use slurm to submit their jobs to utilize the compute nodes for program execution. The submitted jobs may be put on hold and not start execution right away, but the system is configured to ensure all users have fair access to the available resources and those submissions will run as soon as their position in the queue and the available resources have been met.

There are several basic commands that user can use to manage their jobs:

  1. sacct is used to report job or job step accounting information about active or completed jobs; use sacct -h for more information.
  2. sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options; use sinfo –help for more information.
  3. squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order; use squeue –help for more information.
  4. scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step; use scancel –help for more information.

Some frequently asked questions can be found here. A more detail tutorial is provided by the University of Utah.

 

Discovery details

Discovery has 25 compute, 11 GPU and 2 high memory nodes:

discovery-h1 — The head node.

discovery-s1 — The storage node.

discovery-l1 –This is the node that you will be on when you log in.  This is *not* a place to run any computation.  You can be here when installing software, but do not run programs from here as you can hurt the node if improperly executed.

discovery-g1 — Discovery’s GPU node that doubles as a computational node. This node has2 Nvidia Tesla K40 GPUs installed. With 64GB of RAM, two Intel E5-2640 v3 CPUs with 8 cores each. This means there are 16 cores (or 32 threads) on these nodes.

discovery-g[2-6] — Discovery’s Durip GPU nodes. This nodes have 2 Nvidia Tesla P100 GPUs installed. With 192GB of RAM, two Intel Xeon Gold 5117 2.0G CPUs with 14 cores each. This means there are 28 cores (or 56 threads) on this node.

discovery-g[7] — Discovery’s IIPLAB GPU nodes. This nodes have 2 Nvidia Tesla P100 GPUs installed. With 192GB of RAM, two Intel Xeon Gold 5117 2.0G CPUs with 14 cores each. This means there are 28 cores (or 56 threads) on this node.

discovery-g[8-11] — Discovery’s EPSCoR GPU nodes. This nodes have 2 Nvidia Tesla V100 GPUs installed. With 192GB of RAM, two Intel Xeon Gold 5218 CPUs with 16 cores each. This means there are 32 cores (or 64 threads) on this node.

discovery-hm1 — Discovery’s EPSCoR high memory node. With 3TB of RAM, two Intel Xeon Gold 5218 CPUs with 16 cores each. This means there are 32 cores (or 64 threads) on this node.

discovery-hhm1 — Discovery’s EPSCoR hybrid high memory node. With 3TB of RAM, two Intel Xeon Gold 5218 CPUs with 16 cores each. This means there are 32 cores (or 64 threads) on this node.

discovery-c[1-6] — Discovery’s “old” nodes with 64GB of RAM, two Intel E5-2640 v3 CPUs with 8 cores each. This means there are 16 cores (or 32 threads) on these nodes.

discovery-c[7-13] — Discovery’s “new” nodes with 128GB of RAM, two Intel E5-2650 v4 CPUs with 12 cores each. This means there are 24 cores (or 48 threads) on these nodes.

discovery-c[14-15] — Discovery’s “new” nodes with 256GB of RAM, two Intel E5-2650 v4 CPUs with 12 cores each. This means there are 24 cores (or 48 threads) on these nodes.

discovery-c[16-25] — Discovery’s Durip nodes with 192GB of RAM, two Intel Xeon Gold 5117 2.0G CPUs with 14 cores each. This means there are 28 cores (or 56 threads) on these nodes.

 

Discovery has 8 Queues/Partitions:

Four partitions are usable by everyone, while three (listed last) are reserved.

normal — The default queue.  It has a maximum wall-time of 7 days 1 hour (–time 7-01:00:00).  This queue contains nodes discovery-c[1-15].  To make sure you land on a particular node type (“old” or “new”), please learn about how to use slurm.

gpu — The queue that will ensure your job landing on a node with a GPU.  It has a maximum wall-time of 7 days 1 hour (–time 7-01:00:00) and currently contains only node discovery-g1.

debug — The queue to be used to debug your code.  This was created so that you don’t have to wait in line (in the normal queue) for hours or days to debug your code.  It has a maximum wall-time of 1 hour (–time 0-01:00:00) and contains both the discovery-g1 and discovery-c[1-15] nodes.

backfill — This queue scavenges nodes from all partitions for use (discovery-c[1-25], discovery-g[1-11], discovery-hm1, and discovery-hhm1). It has the lowest priority and therefore may be paused multiple times (or indefinitely) depending on the demand of higher priority jobs. It has a maximum wall-time of 6 hours (–time 14-02:00:00).

osg — As a part of Open Science Grid we contribute our CPU hours when they are not in use. This partition is usable only by OSG.

cfdlab — This partition is for the discovery-c[16-25] and discovery-g[2-6] nodes, and has a maximum wall-time of 7 days 1 hour (–time 7-01:00:00). It is a condo partition and is restricted to Dr. Gross’s lab.

cfdlab-debug — This partition is for the discovery-g[2-6] nodes and has a maximum wall-time of 1 hour (–time 0-01:00:00). It is a condo partition and is restricted to Dr. Gross’s lab.

iiplab — This partition is for the discovery-g7 node and has a maximum wall-time of 7 days 1 hour (–time 7-01:00:00). It is a condo partition and is restricted to Dr. Boucheron’s lab.

epscor — This partition is for the discovery-g[8-11], discovery-hm1, and discovery-hhm1 nodes, and has a maximum wall-time of 7 days 1 hour (–time 7-01:00:00). It is a condo partition and is restricted to EPSCoR group.

Note: You can only run 10 jobs at a time.  You can submit as many as you desire, but only 10 will ever run at a time.  To make better use of resources and jobs, please consult “Example 4: How to run programs in parallel” to group several analyses into 1 job.

 

Discovery runs Centos7:

Of the several flavors of Linux/Unix available, Discovery uses CentOS7 as its operating system.  Knowing the OS may be important for installing software.  This also means that the generic Linux/Unix functions and programs can be used on Discovery. Please, use tab-to-complete whenever possible.  This is helpful for determining both pathways and file names.  If you need assistance becoming familiar with Linux/Unix, consider joining us for a workshop.  We will have them throughout the year.  You can also join us during office hours (recommended) or email hpc-team@nmsu.edu.

sbatch

To submit a job to the queue, use sbatch script generator or follow instructions below.

sbatch is used to submit a job script for later execution. The script will typically contain one or more commands to launch parallel tasks, use sbatch -h for more information.

NOTE: slurm is sloppy with its word usage.  For the computer literate, each node consists of 2 CPUs with Y number of cores that can be threaded resulting in 2xYx2 threads.  For slurm, each node has 2xYx2 CPUs (also referred to as cores)…  This can cause a lot of confusion for those who understand the differences between the definitions of CPU, core, and thread.  Please understand that the true thread is referred to as CPU by slurm. (This may cause you a headache, and for that we blame the slurm developers).

Also note: Some programs don’t recognize threads.  In this case, if you want to occupy the whole node, you will need to reserve the maximum number of threads, but your program will only read the number of cores available (# threads/slurm CPUs divided by 2).  For example, Matlab doesn’t recognize threads, so if you reserve 48 threads (–cpus-per-task 48), when you check the number of slaves for the program Matlab will return 24.
Below are examples how to write scripts and submit to slurm using sbatch:

  • Example 1 — Simple submission; “boiler plate” example
    • The first step in creating a batch job is to write a batch file. This is a simple shell script that tells Slurm how and what to do for your job. In the example below, let’s assume the batch file is named example1.sh.
      #!/bin/sh
      #SBATCH --job-name myJobName ##name that will show up in the queue
      #SBATCH --output myJobName.o%j ##filename of the output; the "%j" will append the jobID to the end of the name making the output files unique despite the sane job name; default is slurm-[jobID].out
      #SBATCH --partition normal ##the partition to run in [options: normal, gpu, debug]; default = normal
      #SBATCH	--nodes 1 ##number of nodes to use; default = 1
      #SBATCH --ntasks 3 ##number of tasks (analyses) to run; default = 1
      #SBATCH --cpus-per-task 16 ##the number of threads the code will use; default = 1
      #SBATCH	--time 0-00:05:00 ##time for analysis (day-hour:min:sec) -- Max walltime will vary by partition; time formats are: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"
      #SBATCH --mail-user yourID@nmsu.edu   ##your email address
      #SBATCH --mail-type BEGIN ##slurm will email you when your job starts
      #SBATCH --mail-type END ##slurm will email you when your job ends
      #SBATCH --mail-type FAIL ##slurm will email you when your job fails
      #SBATCH --get-user-env ##passes along environmental settings 
      
      module load myprogram
      
      ##-- After all the modules/programs needed are called, put in your code
      ## myprogram input4myprogram 
      ## Ex:
      /bin/hostname

      sbatch scripts are unique in how they are read. In shell scripts, any line that starts with # is considered a comment. Any comment that starts with the word SBATCH in all caps is treated as a command by slurm. This means that to comment out a slurm command, put a second # at the beginning of the line (ex: #SBATCH means slurm command, ##SBATCH means skip). Note: the “cpus-per-task” value times the “ntasks” value needs to be within the range of the “nodes” thread value.  ex: 1 node is a max of 48 threads, so “cpus-per-task” value times the “ntasks” must be less than 48, otherwise you will get back an error.)

    • To submit a SBATCH script to slurm, simply type “sbatch [inputSbatchScript].sh”.
      [user@Discovery ~]$ sbatch example1.sh
      Submitted batch job 253296
      [user@Discovery ~]$ ls
      example1.sh  myJobName.o253296

      From the example a file named myJobName.o253296 has been created. This is the output from our job, slurm always creates an output file for batch jobs when they start to execute. By default the output file will be named slurm-<job #>.out, unless otherwise specified.

    • Looking into the output file:
      [user@Discovery ~]$ cat myJobName.o253296
      Discovery-c12

      The /bin/hostname command on a Unix system prints the name of system its being run on. So we can see that Slurm ran our job the compute node named Discovery-12.

  • Example 2 – Picking a queue

    Discovery has three different queues, normal, gpu, and debug.

    The normal queue has a limit of 12 nodes (496 threads), but getting this number of resources may take a while to get. The normal queue is the default queue, if you don’t specify a queue your job will be queued in the normal queue. The gpu queue gives you access to the GPU and is the least used at the moment.  The debug queue has access to 2 nodes (Discovery-g1 and Discovery-c1), but a much shortened walltime.

  • Example 3 – How to use a program that you installed and run programs in series
    #!/bin/sh
    #SBATCH --ntasks 1
    #SBATCH --nodes 1
    
    module load matlab/r2015a
    
    cd somePath/nextPartOfPath
    
    matlab <input for matlab>
    ./path2myProgram/myFavProgram <input for my program>
    
    

    How to read this script: In this example, we request 1 node and 1 task.  The default is 1 “cpu-per-task” (read 1 thread per task), so we have requested 1 thread on 1 node. As these are the only parameters we specified, the others are default, including the job name, job output, etc.  Also note that we have not requested sulrm to let us know about the state of our submission, so when we start the run or if we error out, we won’t know until we log back onto Discovery.

    What this script does: 1, Loads the matlab module. 2, Changes directories.  The sbatch command remembers what directory you were in when submitting the job.  To maintain good notes, it’s useful to write directory changes/staring locations into the sbatch script. 3, Runs a script using matlab. 4, Runs myFavProgram. To run a program installed in my environment (ie somewhere within my home directory), you need to preface it with “./” and give the path to the executable. You don’t have to include “/home/userID/” and can start the path from after this location. Note: This script is designed such that the matlab process will run before myFavProgram does (in series).  This also means that if matlab errors out, the script will not get to myFavProgram.  This type of setup is good if the second one is dependent upon one before (ex: the output of program 1 is the input for program 2).

  • Example 4 – How to run programs in parallel (very useful if you need lots of small-resource, independent jobs run)
    #!/bin/sh
    #SBATCH --nodes 1
    #SBATCH --ntasks 2 
    #SBATCH --cpus-per-task 24 
    
    module load matlab/r2015a
    
    srun --preserve-env --multi-prog ./myfile.conf
    

    How to read this script: In this example, we request 1 node, 2 tasks and 24 cpus-per-task (threads).  This means we have requested 32 threads on 1 node and plan on running 3 tasks. As these are the only parameters we specified, the others are default.  Note: We are running 2 programs/analyses for the price of 1 job.

    What this script does: 1, Loads the matlab module. 2, Calls srun on our conf file.  srun allows us to use the “–preserve-env” flag with means whatever is in our environment, including parameters we might have changed, are preserved.  The “–multi-prog” flag tells srun that we will be calling multiple programs.  Calling srun like this means that we will be running our two programs in parallel.  In this case the two programs will be started at the same time, but will finish (or error out), independent of each other.  The resources of the programs must match.  In this case, both programs will get 24 threads to work with.  If one needed 24 and the other 16, we would need to submit two independent sbatch jobs.

    myfile.conf:

    0 matlab <path2inputfile/inputFile>
    1 ./path2myProgram/myFavProgram <input for my program>
    

    The above is the myfile.conf. The first column is the number of the task (in Linux/Unix, always start with 0, not 1!), the second column is the program to be run, and the third column is the input for the program. If you have a way of monitoring the output of your programs (ex: things are being continuously written to an output file), you can watch both files increase in size. If the programs are run in series (see: Example 3) you will see the first file increase in size until complete and then the second one will appear and grow.

  • Example 5
    #!/bin/sh
    #SBATCH -n 1
    #SBATCH -N 1
    srun tar zxf julia-0.3.11.tar.gz
    echo "prefix=/software/julia-0.3.11" > julia/Make.user
    cd julia
    srun make

    The example is a batch file used to compile the Julia programming language on Discovery. The -n 1 tells Slurm we’re going to have one task per job step. Each time we invoke srun, that is a job step. If we had set -n 2, then srun would start the tar command twice because we asked for two tasks per step. In this case the tar command will be run on one compute node. The next two commands don’t have the srun prefix. This is an important step, anything they do will affect the environment for any later commands run via srun. For example, the step “cd julia” changes the current working directory to “julia” This directory was created by the tar command. By default, batch jobs will start in whatever directory you were in when you issued the batch command. If we change the current working directory, then when srun starts the make command, that will be run from the julia directory.

  • Example 6 – MPI jobs

    In this example we’ll run a MPI program on 20 cores.

    #!/bin/sh
    #SBATCH -n 20
    #SBATCH --mpi=pmi2
    #SBATCH -o myoutputfile.txt
    module load mpi/mpich-x86_64
    mpirun -np 20 mpiprogram < inputfile.txt

    First note that we’ve asked for 20 tasks at each job step (anything that starts with srun). For MPI programs it is one CPU per process, so -n 20 will create 20 processes using 20 cores. Next we ask slurm to use the “pmi2” MPI type. This is the appropriate type for MPICH programs. Finally we tell Slurm that the output should be written to “myoutputfile.txt” instead of “slurm-<job #>.txt”. Next we use the module command (more details about the module command are provided later in this document) to load the MPICH environment. This will adjust your PATH, and LD_LIBRARY_PATH to include MPICH. Those settings will be passed on to anything run via srun. Finally we launch our program with the mpirun command.

  • Example 7 – OpenMP and other multithreaded jobs

    OpenMP differs from MPI jobs in that we have a single process using multiple threads. So we want to tell Slurm to give us as many cores as necessary, up to the maximum number of cores in a single node, but only launch a single process.

    #!/bin/sh
    #SBATCH -n 1
    #SBATCH -N 1
    #SBATCH -c 16
    ./multithreaded_program

    In this batch file, we ask Slurm for one node (-N 1), one process/task (-n 1), and 16 cores assigned to that task (-c 16).

    #!/bin/sh
    #SBATCH -n 2
    #SBATCH -N 1
    #SBATCH -c 8
    srun ./multithreaded_program

    In this case, we’ve asked slurm to run two instances of our program (-n 2), each getting 8 cores (-c 8). Since each compute node has 16 cores, and we specified -N 1 they will run on the same node. If we didn’t specify -N 1, Slurm would be free to run the two processes on different nodes.

srun

To run a job interactively:

srun is used to submit a job for execution or initiate job steps in real time, use srun -h for more information. Below are examples using srun:

  • Example 1 – run a simple program interactively
    [user@Discovery ~]$ module load R/322
    [user@Discovery ~]$ srun -c 2 -p normal R --vanilla
    
    R version 3.2.2 (2015-08-14) -- "Fire Safety"
    Copyright (C) 2015 The R Foundation for Statistical Computing
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.
    
      Natural language support but running in an English locale
    
    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.
    
    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.
    
    > 
    

    Note how the cursor is now a “>”; this shows us that we are in the R program command line and no longer working on the OS command line.
    Let’s check what the job ID for this process is:

    [user@Discovery ~]$ queue --account userID
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                253348    normal        R userID  R       1:51      1 Discovery-c2

    This gives us the JobID [253348], what queue/partition we are in [normal], the name of the job [R], who we are [userID], status [R] (running), time our job has been running [1:51] (min:sec), the number of nodes we’re using [1], and the node our job is on [Discovery-c2].
    Let’s see if we got the resources (2 threads) we asked for:

    [user@Discovery ~]$ control show job 253348
    JobId=253348 JobName=R
       UserId=userID(#####) GroupId=userID(#####)
       Priority=3691 Nice=0 Account=userID QOS=normal
       JobState=RUNNING Reason=None Dependency=(null)
       Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
       RunTime=00:04:14 TimeLimit=7-01:00:00 TimeMin=N/A
       SubmitTime=2017-03-24T11:33:56 EligibleTime=2017-03-24T11:33:56
       StartTime=2017-03-24T11:33:56 EndTime=2017-03-31T12:33:56
       PreemptTime=None SuspendTime=None SecsPreSuspend=0
       Partition=normal AllocNode:Sid=Discovery:14975
       ReqNodeList=(null) ExcNodeList=(null)
       NodeList=Discovery-c2
       BatchHost=Discovery-c2
       NumNodes=1 NumCPUs=2 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
       Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
       MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
       Features=(null) Gres=(null) Reservation=(null)
       Shared=OK Contiguous=0 Licenses=(null) Network=(null)
       Command=R
       WorkDir=/home/userID
    

    NumNodes=1 NumCPUs=2 CPUs/Task=2” tells us that we have 1 node, 2 threads and 2 threads per task (task=the R program). So we got the resources we requested and are now safely working interactively on Node 2 (NodeList=Discovery-c2).  Please note that our ability to work in our R session is still limited to the maximum wall-time for the partition we are in (Partition=normal; TimeLimit=7-01:00:00).

  • Example 2 – run an MPI job on 20 cores interactively
    [user@Discovery ~]$ module load mpi/mpich-x86_64
    [user@Discovery ~]$ srun mpirun -np 20 mpi_program < input_file
    <program output>

    Please, note: a flag you can use is –mpi= [values include none, pmi2, mvapich, openmpi. defaults to none, pmi2 works best for MPICH]

  • Example 3 – run Octave interactively
    [user@Discovery ~]$ module load octave/400
     JDK 8u25 added to environment.
     GNU Octave 4.0.0 added to environment.
    [user@Discovery ~]$ srun --pty -c 8 octave

    In this example, we’ve loaded Octave into our environment. Then we run it via srun, requesting 8 cores. For programs that we want to interact with, we add the “–pty”option. This tells slurm to setup the environment the program is running under to look like a regular login session. If you run an interactive job, like Octave or R, and don’t get the prompt you’re expecting try adding the “–pty” option.

  • Example 4 – run CUDA jobs interactively

    Discovery has several NVIDIA Tesla K40 GPUs available. In this example we’ll add CUDA 7.0 to our environment, compile a simple hello world CUDA program and run it on a GPU.

    [user@Discovery ~]$ module load cuda7
    [user@Discovery ~]$ nvcc helloworld.cu
    [user@Discovery ~]$ srun --gres=gpu a.out
    Hello World!
    [user@Discovery ~]$

    The –gres=gpu option tells Slurm we want to use a GPU. The default is one GPU, you can specify the number of GPUs with –gres=gpu:<count>. “gres” stands for Generic Resource Scheduling, and is Slurm’s mechanism for managing arbitrary resources like GPUs and licensed software..