HTC user guide

System Overview

Twoface is a Linux cluster for supporting research at NMSU. This cluster runs CentOS 7.x and shares user file space with the Joker cluster. Twoface uses the HTCondor scheduling system instead of Slurm. Our goal with Twoface is to support a heterogeneous computing environment, and to eventually leverage idle desktop and workstation computers in support of research computing.

OSCENTOS 7
CPUs48
RAM4GB/core
SchedulerHTCondor 8.4.x
Universes supportedStandard, Vanilla, JVM, Docker (planned)

Getting Started

To request access to Twoface, visit our account request page. Most requests are honored within 2 business days.

Once your account has been created, you can login to Twoface via SSH from on campus. If you need to access Twoface from off campus, you must use the NMSU VPN.

Using HTCondor

HTCondor is the scheduler we use on Twoface. The scheduler is how you request that your program be executed. Depending on the workload already present, your job may not start right away. One of the advantages of HTCondor is that it supports checkpointing your program, so that no work is lost if your job has to be restarted.

Batch Jobs

With HTCondor, almost everything you do will be done with a batch job. To create a batch job, you must write a text file that explains what your job needs to do, and what resources it requires. The examples below will illustrate the typical types of batch jobs you might want to submit.

Example 1 – a very simple job

For our first example, we’ll run a simple shell script that sleeps for 30 seconds and prints out the name of the compute node it ran on.

#!/bin/bash
TIMETOWAIT="30"
HOSTNAME=`/bin/hostname`
echo "Sleeping for $TIMETOWAIT seconds on $HOSTNAME"
/bin/sleep $TIMETOWAIT

Save this into a file named “sleep.sh”, it does not to be executable.

Now we’ll create a job submission file, this file tells HTCondor what we want it to do, and what resources we want allocated.

# HTCondor job submission example
executable = sleep.sh
log = sleep.log
output = sleep.out
error = sleep.err
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
queue

Save this into a file named “submit”.

Any line starting with a # in the submit file will be treated as a comment. After that, we have the actual commands to HTCondor. Commands typically take the form “command name = value”. There are some exceptions to this rule, like the queue command. The executable command identifies the exact command to run. The log command causes a log file to be created on the submit host (login node), this file can be very valuable when you’re trying to figure out problems with your job submission. The output command tells HTCondor where to write the output of the executable to. The error command specifies where to write anything written to standard error by the executable. The next option, should_transfer_files, will instruct HTCondor to send a copy of the executable to whatever machine is selected to run your job. If your job is running on a machine that shares the same filesystems as the login node then you can skip this, it never hurts to include the transfer files option. Finally, when should HTCondor copy the output from the compute node back to the login node. The last command, “queue” tells HTCondor to actually queue the job described by the options above.

To submit your job, we’ll use the condor_submit command. We’ll also use the condor_q command to see the list of pending and running jobs.

[user@twoface ~]$ condor_submit submit
Submitting job(s).
1 job(s) submitted to cluster 16.

The message, “1 job(s) submitted to cluster 16” means that our job has been submitted successfully. The job number is 16 in this case.

[user@twoface ~]$ condor_q

-- Schedd: twoface.nmsu.edu : <128.123.211.55:52216?...
 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
 16.0 user 10/13 15:24 0+00:00:14 R 0 0.0 sleep.sh

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Next if we run the condor_q command we can see the status of our job. Notice, the ID column has our job number – 16. In the ST column we see the state of the job. It’s normal for your job to be in a state of Idle (I) for about a minute after job submission.

Example 2 – jobs with command line arguments

In this example, we’ll alter our sleep.sh script to accept a command line argument – the number of seconds to sleep. In Bash, $1 is the first command line argument, $2 the second, and so on.

#!/bin/bash
TIMETOWAIT=$1
HOSTNAME=`/bin/hostname`
echo "Sleeping for $TIMETOWAIT seconds on $HOSTNAME"
/bin/sleep $TIMETOWAIT
cat $2

Next create a text file named “input.txt” with whatever text you’d like, we’ll use the string “This is from the input file!” in our example.

Next we’ll alter the job submission file to include command line arguments for our executable.

# HTCondor job submission example
executable = sleep.sh
arguments = "17 input.txt"
transfer_input_files = input.txt
log = sleep.log
output = sleep.out
error = sleep.err
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
queue

After running the job, we have the following in the output file.

[user@twoface ~]$ cat sleep.out
Sleeping for 17 seconds on twoface-c2.nmsu.edu
This is from the input file!

The arguments command specifies all of the arguments that the executable should be run with. If you don’t specify this, the executable will be run without any arguments. The transfer_input_files command will arrange for HTCondor to copy the named files to the execution host before running the job. You can specify multiple files for transfer_input_files using a comma, i.e. “transfer_input_files = file1,file2,file3”.

Example 3 – job arrays

In the Slurm scheduler they have a feature called Job arrays that allows you to submit a single job that will invoke many sub jobs that are all the same executable with possibly different parameters. HTCondor can do the same thing and might even be more flexible.

In this job submission file, we queue 6 jobs simultaneous. The initialdir command causes each job to be run in a different directory. The variable $(Process) will change for each job that is created and ranges from 0 to 5.

#!/bin/bash
# echo.sh HTCondor example
PROC=$1
HOSTNAME=`/bin/hostname`
echo "Process $PROC on $HOSTNAME"
# HTCondor job submission example
executable = echo.sh
arguments = "$(Process)"
log = echo.log
output = echo.out
error = echo.err
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
initialdir = run$(Process)
queue 6

The one downside to using the initialdir command is that you must make sure all of the directories exist before submitting the job. The Bash script listed below is an easy way to precreate those directories.

for (( i=0; i<=5; i++ )) ; do
> mkdir run$i
> done

If you don’t want to create directories for each job, you can alternately change the output file names to include the process.

# HTCondor job submission example
executable = echo.sh
arguments = "$(Process)"
log = echo$(Process).log
output = echo$(Process).log
error = echo$(Process).err
queue 79

It is also possible to do arithmetic on variables inside of your job submission file. For example, if you want to run 1000 processes with the index starting at 3700 you might do something like this:

# HTCondor job submission example - arithmetic on variables
executable = echo.sh
MyIndex = $(Process) + 3700
arguments = "$(MyIndex)"
queue 1000
Example 4 – Job Requirements

Inside of your job submission file you can specify a large number options to ensure your job gets the resources it needs.

# HTCondor job submission example
executable = echo.sh
arguments = "$(Process)"
request_memory = 1024MB
request_cpus = 4
log = echo.log
output = echo.out
error = echo.err
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
queue

This job will request 4 CPUs and 1GB of memory.

Example 5 – Using the Software Library

Because most of the software requires that you issue a module command, we need to provide a shell script to load the module and then run our job. Since we’re using a shell script as a wrapper, we’ll need to use the Vanilla universe.

In this example we’ll use GNU Octave with all of our octave commands saved into a .m file (input.m)

#!/bin/bash -l
#The shell needs to be a login shell for the module command to work right
#wrapper script to load Octave module
module load octave/400
#Disable octave history file (-H) and ini files (-f) -- they're useless under condor
octave -f -H $*

 

#HTCondor job submission example - wrapper scripts
universe = vanilla
executable = octave.sh
arguments = input.m
transfer_input_files = input.m
#octave can make use of multiple CPUs so request 4 CPUs and 4GB RAM.
request_cpus = 4
request_memory = 4GB
log = octave.log
error = octave.err
output = octave.out
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
Additional Information on job submission

The full set of options for a job mission file are documented in the condor_submit manual page.

HTCondor Universes

HTCondor has many different types of environments that it can offer, each of these environments is called a Universe. The default universe for Twoface is the vanilla universe.

  • Standard
  • Vanilla
  • Java
  • Parallel
  • VM
  • Docker

Presently, we support the Standard, Vanilla, and Java universes. We plan to add support for Parallel, VM, and Docker universe jobs in the future. There are other universes built into HTCondor, but at this time we do not plan to support them.

Choosing a Universe

The default universe on Twoface is the Vanilla universe.

The Standard Universe has many desirable features, including the ability to automatically checkpoint your job. You must recompile your job with the condor_compile command in order to use the Standard Universe. There are several restrictions on what a program can do in the Standard Universe, please see the official documentation section 2.4.1.1 for details.

The Vanilla Universe is the default choice on Twoface. It is ideal for jobs that require shell scripts, or can not be recompiled.

The Java Universe makes running Java code much easier. HTCondor will take care of finding the JVM on the compute node, and setting things like the Classpath for you automatically.

The Parallel, VM, and Docker Universes are not presently available. They are targeted for Spring 2016 availability and will be documented once they are setup.

You specify the job universe with the universe command in your submit file.

# HTCondor example submit file
#choices for universe are vanilla, standard, java
universe = vanilla
Executable = myprogram
Queue
Compiling for the Standard Universe

Recompiling your code to work with HTCondor is very easy, you simply prefix all of your compile commands with condor_compile. Suppose you have a C version of  Hello World. Normally we’d compile this with something like “gcc -o hello hello.c”. Instead we wrap the gcc command with condor_compile like in the example below.

[user@twoface ~]$ condor_compile gcc -o hello hello.c
LINKING FOR CONDOR : /usr/bin/ld -L/usr/lib64/condor -Bstatic --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -o hello /usr/lib64/condor/condor_rt0.o /usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.8.3/crtbeginT.o -L/usr/lib64/condor -L/usr/lib/gcc/x86_64-redhat-linux/4.8.3 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../.. /tmp/cctnaRXI.o /usr/lib64/condor/libcondorsyscall.a /usr/lib64/condor/libcondor_z.a /usr/lib64/condor/libcomp_libstdc++.a /usr/lib64/condor/libcomp_libgcc.a /usr/lib64/condor/libcomp_libgcc_eh.a -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /usr/lib64/condor/libcomp_libgcc.a /usr/lib64/condor/libcomp_libgcc_eh.a /usr/lib/gcc/x86_64-redhat-linux/4.8.3/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../../../lib64/crtn.o

The condor_compile command will automatically add the libraries necessary for your program to support checkpointing. It will also force your program to be compiled as a static executable.