Twoface is a Linux cluster for supporting research at NMSU. This cluster runs CentOS 7.x and shares user file space with the Joker cluster. Twoface uses the HTCondor scheduling system instead of Slurm. Our goal with Twoface is to support a heterogeneous computing environment, and to eventually leverage idle desktop and workstation computers in support of research computing.
|Universes supported||Standard, Vanilla, JVM, Docker (planned)|
To request access to Twoface, visit our account request page. Most requests are honored within 2 business days.
Once your account has been created, you can login to Twoface via SSH from on campus. If you need to access Twoface from off campus, you must use the NMSU VPN.
HTCondor is the scheduler we use on Twoface. The scheduler is how you request that your program be executed. Depending on the workload already present, your job may not start right away. One of the advantages of HTCondor is that it supports checkpointing your program, so that no work is lost if your job has to be restarted.
With HTCondor, almost everything you do will be done with a batch job. To create a batch job, you must write a text file that explains what your job needs to do, and what resources it requires. The examples below will illustrate the typical types of batch jobs you might want to submit.
Example 1 – a very simple job
For our first example, we’ll run a simple shell script that sleeps for 30 seconds and prints out the name of the compute node it ran on.
#!/bin/bash TIMETOWAIT="30" HOSTNAME=`/bin/hostname` echo "Sleeping for $TIMETOWAIT seconds on $HOSTNAME" /bin/sleep $TIMETOWAIT
Save this into a file named “sleep.sh”, it does not to be executable.
Now we’ll create a job submission file, this file tells HTCondor what we want it to do, and what resources we want allocated.
# HTCondor job submission example executable = sleep.sh log = sleep.log output = sleep.out error = sleep.err should_transfer_files = yes when_to_transfer_output = ON_EXIT queue
Save this into a file named “submit”.
Any line starting with a # in the submit file will be treated as a comment. After that, we have the actual commands to HTCondor. Commands typically take the form “command name = value”. There are some exceptions to this rule, like the queue command. The executable command identifies the exact command to run. The log command causes a log file to be created on the submit host (login node), this file can be very valuable when you’re trying to figure out problems with your job submission. The output command tells HTCondor where to write the output of the executable to. The error command specifies where to write anything written to standard error by the executable. The next option, should_transfer_files, will instruct HTCondor to send a copy of the executable to whatever machine is selected to run your job. If your job is running on a machine that shares the same filesystems as the login node then you can skip this, it never hurts to include the transfer files option. Finally, when should HTCondor copy the output from the compute node back to the login node. The last command, “queue” tells HTCondor to actually queue the job described by the options above.
To submit your job, we’ll use the condor_submit command. We’ll also use the condor_q command to see the list of pending and running jobs.
[user@twoface ~]$ condor_submit submit Submitting job(s). 1 job(s) submitted to cluster 16.
The message, “1 job(s) submitted to cluster 16” means that our job has been submitted successfully. The job number is 16 in this case.
[user@twoface ~]$ condor_q -- Schedd: twoface.nmsu.edu : <126.96.36.199:52216?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16.0 user 10/13 15:24 0+00:00:14 R 0 0.0 sleep.sh 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Next if we run the condor_q command we can see the status of our job. Notice, the ID column has our job number – 16. In the ST column we see the state of the job. It’s normal for your job to be in a state of Idle (I) for about a minute after job submission.
Example 2 – jobs with command line arguments
In this example, we’ll alter our sleep.sh script to accept a command line argument – the number of seconds to sleep. In Bash, $1 is the first command line argument, $2 the second, and so on.
#!/bin/bash TIMETOWAIT=$1 HOSTNAME=`/bin/hostname` echo "Sleeping for $TIMETOWAIT seconds on $HOSTNAME" /bin/sleep $TIMETOWAIT cat $2
Next create a text file named “input.txt” with whatever text you’d like, we’ll use the string “This is from the input file!” in our example.
Next we’ll alter the job submission file to include command line arguments for our executable.
# HTCondor job submission example executable = sleep.sh arguments = "17 input.txt" transfer_input_files = input.txt log = sleep.log output = sleep.out error = sleep.err should_transfer_files = yes when_to_transfer_output = ON_EXIT queue
After running the job, we have the following in the output file.
[user@twoface ~]$ cat sleep.out Sleeping for 17 seconds on twoface-c2.nmsu.edu This is from the input file!
The arguments command specifies all of the arguments that the executable should be run with. If you don’t specify this, the executable will be run without any arguments. The transfer_input_files command will arrange for HTCondor to copy the named files to the execution host before running the job. You can specify multiple files for transfer_input_files using a comma, i.e. “transfer_input_files = file1,file2,file3”.
Example 3 – job arrays
In the Slurm scheduler they have a feature called Job arrays that allows you to submit a single job that will invoke many sub jobs that are all the same executable with possibly different parameters. HTCondor can do the same thing and might even be more flexible.
In this job submission file, we queue 6 jobs simultaneous. The initialdir command causes each job to be run in a different directory. The variable $(Process) will change for each job that is created and ranges from 0 to 5.
#!/bin/bash # echo.sh HTCondor example PROC=$1 HOSTNAME=`/bin/hostname` echo "Process $PROC on $HOSTNAME"
# HTCondor job submission example executable = echo.sh arguments = "$(Process)" log = echo.log output = echo.out error = echo.err should_transfer_files = yes when_to_transfer_output = ON_EXIT initialdir = run$(Process) queue 6
The one downside to using the initialdir command is that you must make sure all of the directories exist before submitting the job. The Bash script listed below is an easy way to precreate those directories.
for (( i=0; i<=5; i++ )) ; do > mkdir run$i > done
If you don’t want to create directories for each job, you can alternately change the output file names to include the process.
# HTCondor job submission example executable = echo.sh arguments = "$(Process)" log = echo$(Process).log output = echo$(Process).log error = echo$(Process).err queue 79
It is also possible to do arithmetic on variables inside of your job submission file. For example, if you want to run 1000 processes with the index starting at 3700 you might do something like this:
# HTCondor job submission example - arithmetic on variables executable = echo.sh MyIndex = $(Process) + 3700 arguments = "$(MyIndex)" queue 1000
Example 4 – Job Requirements
Inside of your job submission file you can specify a large number options to ensure your job gets the resources it needs.
# HTCondor job submission example executable = echo.sh arguments = "$(Process)" request_memory = 1024MB request_cpus = 4 log = echo.log output = echo.out error = echo.err should_transfer_files = yes when_to_transfer_output = ON_EXIT queue
This job will request 4 CPUs and 1GB of memory.
Example 5 – Using the Software Library
Because most of the software requires that you issue a module command, we need to provide a shell script to load the module and then run our job. Since we’re using a shell script as a wrapper, we’ll need to use the Vanilla universe.
In this example we’ll use GNU Octave with all of our octave commands saved into a .m file (input.m)
#!/bin/bash -l #The shell needs to be a login shell for the module command to work right #wrapper script to load Octave module module load octave/400 #Disable octave history file (-H) and ini files (-f) -- they're useless under condor octave -f -H $*
#HTCondor job submission example - wrapper scripts universe = vanilla executable = octave.sh arguments = input.m transfer_input_files = input.m #octave can make use of multiple CPUs so request 4 CPUs and 4GB RAM. request_cpus = 4 request_memory = 4GB log = octave.log error = octave.err output = octave.out should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT queue
Additional Information on job submission
The full set of options for a job mission file are documented in the condor_submit manual page.
HTCondor has many different types of environments that it can offer, each of these environments is called a Universe. The default universe for Twoface is the vanilla universe.
Presently, we support the Standard, Vanilla, and Java universes. We plan to add support for Parallel, VM, and Docker universe jobs in the future. There are other universes built into HTCondor, but at this time we do not plan to support them.
Choosing a Universe
The default universe on Twoface is the Vanilla universe.
The Standard Universe has many desirable features, including the ability to automatically checkpoint your job. You must recompile your job with the condor_compile command in order to use the Standard Universe. There are several restrictions on what a program can do in the Standard Universe, please see the official documentation section 188.8.131.52 for details.
The Vanilla Universe is the default choice on Twoface. It is ideal for jobs that require shell scripts, or can not be recompiled.
The Java Universe makes running Java code much easier. HTCondor will take care of finding the JVM on the compute node, and setting things like the Classpath for you automatically.
The Parallel, VM, and Docker Universes are not presently available. They are targeted for Spring 2016 availability and will be documented once they are setup.
You specify the job universe with the universe command in your submit file.
# HTCondor example submit file #choices for universe are vanilla, standard, java universe = vanilla Executable = myprogram Queue
Compiling for the Standard Universe
Recompiling your code to work with HTCondor is very easy, you simply prefix all of your compile commands with condor_compile. Suppose you have a C version of Hello World. Normally we’d compile this with something like “gcc -o hello hello.c”. Instead we wrap the gcc command with condor_compile like in the example below.
[user@twoface ~]$ condor_compile gcc -o hello hello.c LINKING FOR CONDOR : /usr/bin/ld -L/usr/lib64/condor -Bstatic --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -o hello /usr/lib64/condor/condor_rt0.o /usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.8.3/crtbeginT.o -L/usr/lib64/condor -L/usr/lib/gcc/x86_64-redhat-linux/4.8.3 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../.. /tmp/cctnaRXI.o /usr/lib64/condor/libcondorsyscall.a /usr/lib64/condor/libcondor_z.a /usr/lib64/condor/libcomp_libstdc++.a /usr/lib64/condor/libcomp_libgcc.a /usr/lib64/condor/libcomp_libgcc_eh.a -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /usr/lib64/condor/libcomp_libgcc.a /usr/lib64/condor/libcomp_libgcc_eh.a /usr/lib/gcc/x86_64-redhat-linux/4.8.3/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.8.3/../../../../lib64/crtn.o
The condor_compile command will automatically add the libraries necessary for your program to support checkpointing. It will also force your program to be compiled as a static executable.