Submitting jobs to the HEX Computer Farm

Introduction

The HEX Computer Farm is a collection of compute nodes with the effective computing power of over 1000 central processing units. Jobs are submitted to these machines using the HTCondor batch system.

In order to make good use of the power of the HEX farm, it must be possible to break down your overall computing task into sub-tasks, each of which can be executed independently of the others.

An example of job submission

To use the farm, you must prepare two files:

An executable file, which the HTCondor system will schedule for execution on the worker nodes. This can be either a script, or an executable binary created by a compiler/linker.
A job control file, which specifies various details of the job to the HTCondor system

As an example, suppose we wish to test each of the numbers from 0 to (n-1) to see whether it is prime. Here is a bash script which will (not very efficiently) determine whether its argument is a prime number:

#!/bin/bash
# Determine if argument is a prime number
num=$1
maxd=$((num/2))
div=2
while [ $((div<=maxd)) = 1 ] ; do
dvd=$((num/div))
rmd=$((num-dvd*div))
if [ $rmd = 0 ] ; then
echo $num "is not prime.  Divisor is "$div
exit
fi
div=$((div+1))
done
echo $num "is prime"
exit

To use this script with the HEX farm, first create a sub-directory /home/username/condortest and copy this script to the file "prime.bash" in that directory. This is the executable file. Make certain that "prime.bash" has the "execute" file permission set!

Here is the job control file:

universe = vanilla
initialdir = /home/username/condortest
error = /home/username/condortest/prime$(Process).error
log = /home/username/condortest/prime$(Process).log
output = /home/username/condortest/prime$(Process).out
executable = prime.bash
arguments = $(Process)
Notification=never
queue 30

Copy this file to "prime.jcl" in the "condortest" sub-directory.

Here is an explanation of the lines in "prime.jcl":

universe = vanilla: HTCondor has several job submission paradigms. We use only "vanilla".
initialdir: The directory where execution will start.
error, log, and output: Files for error, log and output information. Note the "$(Process)" construction used in the specification of these files. This will be replaced by HTCondor during submission by the actual number of the sub-task. This means that a separate file will be created for each sub-task.
executable: The name of the file to execute.
arguments: Arguments to pass to the program. In this case, just the sub-task number is passed as an argument, which will be the number tested for primeness by the prime.bash script.
Notification=never: If you don't include this, HTCondor will send you email when each job finishes. This may be inconvenient for you if you submit a large number of jobs.
queue 30: 30 is the number of sub-tasks to be submitted for this test. $(Process) will range from 0 to 29.

Now setup HTCondor by entering

source /condor/HTCondor/alma8/condor.(c)sh

You must do this once after you login before any of the HTCondor commands will work. If you wish, this command can be placed in your .login or .(c)shrc file.

Next, submit the job by entering

condor_submit prime.jcl

To view the HTCondor job queue, enter

condor_q -all

To view only your own jobs, enter

condor_q username

Initially, all of your jobs will have status "I". Once the HTCondor system schedules them for execution this will change to "R". Even if there are available nodes to run your jobs, it may take a few minutes for the HTCondor system to complete the scheduling.

Notice that "condor_q" will return a HTCondor job number of the form "NNNNN.MM", where "NNNNN" will be the same for all of the jobs just submitted, and "MM" will range from 0 to 29 (for this test example).

To remove jobs, enter

condor_rm NNNNN

to remove all the jobs just submitted, or

condor_rm NNNNN.MM

to remove a single job from the group of 30 which were just submitted.

Once all 30 sub-tasks have finished executing, you will have 30 output files in /home/username/condortest named "prime0.out" through "prime29.out".