Submitting jobs to the HEX Computer Farm

Introduction

The HEX Computer Farm contains over 50 worker nodes with over 100 central processing units. Jobs are submitted to these machines using the Condor batch system.

In order to make good use of the power of the HEX farm, it must be possible to break down your overall computing task into a large number of sub-tasks, each of which can be executed independently of the others. Ideally, these sub-tasks should each require at least 15 minutes of CPU time to complete. If the sub-tasks require less time than that to complete, the system may spend more time in the overhead operations of setting up and scheduling the tasks for execution than in the execution itself. On the other hand, the sub-tasks should also ideally not require more than a few hours of CPU time to complete, or otherwise it may not be possible for the scheduling algorithms to provide a fair access share to all users.

An example of job submission

To use the farm, you must prepare two files:

  1. An executable file, which the Condor system will schedule for execution on the worker nodes. This can be either a script, or an executable binary created by a compiler/linker.
  2. A job control file, which specifies various details of the job to the Condor system

As an example, suppose we wish to test each of the numbers from 0 to (n-1) to see whether it is prime. Here is a bash script which will (not very efficiently) determine whether its argument is a prime number:

#!/bin/bash
# Determine if argument is a prime number
num=$1
maxd=$((num/2))
div=2
while [ $((div<=maxd)) = 1 ] ; do
dvd=$((num/div))
rmd=$((num-dvd*div))
if [ $rmd = 0 ] ; then
echo $num "is not prime.  Divisor is "$div
exit
fi
div=$((div+1))
done
echo $num "is prime"
exit

Create a sub-directory /home/username/condortest and copy this script to the file "prime.bash" in that directory. This is the executable file. Make certain that "prime.bash" has the "execute" file permission set!

Here is the job control file:

universe = vanilla
initialdir = /home/username/condortest
error = /home/username/condortest/prime$(Process).error
log = /home/username/condortest/prime$(Process).log
output = /home/username/condortest/prime$(Process).out
executable = prime.bash
arguments = $(Process)
queue 30

Copy this file to "prime.jcl" in the "condortest" sub-directory.

Here is an explanation of the lines in "prime.jcl":

universe = vanilla
Condor has several job submission paradigms. We use only "vanilla".
initialdir
The directory where execution will start.
error, log, and output
Files for error, log and output information. Note the "$(Process)" construction used in the specification of these files. This will be replaced by Condor during submission by the actual number of the sub-task. This means that a separate file will be created for each sub-task.
executable
The name of the file to execute.
arguments
Arguments to pass to the program. In this case, just the sub-task number is passed as an argument, which will be the number tested for primeness by the prime.bash script.
queue 30
30 is the number of sub-tasks to be submitted for this test. $(Process) will range from 0 to 29.

Now setup Condor by entering

source /condor/current/setup/condor-setup.(c)sh
You must do this once after you login before any of the Condor commands will work. If you wish, this command can be placed in your .login or .(c)shrc file.

Next, submit the job by entering

condor_submit prime.jcl

To view the Condor job queue, enter

condor_q
To view only your own jobs, enter
condor_q username

Initially, all of your jobs will have status "I". Once the Condor system schedules them for execution this will change to "R". Even if there are available nodes to run your jobs, it may take 5 to 10 minutes for the Condor system to complete the scheduling.

Notice that "condor_q" will return a Condor job number of the form "NNNNN.MM", where "NNNNN" will be the same for all of the jobs just submitted, and "MM" will range from 0 to 29 (for this test example).

To remove jobs, enter

condor_rm NNNNN
to remove all the jobs just submitted, or
condor_rm NNNNN.MM
to remove a single job from the group of 30 which were just submitted.

Once all 30 sub-tasks have finished executing, you will have 30 output files in /home/username/condortest named "prime0.out" through "prime29.out".

Job queues

Information on the available job queues.