HTCondor job submission best practices

The HEX group computer farm currently has CPU resources that allow 632 jobs to run simultaneously. Each job runs in a job "slot", and once a job is started in a slot it will continue to run in that slot until it completes or is manually terminated. As jobs finish, the HTCondor job scheduler picks jobs from the queue of waiting jobs and starts them running.

The scheduler selects jobs based upon the priority level of each user. Initially, all users have the same default priority. As jobs are run for a given user, that user's priority decreases as an exponential function of the total amount of time the user's jobs have been running. Once a user has no jobs running, the user's priority will begin to increase and eventually return to the original default value.

Users should observe the following points to allow fair and timely access to HEX farm resources by everyone:

  1. While the system can run 632 jobs simultaneously with each job getting full access to a single CPU, all I/O operations are handled by 2 file server systems. That being the case, please be aware of the I/O requirements of your jobs, as it is quite possible to saturate the file servers by submitting a large number of I/O limited jobs at the same time. If you need to run a large number of I/O limited jobs, please see the section below on how to set a limit on the number of jobs allowed to run simultaneously. (It is not always easy to determing if a job is I/O limited, or not. If you have any questions about this, you should check with the HEX farm managers.)
  2. As mentioned above, once a job starts running in a job slot it will continue to occupy that slot until it has finished. As long as jobs don't run too long this is not a problem. However, consider the case where there are no jobs running and a user submits 1000 jobs, each of which will run for 1 day. HTCondor will immediately assign job slots for 632 of those jobs, and after that those slots will be unavailable to other users for an entire day. This is likely to make the submitter of the 1000 jobs very unpopular. If you can't structure your jobs so that individual jobs don't run more than 1 hour, please refer to the section below on limiting the number of jobs allowed to run simultaneously.

Limiting the number of jobs that run simultaneously

As mentioned above, there are circumstances under which it is advisable for a user to limit the number of jobs running simultaneously. Of course this can be accomplished by simply submitting a limited number of jobs, waiting for that batch to complete, then submitting another batch. An easier solution, however, is to use the max_materialize option:

In order to limit the number of you jobs that can run simultaneousely, add this line to your job control file:

max_materialize = N

Here N is the maximum number of jobs that will ever be in the run+idle state. A good starting point is to use N=150.