The HEX Computer Farm is a collection of compute nodes with the effective computing power of around 1400 central processing units. Jobs are submitted to these machines using the HTCondor batch system.
In order to make good use of the power of the HEX farm, it must be possible to break down your overall computing task into sub-tasks, each of which can be executed independently of the others.
Setup HTCondor by entering
You must do this once after you login before any of the HTCondor commands will work. If you wish, follow the instructions on this page to have this set up automatically upon each login.source /condor/HTCondor/alma8/condor.sh
To view available resources, you can use the following command that is specific to our instance of HTCondor. It is similar to condor_status -compact but provides some additional formatting.
This command also lists how many CPUs are currently unused, how many are in use by local user jobs, and how many are currently running jobs submitted by remote users (e.g. CMS CRAB jobs submitted at other sites that run using our resources). When our queues fill up, additional local user submissions cause this last category of jobs to be preempted, so that we always have priority for using our resources.ru_condor_status
Note that typical user jobs have a single CPU thread allocated to them, and an average memory of 4 GB per job. If you require more resources than that per job, you can specify RequestMemory = X or RequestCpus = Y in your .jdl file. Our worker nodes have up to 64 CPU threads on a single machine, so Y must be smaller than that, and the larger the resource request is, the fewer jobs that can run at once. It is also important to note that such multithreaded jobs are only beneficial in specific use cases, and one would typically benefit from running many more single-threaded jobs rather than a smaller number of multithreaded ones.
To view all running jobs and the resources they are using, you can use the following command that is specific to our instance of HTCondor (and is similar to condor_q -run -all).
ru_condor_q_run
To view the entire condor queue, you can use condor_q -all or ru_condor_q (which additionally breaks things down into CPU vs. GPU jobs). You can optionally add the -nobatch flag to see the command that each job will run.
To use the farm, you must prepare two files:
As an example, suppose we wish to test each of the numbers from 0 to (n-1) to see whether it is prime. Here is a bash script which will (not very efficiently) determine whether its argument is a prime number:
#!/bin/bash # Determine if argument is a prime number num=$1 maxd=$((num/2)) div=2 while [ $((div<=maxd)) = 1 ] ; do dvd=$((num/div)) rmd=$((num-dvd*div)) if [ $rmd = 0 ] ; then echo $num "is not prime. Divisor is "$div exit fi div=$((div+1)) done echo $num "is prime" exit
To use this script with the HEX farm, first create a directory /home/username/condortest and copy this script to the file "prime.bash" in that directory. This is the executable file. Be sure that "prime.bash" has the "execute" file permission set! Use chmod u+x prime.bash to do so.
Here is the job description file:
universe = vanilla error = /home/username/condortest/prime$(Process).error log = /home/username/condortest/prime$(Process).log output = /home/username/condortest/prime$(Process).out should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT executable = prime.bash arguments = $(Process) queue 30
Copy this into a file named "prime.jdl" in the "condortest" sub-directory.
Here is an explanation of the lines in "prime.jdl":
Next, submit the job by entering
condor_submit prime.jdl
To view the HTCondor job queue, enter
To view only your own jobs, entercondor_q -all
condor_q username
Initially, all of your jobs will have status "I". Once the HTCondor system schedules them for execution this will change to "R". Even if there are available slots to run your jobs, it may take a few minutes for the HTCondor system to complete the scheduling.
Notice that "condor_q -all" will return a HTCondor job number of the form "NNNNN.MM", where "NNNNN" will be the same for all of the jobs just submitted, and "MM" will range from 0 to 29 (for this test example).
To watch the log files of your jobs as they are being populated on the node, you can make use of
condor_tail NNNNN.MM, condor_tail -stderr NNNNN.MM, and condor_ssh_to_job NNNNN.MM
To remove jobs, enter
to remove all the jobs just submitted, orcondor_rm NNNNN
to remove a single job from the group of 30 which were just submitted.condor_rm NNNNN.MM
Once all 30 sub-tasks have finished executing, you will have 30 output files in /home/username/condortest named "prime0.out" through "prime29.out".