hexfarm.rutgers.edu More Hints
Here are more detailed hints on using the HEX Farm. Read the
overview first.
Send comments/suggestions to jacques@physics.rutgers.edu.
Description:
-
Kerberos is installed on hexfm1 to connect to Fermilab
(not the other way however).
-
Hexfm1 acts as a software construction node, e.g. for CDF users,
do the softreltools gmake there.
-
The worker nodes are on a private network so cannot be seen except
from the portal nodes hexfm1, hexcaf, hexsam.
-
All nodes are dual cpu's with Fermi Linux.
-
Website top page
has links to show cpu and network loads.
-
To get a login account, send email to Pieter jacques@physics.rutgers.edu.
Same for a caf/fbsng queue. (Caf users do not require a login account.)
-
Caf jobs can write to the OTHER areas of scratch disks and owners of
login accounts can then access those files, see below.
Data disks:
Batch Jobs
- The execute nodes should be used via hexcaf or the fbsng batch system.
- Hexcaf currently requires a Fermilab kerberos principal (automatic
for Fermilab experimenters; talk to Pieter Jacques if not a
Fermilab experimenter) and the use of some CDF software.
- Fbsng does
not require kerberos nor any CDF software.
- A big advantage of hexcaf/fbsng (fbsng
underlies hexcaf) is its ability to make automatically multiple
sections of a job. The links to the instructions on
using hexcaf or
using fbsng are in the web page above this. Also there is the link to
monitoring pages for jobs in hexcaf.
Non-Batch Use of Worker Nodes
For those unable to access hexcaf or fbsng, a convenient way to use the
worker nodes (node1, node2, etc) is to create a script, which
can be "submitted" as though we had batch queues. This can produce
a log file which will be useful for debugging. Thus, let batch.csh
be your batch job script, then to "submit" it to node1, do
ssh -q node1 "./batch.csh arg1 arg2 ... >&! batch1.log &"
This produces log file batch1.log in your top area. And leaves your
interactive process on hexfm1 free to submit the next batch job (batch2.log)
For an automated CDF/ac++ version of this procedure see
~watts/bin/rsub.csh. Thus,
rsub node4 job15 path
will execute command "path/job15.exe path/job15.tcl" on node4 with
log file path/job15.log_N where N automatically advances 1,2,3... on
repeat rsub actions.
rsub will give some instructions if executed without arguments.
Note that your usual cdf software setup is
required in batch.csh since the ssh performs a new login.
Jobs executed by this non-batch method interfere with the those in
the batch queues because the batch system does not monitor load in
worker nodes. The batch system assumes it has sole control of cpu
cycles in the worker nodes.