In addition to traditional CPU resources, the HEX farm has a handful of machines with GPUs, that can be more efficient for machine learning and other dedicated tasks (e.g. matrix multiplication). We currently have six GPUs integrated into the HEX farm: two that can be used interactively and four that can be accessed via the HTCondor batch system.
From a terminal session logged on to hexcms, you can connect to hexdl via ssh hexdl. This is a machine outfitted with two NVIDIA TITAN XP GPUs with 12 GB of VRAM each. These are somewhat older GPUs that are great for debugging, but can be used for production workflows if needed as well.
Once connected to hexdl, you can use the nvidia-smi command to see the GPU resources present on this machine, and their current utilization.
In this case, GPU 1 is occupied by a process that is using a lot of memory, but GPU 0 is not. If we were to try to use GPU 1, our process would likely be blocked from running--or we could inadvertently cause issues for the person who is running with GPU 1. We should plan to use GPU 0 in this case. Next, we'll need to set up the proper environment to access the needed libraries. One setup script provided by CERN that will work out-of-the-box for many standalone purposes is to do:nvidia-smi Mon Jul 28 12:52:49 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA TITAN Xp Off | 00000000:01:00.0 On | N/A | | 23% 27C P8 10W / 250W | 59MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA TITAN Xp Off | 00000000:02:00.0 Off | N/A | | 31% 47C P2 60W / 250W | 11617MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3363 G /usr/libexec/Xorg 39MiB | | 0 N/A N/A 3449 G /usr/bin/gnome-shell 7MiB | | 1 N/A N/A 872601 C python3 11612MiB | +-----------------------------------------------------------------------------------------+
In some scenarios, custom versions of libraries may be needed, in which case conda, a CMSSW venv, or a custom apptainer for older versions of CMSSW could be used. Contact the HEX farm managers if you think this is necessary for your usage, because such setups tend to be trickier.source /cvmfs/sft.cern.ch/lcg/views/LCG_107_cuda/x86_64-el8-gcc11-opt/setup.sh
Next, we must prepare the script that we'll run. Two examples using either PyTorch or TensorFlow are below, that you could write to files named pytorch_example.py or tensorflow_example.py:
Click to show/hide PyTorch example script
# pytorch_example.py
# based on https://docs.pytorch.org/tutorials/beginner/examples_tensor/polynomial_tensor.html
import torch
import math
import os
print("CUDA_VISIBLE_DEVICES for torch", os.environ["CUDA_VISIBLE_DEVICES"])
dtype = torch.float
device = torch.device("cuda:0")
#device = torch.device("cpu") # uncomment this to run on CPU
try :
print("using device:", torch.cuda.get_device_name(device))
except ValueError :
print("using CPU only!")
# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)
# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(20000):
# Forward pass: compute predicted y
y_pred = a + b * x + c * x ** 2 + d * x ** 3
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of a, b, c, d with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_a = grad_y_pred.sum()
grad_b = (grad_y_pred * x).sum()
grad_c = (grad_y_pred * x ** 2).sum()
grad_d = (grad_y_pred * x ** 3).sum()
# Update weights using gradient descent
a -= learning_rate * grad_a
b -= learning_rate * grad_b
c -= learning_rate * grad_c
d -= learning_rate * grad_d
print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')
Click to show/hide TensorFlow example script
# tensorflow_example.py
# based on https://www.tensorflow.org/tutorials/quickstart/beginner
import tensorflow as tf
import os
print("TensorFlow version:", tf.__version__)
print("CUDA_VISIBLE_DEVICES for TF", os.environ["CUDA_VISIBLE_DEVICES"])
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
tf.nn.softmax(predictions).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)
To run this we should first do the following:
to only expose GPU 0 to the script. If we wanted to only expose GPU 1, we would instead set this to 1. Then we can run this via:export CUDA_VISIBLE_DEVICES=0
python3 pytorch_example.py or python3 tensorflow_example.py
While this runs, you can monitor the GPU usage via nvidia-smi.
The rest of our GPU resources can only be used via the HTCondor batch system. We have two dedicated machines that are each outfitted with one NVIDIA RTX 4080 SUPER and one NVIDIA RTX 4060 Ti. Each of these GPUs have 16 GB of VRAM. In general, these GPUs are more powerful than those on hexdl. Moreover, the rest of the hardware on these machines (particularly the motherboard/CPU) is much newer than that on hexdl, so these tend to be much faster and are more useful for production workflows rather than simple tests.
When possible, the 4080 SUPER is recommended for ML training purposes, and the 4060 Ti is recommended for inference. However, both GPUs are capable of either task.
To run a batch job using our GPU resources, you must first setup HTCondor as described on our condor documentation
and prepare a job submission .jdl file as described on that page. Then, the following lines could be added, before the line with the word queue:source /condor/HTCondor/alma8/condor.sh
RequestGPUs = 1
# For inference, uncomment:
#RequireGPUs = regexp("4060", DeviceName)
# For training, uncomment:
#RequireGPUs = regexp("4080", DeviceName)
If you don't care which GPU is used, you only need the "RequestGPUs" line. If you want to use a specific one, you can uncomment one of the other two lines.
When your job is running, condor will expose the GPU to the job via the CUDA_VISIBLE_DEVICES environment variable, as we did manually in the interactive example above. It is important that you do not overwrite CUDA_VISIBLE_DEVICES in the job that you run, and instead simply make use of this environment variable! If you need two GPUs in a single job, you can set RequestGPUs to 2, and adjust CUDA_VISIBLE_DEVICES for your two models/algorithms in your python script via something like:
orig_cuda_devices = os.environ["CUDA_VISIBLE_DEVICES"]
os.environ["CUDA_VISIBLE_DEVICES"] = orig_cuda_devices.split(',')[1] # pick 1 for GPU 1 (4060 Ti) or 0 for GPU 0 (4080 SUPER)
print(os.environ["CUDA_VISIBLE_DEVICES"])
Since we only have four GPUs integrated into our condor system, you may have to wait longer for two GPUs on a single machine to become available--as a result, you should consider if two GPUs in a single are actually necessary for your use case, or if a single GPU per job would be sufficient.
To view available GPU resources, you can use the following command, which lists the GPU devices currently in use by jobs, followed by details for all of our batch-enabled GPUs, and (at the bottom) details of only those that are available/free:
ru_condor_status_gpus
JobID Owner GPUs_DeviceName GPU_ID GPU Util. Peak VRAM State Activity LoadAv Mem ActvtyTime HOST(S) 122141.0 username01 NVIDIA GeForce RTX 4080 SUPER 0de1a744 0.75 2.0 GB Claimed Busy 1.590 128 0+07:03:04 slot1_1@hexgpu1.hexfarm.rutgers.edu 122144.0 username01 NVIDIA GeForce RTX 4060 Ti 42ae84eb 0.86 1.7 GB Claimed Busy 1.320 128 0+00:03:45 slot1_2@hexgpu1.hexfarm.rutgers.edu 122143.0 username01 NVIDIA GeForce RTX 4060 Ti 199f3f23 0.92 2.5 GB Claimed Busy 1.170 128 0+00:15:19 slot1_2@hexgpu2.hexfarm.rutgers.edu Summary: Machine Slots CPUs GPUs Mem(Gb) FreeCPUs FreeGPUs FreeMem% CpuUtil hexgpu1.hexfarm.rutgers.edu 2 48 2 125.33 38 0 99.8 0.07 hexgpu2.hexfarm.rutgers.edu 1 48 2 125.33 43 1 99.9 0.03 All condor-enabled GPU machines and the details of their GPUs: hexgpu1.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-0de1a744"; ClockMhz = 2565.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15969; Capability = 8.9; DeviceUuid = "0de1a744-6d70-5b05-6321-8c818bd853ed"; DevicePciBusId = "0000:41:00.0"; ComputeUnits = 80; DeviceName = "NVIDIA GeForce RTX 4080 SUPER"; DriverVersion = 12.7; ECCEnabled = false ] hexgpu1.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-42ae84eb"; ClockMhz = 2535.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15974; Capability = 8.9; DeviceUuid = "42ae84eb-0158-c6c0-2343-107fe50bc70b"; DevicePciBusId = "0000:81:00.0"; ComputeUnits = 34; DeviceName = "NVIDIA GeForce RTX 4060 Ti"; DriverVersion = 12.7; ECCEnabled = false ] hexgpu2.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-199f3f23"; ClockMhz = 2535.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15974; Capability = 8.9; DeviceUuid = "199f3f23-68ab-a1f9-3229-a65d9046ffe9"; DevicePciBusId = "0000:82:00.0"; ComputeUnits = 34; DeviceName = "NVIDIA GeForce RTX 4060 Ti"; DriverVersion = 12.7; ECCEnabled = false ] hexgpu2.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-8fbdadaf"; ClockMhz = 2565.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15970; Capability = 8.9; DeviceUuid = "8fbdadaf-6057-d45e-355f-683f327c60b8"; DevicePciBusId = "0000:41:00.0"; ComputeUnits = 80; DeviceName = "NVIDIA GeForce RTX 4080 SUPER"; DriverVersion = 12.7; ECCEnabled = false ] Current available/free GPUs: hexgpu2.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-8fbdadaf"; ClockMhz = 2565.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15970; Capability = 8.9; DeviceUuid = "8fbdadaf-6057-d45e-355f-683f327c60b8"; DevicePciBusId = "0000:41:00.0"; ComputeUnits = 80; DeviceName = "NVIDIA GeForce RTX 4080 SUPER"; DriverVersion = 12.7; ECCEnabled = false ]
You can also use ru_condor_q_run, which has a decicated section at the bottom of its output to monitor GPU jobs that are currently running and the resources they are using. For a longer-term view, you can look at this page to monitor our GPU-enabled machines and various aspects of them.
To view the full condor queue, broken down into CPU vs. those requesting GPUs, you can use ru_condor_q. You can optionally add the -nobatch flag to see the command that each job will run.
For interactive debugging where hexdl will not suffice, you can use condor_submit -interactive RequestGPUs=1. This will allocate the resources to an interactive condor job where you can run commands and do tests. The job ends when you log out of that session, so this should only be used for debugging. If you receive an authentication error when trying to do this, try running ssh-add -D and try again.
Finally, if you have many jobs that you are running for tests, that you either have a way to checkpoint along the way, or that are low priority, you can set nice_user = True in your condor submission .jdl file. This will set the priority of your jobs to a low value, and allow other jobs to preempt yours from running. In other words, when the queue of jobs requesting GPU resources is empty, your jobs will run. But when the queue fills up from other users, your jobs will be evicted by the condor node, set to idle, and will only start running again once the other jobs complete. Depending how your job is structured, you may benefit from using when_to_transfer_output = ON_EXIT_OR_EVICT in your .jdl file when running nice_user jobs (as opposed to just "ON_EXIT").
If your needs are larger than those available on our HEX farm, some other resources available to CMS users include CRAB, CERN's lxplus, FNAL's cmslpc, and the Rutgers Amarel cluster.