GPU resources and how to use them

Introduction

In addition to traditional CPU resources, the HEX farm has a handful of machines with GPUs, that can be more efficient for machine learning and other dedicated tasks (e.g. matrix multiplication). We currently have six GPUs integrated into the HEX farm: two that can be used interactively and four that can be accessed via the HTCondor batch system.

Interactive usage

From a terminal session logged on to hexcms, you can connect to hexdl via ssh hexdl. This is a machine outfitted with two NVIDIA TITAN XP GPUs with 12 GB of VRAM each. These are somewhat older GPUs that are great for debugging, but can be used for production workflows if needed as well.

Once connected to hexdl, you can use the nvidia-smi command to see the GPU resources present on this machine, and their current utilization.

nvidia-smi
Mon Jul 28 12:52:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN Xp                Off |   00000000:01:00.0  On |                  N/A |
| 23%   27C    P8             10W /  250W |      59MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA TITAN Xp                Off |   00000000:02:00.0 Off |                  N/A |
| 31%   47C    P2             60W /  250W |   11617MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3363      G   /usr/libexec/Xorg                        39MiB |
|    0   N/A  N/A            3449      G   /usr/bin/gnome-shell                      7MiB |
|    1   N/A  N/A          872601      C   python3                               11612MiB |
+-----------------------------------------------------------------------------------------+

In this case, GPU 1 is occupied by a process that is using a lot of memory, but GPU 0 is not. If we were to try to use GPU 1, our process would likely be blocked from running--or we could inadvertently cause issues for the person who is running with GPU 1. We should plan to use GPU 0 in this case. Next, we'll need to set up the proper environment to access the needed libraries. One setup script provided by CERN that will work out-of-the-box for many standalone purposes is to do:

source /cvmfs/sft.cern.ch/lcg/views/LCG_107_cuda/x86_64-el8-gcc11-opt/setup.sh

In some scenarios, custom versions of libraries may be needed, in which case conda, a CMSSW venv, or a custom apptainer for older versions of CMSSW could be used. Contact the HEX farm managers if you think this is necessary for your usage, because such setups tend to be trickier.

Next, we must prepare the script that we'll run. Two examples using either PyTorch or TensorFlow are below, that you could write to files named pytorch_example.py or tensorflow_example.py:

Click to show/hide PyTorch example script

# pytorch_example.py
# based on https://docs.pytorch.org/tutorials/beginner/examples_tensor/polynomial_tensor.html
  
import torch
import math
import os

print("CUDA_VISIBLE_DEVICES for torch", os.environ["CUDA_VISIBLE_DEVICES"])

dtype = torch.float
device = torch.device("cuda:0")
#device = torch.device("cpu") # uncomment this to run on CPU

try :
    print("using device:", torch.cuda.get_device_name(device))
except ValueError :
    print("using CPU only!")

# Create random input and output data 
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights                                                                                             
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(20000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

Click to show/hide TensorFlow example script

# tensorflow_example.py
# based on https://www.tensorflow.org/tutorials/quickstart/beginner

import tensorflow as tf
import os
print("TensorFlow version:", tf.__version__)
print("CUDA_VISIBLE_DEVICES for TF", os.environ["CUDA_VISIBLE_DEVICES"])

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()

tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test,  y_test, verbose=2)

To run this we should first do the following:

export CUDA_VISIBLE_DEVICES=0

to only expose GPU 0 to the script. If we wanted to only expose GPU 1, we would instead set this to 1. Then we can run this via:

python3 pytorch_example.py or python3 tensorflow_example.py

While this runs, you can monitor the GPU usage via nvidia-smi.

Batch job usage

The rest of our GPU resources can only be used via the HTCondor batch system. We have two dedicated machines that are each outfitted with one NVIDIA RTX 4080 SUPER and one NVIDIA RTX 4060 Ti. Each of these GPUs have 16 GB of VRAM. In general, these GPUs are more powerful than those on hexdl. Moreover, the rest of the hardware on these machines (particularly the motherboard/CPU) is much newer than that on hexdl, so these tend to be much faster and are more useful for production workflows rather than simple tests.

When possible, the 4080 SUPER is recommended for ML training purposes, and the 4060 Ti is recommended for inference. However, both GPUs are capable of either task.

To run a batch job using our GPU resources, you must first setup HTCondor as described on our condor documentation

source /condor/HTCondor/alma8/condor.sh

and prepare a job submission .jdl file as described on that page. Then, the following lines could be added, before the line with the word queue:

RequestGPUs = 1 # For inference, uncomment: #RequireGPUs = regexp("4060", DeviceName) # For training, uncomment: #RequireGPUs = regexp("4080", DeviceName)

If you don't care which GPU is used, you only need the "RequestGPUs" line. If you want to use a specific one, you can uncomment one of the other two lines.

When your job is running, condor will expose the GPU to the job via the CUDA_VISIBLE_DEVICES environment variable, as we did manually in the interactive example above. It is important that you do not overwrite CUDA_VISIBLE_DEVICES in the job that you run, and instead simply make use of this environment variable! If you need two GPUs in a single job, you can set RequestGPUs to 2, and adjust CUDA_VISIBLE_DEVICES for your two models/algorithms in your python script via something like:

orig_cuda_devices = os.environ["CUDA_VISIBLE_DEVICES"] os.environ["CUDA_VISIBLE_DEVICES"] = orig_cuda_devices.split(',')[1] # pick 1 for GPU 1 (4060 Ti) or 0 for GPU 0 (4080 SUPER) print(os.environ["CUDA_VISIBLE_DEVICES"])

Since we only have four GPUs integrated into our condor system, you may have to wait longer for two GPUs on a single machine to become available--as a result, you should consider if two GPUs in a single are actually necessary for your use case, or if a single GPU per job would be sufficient.

To view available GPU resources, you can use the following command, which lists the GPU devices currently in use by jobs, followed by details for all of our batch-enabled GPUs, and (at the bottom) details of only those that are available/free:

ru_condor_status_gpus

Click to show/hide the output of this command

JobID    Owner         GPUs_DeviceName                GPU_ID  GPU Util.  Peak VRAM  State     Activity LoadAv  Mem    ActvtyTime     HOST(S)

122141.0 username01   NVIDIA GeForce RTX 4080 SUPER  0de1a744 0.75       2.0 GB     Claimed   Busy     1.590   128    0+07:03:04 slot1_1@hexgpu1.hexfarm.rutgers.edu
122144.0 username01   NVIDIA GeForce RTX 4060 Ti     42ae84eb 0.86       1.7 GB     Claimed   Busy     1.320   128    0+00:03:45 slot1_2@hexgpu1.hexfarm.rutgers.edu
122143.0 username01   NVIDIA GeForce RTX 4060 Ti     199f3f23 0.92       2.5 GB     Claimed   Busy     1.170   128    0+00:15:19 slot1_2@hexgpu2.hexfarm.rutgers.edu

Summary:
Machine                          Slots CPUs GPUs  Mem(Gb) FreeCPUs FreeGPUs FreeMem% CpuUtil
hexgpu1.hexfarm.rutgers.edu          2   48    2   125.33       38        0     99.8    0.07
hexgpu2.hexfarm.rutgers.edu          1   48    2   125.33       43        1     99.9    0.03


All condor-enabled GPU machines and the details of their GPUs:

hexgpu1.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-0de1a744"; ClockMhz = 2565.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15969; Capability = 8.9; DeviceUuid = "0de1a744-6d70-5b05-6321-8c818bd853ed"; DevicePciBusId = "0000:41:00.0"; ComputeUnits = 80; DeviceName = "NVIDIA GeForce RTX 4080 SUPER"; DriverVersion = 12.7; ECCEnabled = false ]

hexgpu1.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-42ae84eb"; ClockMhz = 2535.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15974; Capability = 8.9; DeviceUuid = "42ae84eb-0158-c6c0-2343-107fe50bc70b"; DevicePciBusId = "0000:81:00.0"; ComputeUnits = 34; DeviceName = "NVIDIA GeForce RTX 4060 Ti"; DriverVersion = 12.7; ECCEnabled = false ]

hexgpu2.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-199f3f23"; ClockMhz = 2535.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15974; Capability = 8.9; DeviceUuid = "199f3f23-68ab-a1f9-3229-a65d9046ffe9"; DevicePciBusId = "0000:82:00.0"; ComputeUnits = 34; DeviceName = "NVIDIA GeForce RTX 4060 Ti"; DriverVersion = 12.7; ECCEnabled = false ]

hexgpu2.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-8fbdadaf"; ClockMhz = 2565.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15970; Capability = 8.9; DeviceUuid = "8fbdadaf-6057-d45e-355f-683f327c60b8"; DevicePciBusId = "0000:41:00.0"; ComputeUnits = 80; DeviceName = "NVIDIA GeForce RTX 4080 SUPER"; DriverVersion = 12.7; ECCEnabled = false ]


Current available/free GPUs:
hexgpu2.hexfarm.rutgers.edu [ CoresPerCU = 128; Id = "GPU-8fbdadaf"; ClockMhz = 2565.0; MaxSupportedVersion = 12070; GlobalMemoryMb = 15970; Capability = 8.9; DeviceUuid = "8fbdadaf-6057-d45e-355f-683f327c60b8"; DevicePciBusId = "0000:41:00.0"; ComputeUnits = 80; DeviceName = "NVIDIA GeForce RTX 4080 SUPER"; DriverVersion = 12.7; ECCEnabled = false ]

You can also use ru_condor_q_run, which has a decicated section at the bottom of its output to monitor GPU jobs that are currently running and the resources they are using. For a longer-term view, you can look at this page to monitor our GPU-enabled machines and various aspects of them.

To view the full condor queue, broken down into CPU vs. those requesting GPUs, you can use ru_condor_q. You can optionally add the -nobatch flag to see the command that each job will run.

For interactive debugging where hexdl will not suffice, you can use condor_submit -interactive RequestGPUs=1. This will allocate the resources to an interactive condor job where you can run commands and do tests. The job ends when you log out of that session, so this should only be used for debugging. If you receive an authentication error when trying to do this, try running ssh-add -D and try again.

Finally, if you have many jobs that you are running for tests, that you either have a way to checkpoint along the way, or that are low priority, you can set nice_user = True in your condor submission .jdl file. This will set the priority of your jobs to a low value, and allow other jobs to preempt yours from running. In other words, when the queue of jobs requesting GPU resources is empty, your jobs will run. But when the queue fills up from other users, your jobs will be evicted by the condor node, set to idle, and will only start running again once the other jobs complete. Depending how your job is structured, you may benefit from using when_to_transfer_output = ON_EXIT_OR_EVICT in your .jdl file when running nice_user jobs (as opposed to just "ON_EXIT").

Other resources

If your needs are larger than those available on our HEX farm, some other resources available to CMS users include CRAB, CERN's lxplus, FNAL's cmslpc, and the Rutgers Amarel cluster.