Batchfarm
The batchfarm of the KTA Computer System uses Slurm as its workload manager. The official Slurm quickstart guide can be found here.
Submit Queues¶
The following queues are available for job submission:
| Queue name | Purpose | Max runtime |
|---|---|---|
kta | Default queue with all resources | 5 h |
intermediate | Extended runtime with most powerful nodes | 8 h |
xtralong | Very extended runtime with less powerful nodes | 7 d |
gpu | Jobs requiring a GPU | 1 d |
test | Short test runs and debugging | 10 min |
To inspect available queues and their current state, run:
sinfo
scontrol show partition <NAME>Compute Nodes¶
Compute nodes cannot be accessed interactively by users — all tasks must be submitted via Slurm. The following nodes are available:
| Host name | Cores / Threads | CPU | GPUs | RAM (GB) |
|---|---|---|---|---|
| alakazam | 128 / 128 | AMD EPYC 7713 | — | 512 |
| dragonite | 128 / 128 | AMD EPYC 7713 | — | 512 |
| machamp | 128 / 128 | AMD EPYC 7713 | — | 512 |
| pidgeot | 128 / 128 | AMD EPYC 7713 | — | 512 |
| poliwrath | 128 / 128 | AMD EPYC 7713 | — | 512 |
| gengar | 128 / 128 | AMD EPYC 7713 | 3× NVIDIA L40 | 512 |
| cloyster | 40 / 40 | Xeon Gold 6148 | — | 92 |
| marowak | 40 / 40 | Xeon Gold 6148 | — | 92 |
| arcanine | 24 / 24 | Xeon E5-2690 | — | 256 |
| golbat | 40 / 40 | Xeon Gold 6148 | — | 187 |
| muk.ktas.ph.tum.de | 40 / 40 | Xeon Gold 6148 | — | 187 |
| weezing | 40 / 40 | Xeon Gold 6148 | — | 187 |
Hands-on Guide 1: Submitting a Job Array¶
This guide walks through a realistic example: running many independent ROOT jobs in parallel, each using a different random seed. This is a common pattern for Monte Carlo studies.
The idea¶
Rather than submitting one job per seed by hand, Slurm’s job array feature lets you submit a single script that is executed once per array index. Each array task reads its own seed from a pre-generated seed file using awk and the SLURM_ARRAY_TASK_ID environment variable.
Step 1 — Write the ROOT macro¶
Save the following as fill_histogram.C. It accepts a seed as a command-line argument, samples 10 000 events from a Gaussian distribution, fills a histogram, and writes the result to a ROOT file.
// fill_histogram.C
// Usage: root -l -b -q 'fill_histogram.C(<seed>)'
void fill_histogram(int seed = 42) {
TRandom3 rng(seed);
TH1F *h = new TH1F("h_gauss", "Gaussian sample;x;Entries", 100, -5, 5);
for (int i = 0; i < 10000; i++) {
h->Fill(rng.Gaus(0.0, 1.0));
}
TString outname = TString::Format("output_seed%d.root", seed);
TFile *f = new TFile(outname, "RECREATE");
h->Write();
f->Close();
Printf("Done. Output written to %s", outname.Data());
}Step 2 — Generate the seed file¶
Before submitting, generate one seed per job and store them in a plain text file, one seed per line:
python3 -c "import random; [print(random.randint(1, 1000000)) for _ in range(100)]" > seeds.txtThis creates seeds.txt with 100 seeds, one per line. You can verify with:
wc -l seeds.txt # should print 100
head -5 seeds.txt # preview the first five seedsStep 3 — Write the batch script¶
Save the following as submit_array.sh:
#!/bin/bash
#SBATCH --job-name=gauss_array
#SBATCH --partition=test
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=00:10:00
#SBATCH --output=logs/job_%A_%a.out
#SBATCH --error=logs/job_%A_%a.err
# Read the seed for this array task from seeds.txt.
# awk selects the line whose number matches SLURM_ARRAY_TASK_ID.
SEED=$(awk "NR==${SLURM_ARRAY_TASK_ID}" seeds.txt)
echo "Task ${SLURM_ARRAY_TASK_ID}: using seed ${SEED}"
# Load the environment (adjust the module name to match your setup)
module load root
# Run the ROOT macro with the seed
root -l -b -q "fill_histogram.C(${SEED})"Step 4 — Submit¶
Create the log directory and submit the array:
mkdir -p logs
sbatch submit_array.shSlurm will respond with a job ID, e.g.:
Submitted batch job 384710The --array=1-100 directive launches 100 tasks. Each task picks line $SLURM_ARRAY_TASK_ID from seeds.txt and runs the macro with that seed. Outputs are written to output_seed<N>.root in the working directory, and logs to logs/job_<jobid>_<taskid>.out.
Hands-on Guide 2: Interactive Jobs with srun¶
When the batchfarm is idle or lightly loaded, you can use srun --pty to open an interactive shell directly on a compute node. This is particularly useful for tasks that are too heavy for the terminal servers but do not fit naturally into a batch script — the most common example being the compilation of large software projects such as O2Physics.
Compiling O2Physics on a terminal server is discouraged as it can consume most of the available CPU cores and memory, impacting other users. Requesting a dedicated compute node via srun gives you the full resources of that node for the duration of the build without affecting anyone else.
Starting an interactive session¶
The basic command to open an interactive bash shell on a compute node is:
srun --partition=kta --ntasks=1 --cpus-per-task=128 --mem=256G --time=04:00:00 --pty bashOnce the resources are allocated, your prompt will change to reflect the compute node you have landed on, e.g.:
[ga12abc@alakazam ~]$You are now running directly on the compute node and can execute commands as you would on the terminal server — including long-running builds:
# example: build O2Physics using all available cores
cd O2Physics/build
cmake --build . -- -j $(nproc)Requesting more or fewer resources¶
Adjust the srun flags to match your needs:
| Flag | Description | Example |
|---|---|---|
--partition | Queue to use | --partition=kta |
--cpus-per-task | Number of CPU cores | --cpus-per-task=64 |
--mem | Total memory | --mem=128G |
--time | Maximum wall time | --time=02:00:00 |
--gres | Generic resources (e.g. GPUs) | --gres=gpu:1 |
For a lighter interactive session, e.g. for testing or debugging, the test partition is a good choice:
srun --partition=test --ntasks=1 --cpus-per-task=4 --mem=8G --time=00:10:00 --pty bashEnding the session¶
Type exit or press Ctrl+D to end the interactive session and immediately release the allocated resources back to the cluster.