Batchfarm

The batchfarm of the KTA Computer System uses Slurm as its workload manager. The official Slurm quickstart guide can be found here.

Submit Queues¶

The following queues are available for job submission:

Queue name	Purpose	Max runtime
`kta`	Default queue with all resources	5 h
`intermediate`	Extended runtime with most powerful nodes	8 h
`xtralong`	Very extended runtime with less powerful nodes	7 d
`gpu`	Jobs requiring a GPU	1 d
`test`	Short test runs and debugging	10 min

To inspect available queues and their current state, run:

sinfo
scontrol show partition <NAME>

Compute Nodes¶

Compute nodes cannot be accessed interactively by users — all tasks must be submitted via Slurm. The following nodes are available:

Host name	Cores / Threads	CPU	GPUs	RAM (GB)
alakazam.ktas.ph.tum.de	128 / 128	AMD EPYC 7713	—	512
dragonite.ktas.ph.tum.de	128 / 128	AMD EPYC 7713	—	512
machamp.ktas.ph.tum.de	128 / 128	AMD EPYC 7713	—	512
pidgeot.ktas.ph.tum.de	128 / 128	AMD EPYC 7713	—	512
poliwrath.ktas.ph.tum.de	128 / 128	AMD EPYC 7713	—	512
gengar.ktas.ph.tum.de	128 / 128	AMD EPYC 7713	3× NVIDIA L40	512
cloyster.ktas.ph.tum.de	40 / 40	Xeon Gold 6148	—	92
marowak.ktas.ph.tum.de	40 / 40	Xeon Gold 6148	—	92
arcanine.ktas.ph.tum.de	24 / 24	Xeon E5-2690	—	256
golbat.ktas.ph.tum.de	40 / 40	Xeon Gold 6148	—	187
muk.ktas.ph.tum.de	40 / 40	Xeon Gold 6148	—	187
weezing.ktas.ph.tum.de	40 / 40	Xeon Gold 6148	—	187

Hands-on Guide 1: Submitting a Job Array¶

This guide walks through a realistic example: running many independent ROOT jobs in parallel, each using a different random seed. This is a common pattern for Monte Carlo studies.

The idea¶

Rather than submitting one job per seed by hand, Slurm’s job array feature lets you submit a single script that is executed once per array index. Each array task reads its own seed from a pre-generated seed file using awk and the SLURM_ARRAY_TASK_ID environment variable.

Step 1 — Write the ROOT macro¶

Save the following as fill_histogram.C. It accepts a seed as a command-line argument, samples 10 000 events from a Gaussian distribution, fills a histogram, and writes the result to a ROOT file.

// fill_histogram.C
// Usage: root -l -b -q 'fill_histogram.C(<seed>)'

void fill_histogram(int seed = 42) {
    TRandom3 rng(seed);

    TH1F *h = new TH1F("h_gauss", "Gaussian sample;x;Entries", 100, -5, 5);

    for (int i = 0; i < 10000; i++) {
        h->Fill(rng.Gaus(0.0, 1.0));
    }

    TString outname = TString::Format("output_seed%d.root", seed);
    TFile *f = new TFile(outname, "RECREATE");
    h->Write();
    f->Close();

    Printf("Done. Output written to %s", outname.Data());
}

Step 2 — Generate the seed file¶

Before submitting, generate one seed per job and store them in a plain text file, one seed per line:

python3 -c "import random; [print(random.randint(1, 1000000)) for _ in range(100)]" > seeds.txt

This creates seeds.txt with 100 seeds, one per line. You can verify with:

wc -l seeds.txt   # should print 100
head -5 seeds.txt # preview the first five seeds

Step 3 — Write the batch script¶

Save the following as submit_array.sh:

#!/bin/bash
#SBATCH --job-name=gauss_array
#SBATCH --partition=test
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=00:10:00
#SBATCH --output=logs/job_%A_%a.out
#SBATCH --error=logs/job_%A_%a.err

# Read the seed for this array task from seeds.txt.
# awk selects the line whose number matches SLURM_ARRAY_TASK_ID.
SEED=$(awk "NR==${SLURM_ARRAY_TASK_ID}" seeds.txt)

echo "Task ${SLURM_ARRAY_TASK_ID}: using seed ${SEED}"

# Load the environment (adjust the module name to match your setup)
module load root

# Run the ROOT macro with the seed
root -l -b -q "fill_histogram.C(${SEED})"

Step 4 — Submit¶

Create the log directory and submit the array:

mkdir -p logs
sbatch submit_array.sh

Slurm will respond with a job ID, e.g.:

Submitted batch job 384710

The --array=1-100 directive launches 100 tasks. Each task picks line $SLURM_ARRAY_TASK_ID from seeds.txt and runs the macro with that seed. Outputs are written to output_seed<N>.root in the working directory, and logs to logs/job_<jobid>_<taskid>.out.

Throttling large arrays

If you are submitting a large array (e.g. 1000 tasks), you can limit how many tasks run simultaneously to avoid overloading the cluster. Use the % modifier on the --array option:

#SBATCH --array=1-1000%50

This allows at most 50 tasks to run at the same time.

Hands-on Guide 2: Interactive Jobs with `srun`¶

When the batchfarm is idle or lightly loaded, you can use srun --pty to open an interactive shell directly on a compute node. This is particularly useful for tasks that are too heavy for the terminal servers but do not fit naturally into a batch script — the most common example being the compilation of large software projects such as O2Physics.

Compiling O2Physics on a terminal server is discouraged as it can consume most of the available CPU cores and memory, impacting other users. Requesting a dedicated compute node via srun gives you the full resources of that node for the duration of the build without affecting anyone else.

Starting an interactive session¶

The basic command to open an interactive bash shell on a compute node is:

srun --partition=kta --ntasks=1 --cpus-per-task=128 --mem=256G --time=04:00:00 --pty bash

Once the resources are allocated, your prompt will change to reflect the compute node you have landed on, e.g.:

[ga12abc@alakazam ~]$

You are now running directly on the compute node and can execute commands as you would on the terminal server — including long-running builds:

# example: build O2Physics using all available cores
cd O2Physics/build
cmake --build . -- -j $(nproc)

Requesting more or fewer resources¶

Adjust the srun flags to match your needs:

Flag	Description	Example
`--partition`	Queue to use	`--partition=kta`
`--cpus-per-task`	Number of CPU cores	`--cpus-per-task=64`
`--mem`	Total memory	`--mem=128G`
`--time`	Maximum wall time	`--time=02:00:00`
`--gres`	Generic resources (e.g. GPUs)	`--gres=gpu:1`

For a lighter interactive session, e.g. for testing or debugging, the test partition is a good choice:

srun --partition=test --ntasks=1 --cpus-per-task=4 --mem=8G --time=00:10:00 --pty bash

Ending the session¶

Type exit or press Ctrl+D to end the interactive session and immediately release the allocated resources back to the cluster.

Submit Queues¶

Compute Nodes¶

Hands-on Guide 1: Submitting a Job Array¶

The idea¶

Step 1 — Write the ROOT macro¶

Step 2 — Generate the seed file¶

Step 3 — Write the batch script¶

Step 4 — Submit¶

Hands-on Guide 2: Interactive Jobs with srun¶

Starting an interactive session¶

Requesting more or fewer resources¶

Ending the session¶

Hands-on Guide 2: Interactive Jobs with `srun`¶