Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Batchfarm

The batchfarm of the KTA Computer System uses Slurm as its workload manager. The official Slurm quickstart guide can be found here.

Submit Queues

The following queues are available for job submission:

Queue namePurposeMax runtime
ktaDefault queue with all resources5 h
intermediateExtended runtime with most powerful nodes8 h
xtralongVery extended runtime with less powerful nodes7 d
gpuJobs requiring a GPU1 d
testShort test runs and debugging10 min

To inspect available queues and their current state, run:

sinfo
scontrol show partition <NAME>

Compute Nodes

Compute nodes cannot be accessed interactively by users — all tasks must be submitted via Slurm. The following nodes are available:

Host nameCores / ThreadsCPUGPUsRAM (GB)
alakazam.ktas.ph.tum.de128 / 128AMD EPYC 7713512
dragonite.ktas.ph.tum.de128 / 128AMD EPYC 7713512
machamp.ktas.ph.tum.de128 / 128AMD EPYC 7713512
pidgeot.ktas.ph.tum.de128 / 128AMD EPYC 7713512
poliwrath.ktas.ph.tum.de128 / 128AMD EPYC 7713512
gengar.ktas.ph.tum.de128 / 128AMD EPYC 77133× NVIDIA L40512
cloyster.ktas.ph.tum.de40 / 40Xeon Gold 614892
marowak.ktas.ph.tum.de40 / 40Xeon Gold 614892
arcanine.ktas.ph.tum.de24 / 24Xeon E5-2690256
golbat.ktas.ph.tum.de40 / 40Xeon Gold 6148187
muk.ktas.ph.tum.de40 / 40Xeon Gold 6148187
weezing.ktas.ph.tum.de40 / 40Xeon Gold 6148187

Hands-on Guide 1: Submitting a Job Array

This guide walks through a realistic example: running many independent ROOT jobs in parallel, each using a different random seed. This is a common pattern for Monte Carlo studies.

The idea

Rather than submitting one job per seed by hand, Slurm’s job array feature lets you submit a single script that is executed once per array index. Each array task reads its own seed from a pre-generated seed file using awk and the SLURM_ARRAY_TASK_ID environment variable.

Step 1 — Write the ROOT macro

Save the following as fill_histogram.C. It accepts a seed as a command-line argument, samples 10 000 events from a Gaussian distribution, fills a histogram, and writes the result to a ROOT file.

// fill_histogram.C
// Usage: root -l -b -q 'fill_histogram.C(<seed>)'

void fill_histogram(int seed = 42) {
    TRandom3 rng(seed);

    TH1F *h = new TH1F("h_gauss", "Gaussian sample;x;Entries", 100, -5, 5);

    for (int i = 0; i < 10000; i++) {
        h->Fill(rng.Gaus(0.0, 1.0));
    }

    TString outname = TString::Format("output_seed%d.root", seed);
    TFile *f = new TFile(outname, "RECREATE");
    h->Write();
    f->Close();

    Printf("Done. Output written to %s", outname.Data());
}

Step 2 — Generate the seed file

Before submitting, generate one seed per job and store them in a plain text file, one seed per line:

python3 -c "import random; [print(random.randint(1, 1000000)) for _ in range(100)]" > seeds.txt

This creates seeds.txt with 100 seeds, one per line. You can verify with:

wc -l seeds.txt   # should print 100
head -5 seeds.txt # preview the first five seeds

Step 3 — Write the batch script

Save the following as submit_array.sh:

#!/bin/bash
#SBATCH --job-name=gauss_array
#SBATCH --partition=test
#SBATCH --array=1-100
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=00:10:00
#SBATCH --output=logs/job_%A_%a.out
#SBATCH --error=logs/job_%A_%a.err

# Read the seed for this array task from seeds.txt.
# awk selects the line whose number matches SLURM_ARRAY_TASK_ID.
SEED=$(awk "NR==${SLURM_ARRAY_TASK_ID}" seeds.txt)

echo "Task ${SLURM_ARRAY_TASK_ID}: using seed ${SEED}"

# Load the environment (adjust the module name to match your setup)
module load root

# Run the ROOT macro with the seed
root -l -b -q "fill_histogram.C(${SEED})"

Step 4 — Submit

Create the log directory and submit the array:

mkdir -p logs
sbatch submit_array.sh

Slurm will respond with a job ID, e.g.:

Submitted batch job 384710

The --array=1-100 directive launches 100 tasks. Each task picks line $SLURM_ARRAY_TASK_ID from seeds.txt and runs the macro with that seed. Outputs are written to output_seed<N>.root in the working directory, and logs to logs/job_<jobid>_<taskid>.out.


Hands-on Guide 2: Interactive Jobs with srun

When the batchfarm is idle or lightly loaded, you can use srun --pty to open an interactive shell directly on a compute node. This is particularly useful for tasks that are too heavy for the terminal servers but do not fit naturally into a batch script — the most common example being the compilation of large software projects such as O2Physics.

Compiling O2Physics on a terminal server is discouraged as it can consume most of the available CPU cores and memory, impacting other users. Requesting a dedicated compute node via srun gives you the full resources of that node for the duration of the build without affecting anyone else.

Starting an interactive session

The basic command to open an interactive bash shell on a compute node is:

srun --partition=kta --ntasks=1 --cpus-per-task=128 --mem=256G --time=04:00:00 --pty bash

Once the resources are allocated, your prompt will change to reflect the compute node you have landed on, e.g.:

[ga12abc@alakazam ~]$

You are now running directly on the compute node and can execute commands as you would on the terminal server — including long-running builds:

# example: build O2Physics using all available cores
cd O2Physics/build
cmake --build . -- -j $(nproc)

Requesting more or fewer resources

Adjust the srun flags to match your needs:

FlagDescriptionExample
--partitionQueue to use--partition=kta
--cpus-per-taskNumber of CPU cores--cpus-per-task=64
--memTotal memory--mem=128G
--timeMaximum wall time--time=02:00:00
--gresGeneric resources (e.g. GPUs)--gres=gpu:1

For a lighter interactive session, e.g. for testing or debugging, the test partition is a good choice:

srun --partition=test --ntasks=1 --cpus-per-task=4 --mem=8G --time=00:10:00 --pty bash

Ending the session

Type exit or press Ctrl+D to end the interactive session and immediately release the allocated resources back to the cluster.