Distributed training

Franken supports data-parallel training across multiple GPUs via Pytorch distributed support. Each rank processes a shard of the dataset, accumulates local covariance, and the results are summed to obtain the global covariance before solving.

torchrun --standalone --nnodes=1 --nproc-per-node=4 franken.autotune \
    --train-path="train_dataset.xyz" \
    --backbone=mace --mace.path-or-id "mace_mp/small" \
    --rf=gaussian --gaussian.num-rf 4096 --gaussian.length-scale="[5.,10.,20.]"

If you see a FileNotFoundError, call the script via its absolute path, for example: ENV_PATH/bin/franken.autotune.

Below is an example Slurm script (for Leonardo HPC):

#!/bin/bash
#SBATCH --account=XXX
#SBATCH --nodes=1                   # node
#SBATCH --ntasks-per-node=4         # tasks out of 32
#SBATCH --gres=gpu:4                # gpus per node out of 4
#SBATCH --cpus-per-task=8
#SBATCH -p boost_usr_prod

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# Activate the Franken environment
ENV_PATH=...
conda activate "$ENV_PATH"

torchrun --standalone --nnodes=1 --nproc-per-node="${SLURM_NTASKS_PER_NODE}" "$ENV_PATH/bin/franken.autotune" \
    --train-path="train_dataset.xyz" \
    --val-path="valid_dataset.xyz" \
    --backbone=mace --mace.path-or-id "mace_mp/small" --mace.interaction-block 2 \
    --rf=gaussian --gaussian.num-rf 4096 --gaussian.length-scale="[5.,10.,20.]"