Run experiment on Multiple GPU in Kaya
Provided by Kai Niu and Supported by Chris
Multiple GPU Usage - UWA KAYA:
-N 2 means apply multiple GPUs from 2 seperate computer nodes, not apply 2 GPUs from multiple computers nodes, that's why it will have the error
(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ salloc -p pophealth --mem=80G -N 2 -n 8 --gres=gpu:a100:2
salloc: Job allocation 550543 has been revoked.
salloc: error: Job submit/allocate failed: Requested node configuration is not available
hostname - get the hostname of the coputer that host GPUs
(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ hostname
n006.hpc.uwa.edu.auUse exit to exit the GPU usage session
(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ exit
srun: error: n006: task 0: Exited with exit code 130
salloc: Relinquishing job allocation 550408
salloc: Job allocation 550408 has been revoked.
Check GPUs information. As we can see, the partition 'pophealth' has 2 computer codes that host GPUs, one is '002' which has 2 V100, and one is '006' which has 4 A100.
(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
work up 3-00:00:00 2 down* n[023,027]
work up 3-00:00:00 3 drain n[026,029,032]
work up 3-00:00:00 5 mix n[010,015-016,024,028]
work up 3-00:00:00 12 idle n[011-013,017-019,022,025,030-031,033-034]
long up 7-00:00:00 1 mix n021
long up 7-00:00:00 1 alloc n020
gpu up 3-00:00:00 13 mix n[001,003-005,037-044,046]
pophealth up 15-00:00:0 2 idle n[002,006]
ondemand up 12:00:00 1 down* n027
ondemand up 12:00:00 1 drain n026
ondemand up 12:00:00 2 mix n[024,028]
ondemand up 12:00:00 1 idle n025
ondemand-gpu up 12:00:00 8 mix n[036-043]To apply gpu, alwayse use the login nodes, then use 'salloc' command to switch to GPUs computer nodes. You have to specify the GPUs usage time for A100, the time follows convention '--time=D-HH:MM:SS'
To get slurm example for running multiple threads:
Use Accelerate for distributed training:
https://huggingface.co/docs/transformers/en/accelerate
https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference
Train with a script
If you are running your training from a script, run the following command to create and save a configuration file - choos using 'Multiple GPUs':
Then launch your training with:
Last updated