Run experiment on Multiple GPU in Kaya

Provided by Kai Niu and Supported by Chris

Multiple GPU Usage - UWA KAYA:

-N 2 means apply multiple GPUs from 2 seperate computer nodes, not apply 2 GPUs from multiple computers nodes, that's why it will have the error

(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ salloc -p pophealth --mem=80G -N 2 -n 8 --gres=gpu:a100:2
salloc: Job allocation 550543 has been revoked.
salloc: error: Job submit/allocate failed: Requested node configuration is not available

hostname - get the hostname of the coputer that host GPUs

(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ hostname
n006.hpc.uwa.edu.au

Use exit to exit the GPU usage session

(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ exit
srun: error: n006: task 0: Exited with exit code 130
salloc: Relinquishing job allocation 550408
salloc: Job allocation 550408 has been revoked.

Check GPUs information. As we can see, the partition 'pophealth' has 2 computer codes that host GPUs, one is '002' which has 2 V100, and one is '006' which has 4 A100.

(/group/pmc015/kniu/kai_phd/conda_env/champ) bash-4.4$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
work            up 3-00:00:00      2  down* n[023,027]
work            up 3-00:00:00      3  drain n[026,029,032]
work            up 3-00:00:00      5    mix n[010,015-016,024,028]
work            up 3-00:00:00     12   idle n[011-013,017-019,022,025,030-031,033-034]
long            up 7-00:00:00      1    mix n021
long            up 7-00:00:00      1  alloc n020
gpu             up 3-00:00:00     13    mix n[001,003-005,037-044,046]
pophealth       up 15-00:00:0      2   idle n[002,006]
ondemand        up   12:00:00      1  down* n027
ondemand        up   12:00:00      1  drain n026
ondemand        up   12:00:00      2    mix n[024,028]
ondemand        up   12:00:00      1   idle n025
ondemand-gpu    up   12:00:00      8    mix n[036-043]

To apply gpu, alwayse use the login nodes, then use 'salloc' command to switch to GPUs computer nodes. You have to specify the GPUs usage time for A100, the time follows convention '--time=D-HH:MM:SS'

To get slurm example for running multiple threads:

Use Accelerate for distributed training:

https://huggingface.co/docs/transformers/en/accelerate

https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference

Train with a script

If you are running your training from a script, run the following command to create and save a configuration file - choos using 'Multiple GPUs':

Then launch your training with:

Last updated