Demo project
We will try to illustrate how to use the GPU to train a python project, which will use conda and pytorch
A running python project on the GPUs in Kaya will include these steps:
Step 1: Load codes and create the conda environment on login node
Step 2.1: Interactively, allocate a compuate node, then run the code
Step 2.2: Queue the job and interview the results
Load codes and create the conda environment
We will use a public github repo as the example: https://github.com/soledad921/TeLM
If you want to load your code, you have several options
pull down from github
scp the files to the login node
or use the vscode directly coding on the login node
But do not run the heavy codes there!
We will first login to the login node, and then run module avail

You should be able to see this, first step for us, we will need to load the anaconda3 module. Run command
And then run
conda init
Until now, you have the conda loaded.
Note: for the login node and also the compute node, we do not have permission to install software with the apt-get install command, we need to load the module from the avail modules, and if it does not exist there, we need to request to install the software. Good news is, for our python project, we normally do not need to install any extra software, all we need is to do the
pip install or conda install.
Then we start to create a new conda environment with command:
Then load the code via
And follow the instruction in the readme for the repo, install requirements with
Until now, we have the environment ready, what we will do next is to run the code.
Interactively
To allocate a GPU node and run it interactively, run command:

You can see that, after we run the command above, we are allocated a compute node, and within the same directory we are in on login node.
Next we will need to activate the conda environment.
Verify the installation and run training process
You will be able to see this

We can find that the allocated node for me is the node001, so we can see this when it is running:

Queue the job
Create a file in your project root, name it anyway you like, we call it job.slurm
Then we can queue the job via command
It will give you the job id, then you can try to check the progress in file slurm-{job-id}.out in current folder. Or find the results in folder $MYGROUP/conda_results/{job_id}/ after the job finished.
Last updated