Demo project

We will try to illustrate how to use the GPU to train a python project, which will use conda and pytorch

A running python project on the GPUs in Kaya will include these steps:

  • Step 1: Load codes and create the conda environment on login node

  • Step 2.1: Interactively, allocate a compuate node, then run the code

  • Step 2.2: Queue the job and interview the results

Load codes and create the conda environment

We will use a public github repo as the example: https://github.com/soledad921/TeLM

If you want to load your code, you have several options

  • pull down from github

  • scp the files to the login node

  • or use the vscode directly coding on the login node

    • But do not run the heavy codes there!

We will first login to the login node, and then run module avail

You should be able to see this, first step for us, we will need to load the anaconda3 module. Run command

And then run

conda init

Until now, you have the conda loaded.

Note: for the login node and also the compute node, we do not have permission to install software with the apt-get install command, we need to load the module from the avail modules, and if it does not exist there, we need to request to install the software. Good news is, for our python project, we normally do not need to install any extra software, all we need is to do the

pip install or conda install.

Then we start to create a new conda environment with command:

Then load the code via

And follow the instruction in the readme for the repo, install requirements with

Until now, we have the environment ready, what we will do next is to run the code.

Interactively

To allocate a GPU node and run it interactively, run command:

You can see that, after we run the command above, we are allocated a compute node, and within the same directory we are in on login node.

Next we will need to activate the conda environment.

Verify the installation and run training process

You will be able to see this

We can find that the allocated node for me is the node001, so we can see this when it is running:

Queue the job

Create a file in your project root, name it anyway you like, we call it job.slurm

Then we can queue the job via command

It will give you the job id, then you can try to check the progress in file slurm-{job-id}.out in current folder. Or find the results in folder $MYGROUP/conda_results/{job_id}/ after the job finished.

Last updated