Demo project

We will try to illustrate how to use the GPU to train a python project, which will use conda and pytorch

A running python project on the GPUs in Kaya will include these steps:

Step 1: Load codes and create the conda environment on login node
Step 2.1: Interactively, allocate a compuate node, then run the code
Step 2.2: Queue the job and interview the results

Load codes and create the conda environment

We will use a public github repo as the example: https://github.com/soledad921/TeLM

If you want to load your code, you have several options

pull down from github
scp the files to the login node
or use the vscode directly coding on the login node
- But do not run the heavy codes there!

We will first login to the login node, and then run module avail

You should be able to see this, first step for us, we will need to load the anaconda3 module. Run command

module load Anaconda3/2020.11

And then run

conda init

Until now, you have the conda loaded.

Note: for the login node and also the compute node, we do not have permission to install software with the apt-get install command, we need to load the module from the avail modules, and if it does not exist there, we need to request to install the software. Good news is, for our python project, we normally do not need to install any extra software, all we need is to do the

pip install or conda install.

Then we start to create a new conda environment with command:

conda create --prefix $MYGROUP/env/telm_env  python=3.7
# conda activate your environment after creation
# do not create it under your home directory
# make sure the prefix for conda is your group directory
# this is super important

Then load the code via

git clone  https://github.com/soledad921/TeLM.git

And follow the instruction in the readme for the repo, install requirements with

// because we are on the HPC, so we need to check our nvidia-smi version, it is now 11.6
// then we need to install the torch via this command
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu116
// after this command is successful, run the follow command
pip install tqdm numpy scikit-learn scipy

Until now, we have the environment ready, what we will do next is to run the code.

Interactively

To allocate a GPU node and run it interactively, run command:

salloc -p gpu --mem=16G -N 1 -n 8 --gres=gpu:v100:1
// v100 is our best GPU

You can see that, after we run the command above, we are allocated a compute node, and within the same directory we are in on login node.

Next we will need to activate the conda environment.

conda activate telm_env
// then as instructed by the repo run
python setup.py install

Verify the installation and run training process

cd tkbc
python process_timegran.py --tr 100 --dataset yago11k
python learner.py --dataset yago11k --model TeLM --rank 2000 --emb_reg 0.025 --time_reg 0.001

You will be able to see this

We can find that the allocated node for me is the node001, so we can see this when it is running:

Queue the job

Create a file in your project root, name it anyway you like, we call it job.slurm

```shellscript
#!/bin/bash -l
#SBATCH --job-name=nlp_tlp_test
# the partition is the queue your submit your job to 
# there is a day partition max walltime 24:00:00
# there is a week partition maxx walltime is 7-00:00:00
#SBATCH --partition=gpu
# the nodes is the number computes
#SBATCH --nodes=1
# the ntasks is number of cpu cores  max number is 34
#SBATCH --ntasks=10
#  gres defines the number of gpus: 1 or 2 
#SBATCH --gres=gpu:v100:1
#  walltime the max time is dependent on partition
#SBATCH --time=3-00:00:00
#SBATCH --export=ALL
#SBATCH --mem=256G
# To configure GNU Environment for Mothur
conda activate your_env_name_with_your_group_path

# this part below can be universal
#  Note: SLURM_JOBID is a unique number for every job.
# to use NVMe uncomment this following line
#SCRATCH=/tmp/$USER/run_conda/$SLURM_JOBID
# to use MYSCRATCH space
SCRATCH=$MYSCRATCH/run_conda/$SLURM_JOBID
RESULTS=$MYGROUP/conda_results

###############################################
# Creates a unique directory in the SCRATCH directory for this job to run in.
if [ ! -d $SCRATCH ]; then 
    mkdir -p $SCRATCH 
fi 
echo Working SCRATCH directory is $SCRATCH

###############################################
# Creates a unique directory in your GROUP directory for the results of this job
if [ ! -d $RESULTS ]; then 
     mkdir -p $RESULTS
fi 
echo Results will be store in $RESULTS/$SLURM_JOBID

#############################################
#   Copy input files to $SCRATCH
#   then change directory to $SCRATCH
cd ${SLURM_SUBMIT_DIR}
# then copy the whole folder to the scratch folder
cp -r * ${SCRATCH}

cd ${SCRATCH}

# under this will be related to your code
ls -al
cd tkbc
ls -al

python learner.py --dataset yago11k --model TeLM --rank 2000 --emb_reg 0.025 --time_reg 0.001

#############################################
#    $OUTPUT file to the unique results dir
# note this can be a copy or move  
cd $HOME
mv ${SCRATCH} ${RESULTS}


echo Conda ML gpu job finished at  `date`
```

Then we can queue the job via command

sbatch job.slurm

It will give you the job id, then you can try to check the progress in file slurm-{job-id}.out in current folder. Or find the results in folder $MYGROUP/conda_results/{job_id}/ after the job finished.

PreviousKaya NextInteractively use Kaya with JetBrains Gateway

Last updated 1 year ago