If you want to load your code, you have several options
pull down from github
scp the files to the login node
or use the vscode directly coding on the login node
But do not run the heavy codes there!
We will first login to the login node, and then run module avail
You should be able to see this, first step for us, we will need to load the anaconda3 module. Run command
module load Anaconda3/2020.11
And then run
conda init
Until now, you have the conda loaded.
Note: for the login node and also the compute node, we do not have permission to install software with the apt-get install command, we need to load the module from the avail modules, and if it does not exist there, we need to request to install the software. Good news is, for our python project, we normally do not need to install any extra software, all we need is to do the
pip install or conda install.
Then we start to create a new conda environment with command:
conda create --prefix $MYGROUP/env/telm_env python=3.7
# conda activate your environment after creation
# do not create it under your home directory
# make sure the prefix for conda is your group directory
# this is super important
Then load the code via
git clone https://github.com/soledad921/TeLM.git
And follow the instruction in the readme for the repo, install requirements with
// because we are on the HPC, so we need to check our nvidia-smi version, it is now 11.6
// then we need to install the torch via this command
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu116
// after this command is successful, run the follow command
pip install tqdm numpy scikit-learn scipy
Until now, we have the environment ready, what we will do next is to run the code.
Interactively
To allocate a GPU node and run it interactively, run command:
salloc -p gpu --mem=16G -N 1 -n 8 --gres=gpu:v100:1
// v100 is our best GPU
You can see that, after we run the command above, we are allocated a compute node, and within the same directory we are in on login node.
Next we will need to activate the conda environment.
conda activate telm_env
// then as instructed by the repo run
python setup.py install
We can find that the allocated node for me is the node001, so we can see this when it is running:
Queue the job
Create a file in your project root, name it anyway you like, we call it job.slurm
```shellscript
#!/bin/bash -l
#SBATCH --job-name=nlp_tlp_test
# the partition is the queue your submit your job to
# there is a day partition max walltime 24:00:00
# there is a week partition maxx walltime is 7-00:00:00
#SBATCH --partition=gpu
# the nodes is the number computes
#SBATCH --nodes=1
# the ntasks is number of cpu cores max number is 34
#SBATCH --ntasks=10
# gres defines the number of gpus: 1 or 2
#SBATCH --gres=gpu:v100:1
# walltime the max time is dependent on partition
#SBATCH --time=3-00:00:00
#SBATCH --export=ALL
#SBATCH --mem=256G
# To configure GNU Environment for Mothur
conda activate your_env_name_with_your_group_path
# this part below can be universal
# Note: SLURM_JOBID is a unique number for every job.
# to use NVMe uncomment this following line
#SCRATCH=/tmp/$USER/run_conda/$SLURM_JOBID
# to use MYSCRATCH space
SCRATCH=$MYSCRATCH/run_conda/$SLURM_JOBID
RESULTS=$MYGROUP/conda_results
###############################################
# Creates a unique directory in the SCRATCH directory for this job to run in.
if [ ! -d $SCRATCH ]; then
mkdir -p $SCRATCH
fi
echo Working SCRATCH directory is $SCRATCH
###############################################
# Creates a unique directory in your GROUP directory for the results of this job
if [ ! -d $RESULTS ]; then
mkdir -p $RESULTS
fi
echo Results will be store in $RESULTS/$SLURM_JOBID
#############################################
# Copy input files to $SCRATCH
# then change directory to $SCRATCH
cd ${SLURM_SUBMIT_DIR}
# then copy the whole folder to the scratch folder
cp -r * ${SCRATCH}
cd ${SCRATCH}
# under this will be related to your code
ls -al
cd tkbc
ls -al
python learner.py --dataset yago11k --model TeLM --rank 2000 --emb_reg 0.025 --time_reg 0.001
#############################################
# $OUTPUT file to the unique results dir
# note this can be a copy or move
cd $HOME
mv ${SCRATCH} ${RESULTS}
echo Conda ML gpu job finished at `date`
```
Then we can queue the job via command
sbatch job.slurm
It will give you the job id, then you can try to check the progress in file slurm-{job-id}.out in current folder. Or find the results in folder $MYGROUP/conda_results/{job_id}/ after the job finished.