Battle notes

We will put some notes about the problem we encouter and investigated here, so can save time later and for reference purpose.

Set HuggingFace Cache Folder

If you want to load a model from huggingface, it will need to download the model from cloud, and save it somewhere. This "somewhere" is a cache places to save the file, normally it will be within ~/.cache/huggingface/hub

This will be fine for your local machine, however, when you run it on the Kaya, it will not be ok, as you have limited storage place under your home directory => ~ . So you will need to change the params to another directory, normally it should under your group or sbatch folder.

To achieve that, the official guide give several ways to do that, including incode or from shell environment.

Specifically for the Kaya, if you set it inside your python code, it probably will not work, you need to set it inside your .slurm file, something like this:

#!/bin/bash -l
#SBATCH --job-name=train
#  the partition is the queue your submit your job to 
#SBATCH --partition=gpu
# the nodes is the number computes
#SBATCH --nodes=1
# the ntasks is number of cpu cores  max number is 34
#SBATCH --ntasks=5
#  gres defines the number of gpus: 1 or 2 
#SBATCH --gres=gpu:v100:1
#  walltime the max time is dependent on partition
#SBATCH --time=3-00:00:00
#SBATCH --export=ALL
#SBATCH --mem=256G
# To configure GNU Environment for Mothur
conda activate /group/pmc010/sli/llm_ner

# Try to tell HuggingFace where to put temporary files
export HF_HOME=/group/pmc010/sli/LLM_NER/LLM-Eva/cache_here/

#  Note: SLURM_JOBID is a unique number for every job.
SCRIPT=train.py
export HF_HOME=/group/pmc010/sli/LLM_NER/LLM-Eva/cache_here/

This is the key line to make it work.

Last updated