Kaya
About the UWA High Performance computational research platform.
What is Kaya
Kaya means "Hello" in Western Australia Aboriginal Language.
The high-performance computational research platform is named after this.
The documentation page for the Kaya is here: https://docs.hpc.uwa.edu.au/docs/user/getaccess/ .
If you want to access this page, you will need to be within UWA internal network. If you are outside UWA, then you will need to connect to it via UWA VPN. If you are within UWA Campus Network., but connected via Ethernet, you will also not able to access it, must be within Unifi network.
Here is the guideline about how to connect to UWA campus network via UniConnect VPN: https://www.it.uwa.edu.au/it-help/access/uniconnect
How to apply for access?
You will need to apply for the access via the university IT Helpdesk, how to apply for access is here: https://docs.hpc.uwa.edu.au/docs/user/getaccess/
After your application is approved, you will recieve an email regarding you login details and temporal login password.
The email will contain information about your username, random password, your SLURM project, your allocated partitions (different queues), your data folder.
How to connect to Kaya?
First, same as above, you will need first to be within UWA Campus Network, if you are outside UWA, you will need to connect via the Uniconnect VPN first.
Then you can login via secure shell protocol (ssh).
From a Linux machine (or Mac) you can login directly using the terminal application e.g.
ssh [email protected]
From a Windows machine you can use an ssh client (e.g. PUTTY) to get access to the system.
Data transfers can be done using Secure Copy (SCP) or an SFTP application (FileZilla/WinSCP on Windows).
When you login the first time, you will need to change your password.
If you do not want to type in the password everytime, then you can use the command ssh-copy-id
ssh-copy-id [email protected]
then type in your password, next time, when you login, you will only need to type
ssh [email protected]
Going a bit further, you can edit your
~/.ssh/config
file, add a chunk for the Kaya settings like this below, and then you can directlyssh Kaya
everytime to login, without typing anything else.
Architecture of Kaya
So when you login, you will be on the Login Node
, and everyone will share this node after they login. Which means if you run some heavy duty jobs here, it will cause drama for everyone using the system, so do not do that.
Compute Nodes
So this is the most interesting part: how many resources we have?
The documentation page is not always up to date.
For GPU part (until 2023.12.15)
We have 5 nodes, each have 2 V100 GPUs, and the memory for each card is 32GB.
We have 10 nodes, each have 4 P100 GPUs, the memory is 16GB.
It should also come with Dual Intel Xeon CPUs, 36 cores, 768GB RAM.
For CPU part, there are medium (15) nodes (with Dual Xeon CPU and 256GB RAM) and large (5) nodes (with Dual Xeon CPUs and 512GB - 1.5TB RAM)
To make the usage of different resources, you will need to queue it into different partitions.
Storage
In total, we have 120TB storage in Kaya.
There are four related folders, but only two of them need our attention:
/group
: Main folder where the project data lives:/group/your_project_id/
and/group/your_project_id/your_user_id
All users within your group can access it./scratch
:this is the folder where the intermediate results for running jobs willbe stored. Limited to 30TB storage. You will need to copy your results back to/group
after the job finished, otherwise you will lose your data as system will sweep it periodically./tmp
and/home
home
is used to store small data like source code, scripts, etc.tmp
is the one within Linux system to store some temporal stuff.
Transfer data
So if you want to have your project running and access your data, then you will need to upload your data to the /group
folder, and then grab your results from /scratch
. To achieve that, normally we will use scp
command, example:
scp myhpcdatafile [email protected]:/group/projectid/myuserid
Reverse the command can load data from Kaya to your local machine.
You can also use Filezilla (which has an interface to do that).
After you data is in under the /group
folder, it will be accessible via the compute nodes as shown in the graph above.
Notice: The Kaya do not provide persist data storage, so keep your data stored somewhere, or use iRDS (this is basically a large google drive/dropbox/onedrive, etc). So you will need to download the important data, and then you can back up here.
Monitor Tools
To monitor the status of Kaya, you can access them by the urls below (they both only accessible via campus network).
https://monitor.hpc.uwa.edu.au/
This is to monitor the Jobs queue and CPU/GPU depth. This is for current status.
https://metrics.hpc.uwa.edu.au/#main_tab_panel:tg_summary
this is for history summary, for example, which project used the most resources in the past month.
Access Kaya interactively
You can access it via the normal standard way with SLURM
system, we will talk about this later. To actively debug and test your models and codes, you will want to access it in a more interactive way.
There are currently two workable but not perfect ways you can try:
Command line + VS Code Remote
Access login node via VS Code Remote SSH, then edit the code with your home directory
You code be accessible via the compute nodes, so you can request the allocation of the compute nodes, access it via
bash
then mannually test your codes on the compute nodeIf need to update the code, update in VS Code Remote SSH via Login Node, and then run it on compute node via command line
Ondemand VNC
Login via your Kaya username/password
This is a web based interface for you to queue, view, manage your jobs and files.
You do the shell access and desktop VNC access here
Note: currently the desktop VNC access for GPU nodes are not supported, HPC team is working on that.
VSCode Remote SSH + Command Line
To get it ready, you will need to have VS Code installed, and Remote - SSH
extension installed first.
VS Code: https://code.visualstudio.com/
Introduction about Remote SSH VSCode: https://code.visualstudio.com/docs/remote/ssh
If you have done the edit for ~/.ssh/config
as we described above, when you click the remote ssh button, you should be able to see the Kaya pop up, and then click it. You will be prompted to enter the password if you have not setup the ssh-copy-id.
Then you should be able to open the folders within the login node via VS Code. Like the image here:
Then open the terminal, run command to request a GPU compute node:
salloc -p gpu --mem=16G -N 1 -n 8 --gres=gpu:v100:1
After this, you will be within the compute node bash
, you can run nvidia-smi
to check the GPU status.
After this, the setup process is done. Because the /group
and /home
folder is mounted to both login and compute node, so the files will be shared. When you update the code from your login node, the scripts within the compute node will be updated accordingly. In this way, you can debug your scripts interactively by running the scripts via the bash
of compute nodes.
This is not a perfect solution, but can allow you to hold the nodes for a while and do relatively quick debug.
Ondemand VNC with CPU or GPU
There is a web service setup via the HPC team, the link is https://ondemand.hpc.uwa.edu.au , after you login with your username and password, then you can manage all your HPC work you want to do within the browser, which includes files, jobs, cluster ssh access, vnc access. For context, you can understand VNC is remote desktop, so you can access the machine in a GUI.
You can explore each tab by yourself and easily understand what you can do here, we will focus on the interactive Apps => Kaya Ondemand (This is for CPU resources)
The account and partition are prefilled via the .kaya-env.sh
file we mentioned above, if you want to update it to a new project, then you will need to update here. Account is your projectid, and parition can only be ondemand now. (It will lanuch CPU machine, GPU support with one node is in now.)
To lanuch it with GPU using the interface below.
After click the launch, you will see an interface like this:
Click the Launch Kaya OnDemand, you will be able to see a Linux Desktop within the browser
After this, you will be able to do anything you want. We will talk about how to setup a python project and running on Kaya later.
Access Kaya via SLURM
Simple Linux Utility for Resource Management (SLURM) is a standard system resource manager widely used in a lot of supercomputer across the whole world.
In general, it is a batch job submission system. You write a script about what you want to do on the compute nodes, and then you submit the job to a queue. The SLURM will allocate the resources to your job, and run all the commands you mentioned in the script.
So in this context, the most important part is a SLURM job script. It is also the standard way to use Kaya and other HPC.
SLURM cheatsheet: https://slurm.schedmd.com/pdfs/summary.pdf
Job Scripts
The basic unit of work in the HPC system is a job script. A job script is shell script that performs a number of important steps:
It defines the resources required by the job eg. numebr of cores and/or RAM required.
It sets up the environment for the application code to run
Sets up data required for the job
Executes the application/code
Cleans up input data, saves output files etc.
Slurm has a number of different queues (officially called “partitions”). Queues are designed to target different types of machines.
Queues currently configured on Kaya are:
test
10 min
All
work
3 day
gpu
3 day
long
1 week
ondemand
4 hour
Example
Job script
This is the job script file, you put it somewhere in your home directory or project directory.
It will load python, and then run a python inline script, check GPU status, output to result.txt, the job name can be any name you want, put it into a proper partition.
Queue and check
Queue the script:
sbatch job_scripts.sh
Check the job:
squeue -u your_username
You can also create the scripts, queue, and check the job status via ondemand.
Last updated