Kaya

About the UWA High Performance computational research platform.

What is Kaya

Kaya means "Hello" in Western Australia Aboriginal Language.

The high-performance computational research platform is named after this.

The documentation page for the Kaya is here: https://docs.hpc.uwa.edu.au/docs/user/getaccess/ .

If you want to access this page, you will need to be within UWA internal network. If you are outside UWA, then you will need to connect to it via UWA VPN. If you are within UWA Campus Network., but connected via Ethernet, you will also not able to access it, must be within Unifi network.

Here is the guideline about how to connect to UWA campus network via UniConnect VPN: https://www.it.uwa.edu.au/it-help/access/uniconnect

How to apply for access?

You will need to apply for the access via the university IT Helpdesk, how to apply for access is here: https://docs.hpc.uwa.edu.au/docs/user/getaccess/

After your application is approved, you will recieve an email regarding you login details and temporal login password.

The email will contain information about your username, random password, your SLURM project, your allocated partitions (different queues), your data folder.

How to connect to Kaya?

First, same as above, you will need first to be within UWA Campus Network, if you are outside UWA, you will need to connect via the Uniconnect VPN first.

Then you can login via secure shell protocol (ssh).

From a Linux machine (or Mac) you can login directly using the terminal application e.g. ssh [email protected]
From a Windows machine you can use an ssh client (e.g. PUTTY) to get access to the system.
Data transfers can be done using Secure Copy (SCP) or an SFTP application (FileZilla/WinSCP on Windows).

When you login the first time, you will need to change your password.

If you do not want to type in the password everytime, then you can use the command ssh-copy-id

ssh-copy-id [email protected]
then type in your password, next time, when you login, you will only need to type ssh [email protected]
Going a bit further, you can edit your ~/.ssh/config file, add a chunk for the Kaya settings like this below, and then you can directly ssh Kaya everytime to login, without typing anything else.

Host Kaya
    HostName kaya.hpc.uwa.edu.au
    User your_username
    Port 22

Architecture of Kaya

So when you login, you will be on the Login Node, and everyone will share this node after they login. Which means if you run some heavy duty jobs here, it will cause drama for everyone using the system, so do not do that.

Compute Nodes

So this is the most interesting part: how many resources we have?

The documentation page is not always up to date.

For GPU part (until 2023.12.15)

We have 5 nodes, each have 2 V100 GPUs, and the memory for each card is 32GB.
We have 10 nodes, each have 4 P100 GPUs, the memory is 16GB.
It should also come with Dual Intel Xeon CPUs, 36 cores, 768GB RAM.

For CPU part, there are medium (15) nodes (with Dual Xeon CPU and 256GB RAM) and large (5) nodes (with Dual Xeon CPUs and 512GB - 1.5TB RAM)

To make the usage of different resources, you will need to queue it into different partitions.

Storage

In total, we have 120TB storage in Kaya.

There are four related folders, but only two of them need our attention:

/group: Main folder where the project data lives: /group/your_project_id/ and /group/your_project_id/your_user_id All users within your group can access it.
/scratch:this is the folder where the intermediate results for running jobs willbe stored. Limited to 30TB storage. You will need to copy your results back to /group after the job finished, otherwise you will lose your data as system will sweep it periodically.
/tmp and /home home is used to store small data like source code, scripts, etc. tmp is the one within Linux system to store some temporal stuff.

Transfer data

So if you want to have your project running and access your data, then you will need to upload your data to the /group folder, and then grab your results from /scratch. To achieve that, normally we will use scp command, example:

scp myhpcdatafile [email protected]:/group/projectid/myuserid

Reverse the command can load data from Kaya to your local machine.

You can also use Filezilla (which has an interface to do that).

After you data is in under the /group folder, it will be accessible via the compute nodes as shown in the graph above.

Notice: The Kaya do not provide persist data storage, so keep your data stored somewhere, or use iRDS (this is basically a large google drive/dropbox/onedrive, etc). So you will need to download the important data, and then you can back up here.

Monitor Tools

To monitor the status of Kaya, you can access them by the urls below (they both only accessible via campus network).

https://monitor.hpc.uwa.edu.au/
- This is to monitor the Jobs queue and CPU/GPU depth. This is for current status.
https://metrics.hpc.uwa.edu.au/#main_tab_panel:tg_summary
- this is for history summary, for example, which project used the most resources in the past month.

Access Kaya interactively

You can access it via the normal standard way with SLURM system, we will talk about this later. To actively debug and test your models and codes, you will want to access it in a more interactive way.

There are currently two workable but not perfect ways you can try:

Command line + VS Code Remote
- Access login node via VS Code Remote SSH, then edit the code with your home directory
- You code be accessible via the compute nodes, so you can request the allocation of the compute nodes, access it via bash then mannually test your codes on the compute node
- If need to update the code, update in VS Code Remote SSH via Login Node, and then run it on compute node via command line
Ondemand VNC
- https://ondemand.hpc.uwa.edu.au
- Login via your Kaya username/password
- This is a web based interface for you to queue, view, manage your jobs and files.
- You do the shell access and desktop VNC access here
- Note: currently the desktop VNC access for GPU nodes are not supported, HPC team is working on that.

VSCode Remote SSH + Command Line

To get it ready, you will need to have VS Code installed, and Remote - SSH extension installed first.

VS Code: https://code.visualstudio.com/
Introduction about Remote SSH VSCode: https://code.visualstudio.com/docs/remote/ssh

If you have done the edit for ~/.ssh/config as we described above, when you click the remote ssh button, you should be able to see the Kaya pop up, and then click it. You will be prompted to enter the password if you have not setup the ssh-copy-id.

Then you should be able to open the folders within the login node via VS Code. Like the image here:

Then open the terminal, run command to request a GPU compute node:

salloc -p gpu --mem=16G -N 1 -n 8 --gres=gpu:v100:1

After this, you will be within the compute node bash, you can run nvidia-smi to check the GPU status.

After this, the setup process is done. Because the /group and /home folder is mounted to both login and compute node, so the files will be shared. When you update the code from your login node, the scripts within the compute node will be updated accordingly. In this way, you can debug your scripts interactively by running the scripts via the bash of compute nodes.

This is not a perfect solution, but can allow you to hold the nodes for a while and do relatively quick debug.

Ondemand VNC with CPU or GPU

There is a web service setup via the HPC team, the link is https://ondemand.hpc.uwa.edu.au , after you login with your username and password, then you can manage all your HPC work you want to do within the browser, which includes files, jobs, cluster ssh access, vnc access. For context, you can understand VNC is remote desktop, so you can access the machine in a GUI.

You can explore each tab by yourself and easily understand what you can do here, we will focus on the interactive Apps => Kaya Ondemand (This is for CPU resources)

The account and partition are prefilled via the .kaya-env.sh file we mentioned above, if you want to update it to a new project, then you will need to update here. Account is your projectid, and parition can only be ondemand now. (It will lanuch CPU machine, GPU support with one node is in now.)

To lanuch it with GPU using the interface below.

After click the launch, you will see an interface like this:

Click the Launch Kaya OnDemand, you will be able to see a Linux Desktop within the browser

After this, you will be able to do anything you want. We will talk about how to setup a python project and running on Kaya later.

Access Kaya via SLURM

Simple Linux Utility for Resource Management (SLURM) is a standard system resource manager widely used in a lot of supercomputer across the whole world.

In general, it is a batch job submission system. You write a script about what you want to do on the compute nodes, and then you submit the job to a queue. The SLURM will allocate the resources to your job, and run all the commands you mentioned in the script.

So in this context, the most important part is a SLURM job script. It is also the standard way to use Kaya and other HPC.

SLURM cheatsheet: https://slurm.schedmd.com/pdfs/summary.pdf

Job Scripts

The basic unit of work in the HPC system is a job script. A job script is shell script that performs a number of important steps:

It defines the resources required by the job eg. numebr of cores and/or RAM required.
It sets up the environment for the application code to run
Sets up data required for the job
Executes the application/code
Cleans up input data, saves output files etc.

Queues

Slurm has a number of different queues (officially called “partitions”). Queues are designed to target different types of machines.

Queues currently configured on Kaya are:

Queue

Max Walltime

Machines

test

10 min

All

work

3 day

Med/Large CPU

gpu

3 day

GPU

long

1 week

GPU

ondemand

4 hour

Med CPU

Example

Job script

```shellscript
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=result.txt
#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --partition=gpu

# Load Python module if needed
module load python/3.9.0

# Inline Python script
python3 - << EOF
import os
import sys

# Example Python code
print("Hello, Slurm!")
for i in range(5):
    print(f"Number: {i}")

# Add more Python code as needed
EOF

nvidia-smi
```
# filename: job_scripts.sh

This is the job script file, you put it somewhere in your home directory or project directory.

It will load python, and then run a python inline script, check GPU status, output to result.txt, the job name can be any name you want, put it into a proper partition.

Queue and check

Queue the script: sbatch job_scripts.sh
Check the job: squeue -u your_username

You can also create the scripts, queue, and check the job status via ondemand.

PreviousHPC 101 NextDemo project

Last updated 1 year ago