Tutorial
  • Research Skill Bootcamp
    • Overview
    • Introduction to Research
    • Literature Review
    • Latex, Overleaf and Template
    • Latex, use TikZ to draw diagram
    • Introduction to Pytorch
    • Introduction to Neural Networks, MLP, CNN, RNN, Transformer
    • Problem Formulation and Experiment Design
    • Before Vision Becomes Reality
  • AI Engineer Bootcamp
    • Poster
    • Introduction
    • Development Environment Setup
    • Docker
    • Git, GitHub and Agile
    • Introduction to RAG
    • Full Stack Intro, Demo and Setup
    • Python Package Development
    • Databases
    • React
    • Django
    • GraphQL and Hasura
    • Authentication and Authorization
    • Deploy and CI/CD
    • Project Demo
    • External Resources
    • Knowledge Graph and GraphRAG
  • Dev Setup
    • How to install Docker
    • Docker 101
  • GPU Resources
    • HPC 101
    • Kaya
      • Demo project
      • Interactively use Kaya with JetBrains Gateway
      • Battle notes
      • Run experiment on Multiple GPU in Kaya
    • DUG
      • DUG HPC FastX connection Guide for Linux
  • Background Knowledge
    • Public Available LLM
      • Self-Hosted LLM
      • Quantization demo
      • Public Open LLM API
Powered by GitBook
On this page
  • What is HPC
  • Software Part
  1. GPU Resources

HPC 101

About HPC

PreviousGPU ResourcesNextKaya

Last updated 1 year ago

What is HPC

HPC stands for High Performance Computing. The term may refer to the field of using distributed computing resources to accelerate different workloads. In short words, when we need to do some intensive calculations, we will try to use a lot of computers, combine their computational power together. How to get this work is HPC.

The components for the HPC includes hardware and software.

Hardware part will normally includes storage server, login node, compute nodes(GPU/CPU), network.

Software part: it normally based on Linux system. The core task for the software part is to translate the calculation tasks you have into tasks on different compute nodes and get the results back. There is a system designed for this purpose, which is called SLURM

So if you have a calculation task needed to be ran, you can submit the job via the slurm system, then the task will be into a queue, and it will be allocated with resources and do the job later.

There are a lot of different HPC systems, each organisation may have one. For example, UWA has its Kaya, Pawsey center has its own HPC, the company DUG in WA also provide HPC for companies and research to use.

The HPC systems from different organisation may be varied in some command details (For example, Kaya do not support singularity, a docker containerazation system in HPC, but Pawsey do support it), but in general, they all based on the architecture above, and they all use slurm system to manage jobs.

Software Part

SLURM

This is the cheatsheet generally about how to manage the jobs using slurm, we do provide examples for specific HPC system.

Module

Another important part for the software is the module. In HPC system, normally, due to the security reason, you do not have permission to install any package you want. Because the linux is file based system, HPC will have the required and common used software pre-built as modules, when you need the specific software, for example: Python3.9 and Conda, instead of install it via apt-get install xxx You should use the command module load Python3.9 and module load conda

Actually, before doing that, you will need to first check whether the module is availabe in the HPC system or not, with command module avail

We will give examples later for each HPC system we have and running on.

Here is a good introduction linkage for you to have a look:

Slurm cheatsheet:

https://deic-hpc.github.io/EuroCC-knowledgepool/
https://slurm.schedmd.com/pdfs/summary.pdf
Modules - HPC Wiki
Cheatsheet for Module command
Logo