Kofugaku

Kofugaku is an 8-node ARM cluster with the Fujitsu A64fx CPU also used in the Fugaku supercomputer. This is meant to be a testing system more than something you use for real work.

If you plan to use the Fugaku supercomputer you can use Kofugaku to see if your code will run well on it. You can also use Kofugaku to explore programming an ARM-based cluster, or just poke around for fun.

The system

The specs are as follows:

Nodes:	8
Cores per node:	48
Memory per node:	32GB
Network:	56gbit/s (IB)

The A64fx has 32GB high-bandwidth memory per node, making memory transfers relatively fast. With careful instruction scheduling it is possible for each core to do two memory fetches, two vectorized fused multiply-add (FMA) or similar instructions and one memory store each clock cycle.

As a result, it can be faster per core for numerical code than the AMD Zen 2 CPUs that we use on Deigo, while using only a fraction of the power. To achieve that performance you need to use the Scalable Vector Extensions (SVE) and careful instruction scheduling. A good compiler is very important, and hand-tuning of your code can achieve even better results.

The low amount of memory and the heavy emphasis on floating point speed means this is best used for traditional HPC-type applications. Physics simulations such as computational fluid dynamics should run especially well; chemistry and neuroscience simulations should also run fine as long as they can handle the lack of memory.

Environment

/work is a small storage system for running jobs. Please contact us if you need a unit directory here. Please note that it's not as fast as /flash and /work on Deigo and Saion, so try to minimize any I/O access.

Your /home directory is available as usual. As Kofugaku uses a different CPU architecture, be careful not to mix software for Kofugaku with that of other systems. Your home is much slower than /work, so please /work whenever you can.

Running your jobs

Kofugaku is part of the Saion system, and you access it though the Saion login nodes. Use the “kofugaku” partition to start a job:

$ srun -p kofugaku ... bash -l

Don’t forget the “-l” parameter to bash. For job scripts, use:

#!/bin/bash -l
#SBATCH -p kofugaku
...

You can use all eight nodes (384 cores) and 32GB memory per node, for up to 24 hours.

Build your software

We provide no application software on this system, only developer tools. We expect that you are interested in building and running your own code on this system, not run prebuilt software (as you may already have figured out, Kofugaku is not exactly “Baby’s first Cluster”). We do have GCC available, but our recommended toolchain is “acfl”, “armpl” and “OpenMPI”.

ACfL compiler

“ACfL” is “Arm Compiler for Linux” and is based on LLVM 13, with patches to produce good vectorized code using SVE, the ARM vector extensions. “armpl” is “Arm Performance Libraries”, the companion BLAS, LAPACK and FFTW implementations for ARM systems with SVE. OpenMPI is built for Kofugaku by us using this toolchain.

ACfL is based on the same compiler (LLVM) as the Fujitsu compiler on Fugaku, and produces similar performance. It is reasonable to treat ACfL as a stand-in for the Fujitsu compiler, but keep in mind that neither the compilers nor the systems are identical.

Compile code

To build code, first load the modules you need:

# the compiler, MPI, and numerical libraries:
$ module load acfl/22.1 openmpi.acfl/4.1.5 armpl/22.1.0

Note: the “armpl” module will only be visible and available after you load the acfl module, so load them in this order.

Load OpenMPI if you are going to use MPI, and load armpl if you need the BLAS, Lapack and FFTW libraries.

The compilers are named “armclang” for C, “armclang++” for C++ and “armflang” for Fortran. A typical invocation would be:

$ armclang -Ofast -Wall -Rpass=loop-vectorize -mcpu=a64fx -armpl my_program.c -o my_program

-Ofast generates fast, vectorized floating point code, while taking a few non-essential liberties with the IEEE floating point standard.

-Wall enables all warnings, and -Rpass=loop-vectorize informs you when the compiler was able to vectorize a loop. Always good to make sure.

-mcpu=a64fx tells the compiler to target our current CPU specifically.

-armpl include the SVE library from the performance libraries above (no need to load the armpl library for this).

When you build third-party code you may need to specify the compiler through environment variables. This is a good set of settings:

# for C
$ export CC=armclang CFLAGS="-mcpu=a64fx -armpl"

# for C++
$ export CPP=armclang++ CPPFLAGS="-mcpu=a64fx -armpl"

# For Fortran
$ export FC=armflang FCFLAGS="-mcpu=a64fx -armpl"

You may (or may not) also want to make sure it uses “-Ofast” instead of “-O3”, or add “-ffast-math” to the xxFLAGS environment variables above, but it really depends on the application if that’s OK nor not.

To run MPI jobs, please use “srun” without any other parameters. For example, run an MPI program called “pi_mpi” on 48 cores:

#!/bin/bash -l
#SBATCH -p kofugaku
#SBATCH -n 48
#SBATCH -t 10:00

srun ./pi_mpi

GCC

We have a couple of different GCC versions available. The “gnu/11.2.0” module is built with support for the “armpl” numerical libraries above, so you can use those two for numerical code if you really need GCC.

Please be aware that GCC has only rudimentary optimization support for the A64fx and will generate significantlly slower code on this system than the acfl compiler.

HPCX

“hpcx” is an implementation of OpenMPI for Infiniband, but based on the older GCC 8 compiler and without explicit optimizations for the A64fx CPU.

Other Software

“likwid” is a performance toolkit.

“fftw” is the Fast Fourier transform library; note that the armpl library above includes FFTW with better performance.

Resources and links:

Here are some useful documentation links for developing on Arm and and using SVE:

Get Started with C/C++
C/C++ Compiler reference

Get Started with Fortran
Fortran Compiler reference

Get Started with Arm Performace library
armpl reference

Porting and Optimizing code for Arm with SVE

Compilers can be fickle. The automatic vectorization can be hit and miss. If you want the best performance, and you want to control exactly how your code is being run and how your calculations are vectorized, you can use the Arm C-Language extensions for SVE. This library lets you use SVE intrinsics in your C code without having to use assembler.

You can find some good training material on the C-language extensions here: ACLE on Gitlab. The same repository also has a set of SVE examples in assembler and C: SVE Programming examples ()

There is a wealth of information and guides on the ARM developer website.