The Open Saion partitions
The "test-" and "intel" partitions on Saion are available to any user with an HPC account. You can use these partitions for developing and testing code; experimenting with GPU computations or with high-core count nodes; or simply to run short jobs with less waiting time.
These partitions are not well suited for long-running computations. If you need hours or days for your job, you are probably better off using the regular "compute" partition, or you may want to apply for access to a restricted resource that will give you plenty of compute time and resources. Please read the descriptions below before you decide to use them.
The Saion system is set up slightly differently from the main cluster. Please read the Saion introduction here for the best way to organise your computations.
The "test-amd" and "test-gpu" partitions are freely available. They are an easy way for you to get access to high core-count nodes and to modern GPU nodes without applying for restricted resources.
These partitions are running on hardware that is normally used for specific high-priority tasks. If such a task is started, jobs running on these partitions will be stopped, then restarted once the high-priority tasks are done. As your jobs can be interrupted at any time it is not suitable for important long-running jobs, unless the job can restart where it left off.
These partitions are best used for short computations, including interactive use and compiling and testing code.
"test-amd" is a set of 4 dual X86 AMD nodes each with 64 cores and 512GB memory. The per-core performance is slightly lower than the equivalent Intel CPU, but the much higher number of available cores makes it a good choice for thread-parallel workloads such as physics and chemistry modelling, multithreaded bioinformatics tools and the like.
The AMD nodes will be expseiclaly slow if you use the Intel compiler and/or the Intel MKL math library (they are often very slow on non-Intel hardware). If you build your own code, use the GCC compiler and try to use the OpenBLAS or BLIS numerical libraries instead.
"test-gpu" is a set of 6 nodes with 4 NVIDIA P100 GPUs each. The layout and usage notes is the same as for the restricted Saion "gpu" partition. In order to take advantage of these nodes, you generally have to start a job on one of the nodes and build your software there. The login nodes do not have the drivers or libraries needed to build GPU software.
building your code
You should build your application on a GPU compute node. The login nodes use a different CPU model and have no GPU-related libraries, so you can't build your application there.
We have CUDA 10.0 system-wide; and CUDA 8, CuDNN 6 and nccl2 are available as modules. Please use these if you can. We may install more modules if there are user requests for them, but we think many users will want to build their own applications.
To build an application you need to log in to a GPU node. The login nodes do not have the CUDA and other libraries that you typically need. To start an interactive job, try this:
$ srun -t 0-1 -c 8 -p test-gpu --mem=32G --gres=gpu:1 --pty bash
This gives you 32G memory, 8 cores and one GPU for 1 hour. If you only intend to compile the code and not test it, you can of course refrain from asking for a GPU at all. That way you can still use a GPU node for building your code even if all GPUs are in use by somebody else.
"intel" is a set of 4 dual Intel Xeon nodes, each with 40 cores and 512GB memory. We ask you to use no more memory than 120G per job.
The "intel" nodes have fewer cores than the "test-amd" nodes, but have better speed on floating-point calculations. Intensively numerical applications should run well on this machine.
The "intel" partition is gang scheduled. Your job will start immediately, but if too many other jobs are also running on the partition, the jobs will take turns running on the system. The time slice is 10 seconds, meaning your job runs for 10 seconds at a time, before it is suspended in favour of another job.
This system is good for shorter computations, and especially good when you want to quickly test a program. You might simply want to see if your program runs at all, or you want to run it for a few minutes to see that it does what it's supposed to do. On the "intel" system you don't need to wait for your turn; your jobs will usually start immediately.
The frequent interruptions mean that the "intel" partition is not suitable for interactive use. Also, if the node runs out of memory it can not accept more jobs, and further submitted applications will have to wait until memory is freed. This is the reason you are not allowed to use more than 120G per job on the "intel" partition.
Next Section: The restricted Saion partitions.