Running Matlab Jobs

We have several versions of Matlab installed on Deigo, from 2009 onwards. However, only the latest versions will work well on the cluster.

You can use it interactively or in batch jobs. We have support for parallel execution so you can use multiple cores on a node or, with some programming, several nodes to speed up your jobs.

We are not going to describe how to use Matlab itself. The Matlab documentation on the Mathworks website is a great resource for learning the specifics.

Matlab Options

These are useful command line options for running on the cluster:

option meaning
-nosplash Don't put up the "welcome to Matlab" message when we start
-nodesktop Use the text mode interface. Much faster than the graphical desktop. You can still plot and view graphs.
-nodisplay Don't allow any graphical display of any kind. Useful for batch jobs.
-nojvm Start without Java. This disables some graphics and network-related functions, but reduces the startup-time and memory consumption.

 

In general you'll want to use -nosplash and maybe -nodesktop for interactive work, and also -nodisplay and perhaps -nojvm for batch jobs.

We will first run Matlab interactively, with the recommended text interface and with the graphical GUI. Then we'll show how you write a basic Matlab job script so you can start a long-running computation without having to stay logged in. Finally we'll show how you can speed up your Matlab jobs by using multiple cores and nodes on the cluster.

Run Matlab in interactive mode
Matlab job scripts
Access the environment

Parallel jobs with task arrays
Parallel code using the Parallel Toolbox
Parallel code through Distributed Computing Server

Run Matlab in interactive mode

If you want to plot data or use the graphical interface, you need to be able to forward and view X windows. On Linux and OSX you add the "-X" parameter to your ssh command:

$ ssh -X your-name@deigo.oist.jp

On OSX you will need the XQuartz server installed. On Windows you need an SSH client that can forward X window graphics. MobaXterm is a free SSH client with a built-in X server. PuTTY plus the Xming X server reportedly also works.

You can run Matlab as an interactive job on Sango and use it like you would on your own computer. Some Matlab functions (such as vector operations and fft) and toolboxes can take advantage of multiple cores, so it's often a good idea to allocate more than one core even if you're just using it interactively.

Log in to the cluster. Load a Matlab module (Use the module system):

$ module load matlab/R2018b

Start an interactive job with the srun command

$ srun -p short -t 1:00:00 --mem=10G -c8 --x11 --pty bash

In this example we ask for 10GB memory and 8 cores for one hour. --x11 tells Slurm to forward graphical output to our local machine, and --pty bash starts an interactive command line. See Run Computations in the Getting Started page for more on running jobs, and man srun for details on the srun command.

Text mode is the best way to run Matlab interactively on the cluster. It's not as pretty as the GUI, but it's much faster over a network, especially from home or over WiFi, and you still have all the functionality of Matlab, including plotting. You will want to disable the graphical desktop and the start-up splash screen. On the compute node, start Matlab as:

$ matlab -nosplash -nodesktop

If you're impatient, or you know you will only use Matlab this session, you can skip the command line and just run Matlab directly as the interactive command:

$ srun -p short -t 1:00:00 --mem=10G -N 1 -c 12 --x11 --pty matlab -nosplash -nodesktop

Run Matlab interactively with the GUI

If you want you can run Matlab with the full GUI. It will be a bit sluggish on the wireless network, and very slow when you connect from outside OIST. To run Matlab with the GUI you start a job just like above, and run Matlab with just the -nosplash parameter:

$ module load matlab/R2018b
$ srun -p short -t 1:00:00 --mem=10G -N 1 -c 12 --x11 --pty bash
$ matlab -nosplash

This is the Linux desktop version of Matlab, and should work the same as OSX and Windows versions.

Run Matlab job scripts

You can run Matlab scripts as batch jobs. This is the best way to run longer, non-interactive calculations. Let's take a very simple example script and call it svdprog.m:

A=rand(msize,msize);
[U,S,V] = svd(A);
save(wsname);

This script generates a random matrix of size msize, calculates the singular-value decomposition, then saves everything to a Matlab ".mat"-type file with the name in wsname. The script needs inputs for msize and wsname.

Matlab scripts can't be run as regular programs, and they don't have a mechanism to read input parameters directly. Instead, we run a script inline that sets the parameter values, then calls our svdprog.m script:

#!/bin/bash

#SBATCH --partition=short
#SBATCH --job-name=mlab_test
#SBATCH --output=mlab_test_%u.out
#SBATCH --time=10:00
#SBATCH --mem=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

# load matlab. You can see available version with 'module avail'
module load matlab/R2018b

# matlab program with options for batch processing
mlab_cmd="matlab -nosplash -nodisplay -nojvm -nodesktop"

# Set our matrix size and output file name
mat_size=15
out_name="${SLURM_JOB_NAME}_${SLURM_JOB_ID}.mat"

# Run a small script that sets our variables then calls the real script
${mlab_cmd} -r "msize=${mat_size}; wsname='${out_name}'; svdprog;"

We set "mlab_cmd" to our Matlab program and options. We set "mat_size" to our desired matrix size, and "out_name" to the output file name. Slurm defines "SLURM_JOB_NAME" and "SLURM_JOB_ID" for us with the current job name and job ID number.

Before bash runs a line, it replaces "${variable}" with the value of variable. "${mlab_cmd}" is replaced with out Matlab command and options, "${mat_size}" is replaced with the value "15" that we set and so on.

In the end, our script actually runs this command line:

matlab -nosplash -nodisplay -nojvm -nodesktop -r "msize=15; wsname='mlab_test_123456.mat'; svdprog;"

The "-r" option to Matlab tells it to run the following string as a Matlab script. This script sets the variables msize and wsname then runs svdprog.m, like we wanted.

Access the environment from Matlab Scripts

It's often useful to access and change the environment in different ways. Slurm sets a number of environment variables with information on the job name, user, number of cores and so forth. You can access them with "`getenv" like this:

jobname = getenv('SLURM_JOB_NAME');
jobid = str2num(getenv('SLURM_JOB_ID'));

getenv returns the value as a string. If you want the numerical value, use str2num to convert it. We will see a number of examples of this further below.

You can create and run from a temporary directory like this:

scratchdir = strcat('/flash/YourunitU/scratch/',getenv('USER'),'/',getenv('SLURM_JOB_ID'));
unix(['mkdir -p ' scratchdir]);
cd(scratchdir);

Create a directory path /flash/YourunitU/scratch/<user nam>/<job id>/. replace "YourunitU" with the real unit directory name. Then create it by calling the shell "mkdir" command directly with the "-p" flag (so it creates all intermediate directories in the path). We call the system version (using the "unix" function) because the matlab "mkdir" function has a bug. Finally, change the current directory using "cd".

Run Matlab jobs in parallel through Slurm job arrays

If you need to do multiple independent matlab computations with different parameters, you can use job arrays to run them in parallel. We will give you a quick example here; please see "Array Batch Jobs" to learn more about Slurm job arrays.

This is our script above, changed to run a range of matrix sizes, with a different output file for each size. All changes are marked in bold:

#!/bin/bash

#SBATCH --partition=short
#SBATCH --job-name=mlab_test
#SBATCH --output=mlab_test_%u.out
#SBATCH --time=10:00
#SBATCH --mem=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --array=0-9%5

# load matlab. You can see available version with 'module avail'
module load matlab/R2018b

# matlab program with options for batch processing
mlab_cmd="matlab -nosplash -nodisplay -nojvm -nodesktop"

# our range of matrix sizes as a bash array
sizearr=( 8 12 15 22 540 400 234 35 17 42 )

# Set our matrix size and output file name
mat_size=${sizearr[${SLURM_ARRAY_TASK_ID}]}
out_name="${SLURM_JOB_NAME}_${SLURM_JOB_ID}_${mat_size}.mat"

# Run matlab on a small inline script
${mlab_cmd} -r "msize=${mat_size}; wsname='${out_name}'; svdprog;"

We add a new sbatch parameter: "--array=0-9%5" tells Slurm this is an array job; that it should run one job for each value between 0 and 9; and that it should submit no more than 5 jobs at once (we ask you to submit no more than 50 Matlab jobs at any one time).

Slurm will submit this script as a job once for each array value. The value itself is stored in the "SLURM_ARRAY_TASK_ID" variable, and you can use this to set parameters and file names.

By way of example, we define a bash array sizearr with a range of values. We then use SLURM_ARRAY_TASK_ID to index sizearr, then use mat_size to set the matrix size and create unique output file names.

When you start this job, Slurm will submit the first 5 jobs (indexes 0-4). Once a job finishes, Slurm will submit more jobs to keep the total number of submissions at 5. This is more efficient than trying to submit batches of jobs yourself.

Remember that if you need it, you can get the value of SLURM_ARRAY_TASK_ID from inside your matlab script with getenv as we described earlier:

i=str2num(getenv('SLURM_ARRAY_TASK_ID'));

Run parallel Matlab code using the Parallel Toolbox

You can use the parallel toolbox to write multi-threaded scripts that can use the available cores on a single node. You replace for loops with the "parfor" keyword ("parallel for") that runs the loop iterations in parallel, in a manner similar to that ot OpenMP.

Let's rewrite our SVD example to use "parfor". First, here's our Slurm sbatch script:

#!/bin/bash

#SBATCH --partition=short
#SBATCH --job-name=mlab_parbox
#SBATCH --output=mlab_parbox_%u.out
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=4G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

# load matlab.
module load matlab/R2018b

# matlab program without "-nojvm", since we need java for this toolbox.
mlab_cmd="matlab -nosplash -nodisplay -nodesktop"

# Run matlab on our script
${mlab_cmd} -r "svdprog_par"

Our new svd script uses no parameters. We ask for 8 cores and 4G per core for a total of 32G memory.

Our matlab script "svdprog_par.m" looks like this:

% our array of matrix sizes
sizearr=[8, 12, 15, 22, 540, 400, 234, 35, 17, 42];

% get number of cores and the job name from environment
cores = str2num(getenv('SLURM_CPUS_PER_TASK'));
jobname = getenv('SLURM_JOB_NAME');

% Create a parallel pool with 'cores' number of threads
pc = parcluster('local');
poolobj = parpool(pc, cores);

% Parallel loop
parfor j = 1:length(sizearr)
    msize = sizearr(j)
    wsname=strcat(jobname,'_',int2str(msize),'.mat')

    % our SVD code
    A=rand(msize,msize);
    [U,S,V] = svd(A);

    % We need to call 'save' from an external function
    pf_save(wsname, A, U, S, V);
end

% delete parallel pool
delete(poolobj)

We get the number of available cores from the SLURM_CPUS_PER_TASK environment variable. We initialize the parallel toolbox, then create a pool of threads using the parpool() function.

parfor runs a loop from 1 to the number of elements in our parameter array sizearr. But unlike a regular for loop, it will run each iteration though this loop in parallel on different threads. We create our msize and wsname parameters for each iteration, then calculate the SVD for each random matrix.

The Matlab save() function can't be called directly inside a parfor loop. As a workaround, create a script "pf_save.m" with the following and call it instead:

% define an external save command
function pf_save(dname, A, U, S, V)
    save(dname, 'A', 'U', 'S', 'V');
end

You can't parallelize every loop. The iterations need to be independent, so the calculations can't depend on the result of earlier iterations. Also, parfor adds some overhead so unless you have a fairly heavy calculation inside the loop, you may end up with slower code than without it.

Use Slurm array jobs and the Parallel Toolbox at the same time

The Parallel Toolbox uses multiple CPU cores in parallel in a single Matlab job. With Slurm arrays you can easily run multiple Matlab jobs in parallel. Sometimes you may want to do both: run multiple Matlab jobs, where each job uses the Parallel Toolbox. Ideally, you would be able to run your Matlab Parallel Toolbox code with Slurm arrays without issue.

Unfortunately there is a problem: The Parallel Toolbox records information about the job in a directory in your home. If you start multiple such jobs at the same time, some of them may be confused and overwrite each others files, which will result in failing jobs.

The solution is to change the directory that it saves the information in, so that each job has its own unique directory. You need to do this in your Matlab source code, right after you call "parcluster()", and before you call "parpool()":

% here we create a new parallel environment
pc = parcluster('local');

% Find out if we are running a cluster job, and get the job ID
slurmjob = getenv("SLURM_JOB_ID");

% do we have a job id?
if not (slurmjob == "")
        
        % use the job id as a name for a new subdirectory, and set that as the
        % location for Parallel Toolbox to save the information
        cluster_dir = pc.JobStorageLocation + "/" + slurmjob;
        mkdir(cluster_dir);
        pc.JobStorageLocation = cluster_dir;
end

This snippet of code does a few things:

  1. We look for a Slurm Job ID.
  2. If we have one, we're running on the cluster
    1. Get the old location for Parallel Toolbox save files, then add the Job ID as a subdirectory to that
    2. Create the new directory
    3. Set this as the new directory for Parallel Toolbox to use

We need one more thing: when you create a parpool, you need to give the "pc" parameter above as the first argument:

poolobj = parpool(pc, .......);

That applies the new settings to the parpool so it actually uses the new directory.

Run Matlab code in parallel through the Distributed Computing Server

We can also run distributed computations across multiple nodes directly from Matlab itself, using the "matlab distributed computing server" (MDCS). Matlab creates and submits a Slurm job that runs the computation in parallel for you. This is similar in spirit to using MPI for parallel computation.

About Matlab versions

  • The most recent versions of Matlab (2019a onwards) have good support for the Slurm scheduler built in. Using the Parallel Toolbox is quite easy. Follow these instructions.
  • Versions before 2018 don't have good support for Slurm and need more setup to work. Follow these instructions.
  • Versions 2018a and 2018b can't use Parallel Toolbox due to a bug in the scheduler; please use 2019a instead.
     

Parallel Toolbox for Matlab 2019a

This is the documentation for running parallel Toolbox on newer Matlab versions (after 2016). If you are using Matlab 2016 or older, please follow these instructions instead.

NOTE: Matlab 2018a and 2018b suffer from a bug. For Parallel Toolbox, please use Matlab 2019a instead.

As Matlab needs to create and submit a new job, we have to run Matlab on a login node. We need to be very careful not to run any computation on the login node itself.

Load Matlab and start it:

$ module load matlab/R2019b
$ matlab -nosplash -nodisplay -nodesktop

Write a job submission script "mlab_mdcs.m" like this:

% MDCS configuration for Sango cluster 

% Create a temporary directory in /work/scratch/$USER
% without this, matlab would store temporary files in your 
% current directory instead.
scratchdir = strcat('/flash/YourunitU/scratch/',getenv('USER'));
unix(['mkdir -p ' scratchdir]);

% create a cluster instance, with our scratchdir as location
cluster = parallel.cluster.Slurm('JobStorageLocation', scratchdir);

%% Number of jobs/tasks to use
numWorkers = 8;
set(cluster, 'NumWorkers', numWorkers);

%% Job submission through the SPMD engine
pjob = createCommunicatingJob(cluster,'Type','spmd');

% This is the code we actually want to run
pjob.AttachedFiles={'colsum.m'};
t=createTask(pjob, @colsum, 1, {}) ;
pjob.NumWorkersRange = [8,8]
submit(pjob)

This creates a temporary directory under /flash/YourunitU/scratch/$USER/ for the server to put temporary files. We set up a new instance of the server, with the number of processes we wish to use.

This time we don't use a parallel loop. Instead we use a parallel block (called SPMD for "Single Program, Multiple Data" in Matlab), and call our task in colsum.m, once for each process.

This is "colsum.m" :

function total_sum = colsum

if labindex == 1
    % Send magic square to other labs
    A = labBroadcast(1,magic(numlabs)) ;
else
    % Receive broadcast on other labs
    A = labBroadcast(1) ;
end

% Calculate sum of column identified by labindex for this lab
column_sum = sum(A(:,labindex)) ;
% Calculate total sum by combining column sum from all labs
total_sum = gplus(column_sum);

This code calculates the sum of all elements in a matrix (a magic square) in parallel. The code is called once per process, with the rank in labindex and total number of processes in numlabs. Process 1 creates a magic square and sends it to all processes using labBroadcast. Each process calculates the sum of one column, then gplus() sums (reduces) all column values from the processes and distributes the final value to total_sum in all processes.

You run this code interactively from matlab:

>> mlab_mdcs
...
%% you can wait for the task to finish
>> wait(t)
%% Get the final output
>> t.OutputArguments

Please see the Mathworks documentation for much more on the parallel computing toolbox.
 

Parallel Toolbox for Matlab before 2018

This is the documentation for running parallel Toolbox on old Matlab versions without direct support for the Slurm scheduler. If you are using Matlab 2019 onwards, please follow these instructions instead.

Matlab needs to create and submit a new job, so we have to run our Matlab code on a login node. We need to be very careful not to run any computation on the login node itself.

Load Matlab and start it:

$ module load matlab/R2015b
$ matlab -nosplash -nodisplay -nodesktop

Write a job submission script "mlab_mdcs.m" like this:

% MDCS configuration for Sango cluster 
slurmpath = strcat(getenv('MATLAB_ROOT_DIR'), '/toolbox/distcomp/examples/integration/slurm/shared');
addpath(genpath(slurmpath));

% Create a temporary directory in /work/scratch/$USER
scratchdir = strcat('/flash/YourunutU/scratch/',getenv('USER'));
unix(['mkdir -p ' scratchdir]);

% create a cluster instance, with our scratchdir as location
cluster = parallel.cluster.Generic('JobStorageLocation', scratchdir);
set(cluster, 'HasSharedFilesystem', true);
set(cluster, 'RequiresMathWorksHostedLicensing', false);
set(cluster, 'ClusterMatlabRoot', getenv('MATLAB_ROOT_DIR'));
set(cluster, 'OperatingSystem', 'unix');
set(cluster, 'IndependentSubmitFcn', @independentSubmitFcn);
set(cluster, 'CommunicatingSubmitFcn', @communicatingSubmitFcn);
set(cluster, 'GetJobStateFcn', @getJobStateFcn);
set(cluster, 'DeleteJobFcn', @deleteJobFcn);

%% Number of jobs/tasks to use
numWorkers = 8;
set(cluster, 'NumWorkers', numWorkers);

%% Job submission through the SPMD engine
pjob = createCommunicatingJob(cluster,'Type','spmd');

% This is the code we actually want to run
pjob.AttachedFiles={'colsum.m'};
t=createTask(pjob, @colsum, 1, {}) ;
pjob.NumWorkersRange = [8,8]
submit(pjob)

This adds the Slurm integration code for the distributed server, then creates a temporary directory under /flash/YourunitU/scratch/$USER/ for the server. We set up a new instance of the server, with the number of processes we wish to use.

This time we don't use a parallel loop. Instead we use a parallel block (called SPMD for "Single Program, Multiple Data" in Matlab), and call our task in colsum.m, once for each process.

This is "colsum.m" :

function total_sum = colsum

if labindex == 1
    % Send magic square to other labs
    A = labBroadcast(1,magic(numlabs)) ;
else
    % Receive broadcast on other labs
    A = labBroadcast(1) ;
end

% Calculate sum of column identified by labindex for this lab
column_sum = sum(A(:,labindex)) ;
% Calculate total sum by combining column sum from all labs
total_sum = gplus(column_sum);

This code calculates the sum of all elements in a matrix (a magic square) in parallel. The code is called once per process, with the rank in labindex and total number of processes in numlabs. Process 1 creates a magic square and sends it to all processes using labBroadcast. Each process calculates the sum of one column, then gplus() sums (reduces) all column values from the processes and distributes the final value to total_sum in all processes.

You run this code interactively from matlab:

>> mlab_mdcs
...
%% you can wait for the task to finish
>> wait(t)
%% Get the final output
>> t.OutputArguments

Please see the Mathworks documentation for much more on the parallel computing toolbox.