Profile your job

There comes a time in the life of every user when you ask yourself "Is this it? Is this really all?" You start to wonder if your jobs really run as well as they could, or if you can perhaps improve them in some way, whether by rewriting your code or simply by finding a better set of options for your Slurm job.

In HPC, as in science, you can't improve something until you can measure it. So how do you  "measure" your job? This page will tell you how. The tools below are vaguely ordered by how general and simple they are to use.

General Tools

Ruse
Time and /usr/bin/time
SSH and Top

Strace

Advanced Tools

GDB
Valgrind and perf
Python: timeit and cprofile
 

General tools

These are tools that are useful for a wide range of applications, no matter what language they are written in. You don't need to modify your program so they can be used for any application, and you don't need to be a programmer to use them.

Ruse

Ruse is a utility that periodically measures the memory, the number of processes and the time that your job needs, presented in a way that is straightforward to apply to your Slurm job settings.

Ruse periodically samples the process and its subprocesses and keeps track of the CPU, time and maximum memory use. It also optionally records the sampled values over time. The purpose or Ruse is not to profile processes in detail, but to follow jobs that run for many minutes, hours or days, with no performance impact and without changing the measured application in any way.

You use Ruse by loading the module, then adding the "ruse" command in front of your program. Ruse will wake up periodically (every 10 seconds by default) and measure the memory used and the number of processes (total and active). At the end, it writes a file ending in ".ruse" with the results for you to inspect.

Below we have an example of using Ruse to measure the time, processes and memory of a computational chemistry package run with the "g09" command:

$ module load ruse/2.0
$ ruse g09<Azulene.inp >azu.out
Time:           00:00:39
Memory:         448.5 MB
Cores:          8
Total_procs:    9
Active_procs:   8
Proc(%): 74.8  50.9  50.8  50.3  49.6  49.0  45.5  44.0  

We get a lot of useful information:

  • Time is the total running time, in the same format you would use with Slurm.
  • Memory is the peak amount of memory used by the application.
  • Cores is the total amount of cores available. This would usually be what you set with the "-c" parameter to Slurm.
  • Total procs is the maximum number of simultaneous processes started by the application. This includes sleeping processes that don't actually do anything.
  • Active procs is the peak number of simultaneously active processes. For compute-intensive processes these should no more than the number of available cores.
  • Proc% is how much CPU each process used during execution. This can help you figure out if you are using the cores effectively.

Ruse has a set of useful options. The most common are:

-s, --steps            Print each sample step
-t, --time=SECONDS     Sample every SECONDS (default 10)
-h, --help             Print help

With "--steps" you tell Ruse to print the resources you use every time step, not just a summary at the end. "--time" lets you set a different time step than the default. And "--help" gives you a list of all the options you have available.

Please see our Ruse documentation page for all the details on this tool.

time and /usr/bin/time

"time" and "/usr/bin/time" are commands for measuring the time a command takes to run. You use them in the same way as "ruse" and "perf-report" above, by putting them in front of the command you want to measure.

time

$ time spades.py --test
...
real    0m11.265s
user    0m10.585s
sys     0m1.154s

"time" is not actually a stand-alone program; it is a command built into the bash shell itself. When you run "time command" it is the bash shell itself that measures the time taken. Here we measured spades.py running its own self-test. The real total time was about 11 seconds; time spent in spades.py was 10.5 seconds, and time spent in the operating system (reading and writing files and the like) was 1.1 seconds.

"time" is very rudimentary. It gives you the "real" (total or "wall-clock") time; the "user" time spent in the program itself; and the "sys" time spent in system calls.

/usr/bin/time

$ /usr/bin/time spades.py --test

​10.58user 1.20system 0:11.26elapsed 104%CPU (0avgtext+0avgdata 546644maxresident)k
24inputs+20792outputs (1major+125630minor)pagefaults 0swaps

The "time" program in /usr/bin" is a real application, and gives you a lot more information about the program you run. By default it gives you a very telegraphic summary: user, system and total time; amount of CPU; memory; disk IO; memory page faults and swapped pages.

You can expand the output and add more information with the "-v" or "--verbose" option:

$ /usr/bin/time -v spades.py --test

        Command being timed: "spades.py --test"
        User time (seconds): 10.47
        System time (seconds): 1.23
        Percent of CPU this job got: 105%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.09
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 546496
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 125664
        Voluntary context switches: 20591
        Involuntary context switches: 133
        Swaps: 0
        File system inputs: 24
        File system outputs: 20792
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

It gives you the user, system and total time; total CPU and memory consumed; page faults; context switches; and the amount of various kinds of IO. Many of these are not directly useful if you're running somebody else's program, but if you write your own code it can give you some insights into how the program is performing.

You need to specify the path to the application, and run it as "/usr/bin/time"; just calling "time" would give you the simple built-in shell command above.

ssh, ps, top and htop

When you run a job, you are able to log into the nodes that your job runs on, and examine your application directly. This can be very useful to check that everything is running properly.

First start a job (we use "MrBayes" as above), then check your job to see what nodes you use:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           6488691   compute     bash jan-more  R    1:57:44      1 deigo011920 

The node name here is "deigo011920". If your job uses multiple nodes, pick one. Then ssh into it:

$ ssh deigo011920

You can now examine your job in various ways. You can see any processes you are running with the "ps" command:

$ ps u -u $USER
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
jan-mor+  57243  2.0  0.0 284212 21276 pts/0    Sl+  17:10   0:00 mpirun -np 4 mb testseq1c.nex
jan-mor+  57254 93.5  0.0 389388 76796 pts/0    Rl   17:10   0:06 mb testseq1c.nex
jan-mor+  57255 94.2  0.0 388360 77164 pts/0    Rl   17:10   0:06 mb testseq1c.nex
jan-mor+  57257 94.2  0.0 388360 77064 pts/0    Rl   17:10   0:06 mb testseq1c.nex
jan-mor+  57258 94.0  0.0 389396 78952 pts/0    Rl   17:10   0:06 mb testseq1c.nex
jan-mor+  57470  0.0  0.0  54660  3780 pts/1    R+   17:10   0:00 ps u -u jan-moren
jan-mor+  88494  0.0  0.0  93464  9700 ?        Ss   15:12   0:00 /usr/lib/systemd/systemd --user
jan-mor+  88498  0.0  0.0 346856 20160 ?        S    15:12   0:00 (sd-pam)
...

"$USER" contains your user name; you can put your user name directly there if you want.

This will list your current processes, from all your jobs using this node. As you can see, you have quite a few processes running on a node. You can see the "mpirun" command I used to start it, the four MrBayes processes ("mb testseq1c.nex"), the ps command itself, and half a dozen background processes we don't need to care about.

For each process, you can see the name, the pid (process ID); how much CPU and memory each one is currently using (it's a percentage of the total memory on the node, not what you asked for); how much memory it's asking for and how much it's currently using; how long it's been running and the command line.

At the very least, this will tell you if your job is actually doing something or not. If the CPU percentage is stuck at 0.0 then it may be hanging, waiting for something to happen. If the CPU percentage is higher, you know it's at least doing something.

You can get more of a live view of the processes using top:

$ top -u $USER

This will give you an updated list of your processes over time, along with some general data on the system:

top - 10:54:05 up 133 days,  1:20,  1 user,  load average: 60,08, 50,77, 43,87
Tasks: 1784 total,  44 running, 1740 sleeping,   0 stopped,   0 zombie
%Cpu(s): 42,9 us,  0,7 sy,  0,0 ni, 56,0 id,  0,0 wa,  0,3 hi,  0,1 si,  0,0 st
MiB Mem : 515319,1 total, 108200,8 free,  37144,6 used, 369973,7 buff/cache
MiB Swap:   8192,0 total,   7142,2 free,   1049,8 used. 473018,2 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 46672 jan-mor+  20   0  388360  78460  16836 R  99,7   0,0   0:04.59 mb
 46673 jan-mor+  20   0  389412  77164  17580 R  99,3   0,0   0:04.57 mb
 46671 jan-mor+  20   0  388368  79128  17508 R  99,0   0,0   0:04.58 mb
 46670 jan-mor+  20   0  389388  78372  16728 R  98,0   0,0   0:04.51 mb
 46665 jan-mor+  20   0   63332   6428   3880 R   1,0   0,0   0:00.16 top
 46666 jan-mor+  20   0  284208  21384  17952 S   0,3   0,0   0:00.16 mpirun

At the top you can see a lot of information about the node. Mostly this is not useful to you, as this reflects the total use by all users using this node. Below you have a list of your processes, with much the same information you get with "ps" above, ordered slightly differently.

You exit with "q" and you can get help with "h". You can sort your processes in various ways, change the information is displays and so on.

Htop is a top-like tool (written as an improved top). It can show you the same kind of display, and has has a lot of configuration options, including showing the CPU usage graphically, showing your threads as well as processes, and much more. Feel free to play around with the settings to see what you can find out.

Strace

At times it can be useful to follow what a process does in more detail. You might want to know why a program always hangs at some point. Or you're developing your own code and need a way to look for bugs.

"strace" will list every system call your program does, and print the details of the call. This is very useful! System calls is how programs interact with the world, so it tells us what it's doing.

For example, it can tell us what the program is doing just before it hangs or crashes. It can show us what files it it reading and writing. If you need to know if a program really is reading your configuration file, strace can tell you. Or you can find out where a program is saving some output file you can't to find.

To be honest, strace perhaps belongs to the "Advanced Tools" in the next section. But it can be so very useful that I believe it's worth mentioning among the general tools.

You can run it at startup like this - we'll strace a program called "myprog":

$ strace -o strace.out myprog

This will dump all strace output into a file called "strace.out". This will generally be a pretty large file — applications do a lot of things behind the scenes as they start up and this generates lots of system activity.

This is a powerful and often underrated tool. Once you know how to read the (messy, voluminous) output it is useful in a wide range of situations.

A simple one might be when a program just seems to freeze or hang for no apparent reason. You can use "strace" on it, and see what it is trying to do when it hangs. The reason it hangs can often be that it tries to read a file when the file system has a problem; or it tries contact a remote computer and not getting an answer; or an MPI program is trying to do an MPI operation but one of the processes have died. With "strace" you can see exactly what file or remote resource it's waiting for.

For a real, if contrived, example, let's strace "ls" trying to show a file that doesn't exist. Run this:

$ strace -o ls.out ls doesnt_exist
ls: cannot access 'doesnt_exist': No such file or directory

We asked ls to show us the file "doesnt_exist", which of course doesn't exist, so ls complains then stops. We can now examine "ls.out" to see what ls did, and how it failed. You can look at the file with less:

$ less ls.out
execve("/bin/ls", ["ls", "doesnt_exist"], 0x7ffe61fa0388 /* 70 vars */) = 0
brk(NULL)                               = 0x55dbd7490000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc4049a3c0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
. . . 

It's about a hundred lines that looks very messy at first. Each line is actually a system call — "ls" asks the system to do something, and the operating system responds. Most of this file is startup-related things, such as asking for memory, finding libraries and such. Right toward the end we can see:

stat("doesnt_exist", 0x55f4ee59da88)    = -1 ENOENT (No such file or directory)
lstat("doesnt_exist", 0x55f4ee59da88)   = -1 ENOENT (No such file or directory)
write(2, "ls: ", 4)                     = 4
write(2, "cannot access 'doesnt_exist'", 28) = 28
write(2, ": No such file or directory", 27) = 27
write(2, "\n", 1)                       = 1
close(1)                                = 0
close(2)                                = 0
exit_group(2)                           = ?

"doesnt_exist" is the name of the file we asked ls to list for us. "stat" is a system call asking if there is a file with that name. The answer from Linux is "-1 ENOENT (No such file or directory)" - that is, "no" ("ENOENT" is short for "Error, no such entry"). ls then asks if there is a link to such a file with "lstat", and again, the answer is "no". Then ls uses "write" calls to print out the error message that we got when we tried running it.

You can attach to a running program with "-p PID", you can trace multithreaded apps with "-f", and you can filter out the system calls so you only see, say, file operations or network operations if you like.

A devops engineer named Julia Evans as written a slightly silly, but really very well written zine-format explanation of strace; its only 8 hand-drawn pages and you can find the PDF at this link. As an aside, her zines showcase how technical documentation doesn't have to be dry and long-winded to be rigorous or useful.

If you are already comfortable with the idea of system calls, another good place to start learning strace is to look at the manual page:

$ man strace

It has a lot of useful examples in addition to an exhaustive explanation of all options.

Advanced Tools

These tools are perhaps mostly useful for those of you who write your own programs. But even if you're not, don't dismiss them out of hand — they can still be helpful from time to time!

Gdb

gdb is the standard debugger for Linux systems. How — and why — to use it is far outside the scope of this page (see here or here for a couple of the many good online tutorials), but I want to highlight that you can use it on the cluster for non-interactive programs.

ssh into a node that's running your program:

$ squeue -u $USER
...
$ ssh deigo012345

Then you can attach the debugger to a running process using the "-p" parameter. Find out the PID of your process, then attach gdb to it:

$ ps -u $USER
....
$ gdb -p <PID>

You can debug a running process anywhere on the cluster this way. A useful trick for MPI programs is to add a delay right at the start of your code. In C or C++ call sleep() for, say, 30-60 seconds or so at the start of your code, after you have initialized MPI. That gives you time to ssh into a node, find the PID of an MPI process and attach to it before it continues to run.

Other tools

Valgrind is a set of tools that let you examine memory-related issues in various ways. It tracks your memory and cache usage, and will help you find memory overruns; use after free; and other memory, cache and thread-related bugs. You can also use Valgrind to profile your code with the "callgrind" tool. The official manual is a good place to start, and there are any number of good tutorials on the net as well.

Perf is the default profiler for Linux systems. You can use this to find out where your code is spending most of its time. This is very important when you want to optimise your running time; you can't improve your code when you don't know what parts actually need to be improved. This site has a lot of good usage examples.

Python: cprofile and timeit

There are a number of tools for profiling and examining Python programs. Two of the most used are "timeit" and "cprofile".

Timeit is a standard python library. You can give it any function (or whole program); timeit will then run it repeatedly and measure the average time it takes. This is particularly useful if you have multiple ways to do something and want to know which is faster. This page has more examples of use.

Cprofile is the default Python profiler. It lets you find out where you spend most of your time in your Python program, and where you want to spend most of your effort improving it.