Profile your job
There comes a time in the life of every user when you ask yourself "Is this it? Is this really all?" You start to wonder if your jobs really run as well as they could, or if you can perhaps improve them in some way, whether by rewriting your code or simply by finding a better set of options for your Slurm job.
In HPC, as in science, you can't improve something until you can measure it. So how do you "measure" your job? This page will tell you how. The tools below are vaguely ordered by how general and simple they are to use.
These are tools that are useful for a wide range of applications, no matter what language they are written in. You don't need to modify your program so they can be used for any application, and you don't need to be a programmer to use them.
Ruse is a utility that periodically measures the memory, the number of processes and the time that your job needs, presented in a way that is straightforward to apply to your Slurm job settings.
Ruse periodically samples the process and its subprocesses and keeps track of the CPU, time and maximum memory use. It also optionally records the sampled values over time. The purpose or Ruse is not to profile processes in detail, but to follow jobs that run for many minutes, hours or days, with no performance impact and without changing the measured application in any way.
You use Ruse by loading the module, then adding the "ruse" command in front of your program. Ruse will wake up periodically (every 10 seconds by default) and measure the memory used and the number of processes (total and active). At the end, it writes a file ending in ".ruse" with the results for you to inspect.
Below we have an example of using Ruse to measure the time, processes and memory of a computational chemistry package run with the "g09" command:
$ module load ruse/2.0 $ ruse g09<Azulene.inp >azu.out Time: 00:00:39 Memory: 448.5 MB Cores: 8 Total_procs: 9 Active_procs: 8 Proc(%): 74.8 50.9 50.8 50.3 49.6 49.0 45.5 44.0
We get a lot of useful information:
- Time is the total running time, in the same format you would use with Slurm.
- Memory is the peak amount of memory used by the application.
- Cores is the total amount of cores available. This would usually be what you set with the "-c" parameter to Slurm.
- Total procs is the maximum number of simultaneous processes started by the application. This includes sleeping processes that don't actually do anything.
- Active procs is the peak number of simultaneously active processes. For compute-intensive processes these should no more than the number of available cores.
- Proc% is how much CPU each process used during execution. This can help you figure out if you are using the cores effectively.
Ruse has a set of useful options. The most common are:
-s, --steps Print each sample step -t, --time=SECONDS Sample every SECONDS (default 10) -h, --help Print help
With "--steps" you tell Ruse to print the resources you use every time step, not just a summary at the end. "--time" lets you set a different time step than the default. And "--help" gives you a list of all the options you have available.
Please see our Ruse documentation page for all the details on this tool.
Allinea Reports mainly measures how efficiently your program uses MPI and threads. It's useful both when you write your own code, and you are running installed applications.
TO use it, you load the module, then run your program with "perf-report":
$ module load allinea $ perf-report <my program>
When you profile an MPI program, you should use "perf-report" to run the program directly. In other words, you should not use "srun --mpi=pmix" to run it, and you should not use "mpirun". It takes and "-np" argument to set the number of processes; "--mpi=<version>" to set the MPI version you use; "--procs-per-node" to set the number of processes on each node; and "--mpiargs" to set MPI options in general.
This is how you would use it in a Slurm job on Deigo:
$ perf-report --mpi=openmpi -np $SLURM_NPROCS <my MPI program>
On the clusters you need to specify the MPI type with "--mpi"; Allinea reports is not able to autodetect the MPI libraries by itself.
Some useful options:
-np <procs> Number of processes --procs-per-node <procs> Processes per node --mpiargs <args> Set general MPI arguments --mpi=<version> set MPI version - try "openmpi" and "intel-mpi" --no-mpi Don't use MPI (for multithreaded apps) --openmp-threads <threads> The number of OpenMP threads to use -- help List of all options
Allinea will output a text file and an HTML file with the performance information. They contain the same information, though the HTML file has a nicer layout and a few graphs. You can see the HTML file by copying it to your local computer and open it in your browser.
Here is the result of using Allinea on an example MrBayes job:
$ srun -p compute -n 4 --mem=16G -t 0-1 --pty bash $ ml MrBayes.mpi allinea $ perf-report --mpi=openmpi -np 4 mb test_sequence.nex
The output is in sections, with an overview of resources at the top. Then a summary that shows you how much time you spend doing computing, using MPI, and file IO respectively. This summary is the most useful for those of you who aren't writing your own code. If the MPI usage is fairly high, for instance, it probaly means you're trying to use too many processes.
The later sections looks at each segment — CPU, MPI, IO, Threading and Memory — in a bit more detail. This can be useful for you if you are writing your own code. The "Energy" section is empty as we don't track energy use at the individual job level.
"time" and "/usr/bin/time" are commands for measuring the time a command takes to run. You use them in the same way as "ruse" and "perf-report" above, by putting them in front of the command you want to measure.
$ time spades.py --test ... real 0m11.265s user 0m10.585s sys 0m1.154s
"time" is not actually a stand-alone program; it is a command built into the bash shell itself. When you run "time command" it is the bash shell itself that measures the time taken. Here we measured spades.py running its own self-test. The real total time was about 11 seconds; time spent in spades.py was 10.5 seconds, and time spent in the operating system (reading and writing files and the like) was 1.1 seconds.
"time" is very rudimentary. It gives you the "real" (total or "wall-clock") time; the "user" time spent in the program itself; and the "sys" time spent in system calls.
$ /usr/bin/time spades.py --test 10.58user 1.20system 0:11.26elapsed 104%CPU (0avgtext+0avgdata 546644maxresident)k 24inputs+20792outputs (1major+125630minor)pagefaults 0swaps
The "time" program in /usr/bin" is a real application, and gives you a lot more information about the program you run. By default it gives you a very telegraphic summary: user, system and total time; amount of CPU; memory; disk IO; memory page faults and swapped pages.
You can expand the output and add more information with the "-v" or "--verbose" option:
$ /usr/bin/time -v spades.py --test Command being timed: "spades.py --test" User time (seconds): 10.47 System time (seconds): 1.23 Percent of CPU this job got: 105% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.09 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 546496 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1 Minor (reclaiming a frame) page faults: 125664 Voluntary context switches: 20591 Involuntary context switches: 133 Swaps: 0 File system inputs: 24 File system outputs: 20792 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
It gives you the user, system and total time; total CPU and memory consumed; page faults; context switches; and the amount of various kinds of IO. Many of these are not directly useful if you're running somebody else's program, but if you write your own code it can give you some insights into how the program is performing.
You need to specify the path to the application, and run it as "/usr/bin/time"; just calling "time" would give you the simple built-in shell command above.
When you run a job, you are able to log into the nodes that your job runs on, and examine your application directly. This can be very useful to check that everything is running properly.
First start a job (we use "MrBayes" as above), then check your job to see what nodes you use:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6488691 compute bash jan-more R 1:57:44 1 deigo011920
The node name here is "deigo011920". If your job uses multiple nodes, pick one. Then ssh into it:
$ ssh deigo011920
You can now examine your job in various ways. You can see any processes you are running with the "ps" command:
$ ps u -u $USER USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND jan-mor+ 57243 2.0 0.0 284212 21276 pts/0 Sl+ 17:10 0:00 mpirun -np 4 mb testseq1c.nex jan-mor+ 57254 93.5 0.0 389388 76796 pts/0 Rl 17:10 0:06 mb testseq1c.nex jan-mor+ 57255 94.2 0.0 388360 77164 pts/0 Rl 17:10 0:06 mb testseq1c.nex jan-mor+ 57257 94.2 0.0 388360 77064 pts/0 Rl 17:10 0:06 mb testseq1c.nex jan-mor+ 57258 94.0 0.0 389396 78952 pts/0 Rl 17:10 0:06 mb testseq1c.nex jan-mor+ 57470 0.0 0.0 54660 3780 pts/1 R+ 17:10 0:00 ps u -u jan-moren jan-mor+ 88494 0.0 0.0 93464 9700 ? Ss 15:12 0:00 /usr/lib/systemd/systemd --user jan-mor+ 88498 0.0 0.0 346856 20160 ? S 15:12 0:00 (sd-pam) ...
"$USER" contains your user name; you can put your user name directly there if you want.
This will list your current processes, from all your jobs using this node. As you can see, you have quite a few processes running on a node. You can see the "mpirun" command I used to start it, the four MrBayes processes ("mb testseq1c.nex"), the ps command itself, and half a dozen background processes we don't need to care about.
For each process, you can see the name, the pid (process ID); how much CPU and memory each one is currently using (it's a percentage of the total memory on the node, not what you asked for); how much memory it's asking for and how much it's currently using; how long it's been running and the command line.
At the very least, this will tell you if your job is actually doing something or not. If the CPU percentage is stuck at 0.0 then it may be hanging, waiting for something to happen. If the CPU percentage is higher, you know it's at least doing something.
You can get more of a live view of the processes using top:
$ top -u $USER
This will give you an updated list of your processes over time, along with some general data on the system:
top - 10:54:05 up 133 days, 1:20, 1 user, load average: 60,08, 50,77, 43,87 Tasks: 1784 total, 44 running, 1740 sleeping, 0 stopped, 0 zombie %Cpu(s): 42,9 us, 0,7 sy, 0,0 ni, 56,0 id, 0,0 wa, 0,3 hi, 0,1 si, 0,0 st MiB Mem : 515319,1 total, 108200,8 free, 37144,6 used, 369973,7 buff/cache MiB Swap: 8192,0 total, 7142,2 free, 1049,8 used. 473018,2 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 46672 jan-mor+ 20 0 388360 78460 16836 R 99,7 0,0 0:04.59 mb 46673 jan-mor+ 20 0 389412 77164 17580 R 99,3 0,0 0:04.57 mb 46671 jan-mor+ 20 0 388368 79128 17508 R 99,0 0,0 0:04.58 mb 46670 jan-mor+ 20 0 389388 78372 16728 R 98,0 0,0 0:04.51 mb 46665 jan-mor+ 20 0 63332 6428 3880 R 1,0 0,0 0:00.16 top 46666 jan-mor+ 20 0 284208 21384 17952 S 0,3 0,0 0:00.16 mpirun
At the top you can see a lot of information about the node. Mostly this is not useful to you, as this reflects the total use by all users using this node. Below you have a list of your processes, with much the same information you get with "ps" above, ordered slightly differently.
You exit with "q" and you can get help with "h". You can sort your processes in various ways, change the information is displays and so on.
Htop is a top-like tool (written as an improved top). It can show you the same kind of display, and has has a lot of configuration options, including showing the CPU usage graphically, showing your threads as well as processes, and much more. Feel free to play around with the settings to see what you can find out.
At times it can be useful to follow what a process does in more detail. You might want to know why a program always hangs at some point. Or you're developing your own code and need a way to look for bugs.
"strace" will list every system call your program does, and print the details of the call. This is very useful! System calls is how programs interact with the world, so it tells us what it's doing.
For example, it can tell us what the program is doing just before it hangs or crashes. It can show us what files it it reading and writing. If you need to know if a program really is reading your configuration file, strace can tell you. Or you can find out where a program is saving some output file you can't to find.
To be honest, strace perhaps belongs to the "Advanced Tools" in the next section. But it can be so very useful that I believe it's worth mentioning among the general tools.
You can run it at startup like this - we'll strace a program called "myprog":
$ strace -o strace.out myprog
This will dump all strace output into a file called "strace.out". This will generally be a pretty large file — applications do a lot of things behind the scenes as they start up and this generates lots of system activity.
This is a powerful and often underrated tool. Once you know how to read the (messy, voluminous) output it is useful in a wide range of situations.
A simple one might be when a program just seems to freeze or hang for no apparent reason. You can use "strace" on it, and see what it is trying to do when it hangs. The reason it hangs can often be that it tries to read a file when the file system has a problem; or it tries contact a remote computer and not getting an answer; or an MPI program is trying to do an MPI operation but one of the processes have died. With "strace" you can see exactly what file or remote resource it's waiting for.
For a real, if contrived, example, let's strace "ls" trying to show a file that doesn't exist. Run this:
$ strace -o ls.out ls doesnt_exist ls: cannot access 'doesnt_exist': No such file or directory
We asked ls to show us the file "doesnt_exist", which of course doesn't exist, so ls complains then stops. We can now examine "ls.out" to see what ls did, and how it failed. You can look at the file with less:
$ less ls.out
execve("/bin/ls", ["ls", "doesnt_exist"], 0x7ffe61fa0388 /* 70 vars */) = 0 brk(NULL) = 0x55dbd7490000 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc4049a3c0) = -1 EINVAL (Invalid argument) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 . . .
It's about a hundred lines that looks very messy at first. Each line is actually a system call — "ls" asks the system to do something, and the operating system responds. Most of this file is startup-related things, such as asking for memory, finding libraries and such. Right toward the end we can see:
stat("doesnt_exist", 0x55f4ee59da88) = -1 ENOENT (No such file or directory) lstat("doesnt_exist", 0x55f4ee59da88) = -1 ENOENT (No such file or directory) write(2, "ls: ", 4) = 4 write(2, "cannot access 'doesnt_exist'", 28) = 28 write(2, ": No such file or directory", 27) = 27 write(2, "\n", 1) = 1 close(1) = 0 close(2) = 0 exit_group(2) = ?
"doesnt_exist" is the name of the file we asked ls to list for us. "stat" is a system call asking if there is a file with that name. The answer from Linux is "-1 ENOENT (No such file or directory)" - that is, "no" ("ENOENT" is short for "Error, no such entry"). ls then asks if there is a link to such a file with "lstat", and again, the answer is "no". Then ls uses "write" calls to print out the error message that we got when we tried running it.
You can attach to a running program with "-p PID", you can trace multithreaded apps with "-f", and you can filter out the system calls so you only see, say, file operations or network operations if you like.
A devops engineer named Julia Evans as written a slightly silly, but really very well written zine-format explanation of strace; its only 8 hand-drawn pages and you can find the PDF at this link. As an aside, her zines showcase how technical documentation doesn't have to be dry and long-winded to be rigorous or useful.
If you are already comfortable with the idea of system calls, another good place to start learning strace is to look at the manual page:
$ man strace
It has a lot of useful examples in addition to an exhaustive explanation of all options.
These tools are perhaps mostly useful for those of you who write your own programs. But even if you're not, don't dismiss them out of hand — they can still be helpful from time to time!
gdb is the standard debugger for Linux systems. How — and why — to use it is far outside the scope of this page (see here or here for a couple of the many good online tutorials), but I want to highlight that you can use it on the cluster for non-interactive programs.
ssh into a node that's running your program:
$ squeue -u $USER ... $ ssh deigo012345
Then you can attach the debugger to a running process using the "-p" parameter. Find out the PID of your process, then attach gdb to it:
$ ps -u $USER .... $ gdb -p <PID>
You can debug a running process anywhere on the cluster this way. A useful trick for MPI programs is to add a delay right at the start of your code. In C or C++ call sleep() for, say, 30-60 seconds or so at the start of your code, after you have initialized MPI. That gives you time to ssh into a node, find the PID of an MPI process and attach to it before it continues to run.
Valgrind is a set of tools that let you examine memory-related issues in various ways. It tracks your memory and cache usage, and will help you find memory overruns; use after free; and other memory, cache and thread-related bugs. You can also use Valgrind to profile your code with the "callgrind" tool. The official manual is a good place to start, and there are any number of good tutorials on the net as well.
Perf is the default profiler for Linux systems. You can use this to find out where your code is spending most of its time. This is very important when you want to optimise your running time; you can't improve your code when you don't know what parts actually need to be improved. This site has a lot of good usage examples.
There are a number of tools for profiling and examining Python programs. Two of the most used are "timeit" and "cprofile".
Timeit is a standard python library. You can give it any function (or whole program); timeit will then run it repeatedly and measure the average time it takes. This is particularly useful if you have multiple ways to do something and want to know which is faster. This page has more examples of use.
Cprofile is the default Python profiler. It lets you find out where you spend most of your time in your Python program, and where you want to spend most of your effort improving it.