Profiling Distributed Applications with Perf
Written by Sander Mertens
February 16, 2018
I, like many developers, have been in situations where I needed to take an existing application and make it faster–basically by removing slow code and replacing it with fast code. I know now to follow one simple rule when it comes to optimizing code:
Whatever code I think is slowing down the application, is where I should look last.
Profiling is a trade that makes you come to terms with the limitations of your intuition very quickly. I realized early on I needed cold, hard measurements to tell me which parts of my code needed optimizing. Fortunately there is a plethora of profiling tools available that can measure just about anything related to how your code is running.
Tools, however, do not necessarily make profiling easy. Interpreting measurements can be tricky, and variables need to be tightly controlled when conducting experiments. In particular, multi-threaded and distributed applications are hard to profile.
Anyone who ever had to debug a race-condition will be familiar with how time-sensitive the behavior of multi-threaded applications is. Profiling multi-threaded applications has similar challenges, as timing becomes a significant factor in the measurement.
Profilers like callgrind slow down your program significantly, and therefore impact timing. An example that shows a limitation of such profilers is mutex contention. Your application may run slowly because a mutex is being heavily used, causing your code to spend a lot of time in lock functions. A tool like callgrind would not reveal this, as it counts instructions, not time.
There is another class of profilers which do “statistical profiling.” These profilers allow you to run your program like you normally would, while taking periodic snapshots of where the application is spending its time. These profilers need to run for some time to produce accurate results, but can do so with minimal impact on timing. That makes them a great fit for profiling multithreaded and/or distributed applications!
I wanted to share a profiling workflow using the Linux perf tool, that I found to be especially useful as it allows me to quickly identify performance “hotspots.” I will use the c/hello_dynamic example from RTI Connext 5.3.0 as a target for measuring performance.
First, make sure that perf is installed on your Linux machine. On Ubuntu, I had to run this command to install perf on my machine:
sudo apt-get install linux-tools-common linux-tools-3.13.0-107-generic
Next, you need to download a GitHub project that can convert the output from perf to what is called a “FlameGraph,” which is a visual representation of the collected profiling data. Run this command from a location that is convenient to access (like your home directory):
git clone https://github.com/brendangregg/FlameGraph
Now navigate go to the hello_dynamic example in the rti_workspace/examples/c folder. Build the code with these commands (make sure NDDSHOME is set to the RTI Connext installation):
make -f makefile_Hello_x64Linux3gcc4.8.2
The platform in the name of the makefile might be different from your platform. Note how we set the DEBUG environment variable. We do this so that the binary has debugging symbols, which will allow us to see the names of functions in the callstacks that perf outputs.
We can now run perf on our code. Run the following commands:
objs/x64Linux3gcc4.8.2/Hello sub &
sudo perf record -g objs/x64Linux3gcc4.8.2/Hello pub
After some time, hit control-C to exit the publisher. Perf will have produced a file called “perf.out”. We now need to translate this file into something the FlameGraph tool understands, using a script from the FlameGraph repository:
perf script -f | ~/FlameGraph/stackcollapse-perf.pl > out.perf-folded
From here, we can generate the FlameGraph image:
~/FlameGraph/flamegraph.pl out.perf-folded > perf.svg
When you open the perf.svg file in a webbrowser, it should look something like this:
The horizontal axis represents the time spent in a particular function, whereas the stacked bars represent the call stack of your application. You can click on each bar to zoom in to that particular stack.
Try re-running the publisher, but without a subscriber. You will notice that the right part of the flamegraph will disappear, as DDS does not send out any data when there are no subscribers!
The perf tool can do a lot more than what this blog describes. If you know of other settings or tools that have made your profiling life easier, let us know!