Profiling is hard. Measuring the right metric and correctly interpreting the obtained data can be difficult even for relatively simple programs.

For performance optimization, I'm a big fan of the poor man's profiler: run the binary to analyze under a debugger, periodically stop the execution, get a backtrace and continue. After doing this a few times, the hotspots will become apparent. This works amazingly well in practice and gives a reliable picture of where time is spent, without the danger of skewed results from instrumentation overhead.

Sometimes it's nice to get a more fine-grained view. That is, not only find the hotspot, but get an overview how much time is spent where. That's where 'real' profilers come in handy.

Under Windows, I like the built-in "Event Tracing for Windows" (ETW), which produces files that can be analyzed with Xperf/Windows Performance Analyzer. It is a really well thought out system, and the Xperf UI is amazing in the analyzing abilities that it offers. Probably the best place to start reading up on this is ETW Central.

Under Linux, I haven't found a profiler I can really recommend, yet. gprof and sprof are both ancient and have severe limitations. OProfile may be nice, but I haven't had a chance to use it yet, as it wasn't available for my Ubuntu LTS release.

I have used Callgrind from the Valgrind toolkit in combination with the KCachegrind GUI analyzer. I typically invoke it like this:

valgrind --tool=callgrind --callgrind-out-file=callgrind-cpu.out ./program-to-profile
kcachegrind callgrind-cpu.out

Callgrind works by instrumenting the binary under test. It slows down program execution, often by a factor of 10. Further, it only measures CPU time, so sleeping times are not included. This makes it unsuitable for programs that wait a significant amount of time for network or disk operations to complete. Despite these drawbacks, it's pretty handy if CPU time is all that you're interested in.

If blocking times are important (as they are for so many modern applications - we generally spend less time computing and more time communicating), gperftools is a decent choice. It includes a CPU profiler that can be run in real-time sampling mode, and the results can viewed in KCachegrind. It is recommended to compile into the binary to analyze, but using LD_PRELOAD works decently well:

CPUPROFILE_REALTIME=1 CPUPROFILE=prof.out LD_PRELOAD=/usr/lib/ ./program-to-profile
google-pprof --callgrind ./program_to_profile prof.out > callgrind-wallclock.out
kcachegrind callgrind-wallclock.out

If it works, this gives a good overall profile of the application. Unfortunately, it sometimes fails: on amd64, there are sporadic crashes from within libunwind. It's possible to just ignore those and rerun the profile, at least interesting data is obtained 50% of the time.

The more serious problem is that CPUPROFILE_REALTIME=1 causes gperftools to use SIGALARM internally, conflicting with any applications that want to use that signal for themselves. Looking at the profiler source code, it should be possible to work around this limitation with the undocumented CPUPROFILE_PER_THREAD_TIMERS and CPUPROFILE_TIMER_SIGNAL environment variables, but I couldn't get that to work yet.

You'd think that perf has something to offer in this area as well. Indeed, it has a CPU profiling mode (with nice flamegraph visualizations) and a sleeping time profiling mode, but I couldn't find a way to combine the two to get a real-time profile.

Overall, there still seems to be room for a good, reliable real-time sampling profiler under Linux. If I'm missing something, please let me know!