Let me save you fifteen minutes, or the rest of your life: They aren’t.
Profilers alter the behavior of the system. Nothing has high enough clock resolution or fidelity to make them accurate. Intel tried to solve this by building profiling into the processor, and that only helped slightly.
Big swaths of my career, and the resulting wins, started with the question,
“What if the profiler is wrong?”
One of the first things I noticed is that no profilers make a big deal out of invocation count, which is a huge source of information for continuing past tall tent poles or hotspots into productive improvement. I have seen one exception to this, but that tool became defunct sometime around 2005 and nobody has copied them since.
Because of cpu caches and branch prediction and amortized activities in languages or libraries (memory defrag, GC, flushing), many things get tagged by the profiler as expensive that are being scapegoated because they get stuck paying someone else’s bill. They exist at the threshold where actions can no longer be deferred and have to be paid for now.
So what you’re really looking for in the tools is everything that looks weird. And that often involves ignoring the fancy visualization and staring at the numbers. Which are wrong. “Reading the tea leaves” as they say.
> Let me save you fifteen minutes, or the rest of your life: They aren’t.
Knowing that all profilers aren't perfectly accurate isn't a very useful piece of information. However, knowing which types of profilers are inaccurate and in which cases is indeed very useful information, and this is exactly what this article is about. Well worth 15 minutes.
> And that often involves ignoring the fancy visualization and staring at the numbers.
Visualisations are incredibly important. I've debugged a large number [1] of performance issues and production incidents highlighted by the async profiler producing Brendan Gregg's flame graphs [2]. Sure, things could be presented as numbers, but what I really care about most of the time when I take a CPU profile from a production instance is – what part of the system was taking most of the CPU cycles.
Another option is to use the "processor trace" functionality available in Intel and Apple CPUs. This can give you a history of every single instruction executed and timing information every few instructions, with very little observer effect. Probably way more accurate than the approach in the paper, though you need the right kind of CPU and you have to deal with a huge amount of data being collected.
Those definitely make them less wrong, but still leave you hanging because most functions have side effects and those are exceedingly difficult to trace.
The function that triggers GC is typically not the function that made the mess.
The function that stalls on L2 cache misses often did not cause the miss.
Just using the profiler can easily leave 2-3x performance on the table, and in some cases 4x. And in a world where autoscaling exists and computers run in batteries that’s a substantial delta.
And the fact is that with few exceptions nobody after 2008 really knows me as the optimization guy, because I don’t glorify it. I’m the super clean code guy. If you want fast gibberish, one of those guys can come after me for another 2x if you or I don’t shoo them away. Now you’re creeping into order of magnitude territory. And all after the profiler stopped feeding you easy answers.
Do you have a source for “with very little observer effect”? I don’t know better, it just seems like a big assumption the CPU can emit all this extra stuff without behaving differently.
It's not an assumption, this is based on claims made by CPU manufactures. It's possible to get it down to within 1-2% overhead.
Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.
In the early nineties I was test manager of the Borland Profiler. I didn’t supervise the tester of the profiler closely enough, and discovered only when customers complained that the profiler results were off by a quarter second on every single measurement reported.
It turns out that the tester had not been looking closely at the output, other than to verify that output consisted of numbers. He didn’t have any ideas about how to test it, so he opted for mere aesthetics.
This is one of many incidents that convinced me to look closely and carefully at the work of testers I depend upon. Testing is so easy to fake.
Let me save you fifteen minutes, or the rest of your life: They aren’t.
Profilers alter the behavior of the system. Nothing has high enough clock resolution or fidelity to make them accurate. Intel tried to solve this by building profiling into the processor, and that only helped slightly.
Big swaths of my career, and the resulting wins, started with the question,
“What if the profiler is wrong?”
One of the first things I noticed is that no profilers make a big deal out of invocation count, which is a huge source of information for continuing past tall tent poles or hotspots into productive improvement. I have seen one exception to this, but that tool became defunct sometime around 2005 and nobody has copied them since.
Because of cpu caches and branch prediction and amortized activities in languages or libraries (memory defrag, GC, flushing), many things get tagged by the profiler as expensive that are being scapegoated because they get stuck paying someone else’s bill. They exist at the threshold where actions can no longer be deferred and have to be paid for now.
So what you’re really looking for in the tools is everything that looks weird. And that often involves ignoring the fancy visualization and staring at the numbers. Which are wrong. “Reading the tea leaves” as they say.
> Let me save you fifteen minutes, or the rest of your life: They aren’t.
Knowing that all profilers aren't perfectly accurate isn't a very useful piece of information. However, knowing which types of profilers are inaccurate and in which cases is indeed very useful information, and this is exactly what this article is about. Well worth 15 minutes.
> And that often involves ignoring the fancy visualization and staring at the numbers.
Visualisations are incredibly important. I've debugged a large number [1] of performance issues and production incidents highlighted by the async profiler producing Brendan Gregg's flame graphs [2]. Sure, things could be presented as numbers, but what I really care about most of the time when I take a CPU profile from a production instance is – what part of the system was taking most of the CPU cycles.
[1]: https://x.com/SerCeMan/status/1305783089608548354
[2]: https://www.brendangregg.com/flamegraphs.html
Heisenberg principle but for programming
im pretty sure performance counters count accurately. theyre a bit finnicky to use but they dont alter cpu execution.
last i had to deal with it.. which was eons ago.. Higher end CPUs like Xeons had more counters and more useful ones
im sure there are plenty of situations where theyre insufficient, but its absurd to paint the situation as completely always hopeless
Another option is to use the "processor trace" functionality available in Intel and Apple CPUs. This can give you a history of every single instruction executed and timing information every few instructions, with very little observer effect. Probably way more accurate than the approach in the paper, though you need the right kind of CPU and you have to deal with a huge amount of data being collected.
Those definitely make them less wrong, but still leave you hanging because most functions have side effects and those are exceedingly difficult to trace.
The function that triggers GC is typically not the function that made the mess.
The function that stalls on L2 cache misses often did not cause the miss.
Just using the profiler can easily leave 2-3x performance on the table, and in some cases 4x. And in a world where autoscaling exists and computers run in batteries that’s a substantial delta.
And the fact is that with few exceptions nobody after 2008 really knows me as the optimization guy, because I don’t glorify it. I’m the super clean code guy. If you want fast gibberish, one of those guys can come after me for another 2x if you or I don’t shoo them away. Now you’re creeping into order of magnitude territory. And all after the profiler stopped feeding you easy answers.
Do you have a source for “with very little observer effect”? I don’t know better, it just seems like a big assumption the CPU can emit all this extra stuff without behaving differently.
It's not an assumption, this is based on claims made by CPU manufactures. It's possible to get it down to within 1-2% overhead.
Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.
In the early nineties I was test manager of the Borland Profiler. I didn’t supervise the tester of the profiler closely enough, and discovered only when customers complained that the profiler results were off by a quarter second on every single measurement reported.
It turns out that the tester had not been looking closely at the output, other than to verify that output consisted of numbers. He didn’t have any ideas about how to test it, so he opted for mere aesthetics.
This is one of many incidents that convinced me to look closely and carefully at the work of testers I depend upon. Testing is so easy to fake.
In my experience a very large proportion of all automated testing is like this if you go poking into what it does.
My experience is the same.