Leaving performance on the table

Published 2026-05-23 on Farid Zakaria's Blog

I have been working with LLVM at $DAYJOB$ , and I have gotten to become familiar with the benefits of optimizing your workloads.

I tend to think of optimizing my binaries as thinking about whether I have attached -O3 to my compiler flags; maybe if I’m particularly advanced that day I’ll sprinkle in some -flto (link time optimziation) and call it a day.

Turns out though that’s leaving lots of performance on the table.

Compilers work under the assumption that every branch is equally taken, unless you are hints like [[likely]] (ref). If we can feed the compilers more information about the likely path that our workloads often take, then they can produce much more performant code.

There are two primary ways to optimize a binary: instrumented or statistical. When we instrument our binary, we run our workload with an instrumented binary and capture the exact paths that are executed. We will then optimize the binary perfectly tuned to that workload.

If our workloads however are varied, we can collect profiles via perf over a length of time and create an optimized binary based on the statistical occurence of call graphs.

Both approaches have their benefits however let’s start with the instrumented variant first, as it’s a little easier to follow and understand.

Let’s look at a very simple benchmark. We will calculate fibonocci using SQL in sqlite3. This is an ideal workload because it’s purely CPU-bound and ripe for optimizing.

-- fibonacci.sql (100 Million Iterations)
WITH RECURSIVE fibonacci(n, a, b) AS (
    -- Seed values
    SELECT 0, 0, 1
    
    UNION ALL
    
    -- Loop 100,000,000 times. 
    -- The modulo keeps the integers safely within 64-bit bounds.
    SELECT n + 1, b, (a + b) % 1000000007 
    FROM fibonacci 
    WHERE n < 100000000
)
-- Only output the final computed value to avoid terminal I/O bottlenecks
SELECT a FROM fibonacci WHERE n = 100000000;

We will compile sqlite3 from source by downloading it.

> wget https://sqlite.org/2026/sqlite-amalgamation-3530100.zip
> unzip sqlite-amalgamation-3530100.zip

We can compile a “traditional” optimized binary that merely has -O3 and also a version that has LTO enabled since I was also keen to see how much LTO itself adds.

> clang -O3 shell.c sqlite3.c -o sqlite3_base

> clang -O3 -flto shell.c sqlite3.c -o sqlite3_lto

# let's run it just to see some rough times
> time ./sqlite3_base :memory: < ./fibbonoci.sql
╭───────────╮
│     a     │
╞═══════════╡
│ 908460138 │
╰───────────╯

________________________________________________________
Executed in   14.22 secs    fish           external
   usr time   14.16 secs    0.00 micros   14.16 secs
   sys time    0.00 secs  802.00 micros    0.00 secs

Ok, so it looks like our program takes roughly 14-15 seconds to run. Sounds ok? How much better can we do…. 🤔

Next, we compile our program again but we instrument the binary, which effectively injects counters into the program to count invocations of functions. We get very accurate counts of our calls but the binary itself now runs much slower, which can be a problem if your workload was already very slow. Luckily for us, we are in a time domain (~15 seconds), where that is ok.

After we have our instrumented binary, we run our workload again to generate the profile data and rebuild the binary with that data.

# 1. Build the instrumented version for Clang
> clang -O3 -flto -fprofile-generate=. shell.c sqlite3.c -o sqlite3_instr

# 2. Run the 100M Fibonacci loop to generate Clang profile data
> ./sqlite3_instr :memory: < ../fibonacci.sql

# 3. Merge the raw profile
> ./llvm-profdata merge -output=sqlite3.profdata *.profraw

# 4. Build the PGO-optimized binary, BUT preserve relocations for BOLT (-Wl,-q)
./ clang -O3 -flto -fprofile-use=sqlite3.profdata -Wl,-q shell.c sqlite3.c -o sqlite3_pgo

The last step will be to optimize with BOLT, which is a post-link optimizer, which requires us to keep relocations so I’ve also added -Wl,-q.

When we run our workload with the final optimized binary, we see massive improvement already! 🤯

> time ./sqlite3_pgo :memory: < ./fibbonoci.sql
╭───────────╮
│     a     │
╞═══════════╡
│ 908460138 │
╰───────────╯

________________________________________________________
Executed in   10.86 secs    fish           external
   usr time   10.83 secs    0.00 micros   10.83 secs
   sys time    0.00 secs  770.00 micros    0.00 secs

We’ve cut our workload time down to ~10 seconds which is a nearly a 1.5x improvement.

Now let’s optimize the final binary with LLVM’s BOLT. BOLT is a post-link optimizer designed for “large applications”. What this means, is that it largely works by shuffling code around the binary to keep code-paths that have high temporal locality near each other (spatial locality). This can have positive impact on performance due to the instruction cache for instance.

# 1. Instrument the PGO binary with BOLT
> llvm-bolt sqlite3_pgo -o sqlite3_bolt_instr --instrument --instrumentation-file=bolt.fdata

# 2. Run the workload AGAIN to trace the physical execution path
> ./sqlite3_bolt_instr :memory: < ../fibonacci.sql

# 3. Apply the final BOLT optimizations on top of the PGO binary
> llvm-bolt sqlite3_pgo -o sqlite3_ultimate \
    -data=bolt.fdata \
    -reorder-blocks=ext-tsp \
    -reorder-functions=hfsort+ \
    -split-functions \
    -dyno-stats

> time ./sqlite3_ultimate :memory: < ./fibbonoci.sql
╭───────────╮
│     a     │
╞═══════════╡
│ 908460138 │
╰───────────╯

________________________________________________________
Executed in   10.52 secs    fish           external
   usr time   10.46 secs  591.00 micros   10.46 secs
   sys time    0.01 secs  244.00 micros    0.01 secs

Looks like it was a little faster but not much. That makes sense since sqlite3 itself is a pretty small binary (~6MB), but nontheless was good to run through.

Running a more thorough benchmark with hyperfine we can get a final tally of our results.

Build Configuration	Mean Time ± σ	Min … Max	Relative Speed (vs Fastest)
PGO+BOLT	10.536 s ± 0.051 s	10.491 s … 10.631 s	1.00
PGO	10.805 s ± 0.055 s	10.733 s … 10.889 s	1.03 ± 0.01
LTO	14.252 s ± 0.026 s	14.225 s … 14.315 s	1.35 ± 0.01
Basic (no LTO)	14.292 s ± 0.071 s	14.224 s … 14.435 s	1.36 ± 0.01
Fedora package	14.496 s ± 0.074 s	14.402 s … 14.662 s	1.38 ± 0.01

Looks like the sqlite3 I got from the Fedora ecosystem was the slowest. When all the optimizations were applied I was able to get a maximum of 1.38x faster than what was available.

These optimizations would be even more dramatic for code-bases that are a sprawl and can heavily vary.

Don’t worry also about getting the profile perfectly tuned to your workloads. I have a coworker who often cites that even poor profiles are still much better than no profile at all.

Improve this page @ cf0dfdb
The content for this site is CC-BY-SA.