<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://fzakaria.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://fzakaria.com/" rel="alternate" type="text/html" /><updated>2026-04-13T13:30:46-07:00</updated><id>https://fzakaria.com/feed.xml</id><title type="html">Farid Zakaria’s Blog</title><subtitle>I&apos;m a software engineer, father and wishful amateur surfer. If you&apos;ve come seeking my political views, you&apos;ve found the wrong &lt;a href=&quot;https://fareedzakaria.com/&quot;&gt;Fareed&lt;/a&gt;.</subtitle><entry><title type="html">Does anyone actually use the large code-model?</title><link href="https://fzakaria.com/2026/03/27/does-anyone-actually-use-the-large-code-model" rel="alternate" type="text/html" title="Does anyone actually use the large code-model?" /><published>2026-03-27T09:37:00-07:00</published><updated>2026-03-27T09:37:00-07:00</updated><id>https://fzakaria.com/2026/03/27/does-anyone-actually-use-the-large-code-model</id><content type="html" xml:base="https://fzakaria.com/2026/03/27/does-anyone-actually-use-the-large-code-model"><![CDATA[<p>I have been focused lately on trying to resolve relocation overflows when compiling large binaries in the small &amp; medium code-models.
Often when talking to others about the problem, they are quick to offer the idea of using the large code-model.</p>

<dl>
  <dt><strong>small code-model</strong></dt>
  <dd>Assumes all code and data comfortably fit within a single 2GiB window. The compiler relies on fast, compact 32-bit PC-relative offsets for all function calls and data accesses.</dd>
  <dt><strong>medium code-model</strong></dt>
  <dd>Assumes code stays under 2GiB, but data might exceed it. It splits data into “small” and “large” sections using 32-bit offsets for code and small data, and generating 64-bit addresses strictly for the large data.</dd>
  <dt><strong>large code-model</strong></dt>
  <dd>Makes zero assumptions about size or placement, lifting the 2GiB limit entirely. The compiler is forced to use 64-bit absolute addressing for every external reference.</dd>
</dl>

<p>Despite the performance downsides of using the large code-model from the instructions generated, it’s true that its intent was to support arbitrarily large binaries.
However does anyone actually use it?</p>

<p>Turns out that large binaries do not only affect the instructions generated in the <code class="language-plaintext highlighter-rouge">.text</code> section but may also have effects on other sections within the ELF file such as
<code class="language-plaintext highlighter-rouge">.eh_frame</code> (exception handling information), <code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code> (optimized binary search table for <code class="language-plaintext highlighter-rouge">.eh_frame</code>), and even <code class="language-plaintext highlighter-rouge">.gcc_except_table</code>.</p>

<p>Let’s take <code class="language-plaintext highlighter-rouge">.eh_frame</code> and <code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code> as an example. They specifically allow various encodings for the data within them (<code class="language-plaintext highlighter-rouge">sdata4</code> or <code class="language-plaintext highlighter-rouge">sdata8</code> for 4 bytes and 8 bytes respectively) irrespective of the code-model used. However, it looks like the userland has terrible support for it!</p>

<p>If we look at the <code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code> format, we can see how these encodings are applied in practice. The <code class="language-plaintext highlighter-rouge">encoded</code> entries in this column are the ones that actually resolve to specific DWARF exception header encoding formats (like <code class="language-plaintext highlighter-rouge">sdata4</code>, <code class="language-plaintext highlighter-rouge">sdata8</code>, <code class="language-plaintext highlighter-rouge">udata4</code>, etc.) depending on the values provided in the preceding <code class="language-plaintext highlighter-rouge">*_enc</code> fields.</p>

<p><code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code> format [<a href="https://refspecs.linuxfoundation.org/LSB_1.3.0/gLSB/gLSB/ehframehdr.html">ref</a>]:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Encoding</th>
      <th style="text-align: left">Field</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">unsigned byte</td>
      <td style="text-align: left">version</td>
    </tr>
    <tr>
      <td style="text-align: left">unsigned byte</td>
      <td style="text-align: left">eh_frame_ptr_enc</td>
    </tr>
    <tr>
      <td style="text-align: left">unsigned byte</td>
      <td style="text-align: left">fde_count_enc</td>
    </tr>
    <tr>
      <td style="text-align: left">unsigned byte</td>
      <td style="text-align: left">table_enc</td>
    </tr>
    <tr>
      <td style="text-align: left">encoded</td>
      <td style="text-align: left">eh_frame_ptr</td>
    </tr>
    <tr>
      <td style="text-align: left">encoded</td>
      <td style="text-align: left">fde_count</td>
    </tr>
    <tr>
      <td style="text-align: left"><em>(encoded based on table_enc)</em></td>
      <td style="text-align: left">binary search table</td>
    </tr>
  </tbody>
</table>

<p><em>Note: The <code class="language-plaintext highlighter-rouge">encoded</code> values for <code class="language-plaintext highlighter-rouge">eh_frame_ptr</code> and <code class="language-plaintext highlighter-rouge">fde_count</code> dictate their byte size and format. For example, if <code class="language-plaintext highlighter-rouge">fde_count_enc</code> is set to <code class="language-plaintext highlighter-rouge">DW_EH_PE_sdata4</code>, the <code class="language-plaintext highlighter-rouge">fde_count</code> field will be processed as an <code class="language-plaintext highlighter-rouge">sdata4</code> (signed 4-byte) value.</em></p>

<p>Up until very recently (<a href="https://github.com/llvm/llvm-project/pull/179089">pull#179089</a>), LLVM’s linker <code class="language-plaintext highlighter-rouge">lld</code> would crash if it tried to link exception data (<code class="language-plaintext highlighter-rouge">.eh_frame_hdr</code>) beyond 2GiB.
This section is always generated to help stack searching algorithms avoid linear search.</p>

<p>Once we fix that though, it looks like <code class="language-plaintext highlighter-rouge">libgcc</code> (<a href="https://gcc.gnu.org/pipermail/gcc-patches/2026-March/711435.html">gcc-patch@</a>) and <code class="language-plaintext highlighter-rouge">libunwind</code> (<a href="https://github.com/libunwind/libunwind/pull/964">pull#964</a>) explicitly either crash on <code class="language-plaintext highlighter-rouge">sdata8</code> or avoid the binary search table completely reverting back to linear search.</p>

<p>How devasting is linear search here?</p>

<p>If you have a lot of exceptions, which you theoretically might for the large code-model, I had benchmarks that started at <strong>~13s</strong> improve to <strong>~18ms</strong> for a <strong>~700x speedup</strong>.</p>

<p>Other fun failure modes that exist:</p>

<dl>
  <dt><strong>Thread Local Storage (.tdata and .tbss)</strong></dt>
  <dd>Highly optimized TLS access models often rely on 32-bit offsets from the thread pointer to fetch thread-local variables. Massive binaries can push these variables too far away, breaking the fast-path TLS instructions and forcing you into slower, more general TLS models.</dd>
  <dt><strong>The String Table (.strtab)</strong></dt>
  <dd>Even in a 64-bit ELF (<code class="language-plaintext highlighter-rouge">Elf64_Sym</code>), the <code class="language-plaintext highlighter-rouge">st_name</code> field, which holds the offset to the symbol’s name in the string table is only a 32-bit integer. If you have enough heavily mangled C++ templates, your string table can theoretically hit the 4GiB limit, at which point the ELF format itself fundamentally caps out. 🫠</dd>
</dl>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  typedef struct {
	Elf64_Word	st_name;
	unsigned char	st_info;
	unsigned char	st_other;
	Elf64_Half	st_shndx;
	Elf64_Addr	st_value;
	Elf64_Xword	st_size;
  } Elf64_Sym;
</code></pre></div></div>

<p><em>Note: Don’t let <code class="language-plaintext highlighter-rouge">Elf64_Word</code> confuse you, it’s actually 32bit: <code class="language-plaintext highlighter-rouge">typedef uint32_t	Elf64_Word;</code></em></p>

<p>It seems like the large code-model “exists” but no one is using it for it’s intended purpose which was to build large binaries.
I am working to make massive binaries possible without the large code-model while retaining much of the performance characteristics of the small code-model.</p>

<p>You can read more about it in <a href="https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc/m/J211uZASAgAJ">x86-64-abi</a> google-group where I have also posted an RFC.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I have been focused lately on trying to resolve relocation overflows when compiling large binaries in the small &amp; medium code-models. Often when talking to others about the problem, they are quick to offer the idea of using the large code-model.]]></summary></entry><entry><title type="html">Nix is a lie, and that’s ok</title><link href="https://fzakaria.com/2026/03/07/nix-is-a-lie-and-that-s-ok" rel="alternate" type="text/html" title="Nix is a lie, and that’s ok" /><published>2026-03-07T09:21:00-08:00</published><updated>2026-03-07T09:21:00-08:00</updated><id>https://fzakaria.com/2026/03/07/nix-is-a-lie-and-that-s-ok</id><content type="html" xml:base="https://fzakaria.com/2026/03/07/nix-is-a-lie-and-that-s-ok"><![CDATA[<p>When <a href="https://edolstra.github.io/">Eelco Dolstra</a>, father of Nix, descended from the mountain tops and enlightened us all, one of the main <em>commandments</em> for Nix was to eschew all uses of the <a href="https://www.pathname.com/fhs/">Filesystem Hierarchy Standard (FHS)</a>.</p>

<blockquote>
  <p>The FHS is the “find libraries and files by convention” dogma Nix abandons in the pursuit of purity.</p>
</blockquote>

<p><a href="/assets/images/nix_commandments_large.png"><img src="/assets/images/nix_commandments_50p.png" alt="nix commandments" /></a></p>

<p>What if I told you that was a <em>lie</em> ? 😑</p>

<p>Nix was explicitly designed to eliminate standard FHS paths (like <code class="language-plaintext highlighter-rouge">/usr/lib</code> or <code class="language-plaintext highlighter-rouge">/lib64</code>) to guarantee reproducibility. However, graphics drivers represent a hard boundary between user-space and kernel-space.</p>

<p>The user-space library (<code class="language-plaintext highlighter-rouge">libGL.so</code>) must match the host OS’s kernel module and the physical GPU.</p>

<p>Nearly all derivations do not bundle <code class="language-plaintext highlighter-rouge">libGL.so</code> with them because they have no way of predicting the hardware or host kernel the binary will run on.</p>

<p>What about NixOS? Surely, we know what kernel and drivers we have there!? 🤔</p>

<p>Well, if we modified every derivation to include the correct <code class="language-plaintext highlighter-rouge">libGL.so</code> it would cause massive rebuilds for every user and make the NixOS cache effectively useless.</p>

<p>To solve this, NixOS &amp; Home Manager introduce an intentional impurity, a global path at <code class="language-plaintext highlighter-rouge">/run/opengl-driver/lib</code> where derivations expect to find <code class="language-plaintext highlighter-rouge">libGL.so</code>.</p>

<p>We’ve just re-introduced a convention path à la FHS. 🫠</p>

<p>Unfortunately, that leaves users who use Nix on other Linux distributions in a bad state which is documented in <a href="https://github.com/NixOS/nixpkgs/issues/9415">issue#9415</a>, that has been opened since 2015. If you tried to install and run any Nix application that requires graphics, you’ll be hit with the exact error message Nix was designed to thwart:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>error while loading shared libraries: libGL.so.1: 
cannot open shared object file: No such file or directory
</code></pre></div></div>

<p>There are a couple of workarounds for those of us who use Nix on alternate distributions:</p>
<ul>
  <li><a href="https://github.com/nix-community/nixGL">nixGL</a>, a runtime script that injects the library via <code class="language-plaintext highlighter-rouge">$LD_LIBRARY_PATH</code></li>
  <li>manually hacking <code class="language-plaintext highlighter-rouge">$LD_LIBRARY_PATH</code></li>
  <li>creating your own <code class="language-plaintext highlighter-rouge">/run/opengl-driver</code> and symlinking it with the drivers from <code class="language-plaintext highlighter-rouge">/usr/lib/x86_64-linux-gnu</code></li>
</ul>

<p>For those of us though who cling to the beautiful purity of Nix however it feels like a sad but ultimately necessary trade-off.</p>

<p><em>Thou shall not use FHS, unless you really need to.</em></p>]]></content><author><name></name></author><summary type="html"><![CDATA[When Eelco Dolstra, father of Nix, descended from the mountain tops and enlightened us all, one of the main commandments for Nix was to eschew all uses of the Filesystem Hierarchy Standard (FHS).]]></summary></entry><entry><title type="html">Linker Pessimization</title><link href="https://fzakaria.com/2026/02/18/linker-pessimization" rel="alternate" type="text/html" title="Linker Pessimization" /><published>2026-02-18T07:54:00-08:00</published><updated>2026-02-18T07:54:00-08:00</updated><id>https://fzakaria.com/2026/02/18/linker-pessimization</id><content type="html" xml:base="https://fzakaria.com/2026/02/18/linker-pessimization"><![CDATA[<p>In a <a href="/2026/01/30/crazy-shit-linkers-do-relaxation">previous post</a>, I wrote about <em>linker relaxation</em>: the linker’s ability to replace a slower, larger instruction with a faster, smaller one when it has enough information at link time. For instance, an indirect <code class="language-plaintext highlighter-rouge">call</code> through the GOT can be relaxed into a direct <code class="language-plaintext highlighter-rouge">call</code> plus a <code class="language-plaintext highlighter-rouge">nop</code>. This is a well-known technique to optimize the instructions for performance.</p>

<p>Does it ever make sense to go the <em>other direction</em>? 🤔</p>

<p>We’ve been working on linking some massive binaries that include Intel’s <a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html">Math Kernel Library (MKL)</a>, a prebuilt static archive. MKL ships as object files compiled with the <em>small</em> code-model (<code class="language-plaintext highlighter-rouge">mcmodel=small</code>), meaning its instructions assume everything is reachable within ±2 GiB. The included object files also has some odd relocations where the addend is a very large negative number (&gt;1GiB).</p>

<p>The calculation for the relocation value is <strong>S + A - P</strong>: the symbol address plus the addend minus the instruction address. WIth a sufficiently large negative addend, the relocation value can easily exceed the 2 GiB limit and the linker fails with relocation overflows.</p>

<p>We can’t recompile MKL (it’s a prebuilt proprietary archive), and we can’t simply switch everything to the large code model. What can we do? 🤔</p>

<p>I am calling this technique <strong>linker pessimization</strong>: the reverse of relaxation. Instead of shrinking an instruction, we <em>expand</em> one to tolerate a larger address space. 😈</p>

<h3 id="the-problematic-lea">The Problematic LEA</h3>

<p>The specific instructions that overflow in our case are <code class="language-plaintext highlighter-rouge">LEA</code> (Load Effective Address) instructions.</p>

<p>In x86_64, <code class="language-plaintext highlighter-rouge">lea r9, [rip + disp32]</code> performs pure arithmetic: it computes <code class="language-plaintext highlighter-rouge">RIP + disp32</code> and stores the result in <code class="language-plaintext highlighter-rouge">r9</code> without accessing memory. The <code class="language-plaintext highlighter-rouge">disp32</code> is a <strong>32-bit signed integer</strong> embedded directly into the instruction encoding, and the linker fills it in via an <code class="language-plaintext highlighter-rouge">R_X86_64_PC32</code> relocation.</p>

<p>The relocation formula is <strong>S + A - P</strong>. Let’s look at an example with a large addend.</p>

<table>
  <thead>
    <tr>
      <th>Term</th>
      <th>Meaning</th>
      <th>Value (approximate)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>S</strong> (Symbol)</td>
      <td>Addfress of symbol</td>
      <td>~200 MB into <code class="language-plaintext highlighter-rouge">.rodata</code></td>
    </tr>
    <tr>
      <td><strong>A</strong> (Addend)</td>
      <td>Constant baked into the object file</td>
      <td><code class="language-plaintext highlighter-rouge">0x44000000</code> (−1,062 MB)</td>
    </tr>
    <tr>
      <td><strong>P</strong> (Position)</td>
      <td>Address of the instruction being patched</td>
      <td>~1,200 MB into <code class="language-plaintext highlighter-rouge">.text</code></td>
    </tr>
  </tbody>
</table>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S + A - P  =  200 + (−1062) − 1200
           =  −2062 MB
</code></pre></div></div>

<p>A 32-bit signed integer can only represent ±2,048 MB (±2 GiB). Our value of <strong>−2,062 MB</strong> exceeds that range and the linker rightfully complains 💥:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ld.lld: error: libfoo.a(...):(function ...: .text+0x...):
  relocation R_X86_64_PC32 out of range:
  -2160984064 is not in [-2147483648, 2147483647]
</code></pre></div></div>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
These <code class="language-plaintext highlighter-rouge">LEA</code> instructions appear in MKL because the library uses them as a way to compute an address of a data table relative to the instruction pointer. The large negative addend (<code class="language-plaintext highlighter-rouge">-0x44000000</code>) is <em>intentional</em>; it’s an offset within a large lookup table.</p>
</blockquote>

<h3 id="the-idea-replace-lea-with-mov">The Idea: Replace LEA with MOV</h3>

<p>The core idea is delightful because often as an engineer we are trained to optimize systems, but in this case we want the opposite. We swap the <code class="language-plaintext highlighter-rouge">LEA</code> for a <code class="language-plaintext highlighter-rouge">MOV</code> that reads through a nearby pointer.</p>

<p>Recall from the <a href="/2026/01/30/crazy-shit-linkers-do-relaxation">relaxation post</a>: relaxation <em>shrinks</em> instructions (e.g. indirect <code class="language-plaintext highlighter-rouge">call</code> -&gt; direct <code class="language-plaintext highlighter-rouge">call</code>). Here we do the opposite: we make the instruction <em>do more work</em> (pure arithmetic -&gt; memory load) in exchange for a reachable displacement. That’s why I consider it a <em>pessimization</em> or <em>reverse-relaxation</em>.</p>

<p>Both instructions use the same encoding length (7 bytes with a REX prefix), so the patch is a <strong>single byte change</strong> in the opcode. 🤓</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LEA:  4C 8D 0D xx xx xx xx    lea r9, [rip + disp32]   (opcode 0x8D)
MOV:  4C 8B 0D xx xx xx xx    mov r9, [rip + disp32]   (opcode 0x8B)
         ^^
 only this byte changes!
</code></pre></div></div>

<p>The difference in behavior is critical:</p>
<ul>
  <li><strong>LEA</strong>: <code class="language-plaintext highlighter-rouge">r9 = RIP + disp32</code> (arithmetic, no memory access). <code class="language-plaintext highlighter-rouge">disp32</code> must encode the entire distance to the far-away data. This overflows.</li>
  <li><strong>MOV</strong>: <code class="language-plaintext highlighter-rouge">r9 = *(RIP + disp32)</code> (memory load). <code class="language-plaintext highlighter-rouge">disp32</code> points to a <em>nearby</em> 8-byte pointer slot. The pointer slot holds the full 64-bit address. This never overflows.</li>
</ul>

<h3 id="visualizing-the-change">Visualizing the Change</h3>

<p><strong>Original</strong> — the <code class="language-plaintext highlighter-rouge">LEA</code> must reach across the entire binary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                    disp32 must encode this entire distance
                 ╭──────────────────────────────────────────╮
                 │           ~2+ GiB  (OVERFLOW!)           │
                 │                                          │
  .text          ▼                                          │
  ┌──────────────────────────┐                              │
  │ lea r9, [rip + disp32]   │─────────── X ────────────────┤
  │        (0x8D)            │  can't fit in 32 bits!       │
  └──────────────────────────┘                              │
                                                            │
  .rodata (far away)                                        │
  ┌──────────────────────────┐                              │
  │ symbol + offset          │◄─────────────────────────────╯
  └──────────────────────────┘
</code></pre></div></div>

<p><strong>Pessimized</strong> — the <code class="language-plaintext highlighter-rouge">MOV</code> reads a nearby pointer that holds the full address:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  .text                          .data.fixup (nearby)
  ┌────────────────────────┐    ┌──────────────────────────┐
  │ mov r9, [rip + disp32] │──▶ │ .quad &lt;64-bit address&gt;   │
  │        (0x8B)          │    │  (R_X86_64_64 reloc)     │
  └────────────────────────┘    └──────────┬───────────────┘
         small offset ✓                    │
         always fits in 32 bits            │  full 64-bit pointer
                                           │  NEVER overflows
  .rodata (far away)                       │
  ┌──────────────────────────┐             │
  │ symbol + offset          │◄────────────╯
  └──────────────────────────┘
</code></pre></div></div>

<p>We’ve traded one direct <code class="language-plaintext highlighter-rouge">LEA</code> computation for an indirect <code class="language-plaintext highlighter-rouge">MOV</code> through a pointer, and we make sure the displacement is now tiny. The 64-bit pointer slot can reach <em>any</em> address in the virtual address space. 👌</p>

<h3 id="implementation-details">Implementation Details</h3>

<p>For each problematic relocation, three changes are needed in the object file:</p>

<p><strong>1. Opcode Patch</strong>: In <code class="language-plaintext highlighter-rouge">.text</code>, change byte <code class="language-plaintext highlighter-rouge">0x8D</code> to <code class="language-plaintext highlighter-rouge">0x8B</code> (1 byte).</p>

<p>This converts the <code class="language-plaintext highlighter-rouge">LEA</code> (compute address) into a <code class="language-plaintext highlighter-rouge">MOV</code> (load from address). The rest of the instruction encoding (ModR/M byte, REX prefix) stays identical because both instructions use the same operand format.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Before:  4C 8D 0D xx xx xx xx    lea  r9, [rip + disp32]
 After:   4C 8B 0D xx xx xx xx    mov  r9, QWORD PTR [rip + disp32]
             ^^
</code></pre></div></div>

<p><strong>2. New Pointer Slot</strong> — Create a new section (<code class="language-plaintext highlighter-rouge">.data.fixup</code>) containing 8 zero bytes per patch site, plus a new <code class="language-plaintext highlighter-rouge">R_X86_64_64</code> relocation pointing to the original symbol with the original addend.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> .data.fixup:
   .quad 0x0000000000000000      # linker fills via R_X86_64_64
         ▲
         └── relocation: R_X86_64_64  sym=symbol  addend=-0x44000000
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">R_X86_64_64</code> is a <strong>64-bit absolute</strong> relocation. Its formula is simply <code class="language-plaintext highlighter-rouge">S + A</code>, no subtraction of <code class="language-plaintext highlighter-rouge">P</code>. There is no 32-bit range limitation; it can address the entire 64-bit address space. This is the key insight that makes the fix work.</p>

<p><strong>3. Retarget the Original Relocation</strong> — In the <code class="language-plaintext highlighter-rouge">.rela.text</code> entry for the patched instruction, change the symbol to point at the new pointer slot in <code class="language-plaintext highlighter-rouge">.data.fixup</code> and update the type to <code class="language-plaintext highlighter-rouge">R_X86_64_PC32</code>. The addend becomes a small offset (the distance from the instruction to the fixup slot), which is guaranteed to fit.</p>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
Because both <code class="language-plaintext highlighter-rouge">LEA</code> and <code class="language-plaintext highlighter-rouge">MOV</code> with a <code class="language-plaintext highlighter-rouge">[rip + disp32]</code> operand are exactly the same length (7 bytes with a REX prefix), we don’t shift any code, don’t invalidate any other relocations, and don’t need to rewrite any other parts of the object file. It’s truly a surgical patch.</p>
</blockquote>

<p>The pessimized <code class="language-plaintext highlighter-rouge">MOV</code> now performs a <strong>memory load</strong> where the original <code class="language-plaintext highlighter-rouge">LEA</code> did pure register arithmetic. That’s an extra cache line fetch and a data dependency. If this instruction is in a tight loop, it could be a performance hit.</p>

<p>Optimization is the root of all evil, what does that make pessimization? 🧌</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In a previous post, I wrote about linker relaxation: the linker’s ability to replace a slower, larger instruction with a faster, smaller one when it has enough information at link time. For instance, an indirect call through the GOT can be relaxed into a direct call plus a nop. This is a well-known technique to optimize the instructions for performance.]]></summary></entry><entry><title type="html">Creating massively huge fake files and binaries</title><link href="https://fzakaria.com/2026/02/11/creating-massively-huge-fake-files-and-binaries" rel="alternate" type="text/html" title="Creating massively huge fake files and binaries" /><published>2026-02-11T16:34:00-08:00</published><updated>2026-02-11T16:34:00-08:00</updated><id>https://fzakaria.com/2026/02/11/creating-massively-huge-fake-files-and-binaries</id><content type="html" xml:base="https://fzakaria.com/2026/02/11/creating-massively-huge-fake-files-and-binaries"><![CDATA[<p>I was writing a test case for <code class="language-plaintext highlighter-rouge">lld</code> to support “thunks” [<a href="https://github.com/llvm/llvm-project/pull/180266">llvm#180266</a>] which uses a linker script to place two sections very far apart (8GiB) in the virtual address space.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SECTIONS {
    .text_low 0x10000: { *(.text_low) }
    .text_high 0x200000000: { *(.text_high) }
}
</code></pre></div></div>

<p>After linking a trivially small assembly file, I ran <code class="language-plaintext highlighter-rouge">ls -l</code> on the resulting binary was confused</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span><span class="nb">ls</span> <span class="nt">-lh</span> output
<span class="go">-rwxr-xr-x 1 fzakaria fzakaria 8.0G Feb 11 16:00 output
</span></code></pre></div></div>

<p><strong>8 GiB</strong>. For what amounts to a handful of instructions. 😲</p>

<p>What’s going on? And where did all that space come from?</p>

<h3 id="apparent-size-vs-on-disk-size">Apparent size vs. on-disk size</h3>

<p>Turns out <code class="language-plaintext highlighter-rouge">ls -l</code> reports the <em>logical</em> (apparent) size of the file, which is simply an integer stored in the inode metadata. It represents the offset of the last byte written. Since <code class="language-plaintext highlighter-rouge">.text_high</code> lives at <code class="language-plaintext highlighter-rouge">0x200000000</code> (~8 GiB), the file’s logical size extends out that far even though the actual code is tiny.</p>

<p>The <em>real</em> story is told by <code class="language-plaintext highlighter-rouge">du</code>:</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span><span class="nb">du</span> <span class="nt">-h</span> output
<span class="go">12K     output
</span></code></pre></div></div>

<p>12 KiB on disk. The file is <strong>sparse</strong>. 🤓</p>

<h3 id="what-is-a-sparse-file">What is a sparse file?</h3>

<p>A sparse file is one where the filesystem doesn’t bother allocating blocks for regions that are all zeros. The filesystem (ext4, btrfs, etc.) stores a mapping of logical file offsets to physical disk blocks in the inode’s <em>extent tree</em>. For a sparse file, there are simply no extents for the hole regions.</p>

<p>For our 8 GiB binary, the extent tree looks something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Inode extent tree:
  [offset 0,       12 blocks]  → disk blocks 48392-48403   (.text_low code)
  [offset 0x1FFFF, 4 blocks]   → disk blocks 48404-48407   (.text_high code)

  (nothing for the ~8 GiB in between — no extents exist)
</code></pre></div></div>

<p>We can use <code class="language-plaintext highlighter-rouge">filefrag</code> to also see the same information, albeit a little more condensed.</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>defrag <span class="nt">-v</span> output
<span class="go">Filesystem type is: 9123683e
File size of output is 8589873896 (2097138 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       1:  461921719.. 461921720:      2:             encoded
   1:  2097137.. 2097137:  461921740.. 461921740:      1:  464018856: last,eof
output: 2 extents found
</span></code></pre></div></div>

<p>When something reads the file:</p>
<ol>
  <li>The virtual filesystem (VFS) receives <code class="language-plaintext highlighter-rouge">read(fd, buf, size)</code> at some offset</li>
  <li>The filesystem looks up the extent tree for that offset</li>
  <li>If <strong>extent found</strong> then read from the physical disk block</li>
  <li>If <strong>no extent (hole)</strong> then the kernel fills the buffer with zeros, no disk I/O</li>
</ol>

<h3 id="creating-sparse-files-yourself">Creating sparse files yourself</h3>

<p>You don’t need a linker to create sparse files. <code class="language-plaintext highlighter-rouge">truncate</code> will do it:</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span><span class="nb">truncate</span> <span class="nt">-s</span> 1P bigfile
<span class="gp">$</span><span class="w"> </span><span class="nb">ls</span> <span class="nt">-lh</span> bigfile
<span class="go">-rw-r--r-- 1 fzakaria fzakaria 1.0P Feb 11 16:00 bigfile

</span><span class="gp">$</span><span class="w"> </span><span class="nb">du</span> <span class="nt">-h</span> bigfile
<span class="go">0       bigfile
</span></code></pre></div></div>

<p>A 1 PiB file that takes zero bytes on disk. <code class="language-plaintext highlighter-rouge">dd</code> with <code class="language-plaintext highlighter-rouge">seek</code> works too:</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span><span class="nb">dd </span><span class="k">if</span><span class="o">=</span>/dev/null <span class="nv">of</span><span class="o">=</span>bigfile <span class="nv">bs</span><span class="o">=</span>1 <span class="nv">seek</span><span class="o">=</span>1P
</code></pre></div></div>

<p>Both produce the same result: a file whose logical size is 1 PiB but whose on-disk footprint is effectively nothing.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I was writing a test case for lld to support “thunks” [llvm#180266] which uses a linker script to place two sections very far apart (8GiB) in the virtual address space.]]></summary></entry><entry><title type="html">Crazy shit linkers do: Common Data (COMDAT) sections</title><link href="https://fzakaria.com/2026/02/03/crazy-shit-linkers-do-common-data-comdat-sections" rel="alternate" type="text/html" title="Crazy shit linkers do: Common Data (COMDAT) sections" /><published>2026-02-03T09:49:00-08:00</published><updated>2026-02-03T09:49:00-08:00</updated><id>https://fzakaria.com/2026/02/03/crazy-shit-linkers-do-common-data-comdat-sections</id><content type="html" xml:base="https://fzakaria.com/2026/02/03/crazy-shit-linkers-do-common-data-comdat-sections"><![CDATA[<p>Managing code at scale is hard and comes with a lot of weird quirks in your toolchain. I wrote <a href="/2026/01/30/crazy-shit-linkers-do-relaxation">previously</a> about some of the <em>crazy shit</em> linkers can do and that is really the tip of the iceberg.</p>

<p>Let’s take a peek at <code class="language-plaintext highlighter-rouge">COMDAT</code> (Common Data) sections and some of the weird hiccups you can run into.</p>

<p>What even is <code class="language-plaintext highlighter-rouge">COMDAT</code> ?</p>

<p>Well, to understand what a <code class="language-plaintext highlighter-rouge">COMDAT</code> section, let’s create a simple example to demonstrate.</p>

<p>Consider this example where we will create a <code class="language-plaintext highlighter-rouge">Cache&lt;T&gt;</code> helper class and leverage it across two different translation units: <code class="language-plaintext highlighter-rouge">library.o</code> and <code class="language-plaintext highlighter-rouge">main.o</code></p>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
This example was inspired from <a href="https://github.com/grigorypas">@grigorypas</a> on the discussion on the <a href="https://discourse.llvm.org/t/rfc-lld-preferring-small-code-model-comdat-sections-over-large-ones-when-mixing-code-models/89550">LLVM discourse</a>.</p>
</blockquote>

<p>We can compile each individually such as <code class="language-plaintext highlighter-rouge">gcc -std=c++17 -g -O0 -c library.cpp -o library.o</code>. The <code class="language-plaintext highlighter-rouge">-O0</code> is important here otherwise this simple code will be inlined, and <code class="language-plaintext highlighter-rouge">-std=c++17</code> allows us to use inline static variables.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// cache.h</span>
<span class="cp">#pragma once
</span>
<span class="k">template</span><span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">struct</span> <span class="nc">Cache</span> <span class="p">{</span>
    <span class="kr">inline</span> <span class="k">static</span> <span class="n">T</span> <span class="n">data</span><span class="p">;</span>
    <span class="k">static</span> <span class="kt">void</span> <span class="n">set</span><span class="p">(</span><span class="n">T</span> <span class="n">val</span><span class="p">)</span> <span class="p">{</span> <span class="n">data</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span> <span class="p">}</span>
<span class="p">};</span>

<span class="c1">// library.cpp</span>
<span class="cp">#include</span> <span class="cpf">"cache.h"</span><span class="cp">
</span>
<span class="kt">void</span> <span class="nf">foo</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">Cache</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;::</span><span class="n">set</span><span class="p">(</span><span class="mi">42</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// main.cpp</span>
<span class="cp">#include</span> <span class="cpf">"cache.h"</span><span class="cp">
</span>
<span class="kt">void</span> <span class="nf">bar</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">Cache</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;::</span><span class="n">set</span><span class="p">(</span><span class="mi">31</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">extern</span> <span class="kt">void</span> <span class="nf">foo</span><span class="p">();</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">foo</span><span class="p">();</span>
    <span class="n">bar</span><span class="p">();</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Because <code class="language-plaintext highlighter-rouge">Cache&lt;T&gt;</code> is a template, the compiler must generate the machine code for <code class="language-plaintext highlighter-rouge">Cache&lt;int&gt;::set</code> in every object file (<code class="language-plaintext highlighter-rouge">.o</code>) that uses it. If you compile <code class="language-plaintext highlighter-rouge">main.cpp</code> and <code class="language-plaintext highlighter-rouge">library.cpp</code> and they both use <code class="language-plaintext highlighter-rouge">Cache&lt;int&gt;</code>, both object files will contain this code.</p>

<p>We can double check this with <code class="language-plaintext highlighter-rouge">objdump</code> and sure enough, both <code class="language-plaintext highlighter-rouge">main.o</code> and <code class="language-plaintext highlighter-rouge">library.o</code> contain a duplicate section, meaning the instructions, for <code class="language-plaintext highlighter-rouge">_ZN5CacheIiE3setEi</code> which is the mangled version of <code class="language-plaintext highlighter-rouge">Cache&lt;int&gt;::set</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; objdump -d -j .text._ZN5CacheIiE3setEi main.o

Disassembly of section .text._ZN5CacheIiE3setEi:

0000000000000000 &lt;_ZN5CacheIiE3setEi&gt;:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 45 fc             	mov    -0x4(%rbp),%eax
   a:	89 05 00 00 00 00    	mov    %eax,0x0(%rip)
  10:	90                   	nop
  11:	5d                   	pop    %rbp
  12:	c3                   	ret


&gt; objdump -d -j .text._ZN5CacheIiE3setEi library.o

Disassembly of section .text._ZN5CacheIiE3setEi:

0000000000000000 &lt;_ZN5CacheIiE3setEi&gt;:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 45 fc             	mov    -0x4(%rbp),%eax
   a:	89 05 00 00 00 00    	mov    %eax,0x0(%rip)
  10:	90                   	nop
  11:	5d                   	pop    %rbp
  12:	c3                   	ret
</code></pre></div></div>

<p>Wow! Given the prevailing use of templates in C++ this is already seemingly incredibly wasteful since every <code class="language-plaintext highlighter-rouge">.o</code> has to include the instructions for the same templates. 😲</p>

<p>At link time, the linker has to resolve the function to <strong>use only one</strong> of these implementations.</p>

<p>What do we do with all the other duplicate implementations?</p>

<p>That’s where <code class="language-plaintext highlighter-rouge">COMDAT</code> comes in! 🤓</p>

<p>To prevent your final binary from being 10x larger than necessary, the compiler marks these duplicate sections as <code class="language-plaintext highlighter-rouge">COMDAT</code> (Common Data). The linker’s job is simple: pick one, discard the rest.</p>

<p>We can inspect these groupings using <code class="language-plaintext highlighter-rouge">readelf -g</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; readelf -g main.o -W

COMDAT group section [    1] `.group' [_ZN5CacheIiE3setEi] contains 2 sections:
   [Index]    Name
   [    6]   .text._ZN5CacheIiE3setEi
   [    7]   .rela.text._ZN5CacheIiE3setEi
</code></pre></div></div>

<p>Here is the pickle. How does the linker pick which section to use?</p>

<p>Traditionally (not specified by any ABI), the linker selects the first <code class="language-plaintext highlighter-rouge">.o</code> provided to it on the command-line.</p>

<p>Is this problematic?</p>

<p>Well, what if the two object files were build with different code-models (i.e. <code class="language-plaintext highlighter-rouge">mcmodel</code>). Let’s build <code class="language-plaintext highlighter-rouge">main.cpp</code> with large code-model <code class="language-plaintext highlighter-rouge">mcmodel=large</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; gcc -g -O0 -mcmodel=large -c main.cpp -o main.o

&gt; objdump -d -j .text._ZN5CacheIiE3setEi main.o

Disassembly of section .text._ZN5CacheIiE3setEi:

0000000000000000 &lt;_ZN5CacheIiE3setEi&gt;:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	48 ba 00 00 00 00 00 	movabs $0x0,%rdx
   e:	00 00 00 
  11:	8b 45 fc             	mov    -0x4(%rbp),%eax
  14:	89 02                	mov    %eax,(%rdx)
  16:	90                   	nop
  17:	5d                   	pop    %rbp
  18:	c3                   	ret

&gt; objdump -d -j .text._ZN5CacheIiE3setEi library.o

Disassembly of section .text._ZN5CacheIiE3setEi:

0000000000000000 &lt;_ZN5CacheIiE3setEi&gt;:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	89 7d fc             	mov    %edi,-0x4(%rbp)
   7:	8b 45 fc             	mov    -0x4(%rbp),%eax
   a:	89 05 00 00 00 00    	mov    %eax,0x0(%rip)
  10:	90                   	nop
  11:	5d                   	pop    %rbp
  12:	c3                   	ret
</code></pre></div></div>

<p>Although the section names are the same, the instructions generated are now different. The large code-model uses <code class="language-plaintext highlighter-rouge">movabs</code> which has worse performance characteristics.</p>

<p>Let’s verify what <code class="language-plaintext highlighter-rouge">lld</code> does by linking them.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Link library.o first
&gt; gcc library.o main.o -o a.out
&gt; objdump -d a.out
0000000000401117 &lt;_ZN5CacheIiE3setEi&gt;:
  401117:	55                   	push   %rbp
  401118:	48 89 e5             	mov    %rsp,%rbp
  40111b:	89 7d fc             	mov    %edi,-0x4(%rbp)
  40111e:	8b 45 fc             	mov    -0x4(%rbp),%eax
  401121:	89 05 ed 2e 00 00    	mov    %eax,0x2eed(%rip)
  401127:	90                   	nop
  401128:	5d                   	pop    %rbp
  401129:	c3                   	ret

# Link main.o first
&gt; gcc main.o library.o -o a.out
&gt; objdump -d a.out
0000000000401141 &lt;_ZN5CacheIiE3setEi&gt;:
  401141:	55                   	push   %rbp
  401142:	48 89 e5             	mov    %rsp,%rbp
  401145:	89 7d fc             	mov    %edi,-0x4(%rbp)
  401148:	48 ba 14 40 40 00 00 	movabs $0x404014,%rdx
  40114f:	00 00 00 
  401152:	8b 45 fc             	mov    -0x4(%rbp),%eax
  401155:	89 02                	mov    %eax,(%rdx)
  401157:	90                   	nop
  401158:	5d                   	pop    %rbp
  401159:	c3                   	ret
</code></pre></div></div>

<p>We see that the section selected does depend on the <code class="language-plaintext highlighter-rouge">.o</code> order provided. 😬</p>

<p>Why does all this matter?</p>

<p>We are pursuing moving some code to the medium code-model to overcome some relocation overflows, however we have some prebuilt code built in the small code-model. We noticed that although our goal was to leverage the medium code-model, the linker might chose the small code-model variant of a section if it happened to be found first.</p>

<p>If the linker blindly picks the “small model” version (which uses 32-bit relative offsets) but places the data more than 2GB away we still might end up with the relocation overflow errors we sought to resolve.</p>

<p>But wait, it gets worse.</p>

<p>The fact that we may instantiate multiple incarnations of a particular symbol but only select one is often known as the <strong>One Definition Rule</strong> (ODR). The ODR implies that the definition of a symbol must be identical across all translation units. But the linker generally doesn’t check this (unless you use LTO, and even then, it’s fuzzy). It just checks the symbol name.</p>

<p>Imagine if <code class="language-plaintext highlighter-rouge">library.cpp</code> was compiled with <code class="language-plaintext highlighter-rouge">-DLOGGING_ENABLED</code> which injected <code class="language-plaintext highlighter-rouge">printf</code> calls into <code class="language-plaintext highlighter-rouge">Cache::set</code>, while <code class="language-plaintext highlighter-rouge">main.cpp</code> was compiled in release mode without it.</p>

<p>If the linker picks the <code class="language-plaintext highlighter-rouge">main.o</code> (release) version of the <code class="language-plaintext highlighter-rouge">COMDAT</code> group, your “Debug” library implementation loses its logging features effectively muting your debug logic. Conversely, if it picks the <code class="language-plaintext highlighter-rouge">library.o</code> version, your high-performance release binary suddenly has debug logging in critical hot paths.</p>

<p>You aren’t just gambling with instruction selection that may affect performance such as in the case of code-models; you are gambling with program logic. Given that the section name is purely based on the name of the symbol, it’s easy to see that you can get yourself into oddities if you accidentally link implementations that wildly differ.</p>

<p>I can see why now many languages now force symbols to only ever be defined in a single translation unit as it avoids this whole conundrum. 🙃</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Managing code at scale is hard and comes with a lot of weird quirks in your toolchain. I wrote previously about some of the crazy shit linkers can do and that is really the tip of the iceberg.]]></summary></entry><entry><title type="html">Crazy shit linkers do: Relaxation</title><link href="https://fzakaria.com/2026/01/30/crazy-shit-linkers-do-relaxation" rel="alternate" type="text/html" title="Crazy shit linkers do: Relaxation" /><published>2026-01-30T20:55:00-08:00</published><updated>2026-01-30T20:55:00-08:00</updated><id>https://fzakaria.com/2026/01/30/crazy-shit-linkers-do-relaxation</id><content type="html" xml:base="https://fzakaria.com/2026/01/30/crazy-shit-linkers-do-relaxation"><![CDATA[<p>I have been looking into linkers recently and I’ve been amazed at all the crazy options and optimizations that a linker may perform. Compilers are a well understood domain, taught in schools with a plethora of books but few resources exist for linkers aside from what you may find on some excellent technical blogs such as Lance Taylor’s series on <a href="https://www.airs.com/blog/archives/38">writing the gold linker</a> and Fangrui Song’s, also known as MaskRay, <a href="https://maskray.me/">very in-depth blog</a>.</p>

<p>I wanted to write down in my own style, concepts I’m learning from <em>first principles</em>.</p>

<p>Recently, I came across a term “relaxation” as I was fuddling around LLVM’s <code class="language-plaintext highlighter-rouge">lld</code>.</p>

<p>What is it? 🤔</p>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
Relaxation looks to be <em>relatively new</em>, and the original RFC to the <a href="https://groups.google.com/g/x86-64-abi/c/n9AWHogmVY0">x86-64-abi google group</a> was proposed in 2015.</p>
</blockquote>

<p>Well, let’s look at a super simple example to understand what it is and why we want it.</p>

<p>If you want to follow along take a look at this <a href="https://godbolt.org/z/oePn7c86n">godbolt</a> example.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Declare it, but don't define it.</span>
<span class="c1">// The compiler assumes it might be in a shared library.</span>
<span class="k">extern</span> <span class="kt">void</span> <span class="nf">external_function</span><span class="p">();</span>

<span class="kt">void</span> <span class="nf">example</span><span class="p">()</span> <span class="p">{</span>
<span class="n">external_function</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If we compile this with <code class="language-plaintext highlighter-rouge">-O0 -fno-plt -fpic -mcmodel=medium -Wa,-mrelax-relocations=no</code> we see the following disassembly in the object file using <code class="language-plaintext highlighter-rouge">objdump</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">example</span><span class="p">()</span><span class="o">:</span>
 <span class="n">push</span>   <span class="n">rbp</span>
 <span class="n">mov</span>    <span class="n">rbp</span><span class="p">,</span><span class="n">rsp</span>
 <span class="n">call</span>   <span class="n">QWORD</span> <span class="n">PTR</span> <span class="p">[</span><span class="n">rip</span><span class="o">+</span><span class="mh">0x0</span><span class="p">]</span>        <span class="err">#</span> <span class="n">a</span> <span class="o">&lt;</span><span class="n">example</span><span class="p">()</span><span class="o">+</span><span class="mh">0xa</span><span class="o">&gt;</span>
    <span class="n">R_X86_64_GOTPCREL</span> <span class="n">external_function</span><span class="p">()</span><span class="o">-</span><span class="mh">0x4</span>
 <span class="n">pop</span>    <span class="n">rbp</span>
 <span class="n">ret</span>
</code></pre></div></div>

<p>Specifically, the compile has left a “note” for the linker in the form of a <em>relocation</em>, specifically <code class="language-plaintext highlighter-rouge">R_X86_64_GOTPCREL</code>.</p>

<p>You can see that the address in the emitted code is <code class="language-plaintext highlighter-rouge">0x0</code> after compilation. The linker needs to replace that value with the address of the function from the GOT relative to the <code class="language-plaintext highlighter-rouge">rip</code> register (instruction pointer).</p>

<p>This works great and is necessary for shared libraries but what if we are building a final static binary? 🤓</p>

<p>Turns out that in some cases, this instruction can be further simplified by the linker since when producing the final executable binary it has <em>all</em> the information.</p>

<p>We will have to see the actual instruction-code to understand this further.</p>

<p>If we look at the hexcode for that assembly we see the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ff 15 00 00 00 00 call *0x0(%rip)
</code></pre></div></div>

<p>This indirect call <code class="language-plaintext highlighter-rouge">call</code> (<code class="language-plaintext highlighter-rouge">ff</code>) via the GOT address is <strong>6 bytes long</strong> with 2 bytes for the opcode &amp; 4 bytes belonging to the offset to the GOT entry.</p>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
Understanding x86-64 is its own whole can of worms. The ISA is incredibly dense and complex, but if you want you can reference <a href="https://www.felixcloutier.com/x86/call">it here</a>.</p>
</blockquote>

<p>x86-64 though has other <code class="language-plaintext highlighter-rouge">call</code> types (<code class="language-plaintext highlighter-rouge">e8</code>), that operate in a direct mode where it calls the address relative to the bytes presented.</p>

<p>This direct-mode <code class="language-plaintext highlighter-rouge">call</code> type is only <strong>5 bytes</strong> long with 1 byte for the opcode and 4 bytes for the offset to the function.</p>

<p>If we knew the location of the function ahead of time, it would be nice if we could skip checking the GOT completely and just go to where we want to be.</p>

<p>Why would we want to do this?</p>

<p>Well it’s more efficient to directly jump to the address we want to end up directly. The CPU doesn’t have to load the memory stored at the GOT before jumping to it.</p>

<p>When building a static binary the linker should know all the final relative addresses of all the functions, so going through the GOT is no longer necessary.</p>

<p>Since the number of bytes is nearly equal, the linker can effectively patch the binary without disrupting other relative calculations, provided it can fill the small gap.</p>

<p>We only need to find a <em>single byte</em> to pad our more-efficient <code class="language-plaintext highlighter-rouge">call</code>! 🕵️</p>

<p>Turns out, the <code class="language-plaintext highlighter-rouge">nop</code> operation is only <em>a single byte</em>. 👌</p>

<p>We then get the equality:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>call *foo@GOTPCREL(%rip) =&gt; [nop call foo] or [call foo nop]
</code></pre></div></div>

<p>This is what the <code class="language-plaintext highlighter-rouge">R_X86_64_GOTPCRELX</code> relocation indicates. It tells the linker it is safe to “relax” and modify the instructions to the more performant variation.</p>

<p>When we enable relaxation, we now generate the same code as above but with this new relocation type instructing the linker to perform the optimization if possible.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">call</span>   <span class="n">QWORD</span> <span class="n">PTR</span> <span class="p">[</span><span class="n">rip</span><span class="o">+</span><span class="mh">0x0</span><span class="p">]</span>        <span class="err">#</span> <span class="n">a</span> <span class="o">&lt;</span><span class="n">example</span><span class="p">()</span><span class="o">+</span><span class="mh">0xa</span><span class="o">&gt;</span>
    <span class="n">R_X86_64_GOTPCRELX</span> <span class="n">external_function</span><span class="p">()</span><span class="o">-</span><span class="mh">0x4</span>
</code></pre></div></div>
<blockquote class="alert alert-note">
  <p><strong>Note</strong>
Why not just always optimize <code class="language-plaintext highlighter-rouge">R_X86_64_GOTPCREL</code> when possible and forgo introducing a new relocation? My own guess is that it’s important to be backwards compatible and you wouldn’t want the emitted code to vary depending on the linker version but I would be interested to hear something more concrete if you know!</p>
</blockquote>

<p>Interestingly that many linkers, optimize this even further!</p>

<p>Rather than generating a <code class="language-plaintext highlighter-rouge">nop</code> instruction, the linker instead prefixes the <code class="language-plaintext highlighter-rouge">call</code> with <code class="language-plaintext highlighter-rouge">0x67</code> (<code class="language-plaintext highlighter-rouge">addr32</code>).</p>

<p>On x86-64, <code class="language-plaintext highlighter-rouge">0x67</code> (<code class="language-plaintext highlighter-rouge">addr32</code>) usually implies 32-bit addressing for the operand. However, for a relative <code class="language-plaintext highlighter-rouge">call</code> instruction, it acts as a benign prefix that effectively ignores the override but also consumes exactly 1 byte.</p>

<p>If we go back to our example and enable relaxation, and produce a final binary, we can disassemble it to see whether it was relaxed.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> objdump <span class="nt">-SD</span> main

0000000000401133 &lt;example&gt;:
  401133:	55                   	push   %rbp
  401134:	48 89 e5             	mov    %rsp,%rbp
  401137:	48 8d 05 9a 2e 00 00 	lea    0x2e9a<span class="o">(</span>%rip<span class="o">)</span>,%rax        <span class="c"># 403fd8 &lt;_GLOBAL_OFFSET_TABLE_&gt;</span>
  40113e:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
  401143:	67 e8 bd ff ff ff    	addr32 call 401106 &lt;external_function&gt;
  401149:	90                   	nop
  40114a:	5d                   	pop    %rbp
  40114b:	31 c0                	xor    %eax,%eax
  40114d:	c3                   	ret
</code></pre></div></div>

<p>Here we can see that in fact our <code class="language-plaintext highlighter-rouge">call</code> was relaxed since we can see <code class="language-plaintext highlighter-rouge">addr32 call 401106</code> 🥳.</p>

<p>As it happens, you can do this same “relaxation” optimization for a few other instructions such as <code class="language-plaintext highlighter-rouge">test</code>, <code class="language-plaintext highlighter-rouge">jmp</code> and <code class="language-plaintext highlighter-rouge">mov</code> but the basic premise is the same.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I have been looking into linkers recently and I’ve been amazed at all the crazy options and optimizations that a linker may perform. Compilers are a well understood domain, taught in schools with a plethora of books but few resources exist for linkers aside from what you may find on some excellent technical blogs such as Lance Taylor’s series on writing the gold linker and Fangrui Song’s, also known as MaskRay, very in-depth blog.]]></summary></entry><entry><title type="html">Bespoke software is the future</title><link href="https://fzakaria.com/2026/01/01/bespoke-software-is-the-future" rel="alternate" type="text/html" title="Bespoke software is the future" /><published>2026-01-01T12:00:00-08:00</published><updated>2026-01-01T12:00:00-08:00</updated><id>https://fzakaria.com/2026/01/01/bespoke-software-is-the-future</id><content type="html" xml:base="https://fzakaria.com/2026/01/01/bespoke-software-is-the-future"><![CDATA[<p>At Google, some of the engineers would joke, <em>self-deprecatingly</em>,  that the software internally was not particularly exceptional but rather Google’s dominance was an example of the power of network effects: when software is custom tailored to work well with each other.</p>

<p>This is often cited externally to Google, or similar FAANG companies, as indulgent “NIH” (Not Invented Here) syndrome; where the prevailing practice is to pick generalized software solutions, preferably open-source, off-the shelf.</p>

<p>The problem with these generalized solutions is that, well, they are generalized and rarely fit well together. 🙄  Engineers are trained to be DRY (Don’t Repeat Yourself), and love abstractions. As a tool tries to solve more problems, the abstraction becomes leakier and ill-fitting. It becomes a general-purpose tax.</p>

<p>If you only need 10% of a software solution, you pay for the remaining 90% via the abstractions they impose. 🫠</p>

<p>Internally to a company, however, we are taught that unused code is a liability. We often celebrate negative pull-requests as valuable clean-up work with the understanding that smaller code-bases are simpler to understand, operate and optimize.</p>

<p>Yet for our most of our infrastructure tooling, we continue to bloat solutions and tout support despite miniscule user bases.</p>

<p>This is probably one of the areas I am most excited about with the ability to leverage LLM for software creation.</p>

<p>I recently spent time investigating linkers in <a href="/2025/12/28/huge-binaries">previous</a> <a href="/2025/12/29/huge-binaries-i-thunk-therefore-i-am">posts</a> such as LLVM’s <a href="http://lld.llvm.org/">lld</a>.</p>

<p>I found LLVM to be a pretty polished codebase with lots of documentation. Despite the high-quality, navigating the codebase is challenging as it’s a mass of interfaces and abstractions in order to support: multiple object file formats, 13+ ISAs, a slough of features (i.e. linker scripts ) and multiple operating systems.</p>

<p>Instead, I leveraged LLMs to help me design and write <a href="https://github.com/fzakaria/uld">µld</a>, a tiny opinionated linker in Rust that only targets ELF, x86_64, static linking and barebone feature-set.</p>

<p>It shouldn’t be a surprise to anyone that the end result is a codebase that I can audit, educate myself and can easily grow to support additional improvements and optimizations.</p>

<p>The surprising bit, especially to me, was how easy it was to author and write within a very short period of time (1-2 days).</p>

<p>That means smaller companies, without the coffer of similar FAANG companies, can also pursue bespoke custom tailored software for their needs.</p>

<p>This future is well-suited for tooling such as <a href="https://nixos.org">Nix</a>. Nix is the perfect vehicle to help build custom tooling as you have a playground that is designed to build the world similar to a monorepo.</p>

<p>We need to begin to cut away legacy in our tooling and build software that solves specific problems. The end-result will be smaller, easier to manage and better integrated. Where this might have seemed unattainable for most, LLMs will democratize this possibility.</p>

<p>I’m excited for the bespoke future.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[At Google, some of the engineers would joke, self-deprecatingly, that the software internally was not particularly exceptional but rather Google’s dominance was an example of the power of network effects: when software is custom tailored to work well with each other.]]></summary></entry><entry><title type="html">Huge binaries: papercuts and limits</title><link href="https://fzakaria.com/2025/12/30/huge-binaries-papercuts-and-limits" rel="alternate" type="text/html" title="Huge binaries: papercuts and limits" /><published>2025-12-30T08:34:00-08:00</published><updated>2025-12-30T08:34:00-08:00</updated><id>https://fzakaria.com/2025/12/30/huge-binaries-papercuts-and-limits</id><content type="html" xml:base="https://fzakaria.com/2025/12/30/huge-binaries-papercuts-and-limits"><![CDATA[<p>In a <a href="/2025/12/28/huge-binaries">previous post</a>, I synthetically built a program that demonstrated a relocation overflow for a <code class="language-plaintext highlighter-rouge">CALL</code> instruction.</p>

<p>However, the demo required I add <code class="language-plaintext highlighter-rouge">-fno-asynchronous-unwind-tables</code> to disable some additional data that might cause <strong>other overflows</strong> for the purpose of this demonstration.</p>

<p>What’s going on? 🤔</p>

<p>This is a good example that only a select few are facing the size-pressure of massive binaries.</p>

<p>Even with <code class="language-plaintext highlighter-rouge">mcmodel=medium</code> which already is beginning to articulate to the compiler &amp; linker: “Hey, I expect my binary to be pretty big.”; there are surprising gaps where the linker overflows.</p>

<p>On Linux, an ELF binary includes many other sections beyond text and data necessary for code execution. Notably there are sections included for debugging (DWARF) and language-specific sections such as <code class="language-plaintext highlighter-rouge">.eh_frame</code> which is used by C++ to help unwind the stack on exceptions.</p>

<p>Turns out that even with <code class="language-plaintext highlighter-rouge">mcmodel=large</code> you might still run into overflow errors! 🤦🏻‍♂️</p>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
Funny enough, there is a very recent opened issue for this with <a href="https://github.com/llvm/llvm-project/issues/172777">LLVM #172777</a>; perfect timing!</p>
</blockquote>

<p>For instance, <code class="language-plaintext highlighter-rouge">lld</code>  assumes 32-bit <code class="language-plaintext highlighter-rouge">eh_frame_hdr</code> values regardless of the code model. There are similar 32-bit assumptions in the data-structure of <code class="language-plaintext highlighter-rouge">eh_frame</code> as well.</p>

<p>I also mentioned earlier about a pattern about using multiple GOT, Global Offset Tables, to also avoid the 31-bit (±2GiB) relative offset limitation.</p>

<p>Is there even a need for the large code-model?</p>

<p>How far can that take us before we are forced to use the large code-model?</p>

<p>Let’s think about it:</p>

<p>First, let’s think about any limit due to overflow accessing the multiple GOTs. Let’s say we decide to space out our duplicative GOT every 1.5GiB.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|&lt;---- 1.5GiB code -----&gt;|&lt;----- GOT -----&gt;|&lt;----- 1.5GiB code -----&gt;|&lt;----- GOT -----&gt;|
</code></pre></div></div>

<p>That means each GOT can grow at most 500MiB before there could exist a <code class="language-plaintext highlighter-rouge">CALL</code> instruction from the code section that would result in an overflow.</p>

<p>Each GOT entry is 8 bytes, a 64bit pointer. That means we have roughly ~65 million possible entries.</p>

<p>A typical GOT relocation looks like the following and it requires 9 bytes: 7 bytes for the <code class="language-plaintext highlighter-rouge">movq</code> and 2 bytes for <code class="language-plaintext highlighter-rouge">movl</code>.</p>

<pre><code class="language-assembly">movq    var@GOTPCREL(%rip), %rax  # R_X86_64_REX_GOTPCRELX
movl    (%rax), %eax
</code></pre>

<p>That means we have 1.5GiB / 9 = ~178 million possible <em>unique</em> relocations.</p>

<p>So theoretically, we can require more <strong>unique</strong> symbols in our code section than we can fit in the nearest GOT, and therefore cause a relocation overflow. 💥</p>

<p>The same problem exists for thunks, since the thunk is larger than the relative call in bytes.</p>

<p>At some point, there is no avoiding the large code-model, however with multiple GOTs, thunks and other linker optimizations (i.e. LTO, relaxation), we have a lot of headroom before it’s necessary. 🕺🏻</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In a previous post, I synthetically built a program that demonstrated a relocation overflow for a CALL instruction.]]></summary></entry><entry><title type="html">Huge binaries: I thunk therefore I am</title><link href="https://fzakaria.com/2025/12/29/huge-binaries-i-thunk-therefore-i-am" rel="alternate" type="text/html" title="Huge binaries: I thunk therefore I am" /><published>2025-12-29T08:15:00-08:00</published><updated>2025-12-29T08:15:00-08:00</updated><id>https://fzakaria.com/2025/12/29/huge-binaries-i-thunk-therefore-i-am</id><content type="html" xml:base="https://fzakaria.com/2025/12/29/huge-binaries-i-thunk-therefore-i-am"><![CDATA[<p>In my <a href="/2025/12/28/huge-binaries">previous post</a>, we looked at the “sound barrier” of x86_64 linking: the 32-bit relative <code class="language-plaintext highlighter-rouge">CALL</code> instruction and how it can result in relocation overflows. Changing the code-model to <code class="language-plaintext highlighter-rouge">-mcmodel=large</code> fixes the issue but at the cost of “instruction bloat” and likely a performance penalty although I had failed to demonstrate it via a benchmark 🥲.</p>

<p>Surely there are other interesting solutions? 🤓</p>

<p>First off, probably the simplest solution is to not statically build your code and rely on dynamic libraries 🙃. This is what most “normal” software-shops and the world does; as a result this hasn’t been such an issue otherwise.</p>

<p>This of course has its own downsides and performance implications which I’ve written about and produced solutions for (i.e., <a href="/2022/03/14/shrinkwrap-taming-dynamic-shared-objects">Shrinkwrap</a> &amp; <a href="/2024/05/03/speeding-up-elf-relocations-for-store-based-systems">MATR</a>) via my doctorate research. Beyond the performance penalty induced by having thousands of shared-libraries, you lose the simplicity of single-file deployments.</p>

<p>A more advanced set of optimizations are under the umbrella of “LTO”; Link Time Optimizations. The linker at the final stage has all the information necessary to perform a variety of optimizations such as code inlining and tree-shaking. That would seem like a good fit except these huge binaries would need an enormous amount of RAM to perform LTO and cause build speeds to go to a crawl.</p>

<blockquote class="alert alert-tip">
  <p><strong>Tip</strong>
This is still an active area of research and Google has authored <a href="https://research.google/pubs/thinlto-scalable-and-incremental-lto/">ThinLTO</a>. Facebook has its own set of profile guided LTO optimizations as well via <a href="https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/">Bolt</a>.</p>
</blockquote>

<p>What if I told you that you could keep your code in the fast, 5-byte small code-model, even if your binary is 25GiB for most callsites. 🧐</p>

<p>Turns out there is prior art for “Linker Thunks” [<a href="https://github.com/llvm/llvm-project/blob/main/lld/ELF/Thunks.cpp">ref</a>] within LLVM for various architectures – notably missing for <code class="language-plaintext highlighter-rouge">x86_64</code> with a quote:</p>

<blockquote>
  <p>“i386 and x86-64 don’t need thunks” [<a href="https://github.com/llvm/llvm-project/blob/144dc7464fcfde796401acf7784e084d0e66d15c/lld/ELF/Thunks.cpp#L19C4-L19C38">ref</a>]</p>
</blockquote>

<p>What is a “thunk” ?</p>

<p>You might know it by a different name and we use them all the time for <em>dynamic-linking</em> in fact; a trampoline via the procedure linkage table (PLT).</p>

<p>A thunk (or trampoline) is a linker-inserted shim that lives within the immediate reach of the caller. The caller branches to the thunk using a standard relative jump, and the thunk then performs an absolute indirect jump to the final destination.</p>

<!-- 

\documentclass[tikz, border=10pt]{standalone}
\usetikzlibrary{positioning, arrows.meta, calc, shapes.multipart, bending}

\begin{document}
\begin{tikzpicture}[
    font=\sffamily,
    % Styles for the labels on the left column
    addr/.style={font=\ttfamily\small, text=gray, anchor=east},
    symb/.style={font=\ttfamily\bfseries\small, anchor=east, xshift=-0.0cm, yshift=0.4cm},
    % Styles for the instruction boxes
    % Use 'style 2 args' to avoid parameter errors
    memory block/.style 2 args={
        draw=#1,
        fill=#1!5,
        line width=1pt,
        rectangle split,
        rectangle split parts=#2,
        text width=4.5cm,
        align=left,
        inner sep=6pt,
        font=\ttfamily\small,
        anchor=north west
    },
    jump path/.style={
        -{Stealth[bend]},
        line width=1.2pt,
        rounded corners=8pt
    }
]

    % --- Low Memory (Main) ---
    \node[memory block={blue}{2}] (main) {
        ...
        \nodepart{second} bl \_\_far\_thunk
    };
    
    % Labels for main (Left side)
    \node[symb, blue] at (main.one west) {main:};
    \node[addr] at (main.one west) {0x400000};
    \node[addr] at (main.two west) {0x400008};

    % --- Thunk (Directly below main) ---
    \node[memory block={orange}{4}, below=0mm of main] (thunk) {
        ldr x16, [pc, \#8]
        \nodepart{second} br x16
        \nodepart{third} .word 0x20000000
        \nodepart{fourth} .word 0x00000001
    };
    
    % Labels for thunk (Left side)
    \node[symb, orange!80!black] at (thunk.one west) {\_\_far\_thunk:};
    \node[addr] at (thunk.one west) {0x400018};
    \node[addr] at (thunk.two west) {0x40001c};
    \node[addr] at (thunk.three west) {0x400020};
    \node[addr] at (thunk.four west) {0x400024};

    % --- The Gap (Centered under the 4.5cm width box) ---
    \coordinate (center_column) at ($(thunk.south west)!0.5!(thunk.south east)$);
    \node[below=1mm of center_column, text=gray, font=\itshape\small] (gap) {
        [ ... $\approx$ 5 GiB Address Space Gap ... ]
    };
    
    % --- High Memory (Target) ---
    % Positioned below the gap
    \node[memory block={green!60!black}{3}, below=7mm of center_column] (far) {
        push x29 \par mov x29, sp
        \nodepart{second} ...
        \nodepart{third} ret
    };
    
    % Labels for far function (Left side)
    \node[symb, green!40!black] at (far.one west) {far\_function:};
    \node[addr] at (far.one west) {0x120000000};

    % --- Control Flow Paths ---
    
    % Jump 1: main to thunk (Right side)
    \draw[jump path, blue] (main.second east) -- ++(0.6,0) |- (thunk.one east)
        node[pos=0.25, right, font=\sffamily\scriptsize, align=left] {1. Relative Jump};

    % Jump 2: thunk to far (Left side)
    % This edge takes the "long way" around the labels on the left
    \draw[jump path, green!60!black] (thunk.second east) -- ++(0.5,0) |- (far.one east)
        node[pos=0.25, right, font=\sffamily\scriptsize, align=left] {2. Absolute Jump\\(via x16)};

\end{tikzpicture}
\end{document}

-->
<p><a href="/assets/images/thunk.png"><img src="/assets/images/thunk_50p.png" alt="thunk image" /></a></p>

<p>LLVM includes support for inserting thunks for certain architectures such as AArch64 because it is a fixed-size instruction set (32-bit), so the relative branch instruction is restricted to 128MiB. As this limit is so low, <code class="language-plaintext highlighter-rouge">lld</code> has support for thunks out of the box.</p>

<p>If we cross-compile our “far function” example for AArch64 using the same linker script to synthetically place it far away to trigger the need for a thunk, the linker magic becomes visible immediately.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> aarch64-linux-gnu-gcc <span class="nt">-c</span> main.c <span class="nt">-o</span> main.o <span class="se">\</span>
<span class="nt">-fno-exceptions</span> <span class="nt">-fno-unwind-tables</span> <span class="se">\</span>
<span class="nt">-fno-asynchronous-unwind-tables</span>

<span class="o">&gt;</span> aarch64-linux-gnu-gcc <span class="nt">-c</span> far.c <span class="nt">-o</span> far.o <span class="se">\</span>
<span class="nt">-fno-exceptions</span> <span class="nt">-fno-unwind-tables</span> <span class="se">\</span>
<span class="nt">-fno-asynchronous-unwind-tables</span>

<span class="o">&gt;</span> ld.lld main.o far.o <span class="nt">-T</span> overflow.lds <span class="nt">-o</span> thunk-aarch64
</code></pre></div></div>

<p>We can now see the generated code with <code class="language-plaintext highlighter-rouge">objdump</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> aarch64-unknown-linux-gnu-objdump <span class="nt">-dr</span> thunk-example 

Disassembly of section .text:

0000000000400000 &lt;main&gt;:
  400000:	a9bf7bfd 	stp	x29, x30, <span class="o">[</span>sp, <span class="c">#-16]!</span>
  400004:	910003fd 	mov	x29, sp
  400008:	94000004 	bl	400018 &lt;__AArch64AbsLongThunk_far_function&gt;
  40000c:	52800000 	mov	w0, <span class="c">#0x0                   	// #0</span>
  400010:	a8c17bfd 	ldp	x29, x30, <span class="o">[</span>sp], <span class="c">#16</span>
  400014:	d65f03c0 	ret

0000000000400018 &lt;__AArch64AbsLongThunk_far_function&gt;:
  400018:	58000050 	ldr	x16, 400020 &lt;__AArch64AbsLongThunk_far_function+0x8&gt;
  40001c:	d61f0200 	br	x16
  400020:	20000000 	.word	0x20000000
  400024:	00000001 	.word	0x00000001

Disassembly of section .text.far:

0000000120000000 &lt;far_function&gt;:
   120000000:	d503201f 	nop
   120000004:	d65f03c0 	ret
</code></pre></div></div>

<p>Instead of branching to <code class="language-plaintext highlighter-rouge">far_function</code> at <code class="language-plaintext highlighter-rouge">0x120000000</code>, it branches to a generated thunk at <code class="language-plaintext highlighter-rouge">0x400018</code> (only 16 bytes away). The thunk similar to the large code-model, loads <code class="language-plaintext highlighter-rouge">x16</code> with the absolute address, stored in the <code class="language-plaintext highlighter-rouge">.word</code>, and then performs an absolute jump (<code class="language-plaintext highlighter-rouge">br</code>).</p>

<p>What if <code class="language-plaintext highlighter-rouge">x86_64</code> supported this? Can we now go beyond 2GiB? 🤯</p>

<p>There would be some more similar thunks that would need to be fixed beyond <code class="language-plaintext highlighter-rouge">CALL</code> instructions. Although we are mostly using static binaries, some libraries such as <code class="language-plaintext highlighter-rouge">glibc</code> may be dynamically loaded. The access to the methods from these shared libraries are through the GOT, Global Offset Table, which gives the address to the PLT (which is itself a thunk 🤯).</p>

<p>The GOT addresses are also loaded via a relative offset so they will need to changed to be either use thunks or perhaps multiple GOT sections; which also has prior art for other architectures such as MIPS [<a href="https://github.com/llvm/llvm-project/blob/5c19f77a7e0c4b35c0efb511a7d9e2e436335e61/lld/ELF/SyntheticSections.h#L315">ref</a>].</p>

<p>With this information, the necessity of code-models feels unecessary. Why trigger the cost for every callsite when we can do-so piecemeal as necessary with the opportunity to use profiles to guide us on which methods to migrate to thunks.</p>

<p>Furthermore, if our binaries are already tens of gigabytes, clearly size for us is not an issue. We can duplicate GOT entries, at the cost of even larger binaries, to reduce the need for even more thunks for the PLT <code class="language-plaintext highlighter-rouge">jmp</code>.</p>

<p>What do you think? Let’s collaborate.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[In my previous post, we looked at the “sound barrier” of x86_64 linking: the 32-bit relative CALL instruction and how it can result in relocation overflows. Changing the code-model to -mcmodel=large fixes the issue but at the cost of “instruction bloat” and likely a performance penalty although I had failed to demonstrate it via a benchmark 🥲.]]></summary></entry><entry><title type="html">Huge binaries</title><link href="https://fzakaria.com/2025/12/28/huge-binaries" rel="alternate" type="text/html" title="Huge binaries" /><published>2025-12-28T14:13:00-08:00</published><updated>2025-12-28T14:13:00-08:00</updated><id>https://fzakaria.com/2025/12/28/huge-binaries</id><content type="html" xml:base="https://fzakaria.com/2025/12/28/huge-binaries"><![CDATA[<p>A problem I experienced when pursuing my PhD and submitting academic articles was that I had built solutions to problems that required dramatic scale to be effective and worthwhile. Responses to my publication submissions often claimed such problems did not exist; however, I had observed them during my time within industry, such as at Google, but I couldn’t cite it!</p>

<p>One problem that is only present at these mega-codebases is <em>massive binaries</em>. What’s the largest binary (ELF file) you’ve ever seen? I had observed binaries beyond 25GiB, including debug symbols. How is this possible? These companies prefer to statically build their services to speed up startup and simplify deployment. Statically including all code in some of the world’s largest codebases is a recipe for massive binaries.</p>

<p>Similar to the sound barrier, there is a point at which code size becomes problematic and we must re-think how we link and build code. For x86_64, that is the 2GiB “Relocation Barrier.”</p>

<p>Why 2GiB? 🤔</p>

<p>Well let’s take a look at how position independent code is put-together.</p>

<p>Let’s look at a simple example.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="kt">void</span> <span class="nf">far_function</span><span class="p">();</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">far_function</span><span class="p">();</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If we compile this <code class="language-plaintext highlighter-rouge">gcc -c simple-relocation.c -o simple-relocation.o</code> we can inspect it with <code class="language-plaintext highlighter-rouge">objdump</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> objdump <span class="nt">-dr</span> simple-relocation.o

0000000000000000 &lt;main&gt;:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
   9:	e8 00 00 00 00       	call   e &lt;main+0xe&gt;
			a: R_X86_64_PLT32	far_function-0x4
   e:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
  13:	5d                   	pop    %rbp
  14:	c3                   	ret
</code></pre></div></div>

<p>There’s a lot going on here, but one important part is <code class="language-plaintext highlighter-rouge">e8 00 00 00 00</code>. <code class="language-plaintext highlighter-rouge">e8</code> is the <code class="language-plaintext highlighter-rouge">CALL</code> opcode [<a href="https://c9x.me/x86/html/file_module_x86_id_26.html">ref</a>] and it takes a <strong>32bit signed relative offset</strong>, which happens to be 0 (four bytes of 0) right now. <code class="language-plaintext highlighter-rouge">objdump</code> also lets us know there is a “relocation” necessary to fixup this code when we finalize it. We can view this relocation with <code class="language-plaintext highlighter-rouge">readelf</code> as well.</p>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
If you are wondering why we need <code class="language-plaintext highlighter-rouge">-0x4</code>, it’s because the offset is relative to the instruction-pointer which has already moved to the next instruction. The 4 bytes is the operand it has skipped over.</p>
</blockquote>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> readelf <span class="nt">-r</span> simple-relocation.o <span class="nt">-d</span>

Relocation section <span class="s1">'.rela.text'</span> at offset 0x170 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000a  000400000004 R_X86_64_PLT32    0000000000000000 far_function - 4
</code></pre></div></div>

<p>This is additional information embedded in the binary which tells the linker in susbsequent stages that it has code that needs to be fixed. Here we see the address <code class="language-plaintext highlighter-rouge">00000000000a</code>, and <code class="language-plaintext highlighter-rouge">a</code> is 9 + 1, which is the offset of the start of the operand for our <code class="language-plaintext highlighter-rouge">CALL</code> instruction.</p>

<p>Let’s now create the C file for our missing function.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">far_function</span><span class="p">()</span> <span class="p">{</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We will now compile it and link the two object files together using our linker.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> gcc simple-relocation.o far-function.o <span class="nt">-o</span> simple-relocation
</code></pre></div></div>

<p>Let’s now inspect that same callsite and see what it has.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> objdump <span class="nt">-dr</span> simple-relocation

0000000000401106 &lt;main&gt;:
  401106:	55                   	push   %rbp
  401107:	48 89 e5             	mov    %rsp,%rbp
  40110a:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
  40110f:	e8 07 00 00 00       	call   40111b &lt;far_function&gt;
  401114:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
  401119:	5d                   	pop    %rbp
  40111a:	c3                   	ret

000000000040111b &lt;far_function&gt;:
  40111b:	55                   	push   %rbp
  40111c:	48 89 e5             	mov    %rsp,%rbp
  40111f:	90                   	nop
  401120:	5d                   	pop    %rbp
  401121:	c3                   	ret
</code></pre></div></div>

<p>We can see that the linker did the right thing with the relocation and calculated the relative offset of our symbol <code class="language-plaintext highlighter-rouge">far_function</code> and fixed the <code class="language-plaintext highlighter-rouge">CALL</code> instruction.</p>

<p>Okay cool…🤷 What does this have to do with huge binaries?</p>

<p>Notice that this call instruction, <code class="language-plaintext highlighter-rouge">e8</code>, only takes 32bits <strong>signed</strong> which means it’s limited to 2^31 bits. This means a callsite can only jump roughly 2GiB forward or 2GiB backward. The “2GiB Barrier” represents the total reach of a single relative jump.</p>

<p>What happens if our callsite is over 2GiB away?</p>

<p>Let’s build a synthetic example by asking our linker to place <code class="language-plaintext highlighter-rouge">far_function</code> <em>really really far away</em>. We can do this using a “linker script”, which is a mechanism we can instruct the linker how we would like our code sections laid out when the program starts.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SECTIONS
{
    /* 1. Start with standard low-address sections */
    . = 0x400000;
    
    /* Catch everything except our specific 'far' object */
    .text : { 
        simple-relocation.o(.text.*) 
    }
    .rodata : { *(.rodata .rodata.*) }
    .data   : { *(.data .data.*) }
    .bss    : { *(.bss .bss.*) }

    /* 2. Move the cursor for the 'far' island */
    . = 0x120000000; 
    
    .text.far : { 
        far-function.o(.text*) 
    }
}
</code></pre></div></div>

<p>If we now try to link our code we will see a “relocation overflow”.</p>

<blockquote class="alert alert-tip">
  <p><strong>TIP</strong>
I used <code class="language-plaintext highlighter-rouge">lld</code> from <a href="https://lld.llvm.org/">LLVM</a> because the error messages are a bit prettier.</p>
</blockquote>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> gcc simple-relocation.o far-function.o <span class="nt">-T</span> overflow.lds <span class="nt">-o</span> simple-relocation-overflow <span class="nt">-fuse-ld</span><span class="o">=</span>lld

ld.lld: error: &lt;internal&gt;:<span class="o">(</span>.eh_frame+0x6c<span class="o">)</span>:
relocation R_X86_64_PC32 out of range:
5364513724 is not <span class="k">in</span> <span class="o">[</span><span class="nt">-2147483648</span>, 2147483647]<span class="p">;</span> references section <span class="s1">'.text'</span>
ld.lld: error: simple-relocation.o:<span class="o">(</span><span class="k">function </span>main: .text+0xa<span class="o">)</span>:
relocation R_X86_64_PLT32 out of range:
5364514572 is not <span class="k">in</span> <span class="o">[</span><span class="nt">-2147483648</span>, 2147483647]<span class="p">;</span> references <span class="s1">'far_function'</span>
<span class="o">&gt;&gt;&gt;</span> referenced by simple-relocation.c
<span class="o">&gt;&gt;&gt;</span> defined <span class="k">in </span>far-function.o
</code></pre></div></div>

<p>When we hit this problem what solutions do we have?
Well this is a complete other subject on “code models”, and it’s a little more nuanced depending on whether we are accessing data (i.e. static variables) or code that is far away. A great blog post that goes into this is <a href="https://maskray.me/blog/2023-05-14-relocation-overflow-and-code-models">the following</a> by <a href="https://github.com/maskray">@maskray</a> who wrote <code class="language-plaintext highlighter-rouge">lld</code>.</p>

<p>The simplest solution however is to use <code class="language-plaintext highlighter-rouge">-mcmodel=large</code> which changes all the relative <code class="language-plaintext highlighter-rouge">CALL</code> instructions to absolute 64bit ones; kind of like a <code class="language-plaintext highlighter-rouge">JMP</code>.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> gcc simple-relocation.o far-function.o <span class="nt">-T</span> overflow.lds <span class="nt">-o</span> simple-relocation-overflow

<span class="o">&gt;</span> gcc <span class="nt">-c</span> simple-relocation.c <span class="nt">-o</span> simple-relocation.o <span class="nt">-mcmodel</span><span class="o">=</span>large <span class="nt">-fno-asynchronous-unwind-tables</span>

<span class="o">&gt;</span> gcc simple-relocation.o far-function.o <span class="nt">-T</span> overflow.lds <span class="nt">-o</span> simple-relocation-overflow

./simple-relocation-overflow
</code></pre></div></div>

<blockquote class="alert alert-note">
  <p><strong>Note</strong>
I needed to add <code class="language-plaintext highlighter-rouge">-fno-asynchronous-unwind-tables</code> to disable some additional data that might cause overflow for the purpose of this demonstration.</p>
</blockquote>

<p>What does the disassembly look like now?</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> objdump <span class="nt">-dr</span> simple-relocation-overflow 

0000000120000000 &lt;far_function&gt;:
   120000000:	55                   	push   %rbp
   120000001:	48 89 e5             	mov    %rsp,%rbp
   120000004:	90                   	nop
   120000005:	5d                   	pop    %rbp
   120000006:	c3                   	ret

00000000004000e6 &lt;main&gt;:
  4000e6:	55                   	push   %rbp
  4000e7:	48 89 e5             	mov    %rsp,%rbp
  4000ea:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
  4000ef:	48 ba 00 00 00 20 01 	movabs <span class="nv">$0x120000000</span>,%rdx
  4000f6:	00 00 00 
  4000f9:	ff d2                	call   <span class="k">*</span>%rdx
  4000fb:	b8 00 00 00 00       	mov    <span class="nv">$0x0</span>,%eax
  400100:	5d                   	pop    %rbp
  400101:	c3                   	ret
</code></pre></div></div>

<p>There is no longer a sole <code class="language-plaintext highlighter-rouge">CALL</code> instruction, it has become <code class="language-plaintext highlighter-rouge">MOVABS</code> &amp; <code class="language-plaintext highlighter-rouge">CALL</code> 😲. This changed the instructions from 5 (opcode + 4 bytes for 32bit relative offset) to a whopping 12 bytes (2 bytes for <code class="language-plaintext highlighter-rouge">ABS</code> opcode + 8 bytes for absolute 64 bit address + 2 bytes for <code class="language-plaintext highlighter-rouge">CALL</code>).</p>

<p>This has notable downsides among others:</p>
<ul>
  <li><em>Instruction Bloat</em>: We’ve gone from 5 bytes per call to 12. In a binary with millions of callsites, this can add up.</li>
  <li><em>Register Pressure</em>: We’ve burned a general-purpose register, <code class="language-plaintext highlighter-rouge">%rdx</code>, to perform the jump.</li>
</ul>

<blockquote class="alert alert-caution">
  <p><strong>Caution</strong>
I had a lot of trouble building a benchmark that demonstrated a worse lower IPC (instructions per-cycle) for the large <code class="language-plaintext highlighter-rouge">mcmodel</code>, so let’s just take my word for it. 🤷</p>
</blockquote>

<p>Changing to a larger code-model is possible but it comes with these downsides. Ideally, we would like to keep our small code-model when we need it. What other strategies can we pursue?</p>

<p>More to come in subsequent writings.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A problem I experienced when pursuing my PhD and submitting academic articles was that I had built solutions to problems that required dramatic scale to be effective and worthwhile. Responses to my publication submissions often claimed such problems did not exist; however, I had observed them during my time within industry, such as at Google, but I couldn’t cite it!]]></summary></entry></feed>