# .SUNW_cap arcana

This post is going to be useless to almost everyone, yet hopefully eye opening and fascinating. Mostly, the purpose is so that I don't have to discover this for the third time and can, at some later date, Google this, find my own article, and simply read about it.

This is a tale of linkers and code optimization and perhaps the most elegant ELF loader magic I've ever seen.

## Backgrounder

Modern processors are pretty badass. They can do amazing things. The new Intel chips on the horizon will have a single instruction for computing a SHA256 hash (or so I'm led to believe). The various magical instructions (or classes of instructions) you can issue on a current 64-bit Intel or AMD chip are mind boggling. However, if you are shipping binaries to be run somewhere and you leverage these instructions you stand to malfunction on older systems that don't support it.

A simple example is the cx16 set of instructions that allow the atomic comparing and swapping of 16 bytes of data (two 8-byte pointers). This instruction affords the programmer the tooling to build some powerful lockless datastructures. So, as a packager you want everyone to have the best they can, but not everyone has the same chips, so what to do?

OpenSSL solves this issue by compiling several implementations of a given hot function (from hand-coded assembly), names them different things and at startup will check the CPU identifier to map out what it can and can't do. This is effective, but suboptimal b/c the entry into these functions is riddled with branches based on this variable capability set. If you look at other software, the vast majority simply ship the lowest common denominator and wash their hands of it. In the open source world, it would be up to you to compile an optimized copy for yourself; the whole Gentoo community has been the butt of this joke for some time: -funroll-loops!

## Operating Systems

Operating systems run on processors, of course. It stands to reason that if an application can check a set of CPU capabilities, then the operating system can as well. On Linux this is exposed via /proc/cpuinfo, but on Illumos, you can run the command isainfo -v to see what your processors are capable of. This is from a VM on my laptop:

And this is from a production server here:

You'll note that while largely the same, the production server can do some things that my laptop cannot: AVX2 instructions, FMA instructions, and F16C instructions to name a few.

So, if the operating system knows that they are there and the operating system is responsible for running my binaries, could it assist in selecting the right code to run at link-time†† instead of during run-time?

## ELF

ELF stands for "Executable and Linking Format" and is the predominant specification adhered to by modern UNIX platforms (Mac OS X being the standout exception). It is a robust format for laying out all things important to a binary (or shared library) such that they can be assembled into a running process within UNIX. Herein, elfdump is your best friend.

While this ELF library is clearly a "Solaris" ELF binary, because ELF is a standard format, Linux can read these just fine. Also, Illumos can read Linux ELF binary objects.

The various sections of the ELF help the linker understand where symbols are located (which is a lot more complicated than one might expect). Here's where it gets interesting. The Sun linker (now the Illumos linker) has support for creating (object link-editor) and interpretting (run-time linker) a set of sections around system and object capabilities. I shall refer to these sections as the .SUNW_cap†† sections.

The capabilities section allows a developer to "name" capabilities that describe certain software or hardware requirements. For example, a developer could name a capability "avx2" that requires the the AVX2 instruction set be available. To see this section, the Illumos provided elfdump accepts a -H parameter. Before I can jump into this, let's look at some code for reference.

## A sample project

Let's provide a shared library libfoo.so that provides a foo() function that can potentially be optimized for different available processor capabilites.

Note that in most systems when something is optimized for a specific instruction set, it is done so with hand-coded assembly. Yet, in general, this isn't a strict requirement. One could simply recompile the same source file with -mavx2 in the case of AVX2 optimizations and the compile will elect to use AVX2 instructions where it sees fit. The described techniques work in seamlessly in this scenario as well.

Here we have a function foo and three sets of "optimizations" for the purpose for testing only. In order to build these three differently optimized objects we can compile them as such:

## Unearthed Arcana

By diving deeply into the Solaris Linker and Libraries Guide we can find obscure mentions regarding the .SUNW_capchain. I personally found them to be enough to begin discovery, but riddled with significant experimental failure (hence this blog post).

Right now, the three object files we created have different names, different code, but identical symbol names. So, if we attempted to link them together, we'd get collisions. Here enters Illumos ld, the mapfile, and an option they managed to leave out of ld --help and the man page!

#### Step 1. mark our objects.

We need to define a set of capabilities called "avx" and another called "avx2." For that we will create two separate linker mapfiles:

##### mapfile_avx2

And we will use the link editor to map the symbols.

If we attempt to dump the capabilities header for our optimized foo.avx-o using elfdump -H foo.avx-o there is no output (because there is no .SUNW_cap header. And while both foo.avx-o and foo.avx-lo both have the foo function, only one has the capabilities information.

The same holds for the avx2 variants we've created.

#### Step 2. Alter our symbols

The (undocumented) ld option -z symbolcap will take the .SUNW_cap header from an object and fold the capabilities naming into the symbol names and annotate the chain to include that there is an available symbol with that stated capability.

These new objects have altered symbols and all the .SUNW_cap(chain) section information they need to be linked together.

Now, you'll note we never did anything special with our foo.lo artifact. That is just a plain old ELF object ready to be linked. We can link our three (plain, avx, and avx2) objects together into a shared library now. We could use gcc -shared to do this, but since we've been using ld directly this whole time, I'll show how to do that with ld:

We now have a libfoo.so that applications can use. Let's try it out.

## Testing.

Let's create a tiny test program that calls foo.

We set a run path of '.' to avoid needing to set LD_LIBRARY_PATH. The '-R' would be unnecessary if this library were installed in a system path.

Run it.

Now, if I copy the same binaries (test and libfoo.so) up to my production server:

## Understanding the Magic.

The ld.so.1 man page talks briefly about the LD_DEBUG environment variable, but for your own fun, just run any command (say /bin/true) with LD_DEBUG=help /bin/true for a good time! Online debugging for the run-time linker. cap is a valid LD_DEBUG value. Let's run test with cap debugging:

The system has two HW capabilities profiles. It notices the current capabilities at ld.so.1 invocation time and then applies them to the loading of libfoo.so and selects the best matching profile.

## Summary

This method provides a fantastic way to build and ship both optimized and non-optimized variants in the same binary. This makes debugging easier as you have less binary artifacts to wrangle and the symbols in the stacktraces are clearly annotated with the %variant names.

The only downside to this approach is that the Linux run-time linker and toolchain don't support it. Viva la Illumos.