.SUNW_cap arcana
This post is going to be useless to almost everyone, yet hopefully eye opening and fascinating. Mostly, the purpose is so that I don’t have to discover this for the third time and can, at some later date, Google this, find my own article, and simply read about it.
This is a tale of linkers and code optimization and perhaps the most elegant ELF loader magic I’ve ever seen.
Backgrounder
Modern processors are pretty badass. They can do amazing things. The new Intel chips on the horizon will have a single instruction for computing a SHA256 hash (or so I’m led to believe). The various magical instructions (or classes of instructions) you can issue on a current 64-bit Intel or AMD chip are mind boggling. However, if you are shipping binaries to be run somewhere and you leverage these instructions you stand to malfunction on older systems that don’t support it.
A simple example is the cx16 set of instructions that allow the atomic comparing and swapping of 16 bytes of data (two 8-byte pointers). This instruction affords the programmer the tooling to build some powerful lockless datastructures. So, as a packager you want everyone to have the best they can, but not everyone has the same chips, so what to do?
OpenSSL solves this issue by compiling several implementations of a given
hot function (from hand-coded assembly), names them different things and
at startup will check the CPU identifier to map out what it can and can’t
do. This is effective, but suboptimal b/c the entry into these functions
is riddled with branches based on this variable capability set. If you
look at other software, the vast majority simply ship the lowest common
denominator and wash their hands of it. In the open source world, it
would be up to you to compile an optimized copy for yourself; the whole
Gentoo community has been the butt of this joke for some time:
-funroll-loops
!
Operating Systems
Operating systems run on processors, of course. It stands to reason that if
an application can check a set of CPU capabilities, then the operating
system can as well. On Linux this is exposed via /proc/cpuinfo
, but
on Illumos, you can run the command isainfo -v
to see what your
processors are capable of. This is from a VM on my laptop:
; isainfo -v
64-bit amd64 applications
rdrand avx xsave pclmulqdq aes movbe sse4.2 sse4.1 ssse3 popcnt
tscp cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu
32-bit i386 applications
rdrand avx xsave pclmulqdq aes movbe sse4.2 sse4.1 ssse3 popcnt
tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu
And this is from a production server here:
; isainfo -v
64-bit amd64 applications
avx2 fma bmi2 bmi1 rdrand f16c vmx avx xsave pclmulqdq aes movbe
sse4.2 sse4.1 ssse3 popcnt tscp cx16 sse3 sse2 sse fxsr mmx cmov
amd_sysc cx8 tsc fpu
32-bit i386 applications
avx2 fma bmi2 bmi1 rdrand f16c vmx avx xsave pclmulqdq aes movbe
sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx
cmov sep cx8 tsc fpu
You’ll note that while largely the same, the production server can do some things that my laptop cannot: AVX2 instructions, FMA instructions, and F16C instructions to name a few.
So, if the operating system knows that they are there and the operating system is responsible for running my binaries, could it assist in selecting the right code to run at link-time{% sidebar-link ldso1 %} instead of during run-time?
{% sidebar ldso1 %}
A bit about linking
Typically people think of linking as the stage at which you produce binaries or shared objectes from object files as the last step of compiling an application. Dynamic linking also happens are load time. When an ELF binary is loaded, the runtime linker (ld.so.1) lays it out in memory and applies quite a few sophisticated techniques during the act. So, herein I’m referring to the run-time linker arcana. {% endsidebar %}
ELF
ELF stands for “Executable and Linking Format” and is the predominant
specification adhered to by modern UNIX platforms (Mac OS X being the
standout exception). It is a robust format for laying out all things
important to a binary (or shared library) such that they can be
assembled into a running process within UNIX. Herein, elfdump
is
your best friend.
; elfdump -e libfoo.so
ELF Header
ei_magic: { 0x7f, E, L, F }
ei_class: ELFCLASS64 ei_data: ELFDATA2LSB
ei_osabi: ELFOSABI_SOLARIS ei_abiversion: EAV_SUNW_CURRENT
e_machine: EM_AMD64 e_version: EV_CURRENT
e_type: ET_DYN
e_flags: 0
e_entry: 0 e_ehsize: 64 e_shstrndx: 23
e_shoff: 0x12d0 e_shentsize: 64 e_shnum: 24
e_phoff: 0x40 e_phentsize: 56 e_phnum: 5
While this ELF library is clearly a “Solaris” ELF binary, because ELF is a standard format, Linux can read these just fine. Also, Illumos can read Linux ELF binary objects.
The various sections of the ELF help the linker understand where symbols
are located (which is a lot more complicated than one might expect).
Here’s where it gets interesting. The Sun linker (now the Illumos linker)
has support for creating (object link-editor) and interpretting
(run-time linker) a set of sections around system and object capabilities.
I shall refer to these sections as the .SUNW_cap
{% sidebar-link capinfo %} sections.
{% sidebar capinfo %}
.SUNW_cap
There are actually 3 relevant sections: .SUNW_cap, .SUNW_capinfo, .SUNW_capchain; and two supporting sections: .SUNW_capchainsz and .SUNW_capchainent. {% endsidebar %}
The capabilities section allows a developer to “name” capabilities that
describe certain software or hardware requirements. For example, a
developer could name a capability “avx2” that requires the the AVX2
instruction set be available. To see this section, the Illumos provided
elfdump
accepts a -H
parameter. Before I can jump into this, let’s
look at some code for reference.
A sample project
Let’s provide a shared library libfoo.so
that provides a foo()
function
that can potentially be optimized for different available processor
capabilites.
#include <stdio.h>
int foo() {
#ifdef SIMPLE
fprintf(stderr, "Simple foo...\n");
#endif
#ifdef AVX
fprintf(stderr, "AVX foo...\n");
#endif
#ifdef AVX2
fprintf(stderr, "AVX2 foo...\n");
#endif
return 0;
}
Note that in most systems when something is optimized for a specific instruction set, it is done so with hand-coded assembly. Yet, in general, this isn’t a strict requirement. One could simply recompile the same source file with
-mavx2
in the case of AVX2 optimizations and the compile will elect to use AVX2 instructions where it sees fit. The described techniques work in seamlessly in this scenario as well.
Here we have a function foo and three sets of “optimizations” for the purpose for testing only. In order to build these three differently optimized objects we can compile them as such:
; gcc -DSIMPLE -fPIC -m64 -o foo.lo -c foo.c
; gcc -DAVX -fPIC -m64 -mavx -o foo.avx-o -c foo.c
; gcc -DAVX2 -fPIC -m64 -mavx2 -o foo.avx2-o -c foo.c
Unearthed Arcana
By diving deeply into the Solaris Linker and Libraries Guide we can find obscure mentions regarding the .SUNW_capchain. I personally found them to be enough to begin discovery, but riddled with significant experimental failure (hence this blog post).
Right now, the three object files we created have different names, different
code, but identical symbol names. So, if we attempted to link them
together, we’d get collisions. Here enters Illumos ld, the mapfile, and
an option they managed to leave out of ld --help
and the man page!
Step 1. mark our objects.
We need to define a set of capabilities called “avx” and another called “avx2.” For that we will create two separate linker mapfiles:
mapfile_avx
$mapfile_version 2
CAPABILITY "avx"
{
HW += AVX;
};
mapfile_avx2
$mapfile_version 2
CAPABILITY "avx2"
{
HW += AVX2;
};
And we will use the link editor to map the symbols.
; ld -r -o foo.avx-lo -M mapfile_avx foo.avx-o
; ld -r -o foo.avx2-lo -M mapfile_avx2 foo.avx2-o
If we attempt to dump the capabilities header for our optimized
foo.avx-o
using elfdump -H foo.avx-o
there is no output (because
there is no .SUNW_cap header. And while both foo.avx-o
and foo.avx-lo
both have the foo function, only one has the capabilities information.
; nm -e foo.avx-o | grep foo
foo.avx-o:
[8] | 0| 47|FUNC |GLOB |0 |1 |foo
; nm -e foo.avx-lo | grep foo
foo.avx-lo:
[15] | 0| 47|FUNC |GLOB |0 |3 |foo
; elfdump -H foo.avx-o
; elfdump -H foo.avx-lo
Capabilities Section: .SUNW_cap
Object Capabilities:
index tag value
[0] CA_SUNW_ID avx
[1] CA_SUNW_HW_1 0x20000000 [ AVX ]
The same holds for the avx2
variants we’ve created.
Step 2. Alter our symbols
The (undocumented) ld option -z symbolcap
will take the .SUNW_cap header
from an object and fold the capabilities naming into the symbol names and
annotate the chain to include that there is an available symbol with that
stated capability.
; ld -r -o foo.avx-cap-lo -z symbolcap foo.avx-lo
; ld -r -o foo.avx2-cap-lo -z symbolcap foo.avx2-lo
These new objects have altered symbols and all the .SUNW_cap(chain) section information they need to be linked together.
; elfdump -H foo.avx-cap-lo
Capabilities Section: .SUNW_cap
Symbol Capabilities:
index tag value
[1] CA_SUNW_ID avx
[2] CA_SUNW_HW_1 0x20000000 [ AVX ]
Symbols:
index value size type bind oth ver shndx name
[17] 0x0000000000000000 0x000000000000002f FUNC LOCL D 0 .text foo%avx
; elfdump -H foo.avx2-cap-lo
Capabilities Section: .SUNW_cap
Symbol Capabilities:
index tag value
[1] CA_SUNW_ID avx2
[2] CA_SUNW_HW_2 0x20 [ AVX2 ]
Symbols:
index value size type bind oth ver shndx name
[17] 0x0000000000000000 0x000000000000002f FUNC LOCL D 0 .text foo%avx2
Step 3. Link it all together.
Now, you’ll note we never did anything special with our foo.lo
artifact.
That is just a plain old ELF object ready to be linked. We can link our
three (plain, avx, and avx2) objects together into a shared library now.
We could use gcc -shared
to do this, but since we’ve been using ld
directly this whole time, I’ll show how to do that with ld
:
ld -G -o libfoo.so foo.lo foo.avx2-cap-lo foo.avx-cap-lo
We now have a libfoo.so
that applications can use. Let’s try it out.
Testing.
Let’s create a tiny test program that calls foo.
extern int foo();
int main() { return foo(); }
Link it.
gcc -m64 -o test test.c -L. -R. -lfoo
We set a run path of ‘.’ to avoid needing to set
LD_LIBRARY_PATH
. The ‘-R’ would be unnecessary if this library were installed in a system path.
Run it.
; ./test
AVX foo...
Now, if I copy the same binaries (test
and libfoo.so
) up to my
production server:
./test
AVX2 foo...
Understanding the Magic.
The ld.so.1
man page talks briefly about the LD_DEBUG
environment
variable, but for your own fun, just run any command (say /bin/true) with
LD_DEBUG=help /bin/true
for a good time! Online debugging for the
run-time linker. cap
is a valid LD_DEBUG
value. Let’s run test
with cap
debugging:
; LD_DEBUG=cap ./test
debug:
debug: Solaris Linkers: 5.11-1.1750 (illumos)
debug:
12412:
12412: platform capability (CA_SUNW_PLAT) - i86pc
12412: machine capability (CA_SUNW_MACH) - i86pc
12412: hardware capabilities (CA_SUNW_HW_2) - 0x3f
12412: hardware capabilities (CA_SUNW_HW_1) - 0x7bd55c77
12412:
12412: 1:
12412: 1: transferring control: test
12412: 1:
12412: 1: symbol=foo[7]: capability family default
12412: 1: symbol=foo%avx2[1]: capability specific (CA_SUNW_HW_2): [ 0x20 [ AVX2 ] ]
12412: 1: symbol=foo%avx2[1]: capability candidate
12412: 1: symbol=foo%avx[2]: capability specific (CA_SUNW_HW_1): [ 0x20000000 [ AVX ] ]
12412: 1: symbol=foo%avx[2]: capability candidate
12412: 1: symbol=foo%avx2[1]: used
AVX2 foo...
The system has two HW capabilities profiles. It notices the current
capabilities at ld.so.1 invocation time and then applies them to
the loading of libfoo.so
and selects the best matching profile.
Summary
This method provides a fantastic way to build and ship both optimized and
non-optimized variants in the same binary. This makes debugging easier
as you have less binary artifacts to wrangle and the symbols in the
stacktraces are clearly annotated with the %variant
names.
The only downside to this approach is that the Linux run-time linker and toolchain don’t support it. Viva la Illumos.