Optimising Ubuntu performance on amd64 architecture

Michael Hudson-Doyle

on 12 December 2023

Tags: Intel , Performance , silicon , Ubuntu

Everyone wants the Linux distribution they are using to be fast. This is practically a content-free statement, of course: who would want their distro to be slow?

But at the same time, what does it mean for your distribution to be fast? For example, Ubuntu 21.10 switched the default compression for packages to zstd, which made them faster to both download and decompress, improving the performance of one important operation on Ubuntu. But, of course there are many, many other aspects of performance and this article is about something very different: the processor features Ubuntu assumes are available.

In this post, I will talk a little about the history of the amd64 architecture and some investigations we are doing in collaboration with Intel to make better use of newer processors.

Background

By far and away the most used architecture for Ubuntu is amd64, also known as x86-64 in some contexts. Ubuntu is still built for the very first amd64 CPUs, the AMD K8 from 2003 and Intel’s 64-bit Prescott from 2004, using the original instruction set architecture (ISA).

Over the years, Intel and AMD have added a number of extensions to the ISA, for example:

SIMD: SSE3, SSE4, AVX, AVX512, etc
Special purpose: RDRAND, AES-NI, VNNI
Slightly more general: cmpxchg16b (atomic compare and exchange), vfmadd* (fused multiply-add for floating point), movbe (byte order conversion)

Not using these new instructions to improve performance throughout the distribution seems like a missed opportunity. A few core packages like glibc and openssl do runtime detection to use newer instructions when they are available but vastly more packages do no such thing.

A significant difference between an architecture like amd64 and, say, POWER is the diversity of implementations. The POWER architecture has been extended over the years in several ways, but a processor from 2018 can reasonably be assumed to support every instruction supported by a processor from 2013. This is not at all true for the amd64 world. For example, SSE4.1 was introduced in the Penryn microarchitecture in 2007 but as late as 2012 designs that did not support it (e.g. the Centerton range of Atoms) were being released. In addition, both AMD and Intel have introduced extensions that the other has eventually implemented (as well as extensions that never really became widely used and eventually disappeared such as 3DNow! and TSX).

For a long time, the dynamic loader (part of glibc) has allowed distributions to take some advantage of newer CPU features by searching extra directories when support for these features is detected, but on amd64 versions of glibc prior to 2.33 these additional directories were based on ad-hoc, poorly defined selections of capabilities. For example, to my knowledge /lib/x86_64-linux-gnu/haswell was searched on most Intel processors since 2014, but no AMD ones at all.

In 2020, the glibc developers, particularly Florian Weimer of Red Hat, got sufficiently fed up with this mess to propose a solution on the libc-alpha mailing list: assemble reasonable sets of CPU features into “levels” that are mostly supported together, and have the dynamic loader search directories based on these names.

Some bikeshedding later, four levels were defined, each including the previous: “v1” or baseline, “v2”, “v3”, “v4” and these definitions were added to the “psABI” specification (roughly speaking the document that defines what binary code for an amd64 Linux system looks like):

Level Name	CPU Feature	Example instruction
(baseline)	CMOV	cmov
	CX8	cmpxchg8b
	FPU	fld
	FXSR	fxsave
	MMX	emms
	OSFXSR	fxsave
	SCE	syscall
	SSE	cvtss2si
	SSE2	cvtpi2pd
x86-64-v2	CMPXCHG16B	cmpxchg16b
	LAHF-SAHF	lahf
	POPCNT	popcnt
	SSE3	addsubpd
	SSE4_1	blendpd
	SSE4_2	pcmpestri
	SSSE3	phadd
x86-64-v3	AVX	vzeroall
	AVX2	vpermd
	BMI1	andn
	BMI2	bzhi
	F16C	vcvtph2ps
	FMA	vfmadd132pd
	LZCNT	lzcnt
	MOVBE	movbe
	OSXSAVE	xgetbv
x86-64-v4	AVX512F	kmovw
	AVX512BW	vdbpsadbw
	AVX512CD	vplzcntd
	AVX512DQ	vpmullq
	AVX512VL	n/a

Reference: page 14 of the psABI

As alluded to above, it’s not really possible to say that a processor from a given era supports a given level, but as a rough guide most processors from 2009 onward support v2 and most processors from 2015 on support v3.

v4 is complicated: Intel 11th Gen has support but 12th Gen and 13th Gen processors do not and AMD’s new Zen 4 microarchitecture adds support. It’s hard to know what the future holds for AVX512 and I’m not going to consider it for the rest of this article.

From glibc to the toolchains

Although the original idea of these levels was to rationalise the process by which the dynamic loader looks for shared libraries, they also provide a sensible label for a set of instructions assumed to be available by all parts of the distribution. Support for using “x86-64-v$N” as values for the -march flag was added to GCC in version 11 and LLVM in version 12.

It is worth noting here that we are only really talking about the C and C++ toolchains in this document. While the distribution clearly contains a great and increasing amount of code in other languages (Python, Go, Rust, Java, Ruby, …), a large majority of the code is in C/C++. For some language ecosystems, in particular Python, a lot of the performance sensitive code is in C/C++ anyway (e.g. numpy). The other statically compiled toolchains (like Rust and Go) do have support for selecting the precise ISA they target but for the rest of this document we will only think about C and C++.

Bumping the baseline?

It is a trivial change to the packaging of GCC to change the default value for -march, and some distributions have already made this change – both RHEL9 and SUSE Tumbleweed (as of Nov 2022) target x86-64-v2.

These changes have both a cost and a benefit:

For users that have hardware that is too old to support v2 instructions, these operating systems will not work at all.
For users that have paid for better hardware, these operating systems take better advantage of that hardware.

For a commercial distribution like RHEL, this probably still makes sense: if you are spending the money to get a RHEL (or SLES or …) license you are probably already running reasonably up-to-date hardware, or at least the additional cost of updating to hardware that is less than 10 years old is fairly insignificant. It is interesting to note that SUSE’s new “adaptive linux platform” product originally proposed targeting v3 and later scaled this back to v2.

For a free distribution like Ubuntu (or Fedora), the calculation is different: allowing users to extend the life of hardware by installing a free linux distribution is a significant, positive aspect of the open source world, and it is very likely that the users who are still using 2008-era hardware with Ubuntu are the users who are least able to upgrade.

That said, hardware doesn’t last forever. A few years ago, the cost of maintaining full support for 32-bit x86 machines started to outweigh the benefits and we stopped building most packages. Making a considered decision here requires data. Specifically:

Usage – How many Ubuntu users are using hardware that supports only v1 or v2?
Performance – How much performance improvement does changing the default to x86-64-v2 or x86-64-v3 bring anyway?

Neither of these questions is easy to answer.

Trying it for yourself

While we continue our own performance analysis and further assess the needs of our users, we have released an experimental Ubuntu 23.04 Server build – using -march=x86-64-v3 and -mtune=icelake-server – for the community to try out. As we consider the potential perks and drawbacks of using v3 system-wide, your feedback and observations will be an invaluable part of the process. Here are some of the questions we have for our own efforts:

On aggregate, is the v3 version faster than the baseline v1 version of Ubuntu? As we mentioned above, this can mean a lot of different things, from quantitative benchmarking to a looser, qualitative feeling about speed from the user perspective.
Are there certain domains where performance overwhelmingly benefits from or regresses because of v3?
Do these changes break anything?

This discourse post explains where to find an installer for this build, which is not only built out of the rebuilt packages, but will install packages from the rebuild archive by default. Please note that this is for testing only. Systems installed using this installer will receive no security (or any other) updates and will be in no way suitable for use in production.

We will be making another post when our own benchmarking is complete to explain what we tested and the results we found. Stay tuned!

Talk to us today

Interested in running Ubuntu in your organisation?

Optimising Ubuntu performance on amd64 architecture

Michael Hudson-Doyle

Background

From glibc to the toolchains

Bumping the baseline?

Trying it for yourself

Talk to us today

Newsletter signup

Related posts

Profile workloads on x86-64-v3 to enable future performance gains

Canonical and Intel’s strategic collaboration brings you confidential computing with Intel® TDX on Ubuntu

Canonical presence at Qualcomm DX Summit @Hannover Messe