and key point:
"Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that."
> show that most packages show a slight (around 1%) performance improvement
This takes me back to arguing with Gentoo users 20 years ago who insisted that compiling everything from source for their machine made everything faster.
The consensus at the time was basically "theoretically, it's possible, but in practice, gcc isn't really doing much with the extra instructions anyway".
Then there's stuff like glibc which has custom assembly versions of things like memcpy/etc, and selects from them at startup. I'm not really sure if that was common 20 years ago but it is now.
It's cool that after 20 years we can finally start using the newer instructions in binary packages, but it definitely seems to not matter all that much, still.
How many additions have there even been outside of AVX-x? And even AVX-2 is from 2011. If we ignore AVX-x the last I can recall are the few instructions added in the manipulation sets BMI/ABM, but they are Haswell/Piledriver/Jaguar era (2012-2013). While some specific cases could benefit, doesn't seem like a goldmine of performance improvements.
Further, maybe it has not been a focus for compiler vendors to generate good code for these higher-level archs if few are using the feature. So Ubuntu's move could improve that.
Which unfortunately extends all the way to Intels newest client CPUs since they're still struggling to ship their own AVX512 instructions, which are required for v4. Meanwhile AMD has been on v4 for two generations already.
What are the changes to dpkg and apt? Are they being shared with Debian? Could this be used to address the pesky armel vs. armel+hardfloat vs. armhf issue, or for that matter, the issue of i486 vs. i586 vs. i686 vs. the many varieties of MMX and SSE extensions for 32-bit?
Even if technically possible, it's unlikely this will be used to support any of the variants you mentioned in Debian. Both i386 and armel are effectively dead: i386 is reduced to a partial architecture only for backwards compatibility reasons, and armel has been removed entirely from development of the next release.
This would allow mixing armel and softvfp ABIs, but not hard float ABIs, at least across compilation unit boundaries (that said, GCC never seems to optimize ABI bottlenecks within a compilation unit anyway)
Over the past year, Intel has pulled back from Linux development.
Intel has reduced its number of employees, and has lost lots of software developers.
So we lost Clear Linux, their Linux distribution that often showcased performance improvements due to careful optimization and utilization of microarchitectural enhancements.
I believe you can still use the Intel compiler, icc, and maybe see some improvements in performance-sensitive code.
Getting a 1% across the board general purpose improvement might sound small, but is quite significant. Happy to see Canonical invest more heavily in performance and correctness.
Would love to see which packages benefited the most in terms of percentile gain and install base. You could probably back out a kWh/tons of CO2 saved metric from it.
> you will not be able to transfer your hard-drive/SSD to an older machine that does not support x86-64-v3. Usually, we try to ensure that moving drives between systems like this would work. For 26.04 LTS, we’ll be working on making this experience cleaner, and hopefully provide a method of recovering a system that is in this state.
Does anyone know what the plans are to accomplish this?
If I were them I would make sure the V3 instructions are not used until late in the boot process, and some apt command that makes sure all installed programs are in the right subarchitecture for the running system, reinstalling as necessary.
But that does not sound like a simple for non technical users solution.
Anyway, non technical users using an installation on another lower computer? That sounds weird.
I am probably going to be the one implementing this and I don't know what I am going to do yet! At the very least we need the failure mode to be better (currently you get an OOPS when the init from the initrd dies due to an illegal instruction exception)
Right, though compared to what one generally thinks of as an “AVX2-compatible” CPU, it curiously omits AES-NI and CLMUL (both relevant to e.g. AES-GCM). Yes, they are not technically part of AVX2, but they are present in all(?) the qualifying Intel and AMD CPUs (like many other technically-not-AVX2 stuff that did get included, like BMI or FMA3).
I'm really "new" to x64 (I only migrated from 32-bit in 2020...) and the difference I noticed between x86-64-v1 and x86-64-v3 was only with video (with ffmpeg), audio (mp3/ogg/mp4...) and encryption; the rest remains practically the same.
Naively, I believe it might be more appropriate to have x86-64-v1 and x86-64-vN options only for specific software and leave the rest as x86-64-v1.
AVX seemed to give the biggest boost to things.
Regarding those who are making fun of Gentoo users, it really did make a bigger difference in the past, but with the refinement of compilers, the difference has diminished. Today, for me, who still uses Gentoo/CRUX for some specific tasks, what matters is the flexibility to enable or disable what I want in the software, and not so much the extra speed anymore.
As an example, currently I use -Os (x86-64-v1) for everything, and only for things related to video/sound/cryptography (I believe for things related to mathematics in general?) I use -O2 (x86-64-v3) with other flags to get a little more out of it.
Interestingly, in many cases -Os with -mtune=nocona generates faster binaries even though I'm only using hardware from Haswell to today's hardware (who can understand the reason for this?).
This is quite good news but it’s worth remembering that it’s a rare piece of software in the modern scientific/numerical world that can be compiled against the versions in distro package managers, as versions can significantly lag upstream months after release.
If you’re doing that sort of work, you also shouldn’t use pre-compiled PyPi packages for the same reason - you leave a ton of performance on the table by not targeting the micro-architecture you’re running on.
My RSS reader trains a model every week or so and takes 15 minutes total with plain numpy, scikit-learn and all that. Intel MKL can do the same job in about half the time as the default BLAS. So you are looking at a noticeable performance boost but zero bullshit install with uv is worth a lot. If I was interested in improving the model than yeah I might need to train 200 of them interactively and I’d really feel the difference. Thing is the model is pretty good as it is and to make something better I’d have to think long and hard about what ‘better’ means.
Most of the scientific numerical code I ever used had been in use for decades and would compile on a unix variant released in 1992, much less the distribution version of dependencies that were a year or two behind upstream.
Yup, if you're using OpenCV for instance compiling instead of using pre-built binaries can result in 10x or more speed-ups once you take into account avx/threading/math/blas-libraries etc...
I wonder who downvoted this. The juice you are going to get from building your core applications and libraries to suit your workload are going to be far larger than the small improvements available from microarchitectural targeting. For example on Ubuntu I have some ETL pipelines that need libxml2. Linking it statically into the application cuts the ETL runtime by 30%. Essentially none of the practices of Debian/Ubuntu Linux are what you'd choose for efficiency. Their practices are designed around some pretty old and arguably obsolete ideas about ease of maintenance.
Thanks for sharing this. I'd love to learn more about micro-architectures and instruction sets - would you have any recommendations for books or sources that would be a good starting place?
This sure feels like overkill that leaks massive complexity into a lot more areas than it’s needed in. For the applications that truly need sub-architecture variants, surely different packages or just some sort of meta package indirection would be better for everyone involved.
So if it got it right,
This is mostly a way to have branches within a specific release for various levels of CPUs and their support of SIMD and other modern opcodes.
And if I have it right,
The main advantage should come with package manager and open sourced software where the compiled binaries would be branched to benefit and optimize newer CPU features.
Still, this would be most noticeable mostly for apps that benefit from those features such as audio dsp as an example or as mentioned ssl and crypto.
I would expect compression, encryption, and codecs to have the least noticeable benefit because these already do runtime dispatch to routines suited to the CPU where they are running, regardless of the architecture level targeted at compile time.
Seems like this is not using glibc's hwcaps (where shared libraries were located in microarch specific subdirs).
To me hwcaps feels like a very unfortunate feature creep of glibc now. I don't see why it was ever added, given that it's hard to compile only shared libraries for a specific microarch, and it does not benefit executables. Distros seem to avoid it. All it does is causing unnecessary stat calls when running an executable.
No it's not using hwcaps. That would only allow optimization of code in shared libraries, would be irritating to implement in a way that didn't require touching each package that includes shared libraries and would (depending on details) waste a bunch of space on every users system. I think hwcaps would only make sense for a small number of shared libraries if at all, not a system wide thing.
They do mention it in the linked announcement, although not really highlighted, just as a quick mention:
> As a result, we’re very excited to share that in Ubuntu 25.10, some packages are available, on an opt-in basis, in their optimized form for the more modern x86-64-v3 architecture level
> Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.
ARM/RISC-V extensions may be another reason. If a wide-spread variant configuration exists, why not build for it? See:
- RISC-V's official extensions[1]
- ARM's JS-specific float-to-fixed[2]
> Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 uses x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu).
CachyOS uses this one percent of performance gains? Since it uses every performance gain, unsurprising. But now I wonder how my laptop from 2012 did run CachyOS, they seem to switch based on hardware, not during image download and boot.
This is awesome, but ... If you process requires deterministic results (speaking about floats/doubles mostly here), then you need to get this straight.
Maybe - likely we’ll trade-off the added build/test/storage cost of maintaining each variant - so you might not see amd64v4, but possibly amd64v5 depending on how impactful they turn out to be.
The same will apply to different arm64 or riscv64 variants.
apt (3.1.7) unstable; urgency=medium
.
[ Julian Andres Klode ]
* test-history: Adjust for as-installed testing
.
[ Simon Johnsson ]
* Add history undo, redo, and rollback features
mobilio|4 months ago
and key point: "Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that."
ninkendo|4 months ago
This takes me back to arguing with Gentoo users 20 years ago who insisted that compiling everything from source for their machine made everything faster.
The consensus at the time was basically "theoretically, it's possible, but in practice, gcc isn't really doing much with the extra instructions anyway".
Then there's stuff like glibc which has custom assembly versions of things like memcpy/etc, and selects from them at startup. I'm not really sure if that was common 20 years ago but it is now.
It's cool that after 20 years we can finally start using the newer instructions in binary packages, but it definitely seems to not matter all that much, still.
juujian|4 months ago
dang|4 months ago
horizion2025|4 months ago
Further, maybe it has not been a focus for compiler vendors to generate good code for these higher-level archs if few are using the feature. So Ubuntu's move could improve that.
jwrallie|4 months ago
pizlonator|4 months ago
I bet you there is some use case of some app or library where this is like a 2x improvement.
theandrewbailey|4 months ago
x86-64-v3 is AVX2-capable CPUs.
jsheard|4 months ago
Which unfortunately extends all the way to Intels newest client CPUs since they're still struggling to ship their own AVX512 instructions, which are required for v4. Meanwhile AMD has been on v4 for two generations already.
zozbot234|4 months ago
(There is some older text in the Debian Wiki https://wiki.debian.org/ArchitectureVariants but it's not clear if it's directly related to this effort)
Denvercoder9|4 months ago
mwhudson|4 months ago
No, because those are different ABIs (and a debian architecture is really an ABI)
> the issue of i486 vs. i586 vs. i686 vs. the many varieties of MMX and SSE extensions for 32-bit?
It could be used for this but it's about 15 years too late to care surely?
> (There is some older text in the Debian Wiki https://wiki.debian.org/ArchitectureVariants but it's not clear if it's directly related to this effort)
Yeah that is a previous version of the same design. I need to get back to talking to Debian folks about this.
bobmcnamara|4 months ago
watersb|4 months ago
Intel has reduced its number of employees, and has lost lots of software developers.
So we lost Clear Linux, their Linux distribution that often showcased performance improvements due to careful optimization and utilization of microarchitectural enhancements.
I believe you can still use the Intel compiler, icc, and maybe see some improvements in performance-sensitive code.
https://clearlinux.org/
"It was actively developed from 2/6/2015-7/18/2025."
dooglius|4 months ago
Hasz|4 months ago
Would love to see which packages benefited the most in terms of percentile gain and install base. You could probably back out a kWh/tons of CO2 saved metric from it.
dfc|4 months ago
Does anyone know what the plans are to accomplish this?
dmoreno|4 months ago
But that does not sound like a simple for non technical users solution.
Anyway, non technical users using an installation on another lower computer? That sounds weird.
mwhudson|4 months ago
unknown|4 months ago
[deleted]
benatkin|4 months ago
> Description: official repositories compiled with LTO, -march=x86-64-vN and -O3.
Packages: https://status.alhp.dev/
theandrewbailey|4 months ago
x86-64-v3 is AVX2-capable CPUs.
mananaysiempre|4 months ago
random29ah|4 months ago
Naively, I believe it might be more appropriate to have x86-64-v1 and x86-64-vN options only for specific software and leave the rest as x86-64-v1.
AVX seemed to give the biggest boost to things.
Regarding those who are making fun of Gentoo users, it really did make a bigger difference in the past, but with the refinement of compilers, the difference has diminished. Today, for me, who still uses Gentoo/CRUX for some specific tasks, what matters is the flexibility to enable or disable what I want in the software, and not so much the extra speed anymore.
As an example, currently I use -Os (x86-64-v1) for everything, and only for things related to video/sound/cryptography (I believe for things related to mathematics in general?) I use -O2 (x86-64-v3) with other flags to get a little more out of it.
Interestingly, in many cases -Os with -mtune=nocona generates faster binaries even though I'm only using hardware from Haswell to today's hardware (who can understand the reason for this?).
physicsguy|4 months ago
If you’re doing that sort of work, you also shouldn’t use pre-compiled PyPi packages for the same reason - you leave a ton of performance on the table by not targeting the micro-architecture you’re running on.
PaulHoule|4 months ago
colechristensen|4 months ago
zipy124|4 months ago
jeffbee|4 months ago
niwtsol|4 months ago
skywhopper|4 months ago
rock_artist|4 months ago
And if I have it right, The main advantage should come with package manager and open sourced software where the compiled binaries would be branched to benefit and optimize newer CPU features.
Still, this would be most noticeable mostly for apps that benefit from those features such as audio dsp as an example or as mentioned ssl and crypto.
jeffbee|4 months ago
zdw|4 months ago
I couldn't run something from NPM on a older NAS machine (HP Microserver Gen 7) recently because of this.
stabbles|4 months ago
To me hwcaps feels like a very unfortunate feature creep of glibc now. I don't see why it was ever added, given that it's hard to compile only shared libraries for a specific microarch, and it does not benefit executables. Distros seem to avoid it. All it does is causing unnecessary stat calls when running an executable.
mwhudson|4 months ago
smlacy|4 months ago
embedding-shape|4 months ago
> As a result, we’re very excited to share that in Ubuntu 25.10, some packages are available, on an opt-in basis, in their optimized form for the more modern x86-64-v3 architecture level
> Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.
pushfoo|4 months ago
1. https://riscv.atlassian.net/wiki/spaces/HOME/pages/16154732/... 2. https://developer.arm.com/documentation/dui0801/h/A64-Floati...
westurner|4 months ago
"Changes/Optimized Binaries for the AMD64 Architecture v2" (2025) https://fedoraproject.org/wiki/Changes/Optimized_Binaries_fo... :
> Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 uses x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu).
sluongng|4 months ago
yohbho|4 months ago
ElijahLynn|4 months ago
zer0zzz|4 months ago
DrNosferatu|4 months ago
whalesalad|4 months ago
very odd choice of words. "better utilize/leverage" is perhaps the right thing to say here.
JohnKemeny|4 months ago
malkia|4 months ago
tommica|4 months ago
brucehoult|4 months ago
All the fuss about Ubuntu 25.10 and later being RVA23 only was about nothing?
snvzz|4 months ago
wyldfire|4 months ago
justahuman74|4 months ago
jnsgruk|4 months ago
The same will apply to different arm64 or riscv64 variants.
mwhudson|4 months ago
amelius|4 months ago
julian-klode|4 months ago
apt (3.1.7) unstable; urgency=medium . [ Julian Andres Klode ] * test-history: Adjust for as-installed testing . [ Simon Johnsson ] * Add history undo, redo, and rollback features
riskable|4 months ago
o11c|4 months ago
shmerl|4 months ago
bmitch3020|4 months ago
lotfi-mahiddine|4 months ago
[deleted]
unknown|4 months ago
[deleted]