top | item 27509944

Linux with “memory folios”: a 7% performance boost when compiling the kernel

238 points| marcodiego | 4 years ago |lore.kernel.org | reply

158 comments

order
[+] marcodiego|4 years ago|reply
Now, consider that there's work going on to enable Linux to be compiled with profile guided optimization with clang[0], the DAMON patchset that enables proactive reclamation with "32% memory saving with only 1.91% runtime overhead"[1] and performance improvements achievable with futex2 system call[2].

Linux future seems bright with regard to performance.

[0] https://lkml.org/lkml/2021/1/11/98

[1] https://lore.kernel.org/lkml/20210608115254.11930-1-sj38.par...

[2] https://lkml.org/lkml/2021/4/27/1208

[+] londons_explore|4 years ago|reply
Linux needs a standard set of benchmarks that are vaguely representative of things users use Linux for. Phone, laptop and a few server use cases.

It then needs someone to do a big parameter tuning to select optimal settings.

Too many decent algorithms don't make it into the kernel because there are too many tunables, and the ones that do typically arent well tuned for anyone's use case.

Even big projects like Ubuntu typically don't change many tunables in the kernel.

[+] MaxBarraclough|4 years ago|reply
As someone with a limited knowledge of kernel stuff, does this mean Linux is likely to significantly outperform other comparable kernels like FreeBSD and OpenSolaris? Or are those other kernels keeping pace?
[+] mappu|4 years ago|reply
There is some more in-depth discussion about the core folio idea on LWN at https://lwn.net/Articles/849538/ from a previous iteration of the patch set,
[+] WhyNotHugo|4 years ago|reply
Pretty good explanation.

I don't quite follow why it's all being done as one huge set of patches -- rather than first merge the groundwork and then all the conversions.

[+] atoav|4 years ago|reply
A while ago I switched my main rendeeing machine to linux because blender rendered roughly 10% faster on it back then.

This is equivalent to saving ~16 hours when you have a week of rendertime.

[+] jabl|4 years ago|reply
This was a 7% perf boost compiling the kernel, which does a lot of small file IO and memory allocations and thus stresses the MM code in the kernel.

Rendering is much more 'pure cpu' work, so you most likely won't see much difference there due to this work.

[+] bluedino|4 years ago|reply
Imagine if you have a whole renderfarm
[+] mhh__|4 years ago|reply
One thing I've been mulling over recently is that many containers like vector in C++ for example, have almost no state.

That is to say, we at most might have a bit of logic to tune whether we do *1.5 or *2 on realloc, but why not more?

There must be patterns we can exploit in common use cases to be sneaky and do less malloc-ing. Profile guided? Runtime? I might have some results by Christmas, I have some ideas.

Food for thought: Your container has a few bytes of state to make decisions with, your branch predictor has a few megabytes these days.

[+] jabl|4 years ago|reply
The size increase factor for vector is a compromise between performance and wasting memory. It's also a fairly hot code path, so you don't want to run some complicated code there to estimate the 'optimal' factor.

About the best you can do is if you know beforehand roughly how big it will be, is to reserve that capacity with std::vector::reserve().

[+] steerablesafe|4 years ago|reply
std::vector would mostly benefit from an improved allocator interface, where it requests N bytes, but the allocator can give more than that and report the actual value.
[+] viraptor|4 years ago|reply
While some heuristics would be nice if they improve the situation, a lot of apps still leave performance on the table by not estimating the capacity well. You don't have to be very clever about growth if you know that this vector will always have 3 elements and that one will have N that you can estimate from data size.
[+] gary_0|4 years ago|reply
Please don't give the C++ people any more ideas, my compile times are already bad enough!
[+] fulafel|4 years ago|reply
It's surprising that we have stuck with 4 kB pages on x86 since the 386, even though computers have ~10 000x as much memory now (4 MB -> 32GB).
[+] delsarto|4 years ago|reply
I wrote a thesis on this in 2008 (Transparent large-page support for Itanium Linux https://ts.data61.csiro.au/publications/theses_public/08/Wie...) and Matthew Wilcox was already involved in the area then. I admire his persistence, and have certainly have not kept up in the state of the art. Itanium had probably the greatest ability to select page size, probably more than any other architecture (?). On x86-64 you really only have 2mb or 4k to work with in a general purpose situation. It was difficult to show the benefits of managing all the different page sizes and, as this notes, re/writing everything to be aware of page sizes effectively. Those who had really big workloads that benefited from huge pinned mappings didn't really care that much either. It made the work hard to find traction at the time.
[+] TD-Linux|4 years ago|reply
Linux still has a lot of assumptions baked into the page size. Power9 and some aarch64 systems have 16kB pages, but occasionally you run into some corner cases - for example, you can't mount a btrfs partition created on a x86 machine on a power9 one because the btrfs page size must be >= the mmu page size.
[+] rwmj|4 years ago|reply
RHEL used 64k page size on aarch64 for a while. I believe we have switched back to 4k. It caused some problems, from memory:

* Blow-ups in various kernel data structures. There was some virtio code which was allocating N pages per driver queue.

* Problems with GPUs, either the driver or the firmware assumed 4k pages. (Edit: This actually affected Power, not ARM, but the issue is caused by page size: https://lists.fedoraproject.org/archives/list/[email protected]...)

* Filesystems make assumptions about page size versus block size.

* Processes generally take more RAM, with RAM wasted because of internal fragmentation.

[+] b9a2cab5|4 years ago|reply
Indeed, the M1/A14 on mobile has larger pages which lets them have more effective TLB coverage with a smaller cache. In some applications this can boost performance by double digit % (which you can simulate by enabling large pages on x86).
[+] DudeInBasement|4 years ago|reply
A lot of peripherals have 4kB address space. So it becomes complicated if you change to 16kB as you'll be mapping in more than you bargained for.
[+] toast0|4 years ago|reply
You can also do large pages, 2MB or 1GB or whatever the obscenely large page size is for 5-level paging on latest systems.

2MB vs 4kB isn't quite the same ratio as 4MB -> 32GB, but it's still a lot less pages to cache in the TLB, and it's not too big to manage when you need to copy on write or swap out (or compress with zram) and whatever else needs to be done at the page level.

[+] dncornholio|4 years ago|reply
The other day I was wondering what would happen if all operating systems would stop developing and only optimise for a week or 2. How much time and electricity could have been saved?

If you add up an optimisation of just a nanosecond in like openSSH, how much would that do globally?

[+] nikhizzle|4 years ago|reply
I used to be a kernel developer at Apple starting in 2006. Internally, every alternate major release was exactly this. All common paths were identified, and most of the dev time on the release was spent on optimizing those features to hit a goal for the path. Eg. moving 100 files in finder should not take more than x ms
[+] georgyo|4 years ago|reply
> If you add up an optimization of just a nanosecond in like openSSH, how much would that do globally?

I believed optimizations like that at a global scale will not have any impact.

Lets say that this nanosecond will be saved trillions of time a day. Resulting in minutes to an hour a day saved globally.

* Not a single user will notice. * In 99.99% of cases the CPU will not be fully pegged and thus that one nanosecond of compute will not be used to do something else at all. * CPU throttling isn't that fast so, you won't even save that much power.

If we bump it up by 6 orders of magnitude to a millisecond that all remains true. Even though you are potentially saving 100s of years of computing time a day. Extremely small gains distributed across very large number of machines don't tend to be as impactful as you would hope on a global scale.

This is not to say that small gains are worthless. Many small gains added together can be substantial.

[+] im3w1l|4 years ago|reply
Very optimized projects are out-competed by well-factored but less optimized projects, because the latter ones can add features faster. Thats why we are where we are today.
[+] account42|4 years ago|reply
> There does not appear to be a way to tell gcc that it can cache the result of compound_head()

Isn't this what __attribute__((pure)) [0] is for?

[0] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attribute...

[+] hyperman1|4 years ago|reply
As I understood it, pure means a function's output is dependent only on its input:

  y=f(x);z=f(x) implies y=z
What they want is something different:

  y= f(x) implies y=f(y)
This means if you give something the head of a list of pages, it won't try to go to the head again and again, it knows its already there.

The 'folio' idea as I understand it is roughly an alias for the existing 'page' structure, but code knows it is already at a head AND it should do the work on the whole list, not only on the head.

[+] kzrdude|4 years ago|reply
It sounds like the quoted benchmark is for XFS, so other filesystems may be different?