Now, consider that there's work going on to enable Linux to be compiled with profile guided optimization with clang[0], the DAMON patchset that enables proactive reclamation with "32% memory saving with only 1.91% runtime overhead"[1] and performance improvements achievable with futex2 system call[2].
Linux future seems bright with regard to performance.
Linux needs a standard set of benchmarks that are vaguely representative of things users use Linux for. Phone, laptop and a few server use cases.
It then needs someone to do a big parameter tuning to select optimal settings.
Too many decent algorithms don't make it into the kernel because there are too many tunables, and the ones that do typically arent well tuned for anyone's use case.
Even big projects like Ubuntu typically don't change many tunables in the kernel.
As someone with a limited knowledge of kernel stuff, does this mean Linux is likely to significantly outperform other comparable kernels like FreeBSD and OpenSolaris? Or are those other kernels keeping pace?
There is some more in-depth discussion about the core folio idea on LWN at https://lwn.net/Articles/849538/ from a previous iteration of the patch set,
One thing I've been mulling over recently is that many containers like vector in C++ for example, have almost no state.
That is to say, we at most might have a bit of logic to tune whether we do *1.5 or *2 on realloc, but why not more?
There must be patterns we can exploit in common use cases to be sneaky and do less malloc-ing. Profile guided? Runtime? I might have some results by Christmas, I have some ideas.
Food for thought: Your container has a few bytes of state to make decisions with, your branch predictor has a few megabytes these days.
The size increase factor for vector is a compromise between performance and wasting memory. It's also a fairly hot code path, so you don't want to run some complicated code there to estimate the 'optimal' factor.
About the best you can do is if you know beforehand roughly how big it will be, is to reserve that capacity with std::vector::reserve().
std::vector would mostly benefit from an improved allocator interface, where it requests N bytes, but the allocator can give more than that and report the actual value.
While some heuristics would be nice if they improve the situation, a lot of apps still leave performance on the table by not estimating the capacity well. You don't have to be very clever about growth if you know that this vector will always have 3 elements and that one will have N that you can estimate from data size.
I wrote a thesis on this in 2008 (Transparent large-page support for Itanium Linux https://ts.data61.csiro.au/publications/theses_public/08/Wie...) and Matthew Wilcox was already involved in the area then. I admire his persistence, and have certainly have not kept up in the state of the art. Itanium had probably the greatest ability to select page size, probably more than any other architecture (?). On x86-64 you really only have 2mb or 4k to work with in a general purpose situation. It was difficult to show the benefits of managing all the different page sizes and, as this notes, re/writing everything to be aware of page sizes effectively. Those who had really big workloads that benefited from huge pinned mappings didn't really care that much either. It made the work hard to find traction at the time.
Linux still has a lot of assumptions baked into the page size. Power9 and some aarch64 systems have 16kB pages, but occasionally you run into some corner cases - for example, you can't mount a btrfs partition created on a x86 machine on a power9 one because the btrfs page size must be >= the mmu page size.
Indeed, the M1/A14 on mobile has larger pages which lets them have more effective TLB coverage with a smaller cache. In some applications this can boost performance by double digit % (which you can simulate by enabling large pages on x86).
You can also do large pages, 2MB or 1GB or whatever the obscenely large page size is for 5-level paging on latest systems.
2MB vs 4kB isn't quite the same ratio as 4MB -> 32GB, but it's still a lot less pages to cache in the TLB, and it's not too big to manage when you need to copy on write or swap out (or compress with zram) and whatever else needs to be done at the page level.
The other day I was wondering what would happen if all operating systems would stop developing and only optimise for a week or 2. How much time and electricity could have been saved?
If you add up an optimisation of just a nanosecond in like openSSH, how much would that do globally?
I used to be a kernel developer at Apple starting in 2006. Internally, every alternate major release was exactly this. All common paths were identified, and most of the dev time on the release was spent on optimizing those features to hit a goal for the path. Eg. moving 100 files in finder should not take more than x ms
> If you add up an optimization of just a nanosecond in like openSSH, how much would that do globally?
I believed optimizations like that at a global scale will not have any impact.
Lets say that this nanosecond will be saved trillions of time a day. Resulting in minutes to an hour a day saved globally.
* Not a single user will notice.
* In 99.99% of cases the CPU will not be fully pegged and thus that one nanosecond of compute will not be used to do something else at all.
* CPU throttling isn't that fast so, you won't even save that much power.
If we bump it up by 6 orders of magnitude to a millisecond that all remains true. Even though you are potentially saving 100s of years of computing time a day. Extremely small gains distributed across very large number of machines don't tend to be as impactful as you would hope on a global scale.
This is not to say that small gains are worthless. Many small gains added together can be substantial.
Very optimized projects are out-competed by well-factored but less optimized projects, because the latter ones can add features faster. Thats why we are where we are today.
As I understood it, pure means a function's output is dependent only on its input:
y=f(x);z=f(x) implies y=z
What they want is something different:
y= f(x) implies y=f(y)
This means if you give something the head of a list of pages, it won't try to go to the head again and again, it knows its already there.
The 'folio' idea as I understand it is roughly an alias for the existing 'page' structure, but code knows it is already at a head AND it should do the work on the whole list, not only on the head.
[+] [-] marcodiego|4 years ago|reply
Linux future seems bright with regard to performance.
[0] https://lkml.org/lkml/2021/1/11/98
[1] https://lore.kernel.org/lkml/20210608115254.11930-1-sj38.par...
[2] https://lkml.org/lkml/2021/4/27/1208
[+] [-] londons_explore|4 years ago|reply
It then needs someone to do a big parameter tuning to select optimal settings.
Too many decent algorithms don't make it into the kernel because there are too many tunables, and the ones that do typically arent well tuned for anyone's use case.
Even big projects like Ubuntu typically don't change many tunables in the kernel.
[+] [-] MaxBarraclough|4 years ago|reply
[+] [-] mappu|4 years ago|reply
[+] [-] WhyNotHugo|4 years ago|reply
I don't quite follow why it's all being done as one huge set of patches -- rather than first merge the groundwork and then all the conversions.
[+] [-] atoav|4 years ago|reply
This is equivalent to saving ~16 hours when you have a week of rendertime.
[+] [-] jabl|4 years ago|reply
Rendering is much more 'pure cpu' work, so you most likely won't see much difference there due to this work.
[+] [-] bluedino|4 years ago|reply
[+] [-] mhh__|4 years ago|reply
That is to say, we at most might have a bit of logic to tune whether we do *1.5 or *2 on realloc, but why not more?
There must be patterns we can exploit in common use cases to be sneaky and do less malloc-ing. Profile guided? Runtime? I might have some results by Christmas, I have some ideas.
Food for thought: Your container has a few bytes of state to make decisions with, your branch predictor has a few megabytes these days.
[+] [-] jabl|4 years ago|reply
About the best you can do is if you know beforehand roughly how big it will be, is to reserve that capacity with std::vector::reserve().
[+] [-] steerablesafe|4 years ago|reply
[+] [-] viraptor|4 years ago|reply
[+] [-] gary_0|4 years ago|reply
[+] [-] fulafel|4 years ago|reply
[+] [-] delsarto|4 years ago|reply
[+] [-] TD-Linux|4 years ago|reply
[+] [-] rwmj|4 years ago|reply
* Blow-ups in various kernel data structures. There was some virtio code which was allocating N pages per driver queue.
* Problems with GPUs, either the driver or the firmware assumed 4k pages. (Edit: This actually affected Power, not ARM, but the issue is caused by page size: https://lists.fedoraproject.org/archives/list/[email protected]...)
* Filesystems make assumptions about page size versus block size.
* Processes generally take more RAM, with RAM wasted because of internal fragmentation.
[+] [-] b9a2cab5|4 years ago|reply
[+] [-] DudeInBasement|4 years ago|reply
[+] [-] toast0|4 years ago|reply
2MB vs 4kB isn't quite the same ratio as 4MB -> 32GB, but it's still a lot less pages to cache in the TLB, and it's not too big to manage when you need to copy on write or swap out (or compress with zram) and whatever else needs to be done at the page level.
[+] [-] dncornholio|4 years ago|reply
If you add up an optimisation of just a nanosecond in like openSSH, how much would that do globally?
[+] [-] nikhizzle|4 years ago|reply
[+] [-] georgyo|4 years ago|reply
I believed optimizations like that at a global scale will not have any impact.
Lets say that this nanosecond will be saved trillions of time a day. Resulting in minutes to an hour a day saved globally.
* Not a single user will notice. * In 99.99% of cases the CPU will not be fully pegged and thus that one nanosecond of compute will not be used to do something else at all. * CPU throttling isn't that fast so, you won't even save that much power.
If we bump it up by 6 orders of magnitude to a millisecond that all remains true. Even though you are potentially saving 100s of years of computing time a day. Extremely small gains distributed across very large number of machines don't tend to be as impactful as you would hope on a global scale.
This is not to say that small gains are worthless. Many small gains added together can be substantial.
[+] [-] im3w1l|4 years ago|reply
[+] [-] account42|4 years ago|reply
Isn't this what __attribute__((pure)) [0] is for?
[0] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attribute...
[+] [-] hyperman1|4 years ago|reply
The 'folio' idea as I understand it is roughly an alias for the existing 'page' structure, but code knows it is already at a head AND it should do the work on the whole list, not only on the head.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] kzrdude|4 years ago|reply