danlark's comments

danlark | 1 year ago | on: My Favorite Algorithm: Linear Time Median Finding (2018)

3 and 4 elements will fail to prove the complexity is linear

You still can do 3 or 4 but with slight modifications

For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear

danlark | 1 year ago | on: My Favorite Algorithm: Linear Time Median Finding (2018)

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

danlark | 2 years ago | on: How big is YouTube?

Disclaimer: the author of the comment is the CEO of ClickHouse

danlark | 2 years ago | on: Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

I agree on the sorting front. Removing one cmov is not likely to improve much.

danlark | 2 years ago | on: Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

In this particular case they were universal but in paper it's said the optimizations were done on x86. One of the ideas was to use LLVM IR but intuition for optimizer over optimizer was unlikely to work properly.

danlark | 2 years ago | on: Deepmind Alphadev: Faster sorting algorithms discovered using deep RL

You can see hashing optimizations as well https://www.deepmind.com/blog/alphadev-discovers-faster-sort..., https://github.com/abseil/abseil-cpp/commit/74eee2aff683cc7d...

I was one of the members who reviewed expertly what has been done both in sorting and hashing. Overall it's more about assembly, finding missed compiler optimizations and balancing between correctness and distribution (in hashing in particular).

It was not revolutionary in a sense it hasn't found completely new approaches but converged to something incomprehensible for humans but relatively good for performance which proves the point that optimal programs are very inhuman.

Note that for instructions in sorting, removing them does not always lead to better performance, for example, instructions can run in parallel and the effect can be less profound. Benchmarks can lie and compiler could do something differently when recompiling the sort3 function which was changed.

For hashing it was even funnier, very small strings up to 64 bit already used 3 instructions like add some constant -> multiply 64x64 -> xor upper/lower. For bigger ones the question becomes more complicated, that's why 9-16 was a better spot and it simplified from 2 multiplications to just one and a rotation. Distribution on real workloads was good, it almost passed smhasher and we decided it was good enough to try out in prod. We did not rollback as you can see from abseil :)

But even given all that, it was fascinating to watch how this system was searching and was able to find particular programs can be further simplified. Kudos to everyone involved, it's a great incremental change that can bring more results in the future.

danlark | 2 years ago | on: “csinc”, the AArch64 instruction you didn’t know you wanted

I am the author of this trick as well

You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...

danlark | 3 years ago | on: Changing std:sort at Google’s scale and beyond

Sorry, it definitely was a little inconsistent as I stored all snippets without proper formatting and I tried at the same time align texts for phones, etc

Failed at everything. Will improve

danlark | 3 years ago | on: Changing std:sort at Google’s scale and beyond

Hi, the author is here

I am working in MapReduce/Data-pipelining/Dataproc efficiency internally and externally. In cloud you can check out Google Cloud Dataflow

We are working closely with general efficiency and bare metal teams, yes :). They definitely do a fascinating work. This one was my 20% project together with the google efficiency team

danlark | 3 years ago | on: Changing std:sort at Google’s scale and beyond

Thanks, I'll remove the note about "recalling"

danlark | 3 years ago | on: Changing std:sort at Google’s scale and beyond

You don't, it's just not disallowed :)

danlark | 5 years ago | on: I need extra C/C++ performance now. How?

Hi, I don't agree to the end as most of the options are equally applicable to C language except libcxx and others. ThinLTO, inline threshold -- they also can be turned on for C compilation units

danlark | 6 years ago | on: Mimalloc – A compact general-purpose allocator

The developer helped us a lot to improve the speed and now it is almost at the same level as jemalloc for us

danlark | 6 years ago | on: Mimalloc – A compact general-purpose allocator

Yes, our program ran at half speed.

danlark | 6 years ago | on: Mimalloc – A compact general-purpose allocator

We tried mimalloc in ClickHouse and it is two times slower than jemalloc in our common use case https://github.com/microsoft/mimalloc/issues/11