top | item 34312335

Show HN: HyperLogLog in Zig

154 points| seiflotfy | 3 years ago |github.com

34 comments

order

kristoff_it|3 years ago

One change to make this library more idiomatic would be to make HyperLogLog avoid forcing heap allocation on the user.

So from this (simplified code):

    pub fn HyperLogLog(comptime p: u8) type {
        return struct {
            dense: []u6,

            const Self = @This();
            pub fn init(allocator: Allocator) !Self {
                return Self{
                    .dense = try allocator.alloc(u6, 0),
                };
            }
        }
    }
To this:

    pub fn HyperLogLog(comptime p: u8) type {
        return struct {
            dense: [1<<p]u6;

            const Self = @This();
            pub fn init() Self {
                var s = Self{};
                for (s.dense) |*x| x.* = 0;
                return s;
            }
        }
    }
The advantage to this API is that the user can choose where to place the HLL (stack, heap, global memory). Also note how the new init can't fail anymore.

That said, in the code that I've conveniently omitted there's one big complication: when the HLL has only few items in it, it doesn't use the dense representation and instead uses a `std.AutoHashMap` to keep counts.

Unfortunately `std.AutoHashMap` is not a type that can be reasonably disentangled from allocation.

I was about to drop my idea of commenting when I realized that a very neat solution would be to also provide the user with the choice (and admittedly extra responsibility) of when to switch from a hash map to a proper HLL.

    pub fn init(start: std.AutoHashMap) Self {
        var s = Self{};
        for (s.dense) |*s| s.* = 0;
        self.addFromHashMap(start);
        return s;
    }
There you go, allocation-free HyperLogLog :^)

I've also come up with my own funky interface when I implemented Cuckoo Filters in Zig, for those who are interested:

https://github.com/kristoff-it/zig-cuckoofilter

Alifatisk|3 years ago

Why is Zigs syntax so … foreign to me? What’s the hype behind it?

seiflotfy|3 years ago

This is awesome... question though: ```pub fn HyperLogLog(comptime p: u8) type { return struct { dense: [1<<p]u6;

            const Self = @This();
            pub fn init() Self {
                var s = Self{};
                for (s.dense) |*x| x.* = 0;
                return s;
            }
        }
    }```
doesn't this allocate 1<<p upfront though. If yes then the HLL of size 16384 bytes upfront which kinda beats the purpose of having a sparse representation no?

eatonphil|3 years ago

Awesome! Does Axiom use Zig? Most of your projects seem to be in Go. Why did you choose to do this in Zig?

extasia|3 years ago

Looks cool. What are some applications of a cardinality estimator?

zX41ZdbW|3 years ago

Memory bound alternative for the COUNT(DISTINCT ...) aggregate function in SQL.

ClickHouse[1] has a few functions for that, with different tradeoffs in memory usage, precision, and performance:

    - uniqExact - the same as COUNT(DISTINCT ...) - the exact version, using a hash table;
    - uniqCombined - a combination of a linear array of small size in memory arena, a hash table, and a HyperLogLog - this data structure is similar to HyperLogLog++; the log2 of the number of cells is controllable by the user;
    - uniq - a cardinality estimator based on adaptive sampling by hash, lower quality to memory usage compared to HyperLogLog, but beats it in CPU efficiency;
    - uniqTheta - uses the Theta sketch algorithm, whatever it means;
    - uniqUpTo(N) - my favorite, uses a linear array to give the precise answer up to N, and always answers N + 1 when the number of distinct elements is larger than N;
    - groupBitmap - an exact calculation using Roaring Bitmaps, makes sense for dense integer sets;
[1] https://github.com/ClickHouse/ClickHouse/

What is often forgotten in designing a data structure for a cardinality estimator - is that it should work well not only for a few large states but also for a large number of small sets.

For example, in a query like follows:

    SELECT URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL
You should expect that there are a large number of URLs, but most of them will get just 1..2 unique UserIDs. Obviously, it's not practical to allocate even a 4 KB HyperLogLog for every resulting record. That's how complex combined data structures are justified. They should start with ~16..64 bytes of automatic memory in arena or stack, and only then upgrade to a HyperLogLog.

jedisct1|3 years ago

Rate-limiting is one.

cnlwsu|3 years ago

For LSM trees to estimate effectiveness of a compaction

taepper|3 years ago

hashtable sizing for aggregation

VWWHFSfQ|3 years ago

What is the point of Zig? I mean, this looks approximately like code I would write in about a dozen other languages. So, what is this?