I haven't dug extremely deeply into the sources for many F/OSS projects; the code I'm interested in reading has been inevitably opaque (at least to my inexperience). This particular source file (and maybe the rest of Redis?) is really good. I think I'll be taking many more looks at Redis (code- and usage-wise) in the future.
1 million pair using 16 MB is about 16 bytes per pair, which is perfectly fine but nothing impressive.
The dataset is static, so a simple naive solution would be to create a big array sorted by key. Assuming both photo and user IDs use 4 bytes each, this would result in about 2GB of data. Then use binary search to lookup values.
However, if we really want to reduce the size, we could build a finite state machine from the dataset (maybe reverse the values to increase the level of shared suffixes) which should reduce the size by an order of magnitude.
Best of all, lookups in hashes are still O(1), making them very quick.
Based on the zipmap code (linked below), a zipmap is implemented as an array of adjacent (key, value) pairs. Lookup in a zipmap is actually linear search. There is no hashing. The lookup will run in time proportional to the number of entries in the map.
We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity
This shouldn't be surprising either, as the redis documentation states that "If a specially encoded value will overflow the configured max size, Redis will automatically convert it into normal encoding." Their higher level hashing strategy of dividing the media id by 1000 guarantees that 1000 is the maximum number of entries in any zipmap. Setting hash-max-zipmap-entries to anything lower than 1000 means some of their zipmaps will be converted to normal redis key/value encodings.
Redis includes a benchmark utility (http://redis.io/topics/benchmarks). I ran it on my MacBook Pro a couple of weeks ago and got north of 50,000 req/s with a bunch of other crap running. I don't think the utility has any tests for hashes, though.
Second, this functionality seems to be a stop gap to support old users who may be using old clients. So they need an array of 300 million elements each containing an integer of 12 million or less. Assuming 32 bit values (24 would work but.. efficiency), that's a 1,144MB array which, in theory, could be stored as a file and cached through the filesystem cache.
I wonder how the performance of that would stack up against Redis. The convenience of Redis being a network daemon out of the box is potentially the big win here, though the memory usage even in the optimized case seems to be around 4x given that it's doing more than the task necessarily requires (from my interpretation of it - I could be wrong!)
You can find similar examples -- ruby code for the "exact" implementation as described by Instagram -- of this and other redis, memory optimization(s) at the Redis site: http://redis.io/topics/memory-optimization
Lookups are not really O(1), they're O(number of keys per hash) as long as the hashes are zipmaps. When they become full blown hashes the memory usage increases.
Still, this is a very good way to store a lot of key/value pairs in Redis.
Big-O notation refers to asymptotic behavior. The zipmap encoding of hashes only matters for small values bounded by a constant, so hash lookups are still expected O(1) time in Redis.
Why use clear text numbers? Most of the time, you're going to be using large numbers, so binary pack them as save more space.
i had the same issue, normal storage was 1.1gb of space, HSET down to 200mb and binary packing every integer down dbl() bought it right down to 163mb of memory (32bit instance). For that 163mb, I was slicing a md5 of the field for the hset prefix, packing that and then using the remainer as the hset suffix. (due to the data format of the input field)
I would like to see some example code or a gist of what you did. I played around with custom number packing and using the hashset in redis inspired by the original thread topic and your comment. You can find the post here: http://www.christianoestreich.com/2011/11/redis-hashsets-per...
Redis is a series of tradeoffs between simplicity, performance, and power. For the moment, simplicity won as far as keys go -- they're all strings, period, end of story.
Supporting other key types is potentially useful, but brings complications, some of which will be visible to users of Redis.
Your own measurements are telling. Your memory usage went down by about 82%, packing the integers brought that to a bit less than 86%. It's easy to justify a small increase in complexity to save 82%, it's a lot harder to justify even more complexity to bring that number to 86%.
I think they were trying to keep the hash sizes to ~1000 elements each key. At a 3 digit key you would be increasing the hashes to 300k elements each. You should augment the scripts and see if that has an impact on the performance, would be curious to see.
The hash data type is probably the most awesome and underused feature of redis. Here is a small gem I wrote to expose it a little better in ruby: https://github.com/lyconic/redis-native_hash
[+] [-] moxiemk1|14 years ago|reply
I haven't dug extremely deeply into the sources for many F/OSS projects; the code I'm interested in reading has been inevitably opaque (at least to my inexperience). This particular source file (and maybe the rest of Redis?) is really good. I think I'll be taking many more looks at Redis (code- and usage-wise) in the future.
[+] [-] mikeyk|14 years ago|reply
[+] [-] gorset|14 years ago|reply
The dataset is static, so a simple naive solution would be to create a big array sorted by key. Assuming both photo and user IDs use 4 bytes each, this would result in about 2GB of data. Then use binary search to lookup values.
However, if we really want to reduce the size, we could build a finite state machine from the dataset (maybe reverse the values to increase the level of shared suffixes) which should reduce the size by an order of magnitude.
[+] [-] cperciva|14 years ago|reply
If I read the article correctly, existing entries won't change but new entries will be inserted.
[+] [-] maurits|14 years ago|reply
Could you elaborate on that? Or provide a link for the novice?
[+] [-] cperciva|14 years ago|reply
How quick is "very quick"? I was hoping to see some performance benchmarks, not just memory usage benchmarks.
[+] [-] minimax|14 years ago|reply
Based on the zipmap code (linked below), a zipmap is implemented as an array of adjacent (key, value) pairs. Lookup in a zipmap is actually linear search. There is no hashing. The lookup will run in time proportional to the number of entries in the map.
We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity
This shouldn't be surprising either, as the redis documentation states that "If a specially encoded value will overflow the configured max size, Redis will automatically convert it into normal encoding." Their higher level hashing strategy of dividing the media id by 1000 guarantees that 1000 is the maximum number of entries in any zipmap. Setting hash-max-zipmap-entries to anything lower than 1000 means some of their zipmaps will be converted to normal redis key/value encodings.
https://github.com/antirez/redis/blob/unstable/src/zipmap.c
[+] [-] Aqua_Geek|14 years ago|reply
[+] [-] petercooper|14 years ago|reply
Second, this functionality seems to be a stop gap to support old users who may be using old clients. So they need an array of 300 million elements each containing an integer of 12 million or less. Assuming 32 bit values (24 would work but.. efficiency), that's a 1,144MB array which, in theory, could be stored as a file and cached through the filesystem cache.
I wonder how the performance of that would stack up against Redis. The convenience of Redis being a network daemon out of the box is potentially the big win here, though the memory usage even in the optimized case seems to be around 4x given that it's doing more than the task necessarily requires (from my interpretation of it - I could be wrong!)
[+] [-] mythz|14 years ago|reply
[+] [-] thezilch|14 years ago|reply
[+] [-] catwell|14 years ago|reply
Still, this is a very good way to store a lot of key/value pairs in Redis.
[+] [-] pjscott|14 years ago|reply
</extreme-pedantry>
[+] [-] pedigree|14 years ago|reply
i had the same issue, normal storage was 1.1gb of space, HSET down to 200mb and binary packing every integer down dbl() bought it right down to 163mb of memory (32bit instance). For that 163mb, I was slicing a md5 of the field for the hset prefix, packing that and then using the remainer as the hset suffix. (due to the data format of the input field)
[+] [-] eurleif|14 years ago|reply
[+] [-] ctoestreich|14 years ago|reply
[+] [-] nknight|14 years ago|reply
Supporting other key types is potentially useful, but brings complications, some of which will be visible to users of Redis.
Your own measurements are telling. Your memory usage went down by about 82%, packing the integers brought that to a bit less than 86%. It's easy to justify a small increase in complexity to save 82%, it's a lot harder to justify even more complexity to bring that number to 86%.
[+] [-] zellyn|14 years ago|reply
[+] [-] ctoestreich|14 years ago|reply
[+] [-] examancer|14 years ago|reply