Storing hundreds of millions of simple key-value pairs in Redis

[+] moxiemk1|14 years ago|reply

The the code the article links to for zipmap.c (https://github.com/antirez/redis/blob/unstable/src/zipmap.c) is rather literate.

I haven't dug extremely deeply into the sources for many F/OSS projects; the code I'm interested in reading has been inevitably opaque (at least to my inexperience). This particular source file (and maybe the rest of Redis?) is really good. I think I'll be taking many more looks at Redis (code- and usage-wise) in the future.

[+] mikeyk|14 years ago|reply

Back when we were getting started with Redis, the readability / concise nature of the project was one of the things that most excited me about it (here's what it looked like around then: https://github.com/antirez/redis/tree/0b420168b485d0a9c4b66d...)

[+] gorset|14 years ago|reply

1 million pair using 16 MB is about 16 bytes per pair, which is perfectly fine but nothing impressive.

The dataset is static, so a simple naive solution would be to create a big array sorted by key. Assuming both photo and user IDs use 4 bytes each, this would result in about 2GB of data. Then use binary search to lookup values.

However, if we really want to reduce the size, we could build a finite state machine from the dataset (maybe reverse the values to increase the level of shared suffixes) which should reduce the size by an order of magnitude.

[+] cperciva|14 years ago|reply

The dataset is static

If I read the article correctly, existing entries won't change but new entries will be inserted.

[+] maurits|14 years ago|reply

build a finite state machine from the dataset

Could you elaborate on that? Or provide a link for the novice?

[+] cperciva|14 years ago|reply

Best of all, lookups in hashes are still O(1), making them very quick.

How quick is "very quick"? I was hoping to see some performance benchmarks, not just memory usage benchmarks.

[+] minimax|14 years ago|reply

Best of all, lookups in hashes are still O(1), making them very quick.

Based on the zipmap code (linked below), a zipmap is implemented as an array of adjacent (key, value) pairs. Lookup in a zipmap is actually linear search. There is no hashing. The lookup will run in time proportional to the number of entries in the map.

We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity

This shouldn't be surprising either, as the redis documentation states that "If a specially encoded value will overflow the configured max size, Redis will automatically convert it into normal encoding." Their higher level hashing strategy of dividing the media id by 1000 guarantees that 1000 is the maximum number of entries in any zipmap. Setting hash-max-zipmap-entries to anything lower than 1000 means some of their zipmaps will be converted to normal redis key/value encodings.

https://github.com/antirez/redis/blob/unstable/src/zipmap.c

[+] Aqua_Geek|14 years ago|reply

Redis includes a benchmark utility (http://redis.io/topics/benchmarks). I ran it on my MacBook Pro a couple of weeks ago and got north of 50,000 req/s with a bunch of other crap running. I don't think the utility has any tests for hashes, though.

[+] petercooper|14 years ago|reply

First, I love Redis :-)

Second, this functionality seems to be a stop gap to support old users who may be using old clients. So they need an array of 300 million elements each containing an integer of 12 million or less. Assuming 32 bit values (24 would work but.. efficiency), that's a 1,144MB array which, in theory, could be stored as a file and cached through the filesystem cache.

I wonder how the performance of that would stack up against Redis. The convenience of Redis being a network daemon out of the box is potentially the big win here, though the memory usage even in the optimized case seems to be around 4x given that it's doing more than the task necessarily requires (from my interpretation of it - I could be wrong!)

[+] mythz|14 years ago|reply

I love companies like Instagram who take the time to share their scaling issues and experiences.

[+] thezilch|14 years ago|reply

You can find similar examples -- ruby code for the "exact" implementation as described by Instagram -- of this and other redis, memory optimization(s) at the Redis site: http://redis.io/topics/memory-optimization

[+] catwell|14 years ago|reply

Lookups are not really O(1), they're O(number of keys per hash) as long as the hashes are zipmaps. When they become full blown hashes the memory usage increases.

Still, this is a very good way to store a lot of key/value pairs in Redis.

[+] pjscott|14 years ago|reply

Big-O notation refers to asymptotic behavior. The zipmap encoding of hashes only matters for small values bounded by a constant, so hash lookups are still expected O(1) time in Redis.

</extreme-pedantry>

[+] pedigree|14 years ago|reply

Why use clear text numbers? Most of the time, you're going to be using large numbers, so binary pack them as save more space.

i had the same issue, normal storage was 1.1gb of space, HSET down to 200mb and binary packing every integer down dbl() bought it right down to 163mb of memory (32bit instance). For that 163mb, I was slicing a md5 of the field for the hset prefix, packing that and then using the remainer as the hset suffix. (due to the data format of the input field)

[+] eurleif|14 years ago|reply

Internally, redis stores integers as 64-bit binary values, not strings: http://redis.io/commands/incr.

[+] ctoestreich|14 years ago|reply

I would like to see some example code or a gist of what you did. I played around with custom number packing and using the hashset in redis inspired by the original thread topic and your comment. You can find the post here: http://www.christianoestreich.com/2011/11/redis-hashsets-per...

[+] nknight|14 years ago|reply

Redis is a series of tradeoffs between simplicity, performance, and power. For the moment, simplicity won as far as keys go -- they're all strings, period, end of story.

Supporting other key types is potentially useful, but brings complications, some of which will be visible to users of Redis.

Your own measurements are telling. Your memory usage went down by about 82%, packing the integers brought that to a bit less than 86%. It's easy to justify a small increase in complexity to save 82%, it's a lot harder to justify even more complexity to bring that number to 86%.

[+] zellyn|14 years ago|reply

Why use the whole Media ID as the key within the bucket, rather than just the last three digits?

[+] ctoestreich|14 years ago|reply

I think they were trying to keep the hash sizes to ~1000 elements each key. At a 3 digit key you would be increasing the hashes to 300k elements each. You should augment the scripts and see if that has an impact on the performance, would be curious to see.

[+] examancer|14 years ago|reply

The hash data type is probably the most awesome and underused feature of redis. Here is a small gem I wrote to expose it a little better in ruby: https://github.com/lyconic/redis-native_hash

54 comments