I've done quite a bit of work with dictionaries going as far back as 8-bit machines with tight memory constraints. Bear with me here as it will get interesting.
A Bloom filter is a great data structure but probably not the best choice for this problem. It's not as memory efficient as you think.
Here's why: Think of a Bloom filter as a collection of hash values where each individual hash value is expanded by 44%. The cost of maintaining the same false positive rate of an explicit log2(n) hash is to use 1.44log2(n) bits (see http://en.wikipedia.org/wiki/Bloom_filter for more detail.) So, if you want a false positive rate of 1 in 64k you could store a 16-bit hash in an array or a 23-bit hash in a Bloom filter. That's a significant tax to pay when memory is tight.
In this case it's better to keep a bit array of explicit hashes than to use a Bloom filter.
An array of hashes gives you some options to make it smaller:
1. Sort the array by hash value and store only the differences using a delta coding scheme. The downside is you have to search the entire array (or maintain some indexes into it) to look up a single word.
2. Break the array up into sub-arrays where each sub-array is chosen so the first N bits of the hash are the same (and discarded.) Each sub-array uses a smaller hash value and can still be binary searched.
Of course, strip prefixes and suffixes from the words so the number of hashes that must be stored is reduced.
1.5 bytes per word is easy and with some work you can approach one byte per word with a low error rate.
If you're allowed to do off-line preprocessing on the word list you can do better.
Look up the history of Doug McIlroy's Unix spell checker. He got 75,000 words to fit in about 52k (going from memory here and my copy of Bentley's "Programming Pearls" is at home.) McIlroy traded away random access and had a fairly high error rate to get that much compression. Essentially, what he did was to create a large 1-bit Bloom filter and delta-encoded the differences between set bits (similar to option #1 above.)
If you want to maintain fast random access for individual words you can do this with a Directed Acyclic Word Graph (DAWG). A simple DAWG with fixed node sizes and no stemming of any kind will get two bytes per word on English. Add variable node sizes, stemming and a few other tricks and you can have fast lookup with a near-zero error rate at a cost of about one byte per word.
It seems to me that this is a classic case of "guess the answer I'm looking for". The interviewer is obviously impressed with the concept of a Bloom filter, and I'm not saying that there's anything wrong with Bloom filters, but that kind of thinking in an interview is dangerous. What if the interviewee disagrees with your views, and has strong arguments to use a completely different approach? Would you accept them, would you at least aknowledge them as valid, or would you stick with your solution as 'the only right way to do it'?
I wrote it and I'll respond - I do like the concept of a Bloom Filter and do think it's a pretty good solution to this problem but I don't really expect anyone to get to that stage on their own - and very few people do. What I use it for is as a destination and when we finally get there we get to talk about it. You can get a sense of how good a developer is based on how quickly they grasp the concept as well as the discussion that ensues.
I view an interview as more of a conversation and love learning about new concepts.
I think any question can be dangerous in the wrong hands and suffer from the "guess the answer I'm looking for" so it's always up to the interviewer to be unbiased.
If your interview question's success or failure is highly correlated with the person's ability to apply a fairly esoteric data structure it's probably a bad question.
The person who has never come across Bloom filters before has a huge disadvantage in your question, and the person who has seen them before has a huge advantage. I'd guess that as soon as a candidate says "Bloom filter" before you do in an interview, it's an instant "hire" for you, but this is a bad sign since you will get lots of false negatives and false positives.
False negatives will be good devs who never needed a Bloom filter (there are many) and false positives will be green devs that have read the wikipedia article on Bloom filters just before the interview out of dumb luck, or had a question where that data structure was introduced in the way you're introducing it, etc.
So you're saying this interview question is a probabilistic (with the potential for false positives) tool for detecting membership in the set of people who are equipped to understand the concept? Sounds like a Bloom Filter to me.
Nah - if it's obvious the person has heard of a Bloom Filter before it's not a fair question to ask them but I will talk to them about the relationship between the size of the bit array and the number of hash functions to measure how deep the understanding goes.
I actually prefer it when people have never heard of a Bloom filter so that we can actually have a real discussion.
I'd probably just keep a hash of each of the words in the first list rather than all the actual words. Simple & easy to understand for whoever has to deal w/ the code after me.
I'm not 100% convinced that "ability to understand bloom filters after 5 minutes of explanation on them" correlates very highly with what I tend to look for when making a hire. Then again, hiring is such a crap shoot anyways. Mostly you just want to get someone talking about programming for a while and try to determine if they know their shit or not. The exact topic is close to irrelevant.
You pretty much nailed it on the head there with the "get someone talking about programming" piece but I try to start with one problem and dig deeper into it rather than having many questions as if going down a checklist.
That's why you also have more than one person do the interview so you can get multiple perspectives.
My personal favorite solution to this problem is to use a trie.
When asked to reduce the memory usage, you can recognize that the trie is actually a DFA and apply a DFA-minimization algorithm. There are several other interesting approaches here:
The responses here are amazing. No wonder hiring is so broken for developers. Someone actually comes up with a good question and he's practically attacked for it.
Well it's an ongoing process to come up with the question and if I can get better at interviewing it helps me as well.
I usually ask what people think of the question after the interview and the responses tend to be pretty positive (although it may be due to the fact that they are interviewing). It's a great feeling exposing a new concept to a developer though and seeing them get excited about it - those are the exact people you want to work with.
Here is a similar interesting problem that I was once posed. Not sure how common or rare is this:
You get a stream of words (using functions like getFirst(), getNext(), boolNoneRemaining(), etc.). Once you reach the end of this stream, you have to output a random word uniformly chosen from this list.
The first answer is of course to store the entire list and maintain the size of this list. The moment it finishes, you multiply the size by a random number ...
Then comes the twist. What if there isn't enough memory to hold the whole list?
The thing that stuck me first is that someone posing this problem by itself meant that there was probably a solution to this, which I could then think of and program shortly.
Now frankly, when I interview, this is usually the last question on my list. Most candidates do not reach this far. Believe it or not, some half of the applicants are not able to draw a finite state machine that could identify if given binary number is odd or even (when its input stream is bits with LSB the last...).
[+] [-] davidst|15 years ago|reply
A Bloom filter is a great data structure but probably not the best choice for this problem. It's not as memory efficient as you think.
Here's why: Think of a Bloom filter as a collection of hash values where each individual hash value is expanded by 44%. The cost of maintaining the same false positive rate of an explicit log2(n) hash is to use 1.44log2(n) bits (see http://en.wikipedia.org/wiki/Bloom_filter for more detail.) So, if you want a false positive rate of 1 in 64k you could store a 16-bit hash in an array or a 23-bit hash in a Bloom filter. That's a significant tax to pay when memory is tight.
In this case it's better to keep a bit array of explicit hashes than to use a Bloom filter.
An array of hashes gives you some options to make it smaller:
1. Sort the array by hash value and store only the differences using a delta coding scheme. The downside is you have to search the entire array (or maintain some indexes into it) to look up a single word.
2. Break the array up into sub-arrays where each sub-array is chosen so the first N bits of the hash are the same (and discarded.) Each sub-array uses a smaller hash value and can still be binary searched.
Of course, strip prefixes and suffixes from the words so the number of hashes that must be stored is reduced.
1.5 bytes per word is easy and with some work you can approach one byte per word with a low error rate.
If you're allowed to do off-line preprocessing on the word list you can do better.
Look up the history of Doug McIlroy's Unix spell checker. He got 75,000 words to fit in about 52k (going from memory here and my copy of Bentley's "Programming Pearls" is at home.) McIlroy traded away random access and had a fairly high error rate to get that much compression. Essentially, what he did was to create a large 1-bit Bloom filter and delta-encoded the differences between set bits (similar to option #1 above.)
If you want to maintain fast random access for individual words you can do this with a Directed Acyclic Word Graph (DAWG). A simple DAWG with fixed node sizes and no stemming of any kind will get two bytes per word on English. Add variable node sizes, stemming and a few other tricks and you can have fast lookup with a near-zero error rate at a cost of about one byte per word.
[+] [-] dangoldin|15 years ago|reply
Just comes to show that there are almost always better solutions than you can think of!
[+] [-] Sandman|15 years ago|reply
[+] [-] dangoldin|15 years ago|reply
I view an interview as more of a conversation and love learning about new concepts.
I think any question can be dangerous in the wrong hands and suffer from the "guess the answer I'm looking for" so it's always up to the interviewer to be unbiased.
[+] [-] Mesmoria|15 years ago|reply
Given the performance characteristics of human typing speed I would use a disc-based database. Simple to implement.
[+] [-] sitmack|15 years ago|reply
[deleted]
[+] [-] gfodor|15 years ago|reply
The person who has never come across Bloom filters before has a huge disadvantage in your question, and the person who has seen them before has a huge advantage. I'd guess that as soon as a candidate says "Bloom filter" before you do in an interview, it's an instant "hire" for you, but this is a bad sign since you will get lots of false negatives and false positives.
False negatives will be good devs who never needed a Bloom filter (there are many) and false positives will be green devs that have read the wikipedia article on Bloom filters just before the interview out of dumb luck, or had a question where that data structure was introduced in the way you're introducing it, etc.
[+] [-] hinathan|15 years ago|reply
[+] [-] dangoldin|15 years ago|reply
I actually prefer it when people have never heard of a Bloom filter so that we can actually have a real discussion.
[+] [-] harryh|15 years ago|reply
I'm not 100% convinced that "ability to understand bloom filters after 5 minutes of explanation on them" correlates very highly with what I tend to look for when making a hire. Then again, hiring is such a crap shoot anyways. Mostly you just want to get someone talking about programming for a while and try to determine if they know their shit or not. The exact topic is close to irrelevant.
[+] [-] dangoldin|15 years ago|reply
That's why you also have more than one person do the interview so you can get multiple perspectives.
[+] [-] Adrock|15 years ago|reply
[+] [-] Adrock|15 years ago|reply
When asked to reduce the memory usage, you can recognize that the trie is actually a DFA and apply a DFA-minimization algorithm. There are several other interesting approaches here:
http://en.wikipedia.org/wiki/Trie#Compressing_tries
[+] [-] dangoldin|15 years ago|reply
I don't think that this will work as memory keeps on decreasing - is there a way to make a trie probabilistic?
[+] [-] hackinthebochs|15 years ago|reply
[+] [-] dangoldin|15 years ago|reply
I usually ask what people think of the question after the interview and the responses tend to be pretty positive (although it may be due to the fact that they are interviewing). It's a great feeling exposing a new concept to a developer though and seeing them get excited about it - those are the exact people you want to work with.
[+] [-] kqueue|15 years ago|reply
[+] [-] alok-g|15 years ago|reply
You get a stream of words (using functions like getFirst(), getNext(), boolNoneRemaining(), etc.). Once you reach the end of this stream, you have to output a random word uniformly chosen from this list.
The first answer is of course to store the entire list and maintain the size of this list. The moment it finishes, you multiply the size by a random number ...
Then comes the twist. What if there isn't enough memory to hold the whole list?
The thing that stuck me first is that someone posing this problem by itself meant that there was probably a solution to this, which I could then think of and program shortly.
Now frankly, when I interview, this is usually the last question on my list. Most candidates do not reach this far. Believe it or not, some half of the applicants are not able to draw a finite state machine that could identify if given binary number is odd or even (when its input stream is bits with LSB the last...).