Why retire the question just because you saw it on Glassdoor? The same question was posted to sureinterview last year, so anyone you hired since November is suspect! (If you believe that potential advance knowledge of the question is relevant).
Worse yet, even if you did come up with this problem on your own, this exact problem was a fairly common interview question back in the mid 90s, when string processing interview questions were all the rage for C/C++ programming jobs -- I must have seen it a dozen times in various interviews over the years. The problem is familiar enough that I'd bet a decent amount of money that it must be listed in one of those interview problem books that were popular before the websites for this stuff started showing up over the past couple years... If you really relied on the interviewee having no possible advance knowledge of this question prior to the interview you surely had a false sense of security prior to seeing it appear on glassdoor.
As long as you engage the interviewee to see that s/he really understands the answers they are giving, I don't really see why it matters if the question has appeared on one of these sites. If the interviewee is preparing for their interviews enough that they are actually looking at these sites and understanding the answers to the point where they can intelligently discuss them, that probably already puts them in the upper tier of people you'd want to seriously consider hiring, so retiring the question is probably counter-productive unless you have a non-disclosed alternative that you're sure is as good of a question.
A spiritually similar question at a previous employer resulted in many candidates attempting to iterate over the dictionary rather than iterating over the string. We hired them. At least they could iterate over a dictionary. That's a surprisingly rare skill in the hiring pool.
Maybe I'm just getting cynical in my old age, but sometimes I think the world is awash in incompetence. We see so much of it in tech because our incompetencies are (marginally) harder to hide.
>sometimes I think the world is awash in incompetence. We see so much of it in tech because our incompetencies are (marginally) harder to hide.
A very candid Philip Greenspun in Founders@Work( p341-2) : "In aviation, by the time someone's your employee, their perceived skill and actual skill are reasonably in line. JFK Jr is not working at a charter operation because he's dead. In IT,you have people who think 'I am a great programmer'. But where are the metrics that prove them wrong ? Traffic accidents are very infrequent, so they don't get the feedback that they are a terrible driver.
Programmers walk around with a huge overestimate of their capabilities. That's why a lot of them are very bitter. They sit around stewing at their desks. That's why I don't miss IT, because programmers are very unlikable people. They are not pleasant to manage."
I like that you accepted new approaches as valid even if a candidate did not finish them. Reading this I can’t help but think how this person falls into the java trap of trying to make a ridiculously generic solution which I would consider a danger sign but plenty of people love to see.
So, assuming a real world input and real world dictionary you can try plenty of things that break down for a dictionary that includes four hundred A’s but are actually valid solutions. Also, if you want a fast real world solution then sticking with a pure lookup dictionary would slow things down. Being able to toss 5 letters into a lookup table that says the longest and shortest words that start with those letters would save a lot of time. Basically, ‘xxyzy’ = null, null saves you from looking up x to the maximum word length in your dictionary. Secondly sanitizing inputs is going to be both safer and faster. And anything you are coding in 45 minutes is going to be a long way from production ready code.
PS: Even with backtracking you can code an O(n) solution for vary large input sets. Just keep track of the K best output sequences that are aka N, N-1,N-2,…N-K.
Many years ago ('97) we had a Java programming test for new hires that included a question that involved iterating over a Hashtable containing String keys and values and printing them all out.
Nobody ever answered that question correctly and I used to joke that if anyone ever did we'd hire them immediately.
NB I've long since stopped using those kind of trivia questions in interviewing, but I didn't know any better at the time.
I hope readers aren't getting the impression from this article that the code examples provided are the correct way to do word segmentation in English. (Though I understand this is an article about interviewing and not about word segmentation. And this might be considered a preprocessing step for doing things correctly...)
Norvig gives a very approachable version of English word segmentation that uses a language model below.
Peter actually emailed me directly, though I was already very familiar with his work on this and similar problems. But yes, I make it very clear in the post (and to candidates) that they should not assume an English -- or even a natural language -- dictionary.
I had a similar question with a twist asked of me during an interview. It went something like this:
Given a list of all the short strings in the periodic table of elements (e.g. Na, F, Al, etc) and a list of all the words in the English language: 1) write a method that finds the longest possible English word you can spell given any combination of the strings in the periodic table of elements. Re-usage of elements in the same string are allowed. 2) Describe what kind of data types you would want for the two lists and describe anything special about them. 3) Give a big O estimation.
I would have thought of dynamic programming when asked if I could solve it more efficiently, but I've somehow just not had much occasion to use dynamic programming, so would have had to admit it would take an evening of reviewing my algorithms books to actually solve it that way.
However, I would have been able to come up with an alternative O(n^2) solution, involving building a directed graph with vertices representing each position in the string and the end of the string, with edges connecting two vertices if there is a dictionary word starting at the position corresponding to one vertex and ending just before the position corresponding to the second vertex. This can be done in O(n^2), and then you can find the shortest path from the vertex for 0 to the end vertex on this graph in O(n^2) (e.g., by Dijkstra's algorithm), and that gives you an exact covering of the string using the minimal number of dictionary words.
Your approach is correct. The graph that you've built is a DAG. This means that, you can solve for the lower vertices before solving for the upper ones. This is exactly how a typical Dynamic Programming solution works. If for solving a problem, we can see the relationship b/w this problem and other similar but smaller ones, we solve the smaller problems first and use these solutions to build up a solution for the larger case.
EDIT: A naive implementation of your idea can take upto O(n3m) time where m is the no. of words in dictionary. Using a fancy data structure like a trie or suffix tree or an automaton can speed it upto O(n^2). Can we better that?
Actually, my original implementation used Dijkstra's algorithm. But I think the linear table of suffixes (or prefixes if you prefer) is a cleaner approach.
I love it. You spark an (imho) much more interesting debate: that of raw smartness vs pragmatic productivity.
A colleague of mine (now hired) was asked to code a string compare during her interview for a .NET function. She said "String.Compare()". This puzzled the interviewers for a while. They asked her whether she could write it out, she said she didn't see the point.
They ended up hiring her for other reasons (notably, references from trusted colleagues who had worked with her), but I still wonder whether that attitude worked for her or against her.
The correct answer would be ['aa', 'ab'] but unfortunately findall works greedily and so will not find the optimal solution. It is possible to specify it as a regex, but common implementations might take too much time to come up with the good solution.
There's a library for constructing a deterministic regular expression from a dictionary, by the way, which would right away give you the exponential-time result. If you wrapped it in a * and applied it with a guaranteed-linear-time regular-expression engine like RE2, you'd find out whether the string was segmentable (and as a bonus you wouldn't have to construct the deterministic RE yourself) but I don't know if you'd get the actual segments.
Here's my Haskell attempt (both naive recursive backtrack and dynamic programming version):
type Dict = Set String
hasWord :: Dict -> String -> Bool
hasWord = flip Set.member
fmtWords :: [[String]] -> Maybe String
fmtWords = fmap (intercalate " ") . listToMaybe
splitWordsSimple :: Dict -> String -> Maybe String
splitWordsSimple dct = fmtWords . go where
go [] = return []
go s = do
(i,t) <- zip (inits s) (tails s)
guard $ hasWord dct i
(i:) <$> go t
splitWordsDP :: Dict -> String -> Maybe String
splitWordsDP dct s = fmtWords $ a ! 0 where
a = listArray (0, len) $ map f [0..len-1] ++ [[[]]]
len = length s
f i = do
l <- [1..len-i]
let w = take l $ drop i s
guard $ hasWord dct w
(w:) <$> a ! (i+l)
let dict = Data.Set.fromList["apple", "pie", "bread", "applepie", "piebread"]
let fn q = concat.Data.List.map (\(x,y) -> if (member x dict) then if y == "" then [[x]] else Data.List.map (x:) (fn y) else [] ) $ tail $ zip (inits q) (tails q)
Humbling way to start the work week. I could produce the fizzbuzz solution and in my sharper days the recursive backtracing one, but definitely no further.
I wouldn't feel bad. The advanced answers to this question require spending a lot of time understanding string processing. It's like having a CSS question that can be implemented multiple ways: a simple, obvious, slow way and a complicated, "deep knowledge required", fast way. If you have lots of experience with CSS, you might get the fast way, but it doesn't really say how good a programmer you are.
(Yes, not a perfect analogy, but it hopefully gives the idea.)
Bit of a tangent, but a real question: If a writer doesn't approve of a site's behavior (in this case, Glassdoor is desribed as "a site that does not seem to mind when interview candidates violate NDAs"), why does the writer still link to them? Inbound links (w/o a nofollow) help sites, why help sites you don't like?
Point taken. I thought of not linking to them, or even not mentioning them. But they do play an important part in the story of the post, and I don't think I'm helping them so much as I'm raising employees' and employers' awareness of their existence. At least now a few more interviewers might check to see if their questions are posted there. From my limited data, interviewees are already more aware of Glassdoor than interviewers.
As for feeding them page rank, I don't think I have so much to offer that it helps them materially.
I think the concern about whether or not candidates have seen this, or any, programming question before is missing the point. Think about what we want in the ideal candidate -- we want them to come up with a good (elegant, efficient) solution to the problem, and implement it. We (judging by all the other responses) expect them to do that because they've had a solid CS education (formal or informal) as well as significant experience.
But people with that background will give good answers, even if they haven't seen _this specific problem_, because they have seen lots of problems like it and recognize the pattern. And even in that case, we evaluate them based on how well they can implement the pattern they saw, not just on whether they recognized the correct algorithm. So what if they've seen this problem already? Coding it up efficiently and elegantly in an interview context is still non-trivial, and you can still push them to discuss edge cases and performance tradeoffs.
The person who really has _never_ seen anything like this in his life, and still can give a good answer, I have yet to meet.
With the exception of the last part of the question, you learn everything there in your first year of CS at university. Do people who can't write this really put the language on their resume?
Can I get some stats? I really don't (want to) believe it. What percentage of people get this question wrong? Are they all some sort of eng/cs graduate? I'm not even a coder and I can solve this in a few minutes.
I don't have stats I can share, but I assure you that this problem has confounded many interview candidates with strong resumes. I agree with you that it's all basic material -- that's deliberate. I'm glad you think it's too easy. :-)
Its a great question. Pretty sure its not nearly as much of a secret as the author thinks though. I've seen a detailed write-up before somewhere (HN?) and I'm not even a programmer by profession.
Secret is a strong word -- I link to another post on the subject in the article. Still, seeing it on Glassdoor for my own employer crossed the line of disclosure.
I read this and went, I had someone build this for me years ago!
So I looked at the code, it used the efficient solution. Now I am even more impressed with the programmer. I've always thought highly of him (better than me) but it's hard to evaluate someone better than you. His solution ran circles around mine (I had general simple case with 2 words) and now I know exactly how much more efficient his solution was. Very neat.
Its a nice Dynamic programming problem. The beauty of DP is that simply memorizing one application of it does not guarantee you a solution to an entirely different problem that might have a similar Dynamic Programming solution. Look over the TopCoder SRM archives if you don't believe me.
So even though you are retiring this one, coming up with something similar that tests for basically the same things shouldn't be impossible.
The problem he's having is that good interview questions are getting busted, as people post solutions on the web.
If you have a lot of similar interview questions, then there's no way anyone other than a savant can memorize them without actually learning the theory.
Point taken. But it's hard to come up with good interview questions, as my colleagues here, at Google, and at Endeca can attest. In contrast, it's much easier to post solutions.
That's why I'm working on an approach that assumes the candidate does of prior knowledge of the problem. But not there yet.
The author just mentioned dynamic programming. Usually in dynamic programming, e.g., as in Dreyfus and Law, to say that a problem has a dynamic programming solution we outline the solution. But the author did not outline such a solution.
An outline usually includes at least the definition of the 'stages' of the dynamic programming solution. For the problem of 'string segmentation', the obvious selection of stages would be each of i = 1, 2, ..., n for the given string of length n. But this definition of the stages does not yield a dynamic program because the solution at stage i needs more than just the solution at stage i + 1 and, indeed, potentially needs the solutions at each of stages i + 1, i + 2, ..., n.
So, first-cut, there is no dynamic programming solution. For a second-cut, there might be a dynamic programming solution if the author would outline one!
There is now some question if the Google interviewers really understand dynamic programming!
The "obvious" selection of stages does work.
When you are at position j you check all i<j until you find one where there is a segmentation up to i, and the substring [i,j) is a word.
Memoization is just syntactic (semantic?) sugar on top of this and he provides the code that basically implements the above.
In programming competition circles it's common to just say "it's dynamic programming" when the stages are semi-obvious, and given that he uses memoization I'd say there's a strong chance he's been involved in Topcoder
[+] [-] georgemcbay|14 years ago|reply
Worse yet, even if you did come up with this problem on your own, this exact problem was a fairly common interview question back in the mid 90s, when string processing interview questions were all the rage for C/C++ programming jobs -- I must have seen it a dozen times in various interviews over the years. The problem is familiar enough that I'd bet a decent amount of money that it must be listed in one of those interview problem books that were popular before the websites for this stuff started showing up over the past couple years... If you really relied on the interviewee having no possible advance knowledge of this question prior to the interview you surely had a false sense of security prior to seeing it appear on glassdoor.
As long as you engage the interviewee to see that s/he really understands the answers they are giving, I don't really see why it matters if the question has appeared on one of these sites. If the interviewee is preparing for their interviews enough that they are actually looking at these sites and understanding the answers to the point where they can intelligently discuss them, that probably already puts them in the upper tier of people you'd want to seriously consider hiring, so retiring the question is probably counter-productive unless you have a non-disclosed alternative that you're sure is as good of a question.
[+] [-] patio11|14 years ago|reply
Maybe I'm just getting cynical in my old age, but sometimes I think the world is awash in incompetence. We see so much of it in tech because our incompetencies are (marginally) harder to hide.
[+] [-] dxbydt|14 years ago|reply
A very candid Philip Greenspun in Founders@Work( p341-2) : "In aviation, by the time someone's your employee, their perceived skill and actual skill are reasonably in line. JFK Jr is not working at a charter operation because he's dead. In IT,you have people who think 'I am a great programmer'. But where are the metrics that prove them wrong ? Traffic accidents are very infrequent, so they don't get the feedback that they are a terrible driver.
Programmers walk around with a huge overestimate of their capabilities. That's why a lot of them are very bitter. They sit around stewing at their desks. That's why I don't miss IT, because programmers are very unlikable people. They are not pleasant to manage."
[+] [-] onemoreact|14 years ago|reply
So, assuming a real world input and real world dictionary you can try plenty of things that break down for a dictionary that includes four hundred A’s but are actually valid solutions. Also, if you want a fast real world solution then sticking with a pure lookup dictionary would slow things down. Being able to toss 5 letters into a lookup table that says the longest and shortest words that start with those letters would save a lot of time. Basically, ‘xxyzy’ = null, null saves you from looking up x to the maximum word length in your dictionary. Secondly sanitizing inputs is going to be both safer and faster. And anything you are coding in 45 minutes is going to be a long way from production ready code.
PS: Even with backtracking you can code an O(n) solution for vary large input sets. Just keep track of the K best output sequences that are aka N, N-1,N-2,…N-K.
[+] [-] arethuza|14 years ago|reply
Nobody ever answered that question correctly and I used to joke that if anyone ever did we'd hire them immediately.
NB I've long since stopped using those kind of trivia questions in interviewing, but I didn't know any better at the time.
[+] [-] access_denied|14 years ago|reply
[deleted]
[+] [-] pbh|14 years ago|reply
Norvig gives a very approachable version of English word segmentation that uses a language model below.
http://norvig.com/ngrams/
[+] [-] dtunkelang|14 years ago|reply
[+] [-] JL2010|14 years ago|reply
Given a list of all the short strings in the periodic table of elements (e.g. Na, F, Al, etc) and a list of all the words in the English language: 1) write a method that finds the longest possible English word you can spell given any combination of the strings in the periodic table of elements. Re-usage of elements in the same string are allowed. 2) Describe what kind of data types you would want for the two lists and describe anything special about them. 3) Give a big O estimation.
I thought it was a great question :)
[+] [-] tzs|14 years ago|reply
However, I would have been able to come up with an alternative O(n^2) solution, involving building a directed graph with vertices representing each position in the string and the end of the string, with edges connecting two vertices if there is a dictionary word starting at the position corresponding to one vertex and ending just before the position corresponding to the second vertex. This can be done in O(n^2), and then you can find the shortest path from the vertex for 0 to the end vertex on this graph in O(n^2) (e.g., by Dijkstra's algorithm), and that gives you an exact covering of the string using the minimal number of dictionary words.
[+] [-] VinyleEm|14 years ago|reply
EDIT: A naive implementation of your idea can take upto O(n3m) time where m is the no. of words in dictionary. Using a fancy data structure like a trie or suffix tree or an automaton can speed it upto O(n^2). Can we better that?
[+] [-] dtunkelang|14 years ago|reply
[+] [-] StavrosK|14 years ago|reply
[+] [-] skrebbel|14 years ago|reply
A colleague of mine (now hired) was asked to code a string compare during her interview for a .NET function. She said "String.Compare()". This puzzled the interviewers for a while. They asked her whether she could write it out, she said she didn't see the point.
They ended up hiring her for other reasons (notably, references from trusted colleagues who had worked with her), but I still wonder whether that attitude worked for her or against her.
[+] [-] rfurmani|14 years ago|reply
>>> re.findall('(a|aa|aaa|ab)', 'aaab') ['a', 'a', 'a']
The correct answer would be ['aa', 'ab'] but unfortunately findall works greedily and so will not find the optimal solution. It is possible to specify it as a regex, but common implementations might take too much time to come up with the good solution.
[+] [-] kragen|14 years ago|reply
[+] [-] qntm|14 years ago|reply
[+] [-] pjscott|14 years ago|reply
[+] [-] nandemo|14 years ago|reply
[+] [-] shangaslammi|14 years ago|reply
[+] [-] ottbot|14 years ago|reply
[+] [-] eru|14 years ago|reply
[+] [-] zohebv|14 years ago|reply
<pre><code>
let dict = Data.Set.fromList["apple", "pie", "bread", "applepie", "piebread"]
let fn q = concat.Data.List.map (\(x,y) -> if (member x dict) then if y == "" then [[x]] else Data.List.map (x:) (fn y) else [] ) $ tail $ zip (inits q) (tails q)
fn "applepiebread" [["apple","pie","bread"],["apple","piebread"],["applepie","bread"]]
</code></pre>
[+] [-] wensing|14 years ago|reply
[+] [-] SoftwareMaven|14 years ago|reply
(Yes, not a perfect analogy, but it hopefully gives the idea.)
[+] [-] pkteison|14 years ago|reply
[+] [-] dtunkelang|14 years ago|reply
As for feeding them page rank, I don't think I have so much to offer that it helps them materially.
[+] [-] access_denied|14 years ago|reply
[deleted]
[+] [-] mynegation|14 years ago|reply
[+] [-] dtunkelang|14 years ago|reply
[+] [-] svdad|14 years ago|reply
But people with that background will give good answers, even if they haven't seen _this specific problem_, because they have seen lots of problems like it and recognize the pattern. And even in that case, we evaluate them based on how well they can implement the pattern they saw, not just on whether they recognized the correct algorithm. So what if they've seen this problem already? Coding it up efficiently and elegantly in an interview context is still non-trivial, and you can still push them to discuss edge cases and performance tradeoffs.
The person who really has _never_ seen anything like this in his life, and still can give a good answer, I have yet to meet.
[+] [-] Shenglong|14 years ago|reply
Can I get some stats? I really don't (want to) believe it. What percentage of people get this question wrong? Are they all some sort of eng/cs graduate? I'm not even a coder and I can solve this in a few minutes.
[+] [-] dtunkelang|14 years ago|reply
[+] [-] Havoc|14 years ago|reply
[+] [-] dtunkelang|14 years ago|reply
[+] [-] ohashi|14 years ago|reply
So I looked at the code, it used the efficient solution. Now I am even more impressed with the programmer. I've always thought highly of him (better than me) but it's hard to evaluate someone better than you. His solution ran circles around mine (I had general simple case with 2 words) and now I know exactly how much more efficient his solution was. Very neat.
[+] [-] mak120|14 years ago|reply
So even though you are retiring this one, coming up with something similar that tests for basically the same things shouldn't be impossible.
[+] [-] pavel_lishin|14 years ago|reply
[+] [-] dtunkelang|14 years ago|reply
[+] [-] wisty|14 years ago|reply
If you have a lot of similar interview questions, then there's no way anyone other than a savant can memorize them without actually learning the theory.
[+] [-] dtunkelang|14 years ago|reply
That's why I'm working on an approach that assumes the candidate does of prior knowledge of the problem. But not there yet.
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] NY_Entrepreneur|14 years ago|reply
An outline usually includes at least the definition of the 'stages' of the dynamic programming solution. For the problem of 'string segmentation', the obvious selection of stages would be each of i = 1, 2, ..., n for the given string of length n. But this definition of the stages does not yield a dynamic program because the solution at stage i needs more than just the solution at stage i + 1 and, indeed, potentially needs the solutions at each of stages i + 1, i + 2, ..., n.
So, first-cut, there is no dynamic programming solution. For a second-cut, there might be a dynamic programming solution if the author would outline one!
There is now some question if the Google interviewers really understand dynamic programming!
[+] [-] rfurmani|14 years ago|reply
[+] [-] anonymoushn|14 years ago|reply