top | item 21954388

(no title)

Or, instead of porting Lucene, just take its main concepts such as analysis, tokenization, an in-memory trie or binary search tree, query parser, term, query and collector and implement them the way _you_ would and using whatever bit maps and other file formats _you_ see fit when serializing to disk. If you iterate enough times you'll realize that as you have grown in your capacity of understanding these concepts your code base has turned into quite an approximation of Lucene, with the same flaws and the same strengths.

If you feel Lucene is close to a global, or at least a very high local optima, then by now you know search.

Now the real fun begins, because now you get to implement your own search model as either a Lucene model or one that is supported by your own code.

There are ten or twelve really interesting problems to solve before you can call it a day. Before you're done, you'll start to see everything and I do mean everything as a search problem. Use Wikipedia as your tutor but prepare yourself to have your current world view become completely transformed, because what's a search problem, really?

What's a word? What's a phrase? What's their meaning? What's this word's meaning in the context of these other words? What's a paragraph? What's the meaning of this paragraph, in the context of these other paragraphs? What are some of the patterns we can perceive in the binary representation of our data? What are some of the patterns we can perceive in the vector space representation of our data? The answers to these questions and more lies in the search model and implementing one is the most fascinating thing there is, because no one, not even Google's top engineer, knows what the correct set of questions are.

discuss

No comments yet.