Hey, great project - the more competition in this space the better. To be honest, at the moment the algorithm doesn't return any sensible results for anything (at least that I can find), but I hope that you can find a way past this as it's a great place to have a project.
I've included some search terms below that I've tried - I've not cherrypicked these and believe they are indicative of current performance. Some of these might be the size of the index - however I suspect it's actually how the search is being parsed/ranked (in particular I think the top two examples show that).
> Search "best car brands"
Expected: Car Reviews
Returns a page showing the best mobile phone brands.
then...
> Then searching "Best Mobile Phone"
Expected: The article from the search above.
Returns a gizmodo page showing the best apps to buy... "App Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"
> Searching "What is a test?"
Expected result: Some page describing what a test is, maybe wikipedia?
Returns "Test could confirm if Brad Pitt does suffer from face blindness"
> Searching "Duck Duck Go"
Expected result: DDG.com
Returns "There be dragons? Why net neutrality groups won't go to Congress"
> Searching "Google"
Expected result: Google.com
Returns: An article from the independent, "Google has just created the world’s bluest jeans"
I guess that's the real problem. People like to wonder what would be the "ideal world" in a search engine. It may be wishful thinking, I don't know.
It seems really hard to produce quality search results. Takes a lot of investment. Makes it an expensive product. But no one wants to pay. So selling ads it's the only way forward.
Maybe there's a way to convince people to pay what it takes? I dunno...
I was curious and tried a bunch of other searches, with similarly disappointing results. My searches were a bit more esoteric than Closi's.
"langlands program" (pure mathematics thing): yup, top result is indeed related to the Langlands program, though it isn't obviously what anyone would want as their first result for that search. Not bad.
"asmodeus" (evil spirit in one of the deuterocanonical books of the Bible, features extensively in later demonology, name used for an evil god in Dungeons & Dragons, etc.): completely blank page, no results, no "sorry, we have no results" message, nothing. Not good.
"clerihew" (a kind of comic biographical short poem popular in the late 19th / early 20th century): completely blank page. Not good.
"marlon brando" (Hollywood actor): first few results are at least related to the actor -- good! -- but I'd have expected to see something like his Wikipedia or IMDB page near the top, rather than the tangentially related things I actually god.
"b minor mass" (one of J S Bach's major compositions): nothing to do with Bach anywhere in the results; putting quotation marks around the search string doesn't help.
"top quark" (fundamental particle): results -- of which there were only 7 -- do seem to be about particle physics, and in some cases about the top quark, but as with Marlon Brando they're not exactly the results one would expect.
"ferrucio busoni" (composer and pianist): blank page.
"dry brine goose" (a thing one might be interested in doing at this time of year): five results, none relevant; top two were about Untitled Goose Game.
"alphazero" (game-playing AI made by Google): blank page. Putting a space in results in lots of results related to the word "alpha", none of which has anything to do with AlphaZero.
OK, let's try some more mainstream things.
"harry potter": blank page. Wat. Tried again; did give some results this time. They are indeed relevant to Harry Potter, though the unexpected first-place hit is Eric Raymond's rave review of Eliezer Yudkowsky's "Harry Potter and the Methods of Rationality", which I am fairly sure is not what Google gives as its first result for "harry potter" :-).
"iphone 12" (confession: I couldn't remember what the current generation was, and actually this is last year's): top results are all iPhone-related, but first one is about the iPhone 6, second is from 2007, this is about the iPhone 6, fourth is from 2007, fifth is about the iPhone 4S, etc.
"pfizer vaccine": does give fairly relevant-looking results, yay.
> We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.
Is there a concern that volunteers could manipulate results through their crawler?
You already mentioned distributed search engines have their own set of issues. I'm wondering if a simple centralised non-profit fund a la wikipedia could work better to fund crawling without these concerns. One anecdote: Personally I would not install a crawler extensions, not because I don't want to help, but because my internet connection is pitifully slow. I'd rather donate a small sum that would go way further in a datacenter... although I realise the broader community might be the other way around.
[edit]
Unless, the crawler was clever enough to merely feed off the sites i'm already visiting and use minimal upload bandwidth. The only concern then would be privacy. oh the irony, but trust goes a long way.
The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.
That little thought experiment is true for many online services, from social networking to (marginally) publishing. But nowhere is it more true than for search results, which differ in two fundamental ways: being text-only, they don't bother me to anywhere near the degree of other ads. And, second, they are an order of magnitude more valuable than drive-by display ads, because people have indicated a need and a willingness to visit a website that isn't among their bookmarks. These two, combined, make this the worst possible case for replacing an ad-based business with a donation model.
The idea mentioned in this readme that "Google intentionally degrades search results to make you also view the second page" is also wrong, bordering on self-delusion. The typical answer to conspiracy theories works here: there are tens of thousands of people at Google. Such self-sabotage would be obvious to many people on the inside, far too many to keep something like this secret.
I would consider Google randomly excluding the most relevant words from my search query intentionally degrading results. It's incredibly frustrating. This shouldn't be the default behavior, maybe an optional link the user can click to try again with some of the terms excluded.
Yes, I know verbatim mode exists, but I always forget to enable it, and the setting eventually gets lost when my cookies are cleared or something.
Unfortunately I can't switch to another search engine because in my experience every other search engine has far inferior results, despite not having the annoying behaviors Google does. DuckDuckGo is only useful for !bangs for me.
> The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.
Is this relevant for non-profit project? Do you pay $30/year for Wikipedia?
1. What is the rationale behind choosing Python as a implementation language? Performance and efficiency are paramount in keeping operational costs low and ensuring a good user experience even if the search engine will be used by many users. I guess Python is not the best choice for this, compared to C, Rust or Java.
2. What is the rationale behind implementing a search engine from scratch versus using existing Open Source search engine libraries like Apache Lucene, Apache Solr and Apache Nutch (crawler)?
"Ecosia is a search engine based in Berlin, Germany. It donates 80% of its profits to nonprofit organizations that focus on reforestation" [1]
"80% of profits will be distributed among charities and non-profit organizations. The remaining 20% will be put aside for a rainy day." [2]
"Ekoru.org is a search engine dedicated to saving the planet. The company donates 60% of revenue generated from clicks on sponsored search results to partner organizations who work on climate change issues" [3]
Also https://searchmysite.net/ for personal and independent websites (essentially a loss-leader for its open source self-hostable search as a service).
In the early web 2, it was very in for things to be spelled unpronounceably. For the life of me I can only remember Twittr, but I wanna say Spotify also had an unreadable name in the early days.
I have this feeling that most of time I "search" for something I already know what I'm looking for, but google via firefox's omnibox, is just the fastest way to get there, even though it's a bit indirect. Are they getting paid for that, or am I costing them money in the short term, but they get to build up a profile on me to provide more effective ads later?
I wonder if it's possible to take advanage of that type of search by putting a facade in front of the "search engine" and based on the search term and the private local user history, then go direct to a known site, or if it seems a search is needed, go to a specific search engine. This may open up opportunities for say program language specific search engines, or error messages from a program specific search, or shopping for X sites.
I bookmark every site I might possibly want to revisit - make a habit of Ctrl+D. They're totally unsorted, but the key is to wipe the regular history on exit, leaving only the bookmarks as source material for completion. That way I can type something in the url bar and get completion to interesting sites. The url bar (or omnibox) matches on page title as well as the actual address, so it's easy, and always faster than a search engine.
if u set duckduckgo as your default search provider, you can use bang in the omibox.
also you can toogle between local-area or global search. https://duckduckgo.com/bang e.g. !yt !osm !gi
Most wikis or resource/documentation sites have a local search bar on their homepage, Firefox has a feature where it lets you add a search keyword for that specific site. So if you add, say, pydocs as a keyword for docs.python.org you can do "@pydocs <query>" it looks up the query on that page.
This is a business model I've been thinking about: what if users earned credits for running a crawler on their machine? In other words, as much as I hate crypto scams, a "tokenized" search engine where the "mining" power was put to good use, i.e crawling and indexing.
YaCy is decentralized, but without the credit system. Some tokens, like QBUX, have tried to develop decentralized hosting infrastructure.
I also have been wondering how this would play out with some kind of decentralized indexes. The nodes could automatically cluster with other nodes of users sharing the same interests, using some notion of distances between query distributions. The caching and crawling tasks could then be distributed between neighbors.
Actually I have an idea for you: i think you can use cryptography to prove that an SSL session really happened. So you could prove indexing of HTTPS sites.
Make it open source and syndicate it. The goal is to get people to contribute both resources and code. Think about the Shopify as the model. Where many people contribute to create a huge shopping place. People care about their shop only but ultimately they create a useful shopping area.
Also setup a foundation to guide its development and be able to hire a management team.
The real challenge is not the code development but setting up an organization that will outlast all the challenges that will appear. Wikipedia is the model to follow.
Do you yearn for explainability due to getting irrelevant search results? Is what you're searching for more specialized than what the public might consider general knowledge?
It’s really fast - nice job! Can you elaborate on the ranking algorithm you are using? It seems that this will become more important as you index more pages.
It looks interesting. However, the results appearing so fast as I type, and changing just as fast as I type more, makes it seem like it's flickering and it's painful on my eyes. Perhaps a slight delay and/or a fading effect as the results appear would be a bit easier for me to look at.
Update: there's been interest from a few people so I've started a Matrix chat here for anyone that wants to help out or provide feedback: https://matrix.to/#/#mwmbl:matrix.org
Congrats on the mvp path you took to lunch your product. Generally, I think that there is a place for other variations of web search, be it in the way you crawl or perhaps how you monetize. I genuinely believe that it is really hard to build a general purpose search engine like DDG, Google and the like, but you can build a fairly good niche search engine. I'm particularly fond of the idea of community powered curation in search. Just today I lunched my own take on a community driven search engine - https://github.com/gkasev/chainguide. If you like to bounce ideas back and forth with somebody, I'll be very interested to talk to you.
Off-topic [0]: I would be very interested in an economic model that would work for such a search engine. Donations are fine, but (imho) it will take much more than that to keep the lights on, let alone expand...
The "fairest" solution for both sides I can think of is ads which no not send tracking information, and are shown primarily based on search terms and country, or even other parameters that the visitor has set explicitly. Any other ideas on how to finance such an engine so that incentives are aligned?
[0]: EDIT: off-topic because the page clearly states that this project will be financed with donations only.
The model my search uses is for the public search to essentially be a loss leader for the search as a service - site owners can pay a small fee to access extra features such as being able to configure what is indexed, trigger reindexing on demand, etc. It also heavily downranks pages with adverts, to try to eliminate the incentive for spamdexing.
Wikimedia has an estimated $157m in donations this year. If we could get a small fraction of this amount we should be able to build something pretty good.
1. They somewhat get around this with their maps feature, but their regular search doesn't actually search by area; you always get national websites that optimize the best. That would be a nice feature to have starting out without having to type in the specific area you're looking for.
2. Search results for hotels that actually work! Not only if they're set up on OTA's! This could actually get your search engine some traction as the search engine to go to when making travel plans which would give you a nice niche to start out in.
If you filed to become a non-profit could people "donate" their engineering time as a tax write off? If you find out the legality of something like this and make it easy to do that could inspire a lot of collaboration on the project and I can see a bunch of other areas (outside of search) where services could be provided like this. I'm also sure having a non-profit would also make it easier to find cheap hosting which is a large part of the cost there.
Non-profit search engines are needed. It will probably still be vulnerable to SEO but will more likely be resistant to become corrupt by the interest of "investors".
Congrats! Very nice to see results being lightning fast, I am getting 100-120ms response with network overhead included and that is impressive. The payload size of only 10-20kb helps immensely, good job!
I've built something similar called Teclis [1] and in my experience a new search engine should focus on a niche and try to be really, really good at it (I focused on non-commercial content for example).
The reason is to be able to narrow down the scope of content to crawl/index/rank and hopefully with enough specialization to be able to offer better results than Google for that niche. This could open doors to additional monetization path, API access. Newscatcher [2] is an example of where this approach worked (they specialized on "news").
Okay, the cynical quip is "All search engines other than Google's are 'non-profit'." :-) But the reasons for that won't fit in the margin here.
Building search engines are cool and fun! They have what seems like an endless source of hard problems that have to be solved before they are even close to useful!
As a result people who start on this journey often end up crushed by the lack of successes between the start and the point where there is something useful. So if I may, allow me to suggest some alternatives which have all the fun of building a search engine and yet can get you to a useful place sooner.
Consider a 'spam' search engine. Which is to say a crawler that you work to train on finding spammy useless web sites. Trust me when I say the current web is a "target rich environment" here. The purpose would be to not so much provide a search engine in total here, as it would be to provide something like the realtime black hole list did for email spam, come up with a list of URLs that could be easily checked with a modified DNS type server (using DNS protocol but expressly for the purpose of doing the query 'Is this URI hosting spam?' in a rapid fashion.
There are two "go to market" strategies for such a site. One is a web browser plugin that would either pop up an interstitial page that said, "Don't go here, it is just spam" when someone clicked on a link. Or a monkey-script kind of thing which would add an indication to a displayed page that a link was spammy (like set the anchor display tag to blinking red or something). The second is to sell access to this service to web proxies, web filters, and Bing which could in the course of their operation simply ignore sites that appeared on your list as if they didn't exist.
You will know you are successful when you are approached by shady people trying to buy you out.
Another might be a "fact finding" search engine. This would be something like Wolfram Alpha but for "facts." There are lots of good AI problems here, one which develops a knowledge tree based on crawled and parsed data, and one which answers factual queries like 'capital of alaska' or 'recipe for baked alaska'. The nice things about facts is they are well protected against the claim of copyright infringement and so people really can't come after you for reproducing the fact that the speed of light is 300Mkps, even if they can prove you crawled their web site to get that fact.
[+] [-] Closi|4 years ago|reply
I've included some search terms below that I've tried - I've not cherrypicked these and believe they are indicative of current performance. Some of these might be the size of the index - however I suspect it's actually how the search is being parsed/ranked (in particular I think the top two examples show that).
> Search "best car brands"
Expected: Car Reviews
Returns a page showing the best mobile phone brands.
then...
> Then searching "Best Mobile Phone"
Expected: The article from the search above.
Returns a gizmodo page showing the best apps to buy... "App Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"
> Searching "What is a test?"
Expected result: Some page describing what a test is, maybe wikipedia?
Returns "Test could confirm if Brad Pitt does suffer from face blindness"
> Searching "Duck Duck Go"
Expected result: DDG.com
Returns "There be dragons? Why net neutrality groups won't go to Congress"
> Searching "Google"
Expected result: Google.com
Returns: An article from the independent, "Google has just created the world’s bluest jeans"
[+] [-] rmbyrro|4 years ago|reply
It seems really hard to produce quality search results. Takes a lot of investment. Makes it an expensive product. But no one wants to pay. So selling ads it's the only way forward.
Maybe there's a way to convince people to pay what it takes? I dunno...
[+] [-] daoudc|4 years ago|reply
[+] [-] gjm11|4 years ago|reply
"langlands program" (pure mathematics thing): yup, top result is indeed related to the Langlands program, though it isn't obviously what anyone would want as their first result for that search. Not bad.
"asmodeus" (evil spirit in one of the deuterocanonical books of the Bible, features extensively in later demonology, name used for an evil god in Dungeons & Dragons, etc.): completely blank page, no results, no "sorry, we have no results" message, nothing. Not good.
"clerihew" (a kind of comic biographical short poem popular in the late 19th / early 20th century): completely blank page. Not good.
"marlon brando" (Hollywood actor): first few results are at least related to the actor -- good! -- but I'd have expected to see something like his Wikipedia or IMDB page near the top, rather than the tangentially related things I actually god.
"b minor mass" (one of J S Bach's major compositions): nothing to do with Bach anywhere in the results; putting quotation marks around the search string doesn't help.
"top quark" (fundamental particle): results -- of which there were only 7 -- do seem to be about particle physics, and in some cases about the top quark, but as with Marlon Brando they're not exactly the results one would expect.
"ferrucio busoni" (composer and pianist): blank page.
"dry brine goose" (a thing one might be interested in doing at this time of year): five results, none relevant; top two were about Untitled Goose Game.
"alphazero" (game-playing AI made by Google): blank page. Putting a space in results in lots of results related to the word "alpha", none of which has anything to do with AlphaZero.
OK, let's try some more mainstream things.
"harry potter": blank page. Wat. Tried again; did give some results this time. They are indeed relevant to Harry Potter, though the unexpected first-place hit is Eric Raymond's rave review of Eliezer Yudkowsky's "Harry Potter and the Methods of Rationality", which I am fairly sure is not what Google gives as its first result for "harry potter" :-).
"iphone 12" (confession: I couldn't remember what the current generation was, and actually this is last year's): top results are all iPhone-related, but first one is about the iPhone 6, second is from 2007, this is about the iPhone 6, fourth is from 2007, fifth is about the iPhone 4S, etc.
"pfizer vaccine": does give fairly relevant-looking results, yay.
[+] [-] clay-dreidels|4 years ago|reply
[+] [-] tomxor|4 years ago|reply
Is there a concern that volunteers could manipulate results through their crawler?
You already mentioned distributed search engines have their own set of issues. I'm wondering if a simple centralised non-profit fund a la wikipedia could work better to fund crawling without these concerns. One anecdote: Personally I would not install a crawler extensions, not because I don't want to help, but because my internet connection is pitifully slow. I'd rather donate a small sum that would go way further in a datacenter... although I realise the broader community might be the other way around.
[edit]
Unless, the crawler was clever enough to merely feed off the sites i'm already visiting and use minimal upload bandwidth. The only concern then would be privacy. oh the irony, but trust goes a long way.
[+] [-] KarlKemp|4 years ago|reply
That little thought experiment is true for many online services, from social networking to (marginally) publishing. But nowhere is it more true than for search results, which differ in two fundamental ways: being text-only, they don't bother me to anywhere near the degree of other ads. And, second, they are an order of magnitude more valuable than drive-by display ads, because people have indicated a need and a willingness to visit a website that isn't among their bookmarks. These two, combined, make this the worst possible case for replacing an ad-based business with a donation model.
The idea mentioned in this readme that "Google intentionally degrades search results to make you also view the second page" is also wrong, bordering on self-delusion. The typical answer to conspiracy theories works here: there are tens of thousands of people at Google. Such self-sabotage would be obvious to many people on the inside, far too many to keep something like this secret.
[+] [-] nyuszika7h|4 years ago|reply
Yes, I know verbatim mode exists, but I always forget to enable it, and the setting eventually gets lost when my cookies are cleared or something.
Unfortunately I can't switch to another search engine because in my experience every other search engine has far inferior results, despite not having the annoying behaviors Google does. DuckDuckGo is only useful for !bangs for me.
[+] [-] timeon|4 years ago|reply
Is this relevant for non-profit project? Do you pay $30/year for Wikipedia?
[+] [-] daoudc|4 years ago|reply
[+] [-] wolfgarbe|4 years ago|reply
1. What is the rationale behind choosing Python as a implementation language? Performance and efficiency are paramount in keeping operational costs low and ensuring a good user experience even if the search engine will be used by many users. I guess Python is not the best choice for this, compared to C, Rust or Java.
2. What is the rationale behind implementing a search engine from scratch versus using existing Open Source search engine libraries like Apache Lucene, Apache Solr and Apache Nutch (crawler)?
[+] [-] Faaak|4 years ago|reply
Apart from that, the misconception that "python is slow" should die :-)
[+] [-] legofr|4 years ago|reply
https://www.ecosia.org/
https://ask.moe/
https://ekoru.org/
I remember seeing one more non-profit search engine on HN but can't seem to find it right now.
[+] [-] daoudc|4 years ago|reply
"Ecosia is a search engine based in Berlin, Germany. It donates 80% of its profits to nonprofit organizations that focus on reforestation" [1]
"80% of profits will be distributed among charities and non-profit organizations. The remaining 20% will be put aside for a rainy day." [2]
"Ekoru.org is a search engine dedicated to saving the planet. The company donates 60% of revenue generated from clicks on sponsored search results to partner organizations who work on climate change issues" [3]
[1] https://en.wikipedia.org/wiki/Ecosia [2] https://ask.moe/ [3] https://www.forbes.com/sites/meimeifox/2020/01/19/how-the-se...
[+] [-] asicsp|4 years ago|reply
Probably this one? "A search engine that favors text-heavy sites and punishes modern web design" https://news.ycombinator.com/item?id=28550764 (3 months ago, 717 comments)
[+] [-] mlinksva|4 years ago|reply
[+] [-] Minor49er|4 years ago|reply
[+] [-] m-i-l|4 years ago|reply
[+] [-] marginalia_nu|4 years ago|reply
[+] [-] alexdowad|4 years ago|reply
"To Google" has entered the English lexicon as a verb, but I don't think anybody will ever say they "mwmbled" something.
[+] [-] daoudc|4 years ago|reply
[+] [-] wodenokoto|4 years ago|reply
[+] [-] ZeroGravitas|4 years ago|reply
I wonder if it's possible to take advanage of that type of search by putting a facade in front of the "search engine" and based on the search term and the private local user history, then go direct to a known site, or if it seems a search is needed, go to a specific search engine. This may open up opportunities for say program language specific search engines, or error messages from a program specific search, or shopping for X sites.
[+] [-] medstrom|4 years ago|reply
[+] [-] yellowsir|4 years ago|reply
[+] [-] bluecatswim|4 years ago|reply
[+] [-] _xnmw|4 years ago|reply
[+] [-] thebeastie|4 years ago|reply
[+] [-] Piezoid|4 years ago|reply
I also have been wondering how this would play out with some kind of decentralized indexes. The nodes could automatically cluster with other nodes of users sharing the same interests, using some notion of distances between query distributions. The caching and crawling tasks could then be distributed between neighbors.
[+] [-] thebeastie|4 years ago|reply
[+] [-] g105b|4 years ago|reply
[+] [-] WheelsAtLarge|4 years ago|reply
Also setup a foundation to guide its development and be able to hire a management team.
The real challenge is not the code development but setting up an organization that will outlast all the challenges that will appear. Wikipedia is the model to follow.
[+] [-] ChemSpider|4 years ago|reply
Ideally with explainable AI (XAI) that can tell me WHY is result A ranked higher than result B. I would even pay a monthly subscription to use it.
[+] [-] challenger-derp|4 years ago|reply
[+] [-] tibbar|4 years ago|reply
[+] [-] rascul|4 years ago|reply
[+] [-] daoudc|4 years ago|reply
[+] [-] daoudc|4 years ago|reply
[+] [-] gkasev|4 years ago|reply
[+] [-] amenod|4 years ago|reply
The "fairest" solution for both sides I can think of is ads which no not send tracking information, and are shown primarily based on search terms and country, or even other parameters that the visitor has set explicitly. Any other ideas on how to finance such an engine so that incentives are aligned?
[0]: EDIT: off-topic because the page clearly states that this project will be financed with donations only.
[+] [-] luckylion|4 years ago|reply
I think donations are probably workable. It works in the private tracker scene; the larger ones have "donation meters" and never seem to fall behind.
It could also work on a subscription model which is essentially just formalizing the donations and making it easier to plan cash flow.
[+] [-] m-i-l|4 years ago|reply
[+] [-] hosteur|4 years ago|reply
[+] [-] marginalia_nu|4 years ago|reply
You'd be surprised how cheap a search engine can be to operate. My search.marginalia.nu has a burn rate of less than $100/month.
[+] [-] daoudc|4 years ago|reply
[+] [-] quantum2021|4 years ago|reply
1. They somewhat get around this with their maps feature, but their regular search doesn't actually search by area; you always get national websites that optimize the best. That would be a nice feature to have starting out without having to type in the specific area you're looking for.
2. Search results for hotels that actually work! Not only if they're set up on OTA's! This could actually get your search engine some traction as the search engine to go to when making travel plans which would give you a nice niche to start out in.
[+] [-] gravypod|4 years ago|reply
[+] [-] marcodiego|4 years ago|reply
[+] [-] freediver|4 years ago|reply
I've built something similar called Teclis [1] and in my experience a new search engine should focus on a niche and try to be really, really good at it (I focused on non-commercial content for example).
The reason is to be able to narrow down the scope of content to crawl/index/rank and hopefully with enough specialization to be able to offer better results than Google for that niche. This could open doors to additional monetization path, API access. Newscatcher [2] is an example of where this approach worked (they specialized on "news").
[1] http://teclis.com
[2] https://newscatcherapi.com/
[+] [-] ChuckMcM|4 years ago|reply
Building search engines are cool and fun! They have what seems like an endless source of hard problems that have to be solved before they are even close to useful!
As a result people who start on this journey often end up crushed by the lack of successes between the start and the point where there is something useful. So if I may, allow me to suggest some alternatives which have all the fun of building a search engine and yet can get you to a useful place sooner.
Consider a 'spam' search engine. Which is to say a crawler that you work to train on finding spammy useless web sites. Trust me when I say the current web is a "target rich environment" here. The purpose would be to not so much provide a search engine in total here, as it would be to provide something like the realtime black hole list did for email spam, come up with a list of URLs that could be easily checked with a modified DNS type server (using DNS protocol but expressly for the purpose of doing the query 'Is this URI hosting spam?' in a rapid fashion.
There are two "go to market" strategies for such a site. One is a web browser plugin that would either pop up an interstitial page that said, "Don't go here, it is just spam" when someone clicked on a link. Or a monkey-script kind of thing which would add an indication to a displayed page that a link was spammy (like set the anchor display tag to blinking red or something). The second is to sell access to this service to web proxies, web filters, and Bing which could in the course of their operation simply ignore sites that appeared on your list as if they didn't exist.
You will know you are successful when you are approached by shady people trying to buy you out.
Another might be a "fact finding" search engine. This would be something like Wolfram Alpha but for "facts." There are lots of good AI problems here, one which develops a knowledge tree based on crawled and parsed data, and one which answers factual queries like 'capital of alaska' or 'recipe for baked alaska'. The nice things about facts is they are well protected against the claim of copyright infringement and so people really can't come after you for reproducing the fact that the speed of light is 300Mkps, even if they can prove you crawled their web site to get that fact.