top | item 29690877

Show HN: I'm building a non-profit search engine

441 points| daoudc | 4 years ago |github.com | reply

199 comments

order
[+] Closi|4 years ago|reply
Hey, great project - the more competition in this space the better. To be honest, at the moment the algorithm doesn't return any sensible results for anything (at least that I can find), but I hope that you can find a way past this as it's a great place to have a project.

I've included some search terms below that I've tried - I've not cherrypicked these and believe they are indicative of current performance. Some of these might be the size of the index - however I suspect it's actually how the search is being parsed/ranked (in particular I think the top two examples show that).

> Search "best car brands"

Expected: Car Reviews

Returns a page showing the best mobile phone brands.

then...

> Then searching "Best Mobile Phone"

Expected: The article from the search above.

Returns a gizmodo page showing the best apps to buy... "App Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"

> Searching "What is a test?"

Expected result: Some page describing what a test is, maybe wikipedia?

Returns "Test could confirm if Brad Pitt does suffer from face blindness"

> Searching "Duck Duck Go"

Expected result: DDG.com

Returns "There be dragons? Why net neutrality groups won't go to Congress"

> Searching "Google"

Expected result: Google.com

Returns: An article from the independent, "Google has just created the world’s bluest jeans"

[+] rmbyrro|4 years ago|reply
I guess that's the real problem. People like to wonder what would be the "ideal world" in a search engine. It may be wishful thinking, I don't know.

It seems really hard to produce quality search results. Takes a lot of investment. Makes it an expensive product. But no one wants to pay. So selling ads it's the only way forward.

Maybe there's a way to convince people to pay what it takes? I dunno...

[+] daoudc|4 years ago|reply
Thanks for the feedback! I'll take a look at your examples and see if I can improve the rankings.
[+] gjm11|4 years ago|reply
I was curious and tried a bunch of other searches, with similarly disappointing results. My searches were a bit more esoteric than Closi's.

"langlands program" (pure mathematics thing): yup, top result is indeed related to the Langlands program, though it isn't obviously what anyone would want as their first result for that search. Not bad.

"asmodeus" (evil spirit in one of the deuterocanonical books of the Bible, features extensively in later demonology, name used for an evil god in Dungeons & Dragons, etc.): completely blank page, no results, no "sorry, we have no results" message, nothing. Not good.

"clerihew" (a kind of comic biographical short poem popular in the late 19th / early 20th century): completely blank page. Not good.

"marlon brando" (Hollywood actor): first few results are at least related to the actor -- good! -- but I'd have expected to see something like his Wikipedia or IMDB page near the top, rather than the tangentially related things I actually god.

"b minor mass" (one of J S Bach's major compositions): nothing to do with Bach anywhere in the results; putting quotation marks around the search string doesn't help.

"top quark" (fundamental particle): results -- of which there were only 7 -- do seem to be about particle physics, and in some cases about the top quark, but as with Marlon Brando they're not exactly the results one would expect.

"ferrucio busoni" (composer and pianist): blank page.

"dry brine goose" (a thing one might be interested in doing at this time of year): five results, none relevant; top two were about Untitled Goose Game.

"alphazero" (game-playing AI made by Google): blank page. Putting a space in results in lots of results related to the word "alpha", none of which has anything to do with AlphaZero.

OK, let's try some more mainstream things.

"harry potter": blank page. Wat. Tried again; did give some results this time. They are indeed relevant to Harry Potter, though the unexpected first-place hit is Eric Raymond's rave review of Eliezer Yudkowsky's "Harry Potter and the Methods of Rationality", which I am fairly sure is not what Google gives as its first result for "harry potter" :-).

"iphone 12" (confession: I couldn't remember what the current generation was, and actually this is last year's): top results are all iPhone-related, but first one is about the iPhone 6, second is from 2007, this is about the iPhone 6, fourth is from 2007, fifth is about the iPhone 4S, etc.

"pfizer vaccine": does give fairly relevant-looking results, yay.

[+] clay-dreidels|4 years ago|reply
What does a search engine algorithm look like, and where can I find examples to build from?
[+] tomxor|4 years ago|reply
> We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.

Is there a concern that volunteers could manipulate results through their crawler?

You already mentioned distributed search engines have their own set of issues. I'm wondering if a simple centralised non-profit fund a la wikipedia could work better to fund crawling without these concerns. One anecdote: Personally I would not install a crawler extensions, not because I don't want to help, but because my internet connection is pitifully slow. I'd rather donate a small sum that would go way further in a datacenter... although I realise the broader community might be the other way around.

[edit]

Unless, the crawler was clever enough to merely feed off the sites i'm already visiting and use minimal upload bandwidth. The only concern then would be privacy. oh the irony, but trust goes a long way.

[+] KarlKemp|4 years ago|reply
The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.

That little thought experiment is true for many online services, from social networking to (marginally) publishing. But nowhere is it more true than for search results, which differ in two fundamental ways: being text-only, they don't bother me to anywhere near the degree of other ads. And, second, they are an order of magnitude more valuable than drive-by display ads, because people have indicated a need and a willingness to visit a website that isn't among their bookmarks. These two, combined, make this the worst possible case for replacing an ad-based business with a donation model.

The idea mentioned in this readme that "Google intentionally degrades search results to make you also view the second page" is also wrong, bordering on self-delusion. The typical answer to conspiracy theories works here: there are tens of thousands of people at Google. Such self-sabotage would be obvious to many people on the inside, far too many to keep something like this secret.

[+] nyuszika7h|4 years ago|reply
I would consider Google randomly excluding the most relevant words from my search query intentionally degrading results. It's incredibly frustrating. This shouldn't be the default behavior, maybe an optional link the user can click to try again with some of the terms excluded.

Yes, I know verbatim mode exists, but I always forget to enable it, and the setting eventually gets lost when my cookies are cleared or something.

Unfortunately I can't switch to another search engine because in my experience every other search engine has far inferior results, despite not having the annoying behaviors Google does. DuckDuckGo is only useful for !bangs for me.

[+] timeon|4 years ago|reply
> The central problem with this and similar endeavors: nobody is willing to pay what they are worth in ads. Let's say the average Google user in the US earns them $30/year. Are you willing to pay $30/year for an ad-free Google experience? Great! We now know that you are worth at least $60/year.

Is this relevant for non-profit project? Do you pay $30/year for Wikipedia?

[+] daoudc|4 years ago|reply
TBF I don't think Google intentionally degrades results, but they have less incentive to improve the results.
[+] wolfgarbe|4 years ago|reply
A laudable effort. Two questions:

1. What is the rationale behind choosing Python as a implementation language? Performance and efficiency are paramount in keeping operational costs low and ensuring a good user experience even if the search engine will be used by many users. I guess Python is not the best choice for this, compared to C, Rust or Java.

2. What is the rationale behind implementing a search engine from scratch versus using existing Open Source search engine libraries like Apache Lucene, Apache Solr and Apache Nutch (crawler)?

[+] Faaak|4 years ago|reply
Premature optimization is the root of all evil. Best to concentrate on the algorithm first, and then, maybe, improve it with a faster language.

Apart from that, the misconception that "python is slow" should die :-)

[+] legofr|4 years ago|reply
> All other search engines that I've come across are for-profit. Please let me know if I've missed one!

https://www.ecosia.org/

https://ask.moe/

https://ekoru.org/

I remember seeing one more non-profit search engine on HN but can't seem to find it right now.

[+] daoudc|4 years ago|reply
Thanks, but these are not technically non-profit:

"Ecosia is a search engine based in Berlin, Germany. It donates 80% of its profits to nonprofit organizations that focus on reforestation" [1]

"80% of profits will be distributed among charities and non-profit organizations. The remaining 20% will be put aside for a rainy day." [2]

"Ekoru.org is a search engine dedicated to saving the planet. The company donates 60% of revenue generated from clicks on sponsored search results to partner organizations who work on climate change issues" [3]

[1] https://en.wikipedia.org/wiki/Ecosia [2] https://ask.moe/ [3] https://www.forbes.com/sites/meimeifox/2020/01/19/how-the-se...

[+] asicsp|4 years ago|reply
>I remember seeing one more non-profit search engine on HN but can't seem to find it right now.

Probably this one? "A search engine that favors text-heavy sites and punishes modern web design" https://news.ycombinator.com/item?id=28550764 (3 months ago, 717 comments)

[+] Minor49er|4 years ago|reply
I just tried ask.moe, but it clearly noted that the search results were provided by Google
[+] m-i-l|4 years ago|reply
Also https://searchmysite.net/ for personal and independent websites (essentially a loss-leader for its open source self-hostable search as a service).
[+] alexdowad|4 years ago|reply
Some idle words from a passer-by: It would have been good if this project had a pronounceable name.

"To Google" has entered the English lexicon as a verb, but I don't think anybody will ever say they "mwmbled" something.

[+] daoudc|4 years ago|reply
It's pronounced "mumble". I live in Mumbles, which is spelt Mwmbwls in Welsh.
[+] wodenokoto|4 years ago|reply
In the early web 2, it was very in for things to be spelled unpronounceably. For the life of me I can only remember Twittr, but I wanna say Spotify also had an unreadable name in the early days.
[+] ZeroGravitas|4 years ago|reply
I have this feeling that most of time I "search" for something I already know what I'm looking for, but google via firefox's omnibox, is just the fastest way to get there, even though it's a bit indirect. Are they getting paid for that, or am I costing them money in the short term, but they get to build up a profile on me to provide more effective ads later?

I wonder if it's possible to take advanage of that type of search by putting a facade in front of the "search engine" and based on the search term and the private local user history, then go direct to a known site, or if it seems a search is needed, go to a specific search engine. This may open up opportunities for say program language specific search engines, or error messages from a program specific search, or shopping for X sites.

[+] medstrom|4 years ago|reply
I bookmark every site I might possibly want to revisit - make a habit of Ctrl+D. They're totally unsorted, but the key is to wipe the regular history on exit, leaving only the bookmarks as source material for completion. That way I can type something in the url bar and get completion to interesting sites. The url bar (or omnibox) matches on page title as well as the actual address, so it's easy, and always faster than a search engine.
[+] yellowsir|4 years ago|reply
if u set duckduckgo as your default search provider, you can use bang in the omibox. also you can toogle between local-area or global search. https://duckduckgo.com/bang e.g. !yt !osm !gi
[+] bluecatswim|4 years ago|reply
Most wikis or resource/documentation sites have a local search bar on their homepage, Firefox has a feature where it lets you add a search keyword for that specific site. So if you add, say, pydocs as a keyword for docs.python.org you can do "@pydocs <query>" it looks up the query on that page.
[+] _xnmw|4 years ago|reply
This is a business model I've been thinking about: what if users earned credits for running a crawler on their machine? In other words, as much as I hate crypto scams, a "tokenized" search engine where the "mining" power was put to good use, i.e crawling and indexing.
[+] thebeastie|4 years ago|reply
How would you judge that they had actually done the work? The output needs to be verifiable.
[+] Piezoid|4 years ago|reply
YaCy is decentralized, but without the credit system. Some tokens, like QBUX, have tried to develop decentralized hosting infrastructure.

I also have been wondering how this would play out with some kind of decentralized indexes. The nodes could automatically cluster with other nodes of users sharing the same interests, using some notion of distances between query distributions. The caching and crawling tasks could then be distributed between neighbors.

[+] thebeastie|4 years ago|reply
Actually I have an idea for you: i think you can use cryptography to prove that an SSL session really happened. So you could prove indexing of HTTPS sites.
[+] g105b|4 years ago|reply
I'm very intrigued by this concept.
[+] WheelsAtLarge|4 years ago|reply
Make it open source and syndicate it. The goal is to get people to contribute both resources and code. Think about the Shopify as the model. Where many people contribute to create a huge shopping place. People care about their shop only but ultimately they create a useful shopping area.

Also setup a foundation to guide its development and be able to hire a management team.

The real challenge is not the code development but setting up an organization that will outlast all the challenges that will appear. Wikipedia is the model to follow.

[+] ChemSpider|4 years ago|reply
Really, I don't care if it is for-profit or not. Just a search engine with transparent ranking would be great.

Ideally with explainable AI (XAI) that can tell me WHY is result A ranked higher than result B. I would even pay a monthly subscription to use it.

[+] challenger-derp|4 years ago|reply
Do you yearn for explainability due to getting irrelevant search results? Is what you're searching for more specialized than what the public might consider general knowledge?
[+] tibbar|4 years ago|reply
It’s really fast - nice job! Can you elaborate on the ranking algorithm you are using? It seems that this will become more important as you index more pages.
[+] rascul|4 years ago|reply
It looks interesting. However, the results appearing so fast as I type, and changing just as fast as I type more, makes it seem like it's flickering and it's painful on my eyes. Perhaps a slight delay and/or a fading effect as the results appear would be a bit easier for me to look at.
[+] daoudc|4 years ago|reply
Thanks for the feedback! Yup, a delay is on the to-do list.
[+] gkasev|4 years ago|reply
Congrats on the mvp path you took to lunch your product. Generally, I think that there is a place for other variations of web search, be it in the way you crawl or perhaps how you monetize. I genuinely believe that it is really hard to build a general purpose search engine like DDG, Google and the like, but you can build a fairly good niche search engine. I'm particularly fond of the idea of community powered curation in search. Just today I lunched my own take on a community driven search engine - https://github.com/gkasev/chainguide. If you like to bounce ideas back and forth with somebody, I'll be very interested to talk to you.
[+] amenod|4 years ago|reply
Off-topic [0]: I would be very interested in an economic model that would work for such a search engine. Donations are fine, but (imho) it will take much more than that to keep the lights on, let alone expand...

The "fairest" solution for both sides I can think of is ads which no not send tracking information, and are shown primarily based on search terms and country, or even other parameters that the visitor has set explicitly. Any other ideas on how to finance such an engine so that incentives are aligned?

[0]: EDIT: off-topic because the page clearly states that this project will be financed with donations only.

[+] luckylion|4 years ago|reply
Aren't ads super ineffective, especially when you don't make them very invasive?

I think donations are probably workable. It works in the private tracker scene; the larger ones have "donation meters" and never seem to fall behind.

It could also work on a subscription model which is essentially just formalizing the donations and making it easier to plan cash flow.

[+] m-i-l|4 years ago|reply
The model my search uses is for the public search to essentially be a loss leader for the search as a service - site owners can pay a small fee to access extra features such as being able to configure what is indexed, trigger reindexing on demand, etc. It also heavily downranks pages with adverts, to try to eliminate the incentive for spamdexing.
[+] hosteur|4 years ago|reply
Ads as a business model ends with surveillance as a business model. We know this now.
[+] marginalia_nu|4 years ago|reply
> but (imho) it will take much more than that to keep the lights on, let alone expand...

You'd be surprised how cheap a search engine can be to operate. My search.marginalia.nu has a burn rate of less than $100/month.

[+] daoudc|4 years ago|reply
Wikimedia has an estimated $157m in donations this year. If we could get a small fraction of this amount we should be able to build something pretty good.
[+] quantum2021|4 years ago|reply
Two big things that annoy me about google:

1. They somewhat get around this with their maps feature, but their regular search doesn't actually search by area; you always get national websites that optimize the best. That would be a nice feature to have starting out without having to type in the specific area you're looking for.

2. Search results for hotels that actually work! Not only if they're set up on OTA's! This could actually get your search engine some traction as the search engine to go to when making travel plans which would give you a nice niche to start out in.

[+] gravypod|4 years ago|reply
If you filed to become a non-profit could people "donate" their engineering time as a tax write off? If you find out the legality of something like this and make it easy to do that could inspire a lot of collaboration on the project and I can see a bunch of other areas (outside of search) where services could be provided like this. I'm also sure having a non-profit would also make it easier to find cheap hosting which is a large part of the cost there.
[+] marcodiego|4 years ago|reply
Non-profit search engines are needed. It will probably still be vulnerable to SEO but will more likely be resistant to become corrupt by the interest of "investors".
[+] freediver|4 years ago|reply
Congrats! Very nice to see results being lightning fast, I am getting 100-120ms response with network overhead included and that is impressive. The payload size of only 10-20kb helps immensely, good job!

I've built something similar called Teclis [1] and in my experience a new search engine should focus on a niche and try to be really, really good at it (I focused on non-commercial content for example).

The reason is to be able to narrow down the scope of content to crawl/index/rank and hopefully with enough specialization to be able to offer better results than Google for that niche. This could open doors to additional monetization path, API access. Newscatcher [2] is an example of where this approach worked (they specialized on "news").

[1] http://teclis.com

[2] https://newscatcherapi.com/

[+] ChuckMcM|4 years ago|reply
Okay, the cynical quip is "All search engines other than Google's are 'non-profit'." :-) But the reasons for that won't fit in the margin here.

Building search engines are cool and fun! They have what seems like an endless source of hard problems that have to be solved before they are even close to useful!

As a result people who start on this journey often end up crushed by the lack of successes between the start and the point where there is something useful. So if I may, allow me to suggest some alternatives which have all the fun of building a search engine and yet can get you to a useful place sooner.

Consider a 'spam' search engine. Which is to say a crawler that you work to train on finding spammy useless web sites. Trust me when I say the current web is a "target rich environment" here. The purpose would be to not so much provide a search engine in total here, as it would be to provide something like the realtime black hole list did for email spam, come up with a list of URLs that could be easily checked with a modified DNS type server (using DNS protocol but expressly for the purpose of doing the query 'Is this URI hosting spam?' in a rapid fashion.

There are two "go to market" strategies for such a site. One is a web browser plugin that would either pop up an interstitial page that said, "Don't go here, it is just spam" when someone clicked on a link. Or a monkey-script kind of thing which would add an indication to a displayed page that a link was spammy (like set the anchor display tag to blinking red or something). The second is to sell access to this service to web proxies, web filters, and Bing which could in the course of their operation simply ignore sites that appeared on your list as if they didn't exist.

You will know you are successful when you are approached by shady people trying to buy you out.

Another might be a "fact finding" search engine. This would be something like Wolfram Alpha but for "facts." There are lots of good AI problems here, one which develops a knowledge tree based on crawled and parsed data, and one which answers factual queries like 'capital of alaska' or 'recipe for baked alaska'. The nice things about facts is they are well protected against the claim of copyright infringement and so people really can't come after you for reproducing the fact that the speed of light is 300Mkps, even if they can prove you crawled their web site to get that fact.