Do they really though, for normal people that is?. Some of my searches today below, can't remember the exact terms I used. Mix of DDG and Google.
1) Walt Whitman, I wanted a basic overview of his work to satisfy some idle curiosity. DDG gave me his wikipedia page. Bingo
2) EAN-13 check digit. First result wikipedia telling me how to calculate it. I see it is simple and I have a long list in Excel to check. I can't be bothered to think so...
3) EAN-13 Excel. First result has an example that I copied and pasted.
4) Timezone [niche cloud system]. Said system didn't do what we expected, seems to be timezone issue. First article is discussing this niche issue and offers solutions
5) Does Shopify support x payments. Yes it does
6) Coronavirus test. Got straight to government site.
7) MacOS version numbers. First hit...
8) How come my Microsoft x platform is showing as being at y level of service when my Buddies is not. Straight in
Am I just a perfect search customer? I don't seem to be getting the problems Drew is?
I suspect that anyone who claims that Duckduckgo "Just works" only do english search. I usually do "english" / "mother tongue" searchs all day. Everytime, I need to remember to toggle the regional button otherwise I get attrocious results. Whereas google simply understand that if I'm searching using the english language it should prioritize english results while if I'm searching in another language it should prioritize it instead.
It gets tiring quickly and I find easier to append !g instead of clicking the regional toggle button.
Yeah, all of these are quite DDG-friendly searches. It is my default engine and, yes, some results do suck quite consistently.
I'm a bit lazy right now to remember all the problems it has, but some of the most obvious are looking up for news on recent events (especially something small, stuff that doesn't appear in reuters and these sorts of media) and trying to find out some basic stuff about local shops and such (of course, I only know about how it feels in my location, not worldwide). On both occasions I pretty much always use "!g ..." right away, because DDG is just clueless about this shit. Google does this just fine (in fact, sometimes it's even impressive: there are thousands of cities like mine, yet Google can often tell me where I can buy some stuff I'd have no idea where to look for).
I am typing this from India. DDG never provides satisfactory results for anything country specific. As an example, point 6 above is a failure. I used to have DDG as my default, but my workflow got so convoluted that I would search first on DDG, see that the results as not good, open google and search again. It is so frustrating that I switched back to Google even when I didn't want to.
In consumer search there is a really long tail of questions (in 2017 15% of Google's daily queries have never been seen before[1]) and performance on this is very important.
I just searched for "lockdown rules for SA" (I'm in South Australia and we just had a new 20 person cluster, so we are going back into lockdown).
On DDG the first results was a Guardian article which was good, but then the rest were a mix of South African articles and blog spam. There were no SA Gov pages on the first page of results.
On Google the first result was the South Australian gov site with the rules, the second was the Guardian article, then more SA Gov pages and at result 8 I got a South African result.
They do. Something fundamentally changed at some point during the past couple of years. It used to be that DDG was the best for verbatim search (meaning I want to only have results were the exact words I search for are included).
Now, even with quotes, I routinely get a whole first page of results where my terms are not included anywhere. Google generally respect the quotes.
> Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results.
Not a big fan of this conclusion. Who chooses the white list, and why should I trust them? Is it democratically chosen? Just because a site is popular very clear does not mean it's trustworthy. Does it get vetted? by whom? Also, who's definition of trustworthy are we trusting?
If I want my blog to show up on your search engine, do I have to get it linked by one of those sites, or can I register with you? Will I be tier 1, or
SEO is crushing the utility of Google. It is pretty telling when you need to add things like site:reddit.com to get anything of value. Harnessing real user experiences (blogs, etc) is the key to a better search engine. This model unfortunately crumbles under walled gardens which is increasingly the preferred location of user activity.
My main problem with DDG is that there's no way to be sure they actually respect their users' privacy as they claim to.
Ideally, services like theirs would be continuously audited by respectable, trusted organizations like the EFF.. multiple such organizations even.
Then I'd have at least some reason to believe their claims of not collecting data about me.
As it stands, I only have their word for it.. which in this day and age is pretty worthless.
That said, I'd still much rather use DDG, who at least pay lip service to privacy, than sites like Google or Facebook, who are openly contemptuous of it.
At the very least it sends a message to these organizations that privacy is still valued, and they'd lose out by not trying to accommodate the privacy needs of their users to some extent.
How would anybody ever know what the server is running and/or doing with the data you send it, regardless of if it is running open or closed source code?
A service, running on somebody else's machine, is essntially closed.
I think the only way to have an 'open' service is to have it managed like a co-op, where the users all have access to deployment logs or other such transparency.
Even then, it requires implicit trust in whomever has the authorization to access the servers.
That sounds a bit like YaCy.[1] It is a program that apparently lets you host a search engine on your own machine, or have it run as a P2P node.
I think the next step forward should be to have indices that can be shared/sold for use with local mode. So you might buy specialised indices for particular fields, or general ones like what Google has. The size of Google's index is measured in petabytes, so a normal person would still not have the capability to run something like that locally.
Edit: In another thread, ddorian43 has pointed out the existence of Common Crawl,[2] which provides Web crawl data for free. I have no idea if it can be integrated with YaCy, but it is there.
In theory, this is the kind of thing that the GPL v3 was trying to address: roughly speaking, if you host & run a service that is derived from GPL-v3'd software, you are obliged to publish your modifications.
But, I agree with you - and I don't think the author had really thought through what they were demanding, they made no mention of licensing other than singing happy praises of FOSS as if that would magically mean you could trust what a search engine was doing.
> How would anybody ever know what the server is running and/or doing with the data you send it, regardless of if it is running open or closed source code?
I'm also using DDG exclusively since many years. I find what I need usually as the first couple of results or in the box on the right, that usually goes directly to the authorative source anyway.
I'm mostly getting Norwegian results, when searching for Danish subjects from a Danish IP address. It also seems it just hasn't indexed as many websites as Google.
I re-search almost everything technical with google after ddg showed me crap. I still use ddg by default tho', it works for most things, just not for work.
Does Google search results work for you? If yes, then I'd say the reason is you don't see or agree with how bad results are today (as others have posted extensively about). I for one find DDG as the search engine that returns the worst results. Qwant is a better Bing-using engine IMO but it is still bad.
I can think of some improvements (better forum/mailing list coverage), but it's generally pretty good. Lately if I don't find it on DDG I probably won't have much luck anywhere else, either.
I sometimes come across inappropriate results - for example I search for a hex error code and the results are for other numbers - and sometimes the adverts are misleading, but neither are so prevalent enough that it harms the experience in general.
I always send feedback when I come across incorrect results and also try to when I get a really easy find.
I have not had to resort to any other search engine for at least five years.
I've tried DDG for a while, around a couple of years ago, and I had lower-quality results particularly for technical subjects (which are the vast majority of my searches). I will give DDG another shot, though.
for generic stuff DDG is mostly ok. But for local results, even though it has a switch for local results, it REALLY REALLY REALLY sucks bad and often doesn't get any of the expected places anywhere in the first few pages for New Zealand which makes it somewhat useless
I say about 50% I'm good with DDG. About 1/3 of the time I add !g, usually for weird error messages and tech stuff.
Honestly we shouldn't be using Google for everything. Why not just search StackExchange or Github issues directly for known bug problems? If you need a movie, !imdb or !rt forward you to exactly where you want to really search on.
If DDG or Google also included independent small blogs for movie results, I could see the value in that. I'd prefer someone's review on their own site or video channel, but it doesn't. We've kinda lost that part of the Internet.
Why couldn't several coordinating specialized search engines share their data via something like "charge the downloader" S3 buckets? Then you get an org like StackExchange who could provide indexed data from their site and the algorithms to search the data the most efficiently, GitHub can do the same for their specific zone of speciality, Amazon, etc.
Then anyone who wants to use the data can either copy it to their own S3 buckets to pay just once, or can use it with some sort of pay-as-you-go method. Anyone who runs a search engine can use the algorithms as a guide for the specific searches they are interested in for their site, or can just make their own.
You could trust the other indexers not to give you bad data, because you'd have some sort of legal agreement and technical standards that would ensure that they couldn't/wouldn't "poison the well" somehow with the data they provide. Further, if a bad actor was providing faulty data, the other actors would notice and kick them out of the group or just stop using their data.
It would have to be fully open source, I agree with the other parts of Drew's essay here, but I think we could share the index/data somehow if we got together and tried to think about it. We just need a standard for how we share the data.
> they’ve demonstrated gross incompetence in privacy
Not sure I buy the example that is given here.
1. It's an issue in their browser app, not their search service.
2. It's not completely indefensible: it allows fetching favicons (potentially) much faster, since they're cached, and they promise that the favicon service is 100% anonymous anyway.
> The search results suck! The authoritative sources for anything I want to find are almost always buried beneath 2-5 results from content scrapers and blogspam. This is also true of other search engines like Google.
This part is kinda funny because "DuckDuckGo sucks, it's just as bad as Google" is ... not the sort of complaint you normally hear about an alternative search engine, nor does it really connect with any of the normal reasons people consider alternative search engines.
That said, I agree with this point. Both DDG and Google seem to be losing the spam war, from what I can tell. And the diagnosis is a good one too: the problem with modern search engines is that they're not opinionated / biased enough!
> Crucially, I would not have it crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers.
This is, obviously, very different from the modern search engine paradigm where domains are treated neutrally at the outset, and then they "learn" weights from how often they get linked and so on. (I'm not sure whether it's possible to make these opinionated decisions in an open source way, but it seems like obviously the right way to go for higher quality results.) Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.
Maybe instead of hard-coding these preferences in the search engine, or having it try to guess for you based on your search history, you can opt-in to download and apply such lists of ranking modifiers to your user profile. Those lists would be maintained by 3rd parties and users, just like eg. adblock blacklists and whitelists. For example, Python devs might maintain a list of search terms and associated urls that get boosted, including stack exchange and their own docs. "Learn python" tutorials would recommend you set up your search preferences for efficient python work, just like they recommend you set up the rest of your workflow. Japanese python devs might have their own list that boosts the official python docs and also whatever the popular local equivalent of stackexchange is in Japan, which gets recommended by the Japanese tutorials. People really into 3D printing can compile their own list for 3D printing hobbyists. You can apply and remove any number of these to your profile at a time.
Agreed. I think the key point here is that the web is a radically different place than it was in 1998 (when Google launched and established the search engine paradigm as we know it). Back then the quality-to-spam ratio was probably much higher, the overall size of the web was certainly much smaller (making scraping the entire thing more tractable), and there were many more self-hosted sources rather than platforms (meaning it was more necessary to rely on inter-linking, and "authoritative domains" weren't as much of a thing). The naive scraping approach was both more crucial and more effective. And in the decades since, it's been a constant war of attrition to keep that model working under more and more adversarial conditions.
So I think that stepping back and re-thinking what a search engine fundamentally is, is a great starting point for disruption.
Additionally, something the OP didn't mention is that ML technologies have progressed dramatically since 1998, and that much of that progress has been done in the open. I can't imagine that not being a force-multiplier for any upstart in this domain.
I think Google sort of takes into account "votes", in that they look at the last thing you clicked on from that search, and consider that the "right answer", which they then feed back into their results.
As such, they effectively have a list of "tier 1" domains.
I thought DDG already crawled their own curated list of sites?
There is a DuckDuckGoBot and I think it was an interview or podcast Gabriel did a while back that he mentioned they use it for filling out gaps in the Bing API data to provide the instant answers, favicons. Their preference for the instant answers were authoritative references such as docs.python.org. This would have been a while back though.
Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.
The problem with this strategy is always going to be that different users will regard different sources as most desirable.
For example, it's enormously frustrating that searching for almost anything Python-related on DDG seems to return lots of random blog posts but hardly ever shows the official Python docs near the top. I don't personally think the official Python docs are ideally presented, but they're almost certainly more useful to me at that time than some random blog that happens to mention an API call I'm looking up.
On the other hand, I would gladly have an option in a search engine to hide the entire Stack Exchange network by default. The signal/noise ratio has been so bad for a long time that I would prefer to remove them from my search experience entirely rather than prioritise them. YMMV, of course. (Which is my point.)
I tried to build something like this in 2007, together with a small band of nerds and geeks and Linux enthusiasts. It was called Beeseek. [0]
I knew close to nothing about building a company or a project, or how a proper business model would have helped it. I was the leader (SABDFL) of the group, and unfortunately I didn't lead it well enough to succeed. We had some good ideas, but ultimately we failed at building more than the initial prototype.
The idea behind it was simple: WorkerBee nodes (users' computers) would crawl the web, and provide the computational power to run Beeseek. Users could upvote pages (using "trackers" that anonymously "spy" the user in order to find new pages - repeat: anonymously). The entire DB would be hosted across multiple nodes. Auth and other functionalities would be provided by "higher level" nodes (QueenBee nodes).
Everything was going to be open source.
Well, it didn't work.
Thankfully, because of Beeseek, I met a few very smart people that I am in touch with to this day.
Life is strange and beautiful in its own way.
Weird, though, that today I still believe that Beeseek could have been the right thing to build. Who knows?
In what ways does what OP describes remind you of your project? Just that it was an open source web search?
One difference from what you describe is that the OP is specifically recommending against decentralization/federation, where it seems to have been the core differentiator of your effort. I don't think what OP is describing is quite what you are describing.
Thinking you can design a better search engine by yourself is either egotism or ignorance. Assuming you based it on state of the art search engine research, and could somehow avoid patent encumbrance, it'd still take you 5 years to match Google's results (and even then not likely) sans all the SEO bullshit.
Most people still believe that it's possible for one search engine to help anyone find anything without it knowing anything about them, which is just ridiculous. To get good search results you practically have to read someone's mind. Google basically does this (along with their e-mails, and voicemails, and texts, and web searches, and AMP links, and PageRanked crawls, and context-aware filters) and they still don't always get it right.
There is no magic algorithm that replaces statistical analysis of a large corpus along with a massive database of customized rulesets.
> We should also prepare the software to boldly lead the way on new internet standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages? Linux distribution repositories?), supporting non-traditional network stacks (Tor? Yggdrasil? cjdns?) and third-party name systems (OpenNIC?), and anything else we could leverage our influence to give a leg up on.
Oh, great, so become the Devil himself, then. Count me out.
Yes, we can do better than DDG. But if you are expecting to fund a real search engine with a few hundred thousand dollars you are insane. It will take a ton of development and a ton of hardware to create an index that isn't a pile of garbage. This isn't 2000 anymore. You need to index >100 billion pages and you need it updated and you need great crawling and parsing and you need great algorithms and probably an entirely proprietary engine and you need to CONSTANTLY refine all the above until it isn't garbage. Maybe you could muster something passable for $1B over 5 years with a strong core team that attracts great talent. If Apple actually does this, as they are rumored to, I bet they dump $10b into it just for the initial version.
Check out the serious difficulties the Common Crawl had with crawling 1% of the public internet on donated money and then get back to me with a plan. This is really, really hard to do for free. Maybe talk to Gates :)
I don't fully understand something about the general tech industry discourse around search and would love to hear if I'm wrong.
Here's my brief and slightly made up history of search engines:
In the beginning of time, search engines took a Boolean query (duck AND pond) and found all the documents which contained both words using an inverted index and then returned them in something like descending date order. But for queries which had big result sets, this order wasn't very useful and so search engines began letting users enter more "natural language" queries (duck pond) and sorting documents based on the number of terms that overlap with the query. They came up with a bunch of relevance formulas - tfidf, BM25 - that tried to model the query overlap. But it turns out this is tricky because user intent is a really tricky problem and so modern day search engines just declare that relevance is whatever users click on. Specifically they just model the probability that you're going to click on a link (or something) using a DNN that uses things like the individual term overlap, the number of users that have clicked on this link, the probability it's spam, the PageRank etc. Some search engines like Google also include personalized features like the number of times you have clicked on this particular domain - because for instance as a programmer your query of (Java) might have different intent than your grandmother's. This score then gets used to sort the results into a ranked list. This is why search engines (DDG included) collect all this data - because it makes the relevance problem tractable at web scale.
Maybe just my perspective but I just really don't understand why OP would want to build an index - it's hard boring expensive and doesn't violate data privacy - and I don't think people grasp that - at least to some extent - data privacy and relevance are in direct conflict?
I wonder if you could start small on something like this. Build a proof of concept, a search engine for programmers that indexes only programming sites/material. See if you can technically do it, & if you can figure out governance mechanisms for the project. Sort of like Amazon starting with just selling books.
I wonder if instead of another search engine we would benefit from a directory, like DMOZ, or perhaps something tag based or non-hierarchical. Sometimes I find better results by first finding a good website in space of my query, and then searching within that site, as opposed to applying a specific query over all websites. Once example would be recipes: if you search for "bean burger recipe" you will get lots of results across many website, but some may not be very good, whereas if you already know of recipe websites that you consider high-quality or match your preferences, then you'll find the best (subjectively) recipe by visiting that site and then searching for bean burgers.
I've recently been /tinkering/ with exactly such an idea! In my case, it's even more specific and scoped: A search engine with only allow-listed domains about software engineering/tech/product blogs that I trust.
It's not even really at the POC stage yet, but I hope to host it with a simple web frontend sometime soon. Primarily, this is just for myself... I just want a good way to search the sources that I myself trust.
Its still pretty new and I'm working on it in my spare time, but my side-project https://searchmysite.net/ seems pretty close to what the author is after:
- "100% of the software would be free software, and third parties would be encouraged to set up their own installations" - I'm planning on open sourcing it under AGPL soon, once I've got documentation, testing etc. ready. Plus it's easy to set up your own installation (git clone; mkdirs for data; docker-compose up -d).
- "I would not have it crawling the entire web from the outset" - That's one of the key features of my approach, only crawling submitted domains. I'm focussing on personal websites and independent websites at the moment, primarily because I don't currently have the money for infra to crawl big but useful sites like wikipedia, but there's nothing to stop people setting up their own instances for other types of site.
- "who’s going to pay for it? Advertisements or paid results are not going to fly" - A tough anti-advert stance is another key differentiating feature to try to keep out spam, e.g. I detect adverts on indexed pages and make sure those pages are heavily downranked. Planning to pay running costs via a listing fee, which gives access to additional features like greater control over indexing (e.g. being able to trigger reindexing on demand).
[+] [-] jimnotgym|5 years ago|reply
Do they really though, for normal people that is?. Some of my searches today below, can't remember the exact terms I used. Mix of DDG and Google.
1) Walt Whitman, I wanted a basic overview of his work to satisfy some idle curiosity. DDG gave me his wikipedia page. Bingo
2) EAN-13 check digit. First result wikipedia telling me how to calculate it. I see it is simple and I have a long list in Excel to check. I can't be bothered to think so...
3) EAN-13 Excel. First result has an example that I copied and pasted.
4) Timezone [niche cloud system]. Said system didn't do what we expected, seems to be timezone issue. First article is discussing this niche issue and offers solutions
5) Does Shopify support x payments. Yes it does
6) Coronavirus test. Got straight to government site.
7) MacOS version numbers. First hit...
8) How come my Microsoft x platform is showing as being at y level of service when my Buddies is not. Straight in
Am I just a perfect search customer? I don't seem to be getting the problems Drew is?
[+] [-] kringdom89|5 years ago|reply
It gets tiring quickly and I find easier to append !g instead of clicking the regional toggle button.
[+] [-] krick|5 years ago|reply
I'm a bit lazy right now to remember all the problems it has, but some of the most obvious are looking up for news on recent events (especially something small, stuff that doesn't appear in reuters and these sorts of media) and trying to find out some basic stuff about local shops and such (of course, I only know about how it feels in my location, not worldwide). On both occasions I pretty much always use "!g ..." right away, because DDG is just clueless about this shit. Google does this just fine (in fact, sometimes it's even impressive: there are thousands of cities like mine, yet Google can often tell me where I can buy some stuff I'd have no idea where to look for).
[+] [-] sightmost|5 years ago|reply
Edit: typos.
[+] [-] nl|5 years ago|reply
I just searched for "lockdown rules for SA" (I'm in South Australia and we just had a new 20 person cluster, so we are going back into lockdown).
On DDG the first results was a Guardian article which was good, but then the rest were a mix of South African articles and blog spam. There were no SA Gov pages on the first page of results.
On Google the first result was the South Australian gov site with the rules, the second was the Guardian article, then more SA Gov pages and at result 8 I got a South African result.
https://searchengineland.com/google-reaffirms-15-searches-ne...
[+] [-] 3np|5 years ago|reply
Now, even with quotes, I routinely get a whole first page of results where my terms are not included anywhere. Google generally respect the quotes.
[+] [-] ddingus|5 years ago|reply
Google does infer purpose better, and if someone is looking to buy something, it does well there too.
Ddg is very good at info queries and the more one uses it, the better it is.
What they could do is exactly what google did and that's to review those uses and improve.
But what they have right now is solid, given just a tiny bit of work.
[+] [-] powersnail|5 years ago|reply
It also sucks at retrieving very new information.
And I say this as someone who set DDG as default.
I mean, you do seem to be DDG’s ideal user. You searched for mostly technical issues, and a hot political issue.
[+] [-] jedimastert|5 years ago|reply
Not a big fan of this conclusion. Who chooses the white list, and why should I trust them? Is it democratically chosen? Just because a site is popular very clear does not mean it's trustworthy. Does it get vetted? by whom? Also, who's definition of trustworthy are we trusting?
If I want my blog to show up on your search engine, do I have to get it linked by one of those sites, or can I register with you? Will I be tier 1, or
[+] [-] jron|5 years ago|reply
[+] [-] benmller313|5 years ago|reply
[+] [-] timClicks|5 years ago|reply
[+] [-] 6510|5 years ago|reply
If your goal is "to make something better than the Duck" and you succeed, the Duck dies... what is your goal now?
[+] [-] pmoriarty|5 years ago|reply
Ideally, services like theirs would be continuously audited by respectable, trusted organizations like the EFF.. multiple such organizations even.
Then I'd have at least some reason to believe their claims of not collecting data about me.
As it stands, I only have their word for it.. which in this day and age is pretty worthless.
That said, I'd still much rather use DDG, who at least pay lip service to privacy, than sites like Google or Facebook, who are openly contemptuous of it.
At the very least it sends a message to these organizations that privacy is still valued, and they'd lose out by not trying to accommodate the privacy needs of their users to some extent.
[+] [-] andreareina|5 years ago|reply
[1] https://help.duckduckgo.com/duckduckgo-help-pages/results/du...
[2] https://help.duckduckgo.com/duckduckgo-help-pages/results/so...
[+] [-] mixologic|5 years ago|reply
A service, running on somebody else's machine, is essntially closed.
I think the only way to have an 'open' service is to have it managed like a co-op, where the users all have access to deployment logs or other such transparency.
Even then, it requires implicit trust in whomever has the authorization to access the servers.
[+] [-] joshuaissac|5 years ago|reply
I think the next step forward should be to have indices that can be shared/sold for use with local mode. So you might buy specialised indices for particular fields, or general ones like what Google has. The size of Google's index is measured in petabytes, so a normal person would still not have the capability to run something like that locally.
Edit: In another thread, ddorian43 has pointed out the existence of Common Crawl,[2] which provides Web crawl data for free. I have no idea if it can be integrated with YaCy, but it is there.
1. https://yacy.net/
2. https://commoncrawl.org/
[+] [-] joosters|5 years ago|reply
But, I agree with you - and I don't think the author had really thought through what they were demanding, they made no mention of licensing other than singing happy praises of FOSS as if that would magically mean you could trust what a search engine was doing.
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] Jyaif|5 years ago|reply
https://en.wikipedia.org/wiki/Homomorphic_encryption
[+] [-] neurobashing|5 years ago|reply
What am I doing wrong (or right), here? I put a thing in and find it. I just don't use Google any more.
Genuinely curious why it's working for me and such garbage for everyone else.
[+] [-] aembleton|5 years ago|reply
For example, you might search for `vue js on show` whereas `vue on show` will show you (in the UK) results for what is on at Vue cinemas.
With Google, I expect it would understand that you are probably searching for JS related vue questions and rank those higher.
[+] [-] Moru|5 years ago|reply
[+] [-] dybber|5 years ago|reply
[+] [-] zeepzeep|5 years ago|reply
[+] [-] Dahoon|5 years ago|reply
[+] [-] jlarocco|5 years ago|reply
I can think of some improvements (better forum/mailing list coverage), but it's generally pretty good. Lately if I don't find it on DDG I probably won't have much luck anywhere else, either.
[+] [-] proactivesvcs|5 years ago|reply
I always send feedback when I come across incorrect results and also try to when I get a really easy find.
I have not had to resort to any other search engine for at least five years.
[+] [-] pizza234|5 years ago|reply
[+] [-] keithnz|5 years ago|reply
[+] [-] djsumdog|5 years ago|reply
Honestly we shouldn't be using Google for everything. Why not just search StackExchange or Github issues directly for known bug problems? If you need a movie, !imdb or !rt forward you to exactly where you want to really search on.
If DDG or Google also included independent small blogs for movie results, I could see the value in that. I'd prefer someone's review on their own site or video channel, but it doesn't. We've kinda lost that part of the Internet.
[+] [-] lambda_obrien|5 years ago|reply
Then anyone who wants to use the data can either copy it to their own S3 buckets to pay just once, or can use it with some sort of pay-as-you-go method. Anyone who runs a search engine can use the algorithms as a guide for the specific searches they are interested in for their site, or can just make their own.
You could trust the other indexers not to give you bad data, because you'd have some sort of legal agreement and technical standards that would ensure that they couldn't/wouldn't "poison the well" somehow with the data they provide. Further, if a bad actor was providing faulty data, the other actors would notice and kick them out of the group or just stop using their data.
It would have to be fully open source, I agree with the other parts of Drew's essay here, but I think we could share the index/data somehow if we got together and tried to think about it. We just need a standard for how we share the data.
[+] [-] bscphil|5 years ago|reply
Not sure I buy the example that is given here.
1. It's an issue in their browser app, not their search service.
2. It's not completely indefensible: it allows fetching favicons (potentially) much faster, since they're cached, and they promise that the favicon service is 100% anonymous anyway.
3. They responded to user feedback and switched to fetching favicons locally, so this is no longer an issue. https://github.com/duckduckgo/Android/issues/527#issuecommen...
> The search results suck! The authoritative sources for anything I want to find are almost always buried beneath 2-5 results from content scrapers and blogspam. This is also true of other search engines like Google.
This part is kinda funny because "DuckDuckGo sucks, it's just as bad as Google" is ... not the sort of complaint you normally hear about an alternative search engine, nor does it really connect with any of the normal reasons people consider alternative search engines.
That said, I agree with this point. Both DDG and Google seem to be losing the spam war, from what I can tell. And the diagnosis is a good one too: the problem with modern search engines is that they're not opinionated / biased enough!
> Crucially, I would not have it crawling the entire web from the outset. Instead, it should crawl a whitelist of domains, or “tier 1” domains. These would be the limited mainly to authoritative or high-quality sources for their respective specializations, and would be weighed upwards in search results. Pages that these sites link to would be crawled as well, and given tier 2 status, recursively up to an arbitrary N tiers.
This is, obviously, very different from the modern search engine paradigm where domains are treated neutrally at the outset, and then they "learn" weights from how often they get linked and so on. (I'm not sure whether it's possible to make these opinionated decisions in an open source way, but it seems like obviously the right way to go for higher quality results.) Some kind of logic like "For Python programming queries, docs.python.org and then StackExchange are the tier 1 sources" seems to be the kind of hard-coded information that would vastly improve my experience trying to look things up on DuckDuckGo.
[+] [-] jbay808|5 years ago|reply
[+] [-] brundolf|5 years ago|reply
So I think that stepping back and re-thinking what a search engine fundamentally is, is a great starting point for disruption.
Additionally, something the OP didn't mention is that ML technologies have progressed dramatically since 1998, and that much of that progress has been done in the open. I can't imagine that not being a force-multiplier for any upstart in this domain.
[+] [-] jedberg|5 years ago|reply
As such, they effectively have a list of "tier 1" domains.
[+] [-] dwd|5 years ago|reply
There is a DuckDuckGoBot and I think it was an interview or podcast Gabriel did a while back that he mentioned they use it for filling out gaps in the Bing API data to provide the instant answers, favicons. Their preference for the instant answers were authoritative references such as docs.python.org. This would have been a while back though.
[+] [-] Silhouette|5 years ago|reply
The problem with this strategy is always going to be that different users will regard different sources as most desirable.
For example, it's enormously frustrating that searching for almost anything Python-related on DDG seems to return lots of random blog posts but hardly ever shows the official Python docs near the top. I don't personally think the official Python docs are ideally presented, but they're almost certainly more useful to me at that time than some random blog that happens to mention an API call I'm looking up.
On the other hand, I would gladly have an option in a search engine to hide the entire Stack Exchange network by default. The signal/noise ratio has been so bad for a long time that I would prefer to remove them from my search experience entirely rather than prioritise them. YMMV, of course. (Which is my point.)
[+] [-] judge2020|5 years ago|reply
With that logic, Apple’s OCSP server is also 100% anonymous (which I legitimately can believe it is).
[+] [-] merlinscholz|5 years ago|reply
[+] [-] simonebrunozzi|5 years ago|reply
I knew close to nothing about building a company or a project, or how a proper business model would have helped it. I was the leader (SABDFL) of the group, and unfortunately I didn't lead it well enough to succeed. We had some good ideas, but ultimately we failed at building more than the initial prototype.
The idea behind it was simple: WorkerBee nodes (users' computers) would crawl the web, and provide the computational power to run Beeseek. Users could upvote pages (using "trackers" that anonymously "spy" the user in order to find new pages - repeat: anonymously). The entire DB would be hosted across multiple nodes. Auth and other functionalities would be provided by "higher level" nodes (QueenBee nodes). Everything was going to be open source.
Well, it didn't work.
Thankfully, because of Beeseek, I met a few very smart people that I am in touch with to this day.
Life is strange and beautiful in its own way.
Weird, though, that today I still believe that Beeseek could have been the right thing to build. Who knows?
[0]: https://launchpad.net/beeseek
[+] [-] jrochkind1|5 years ago|reply
One difference from what you describe is that the OP is specifically recommending against decentralization/federation, where it seems to have been the core differentiator of your effort. I don't think what OP is describing is quite what you are describing.
[+] [-] 0xbadcafebee|5 years ago|reply
Most people still believe that it's possible for one search engine to help anyone find anything without it knowing anything about them, which is just ridiculous. To get good search results you practically have to read someone's mind. Google basically does this (along with their e-mails, and voicemails, and texts, and web searches, and AMP links, and PageRanked crawls, and context-aware filters) and they still don't always get it right.
There is no magic algorithm that replaces statistical analysis of a large corpus along with a massive database of customized rulesets.
> We should also prepare the software to boldly lead the way on new internet standards. Crawling and indexing non-HTTP data sources (Gemini? Man pages? Linux distribution repositories?), supporting non-traditional network stacks (Tor? Yggdrasil? cjdns?) and third-party name systems (OpenNIC?), and anything else we could leverage our influence to give a leg up on.
Oh, great, so become the Devil himself, then. Count me out.
[+] [-] dumbfounder|5 years ago|reply
[+] [-] rjurney|5 years ago|reply
[+] [-] sfletcher|5 years ago|reply
Here's my brief and slightly made up history of search engines:
In the beginning of time, search engines took a Boolean query (duck AND pond) and found all the documents which contained both words using an inverted index and then returned them in something like descending date order. But for queries which had big result sets, this order wasn't very useful and so search engines began letting users enter more "natural language" queries (duck pond) and sorting documents based on the number of terms that overlap with the query. They came up with a bunch of relevance formulas - tfidf, BM25 - that tried to model the query overlap. But it turns out this is tricky because user intent is a really tricky problem and so modern day search engines just declare that relevance is whatever users click on. Specifically they just model the probability that you're going to click on a link (or something) using a DNN that uses things like the individual term overlap, the number of users that have clicked on this link, the probability it's spam, the PageRank etc. Some search engines like Google also include personalized features like the number of times you have clicked on this particular domain - because for instance as a programmer your query of (Java) might have different intent than your grandmother's. This score then gets used to sort the results into a ranked list. This is why search engines (DDG included) collect all this data - because it makes the relevance problem tractable at web scale.
Maybe just my perspective but I just really don't understand why OP would want to build an index - it's hard boring expensive and doesn't violate data privacy - and I don't think people grasp that - at least to some extent - data privacy and relevance are in direct conflict?
[+] [-] claytoneast|5 years ago|reply
[+] [-] wcerfgba|5 years ago|reply
[+] [-] mcqueenjordan|5 years ago|reply
https://github.com/jmqd/folklore.dev
It's not even really at the POC stage yet, but I hope to host it with a simple web frontend sometime soon. Primarily, this is just for myself... I just want a good way to search the sources that I myself trust.
[+] [-] m-i-l|5 years ago|reply
- "100% of the software would be free software, and third parties would be encouraged to set up their own installations" - I'm planning on open sourcing it under AGPL soon, once I've got documentation, testing etc. ready. Plus it's easy to set up your own installation (git clone; mkdirs for data; docker-compose up -d).
- "I would not have it crawling the entire web from the outset" - That's one of the key features of my approach, only crawling submitted domains. I'm focussing on personal websites and independent websites at the moment, primarily because I don't currently have the money for infra to crawl big but useful sites like wikipedia, but there's nothing to stop people setting up their own instances for other types of site.
- "who’s going to pay for it? Advertisements or paid results are not going to fly" - A tough anti-advert stance is another key differentiating feature to try to keep out spam, e.g. I detect adverts on indexed pages and make sure those pages are heavily downranked. Planning to pay running costs via a listing fee, which gives access to additional features like greater control over indexing (e.g. being able to trigger reindexing on demand).