gbmatt's comments

gbmatt | 2 years ago | on: OpenAI is too cheap to beat

Only Big Tech (Microsoft,Google,Facebook) can crawl the web at scale because they own the major content companies and they severly throttle the competition's crawlers, and sometimes outright block them. I'm not saying it's impossible to get around, but it is certainly very difficult, and you could be thrown in prison for violating the CFAA.

gbmatt | 3 years ago | on: AI chatbots are not a replacement for search engines

Q: how might an AI algorithm be modified in order to return citations with its response?

A: There are several ways in which an AI algorithm could be modified to return citations with its responses. Here are a few possibilities:

    One approach would be to add a citation generation module to the AI algorithm. This module could be trained on a dataset of citations and could be used to generate appropriate citations for the information provided by the AI algorithm.

    Another option would be to modify the AI algorithm to include a database of references, with each reference being associated with a specific piece of information. When the AI algorithm provides a response, it could also return the appropriate reference(s) from the database.

    A third possibility would be to modify the AI algorithm to search external sources of information (such as online databases or libraries) for relevant citations whenever it provides a response. The AI could then return the relevant citations along with its response.
It is worth noting that adding a citation generation module or incorporating external sources of information into an AI algorithm can be a complex task, as it requires the development of additional algorithms and data structures to support these features. It may also require the acquisition of additional data or resources to support the citation generation process.

gbmatt | 4 years ago | on: Lawsuit filed alleging Google is paying Apple to stay out of the search business

I just posted this same comment on the ddg story, but I'm going to post it here as well.

Google forced my search engine (gigablast) basically out of business. I had ixquick.com as a big client at one time; I was providing them with search results from my custom web search engine. Then their CEO called me one day and told me he was cancelling, even though he'd been a client for over 10 years. He said it was because of some change Google had made to their agreement. Ixquick needed Google's results and ads for their startpage.com website, and, even though my results were shown on their ixquick.com and later ixquick.eu site, apparently Google wasn't good with that.

gbmatt | 4 years ago | on: Google manipulating browser extensions to stifle competitors, DDG CEO says

Yeah, Google forced my search engine basically out of business. I had ixquick.com as a big client at one time; I was providing them with search results from my custom web search engine. Then their CEO called me one day and told me he was cancelling, even though he'd been a client for over 10 years. He said it was because of some change Google had made to their agreement. Ixquick needed Google's results and ads for their startpage.com website, and, even though my results were shown on their ixquick.com and later ixquick.eu sites, apparently Google wasn't good with that.

gbmatt | 4 years ago | on: Jack Dorsey and the Unlikely Revolutionaries Who Want to Reboot the Internet

everyone needs equal access to public data. right now only big tech can download the many web pages (without thottling or being ip banned) on linkedin (microsoft), youtube (google), facebook, github (microsoft) and billions of more pages. this also leads to a gap on AI training sets to give big tech even more entrenchment. for instance, only microsoft can build that ai coding application they did because other companies can't access all of github without being throttled or ip banned (last time i checked - but i could be wrong now)[microsoft owns github]. regardless, we need some sort of bot 'bill of rights' to ensure equal access going forward. perhaps the answer is legislation or perhaps it is some massive p2p proxy net. i think it is legislation because the p2p proxy net is too hard to implement, and it would have to solve turing tests.

but perhaps web 3.0 (dweb) can just bypass all this nonsense and make its own versions of these popular services with baked-in accessibility for all.

gbmatt | 4 years ago | on: Gigablast Search Engine

hey thanks for the recognition, people. :) finally, all my problems are solved. this comment is here for hacker news karma points.

gbmatt | 4 years ago | on: Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

I'll admit I had not been working on the quality of single term queries as much as I should have lately. However, especially for such simple queries, having a database of link text (inbound hyperlinks and the associated hypertest) is very, very important. And you don't get the necessary corpus of link text if you have a small index. So in this particular case the index size is, indeed, quite likely a factor.

And thank you for the elaborate breakdown. It is quite useful and very informative, and was nice of you to present.

And I'm not saying that index size is the only obstacle here. I just feel it's the biggest single issue holding Gigablast's quality back. Certainly, there are other quality issues in the algorithm and you might have touched on some there.

gbmatt | 4 years ago | on: Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

Cloudflare is not the only gatekeeper, too. Keep that in mind. There's many others and, as an upstart search engine operator, it's quite overwhelming to have to deal with them all. Some of them have contempt for you when you approach them. I've had one gatekeeper actually list my bot as a bad actor in an example in some of their documentation. So, don't get me wrong, this is about gatekeepers in general, not just only Cloudflare and Cloudfront.

gbmatt | 4 years ago | on: Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

It's not quite that easy. Have you ever tried it? See my post below. Basically, yes, I've done it, but i had to go through a lot and was lucky enough to even get them to listen to me. I just happened to know the right person to get me through. So, super lucky there. Furthermore, they have an AI that takes you off the whitelist if it sees your bot 'misbehave', whatever that is. So if you have a certain kind of bug in your spider, or your bot 'misbehaves', whatever that means is anyone's guess, then you're going to get kicked off the list. So then what? You have to try to get on the whitelist again? They have Bing and Google on some special short lists so those guys don't have to sweat all these hurdles. Lastly, their UI and documentation is heavily centered around Google and Bing, so upstart search engines aren't getting the same treatment.

gbmatt | 4 years ago | on: Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

brave 'falls back' to bing. which in my experience is most of the time. in fact, out of all the queries i did a while back, they all seemed to come directly from bing. is there a way to disable the reliance on bing and get pure 'brave only' results? and can you be more specific as to what this fraction is? do you blend at all?

gbmatt | 4 years ago | on: Ask HN: Why doesn't anyone create a search engine comparable to 2005 Google?

yes, large proxy networks are potential solutions. but they cost money, and you are playing a cat and mouse game with turing tests, and some sites require a login. furthermore, people have tried to use these to spider linkedin (sometimes creating fake accounts to login) only to be sued by microsoft who swings the CFAA at them. so you start off with an intellectual desire to make a nice search engine and end up getting sidetracked into this pit of muck and having microsoft try to put you in jail. and, no, i'm not the one microsoft was suing.
page 1