top | item 43722118

(no title)

peterjliu | 10 months ago

another advantage is people want the Google bot to crawl their pages, unlike most AI companies

discuss

Reddit was an interesting case here. They knew that they had particularly good AI training data, and they were able to hold it hostage from the Google crawler, which was an awfully high risk play given how important Google search results are to Reddit ads, but they likely knew that Reddit search results were also really important to Google. I would love to be able to watch those negotiations on each side; what a crazy high stakes negotiation that must've been.

mattlondon|10 months ago

Particularly good training data?

You can't mean the bottom-of-the-barrel dross that people post on Reddit, so not sure what data you are referring to? Click-stream?

mmaunder|10 months ago

This is an underrated comment. Yes it's a big advantage and probably a measurable pain point for Anthropic and OpenAI. In fact you could just do a 1% survey of robots.txt out there and get a reasonable picture. Maybe a fun project for an HN'er.

newfocogi|10 months ago

This is right on. I work for a company with somewhat of a data moat and AI aspirations. We spend a lot of time blocking everyone's bots except for Google. We have people whose entire job is it to make it faster for Google to access our data. We exist because Google accesses our data. We can't not let them have it.

jiocrag|10 months ago

Excellent point. If they can figure out how to either remunerate or drive traffic to third parties in conjunction with this, it would be huge.