top | item 29892122

(no title)

IndexPointer | 4 years ago

Any simple heuristic has false positives, meaning they'll end up taking down legitimate sites that had repeated content for a good reason. Say, for example two sites quoting text from the us constitution. The second one to be crawled would be considered to be spam copying the first one and removed from web results. Then you'll get comments on hacker news complaining that Google is censoring it for political reasons.

And any simple heuristic is quickly reverse engineered by SEOs, who will find a way to mask it as legitimate.

tl;dr it's a hard problem.

discuss

order

Kavelach|4 years ago

They could use the heuristics to build a list of domains to block and then have someone review it. After doing it for a long time, they could build a neural model on top of that, and automate it.

As I have said, the reason they don't do it is not because they don't have the skills and know-how.