top | item 29889157

(no title)

Kavelach | 4 years ago

Just block the domain. At first, you can block manually, but we know Google doesn't like doing things that way. Fortunately, they have a lot of heuristics to find sites like that; usually the content is just copied from another source. And since they scrape the web all the time, they should know which content has appeared first where.

But the issue isn't that they can't; the issue is that they don't want that. Why the sites with copied content exists? To earn money through ads. What earns Google money? Ads!

discuss

order

IndexPointer|4 years ago

Any simple heuristic has false positives, meaning they'll end up taking down legitimate sites that had repeated content for a good reason. Say, for example two sites quoting text from the us constitution. The second one to be crawled would be considered to be spam copying the first one and removed from web results. Then you'll get comments on hacker news complaining that Google is censoring it for political reasons.

And any simple heuristic is quickly reverse engineered by SEOs, who will find a way to mask it as legitimate.

tl;dr it's a hard problem.

Kavelach|4 years ago

They could use the heuristics to build a list of domains to block and then have someone review it. After doing it for a long time, they could build a neural model on top of that, and automate it.

As I have said, the reason they don't do it is not because they don't have the skills and know-how.