Great feedback, agree I need to filter here. Some website localization is very hard to work around, because they will try to geo-locate the IP address of your bot and redirect it accordingly to a given language...
The issue I was having was with the query "term+wikipedia" it then shows the wikipedia article in Czech, Hungarian, Russian, some kind of Arab and other before finally showing the English version. Then also a lot of that occur 2,3,4+ times with the same URL, just differing in crawltime by a few minutes.
It's a difficult problem to fix, you can set an Accept-Language header on crawl requests but his only works if the target website uses "Content Negotiation." Some sites ignore headers and determine language based on the IP address (Geo-IP) or the URL structure (e.g., /es/ vs /en/), basically a mess...
1718627440|1 month ago
saltysalt|1 month ago