top | item 42011609

(no title)

csiegert | 1 year ago

I’ve got two questions:

1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

2. The linked page says to avoid Disallow in robots.txt and to rely on the noindex tag. But how can I prevent googlebot from crawling all user profiles to avoid database hits, bandwidth, etc. without an entry in robots.txt? With noindex, googlebot must visit each user profile page to see that it is not supposed to be indexed.

discuss

order

seanwilson|1 year ago

https://developers.google.com/search/docs/crawling-indexing/...

   "Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it."
It's counterintuitive but if you want a page to never appear on Google search, you need to flag it as noindex, and not block it via robots.txt.

> 1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

It'll usually list the URL with a description like "No information is available for this page". This can happen for example if the page has a lot of backlinks, it's blocked via robots.txt, and it's missing the noindex flag.

dazc|1 year ago

'But how can I prevent googlebot from crawling all user profiles to avoid database hits..'

If user profiles are noindexed then why should you care if google are crawling, when almost every other crawler out there does not obey robots.txt?

It's not in google's interest to waste resources on non-indexable content, you are worrying far too much about it.