It's probably just a short URL is a heuristic for an important site. www.site.com/section is probably more important than www.site.com/section/subsection/detail/page/5/comments. A good move for the crawler - don't get distracted by "deep" pages - try and stick to high level ones first.
Edit: this would also encourage webmasters to use short URLs, which benefits users by being easier to remember, too.
How well-known is this to webmasters? If it's not well-known I cannot see how the second part could be. But the first could very much be true, and is what I would've surmised.
There is probably some highly non-obvious reason that sorting your queue of URLs by length is optimal, which was arrived at after a lot of modelling and testing.
We're unlikely to ever know the answer unless someone from google explains it to us.
The site adheres to a strict url structure. /state/city/id/schoolname - entering from the homepage, the only way to crawl the site 1 level at a time would be crawling the shortest urls first. this structure is also emphasised in the breadcrumbs on every page, the shortest urls are also the ones with the most internal links.
I did a more scientific analysis of the googlebot requests in the provided log (graph! http://i.imgur.com/uMoUT.png) and it definitely looks like it is taking shortest urls first. Anyone else with a large site want to check as well for further data?
I imagine that in these specific case is because longer URLs represent deeper pages on the site that are less "important" (in terms of internal incoming links and pagerank) than the shorter ones. It doesn't seem logical that google order the URLs by length and then crawl them in that order; probably the URL length can be a factor that the bot takes into account, but not the only one in the manner this article suggest :)
Anyway, good point, that deserves more testing to extract some conclussions
The question that I have is whether this is a relative behavior (i.e. whether, for a given domain 'domain.com' Google prioritizes domain.com/short-url over domain.com/longer/url.html) or a global one (i.e. prioritizing short.com/url over very-long-domain.com/nested/pages/hierarchy.html, all else equal).
I can definitely see the local/relative effects being a natural consequence of prioritizing by pagerank, but the global part sounds more like a separate signal.
A poor mans PageRank algorithm, assuming nothing else, would assign a higher PageRank to shorter site links on a page. Presumably the crawler visits pages with a higher PR first.
Pagerank determine crawls rate amongst other things. I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO).
You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer.
I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO).
You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer.
cant confirm this - sitemaps are used for discovery -> the urls listed in the sitemap get pushed into the 'discovered urls queue' then this queue is prioritized for crawling- and - if there are no other factors - the shorter urls get prioritized higher (as there is a bigger chance that a shorter url is a canonical version of a longer url - well, the chance is bigger then the other way round
Maybe because that's how they are sorted in the hashtable/database they use to queue urls? Or maybe because they want to index the shortest pages first, so that they are processed before any duplicates with longer urls (i.e. get /articles/ before /articles/index.php)
I suspect you're on the right track with your first guess.
Most people posting here are looking for some sort of deep meaning in this when IMO it is more likely just due to a localized side-effect of doing something such as storing the urls in a trie-like structure and then iterating over it breadth-first.
[+] [-] AshleysBrain|14 years ago|reply
Edit: this would also encourage webmasters to use short URLs, which benefits users by being easier to remember, too.
[+] [-] orijing|14 years ago|reply
[+] [-] cma|14 years ago|reply
[+] [-] JonnieCache|14 years ago|reply
We're unlikely to ever know the answer unless someone from google explains it to us.
[+] [-] JonnieCache|14 years ago|reply
[+] [-] esryl|14 years ago|reply
why would you crawl the site in any other way?
[+] [-] foxhop|14 years ago|reply
[+] [-] underdown|14 years ago|reply
I could see crawling pages most likely to have changed first as those pages would most likely lead to fresh content.
[+] [-] personalcompute|14 years ago|reply
[+] [-] foxhop|14 years ago|reply
[+] [-] meow|14 years ago|reply
[+] [-] personalcompute|14 years ago|reply
[+] [-] christianwilde|14 years ago|reply
Anyway, good point, that deserves more testing to extract some conclussions
[+] [-] orijing|14 years ago|reply
I can definitely see the local/relative effects being a natural consequence of prioritizing by pagerank, but the global part sounds more like a separate signal.
Does anyone have insights?
[+] [-] arn|14 years ago|reply
it's not because of sitemap or because of url structure or because of dynamic content.
Mine were blog articles in the same format. This is how it was crawled:
sitename.com/year/mo/day/stub
sitename.com/year/mo/day/stub-one
sitename.com/year/mo/day/stub-one-two
sitename.com/year/mo/day/stub-one-two-three
sitename.com/year/mo/day/stub-one-two-three-four
[+] [-] _grrr|14 years ago|reply
[+] [-] TuxPirate|14 years ago|reply
You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer.
[+] [-] TuxPirate|14 years ago|reply
I find it is also likely that short URLs (especially in the case of a directory-type site) are seen first by the spider and that this order is respected by the crawler (FIFO).
You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO people in there who might be able to provide you with a decent answer.
[+] [-] abrudtkuhl|14 years ago|reply
[+] [-] bauchidgw|14 years ago|reply
[+] [-] ignifero|14 years ago|reply
[+] [-] georgemcbay|14 years ago|reply
Most people posting here are looking for some sort of deep meaning in this when IMO it is more likely just due to a localized side-effect of doing something such as storing the urls in a trie-like structure and then iterating over it breadth-first.