It strikes me that the entire article/proposal is based on a faulty premise:
"After all, the value is not in the index it is in the analysis of that index."
The ability for a given search engine to innovate is based on having control of the index. The line between indexing and analysis isn't quite as clean as what is implied by the article, if only for the simple fact that you can only analyze what is in the index.
For example, at it's simplest and index is a list of what words are in what document on the web. But what if I want to give greater weight to words that are in document titles or headings? Then I need to somehow put that into the index.
What if I want to use the proximity between words to determine relevance of a result for a particular phrase? Need to get that info into the index, too.
In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots. In order for someone to do that, they'd need to charge the bot owners, but the bot owners could just index your content for free, so why would they pay?
Three easy reasons search engine owners /might/ pay for a full copy of the web crawl:
-Faster. You don't have the latency of millions of HTTP connections, but instead a single download. (Or a few dozen. Or a van full of hard drives.)
-Easier. The problem of crawling quickly but politely has been handled for you. The reading of sitemaps has been handled for you. The problem of deciding how deep to crawl, and when to write off a subsite as an endless black hole, has been handled for you. Etc.
-Predictable. Figuring out, in advance, how much it is going to cost you to crawl some/all of the web is, to say the least, tricky. Buying a copy with a known price tag provides a measure of certainty.
Of course, I am leaving out the potential pitfalls, but the point is there /are/ arguments in favor of buying a copy of the web (and then building your own index).
You are setting up a straw man by saying "at it's simplest an index is a list of what words are in what documents". That kind of index(an inverted word index) could be generated from the ideal data format, but it would be stupid store such a rich data set in such a dumb, information-losing format.
"The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL. There would also be some kind of metadata attached to the keys to indicate HTTP status code, HTTP header information, last-crawled date, and any other interesting data. From this data set, other, more appropriate indexes could be generated(for example via hadoop)
> In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots.
Yes, but not for bots. It would be for algorithms.
> In order for someone to do that, they'd need to charge the bot owners
Probably.. The funding could be like ICANN, whose long-term funding comes from groups that benefit from its services.
> but the bot owners could just index your content for free, so why would they pay?
How would you create a copy of the internet for free? Are you just going to run your crawler on your home modem while you're at work? Where are you going to store all that data? How are you going to process it? How long is that going to take? Wouldn't it just be easier to(for example) mount a shared EBS volume in EC2 that has the latest, most up to date crawled internet available for your processing?
What "index" is this article asking Google to open? The index against which they run actual queries has to be tied to Google's core search algorithms, which I doubt they'd want to make public.
So would they open an "index" of web page contents? In this case, why would another search engine access Google's "index" rather than the original server? The original server is guaranteed to be up to date, and there's no single point of failure.
There are two good reasons to access an index instead of the original server:
1) You won't DOS the host site.
2) You don't have to respect robots.txt. If you need to crawl a 1 million page site, and its robots.txt restricts you to 1 page/second, then you'll have to wait for a long time. Downloading a crawl dump from a central repository would be much easier.
From my understanding, the premise is that the 'index' generated from each crawled site will be some set of metadata smaller than the site's actual content. So instead of many robots, each crawling through all the data on your site, there could be one bot, which updates a single (smaller) index that all search engines can access.
I agree that Google's index is probably optimized to work with their search algorithm. From what the author claims, though, this doesn't mean that Google would be losing anything by allowing other engines to use the index, as "all the value is in the analysis" of the index.
I think the idea would be to have a global index of sites that sites use to mirror their data for searching.
This database would normally only accessible by search engines and the sites themselves could then disallow direct bot search in their robots.txt.
It occurs to me that this might have to be "invite only" - Google invites the sites they trust to put their data there but if they catch someone "cheating" in one way or another, they stop indexing. Plus they wouldn't have to invite really small sites.
I guess I don't understand, if someone provides me with a storage cluster of the_whole_internet for free, won't my proprietary_search_algorithm significantly degrade the IOPS and network bandwidth of the storage? Where would it all be? In some google data center that now anyone can demand colocation in? What happens when I accidentally or maliciously slow down Bing's updates and degrade their quality? And, as others mentioned, what happens when people push data into the index that doesn't represent what they're hosting?
It seems like this would be quite a complex project with a for the public good approach. Maybe it could work as an AWS project to sell amazon compute cycles.
I suspect the only way to make this work would be for the "index" to actually be some sort of stream. It wouldn't be a "file" or "database". It would require probably on the order of hundreds of thousands of dollars worth of hardware just to receive the stream and run a hello-world map-reduce(-esque calculation) on it would be my guess. It's your job to turn that stream into a queryable database.
As for your last two questions, there's nothing new whatsoever about them. Search engine pollution is ancient news.
yeah, sure... lets make a system and store all kind of information there, so people can browse it... it would be great to distribute it around the world, maybe on different companies, and sync the data every day so it keeps fresh... I don't know, maybe we can even have every person store their own data on their own private server... but of course, in an open index... </sarcasm>
I'm blinded by what you are trying to imply sarcastically.
A rough distributed model could be implemented similar to the way we (hackers/coders) use github as a central repository for a distributed system. People contributing to the index on a private server could do whatever they want but since that instance of the index is not public, no one else will care about what the owner has done to it. Forks can be pushed to a public staging area where others can view it and verify it's accuracy, and then the major players can merge those changes into their forks.
The complaint (with github) that it is hard to figure out the canonical repo is also invalid in this model, as one can start with a fork of Google or Yahoo's public repo, and then build their own through merging or hacking directly on it, just like one can fork Linus' Linux kernel, and then merge in other's forks to incorporate other changes.
Remember, the index itself, as in the raw data taken in by GoogleBot or Yahoo! Slurp bot, would be the shared information. The analysis of the data, as in pagerank and other factors that Google decides makes one page more relevant to a keyword than the other, would not be shared as that is the bread and butter of each engine.
I think this is a good idea. The whole idea of people syncing their own data doesn't work though, it gives too much room for people to fudge their data into the system so it favours them more.
I think the idea is good though. I think there would be a fight though to say who is the aggregator of the information. This would also mean whoever does distribute it has a stranglehold on the industry in terms of how and when it supplies this information.
I can see it's uses but I can equally see a lot of cons for the system not working or some serious amount of anti trust.
If you could get an unbiased 3rd party involved though and they built the database then I think that would work.
For the record, the Google exec (Berthier Ribeiro-Neto) is the co-author of "Modern Information Retrieval" [1], an excellent book and close to a standard text on IR.
I can second the recommendation of that book, I've heard a lot of good things, though I haven't read it. It's recently been updated in a 2nd edition [1], though I have no idea if there are substantive changes, presumably, there are, given more than a decade has elapsed. If anyone's read the updated version, I'd appreciate knowing if and/or how the book's changed, I've been thinking about picking it up.
I have read pretty big sections of Manning's Introduction to IR, and it served me fairly well as an introduction to the field. It's available online.[2]
We're talking about the cache, right? The index, or more likely indices, are optimized data structures used to search the cache. I doubt Google could share those without revealing too much about their ranking algorithm.
Letting sites inject into the cache is an interesting idea, but Google will still have to spider periodically to ensure accuracy. Inevitably, a large number of sites will just screw it up, because the internet is mostly made of fail. This would leave Google with only bad options: If they delist all the sites to punish them, they leave a significant hole in their dataset. But if they don't punish them and just silently fix it by spidering, there is no longer any threat to keep the black hat SEOs in check. Either way, it would cause an explosion in support requirements and Google is apparently already terrible at that.
I think the idea was that only Google will crawl your site and update the index, then the rest of the search engines will use the index instead of hitting your site.
"Each of these robots takes up a considerable amount of my resources. For June, the Googlebot ate up 4.9 gigabytes of bandwidth, Yahoo used 4.8 gigabytes, while an unknown robot used 11.27 gigabytes of bandwidth. Together, they used up 45% of my bandwidth just to create an index of my site."
I don't suppose anyone has considered making an entry in robots.txt that says either:
last change was : <parsable date>
Or a URL list of the form
<relative_url> : <last change date>
There are a relatively small number of robots (a few 10's perhaps) which crawl your web site, all of the legit ones provide contact information either in the referrer header or on their web site. If you let them know you had adopted this approach then they could very efficiently not crawl your site.
That solves two problems;
@ web sites on the back end of ADSL lines but don't change often wouldn't have their bandwidth chewed by robots,
@ The search index would be up to date so if someone who needed to find you hit that search engine they would still find you.
There's already place for doing this in the http protocol. I would assume that crawlers respect this, if provided, although I haven't tested to verify my expectation.
1. If I were Microsoft, I wouldn't trust Google's index. How do I know they aren't doing subtle things to the index to give them an advantage?
2. Having the resources to keep a live snapshot of the web is one of the big players' advantages. Opening the index, while good for the web, would not necessarily be good for the company. Google could mitigate that by licensing the data: for data more than X hours old, you get free access; for data newer than that, you pay a license fee to Google. Furthermore, integrate the data with Google's cloud hosting to provide a way to trivially create map/reduce implementations that use the data.
3. On the other side, what a great opportunity the index could provide for startups. Maintaining a live index of the web is costly and getting more and more difficult as people lock down their robots.txt. Being able to immediately test your algorithms against the whole web would be a godsend for ensuring your algorithms work with the huge dataset and that your performance is sufficient.
The first step would be for some top companies (Google, Yahoo...) to share the index. That way, there would be some speed up of the internet, and the index would not be open to abuse by arbitrary people/companies.
The author should use something like "crawl data" instead of "index". An index is the end result of analyzing crawled web pages.
It's a cool idea though because Yahoo sucks up a ton of my bandwidth and delivers very little in SEO traffic. On most of my sites now I have a Yahoo bot specific Crawl-Delay in robots.txt of 60 seconds, which pretty much bans them.
Maybe each site should be able to designate who indexes it and robots can get that index from that indexer. Let the indexers compete. Let each site decide how frequently it can index. Allow the indexer that gets the business use the index immediately, with others getting access just once a day. Perhaps a standardized raw sharable index format could be created, with each search company processing it further for their own needs after pulling it.
And let the site notify the indexer when things change, so all the bandwidth isn't used looking for what's changed. Actual changes could make it in to the index more quickly if the site could draw attention to it immediately rather than an army of robots having to invade as frequently as inhumanly possible. The selected indexer could still visit once a day or week to make sure nothing gets missed.
Their attitude is to take everything in but not to let you automate searches to get data out.
This is the biggest problem I have with search engines - you want to deep index all my sites? Fine, but you better let me search in return - deeper than 1000 results (and ten pages). Give us RSS, etc.
The whole article is about information in its rawest form and nothing to do with searchable content.
You would write something that takes the information they are referring to in this article, it's how you digest and index that information yourself that makes the difference
It strikes me that both in the article and in most comments people have no idea of what they are talking about, and yet they boldly carry on.
"The index"? Feature extraction is the most complex part of almost any machine learning algorithm, and search is no different. Indexing full text documents is a really difficult task, especially if you take inflected languages into account (English is particularly easy).
I don't see a way to "open the index" without disclosing and publishing a huge amount of highly complex code, that also makes use of either large dictionaries, or huge amounts of statistical information. It's not like you can just write a quick spec of "the index" and put it up on github.
FWIW, I run a startup that wrote a search engine for e-commerce (search as a service).
I don't think it's quite that simple. The index that google serves search query results from is a direct result of the algorithms they've applied to the data the googlebot has gathered. If by 'index' the author means the data the goolebot (for example) has downloaded from the internet, that's quite a bit different, but still probably serves the purpose the author is looking for. The index is a highly specialized representation of all the data they've collected.
Does it seem naive to anyone else to allow site owners to update the index and stop spidering. a) lots of people for various reasons (ignorance, security through obscurity) would just not update it and stuff would fall out of search.
Second, this seems incredibly ripe for abuse. Like we don't have enough search spam result problems already, letting spammers have more direct access to the content going into their rankings seems like a truly bad idea.
When spiders use more bandwidth than customers, your website must not be very popular. It implies that each page is viewed only a handful of times / month on average.
And if you think about it, the robots are much harder to optimize for – they’re crawling the long tail, which totally annihilates your caching layers. Humans are much easier to predict and optimize for.
[+] [-] geekfactor|15 years ago|reply
"After all, the value is not in the index it is in the analysis of that index."
The ability for a given search engine to innovate is based on having control of the index. The line between indexing and analysis isn't quite as clean as what is implied by the article, if only for the simple fact that you can only analyze what is in the index.
For example, at it's simplest and index is a list of what words are in what document on the web. But what if I want to give greater weight to words that are in document titles or headings? Then I need to somehow put that into the index.
What if I want to use the proximity between words to determine relevance of a result for a particular phrase? Need to get that info into the index, too.
In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots. In order for someone to do that, they'd need to charge the bot owners, but the bot owners could just index your content for free, so why would they pay?
[+] [-] mapgrep|15 years ago|reply
-Faster. You don't have the latency of millions of HTTP connections, but instead a single download. (Or a few dozen. Or a van full of hard drives.)
-Easier. The problem of crawling quickly but politely has been handled for you. The reading of sitemaps has been handled for you. The problem of deciding how deep to crawl, and when to write off a subsite as an endless black hole, has been handled for you. Etc.
-Predictable. Figuring out, in advance, how much it is going to cost you to crawl some/all of the web is, to say the least, tricky. Buying a copy with a known price tag provides a measure of certainty.
Of course, I am leaving out the potential pitfalls, but the point is there /are/ arguments in favor of buying a copy of the web (and then building your own index).
[+] [-] oh_sigh|15 years ago|reply
"The index" would just be a list of key/values. The key would be a URL, and the value would be the content located at that URL. There would also be some kind of metadata attached to the keys to indicate HTTP status code, HTTP header information, last-crawled date, and any other interesting data. From this data set, other, more appropriate indexes could be generated(for example via hadoop)
> In the end, what the author really wants is for someone to maintain a separate copy of the internet for bots.
Yes, but not for bots. It would be for algorithms.
> In order for someone to do that, they'd need to charge the bot owners
Probably.. The funding could be like ICANN, whose long-term funding comes from groups that benefit from its services.
> but the bot owners could just index your content for free, so why would they pay?
How would you create a copy of the internet for free? Are you just going to run your crawler on your home modem while you're at work? Where are you going to store all that data? How are you going to process it? How long is that going to take? Wouldn't it just be easier to(for example) mount a shared EBS volume in EC2 that has the latest, most up to date crawled internet available for your processing?
[+] [-] jedberg|15 years ago|reply
[+] [-] helwr|15 years ago|reply
[+] [-] panic|15 years ago|reply
So would they open an "index" of web page contents? In this case, why would another search engine access Google's "index" rather than the original server? The original server is guaranteed to be up to date, and there's no single point of failure.
[+] [-] lpolovets|15 years ago|reply
1) You won't DOS the host site.
2) You don't have to respect robots.txt. If you need to crawl a 1 million page site, and its robots.txt restricts you to 1 page/second, then you'll have to wait for a long time. Downloading a crawl dump from a central repository would be much easier.
[+] [-] translocation|15 years ago|reply
I agree that Google's index is probably optimized to work with their search algorithm. From what the author claims, though, this doesn't mean that Google would be losing anything by allowing other engines to use the index, as "all the value is in the analysis" of the index.
[+] [-] joe_the_user|15 years ago|reply
This database would normally only accessible by search engines and the sites themselves could then disallow direct bot search in their robots.txt.
It occurs to me that this might have to be "invite only" - Google invites the sites they trust to put their data there but if they catch someone "cheating" in one way or another, they stop indexing. Plus they wouldn't have to invite really small sites.
[+] [-] trotsky|15 years ago|reply
It seems like this would be quite a complex project with a for the public good approach. Maybe it could work as an AWS project to sell amazon compute cycles.
[+] [-] jerf|15 years ago|reply
As for your last two questions, there's nothing new whatsoever about them. Search engine pollution is ancient news.
[+] [-] sebastianavina|15 years ago|reply
[+] [-] uxp|15 years ago|reply
A rough distributed model could be implemented similar to the way we (hackers/coders) use github as a central repository for a distributed system. People contributing to the index on a private server could do whatever they want but since that instance of the index is not public, no one else will care about what the owner has done to it. Forks can be pushed to a public staging area where others can view it and verify it's accuracy, and then the major players can merge those changes into their forks.
The complaint (with github) that it is hard to figure out the canonical repo is also invalid in this model, as one can start with a fork of Google or Yahoo's public repo, and then build their own through merging or hacking directly on it, just like one can fork Linus' Linux kernel, and then merge in other's forks to incorporate other changes.
Remember, the index itself, as in the raw data taken in by GoogleBot or Yahoo! Slurp bot, would be the shared information. The analysis of the data, as in pagerank and other factors that Google decides makes one page more relevant to a keyword than the other, would not be shared as that is the bread and butter of each engine.
[+] [-] VladRussian|15 years ago|reply
Edit: looking at the other comments, doesn't seem to be that that many were able to get it. What a disappointing state of minds.
[+] [-] drivebyacct2|15 years ago|reply
[+] [-] chrislomax|15 years ago|reply
I think the idea is good though. I think there would be a fight though to say who is the aggregator of the information. This would also mean whoever does distribute it has a stranglehold on the industry in terms of how and when it supplies this information.
I can see it's uses but I can equally see a lot of cons for the system not working or some serious amount of anti trust.
If you could get an unbiased 3rd party involved though and they built the database then I think that would work.
[+] [-] Emore|15 years ago|reply
[1] http://www.amazon.com/Modern-Information-Retrieval-Ricardo-B...
[+] [-] a_m_kelly|15 years ago|reply
I have read pretty big sections of Manning's Introduction to IR, and it served me fairly well as an introduction to the field. It's available online.[2]
[1] http://www.amazon.com/Modern-Information-Retrieval-Concepts-...
[2] http://nlp.stanford.edu/IR-book/information-retrieval-book.h...
[+] [-] extension|15 years ago|reply
Letting sites inject into the cache is an interesting idea, but Google will still have to spider periodically to ensure accuracy. Inevitably, a large number of sites will just screw it up, because the internet is mostly made of fail. This would leave Google with only bad options: If they delist all the sites to punish them, they leave a significant hole in their dataset. But if they don't punish them and just silently fix it by spidering, there is no longer any threat to keep the black hat SEOs in check. Either way, it would cause an explosion in support requirements and Google is apparently already terrible at that.
[+] [-] amikazmi|15 years ago|reply
[+] [-] ChuckMcM|15 years ago|reply
I don't suppose anyone has considered making an entry in robots.txt that says either:
last change was : <parsable date>
Or a URL list of the form
<relative_url> : <last change date>
There are a relatively small number of robots (a few 10's perhaps) which crawl your web site, all of the legit ones provide contact information either in the referrer header or on their web site. If you let them know you had adopted this approach then they could very efficiently not crawl your site.
That solves two problems;
@ web sites on the back end of ADSL lines but don't change often wouldn't have their bandwidth chewed by robots,
@ The search index would be up to date so if someone who needed to find you hit that search engine they would still find you.
[+] [-] jerf|15 years ago|reply
[+] [-] troels|15 years ago|reply
[+] [-] SoftwareMaven|15 years ago|reply
1. If I were Microsoft, I wouldn't trust Google's index. How do I know they aren't doing subtle things to the index to give them an advantage?
2. Having the resources to keep a live snapshot of the web is one of the big players' advantages. Opening the index, while good for the web, would not necessarily be good for the company. Google could mitigate that by licensing the data: for data more than X hours old, you get free access; for data newer than that, you pay a license fee to Google. Furthermore, integrate the data with Google's cloud hosting to provide a way to trivially create map/reduce implementations that use the data.
3. On the other side, what a great opportunity the index could provide for startups. Maintaining a live index of the web is costly and getting more and more difficult as people lock down their robots.txt. Being able to immediately test your algorithms against the whole web would be a godsend for ensuring your algorithms work with the huge dataset and that your performance is sufficient.
Here's to hoping Google goes forward with it!
[+] [-] thevivekpandey|15 years ago|reply
[+] [-] mmaunder|15 years ago|reply
It's a cool idea though because Yahoo sucks up a ton of my bandwidth and delivers very little in SEO traffic. On most of my sites now I have a Yahoo bot specific Crawl-Delay in robots.txt of 60 seconds, which pretty much bans them.
[+] [-] stretchwithme|15 years ago|reply
And let the site notify the indexer when things change, so all the bandwidth isn't used looking for what's changed. Actual changes could make it in to the index more quickly if the site could draw attention to it immediately rather than an army of robots having to invade as frequently as inhumanly possible. The selected indexer could still visit once a day or week to make sure nothing gets missed.
[+] [-] ck2|15 years ago|reply
Their attitude is to take everything in but not to let you automate searches to get data out.
This is the biggest problem I have with search engines - you want to deep index all my sites? Fine, but you better let me search in return - deeper than 1000 results (and ten pages). Give us RSS, etc.
[+] [-] chrislomax|15 years ago|reply
You would write something that takes the information they are referring to in this article, it's how you digest and index that information yourself that makes the difference
[+] [-] random42|15 years ago|reply
For _most_ of the websites, its in _their_ interest to have good SERP ranking. Not the other way around.
[+] [-] sigil|15 years ago|reply
[+] [-] random42|15 years ago|reply
[+] [-] jwr|15 years ago|reply
"The index"? Feature extraction is the most complex part of almost any machine learning algorithm, and search is no different. Indexing full text documents is a really difficult task, especially if you take inflected languages into account (English is particularly easy).
I don't see a way to "open the index" without disclosing and publishing a huge amount of highly complex code, that also makes use of either large dictionaries, or huge amounts of statistical information. It's not like you can just write a quick spec of "the index" and put it up on github.
FWIW, I run a startup that wrote a search engine for e-commerce (search as a service).
[+] [-] 198d|15 years ago|reply
[+] [-] jessriedel|15 years ago|reply
That's what the author means by 'index'.
[+] [-] mindstab|15 years ago|reply
[+] [-] tlb|15 years ago|reply
[+] [-] tlrobinson|15 years ago|reply
Edit: SmugMug seems to fall into this category: http://don.blogs.smugmug.com/2010/07/15/great-idea-google-sh...
Also interesting:
And if you think about it, the robots are much harder to optimize for – they’re crawling the long tail, which totally annihilates your caching layers. Humans are much easier to predict and optimize for.