> Then a couple of weeks ago, added [direct] links to the Wayback Machine
Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product...
EDIT:
Apparently they are collaborating but there are not much details [0]
>Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product
WebArchive link is hidden so deep in the "About the source" page that vast majority of Google users won't even know that it exists.
There is excellent browser extension called Web Archives[0] that hooks all major web archiving services e.g. Archive.is, Wayback Machine and others in one place.
It'd be absolutely foolish if the agreement wasn't contingent on funding. I assume the reason it's not explicitly stated was some sort of NDA (since IA is also involved in turmoil and Google doesn't want to be part of that).
I hope Google is NOT going to be a significant source of funding for the Internet Archive. Because I want to trust Wayback Machine and the Internet Archive to be unbiased.
Google likes to influence search results, hiding ones it doesn't like, and elevating those that the Company supports. Wayback Machine has been very reliable so far, I hope it stays that way.
IA needs an alternative - an independent backup archive - more than it needs funding. Unless IA funding exceeds the entire US copyright lobbying industry there is always a chance they will cease to exist without enough notice to save the data somewhere else.
There is also the matter what IA will be able to archive. The the machine learning gold rush more and more site operators see dollar bills in front of them and are restricting who can crawl their content. Google is in a special position here because almost no one can affort not to be crawled by Google which is what made their cache especially valuable in addition to the IA.
Very sad to see it gone. It was always some kind of last resort. Internet Archive is lovely, don't get me wrong, but it relies mostly on people actively queueing up sites to save.
So most of the time for more obscure sites where the bitrot was already in place and they aren't loading anymore you could use the Google cache to get something out of it – where IA had nothing.
I do worry about the future of IA. Simply because of some of their reckless moves with their book lending policy, they have opened themselves up to being bleed dry financially. That plus the amount of copyright infringement openly available on the site is just waiting to be attacked.
I am waiting for Nintendo to get wind of the huge ROM dumps on there, it is not going to pretty. No manner of 'moral high ground' will defend against lawyers.
Google Cache was useful because you could sometimes not find a term or keyword in the web site, but it would be in the cache. Or for sites that have gone offline, or no longer have the item. "It's still in the Google Cache!" you can't say that anymore.
I use Google less and less these days. What's the point when you can just ask an LLM, and it gives you an answer within seconds, with no ads? You can ask for references and links and it will give those to you too. I don't think I've ever been given a link to an SEO content farm, where as with Google search its the entire page. Google Search feels like Yahoo was (maybe even worse) right before it died and was replaced with Bing.
Google index sometimes also contain content which is under paywall or cookiewall. Two major sites in Czechia started implementing cookie walls, which is against GDPR but our local office for data privacy is not acting so it seems they are probably paid by those websites...
> I would presume Google still has all this data. ...
Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.
> Could this be an advantage that Google can use to train their models on but others won't have access?
Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?
As I understand it, Google does a decent amount of rendering of a page before indexing; this a) allows it to index content loaded by JS and b) prevents some ways spammers show Google different content from users. Perhaps Google's main way of storing a page no longer matches something that can be easily served as a cache page. This might be a way to remove a legacy copy of each page and reduce storage costs.
Just with youtube, the surface area of these services is getting smaller and smaller and you get less and less. Too much optimization to the detriment of users. All the while search is still rooted in 90s concepts and only serves as a money making thing.
I am genuinely surprised to learn that it even still existed. I'm pretty sure it's been years since I have seen a Google result which actually had a cached version for me to pull up.
This was really useful when looking for product support, as companies regularly pull down or move around pages on their website. Seeing the version of a page at the time google associated it as a result was something I did all the time.
Sadly, not knowing what used to be, erases history.
“The past was alterable. The past never had been altered. Oceania was at war with Eastasia. Oceania had always been at war with Eastasia.”
― George Orwell, 1984
Ah the memories! I remember in my starting years. I was migrating a WordPress to new server. The db backup got corrupted in the process. Google cache helped me restore the blog entries. Crazy days!
Any solid evidence on why, or why now? I have to assume the additional interest in crawling/scraping data for AI precipitated this. Why deal with all the messiness of crawling the web at large when you can use a Google search and cache: results as your RAG?
The answer could be to push users to their AI offerings, or possibly due to bots scraping up the cached data for their own AI models, where Google wasn't making a profit off providing the data. Most likely the feature wasn't used enough for them to care, and they couldn't find a way to monetize it to make it worth keeping around.
Probably yes. Or websites that google has made scraping deals with don't want a cache of their content to be publically available and the easiests thing to do was to just turn off the pulic cache completely.
Too bad. It was a great complement to the increasingly unreliable IA, whose list of blacklisted websites just keeps skyrocketing for opaque reasons. I'm guessing it's still available internally, along with snapshots going far, far back in time.
> Too bad. It was a great complement to the increasingly unreliable IA, whose list of blacklisted websites just keeps skyrocketing for opaque reasons
This could be due to site owners contacting the IA and requesting their site be permanently removed from the archive. It's not as easy as pressing a button, but it's not difficult to have your site removed.
I don't think that the IA itself makes editorial decisions as to which sites to include and which to blacklist. It's more likely that the blacklist is a voluntary opt-in thing...
Many years ago Google Cache once saved a site I used to maintain/own, classic funny story, I accidentally deleted the production database when I was trying migrate it, but luckily all the data to recreate the latest posts (the most important for this japanese music-downloads-links WordPress site) was stored all in HTML attributes and some tags, so I created a script to scrap it all from Google Cache and recreated the DB as best as I could.
What are the chances of wayback machine removing snapshots? I found an article on something that is far too taboo to talk about these days that was removed from the newspaper after having it there for more than 5 years. Out of public pressure.
I really don’t understand killing this useful feature. Between this and the search results being bad, I don’t have much of a reason to visit Google anymore.
> [Google Cache] was meant for helping people access pages when way back, you often couldn't depend on a page loading. These days, things have greatly improved. So, it was decided to retire it.
I wish I knew what he's talking about - not only are sites disappearing left and right, but even those that remain will often change so quickly that your search term is nowhere to be found.
My cynical guess: websites want Google to index them so they show full versions of their articles knowing they won't be penalized for that. Everybody else gets a paywall, but Google Cache let everyone bypass them. Faced with the choice between users and companies, Google threw the users under the bus.
> will often change so quickly that your search term is nowhere to be found
About 5 years ago I was often pulling up the cache to see if the indexed/cached page actually contained the search terms I was looking up, suspecting the site was serving a different page compared to what I was redirected to.
The number of websites doing this to game SEO was (and I suspect still is) substantial, despite google saying they're penalizing this behavior.
Outlets serving full articles to google then presenting you an unreadable mess, often downgraded through JS, is one of the most egregious, and google doesn't seem to care anyway.
This was before I gave up completely on google giving me pages containing the terms I was looking for.
Cache was invaluable tool for journalists over the world, especially in todays fast-moving, information overload world where powerful people try to rewrite history all the time. It sucks.
Sometimes I wonder if it really was a burden for a Google?
Can someone ELI5 what google cache was and why it was important? Was it essentially a wayback machine alternative? People are upset about its removal; curious to understand why.
You could access a snapshot of the page that was taken when google was indexing it.
It is helpful if the site content changed or was removed shortly after google indexed it. This often lead to wrong search result preview texts which you could still find in the cache. The internet archive has a different focus and maybe you won't find the missing information that google has indexed there.
[+] [-] luizfelberti|1 year ago|reply
Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product...
EDIT:
Apparently they are collaborating but there are not much details [0]
[0] https://blog.archive.org/2024/09/11/new-feature-alert-access...
[+] [-] mrkramer|1 year ago|reply
WebArchive link is hidden so deep in the "About the source" page that vast majority of Google users won't even know that it exists.
There is excellent browser extension called Web Archives[0] that hooks all major web archiving services e.g. Archive.is, Wayback Machine and others in one place.
[0] https://github.com/dessant/web-archives
[+] [-] krackers|1 year ago|reply
[+] [-] gibibit|1 year ago|reply
Google likes to influence search results, hiding ones it doesn't like, and elevating those that the Company supports. Wayback Machine has been very reliable so far, I hope it stays that way.
[+] [-] account42|1 year ago|reply
There is also the matter what IA will be able to archive. The the machine learning gold rush more and more site operators see dollar bills in front of them and are restricting who can crawl their content. Google is in a special position here because almost no one can affort not to be crawled by Google which is what made their cache especially valuable in addition to the IA.
[+] [-] runxel|1 year ago|reply
So most of the time for more obscure sites where the bitrot was already in place and they aren't loading anymore you could use the Google cache to get something out of it – where IA had nothing.
[+] [-] DaoVeles|1 year ago|reply
I am waiting for Nintendo to get wind of the huge ROM dumps on there, it is not going to pretty. No manner of 'moral high ground' will defend against lawyers.
[+] [-] iamleppert|1 year ago|reply
I use Google less and less these days. What's the point when you can just ask an LLM, and it gives you an answer within seconds, with no ads? You can ask for references and links and it will give those to you too. I don't think I've ever been given a link to an SEO content farm, where as with Google search its the entire page. Google Search feels like Yahoo was (maybe even worse) right before it died and was replaced with Bing.
[+] [-] deanCommie|1 year ago|reply
* I search a keyword * I see a google result * I see the keyword IN THE PREVIEW on Google * I click on the link * No keyword
And this isn't hidden SEO spam stuff, it was literally removed. The cache doesn't match the live result.
No recourse.
[+] [-] EasyMark|1 year ago|reply
[+] [-] neop1x|1 year ago|reply
[+] [-] cyberax|1 year ago|reply
[+] [-] bjord|1 year ago|reply
[+] [-] ThinkBeat|1 year ago|reply
Could this be an advantage that Google can use to train their models on but others won't have access?
Google wants it to be more difficult to notice rewrites? Journalists to often have found valuable information with it?
[+] [-] selectodude|1 year ago|reply
Unrelated: Google should probably think about a sizable donation to the Internet archive.
[+] [-] zepearl|1 year ago|reply
Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.
> Could this be an advantage that Google can use to train their models on but others won't have access?
Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?
[+] [-] advisedwang|1 year ago|reply
[+] [-] lofaszvanitt|1 year ago|reply
[+] [-] bigstrat2003|1 year ago|reply
[+] [-] JonChesterfield|1 year ago|reply
[+] [-] karlzt|1 year ago|reply
[+] [-] sandyarmstrong|1 year ago|reply
[+] [-] RachelF|1 year ago|reply
“The past was alterable. The past never had been altered. Oceania was at war with Eastasia. Oceania had always been at war with Eastasia.” ― George Orwell, 1984
[+] [-] arshdeep79|1 year ago|reply
[+] [-] xnx|1 year ago|reply
[+] [-] progmetaldev|1 year ago|reply
[+] [-] account42|1 year ago|reply
[+] [-] 0x_rs|1 year ago|reply
[+] [-] A_D_E_P_T|1 year ago|reply
This could be due to site owners contacting the IA and requesting their site be permanently removed from the archive. It's not as easy as pressing a button, but it's not difficult to have your site removed.
I don't think that the IA itself makes editorial decisions as to which sites to include and which to blacklist. It's more likely that the blacklist is a voluntary opt-in thing...
[+] [-] mattigames|1 year ago|reply
[+] [-] nashashmi|1 year ago|reply
[+] [-] dimensi0nal|1 year ago|reply
[+] [-] selimthegrim|1 year ago|reply
[+] [-] blackeyeblitzar|1 year ago|reply
[+] [-] matt-p|1 year ago|reply
Presumably historical context is quite useful for so e cases and if they can access new content like books etc then that'd be another benifit.
It is a win win for site owners who currently have everyone and thier dog crawling thier site at the moment.
[+] [-] lithos|1 year ago|reply
[+] [-] probably_wrong|1 year ago|reply
I wish I knew what he's talking about - not only are sites disappearing left and right, but even those that remain will often change so quickly that your search term is nowhere to be found.
My cynical guess: websites want Google to index them so they show full versions of their articles knowing they won't be penalized for that. Everybody else gets a paywall, but Google Cache let everyone bypass them. Faced with the choice between users and companies, Google threw the users under the bus.
[+] [-] wakeupcall|1 year ago|reply
About 5 years ago I was often pulling up the cache to see if the indexed/cached page actually contained the search terms I was looking up, suspecting the site was serving a different page compared to what I was redirected to.
The number of websites doing this to game SEO was (and I suspect still is) substantial, despite google saying they're penalizing this behavior.
Outlets serving full articles to google then presenting you an unreadable mess, often downgraded through JS, is one of the most egregious, and google doesn't seem to care anyway.
This was before I gave up completely on google giving me pages containing the terms I was looking for.
[+] [-] cyberax|1 year ago|reply
[+] [-] Arbortheus|1 year ago|reply
[+] [-] Giorgi|1 year ago|reply
Sometimes I wonder if it really was a burden for a Google?
[+] [-] nomilk|1 year ago|reply
[+] [-] nikeee|1 year ago|reply