top | item 41640845

Google Cache is fully dead

440 points| r721 | 1 year ago |seroundtable.com

237 comments

order
[+] luizfelberti|1 year ago|reply
> Then a couple of weeks ago, added [direct] links to the Wayback Machine

Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product...

EDIT:

Apparently they are collaborating but there are not much details [0]

[0] https://blog.archive.org/2024/09/11/new-feature-alert-access...

[+] mrkramer|1 year ago|reply
>Hopefully they are also making substantial donations to the Internet Archive, since they will be directing a lot of traffic into it and basically using their infrastructure as a feature on their main product

WebArchive link is hidden so deep in the "About the source" page that vast majority of Google users won't even know that it exists.

There is excellent browser extension called Web Archives[0] that hooks all major web archiving services e.g. Archive.is, Wayback Machine and others in one place.

[0] https://github.com/dessant/web-archives

[+] krackers|1 year ago|reply
It'd be absolutely foolish if the agreement wasn't contingent on funding. I assume the reason it's not explicitly stated was some sort of NDA (since IA is also involved in turmoil and Google doesn't want to be part of that).
[+] gibibit|1 year ago|reply
I hope Google is NOT going to be a significant source of funding for the Internet Archive. Because I want to trust Wayback Machine and the Internet Archive to be unbiased.

Google likes to influence search results, hiding ones it doesn't like, and elevating those that the Company supports. Wayback Machine has been very reliable so far, I hope it stays that way.

[+] account42|1 year ago|reply
IA needs an alternative - an independent backup archive - more than it needs funding. Unless IA funding exceeds the entire US copyright lobbying industry there is always a chance they will cease to exist without enough notice to save the data somewhere else.

There is also the matter what IA will be able to archive. The the machine learning gold rush more and more site operators see dollar bills in front of them and are restricting who can crawl their content. Google is in a special position here because almost no one can affort not to be crawled by Google which is what made their cache especially valuable in addition to the IA.

[+] runxel|1 year ago|reply
Very sad to see it gone. It was always some kind of last resort. Internet Archive is lovely, don't get me wrong, but it relies mostly on people actively queueing up sites to save.

So most of the time for more obscure sites where the bitrot was already in place and they aren't loading anymore you could use the Google cache to get something out of it – where IA had nothing.

[+] DaoVeles|1 year ago|reply
I do worry about the future of IA. Simply because of some of their reckless moves with their book lending policy, they have opened themselves up to being bleed dry financially. That plus the amount of copyright infringement openly available on the site is just waiting to be attacked.

I am waiting for Nintendo to get wind of the huge ROM dumps on there, it is not going to pretty. No manner of 'moral high ground' will defend against lawyers.

[+] iamleppert|1 year ago|reply
Google Cache was useful because you could sometimes not find a term or keyword in the web site, but it would be in the cache. Or for sites that have gone offline, or no longer have the item. "It's still in the Google Cache!" you can't say that anymore.

I use Google less and less these days. What's the point when you can just ask an LLM, and it gives you an answer within seconds, with no ads? You can ask for references and links and it will give those to you too. I don't think I've ever been given a link to an SEO content farm, where as with Google search its the entire page. Google Search feels like Yahoo was (maybe even worse) right before it died and was replaced with Bing.

[+] deanCommie|1 year ago|reply
This still happens all the time.

* I search a keyword * I see a google result * I see the keyword IN THE PREVIEW on Google * I click on the link * No keyword

And this isn't hidden SEO spam stuff, it was literally removed. The cache doesn't match the live result.

No recourse.

[+] EasyMark|1 year ago|reply
LLM… no ads…. *For now
[+] neop1x|1 year ago|reply
Google index sometimes also contain content which is under paywall or cookiewall. Two major sites in Czechia started implementing cookie walls, which is against GDPR but our local office for data privacy is not acting so it seems they are probably paid by those websites...
[+] cyberax|1 year ago|reply
I used cache a lot, not just to view sites, but see the text versions of PDF and Word documents. RIP.
[+] bjord|1 year ago|reply
oh, wow, same! this comment just made me realize that some of my older projects will no longer work after this
[+] ThinkBeat|1 year ago|reply
I would presume Google still has all this data. They just will not let anyone else use it.

Could this be an advantage that Google can use to train their models on but others won't have access?

Google wants it to be more difficult to notice rewrites? Journalists to often have found valuable information with it?

[+] selectodude|1 year ago|reply
I feel like the internet archive has taken a lot of that sort of use off of Google.

Unrelated: Google should probably think about a sizable donation to the Internet archive.

[+] zepearl|1 year ago|reply
> I would presume Google still has all this data. ...

Maybe - I guess that they must have served that "cached" content from DB-records that had it all saved directly (URL X has contents Y => basically a "mirror" of the terms that they indexed) => not having to store that "mirror" (only the search index) might save quite a lot of storage space (and I/O and CPU to decompress it, as users won't be requesting it anymore) => all in all that might save quite a lot of infrastructure costs $$$.

> Could this be an advantage that Google can use to train their models on but others won't have access?

Maybe (if they decided to just get rid of the I/O related to the user requests), but on the other hand I don't know if previously any "Google-consumer" was ever able to perform mass-downloads of Google's "cached" data - could that be done without being banned by Google's webpage (or API)?

[+] advisedwang|1 year ago|reply
As I understand it, Google does a decent amount of rendering of a page before indexing; this a) allows it to index content loaded by JS and b) prevents some ways spammers show Google different content from users. Perhaps Google's main way of storing a page no longer matches something that can be easily served as a cache page. This might be a way to remove a legacy copy of each page and reduce storage costs.
[+] lofaszvanitt|1 year ago|reply
Just with youtube, the surface area of these services is getting smaller and smaller and you get less and less. Too much optimization to the detriment of users. All the while search is still rooted in 90s concepts and only serves as a money making thing.
[+] bigstrat2003|1 year ago|reply
I am genuinely surprised to learn that it even still existed. I'm pretty sure it's been years since I have seen a Google result which actually had a cached version for me to pull up.
[+] JonChesterfield|1 year ago|reply
One fewer reason to use Google search. Solid effort killing the money printer all around.
[+] karlzt|1 year ago|reply
One more reason to not use Google search, I don't remember when it was the last time I used it, perhaps like twelve years ago.
[+] sandyarmstrong|1 year ago|reply
This was really useful when looking for product support, as companies regularly pull down or move around pages on their website. Seeing the version of a page at the time google associated it as a result was something I did all the time.
[+] RachelF|1 year ago|reply
Sadly, not knowing what used to be, erases history.

“The past was alterable. The past never had been altered. Oceania was at war with Eastasia. Oceania had always been at war with Eastasia.” ― George Orwell, 1984

[+] arshdeep79|1 year ago|reply
Ah the memories! I remember in my starting years. I was migrating a WordPress to new server. The db backup got corrupted in the process. Google cache helped me restore the blog entries. Crazy days!
[+] xnx|1 year ago|reply
Any solid evidence on why, or why now? I have to assume the additional interest in crawling/scraping data for AI precipitated this. Why deal with all the messiness of crawling the web at large when you can use a Google search and cache: results as your RAG?
[+] progmetaldev|1 year ago|reply
The answer could be to push users to their AI offerings, or possibly due to bots scraping up the cached data for their own AI models, where Google wasn't making a profit off providing the data. Most likely the feature wasn't used enough for them to care, and they couldn't find a way to monetize it to make it worth keeping around.
[+] account42|1 year ago|reply
Probably yes. Or websites that google has made scraping deals with don't want a cache of their content to be publically available and the easiests thing to do was to just turn off the pulic cache completely.
[+] 0x_rs|1 year ago|reply
Too bad. It was a great complement to the increasingly unreliable IA, whose list of blacklisted websites just keeps skyrocketing for opaque reasons. I'm guessing it's still available internally, along with snapshots going far, far back in time.
[+] A_D_E_P_T|1 year ago|reply
> Too bad. It was a great complement to the increasingly unreliable IA, whose list of blacklisted websites just keeps skyrocketing for opaque reasons

This could be due to site owners contacting the IA and requesting their site be permanently removed from the archive. It's not as easy as pressing a button, but it's not difficult to have your site removed.

I don't think that the IA itself makes editorial decisions as to which sites to include and which to blacklist. It's more likely that the blacklist is a voluntary opt-in thing...

[+] mattigames|1 year ago|reply
Many years ago Google Cache once saved a site I used to maintain/own, classic funny story, I accidentally deleted the production database when I was trying migrate it, but luckily all the data to recreate the latest posts (the most important for this japanese music-downloads-links WordPress site) was stored all in HTML attributes and some tags, so I created a script to scrap it all from Google Cache and recreated the DB as best as I could.
[+] nashashmi|1 year ago|reply
What are the chances of wayback machine removing snapshots? I found an article on something that is far too taboo to talk about these days that was removed from the newspaper after having it there for more than 5 years. Out of public pressure.
[+] dimensi0nal|1 year ago|reply
If it's important, it should go in archive.is. Sites have always been able to remove their own content from Wayback Machine.
[+] selimthegrim|1 year ago|reply
What was the article? Is this in Pak?
[+] blackeyeblitzar|1 year ago|reply
I really don’t understand killing this useful feature. Between this and the search results being bad, I don’t have much of a reason to visit Google anymore.
[+] matt-p|1 year ago|reply
On a unrelated note, could IA be charging companies training AI for access to an API with all thier data, or a enormous data dump?

Presumably historical context is quite useful for so e cases and if they can access new content like books etc then that'd be another benifit.

It is a win win for site owners who currently have everyone and thier dog crawling thier site at the moment.

[+] lithos|1 year ago|reply
Historical data, or before AI spam data is the most valuable. Makes sense to pull up the ladders from competitors.
[+] probably_wrong|1 year ago|reply
> [Google Cache] was meant for helping people access pages when way back, you often couldn't depend on a page loading. These days, things have greatly improved. So, it was decided to retire it.

I wish I knew what he's talking about - not only are sites disappearing left and right, but even those that remain will often change so quickly that your search term is nowhere to be found.

My cynical guess: websites want Google to index them so they show full versions of their articles knowing they won't be penalized for that. Everybody else gets a paywall, but Google Cache let everyone bypass them. Faced with the choice between users and companies, Google threw the users under the bus.

[+] wakeupcall|1 year ago|reply
> will often change so quickly that your search term is nowhere to be found

About 5 years ago I was often pulling up the cache to see if the indexed/cached page actually contained the search terms I was looking up, suspecting the site was serving a different page compared to what I was redirected to.

The number of websites doing this to game SEO was (and I suspect still is) substantial, despite google saying they're penalizing this behavior.

Outlets serving full articles to google then presenting you an unreadable mess, often downgraded through JS, is one of the most egregious, and google doesn't seem to care anyway.

This was before I gave up completely on google giving me pages containing the terms I was looking for.

[+] cyberax|1 year ago|reply
Google allowed sites to disable caching since forever. They could also serve the full content to Google's bots, Google publishes their IP ranges.
[+] Arbortheus|1 year ago|reply
That’s sad. I liked that feature a lot.
[+] Giorgi|1 year ago|reply
Cache was invaluable tool for journalists over the world, especially in todays fast-moving, information overload world where powerful people try to rewrite history all the time. It sucks.

Sometimes I wonder if it really was a burden for a Google?

[+] nomilk|1 year ago|reply
Can someone ELI5 what google cache was and why it was important? Was it essentially a wayback machine alternative? People are upset about its removal; curious to understand why.
[+] nikeee|1 year ago|reply
You could access a snapshot of the page that was taken when google was indexing it. It is helpful if the site content changed or was removed shortly after google indexed it. This often lead to wrong search result preview texts which you could still find in the cache. The internet archive has a different focus and maybe you won't find the missing information that google has indexed there.