Javascript apps can be fully crawlable

[+] dwwoelfel|12 years ago|reply

This is a great approach, but detecting the user-agent is the wrong way to decide if you should pre-render the page. If you include the following meta tag in the header:

   <meta content="!" name="fragment">

then Google will request the page with the "_escaped_fragment_" query param. That's when you should serve the pre-rendered version of the page.

Google has documentation on this here: https://developers.google.com/webmasters/ajax-crawling/docs/... and we've been using this method at https://circleci.com for the past year.

Waiting for google to request the page with _escaped_fragment_ should also prevent you from getting penalized for slow load times or showing googlebot different content.

[+] Isofarro|12 years ago|reply

That Google Ajax crawler spec is no magic bullet.

Nick Denton: "Dip in uniques largely because of drop in Google refers. Pageviews (which are driven more by core audience) less affected." -- http://twitter.com/nicknotned/status/61152134929981440

Nick Denton: "Google does not fully support "hashbang" URLs. So we're eliminating them rather than waiting for Mountain View." -- http://twitter.com/nicknotned/status/61465859079671808

Nick Denton: "Yeah, I'd advise against hashbang urls. Will kill search traffic -- even if you abide by Google protocol." -- http://twitter.com/nicknotned/status/62595141927583745

[+] bceagle|12 years ago|reply

> This is a great approach

No, it is not. While this will certainly help your client side app get indexed, it is not 'great'. Other commenters on this thread bring up a number of valid concerns, but in my mind it comes down to two very simply things.

One is that when you are fighting for the top spot in organic traffic, this won't cut it. Off-page SEO is more important than on-page optimizations, but it on-page optimizations still have value.

The other issue is that this approach assumes that the client side rendered view at a particular hash is exactly what should be initially rendered on the server side. While this could work in some cases, it is my experience that it either creates a weird user experience and/or you end up doing hacks on the client side in order to ensure PhantomJS captures the right html.

This is a fine solution for some use cases, but I really hope that the community doesn't think this is the future. This is a temporary hack until we get a good server/client rendering framework in place OR all search engines evolve to capture pure client side apps without any of this.

[+] benaiah|12 years ago|reply

This is a great point. It might seem extreme, but I would advocate never using the User-Agent string to make decisions about what to serve a client. There is too much hackery and history that clouds up the User-Agent (such as every browser identifying itself as Mozilla), and it's almost always a proxy for something else that you actually want to test for.

In some rare situations, it's unavoidable, but even then I'd urge trying to rearchitect the solution to avoid it.

[+] thoop|12 years ago|reply

Thanks, I'll work on adding that. I'd still like to use a useragent fall-back for the crawlers that might not use the _escaped_fragment_ protocol.

[+] thomasfromcdnjs|12 years ago|reply

I wrote a similar open source library that uses this approach last year

http://github.com/apiengine/seoserver

and the blog post related to it

http://backbonetutorials.com/seo-for-single-page-apps/

[+] gcb1|12 years ago|reply

no it is not a great approach for 99% of the cases.

the issue with getting content from scripted sites is not the initial part... you could use noscript and be done much easier.

the real issue is that most sites require user interaction to get to most content. this does nothing besides providing a convenient DoS entry point.

nice hack though.

[+] timr|12 years ago|reply

Don't do this.

Rendering different content based on user agent is tempting the webspam gods. Rendering nothing but a big gob of javascript to non-googlebot user agents is a recipe to get the banhammer dropped on your head.

You're either gambling that Google is smart enough to know that your particular big gob of javascript isn't cloaking keyword spam (in which case you should just depend on their JS evaluation, since you already are, implicitly), or you're gambling that they won't bust you even though your site looks like a classic keyword stuffer.

[+] robmcm|12 years ago|reply

Google actually recommend you do this, provided it's the same content that is shown with JS enabled:

https://support.google.com/webmasters/answer/66353

"JavaScript: Place the same content from the JavaScript in a <noscript> tag. If you use this method, ensure the contents are exactly the same as what’s contained in the JavaScript, and that this content is shown to visitors who do not have JavaScript enabled in their browser."

This was done for years with Flash sites and I never saw Google black list anyone doing it legitimately.

You can also provide different content if you want the content to be behind a pay wall, although personally I find this is a little annoying.

[+] stephenheron|12 years ago|reply

Google does have a section within their guidelines on creating "HTML Snapshots". "If a lot of your content is created in JavaScript, you may want to consider using a technology such as a headless browser to create an HTML snapshot." https://developers.google.com/webmasters/ajax-crawling/docs/...

[+] benaiah|12 years ago|reply

Hiding keyword spam behind JS doesn't make any sense in this situation - the whole point is that the JS isn't being served to Google. That's who keyword spammers are trying to fool, not actual humans.

[+] _lex|12 years ago|reply

This will get you penalized for having a website that takes forever to load. This is what happens:

Googlebot requests page -> your webapp detects googlebot -> you call remote service and request that they crawl your website -> they request the page from you -> you return the regular page, with js that modifies it's look and feel -> the remote service returns the final html and css to your webapp -> your webapp returns the final html and css to Googlebot. That's gonna be just murder on your loadtimes.

If this must be done, for static pages, it should be done by grunt during build time, not by a remote service. For dynamic content, it's best to do the phantomjs rendering locally, and on an hourly (or so) schedule, since it doesn't really matter if googlebot has the latest version of your content.

Or perhaps I'm mistaken and the node-module actually calls the service hourly or so and caches results on app so it doesn't actually call the service during googlebot crawls. If that's the case, I take back my objections, but I'd recommend updating the website to say as much.

[+] benatkin|12 years ago|reply

If it doesn't cache, then besides latency, someone could send fake googlebot requests and overload the prerender service, which is unlikely to be able to handle a lot of traffic.

[+] 10098|12 years ago|reply

Pretty sure the load time problem can be mitigated by caching.

[+] unknown|12 years ago|reply

[deleted]

[+] ilaksh|12 years ago|reply

Its not a remote service. Its PhantomJS which is webkit rendering on your own server. Where did they say it was going to call a remote service?

[+] Isofarro|12 years ago|reply

An entire project written to simulate progressive enhancement (badly). One that only works for specified whitelisted User-Agents, instead of being based on capability.

I'm also not understanding the use-case for this project. Everytime the topic of "Web Apps", "JavaScript Apps", "Single page web apps" comes up, evangelists point out that they are applications (or skyscrapers), not just fancy decorators for website content.

So exactly what is this project delivering as fallback content? A server-generated website?

This project just seems pointlessly backwards. Simulating a feature that the JavaScript framework has already deliberately broken. One that introduces a server-side dependency on a project deliberately chosen not to have a server-side framework.

This just looks like a waste of effort, when building the JavaScript application properly the first time, with progressive enhancement, covers this exact use-case, and far, far more use-cases.

The time would have been better spent fixing these evidently broken JavaScript frameworks - Angular, ember, Backbone. Or at least to fix the tutorial documentation to explain how to build Web things properly. (This stuff isn't difficult, it just requires discipline)

I call hokum on people saying there's a difference between Websites and Web apps (or the plethora of terms used to obfuscate that: Single-page apps, JavaScript apps). This project proves that these are just Websites, built improperly, and this is the fudge that tries to repair that for Googlebot.

[+] philbo|12 years ago|reply

+100

Why some developers are so against progressive enhancement mystifies me. It is an elegant solution that actually works in all cases rather than an ugly hack that should probably work in the majority of cases. How can there even be a dispute about it? It's insane!

[+] raynjamin|12 years ago|reply

What would you do if you required SEO enhancement AND dynamic loading of content? Are you supposed to just let that portion of the site go without indexing? Surely there are sites that have both requirements.

What's the alternative?

[+] wldlyinaccurate|12 years ago|reply

If you are able to "pre-render" a JavaScript app like this, then you should be serving users the pre-rendered version and then enhancing it with JavaScript after onload.

JavaScript-only apps are a blight on the web. All it takes is a bad SSL cert, or your CDN going down, and your pages become useless to the end-user.

[+] dchest|12 years ago|reply

All it takes is a bad SSL cert, or your CDN going down, and your pages become useless to the end-user.

How are non-JavaScript pages protected from this?

[+] ewillbefull|12 years ago|reply

Wouldn't the pre-render based on useragent be penalized because Google doesn't like being shown pages differently than non-Googlebot useragents?

[+] michaelbuckbee|12 years ago|reply

Google doesn't like it when they are shown different content than a browsing user. This is roughly the equivalent of pointing Google Agent to a copy of the page requested that happens to be in Memcached instead of spinning up the full app stack to do the render.

[+] eonil|12 years ago|reply

Static rendering of dynamic content? I don't think this does make sense.

If it's pre-rednered, it's missing something. If it has all the data at first, then it's not dynamic.

Pre-rendered(static) javascript app(dynamic)...? Hmm... I don't see anything more than something like JWT in JS instead of Java?

[+] FedRegister|12 years ago|reply

>Static rendering of dynamic content? I don't think this does make sense.

Bro do you even Web 1.0? That's what CGI scripts in Perl did! Pull the data from the database, generate HTML (no JavaScript back then!) on the fly, and send to the browser.

[+] dchest|12 years ago|reply

> Static rendering of dynamic content?

Yes.

> I don't think this does make sense.

It does, if you use one of the JS frameworks listed on the linked page.

[+] anonymous|12 years ago|reply

I was under the impression that Googlebot already executes javascript on pages.

A more interesting idea would be if you do this for every user - prerender the page and send them the result, so they don't have to do the first, heavy js execution themselves. I know it sounds a bit retarded at first - you're basically using javascript as a server-side page renderer, but think about this: You can choose to prerender or not to prerender based on user agent string -- do it for people on mobile phones, but not for desktop users. You can write your entire site with just client-side page generation with javascript and let it run client-side at first, then switch to server-side prerendering once you have better hardware.

[+] benaiah|12 years ago|reply

Something similar to that, albeit slightly more elegant, is the work that AirBnB has done with their rendr [0] project, which serves prerendered content that's then rerendered with JS if it needs to be changed. You can do similar things with non-Backbone stacks, of course.

[0]: https://github.com/airbnb/rendr

[+] pzxc|12 years ago|reply

A better way is to do a hybrid single/multipage app as described here:

https://news.ycombinator.com/item?id=6507135

It's a multipage app, that uses ajax to function as a singlepage app. From the user's point of view it's a singlepage app, but it's accessible from any of the URLs that it pushStates to, so it's like the best of both worlds. It's fully crawlable because it functions as a multipage app, but it's got the speed of a singlepage app (if your browser supports pushState)

[+] bfirsh|12 years ago|reply

This is a similar thing, but is far faster because it uses Zombie instead of Phantom: https://github.com/bfirsh/otter

[+] tjmehta|12 years ago|reply

I tried using phantomjs in the past to serverside render a complex backbone application for SEO, and it was taking over 15 seconds to return a response (which is bad for SEO).

Looking at the prerender's source I did't see any caching mechanism.

What kind of load times have you see rendering your apps?

Have there been recent significant improvements in phantomjs's performance?

[+] chaddeshon|12 years ago|reply

I run http://www.brombone.com. We provide prerendered snapshots as a service.

You can get it faster than 15 seconds, but you can't really get it fast enough. We precache everything. I would strongly recommend against trying to process the pages in realtime.

[+] ivanhoe|12 years ago|reply

Still the main problems is not solved: you risk getting penalized for serving a different content to the googlebot

[+] beernutz|12 years ago|reply

I have been looking for something like this for a long time. Seems very straight forward.

I have not tested it yet, but I wonder if the speed of render will penalize you in the google results. Seems like a separate machine with a good CPU might be worthwhile if you are going to run this.

[+] gkoberger|12 years ago|reply

I can see a lot of issues with this (slow, displaying different content to Google can get you penalized, etc)... but this is a really clever hack.

Google is less important (they already execute JS), however it's good for sites like Facebook (which doesn't when you share a link).

[+] mk3|12 years ago|reply

They execute Javascript in limited fashion. So you should consider using what is suggested by google itself https://developers.google.com/webmasters/ajax-crawling/docs/... . If you are using angular, then you will get your template displayed instead of fully rendered page. with all {{sitename}} displayed.

[+] gildas|12 years ago|reply

Shameless plug: http://seo4ajax.com

It's a SaaS which is much more elaborated than this project (there is one year of development into it). We serve and crawl thousands of pages every day without any issues.

[+] unknown|12 years ago|reply

[deleted]

[+] se_|12 years ago|reply

If you're using Rails have a look at https://github.com/seojs/seojs-ruby, it's a gem similar to prerender but it's using our managed service at http://getseojs.com/ to get the snapshots. There are also ready to use integrations for Apache and Nginx.

Some benefits of SEO.js to other approaches are:

- it's effortless, you don't need to setup and operate your own phantomjs server

- snapshots are created and cached in advance so the search engine crawler won't be put off by slow page loads

- snapshots are updated regularly

[+] chadscira|12 years ago|reply

I recently needed to do this for google, but i wanted the rendering time, and delivery of the page to be under 500MS, so i hacked up something that works with expressjs

https://github.com/icodeforlove/node-express-renderer

It uses phantomjs but removes all the styles initially so the rendering time is much faster. (my ember app was averaging 70MS to render, but i prefetch the page data)

[+] paulocal|12 years ago|reply

came across this recently and it's super easy to implement

[+] RoboTeddy|12 years ago|reply

This looks similar to Meteor's "spiderable" package

http://docs.meteor.com/#spiderable

[+] imslavko|12 years ago|reply

Looks like that's exactly what Meteor's spiderable package does since 08/2012[0]: look at user-agent, run phantomjs for 10s and return a rendered page once google/facebook crawler detected.

[0]: http://www.meteor.com/blog/2012/08/08/search-engine-optimiza...

[+] davedx|12 years ago|reply

Considering using this for my Meteor app. Do you have any experience of it?

[+] commanderj|12 years ago|reply

Making JS heavy sites crawlable is also possible with libraries like https://github.com/minddust/jquery-pjaxr and https://github.com/defunkt/jquery-pjax . Plus the push state has the advantage of "real" urls.

[+] dchest|12 years ago|reply

How?

104 comments