top | item 24898877

(no title)

vmatouch | 5 years ago

For more generic web indexing you need to use a browser. You do not index pages served by a server anymore, you index pages rendered by javascript apps in the browser. So as a part of the "fetch" stage I usually let parsing of title and other page metadata to a javascript script running inside the browser (using https://www.browserless.io/) and then as part of the "parse" phase I use cheerio to extract links and such. It is very tempting to do everything in the browser, but architecturally it does not belong there. So you need to find the balance that works best for you.

discuss

order

mrskitch|5 years ago

Thanks for the mention! I'm the founder of browserless.io, and agree with pretty much everything you're saying.

Our infrastructure actually does procedure for some of our scraping needs: we scrape puppeteer's GH documentation page to build out our debugger's autocomplete tool. To do this, we "goto" the page, extract the page's content, and then hand it off to nodejs libraries for parsing. This has two benefits: it cuts down the time you have the browser open and running, and let's you "offload" some of that work to your back-end with more sophisticated libraries. You get the best of both worlds with this approach, and it's one we generally recommend to folks everywhere. Also a great way that we "dogfood" our own product as well :)

paulpro|5 years ago

What is the reason you are not just getting page content directly with HTTP request? Is headless browser providing some benefits in your case?

domenicd|5 years ago

Maintainer of jsdom here. jsdom will run the JavaScript on a page, so it can get you pretty far in this regard without a proper browser. It has some definite limitations, most notably that it doesn't do any layout or handling of client-side redirects, but it allows scraping of most single-page client-side-rendered apps.

mnmkng|5 years ago

Not necessarily. It is true that most websites today are JavaScript heavy. However, they are server-side rendered more often than not. Mostly for performance reasons. Also, not all search engines are as good as Google at indexing dynamic JS websites, so it's better to serve pre-rendered HTML for that reason as well.