For more generic web indexing you need to use a browser. You do not index pages served by a server anymore, you index pages rendered by javascript apps in the browser. So as a part of the "fetch" stage I usually let parsing of title and other page metadata to a javascript script running inside the browser (using https://www.browserless.io/) and then as part of the "parse" phase I use cheerio to extract links and such. It is very tempting to do everything in the browser, but architecturally it does not belong there. So you need to find the balance that works best for you.
mrskitch|5 years ago
Our infrastructure actually does procedure for some of our scraping needs: we scrape puppeteer's GH documentation page to build out our debugger's autocomplete tool. To do this, we "goto" the page, extract the page's content, and then hand it off to nodejs libraries for parsing. This has two benefits: it cuts down the time you have the browser open and running, and let's you "offload" some of that work to your back-end with more sophisticated libraries. You get the best of both worlds with this approach, and it's one we generally recommend to folks everywhere. Also a great way that we "dogfood" our own product as well :)
paulpro|5 years ago
domenicd|5 years ago
mnmkng|5 years ago
unknown|5 years ago
[deleted]