top | item 32410050

(no title)

snehesht | 3 years ago

In the world of SPA (single page applications), headless browser API is super helpful, playwright[1] and puppeteer[2] are very good choices.

[1] https://github.com/microsoft/playwright

[2] https://github.com/puppeteer/puppeteer

discuss

order

lofatdairy|3 years ago

Highly recommend playwright (if I'm not mistaken most of the big developers from puppeteer were hired by MS to work on playwright). I run into significantly less await/async problems with playwright than I did with puppeteer and the codegen tool is super helpful as a first pass option.

snehesht|3 years ago

Playwright integrates with lot of different browsers compared to puppeteer which just uses chrome.

mynameismon|3 years ago

Also is the ability to open the Networks panel, to snoop on requests and find the exact API call that you might need to perform your task, instead of having to pull in all of HTML/JS/CSS crap. As a lot of SPAs have essentially pushed everything behind JSON APIs, all information is usually one (authenticated) API call away.

XzAeRosho|3 years ago

Most content heavy websites that tend to be scrapped, usually use server side rendering for this exact same reason, and put many obstacles in the way to make sure that data doesn't get scrapped easily. See: product price, stock, delivery information.

snehesht|3 years ago

If you're interested in running the puppeteer in containers, take a look at chrome-aws-lambda[1] and browserless docker container[2]

Not affiliated with browserless, but they do have a free/paid cloud service. https://www.browserless.io

[1] https://github.com/alixaxel/chrome-aws-lambda

[2] https://github.com/browserless/chrome

btown|3 years ago

https://chrome.browserless.io/ is perhaps the best technical demo I've ever seen, and shows off Browserless's capabilities amazingly. An incredibly high-quality service and codebase.