top | item 32569210

(no title)

mnmkng | 3 years ago

Exactly. The dynamic websites need to pull the data from somewhere as well. There's no magic behind it. Either all the data is in the initial payload in some form (not necessarily HTML), or it's downloaded later, again, over HTTP.

Headless browsers are useful when the servers are protected by anti-scraping software and you can't reverse engineer it, when the data you need is generated dynamically - not downloaded, but computed, or simply when you don't have the time to bother with understanding the website on a deeper level.

Usually it's a tradeoff between development costs and runtime costs. In our case, we always try plain HTTP first. If we can't find an obvious way to do it, we go with browsers and then get back to optimizing the scraper later, using plain HTTP or a combination of plain HTTP and browsers for some requests like logins, tokens or cookies.

discuss

No comments yet.