top | item 17895224

(no title)

xarball | 7 years ago

Why would you switch from selenium to beautiful soup halfway through what you're trying to do, and force your program to re-request the same information from the web server? Selenium has access to the entire DOM, and the entire JavaScript session already loaded in a running web browser. It has way more power for data mining than beautiful soup does.

It looks like they're just trying to use selectors, but these directions seem to completely miss that functionality in Selenium's API. Just search the WebDriver documentation for 'find_element_by_':

https://selenium-python.readthedocs.io/api.html

I use Selenium for all my web crawling, exactly because I would rather have one crawler with all the backing support of a modern web browser, than corner myself into not having something as crucial as a JavaScript parser halfway through implementing a bot that's designed to hook what's basically an end-user interface sitting on top of all that.

The most obvious benefit of Selenium to me, is that by having all that, I can make my interactions with a web server look more like a user, and fly under the radar a little more. This tends to require less work on my part when I treat websites more like a whole package (though more RAM, yes!)

discuss

chsasank|7 years ago

One reason to use beautiful soup is that selenium is slow. You need to open the whole webpage including images, css etc. With requests/beautiful soup you can just parse the collected URLs very fast.

xarball|7 years ago

Selenium sets up the browser profile for you, so you can disable images, videos, css, javascript, embeds, all to your heart's content.

I've recently started using Selenium with the privoxy proxy, exactly because browser headless modes are still fairly new tech. They don't all necessarily support all the standard profile features (addons, settings, etc), or behave the same way. It's really neat seeing where they're going, but they sometimes need a bit of help MITM-ing traffic, so that's where a good filter comes in handy.

In the user-facing web world, 'slow' is kind of a relative term. Even with a barebone system, you're nearly always going faster than most servers will put out. I just take my chances bringing in bigger tools, because the personal cost of maintaining an under-equipped tool is usually a greater time-waster to keep up to date as your target site evolves, than the personal cost of waiting for variably-optimized background work to perform its duties.