I'm surprised nobody has mentioned WWW::Mechanize - classic perl library [1] or python port of it [2], which is much closer to RoboBrowser than selenium/phantomjs/horseman.
I hope scrapers could be in a form of Chrome extensions, it would record my webpage actions as macros, then execute the macros on a remote headless server without downtime with periodic revisits. No need to program or config anything.
I would urge you to try nightmare[1], its a web scraper that uses electron as a headless browser, and they have a plugin called daydream[2] that records your webpage actions and convert it into a nightmare script. You have to do a couple of retries, but it does work.
SeleniumIDE[0] provides a nice and simple way of doing this, it's just a very simple Firefox addon that lets you record and playback mouse movements, typing, etc. You can then improve your macro through Selenium WebDriver.
Does this run javascript on the page? I've done quite a bit of scraping with scrapy, and have had to use phantomjs in many cases because static html doesn't get what you're after.
I've had a lot of success scraping websites with Capybara [1]. It's intended for writing acceptance tests of web apps, but it works remarkably well for scraping websites. It's written in Ruby, but the DSL it provides for interacting with web pages should be pretty understandable to anybody who's programmed before. It also supports multiple browsers, which means you can tradeoff along these axes:
- Headless vs. Not
- JS support vs. Not
I put a repo together with a sample script [2] for scraping leads off of a website which I will not name, but whose name rhymes with 'help'. It uses the PhantomJS browser for headless JS support. It also includes a Vagrantfile so you can avoid installing all the dependencies on your local machine.
I love PhantomJS or SlimerJS for scraping. Everything else includes extra hassles for cookie management, JavaScript emulation, faking user agents and whatnot. Best to simply use a headless browser. Selenium seems overly complicated, too.
Interesting for unprotected websites but it's easy to detect and to block: no valid js, no valid meta header, no valid cookie, suspect behavior...
Selenium is a much "elaborated" solution, but still, can be detected most of the time.
Disclosure: I'm DataDome co-founder.
If you want to detect bad bots and scrapers on your website, don't hesitate to try out for free and to share your feedback with us https://datadome.co
I realize you have reasons not to answer this question, but out of curiosity, what sorts of thing can tip off the fact that a site is getting scraped by a real browser and selenium?
After a glimpse, I should say that if the page needs javascript then use selenium else you use this. So this is like selenium without javascript. Am I right?
It doesn't. To scrape (or fake-API) js-only websites you have to either:
- drive a browser (firefox/chrome) via already mentioned here selenium/webdriver (potentially hiding the actual browser window into a virtual X by wrapping the whole thing with xvfb-run),
- or use one of the webkit-based toolkits: phantomjs [1] or headless horseman [2].
There is also an interesting project that combines the two, i.e. it drives a Firefox (or, more precisely, slightly outdated version of Gecko) to emulate a phantomjs-compatible API. [3]
phantomjs/slimerjs are pretty popular and even have tools that run on top of them, such as casperjs [4], that geared more to automated website testing, but can be quite good at scraping or fake-APIing too.
From what I can tell only recently starting to uzse Scrapy is that alot more "magic", shall we say, happens in the background so long procedures which could be a few hundred lines using bs4/requests/mechanize/etc can be minimized into a lot less. Looking at Robobrowser, it seems like it will reduce some of the coding effort but not to the extent that Scrapy does.
I think the main difference is that Scrapy is async - it downloads pages in parallel by default, so it is more efficient. But async APIs can be harder to use - you need callbacks or generators everywhere, so sync packages (like RoboBrowser) can be easier to get started.
Hmm, I can see why I'd want to use this library over piecing together requests and BS4 myself for every project. I love how simple the examples look.
I have a project I'm working on that will involve scraping many different websites on a daily basis. My only scraping experience so far is using cheerio[0] to scrape a single page with a 1,000 row HTML table. Should I start with something BS-based like this or should I jump straight into Scrapy? Or are there any other alternatives I should try?
I've used robobrowser for a project, where I needed to log in to a website and subsequently access pages as a logged in user. It worked well and I like the API. For "simple" scrapers that require authentication or some form of user interaction this is a good tool. If I need to scrape many pages from a site as fast as possible, I'd probably go for Scrapy though.
It's been years since I've used it, but I think cntlm can do this. Point your Scrapy code at the cntlm instance, and it should handle all of the NTLM headers for you.
Just curious... what is everyone using scrapers for?
I've done a lot of work scraping various sites and I can tell you this: basing any product on your ability to aggregate data via scraping will not work in the long run.
Eventually you will be asked not to scrape and then you'll get sued if you don't stop.
Case law is not in your favor here. See Craigslist Vs. 3Taps.
I had a quick look into the repository and unfortunately, it doesn't support WebSockets. Does anyone know of a browser automation library/framework that does support WebSockets?
[+] [-] aexaey|9 years ago|reply
[1] http://search.cpan.org/~ether/WWW-Mechanize-1.75/lib/WWW/Mec...
[2] https://pypi.python.org/pypi/mechanize/
[+] [-] b3b0p|9 years ago|reply
[0] https://github.com/sparklemotion/mechanize
[+] [-] Dolores12|9 years ago|reply
[+] [-] est|9 years ago|reply
[+] [-] itsyogesh|9 years ago|reply
[1] https://github.com/segmentio/nightmare [2] https://github.com/segmentio/daydream
[+] [-] popey456963|9 years ago|reply
[0] http://www.seleniumhq.org/
[+] [-] caseyf7|9 years ago|reply
[+] [-] foota|9 years ago|reply
[+] [-] tsergiu|9 years ago|reply
[+] [-] cookiecaper|9 years ago|reply
[+] [-] markbnj|9 years ago|reply
[+] [-] tekacs|9 years ago|reply
So static HTML parsing only.
[+] [-] yes_or_gnome|9 years ago|reply
[+] [-] ddebernardy|9 years ago|reply
[+] [-] toasterlovin|9 years ago|reply
- Headless vs. Not - JS support vs. Not
I put a repo together with a sample script [2] for scraping leads off of a website which I will not name, but whose name rhymes with 'help'. It uses the PhantomJS browser for headless JS support. It also includes a Vagrantfile so you can avoid installing all the dependencies on your local machine.
[1]: https://github.com/jnicklas/capybara
[2]: https://github.com/toasterlovin/scraping-yalp
[+] [-] facepalm|9 years ago|reply
[+] [-] Benfromparis|9 years ago|reply
Selenium is a much "elaborated" solution, but still, can be detected most of the time.
Disclosure: I'm DataDome co-founder. If you want to detect bad bots and scrapers on your website, don't hesitate to try out for free and to share your feedback with us https://datadome.co
[+] [-] dchuk|9 years ago|reply
[+] [-] mathheaven|9 years ago|reply
[+] [-] r1k|9 years ago|reply
[+] [-] aexaey|9 years ago|reply
- drive a browser (firefox/chrome) via already mentioned here selenium/webdriver (potentially hiding the actual browser window into a virtual X by wrapping the whole thing with xvfb-run),
- or use one of the webkit-based toolkits: phantomjs [1] or headless horseman [2].
There is also an interesting project that combines the two, i.e. it drives a Firefox (or, more precisely, slightly outdated version of Gecko) to emulate a phantomjs-compatible API. [3]
phantomjs/slimerjs are pretty popular and even have tools that run on top of them, such as casperjs [4], that geared more to automated website testing, but can be quite good at scraping or fake-APIing too.
[1] http://phantomjs.org/
[2] https://github.com/johntitus/node-horseman
[3] https://slimerjs.org/
[4] http://casperjs.org/
[+] [-] enibundo|9 years ago|reply
[+] [-] pkmishra|9 years ago|reply
[+] [-] alexroan|9 years ago|reply
[+] [-] kmike84|9 years ago|reply
[+] [-] pcr0|9 years ago|reply
I have a project I'm working on that will involve scraping many different websites on a daily basis. My only scraping experience so far is using cheerio[0] to scrape a single page with a 1,000 row HTML table. Should I start with something BS-based like this or should I jump straight into Scrapy? Or are there any other alternatives I should try?
[0]: https://github.com/cheeriojs/cheerio
[+] [-] jakubbalada|9 years ago|reply
Disclaimer: I'm a cofounder there
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] gkst|9 years ago|reply
[+] [-] thomasahle|9 years ago|reply
[+] [-] kej|9 years ago|reply
[+] [-] IanDrake|9 years ago|reply
I've done a lot of work scraping various sites and I can tell you this: basing any product on your ability to aggregate data via scraping will not work in the long run.
Eventually you will be asked not to scrape and then you'll get sued if you don't stop.
Case law is not in your favor here. See Craigslist Vs. 3Taps.
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] zo1|9 years ago|reply
[+] [-] bitfox|9 years ago|reply
[+] [-] taesu|9 years ago|reply
[+] [-] PhasmaFelis|9 years ago|reply
[+] [-] tlrobinson|9 years ago|reply