top | item 11680777

RoboBrowser: Your friendly neighborhood web scraper

182 points| pmoriarty | 10 years ago |github.com | reply

58 comments

order
[+] aexaey|9 years ago|reply
I'm surprised nobody has mentioned WWW::Mechanize - classic perl library [1] or python port of it [2], which is much closer to RoboBrowser than selenium/phantomjs/horseman.

[1] http://search.cpan.org/~ether/WWW-Mechanize-1.75/lib/WWW/Mec...

[2] https://pypi.python.org/pypi/mechanize/

[+] Dolores12|9 years ago|reply
Mechanize is outdated and python 2 only. We tried it and switched to RoboBrowser.
[+] est|9 years ago|reply
I hope scrapers could be in a form of Chrome extensions, it would record my webpage actions as macros, then execute the macros on a remote headless server without downtime with periodic revisits. No need to program or config anything.
[+] popey456963|9 years ago|reply
SeleniumIDE[0] provides a nice and simple way of doing this, it's just a very simple Firefox addon that lets you record and playback mouse movements, typing, etc. You can then improve your macro through Selenium WebDriver.

[0] http://www.seleniumhq.org/

[+] caseyf7|9 years ago|reply
The Resurrectio Chrome extension allows you to do this with PhantomJS
[+] foota|9 years ago|reply
I know there's extensions that help already with generating finders for automated e2e testing, you might be able to use something like that.
[+] cookiecaper|9 years ago|reply
Several startups have tried this. http://kimonolabs.com was is one I had experimented with and it was recently bought and shut down by Palantir.
[+] markbnj|9 years ago|reply
Does this run javascript on the page? I've done quite a bit of scraping with scrapy, and have had to use phantomjs in many cases because static html doesn't get what you're after.
[+] tekacs|9 years ago|reply
At a glance, no - it uses Requests to fetch pages and BeautifulSoup to parse them, the latter of which only parses the HTML into a document object.

So static HTML parsing only.

[+] yes_or_gnome|9 years ago|reply
I use PhantomJS as well, but (assuming you haven't already) look at CasperJS. It uses PhantomJS, but it is more friendly to use for bigger tasks.
[+] ddebernardy|9 years ago|reply
You could use Splash for JS as well. (Disclaimer: working for Scrapinghub, the main maintainers of Scrapy/Splash.)
[+] toasterlovin|9 years ago|reply
I've had a lot of success scraping websites with Capybara [1]. It's intended for writing acceptance tests of web apps, but it works remarkably well for scraping websites. It's written in Ruby, but the DSL it provides for interacting with web pages should be pretty understandable to anybody who's programmed before. It also supports multiple browsers, which means you can tradeoff along these axes:

- Headless vs. Not - JS support vs. Not

I put a repo together with a sample script [2] for scraping leads off of a website which I will not name, but whose name rhymes with 'help'. It uses the PhantomJS browser for headless JS support. It also includes a Vagrantfile so you can avoid installing all the dependencies on your local machine.

[1]: https://github.com/jnicklas/capybara

[2]: https://github.com/toasterlovin/scraping-yalp

[+] facepalm|9 years ago|reply
I love PhantomJS or SlimerJS for scraping. Everything else includes extra hassles for cookie management, JavaScript emulation, faking user agents and whatnot. Best to simply use a headless browser. Selenium seems overly complicated, too.
[+] Benfromparis|9 years ago|reply
Interesting for unprotected websites but it's easy to detect and to block: no valid js, no valid meta header, no valid cookie, suspect behavior...

Selenium is a much "elaborated" solution, but still, can be detected most of the time.

Disclosure: I'm DataDome co-founder. If you want to detect bad bots and scrapers on your website, don't hesitate to try out for free and to share your feedback with us https://datadome.co

[+] dchuk|9 years ago|reply
I realize you have reasons not to answer this question, but out of curiosity, what sorts of thing can tip off the fact that a site is getting scraped by a real browser and selenium?
[+] mathheaven|9 years ago|reply
After a glimpse, I should say that if the page needs javascript then use selenium else you use this. So this is like selenium without javascript. Am I right?
[+] r1k|9 years ago|reply
Does it support sites which require a JS enabled browser?
[+] aexaey|9 years ago|reply
It doesn't. To scrape (or fake-API) js-only websites you have to either:

- drive a browser (firefox/chrome) via already mentioned here selenium/webdriver (potentially hiding the actual browser window into a virtual X by wrapping the whole thing with xvfb-run),

- or use one of the webkit-based toolkits: phantomjs [1] or headless horseman [2].

There is also an interesting project that combines the two, i.e. it drives a Firefox (or, more precisely, slightly outdated version of Gecko) to emulate a phantomjs-compatible API. [3]

phantomjs/slimerjs are pretty popular and even have tools that run on top of them, such as casperjs [4], that geared more to automated website testing, but can be quite good at scraping or fake-APIing too.

[1] http://phantomjs.org/

[2] https://github.com/johntitus/node-horseman

[3] https://slimerjs.org/

[4] http://casperjs.org/

[+] enibundo|9 years ago|reply
Last time I needed something like this I used selenium. And I use requests the rest of the time.
[+] pkmishra|9 years ago|reply
What benefit does it provide in comparison to Scrapy?
[+] alexroan|9 years ago|reply
From what I can tell only recently starting to uzse Scrapy is that alot more "magic", shall we say, happens in the background so long procedures which could be a few hundred lines using bs4/requests/mechanize/etc can be minimized into a lot less. Looking at Robobrowser, it seems like it will reduce some of the coding effort but not to the extent that Scrapy does.
[+] kmike84|9 years ago|reply
I think the main difference is that Scrapy is async - it downloads pages in parallel by default, so it is more efficient. But async APIs can be harder to use - you need callbacks or generators everywhere, so sync packages (like RoboBrowser) can be easier to get started.
[+] pcr0|9 years ago|reply
Hmm, I can see why I'd want to use this library over piecing together requests and BS4 myself for every project. I love how simple the examples look.

I have a project I'm working on that will involve scraping many different websites on a daily basis. My only scraping experience so far is using cheerio[0] to scrape a single page with a 1,000 row HTML table. Should I start with something BS-based like this or should I jump straight into Scrapy? Or are there any other alternatives I should try?

[0]: https://github.com/cheeriojs/cheerio

[+] gkst|9 years ago|reply
I've used robobrowser for a project, where I needed to log in to a website and subsequently access pages as a logged in user. It worked well and I like the API. For "simple" scrapers that require authentication or some form of user interaction this is a good tool. If I need to scrape many pages from a site as fast as possible, I'd probably go for Scrapy though.
[+] thomasahle|9 years ago|reply
I'd like to write a small scraper for a website that uses NTLM authentication, the headers it sends are:

    HTTP/1.1 401 Unauthorized
    Server: Microsoft-IIS/8.5
    WWW-Authenticate: NTLM
    WWW-Authenticate: Negotiate
    ...
Does RoboBrowser support these kinds of protocols? I tried to get it to work with Scrapy, but it seemed non-trivial...
[+] kej|9 years ago|reply
It's been years since I've used it, but I think cntlm can do this. Point your Scrapy code at the cntlm instance, and it should handle all of the NTLM headers for you.
[+] IanDrake|9 years ago|reply
Just curious... what is everyone using scrapers for?

I've done a lot of work scraping various sites and I can tell you this: basing any product on your ability to aggregate data via scraping will not work in the long run.

Eventually you will be asked not to scrape and then you'll get sued if you don't stop.

Case law is not in your favor here. See Craigslist Vs. 3Taps.

[+] zo1|9 years ago|reply
I had a quick look into the repository and unfortunately, it doesn't support WebSockets. Does anyone know of a browser automation library/framework that does support WebSockets?
[+] bitfox|9 years ago|reply
What are the differences (advantages) from Selenium WebDriver and why I should use it?
[+] taesu|9 years ago|reply
If this doesn't run js, then what's it's edge vs. requests lib?
[+] PhasmaFelis|9 years ago|reply
Could someone explain what this is for, maybe with a couple of examples? This is getting to be a problem on HN.
[+] tlrobinson|9 years ago|reply
Really? It's right there on the main Github page, a 3 sentence description and 6 code examples.