Python web scraping

[+] Denzel|12 years ago|reply

Why not just use Scrapy[1]? It's built for this sort of thing, easily extensible, and written in Python.

[+] Jake232|12 years ago|reply

I've used Scrapy previously, and agree it's a good tool in the right situation.

It could just be me, but I prefer more control, and using the requests library gave me that. I ran into some obscure problems when wanting to use / change multiple proxies with Scrapy.

I think for a long running spider, then maybe scrapy is worth looking into. For simple one-off scripts / things not run very often, I'd much prefer to write something custom and not have to learn about Pipelines, Filters, etc.

[+] crdoconnor|12 years ago|reply

I tried using this and it was too much of a straitjacket. It's really, really hard to write a decent framework (for anything!) and I don't think scrapy really succeeds.

In the end I dumped my scrapy code and replaced it with a combination of celery, mechanize and pyquery. That worked much better, used less code and was much more flexible.

[+] abaker87|12 years ago|reply

Agree. Scrapy is great for this sort of thing. Takes care of little details you don't want to be worrying about.

I coupled scrapy with Stem to control Tor on the odd chance your IP gets blocked. Works great.

[+] Bocker|12 years ago|reply

Scrapy makes things a lot easier. I was really surprised at how fast it crawled a site the first time I used it (forgot to set a delay).

[+] hekker|12 years ago|reply

Thanks for creating Scrapy, I have used it before many times and it is a great tool!

[+] JonLim|12 years ago|reply

Good resource - I've been using BeautifulSoup[1] for the scraper I set up for my needs, and it's probably worth checking out as well!

[1]: http://www.crummy.com/software/BeautifulSoup/

[+] Jake232|12 years ago|reply

Thanks! I mentioned BeautifulSoup existed very briefly in the lxml section in prerequisites. I just added a short section on CSS Selectors, you can actually use lxml with them. I'll add a little more info with BeautifulSoup and PyQuery mentioned though.

[+] bryogenic|12 years ago|reply

PyQuery is pretty good for navigating around the DOM too.

http://pythonhosted.org//pyquery/

[+] Jake232|12 years ago|reply

Never used it, but looks useful. I'll add a note regarding it. Thanks.

[+] kclay|12 years ago|reply

Used this on many project, great library.

[+] caio1982|12 years ago|reply

Great summary on how to start on the topic, really nice! I only wish it was longer as I love playing with scraping (regex lover here) and unfortunately not many people consider going straight with lxml + xpath, which is ridiculously fast. Sometimes I see people writing a bunch of lines to walk through a tree of elements and using selectors with BeautifulSoup or even with a full Scrapy project and then I, like, "dude, why didn't you just extracted that tiny data with a single xpath?". Suggestion for a future update: try covering the caveats of lxml (invalid pages [technically not lxml's fault but okay], limitations of xpath 1.0 compared to 2.0 [not supported in lxml], tricky charset detection) and maybe throw a few samples codes both in BS and lxml to compare when to use each of them :-)

[+] Jake232|12 years ago|reply

Hey, thanks for the feedback!

I'm planning on adding more to the article in the near future, this was just a start. I plan on it being a resource with almost everything in, so people can bookmark it for future use.

I really haven't seen lxml choking on pages, are you sure you have the latest libxml2 installed? lxml always seems to work for me. If you know of a URL where it doesn't, I'd love to see it.

[+] gameguy43|12 years ago|reply

For a browser that runs JS the author mentions PhantomJS, but looks like its Python support is iffy. Mechanize is super-easy in Python: http://www.pythonforbeginners.com/cheatsheet/python-mechaniz...

Edit: so easy, in fact, that I prefer to just START with using mechanize to fetch the pages--why bother testing whether or not your downloader needs JS, cookies, a reasonable user agent, etc--just start with them.

[+] pjin|12 years ago|reply

A hearty second for mechanize. It basically wraps urllib2 which is neat. Although, I've encountered situations where language support for distributed systems would have really saved some frustration. There's a version of mechanize for Erlang [1], which I intend on trying out whenever I get around to learning Erlang :)

[1] https://github.com/tokenrove/mechanizerl

[+] Jake232|12 years ago|reply

Thanks for the mechanize link, I'll add a note about it.

Mechanize is awfully slow though, if you need to crawl quickly it's not asynchronous. I wouldn't want to use it for general crawling. I guess you could patch the stdlib with gevent, and try and get something working that way.

[+] aGHz|12 years ago|reply

Like the OP, I needed more control over the crawling behaviour for a project. All the scraping code quickly became a mess though, so I wrote a library that lets you declaratively define the data you're interested in (think Django forms). It also provides decorators that allow you to specify imperative code for organizing the data cleanup before and after parsing. See how easy it is to extract data from the HN front page: https://github.com/aGHz/structominer/blob/master/examples/hn...

I'm still working on proper packaging so for the moment the only way to install Struct-o-miner is to clone it from https://github.com/aGHz/structominer.

[+] Jake232|12 years ago|reply

Hey,

I've actually built something similar to this myself, I plan on writing an article in the future with something along these lines.

Yours look pretty polished though, good job!

[+] rplnt|12 years ago|reply

> [...] prefer to stick with lxml for raw speed.

Is parsing speed really an issue when you are scraping the data you are parsing?

[+] Jake232|12 years ago|reply

If you're running on a server with 1Gbps, then yes - it can be an issue. Another issue (I'm going to add a section on it) is that you can peg the CPU at 100% very easily. Parsing HTML / Running xPaths uses a lot of CPU, so if you're running a good number of threads this can quickly become a problem.

[+] dikei|12 years ago|reply

It's an issue if you have high internet speed. I've had cases where the data can be downloaded faster than it can be parsed and overflow the internal queue. It's even worse if your crawler is not asynchronous, in that case slow parser slow down crawling speed as well.

[+] christianmann|12 years ago|reply

> Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json.

But sometimes you won't. Sometimes you'll be assaulted with a response in a proprietary, obfuscated, or encrypted format. In situations where reverse-engineering the Javascript is unrealistic (perhaps it is equally obfuscated), I recommend Selenium[1][2] for scraping. It hooks a remote control to Firefox, Opera, Chrome, or IE, and allows you to read the data back out.

[1]: http://docs.seleniumhq.org/ [2]: http://docs.seleniumhq.org/projects/webdriver/

[+] Jake232|12 years ago|reply

Hi.

I mentioned selenium in the section below that, but I'll drop a note to it in the AJAX section too!

[+] hernan604|12 years ago|reply

Test this real scrapper framework: https://github.com/hernan604/HTML-Robot-Scraper

[+] lmz|12 years ago|reply

It's scraper (single p).

[+] unknown|12 years ago|reply

[deleted]

[+] joyofdata|12 years ago|reply

Warning: Self-Advertisment!

http://www.joyofdata.de/blog/using-linux-shell-web-scraping/

Okay, it's not as bold has using Headers saying "Hire Me" but I would like to emphasize that sometimes even complex tasks can be super-easy when you use the right tools. And a combination of Linux shell tools makes this task really very straightforward (literally).

[+] jmduke|12 years ago|reply

This is really cool -- particularly hxselect -- but I don't see how it's particularly simpler than using Python/requests. How would you handle following links and deduping visited URLs via piping?

[+] sdoering|12 years ago|reply

Thanks for the self-advertisement. That way, I was introduced to some fine reading on your blog.

Greetings from Hamburg.

[+] notfoss|12 years ago|reply

> If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.

Also, many times, not all functionality/features are available through the API.

Edit: By the way, without JS enabled, the code blocks on your website are basically unviewable (at least on Firefox).

http://imgur.com/CSYxMfL

[+] victoro|12 years ago|reply

This is a great starting point. Can anyone recommend any resources for how to best set up a remote scraping box on AWS or another similar provider? Pitfalls, best tools to help manage/automate scripts etc. I've found a few "getting started" tutorials like this one but I haven't been able to find anything good that discusses scraping beyond running basic scripts on your local machine.

[+] mdaniel|12 years ago|reply

http://scrapy.org/ is more systemic than these ad-hoc solutions, and http://scrapyd.readthedocs.org/en/latest/ is the daemon into which one can deploy scrapers if you want a little more structure.

The plugin architecture alone makes Scapy a hands-down winner over whatever you might dream up on your own. And with any good plugin architecture, it ships with several optional toys, you can always add your own, and there is a pretty good community (e.g. https://github.com/darkrho/scrapy-redis)

http://crawlera.com/ will start to enter into your discussion unless you have a low-volume crawler, and http://scrapinghub.com/ are the folks behind Crawlera and (AFAIK) sponsor (or actually do) the development for Scrapy.

[+] level09|12 years ago|reply

I created a news aggregator completely based on a python scrapper. I run scrapping jobs as periodic celery tasks. usually I would look into the RSS feeds (to start) and parse them with "Feed Parser" module. Then I would use "pyquery" to find open graph tags.

Pyquery is an excellent module, but I think its parser is not very "Forgiving" so it might fail for some invalid/non-standard markups.

[+] crdoconnor|12 years ago|reply

pyquery uses lxml, which is pretty forgiving on any non-standard/invalid markups.

[+] hatchoo|12 years ago|reply

The part on how to avoid detection was particularly useful for me.

I use webscraping (https://code.google.com/p/webscraping/) + BeautifulSoup. What I like about webscraping is that it automatically creates a local cache of the page you access so you don't end up needlessly hitting the site while you are testing the scraper.

[+] sente|12 years ago|reply

He recommends http://51proxy.com/ - I was curious and signed up for their smallest package. It's been 24 hours since I created the account and I have 48 "Waiting" proxies, and I haven't been able to connect to either of the "Live" ones.

Has anyone had any success with them?

[+] gjreda|12 years ago|reply

For those interested, I wrote a "web scraping 101" tutorial in the past that uses BeautifulSoup: http://www.gregreda.com/2013/03/03/web-scraping-101-with-pyt...

[+] lmz|12 years ago|reply

For all those mentioning scrapy, would it be a good fit for authenticated scraping with parameters (log in to different banks and get recent transactions)?

[+] callmeed|12 years ago|reply

I do a fair amount of web scraping with Ruby. Can anyone who has dabbled in both weigh in? Does Python offer superior libraries/tools?

[+] Jake232|12 years ago|reply

Hey, article author here.

I've done extensive scraping in both Python and Ruby, as I wrote most of the scraping / crawling code at http://serpiq.com, so I can chip in.

Overall, I prefer Python. That is pretty much solely down to the requests library though, it makes everything so simple and quick. I haven't covered it in the article yet, but you can extend the Response() class easily, so you can for example add methods like images(), links(nofollow=True), etc. Overall, I just think the requests library is much more polished than anything available in Ruby.

grequests (Python) means I can make things concurrent in a matter of minutes. However in Ruby the only capable library supporting concurrent HTTP requests that I liked was Typhoeus. It just wasn't to the same standard though, and I ran across certain issues when using proxies etc.

As far as the HTML parsing goes, I don't really have any preference. Nokogiri and lxml are both equally capable.

I think they're both perfectly capable languages though, stick with what you prefer. I've been experimenting with Go lately.

[+] flexd|12 years ago|reply

Scrapy mentioned here is really easy, or as the article mentions a combination of requests, lxml and something like phantomjs if there is javascript involved makes scraping sites in python a nice experience.

I have briefly made something with Scrapy to scrape my university's websites to notify me when we get new exam results [1], and that was okay. I might be slightly abusing scrapy but it was an okay experience.

Previously I have used 'scrubyt' for Ruby to scrape things, but their homepage seems to lead to a skin-related website now. What tools/libraries do you use for scraping stuff with Ruby these days? I remember Mechanize and Nokogiri was good/decent, but it's been more than a few years since I last used Ruby.

[1] https://github.com/flexd/studweb (description in Norwegian but it's not important)

[+] ZenoArrow|12 years ago|reply

I've used both Python and Ruby for web scraping. Whilst Python is my language of choice for most things, I enjoyed the web scraping experience more with Ruby (in particular, Nokogiri). Maybe it's just bad luck on my part, but I tend to find Unicode issues when scraping with Python 2.x, whereas Ruby has had decent Unicode support for a while. I've not used Python 3.x. YMMV.

[+] lukasm|12 years ago|reply

Coolio. I'd use PyQuery instead of xpath.

[+] Jake232|12 years ago|reply

Going to add a section on PyQuery, thanks!

62 comments