Crawl a website with scrapy and store extracted results with MongoDB

[+] JackC|14 years ago|reply

For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination:

  import httplib2, lxml, pyquery
  h = httplib2.Http(".cache")
  def get(url):
      resp, content = h.request(  url, headers={'cache-control':'max-age=3600'})
      return pyquery.PyQuery( lxml.etree.HTML(content) )

This gives you a little function that fetches any URL as a jquery-like object:

  pq = get("http://foo.com/bar")
  checkboxes = pq('form input[type=checkbox]')
  nextpage = pq('a.next').attr('href')

And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.

Just something else to throw in the toolbelt ...

[+] the_cat_kittles|14 years ago|reply

Have checked out kenneth reitz's requests? Its fantastic, you might like it

[+] jat1|14 years ago|reply

Also check this out for a pretty good discussion on scraping http://pyvideo.org/video/609/web-scraping-reliably-and-effic...

[+] BaltoRouberol|14 years ago|reply

Yeah, I actually learnt scraping from Asheesh :) He's awesome.

[+] unknown|14 years ago|reply

[deleted]

[+] danneu|14 years ago|reply

Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: https://gist.github.com/2475824. Screenshot: http://i.imgur.com/cbv9A.png

[1]: http://anemone.rubyforge.org/doc/index.html

[+] ananthrk|14 years ago|reply

Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "isbullshit_spiders.py"? :)

[+] BaltoRouberol|14 years ago|reply

Oh, that's just a typo. My bad. Edit: there, corrected.

[+] hack_edu|14 years ago|reply

I really want to read. Topic is right down my alley.

Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :(

[+] martius|14 years ago|reply

Hi all, thank you for reporting !

This is a known issue and I'm working on it! I'll try to push a "responsive" version today or tomorrow.

17 comments