top | item 3878605 Crawl a website with scrapy and store extracted results with MongoDB 93 points| BaltoRouberol | 14 years ago |isbullsh.it | reply 17 comments order hn newest [+] [-] JackC|14 years ago|reply For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination: import httplib2, lxml, pyquery h = httplib2.Http(".cache") def get(url): resp, content = h.request( url, headers={'cache-control':'max-age=3600'}) return pyquery.PyQuery( lxml.etree.HTML(content) ) This gives you a little function that fetches any URL as a jquery-like object: pq = get("http://foo.com/bar") checkboxes = pq('form input[type=checkbox]') nextpage = pq('a.next').attr('href') And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.Just something else to throw in the toolbelt ... [+] [-] the_cat_kittles|14 years ago|reply Have checked out kenneth reitz's requests? Its fantastic, you might like it load replies (1) [+] [-] jat1|14 years ago|reply Also check this out for a pretty good discussion on scraping http://pyvideo.org/video/609/web-scraping-reliably-and-effic... [+] [-] BaltoRouberol|14 years ago|reply Yeah, I actually learnt scraping from Asheesh :) He's awesome. load replies (1) [+] [-] unknown|14 years ago|reply [deleted] [+] [-] danneu|14 years ago|reply Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: https://gist.github.com/2475824. Screenshot: http://i.imgur.com/cbv9A.png[1]: http://anemone.rubyforge.org/doc/index.html [+] [-] ananthrk|14 years ago|reply Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "isbullshit_spiders.py"? :) [+] [-] BaltoRouberol|14 years ago|reply Oh, that's just a typo. My bad. Edit: there, corrected. [+] [-] hack_edu|14 years ago|reply I really want to read. Topic is right down my alley.Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :( [+] [-] martius|14 years ago|reply Hi all, thank you for reporting !This is a known issue and I'm working on it! I'll try to push a "responsive" version today or tomorrow. [+] [-] noinput|14 years ago|reply http://www.readability.com/mobile/articles/yfhqwo0t cleans it right up. [+] [-] joshu|14 years ago|reply Completely unreadable on iPhone too. [+] [-] mumphster|14 years ago|reply Same using mobile safari load replies (1)
[+] [-] JackC|14 years ago|reply For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination: import httplib2, lxml, pyquery h = httplib2.Http(".cache") def get(url): resp, content = h.request( url, headers={'cache-control':'max-age=3600'}) return pyquery.PyQuery( lxml.etree.HTML(content) ) This gives you a little function that fetches any URL as a jquery-like object: pq = get("http://foo.com/bar") checkboxes = pq('form input[type=checkbox]') nextpage = pq('a.next').attr('href') And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.Just something else to throw in the toolbelt ... [+] [-] the_cat_kittles|14 years ago|reply Have checked out kenneth reitz's requests? Its fantastic, you might like it load replies (1)
[+] [-] the_cat_kittles|14 years ago|reply Have checked out kenneth reitz's requests? Its fantastic, you might like it load replies (1)
[+] [-] jat1|14 years ago|reply Also check this out for a pretty good discussion on scraping http://pyvideo.org/video/609/web-scraping-reliably-and-effic... [+] [-] BaltoRouberol|14 years ago|reply Yeah, I actually learnt scraping from Asheesh :) He's awesome. load replies (1)
[+] [-] BaltoRouberol|14 years ago|reply Yeah, I actually learnt scraping from Asheesh :) He's awesome. load replies (1)
[+] [-] danneu|14 years ago|reply Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: https://gist.github.com/2475824. Screenshot: http://i.imgur.com/cbv9A.png[1]: http://anemone.rubyforge.org/doc/index.html
[+] [-] ananthrk|14 years ago|reply Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "isbullshit_spiders.py"? :) [+] [-] BaltoRouberol|14 years ago|reply Oh, that's just a typo. My bad. Edit: there, corrected.
[+] [-] hack_edu|14 years ago|reply I really want to read. Topic is right down my alley.Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :( [+] [-] martius|14 years ago|reply Hi all, thank you for reporting !This is a known issue and I'm working on it! I'll try to push a "responsive" version today or tomorrow. [+] [-] noinput|14 years ago|reply http://www.readability.com/mobile/articles/yfhqwo0t cleans it right up. [+] [-] joshu|14 years ago|reply Completely unreadable on iPhone too. [+] [-] mumphster|14 years ago|reply Same using mobile safari load replies (1)
[+] [-] martius|14 years ago|reply Hi all, thank you for reporting !This is a known issue and I'm working on it! I'll try to push a "responsive" version today or tomorrow.
[+] [-] noinput|14 years ago|reply http://www.readability.com/mobile/articles/yfhqwo0t cleans it right up.
[+] [-] JackC|14 years ago|reply
Just something else to throw in the toolbelt ...
[+] [-] the_cat_kittles|14 years ago|reply
[+] [-] jat1|14 years ago|reply
[+] [-] BaltoRouberol|14 years ago|reply
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] danneu|14 years ago|reply
[1]: http://anemone.rubyforge.org/doc/index.html
[+] [-] ananthrk|14 years ago|reply
[+] [-] BaltoRouberol|14 years ago|reply
[+] [-] hack_edu|14 years ago|reply
Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :(
[+] [-] martius|14 years ago|reply
This is a known issue and I'm working on it! I'll try to push a "responsive" version today or tomorrow.
[+] [-] noinput|14 years ago|reply
[+] [-] joshu|14 years ago|reply
[+] [-] mumphster|14 years ago|reply