Ask HN: How to aggregate product info from other websites
I have zero experience web scraping of any kind so any direction would be helpful. Before I start digging I figured HNers may have some invaluable advice.
I have zero experience web scraping of any kind so any direction would be helpful. Before I start digging I figured HNers may have some invaluable advice.
[+] [-] astrec|17 years ago|reply
[+] [-] jreilly|17 years ago|reply
[+] [-] shabda|17 years ago|reply
[+] [-] DenisM|17 years ago|reply
Python has SGML SAX parser and since HTML is SGML it can be used. Better than regexps any day.
Python's http client library also supports cookies so that you can pretend to have a "session" with your target website.
EDIT: the libraries are urllib2, sgmllib, cookielib
[+] [-] olegp|17 years ago|reply
What experiences has everyone else had?
[+] [-] jreilly|17 years ago|reply
[+] [-] thwarted|17 years ago|reply
[+] [-] petercooper|17 years ago|reply
If you're wondering why, well, consider this script that "learns" how to scrape Google results (from one supplied example of output data):
Reads almost like English in the scraping part![+] [-] aneesh|17 years ago|reply
[+] [-] qhoxie|17 years ago|reply
[+] [-] tocomment|17 years ago|reply
[+] [-] jreilly|17 years ago|reply