I'm always surprised by how many web scraping frameworks/libraries I see sprout here on HN on a regular basis. Is web scraping something people are doing, or is web scraping the new high-concurrency version of the "to do list" utility everyone used to write as an exercise?
This is an honest question, I'm not trying to take a dig at anyone in particular.
Just yesterday I wrote a program to scrape some Amazon search and product pages [0].
Why? Because Amazon's search is outright broken. The number of results changes when you change sorting mode, and sometimes, sorting by a different criteria will just serve you a "no products found" error page.
I'll generally write a personal product comparison program when it becomes clear that I can't be certain that I can find the best product by hand. Often, even specialized websites that should have parametrized product search/filtering don't have their data properly indexed, so you have to scrape and parse it yourself. Another reason is to cross-reference with data from other sources. E.g. what laptop can I buy that has the best single-threaded CPU performance (within some other restrictions) [1]?
I find that I frequently want to dump a table from some webpage into a sqlite db file and run queries on it. Most sites don't offer a CSV or XLS download option, so a simple script to scrape the data usually ends up being the simplest option.
The solutions I write using homegrown utilities are both more elegant and faster than any framework or library I have ever seen posted to HN or recommended elsewhere. Not to mention smaller and more agile. IMHO.
All frameworks and libraries that I have seen will all fail given the right input i.e. fuzzing.
I think parsing and transforming the content from webpages is just viewed as work that no one wants to do because, for whatever reason, webpages are still unpredictable.
Good performance matters if you have decent networking infrastructure or your server has limited resources.
Bandwidth and IP limits are the most common bottle necks, but these can be solved using multiple proxies and ssh tunnels. Colly has built in support for switching proxies [1].
Tell that to the project I migrated from scrapy to go six months back. Granted, scrapy might be doing other "fun" things to eat into performance, but it was really night and day. Immediately went from CPU bottleneck to network.
Here here. The right tool for the right job. And I can't think of a "righter" tool for this kind of job.
Edit - not picking on you, but given the quality and ecosystem of libraries and ancillary tools for scrapy, I don't even consider alternatives at this point. Good on anyone who does it to learn but for actual workloads I won't consider anything else.
Please breakup your main `colly.go` file into separate parts. If possible you shouldn't have a 30 line imports definition covering everything from cookies and regex to html and sync access.
Make sure to use DNS caching on the box else add it in Go.
Colly only supports a single machine via map of visited URL's. Would be great if you replace with a queue like redis or beanastalkd.
Please don't follow this suggestion. It's very helpful and healthy to have everything in a single file if you consider that manageable, so no problem at all.
[+] [-] stablemap|8 years ago|reply
https://news.ycombinator.com/item?id=15408784
[+] [-] dang|8 years ago|reply
[+] [-] dguaraglia|8 years ago|reply
This is an honest question, I'm not trying to take a dig at anyone in particular.
[+] [-] CyberShadow|8 years ago|reply
Why? Because Amazon's search is outright broken. The number of results changes when you change sorting mode, and sometimes, sorting by a different criteria will just serve you a "no products found" error page.
I'll generally write a personal product comparison program when it becomes clear that I can't be certain that I can find the best product by hand. Often, even specialized websites that should have parametrized product search/filtering don't have their data properly indexed, so you have to scrape and parse it yourself. Another reason is to cross-reference with data from other sources. E.g. what laptop can I buy that has the best single-threaded CPU performance (within some other restrictions) [1]?
[0]: https://github.com/CyberShadow/choose-product/blob/master/am...
[1]: https://github.com/CyberShadow/choose-product/blob/master/le...
[+] [-] curun1r|8 years ago|reply
[+] [-] feelin_googley|8 years ago|reply
The solutions I write using homegrown utilities are both more elegant and faster than any framework or library I have ever seen posted to HN or recommended elsewhere. Not to mention smaller and more agile. IMHO.
All frameworks and libraries that I have seen will all fail given the right input i.e. fuzzing.
I think parsing and transforming the content from webpages is just viewed as work that no one wants to do because, for whatever reason, webpages are still unpredictable.
[+] [-] mlevental|8 years ago|reply
[+] [-] 973015077|8 years ago|reply
[deleted]
[+] [-] mlevental|8 years ago|reply
the bottle neck in scraping is never the parsing/DOM representation/traversal.
[+] [-] tampo9|8 years ago|reply
Bandwidth and IP limits are the most common bottle necks, but these can be solved using multiple proxies and ssh tunnels. Colly has built in support for switching proxies [1].
[1] http://go-colly.org/docs/best_practices/distributed/
[+] [-] deoxxa|8 years ago|reply
[+] [-] ospider|8 years ago|reply
[+] [-] blowski|8 years ago|reply
[+] [-] tptacek|8 years ago|reply
[+] [-] sheraz|8 years ago|reply
Edit - not picking on you, but given the quality and ecosystem of libraries and ancillary tools for scrapy, I don't even consider alternatives at this point. Good on anyone who does it to learn but for actual workloads I won't consider anything else.
[+] [-] fiatjaf|8 years ago|reply
[+] [-] jjuel|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] Xeoncross|8 years ago|reply
Make sure to use DNS caching on the box else add it in Go.
Colly only supports a single machine via map of visited URL's. Would be great if you replace with a queue like redis or beanastalkd.
[+] [-] fiatjaf|8 years ago|reply
[+] [-] guessmyname|8 years ago|reply
SQLite has a different opinion — https://www.sqlite.org/amalgamation.html
[+] [-] kondro|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] wiradikusuma|8 years ago|reply
[deleted]