top | item 15953629

Colly – Scraping Framework for Golang

116 points| tampo9 | 8 years ago |github.com

31 comments

order
[+] dguaraglia|8 years ago|reply
I'm always surprised by how many web scraping frameworks/libraries I see sprout here on HN on a regular basis. Is web scraping something people are doing, or is web scraping the new high-concurrency version of the "to do list" utility everyone used to write as an exercise?

This is an honest question, I'm not trying to take a dig at anyone in particular.

[+] CyberShadow|8 years ago|reply
Just yesterday I wrote a program to scrape some Amazon search and product pages [0].

Why? Because Amazon's search is outright broken. The number of results changes when you change sorting mode, and sometimes, sorting by a different criteria will just serve you a "no products found" error page.

I'll generally write a personal product comparison program when it becomes clear that I can't be certain that I can find the best product by hand. Often, even specialized websites that should have parametrized product search/filtering don't have their data properly indexed, so you have to scrape and parse it yourself. Another reason is to cross-reference with data from other sources. E.g. what laptop can I buy that has the best single-threaded CPU performance (within some other restrictions) [1]?

[0]: https://github.com/CyberShadow/choose-product/blob/master/am...

[1]: https://github.com/CyberShadow/choose-product/blob/master/le...

[+] curun1r|8 years ago|reply
I find that I frequently want to dump a table from some webpage into a sqlite db file and run queries on it. Most sites don't offer a CSV or XLS download option, so a simple script to scrape the data usually ends up being the simplest option.
[+] feelin_googley|8 years ago|reply
This framework has been posted before.

The solutions I write using homegrown utilities are both more elegant and faster than any framework or library I have ever seen posted to HN or recommended elsewhere. Not to mention smaller and more agile. IMHO.

All frameworks and libraries that I have seen will all fail given the right input i.e. fuzzing.

I think parsing and transforming the content from webpages is just viewed as work that no one wants to do because, for whatever reason, webpages are still unpredictable.

[+] mlevental|8 years ago|reply
i worked at a startup for 6 months (fintech) whose entire value prop was built on the back of scraping.
[+] mlevental|8 years ago|reply
>Lightning Fast and Elegant Scraping Framework for Gophers

the bottle neck in scraping is never the parsing/DOM representation/traversal.

[+] tampo9|8 years ago|reply
Good performance matters if you have decent networking infrastructure or your server has limited resources.

Bandwidth and IP limits are the most common bottle necks, but these can be solved using multiple proxies and ssh tunnels. Colly has built in support for switching proxies [1].

[1] http://go-colly.org/docs/best_practices/distributed/

[+] deoxxa|8 years ago|reply
Tell that to the project I migrated from scrapy to go six months back. Granted, scrapy might be doing other "fun" things to eat into performance, but it was really night and day. Immediately went from CPU bottleneck to network.
[+] ospider|8 years ago|reply
It has always been scheduling and anti-blocking
[+] blowski|8 years ago|reply
The obvious question - why would I use this over Scrapy?
[+] tptacek|8 years ago|reply
Because you're using Golang and not Python.
[+] sheraz|8 years ago|reply
Here here. The right tool for the right job. And I can't think of a "righter" tool for this kind of job.

Edit - not picking on you, but given the quality and ecosystem of libraries and ancillary tools for scrapy, I don't even consider alternatives at this point. Good on anyone who does it to learn but for actual workloads I won't consider anything else.

[+] fiatjaf|8 years ago|reply
For DOM parsing I cannot imagine that there could anything better than https://github.com/PuerkitoBio/goquery.
[+] jjuel|8 years ago|reply
Which is funny because if you look at the code this is using goquery. Which then makes you wonder why would I use this when I can just use goquery?
[+] Xeoncross|8 years ago|reply
Please breakup your main `colly.go` file into separate parts. If possible you shouldn't have a 30 line imports definition covering everything from cookies and regex to html and sync access.

Make sure to use DNS caching on the box else add it in Go.

Colly only supports a single machine via map of visited URL's. Would be great if you replace with a queue like redis or beanastalkd.

    visitedURLs map[uint64]bool
[+] fiatjaf|8 years ago|reply
Please don't follow this suggestion. It's very helpful and healthy to have everything in a single file if you consider that manageable, so no problem at all.
[+] kondro|8 years ago|reply
How does this go with running the JS on the SPAs that make up a large portion of the web today?