Colly – Scraping Framework for Golang

[+] stablemap|8 years ago|reply

Some discussion from two months ago:

https://news.ycombinator.com/item?id=15408784

[+] dang|8 years ago|reply

Thanks! Missed that earlier.

[+] dguaraglia|8 years ago|reply

I'm always surprised by how many web scraping frameworks/libraries I see sprout here on HN on a regular basis. Is web scraping something people are doing, or is web scraping the new high-concurrency version of the "to do list" utility everyone used to write as an exercise?

This is an honest question, I'm not trying to take a dig at anyone in particular.

[+] CyberShadow|8 years ago|reply

Just yesterday I wrote a program to scrape some Amazon search and product pages [0].

Why? Because Amazon's search is outright broken. The number of results changes when you change sorting mode, and sometimes, sorting by a different criteria will just serve you a "no products found" error page.

I'll generally write a personal product comparison program when it becomes clear that I can't be certain that I can find the best product by hand. Often, even specialized websites that should have parametrized product search/filtering don't have their data properly indexed, so you have to scrape and parse it yourself. Another reason is to cross-reference with data from other sources. E.g. what laptop can I buy that has the best single-threaded CPU performance (within some other restrictions) [1]?

[0]: https://github.com/CyberShadow/choose-product/blob/master/am...

[1]: https://github.com/CyberShadow/choose-product/blob/master/le...

[+] curun1r|8 years ago|reply

I find that I frequently want to dump a table from some webpage into a sqlite db file and run queries on it. Most sites don't offer a CSV or XLS download option, so a simple script to scrape the data usually ends up being the simplest option.

[+] feelin_googley|8 years ago|reply

This framework has been posted before.

The solutions I write using homegrown utilities are both more elegant and faster than any framework or library I have ever seen posted to HN or recommended elsewhere. Not to mention smaller and more agile. IMHO.

All frameworks and libraries that I have seen will all fail given the right input i.e. fuzzing.

I think parsing and transforming the content from webpages is just viewed as work that no one wants to do because, for whatever reason, webpages are still unpredictable.

[+] mlevental|8 years ago|reply

i worked at a startup for 6 months (fintech) whose entire value prop was built on the back of scraping.

[+] 973015077|8 years ago|reply

[deleted]

[+] mlevental|8 years ago|reply

>Lightning Fast and Elegant Scraping Framework for Gophers

the bottle neck in scraping is never the parsing/DOM representation/traversal.

[+] tampo9|8 years ago|reply

Good performance matters if you have decent networking infrastructure or your server has limited resources.

Bandwidth and IP limits are the most common bottle necks, but these can be solved using multiple proxies and ssh tunnels. Colly has built in support for switching proxies [1].

[1] http://go-colly.org/docs/best_practices/distributed/

[+] deoxxa|8 years ago|reply

Tell that to the project I migrated from scrapy to go six months back. Granted, scrapy might be doing other "fun" things to eat into performance, but it was really night and day. Immediately went from CPU bottleneck to network.

[+] ospider|8 years ago|reply

It has always been scheduling and anti-blocking

[+] blowski|8 years ago|reply

The obvious question - why would I use this over Scrapy?

[+] tptacek|8 years ago|reply

Because you're using Golang and not Python.

[+] sheraz|8 years ago|reply

Here here. The right tool for the right job. And I can't think of a "righter" tool for this kind of job.

Edit - not picking on you, but given the quality and ecosystem of libraries and ancillary tools for scrapy, I don't even consider alternatives at this point. Good on anyone who does it to learn but for actual workloads I won't consider anything else.

[+] fiatjaf|8 years ago|reply

For DOM parsing I cannot imagine that there could anything better than https://github.com/PuerkitoBio/goquery.

[+] jjuel|8 years ago|reply

Which is funny because if you look at the code this is using goquery. Which then makes you wonder why would I use this when I can just use goquery?

[+] unknown|8 years ago|reply

[deleted]

[+] Xeoncross|8 years ago|reply

Please breakup your main `colly.go` file into separate parts. If possible you shouldn't have a 30 line imports definition covering everything from cookies and regex to html and sync access.

Make sure to use DNS caching on the box else add it in Go.

Colly only supports a single machine via map of visited URL's. Would be great if you replace with a queue like redis or beanastalkd.

    visitedURLs map[uint64]bool

[+] fiatjaf|8 years ago|reply

Please don't follow this suggestion. It's very helpful and healthy to have everything in a single file if you consider that manageable, so no problem at all.

[+] guessmyname|8 years ago|reply

> […] It's very helpful and healthy to have everything in a single file if you consider that manageable.

SQLite has a different opinion — https://www.sqlite.org/amalgamation.html

[+] kondro|8 years ago|reply

How does this go with running the JS on the SPAs that make up a large portion of the web today?

[+] unknown|8 years ago|reply

[deleted]

[+] wiradikusuma|8 years ago|reply

[deleted]

31 comments