top | item 358719

Ask HN: How to aggregate product info from other websites

14 points| jreilly | 17 years ago | reply

Any input on the best way to aggregate product information from various websites would be much appreciated. Most of the websites I would like to aggregate lack any APIs that I could use to track prices and things of that nature.

I have zero experience web scraping of any kind so any direction would be helpful. Before I start digging I figured HNers may have some invaluable advice.

21 comments

order
[+] shabda|17 years ago|reply
You might also want to 1. Worry about the Copyright laws. 2. Make sure you do not hit the site so often that you show up in their logs as bandwdth hog and are blocked.
[+] DenisM|17 years ago|reply
I use python to write scripts of this nature (one script so far:)).

Python has SGML SAX parser and since HTML is SGML it can be used. Better than regexps any day.

Python's http client library also supports cookies so that you can pretend to have a "session" with your target website.

EDIT: the libraries are urllib2, sgmllib, cookielib

[+] olegp|17 years ago|reply
Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.

What experiences has everyone else had?

[+] jreilly|17 years ago|reply
Thanks for the input. I am currently learning rails so also wondering if there are any libraries that will make this significantly easier
[+] thwarted|17 years ago|reply
See if the sites in question are part of an affiliate network, like Commission Junction or Link Share. They often provide plain-text feeds to affiliates through these programs, and many of their terms of service enable you to set up these kinds of services (although some have restrictions on mixing their data with data from their competitors). However, I've found that even this data isn't all that great, cleanliness wise (sometimes you can't trust the name of the product, the price, the link, or the SKU to even match the website) and isn't updated very often (like product availability). But it's a hell of a lot easier than writing a custom parser for each site's HTML (although when I was working on project like this, I had to write a custom parser for each feed in order to put them in a more consistent format).
[+] petercooper|17 years ago|reply
For Ruby, consider Scrubyt: http://scrubyt.org/

If you're wondering why, well, consider this script that "learns" how to scrape Google results (from one supplied example of output data):

  google_data = Scrubyt::Extractor.define do
    fetch 'http://www.google.com/ncr'
    fill_textfield 'q', 'ruby'
    submit

    link "Ruby Programming Language" do
      url "href", :type => :attribute
    end

    next_page "Next", :limit => 2
  end

  puts google_data.to_xml
Reads almost like English in the scraping part!
[+] aneesh|17 years ago|reply
Perl's WWW::Mechanize module is a good choice for scraping & automating website interactions.
[+] qhoxie|17 years ago|reply
Mechanize also has a ruby port since you are working with rails.
[+] tocomment|17 years ago|reply
What are you trying to do exactly? It depends a lot on the type of data you're trying to gather.
[+] jreilly|17 years ago|reply
I am basically trying to track prices of certain products easily so I do not have to worry about doing it by hand and checking myself every once in a while.