top | item 6582632

Ask HN: Is there any HTML table scraper generator in python or else?

2 points| jeffjia | 12 years ago

Hi,

In one of my projects, I happen to need to get some scrapers running for tens of websites to collect rows, columns of tables (<table>, <ul>, <div>). Those tables are well formatted. I have written several scrapers in python, which basically use CSS selector and then do some simple transformation with regular expression. I just wonder whether there is any scraper generator which may take a url and sample target output as input, and produce a scraper automatically?

Any suggestion is welcomed. Thanks in advance.

10 comments

order

tonyfelice|12 years ago

Have you looked at phantomjs?

The webintro example here (https://github.com/ariya/phantomjs/wiki/Examples) scrapes a specific element.

jeffjia|12 years ago

I was using mechanizer + beautiful soup in python before, but it seems that this one also needs human to read the html source code to pick a css selector instead of automating it...

brandonlipman|12 years ago

I would take a look at the Mac App FakeApp. It does a lot of what you are saying expecially in regards to CSS and xpath selectors. I have been using it and have been able to do some really great stuff.

Johnie|12 years ago

If you don't want to build it yourself, check out import.io. They turn any website into an API. They did a demo at SV Newtech a couple months ago.

jeffjia|12 years ago

Thanks Johnie. It is almost what I want, except that it is not open-source and free...

murtza|12 years ago

Have you taken a look at the Scrapy framework for Python?

http://scrapy.org/

jeffjia|12 years ago

Thanks. I used beautiful soup for the parser, and actually have written a crawler framework for my scenario. But I was wondering whether there is any tool that could automate the selection of css selector or xpath.

Larrikin|12 years ago

I wrote a couple a few scrapers and found scrapy to be my best option

taddeimania|12 years ago

I've used BeautifulSoup to do stuff like this.

jeffjia|12 years ago

Yeah. Me too. The css selector is quite convenient. The only problem is that I need to pick the selector set for each website I need to scrape, and there are tens of them, which makes the work itself time-consuming...