top | item 8616738

Show HN: A Python Spider System with Web UI

159 points| binux | 11 years ago |github.com | reply

40 comments

order
[+] meowface|11 years ago|reply
This looks really nice. The API seems more user-friendly than scrapy's.
[+] adam-_-|11 years ago|reply
How does this compare to scrapy? Why would I use one over the other, or is either a fine choice?
[+] binux|11 years ago|reply
I'm working on a benchmarking suite https://gist.github.com/binux/67b276c51e988f8e2c31 and meet some problem...

pyspider comes from a vertical search engine project. we have two issues:

- 100+ websites, they may change the template or down sometime. We need a dashboard to monitor the changes and the fails.

- update in 5 minutes, when the website updated, we need follow that in 5 minutes. We are using a update time from index(list) page to tell the changed pages. And pages should been updated after about 30 days in case of we missed something. A powerful scheduler is needed.

obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.

[+] skillachie|11 years ago|reply
+1 Scrapy comparison please

Can you compare to scrapy as requested by other posters. Why could you not build on top of scrapy and leverage celery for scheduling etc (http://www.celeryproject.org/)

What is the immediate value add to using pyspider ?

[+] mrmondo|11 years ago|reply
Nice project! I do wish it supported a PostgreSQL backend rather than (or as well as I guess) MySQL.
[+] _bitliner|11 years ago|reply
I really like the flow/UX. Congratulations! Nice job!

What is the roadmap?

I am really inside scraping, it is one of my daily job. I could consider to integrate it in one of my architectures

[+] _bitliner|11 years ago|reply
Furthermore, what you mean with `Javascript pages supported`? Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?
[+] binux|11 years ago|reply
To make it more flexible and easy to reuse? I have implemented most features I need now.
[+] erikb|11 years ago|reply
What is a "spider system"? Never heard that term before.
[+] Immortalin|11 years ago|reply
Any plans for a gui based web scraper interface similar to portia?
[+] binux|11 years ago|reply
Currently, yes and no.

pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.

But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.

[+] zbb|11 years ago|reply
Take a look at source code. The package hirarchy is not pythonic (use "libs" as top package is not a good idea).
[+] huskyr|11 years ago|reply
Ah damned. the package hierarchy is not pythonic. That renders all the functionality of this package completely unusable.

Come on people, don't be like this. It takes 5 seconds to rephrase a comment like this into something friendlier.

[+] paulhauggis|11 years ago|reply
Why isn't it a good idea? I have plenty of projects setup this way and it works well.

It looks pretty well organized to me.

[+] binux|11 years ago|reply
agree