Show HN: A Python Spider System with Web UI

[+] meowface|11 years ago|reply

This looks really nice. The API seems more user-friendly than scrapy's.

[+] adam-_-|11 years ago|reply

How does this compare to scrapy? Why would I use one over the other, or is either a fine choice?

[+] binux|11 years ago|reply

I'm working on a benchmarking suite https://gist.github.com/binux/67b276c51e988f8e2c31 and meet some problem...

pyspider comes from a vertical search engine project. we have two issues:

- 100+ websites, they may change the template or down sometime. We need a dashboard to monitor the changes and the fails.

- update in 5 minutes, when the website updated, we need follow that in 5 minutes. We are using a update time from index(list) page to tell the changed pages. And pages should been updated after about 30 days in case of we missed something. A powerful scheduler is needed.

obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.

[+] skillachie|11 years ago|reply

+1 Scrapy comparison please

Can you compare to scrapy as requested by other posters. Why could you not build on top of scrapy and leverage celery for scheduling etc (http://www.celeryproject.org/)

What is the immediate value add to using pyspider ?

[+] OedipusRex|11 years ago|reply

Can someone explain what this is?

[+] unknown|11 years ago|reply

[deleted]

[+] mrmondo|11 years ago|reply

Nice project! I do wish it supported a PostgreSQL backend rather than (or as well as I guess) MySQL.

[+] _bitliner|11 years ago|reply

I really like the flow/UX. Congratulations! Nice job!

What is the roadmap?

I am really inside scraping, it is one of my daily job. I could consider to integrate it in one of my architectures

[+] _bitliner|11 years ago|reply

Furthermore, what you mean with `Javascript pages supported`? Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?

[+] binux|11 years ago|reply

To make it more flexible and easy to reuse? I have implemented most features I need now.

[+] kidsil|11 years ago|reply

Thanks for making me feel bad about my python-based aggregation solution :)

https://github.com/AZdv/agricatch

[+] erikb|11 years ago|reply

What is a "spider system"? Never heard that term before.

[+] binux|11 years ago|reply

sorry :(

[+] bowlofstew|11 years ago|reply

That is a nice tool....nice work!

[+] Immortalin|11 years ago|reply

Any plans for a gui based web scraper interface similar to portia?

[+] binux|11 years ago|reply

Currently, yes and no.

pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.

But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.

[+] bjblazkowicz|11 years ago|reply

How's the performance compared to scrapy?

[+] binux|11 years ago|reply

https://gist.github.com/binux/67b276c51e988f8e2c31

[+] zbb|11 years ago|reply

Take a look at source code. The package hirarchy is not pythonic (use "libs" as top package is not a good idea).

[+] huskyr|11 years ago|reply

Ah damned. the package hierarchy is not pythonic. That renders all the functionality of this package completely unusable.

Come on people, don't be like this. It takes 5 seconds to rephrase a comment like this into something friendlier.

[+] paulhauggis|11 years ago|reply

Why isn't it a good idea? I have plenty of projects setup this way and it works well.

It looks pretty well organized to me.

[+] binux|11 years ago|reply

agree

40 comments