pyspider comes from a vertical search engine project. we have two issues:
- 100+ websites, they may change the template or down sometime.
We need a dashboard to monitor the changes and the fails.
- update in 5 minutes, when the website updated, we need follow that in 5 minutes.
We are using a update time from index(list) page to tell the changed pages.
And pages should been updated after about 30 days in case of we missed something.
A powerful scheduler is needed.
obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.
Can you compare to scrapy as requested by other posters. Why could you not build on top of scrapy and leverage celery for scheduling etc (http://www.celeryproject.org/)
What is the immediate value add to using pyspider ?
Furthermore, what you mean with `Javascript pages supported`?
Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?
pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.
But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.
[+] [-] meowface|11 years ago|reply
[+] [-] adam-_-|11 years ago|reply
[+] [-] binux|11 years ago|reply
pyspider comes from a vertical search engine project. we have two issues:
- 100+ websites, they may change the template or down sometime. We need a dashboard to monitor the changes and the fails.
- update in 5 minutes, when the website updated, we need follow that in 5 minutes. We are using a update time from index(list) page to tell the changed pages. And pages should been updated after about 30 days in case of we missed something. A powerful scheduler is needed.
obviously, I hadn't got the right way to do so with scrapy. I'm not very familiar with scrapy. So I can't say something pyspider can do but scrapy not.
[+] [-] skillachie|11 years ago|reply
Can you compare to scrapy as requested by other posters. Why could you not build on top of scrapy and leverage celery for scheduling etc (http://www.celeryproject.org/)
What is the immediate value add to using pyspider ?
[+] [-] OedipusRex|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] mrmondo|11 years ago|reply
[+] [-] _bitliner|11 years ago|reply
What is the roadmap?
I am really inside scraping, it is one of my daily job. I could consider to integrate it in one of my architectures
[+] [-] _bitliner|11 years ago|reply
[+] [-] binux|11 years ago|reply
[+] [-] kidsil|11 years ago|reply
https://github.com/AZdv/agricatch
[+] [-] erikb|11 years ago|reply
[+] [-] binux|11 years ago|reply
[+] [-] bowlofstew|11 years ago|reply
[+] [-] Immortalin|11 years ago|reply
[+] [-] binux|11 years ago|reply
pyspider is running original python code, something like portia is a code generator (Apologize if I'm wrong, I have not use it). So it can been made as another WebUI module.
But for flexible, I have no idea how to make it right currently. So, We have a css selector helper, but no plan for a complete tool.
[+] [-] bjblazkowicz|11 years ago|reply
[+] [-] binux|11 years ago|reply
[+] [-] zbb|11 years ago|reply
[+] [-] huskyr|11 years ago|reply
Come on people, don't be like this. It takes 5 seconds to rephrase a comment like this into something friendlier.
[+] [-] paulhauggis|11 years ago|reply
It looks pretty well organized to me.
[+] [-] binux|11 years ago|reply