I was scraping with jQuery for a while but it felt like an awful lot of overhead. In the case of simpler scraping tasks that happen a lot I've actually gone back to nuts and bolts with HTML5[1]'s tokenizer and a custom state machine that only accumulates the data I want. At no time is any DOM node actually created in memory, let alone the entire DOM tree. It means I feel safer running many of these in parallel on a VPS. It also means I can write a nice streaming API where you start emitting data the moment you get enough input. Buffering input just feels wrong in node.js.
Actually, I'm doing this for my SUPER SECRET startup at the moment. Originally the front-end would just send the back-end the whole HTML of a user's page when they executed the browser plugin, and the back-end would intercept it and knock it up in Perl.
Wasn't sure how well that was going to scale, and was worried people would get weird about sending the entire contents of the page they're on - I have a 90% working solution now where it's all done in-browser, with a bunch of classes I've been working on with a node.js set of testing tools
One of my biggest pet peeves with crawling the web is using XPath. Not because I have strong feelings about XPath, just that I use css selector syntax so much, it's a pain I can't leverage that knowledge in this domain as well. Something like this is really awesome and going to make crawling the web more accessible.
Wow, I was just thinking this morning how awesome it would be to make a desktop app that could crawl websites with jquery. And since node.js has a windows installer, it sounds like a much better solution than the C# HtmlAgilityPack I've been using.
Hm.. I tried doing this on windows but it turned out to be a lot of work to get it setup correctly. npm is hard to install on windows, and the jquery project depends on contextify, which has a binary. It does have a windows build though: https://github.com/Benvie/contextify
[+] [-] pshc|14 years ago|reply
But jQuery is a great scraper if your transformation is complex and non-streamable. [1] https://github.com/aredridel/html5
[+] [-] ricardobeat|14 years ago|reply
[+] [-] peteretep|14 years ago|reply
Wasn't sure how well that was going to scale, and was worried people would get weird about sending the entire contents of the page they're on - I have a 90% working solution now where it's all done in-browser, with a bunch of classes I've been working on with a node.js set of testing tools
[+] [-] bialecki|14 years ago|reply
[+] [-] badmash69|14 years ago|reply
[+] [-] cosmic_shame|14 years ago|reply
[+] [-] orc|14 years ago|reply
[+] [-] orc|14 years ago|reply
[+] [-] slashclee|14 years ago|reply