HTML/XML Parsing with Node & jQuery

[+] pshc|14 years ago|reply

I was scraping with jQuery for a while but it felt like an awful lot of overhead. In the case of simpler scraping tasks that happen a lot I've actually gone back to nuts and bolts with HTML5[1]'s tokenizer and a custom state machine that only accumulates the data I want. At no time is any DOM node actually created in memory, let alone the entire DOM tree. It means I feel safer running many of these in parallel on a VPS. It also means I can write a nice streaming API where you start emitting data the moment you get enough input. Buffering input just feels wrong in node.js.

But jQuery is a great scraper if your transformation is complex and non-streamable. [1] https://github.com/aredridel/html5

[+] ricardobeat|14 years ago|reply

    doc.find('h2:gt(0)').before('<hr />')

[+] peteretep|14 years ago|reply

Actually, I'm doing this for my SUPER SECRET startup at the moment. Originally the front-end would just send the back-end the whole HTML of a user's page when they executed the browser plugin, and the back-end would intercept it and knock it up in Perl.

Wasn't sure how well that was going to scale, and was worried people would get weird about sending the entire contents of the page they're on - I have a 90% working solution now where it's all done in-browser, with a bunch of classes I've been working on with a node.js set of testing tools

[+] bialecki|14 years ago|reply

One of my biggest pet peeves with crawling the web is using XPath. Not because I have strong feelings about XPath, just that I use css selector syntax so much, it's a pain I can't leverage that knowledge in this domain as well. Something like this is really awesome and going to make crawling the web more accessible.

[+] badmash69|14 years ago|reply

If you are using Java to crawl the web , I would suggest using Jsoup for data extraction -- you can extract data with jquery like methods.

[+] cosmic_shame|14 years ago|reply

if you're using python, lxml has a cssselect module that makes this a breeze.

[+] orc|14 years ago|reply

Wow, I was just thinking this morning how awesome it would be to make a desktop app that could crawl websites with jquery. And since node.js has a windows installer, it sounds like a much better solution than the C# HtmlAgilityPack I've been using.

[+] orc|14 years ago|reply

Hm.. I tried doing this on windows but it turned out to be a lot of work to get it setup correctly. npm is hard to install on windows, and the jquery project depends on contextify, which has a binary. It does have a windows build though: https://github.com/Benvie/contextify

[+] slashclee|14 years ago|reply

Apparently node.js doesn't implement the DOMParser object, which means that you can't actually use jquery's parseXML method. That's a bummer :(

10 comments