top | item 1665999

Using jQuery and node.js to scrape html pages in 5 lines

133 points| Ainab | 15 years ago |blog.nodejitsu.com | reply

43 comments

order
[+] drats|15 years ago|reply
Strange popup: "Hello, i see you are coming from hacker news.

the article you clicked on was most certainly not submitted by nodejitsu.

news.ycombinator has a long history of squashing articles and submitters that aren't funded by y-comb.

most of this is done through their "silent" banning and censoring mechanisms, that leave people not even realizing they have been silenced.

i hope you enjoy this article, and remember that HN is extremely biased and that you should keep your horizons broad."

While I would agree that HN is bias towards YC-funded projects I would not agree that it is biased against non-YC projects or news. In fact, the majority of the items on HN are non-YC. This also follows for submitters and commenters for the year or more I've been here.

On a different note. Hpricot is not representative of Ruby scraping anymore - nokogiri (http://nokogiri.org/) is where it's at. Which has a Hpricot translation layer if you need to change. Even when I decided to solidify on Python for everything else I will still go back to Ruby just for nokogiri when it comes to scraping.

[+] shadowsun7|15 years ago|reply
Marak, the guy behind the Nodejitsu (and, presumably, the popup message) is known to exhibit asshole behaviour. (See: http://news.ycombinator.com/item?id=1448309) The popup message is consistent with what HN knows of him.

Whether being a jerk justifies banning I can't say - but his assertion that HN is biased has little justification (particularly when you consider that the writer himself is biased.) Kindly ignore.

[+] brown9-2|15 years ago|reply
The language in the popup pretty much makes me not interested to actually proceed with reading the article.
[+] robinduckett|15 years ago|reply
It's because all the Nodejitsu team are perma banned from HN.
[+] steilpass|15 years ago|reply
Are there any information/discussions about the "censoring mechanisms"?
[+] rb2k_|15 years ago|reply
In my benchmarks, hpricot has a better performance for simple link extraction (and regexp has an even higher one :D)
[+] aneth|15 years ago|reply
I'm not sure how you can be biased toward something without being biased against everything else, but the spirit of your meaning seems true. YC companies and submitters have an institutional advantage, but I am not with a YC company and don't feel discriminated against.
[+] robinduckett|15 years ago|reply
Hey guys. The Nodejitsu team and Marak (http://www.github.com/Marak), the guy behind Nodejitsu are perma-banned from HN and can't respond to your queries.

He sends his regards, and if you'd like to contact him visit the #Node.js IRC channel @ Freenode

[+] il|15 years ago|reply
I have a question: Does scraping like this execute Javascript on the scraped page? Am I able to access the output of Javascript/AJAX on that page?

As far as I know this is impossible with any other server-side scraping technology.

If so, that would be amazingly useful for a couple of my side projects, much easier than parsing their Javascript code and extracting the info I need.

[+] robinduckett|15 years ago|reply
You'd have to parse the page seperately and run each piece of in line scripts / linked scripts in a sandbox which can talk to jsdom, but it could be done.
[+] vially|15 years ago|reply
If you are using Java you could use HtmlUnit which has fairly good JavaScript support (including AJAX).
[+] fmw|15 years ago|reply
The article lists BeautifulSoup as the Python choice for scraping, but that isn't necessarily true. I'm using http://scrapy.org/, for example, which is a scraping framework that uses lxml and xpath by default.
[+] bmelton|15 years ago|reply
I don't know if it was edited after your post, but it lists Scrapy right next to Beautiful Soup.
[+] unknown|15 years ago|reply

[deleted]

[+] wmil|15 years ago|reply
HN deletes posts to try to keeps things on topic. It's heavy handed, but there aren't a lot of ways to stop flame wars and pun threads.
[+] tcarnell|15 years ago|reply
Interesting, when I built http://cQuery.com (Content Query Engine), I investigated a number of options html parsing and content extraction options. I had played with Rhino and John Resigs env.js (http://ejohn.org/blog/bringing-the-browser-to-the-server/) to run jQuery server-side.

For portability, performance and flexability I finally settled for writing my own HTML parser and CSS selection engine from scratch.

[+] knowtheory|15 years ago|reply
The article reads "The challenge with using these libraries is that they all have their own quirks that can make working with HTML, CSS and Javascript challenging."

And that's true only if you only want to do page manipulation in Javascript. I'm perfectly happy with my page manipulation in Ruby w/ Nokogiri. Here's an example:

(code formatting on HN sucks, so it's on my blog, apologies)

http://blog.knowtheory.net/post/1074676060/xml-manipulation-...

[+] forsaken|15 years ago|reply
Site appears down. Is node popular enough yet for the "Node doesn't scale" talk? :)
[+] drats|15 years ago|reply
Yes, as cliched as it is, I think it's time. I couldn't use at least 6/10 of the node challenge top 10 when it hit the HN front page (and the rest were beset by bugs and didn't work - the pixel one where you form characters stopped showing the shape I was supposed to be trying to get into after a few rounds, and the robot war one never let me buy or release my wave of robots on Chrome or Firefox). Overall it was totally disappointing experience.
[+] jfager|15 years ago|reply
Ignoring the drama: my current favorite scraping combo is NekoHtml underneath Scala's completely kickass combo of pattern matching and XML literals.