top | item 7066479

Show HN: Kimono – Never write a web scraper again

717 points| pranade | 12 years ago |kimonify.kimonolabs.com | reply

230 comments

order
[+] randomdrake|12 years ago|reply
The presentation is beautiful and the website is great, but the tech broke so I have no idea how or if this even works. This is a wonderful concept and one I've talked about doing with others. I was really excited to try this. I watched the demo video and it seemed straightforward.

I went to try and use it on the demo page it provides, going through and adding things, but when I went to save it, I just received an error that something went wrong. Well, crap. That was a waste of time. Oh well, maybe it's just me.

Alright, I'll give it another shot using the website they used in the demo. Opened up a Hacker News discussion page and started to give it a try. Immediately it was far less intelligent than the demo. Clicking on a title proceeded to select basically every link on the page. Somehow I clicked on some empty spots as well. Nothing was being intelligently selected like it was in the demo. Fine, that wasn't working tremendously well, but I wanted to at least see the final result.

Same thing: just got an error that something went wrong and it couldn't save my work.

Disappointing. I still might try it again when it works 'cause it's a great idea if they really pulled it off. So far: doesn't seem to be the case.

[+] pranade|12 years ago|reply
Sorry you had a bad first experience. We've tested this on a lot of sites, and it works stably across a lot of different cases, but haven't solved it everywhere. Thanks for letting us know about the discussion page. We'll look into the bugs right now
[+] ricardobeat|12 years ago|reply
HN pages are possibly the worst case, very hard to infer structure from due to its 1998 coding standards. You'll have a better chance with an alternative interface like http://ihackernews.com/ or http://hckrnews.com (no comments though).
[+] BinaryBird|12 years ago|reply
In my case, worked for some pages and not for others. Currently I'm using Feedity: http://feedity.com for all business-centric data extraction and it has been working great (although not as flexible as kimono).
[+] dunham|12 years ago|reply
The Simile group at MIT did something similar back around 2006. Automatic identification of collections in web pages (repeated structures), detection of fields by doing tree comparisons between the repeated structures, and fetching of subsequent pages.

The software is abandoned, but their algorithms are described in a paper:

    http://people.csail.mit.edu/dfhuynh/research/papers/uist2006-augmenting-web-sites.pdf
[+] losvedir|12 years ago|reply
Oh, hey, memories. I worked one summer with David Huynh (who you're linking to there) and David Karger (his thesis advisor) on one of the Simile projects.

I vaguely remember playing around with this tool you mentioned. I thiiiiink it was this one[0], although it seems to be superseded by this one[1] now.

[0] http://simile.mit.edu/wiki/Piggy_Bank [1] http://simile.mit.edu/wiki/Sifter

[+] pranade|12 years ago|reply
Thanks a ton for sharing... the association algorithms have been where we've been spending a good chunk of time. Will read through this
[+] tlianza|12 years ago|reply
If you're interested in hosted solutions that try to do automatic identification of pages, diffbot is worth a look. We've had some good experiences: http://diffbot.com/
[+] DanBlake|12 years ago|reply
Show me it working with authentication and you will have a customer. Scraping is always something you need to write because the shit you want to get is only shown when you are logged in.
[+] pranade|12 years ago|reply
Yes, it's one of the most popular feature requests. We don't support auth yet, but it's on our shortlist and we hope to have it ready soon.
[+] automately|12 years ago|reply
Creator of Automately here, our service could definitely be something you might be interested in. While we aren't directly in the business of web scraping, we do have a powerful automation service that can accomplish those needs using simple javascript and our powerful scalable automation API.

We are accepting early access requests right now. Check us out! http://automate.ly/

[+] georgemcbay|12 years ago|reply
I've written more web scraping code than I care to admit. A lot of the apps that ran on chumby devices used scraping to get their data (usually(!) with the consent of the website being scraped) since the device wasn't capable of rendering html (it eventually did get a port of Qt/WebKit, but that was right before it died and it wasn't well integrated with the rest of the chumby app ecosystem).

This service looks great, good work! But since you seem to host the APIs created how do you plan to get around the centralized access issues? Like on the chumby we had to do a lot of web scraping on the device itself (even though doing string processing operations needed for scraping required a lot of hoop jumping optimization to run well in ActionScript 2 on a slow ARMv5 chip with 64mb total RAM) to avoid all the requests coming from the same set of chumby-server IP addresses, because companies tend to notice lots of requests coming from the same server block really quick and will often rate limit the hell out of you, which could result in a situation where one heavy-usage scraper destroys access for every other client trying to scrape from that same source.

[+] pranade|12 years ago|reply
Access, legality and rate limiting issues come up a lot. We're working on a couple things to address them. The first is an intelligent job distribution system that consolidates scrapes across users and hits sites (and pages) at human-like intervals. the second is to create a portal for webmasters that allows them special privileged access to analytics on data being extracted from their sites, and the ability to "turn on or off" kimono APIs if they see fit. this way, via kimono, a webmaster at chumby could "provision" certain kimono users. we're still yet to see whether the later works out. thanks for the input
[+] GigabyteCoin|12 years ago|reply
I'm curious how you plan to avoid/circumvent the inevitable hard IP ban that the largest (and most sought after targets) will place on you and your services once you begin to take off?

I could have really used a service like this just yesterday actually, I ended up fiddling around with iMacros and got about 80% of what I was trying to achieve.

[+] pranade|12 years ago|reply
It's a great question. What we're really trying to do is make data accessible programmatically and at scale. We want to connect data providers and data consumers with APIs in a way that's mutually beneficial vs. being a tool for data theft. Our hope is to (once we scale) actually work with data providers directly on the on the distribution of their data so the IP ban becomes a non-issue.
[+] hcarvalhoalves|12 years ago|reply
This is excellent. Even it if doesn't work for scraping all sites, it simplifies the average use case so much that it's not even funny.

Feature proposal: deal with pagination.

[+] scotty79|12 years ago|reply
Another feature, simple one: Allow to add some filters to the data stream. For example: only posts that contain word "bitcoin" in the name or only those with 50 upvotes or more.
[+] pranade|12 years ago|reply
thanks. glad you like it. pagination, dynamic tabs (and crawling in general) is a big feature we really want to add soon. a lot of people are asking for it. the challenge will be integrating it with the current UEX which we're trying to keep super simple.
[+] fsckin|12 years ago|reply
Constructive Tone: I figured that it might be nifty to scrape cedar pollen count information from a calendar and then shoot myself an email when it was higher than 100 gr/m3.

This would be a pretty difficult thing to grab when scraping normally, but the app errors before loading the content:

https://www.keepandshare.com/calendar/show_month.php?i=19409...

JS error: An error occurred while accessing the server, please try againError Reference: 6864046a

[+] pranade|12 years ago|reply
Thanks for letting us know. Just tried and am getting the same error. The page is loading content dynamically from another source... We'll look into this and see if we can get it working on this page
[+] thinkzig|12 years ago|reply
Great work so far. The tool was very intuitive and easy to use.

My suggestion: once I've defined an API, let me apply it to multiple targets that I supply to you programatically.

The use case driving my suggestion: I'm an affiliate for a given eCommerce site. As an affiliate, I get a data feed of items available for sale on the site, but the feed only contains a limited amount of information. I'd like to make the data on my affiliate page richer with extra data that I scrape from a given product page that I get from the feed.

In this case, the page layout for all the various products for sale is exactly the same, but there are thousands of products.

So I'd like to be able to define my Kimono API once - lets call it CompanyX.com Product Page API - then use the feed from my affiliate partner to generate a list of target URLs that I feed to Kimono.

Bonus points: the list of products changes all the time. New products are added, some go away, etc. I'd need to be able to add/remove target URLs from my Kimono API individually as well as adding them in bulk.

Thanks for listening. Great work, again. I can't wait to see where you go with this.

Cheers!

[+] pranade|12 years ago|reply
Thanks a ton for the feedback. Getting data from multiple similarly structured URLs programmatically is something we're working on now. We love hearing about the use cases you want to use this for so we can make sure we build out the right features to make kimono useful for you.
[+] sync|12 years ago|reply
Undo button is awesome.

More web apps need an undo button.

[+] rlpb|12 years ago|reply
Are you familiar with ScraperWiki? I'm wondering how your work fits in with it.

Edit: looks like they've moved away from that space, but have an old version available at: https://classic.scraperwiki.com/

[+] Maxious|12 years ago|reply
The people who scrape data to avoid paying for APIs are the same people who will not pay for a service to make scraping easier ;)
[+] trey_swann|12 years ago|reply
This is a great tool! In a past life we needed a web scraper to pull single game ticket prices from NBA, MLB, and NHL team pages (e.g. http://www.nba.com/warriors/tickets/single). We needed the data. But, when you factor in dynamic pricing and frequent page changes you are left with a real headache. I wish Kimono was around when we were working on that project.

I love how you can actually use their "web scraper for anyone" on the blog post. Very cool!

[+] pknight|12 years ago|reply
That UI made me go wow, this could be an awesome tool. Idea that pops into my mind is being able to grab data from those basic local sites run by councils, local news papers etc and putting it into a useful app.

How dedicated are you guys to making this work because I'd imagine there are quite a few technical hurdles in keeping a service like this working long term while not getting blocked by various sites?

[+] pranade|12 years ago|reply
Love your suggestion. We're committed to making kimono better and we're working on it all the time. We want to make sure it's a responsible scraper, so want to work together with webmasters in cases where there might be blocking but the data is legal to share...
[+] fnordfnordfnord|12 years ago|reply
>Sorry, can't kimonify

>According that web site's data protection policy, we were unable to kimonify that particular page.

Sigh... Oh well... Back to scraping.

[+] pranade|12 years ago|reply
What page were you trying to hit? We'll check it out
[+] bambax|12 years ago|reply
> Web scraping. It's something we all love to hate. You wish the data you needed to power your app, model or visualization was available via API. But, most of the time it's not. So, you decide to build a web scraper. You write a ton of code, employ a laundry list of libraries and techniques, all for something that's by definition unstable, has to be hosted somewhere, and needs to be maintained over time.

I disagree. Web scraping is mostly fun. You don't need "a ton of code" and "a laundry list of libraries", just something like Beautiful Soup and maybe XSLT.

The end of the statement is truer: it's not really a problem that your web scraper will have to be hosted somewhere, since the thing you're using it for also has to be hosted somewhere, but yes, it needs to be maintained and it will break if the source changes.

But I don't see how this solution could ever be able to automatically evolve with the source, without the original developer doing anything?

[+] littledot5566|12 years ago|reply
Perhaps this could be automated by finding the same content in two versions of the dom and then doing a diff on the structure, updating the rules?
[+] rpedela|12 years ago|reply
I assume you get an error on the hourly, daily, monthly, whatever update which you are notified about. Then you can redo the semi-manual setup of the scraper.
[+] IbJacked|12 years ago|reply
Wow, this is looking good, I wish I had it available to me 6 months ago! Nice job :D

I don't know if it's just me or not, but it's not working for me in Firefox (OSX Mavericks 10.9.1 and Firefox v26). The X's and checkmarks aren't showing up next to the highlighted selections. Works fine in Safari.

[+] pranade|12 years ago|reply
Thanks for letting us know. We've tested on some versions of Firefox, but not v26 on Mavericks. We'll look into this
[+] eth|12 years ago|reply
Great tool!

I'm coming at things from a non-coder perspective and found it easy to use, and easy to export the data I collected into a usable format.

For my own enjoyment, I like to track and analyze Kickstarter project statistics. Options up until now have been either labor intensive (manually entering data into spreadsheets) or tech heavy (JSON queries, KickScraper, etc. pull too much data and my lack of coding expertise prevents me from paring it down/making it useful quickly and automagically) as Kickstarter lacks a public API. Sure, it is possible to access their internal API or I could use KickScraper, but did I mention the thing about how I dont, as many of you say, "code"?

What I do understand is auto-updating.CSV files, and that's what I can get from Kimono. Looking forward to continued testing/messing about with Kimono!

[+] alternize|12 years ago|reply
looks promising!

to be fully usable for me, there are some features missing:

- it lacks manual editing/correcting possibilities: i've tried to create an api for http://akas.imdb.com/calendar/?region=us with "date", "movie", "year". unfortunately, it failed to group the date (title) with the movies (list entries) but rather created two separate, unrelated collections (one for the dates, one for the movies).

- it lacks the ability to edit an api, the recommended way is to delete and recreate.

small bugreport: there was a problem saving the api, or at least i was told saving failed - it nevertheless seems to be stored stored in my account

[+] pranade|12 years ago|reply
Thanks for the feedback. We're working on a feature that will allow you to edit APIs you've created and also edit the selectors and regex (right now, in advanced mode, you can see them, but cannot edit). We're looking into your bug now...
[+] aqme28|12 years ago|reply
I would seriously consider rethinking that Favicon.
[+] misuba|12 years ago|reply
Seconded. I can't show that to anyone at work.
[+] rfnslyr|12 years ago|reply
Why? It looks like an onion with angry eyebrows.

Do you have something against vegetables?

[+] lips|12 years ago|reply
I'm experiencing login errors (PEBKAC caveat: password manager, 2x checked, reset), but the support confirmation page is a nice surprise.

http://i.imgur.com/w01CoUy.jpg