top | item 7582858

Import.io – Structured Web Data Scraping

100 points| steeples | 12 years ago |import.io | reply

34 comments

order
[+] uuid_to_string|12 years ago|reply
As a PoC, I would be willing to "turn the web into data", i.e., produce one of the formats offered by these "services": CSV.

I will use only standard UNIX utilities, no Python, etc. As such, you "own" the code. No SaaS. The result will be portable and run on any UNIX.

I believe I can deliver in fewer words of code and that the result will be easier to modify when sites change.

You pay nothing. Post your scraping "challenges" to HN.

I enjoy turning web into data.

Some people enjoy working with HTML, CSS, Javascript, etc. I prefer working with raw data.

It is interesting to hear that some people are willing to pay to have the HTML, CSS, Javascript, etc. stripped out.

[+] ycmike|12 years ago|reply
HN,

So who do you guys use more? Import.io or Kimono? I have heard good things about both.

[+] Jake232|12 years ago|reply
I write my own custom scrapers, I prefer the flexibility and feel safer that the service isn't going to disappear any minute.

If anybody is interested, I wrote a detailed article on scraping not so long back that was well received here: http://jakeaustwick.me/python-web-scraping-resource/

[+] samstave|12 years ago|reply
I tried Komono, but it cannot auth into the sites I want to pull the data from....

Just grabbed import.io - will see if it can loginto sites and grab the data from services I am already paying thousands per month for.

EDIT:

To add some context: I pay about $3,000 per month for some monitoring services which do not have any real reportin mechanisms. So for my daily and weekly reports, I have to manually compile them and screen shot a ton of things, compose an email and send.

I want to configure a scraper to automatically grab screens of things I want regularly and email them.

I want to have a script that will grab many diff pieces of data (visual graphs, typically) and put them all into one email.

I am working with my monitoring vendors to get them to add reporting... but until that can happen - I am tired of spending a couple hours per week screen capping graphs...

[+] rch|12 years ago|reply
I'm evaluating these to augment a system I'm building on top of casper. This is the first I've seen of this one, but right out of the gate I think I prefer Kimono.
[+] thejosh|12 years ago|reply
I prefer to rely on code that doesn't rely on an API that could just vanish the next day or cost a bucket to run.
[+] RaphiePS|12 years ago|reply
There are a bunch of comments about rolling your own scraper instead of relying upon a possibly unreliable SaaS app.

That makes me think -- would it be viable to run a service that, instead of running the scraping on their own servers, simply gave you a custom binary to run?

Assuming that you trusted the executable, you would never have to worry about the company failing. It'd just be a one-time fee, and yours to use in perpetuity. Presumably updates would be free.

[+] caio1982|12 years ago|reply
That's a really neat idea I'd pay for. Not sure about the sustainability of the model though.
[+] hyuuu|12 years ago|reply
if you use scrapy (which is an awesome python scraping framework) you can plug a different third party solutions such as: http://crawlera.com/

Not really server level hosting, but you get the benefits of their network.

[+] robotfelix|12 years ago|reply
Great to see these guys are now out of Beta!

While their real-time Extractors aren't quite as quick as doing it yourself, we've found them to be particularly useful for sites requiring JavaScript and/or cookies to use.

It's also worth mentioning that it's quick to get started. You can start playing around with real data without having to dig into a site's URL structure, and then write your own scraper later if needed.

[+] chrisherring|12 years ago|reply
Isn't it illegal to scrape without permission? How would import.io handle the case when a large site comes back with legal threats when a user of their site has used scraped the wrong site? Can they claim non-responsibility?

Also what happens when sites start blocking their IPs due to repeated scraping or is this unlikely to happen?

[+] seivan|12 years ago|reply
Heads up, the application is placed in ~/Desktop and not /Applications
[+] th0br0|12 years ago|reply
They presented last year at Yahoo!'s Hack Europe: London hackathon. It's an interesting concept, they've come far since their initial presentation and while the app has its quirks I have come to use it occasionally for some tasks.

I hope that they'll manage to properly monetize on this - I don't see why I should pay for using a scraping rule if I can just write the scraper myself which doesn't cost me that much more time.

[+] fibertera|12 years ago|reply
What kind of legitimate uses are there for something like this? This is not a sarcastic question. It seems like an obvious spam magnet, but if people are using it legitimately wouldn't their sources already be providing an API or RSS key?
[+] antjanus|12 years ago|reply
I've my own use case for it and it will probably mirror other sites. I run my own blog and thus have ads and affiliate links there. The thing is, as good as Google Adsense is, it's shitty for my site and my topic (Web Dev).

What am I left with? Great affiliates like Team Treehouse, Lynda.com, framework themes, and Udemy. The problem is that none of those offer any kind of a good API. All they have is a link and possibly an image that they provide.

By using Kimono, I can scrape (but I don't) all of Udemy's programs, categorize them with custom categories, build a full-text search engine around it and serve relevant ads per post. For instance, my "Best Bootstrap Themes" post would yield "Learn bootstrap" udemy course and an on-the-fly-but-cached image for it thus serving relevant ads to my users.

Same goes for Lynda. If someone lands on "Why C# is a great language to learn" (one of my unreleased articles), my custom API built on top of scraped data could serve them with a "ASP.NET Essentials" course.

So why use something like this for framework themes? Take Wrapbootstrap.com, they have a great affiliate program. Using Kimono, you can easily get daily refreshes of their main page which usually has: sales priced themes, featured themes, and new rising themes. This way, you can serve users with an ad that has up-to-date prices and themes that are hot right now.

What about non-ad uses? You can create custom search, weighted according to YOUR metrics and build your own marketplace front and aggregate several sources in order to serve users with better content.

[+] kyriakos|12 years ago|reply
We use scraping to gather product prices from online shops for a price comparison site. I have permission from the sites who are not bothered to provide us with a price list other than their public website. Legal and necessary - so there is a market for this I believe, I am not sure about its size though.
[+] tomschlick|12 years ago|reply
Doing something like http://openstates.org/ is a perfect example. State government data is shitty most of the time and doesn't have a public api you can query so open states runs 50+ scrapers to get the data and normalize it.
[+] troels|12 years ago|reply
Very few companies can figure out how to provide proper api's. Unless it's part of their core business, it'll always be lacking.
[+] thom|12 years ago|reply
I suspect the real, top-secret business behind import.io is in either training a system to crawl the web and see structured data, and/or gathering over time a very rich crowd-sourced database of structured data.
[+] late2part|12 years ago|reply
Unfortunately, this doesn't seem to work too well on my mac. And, why do you want to know who my friends on Facebook are?