top | item 10420329

Show HN: Apifier – hosted web crawler for developers

101 points| jancurn | 10 years ago |apifier.com | reply

35 comments

order
[+] thecodemonkey|10 years ago|reply
I don't quite understand why you would use a full-blown browser like phantomjs for crawling (I've seen a lot of projects recently taking this approach, so this critique is not directly towards Apifier).

Yes, I get that in some specific circumstances it would be nice to be able to execute the JavaScript on the page but think about the trade-off here.

In the vast majority of cases a simple HTTP GET request with a DOM parser is all you need -- actually not a single one of the examples on the Apifier homepage has any need for phantomjs.

Wouldn't it be much much cheaper, simpler and faster to ditch phantomjs? Or is there something I'm missing here?

[+] jancurn|10 years ago|reply
You're right that most of the time you don't need to use JavaScript.

But look at Google Groups for example - there is an infinite scroll to get all the topics, posts are also loaded dynamically, so you have to wait some time to get them.

In the SFO flights example you have to deal with pagination also using JavaScript.

We wanted to build a powerful tool which can crawl and scrape almost any website out there. It's slower, but you can use bench of our nodes to do it in parallel.

[+] Eridrus|10 years ago|reply
I've written some projects which use phantomjs; the primary motivation for me has been the desire to look at the web in general, rather than specific sites I'm scraping data off, and having the ability to see what their javascript does.
[+] est|10 years ago|reply
It's OK when you only have to crawl one or two websites, sure, manually analyze the js and write minimal DOM parsing routes would do.

But how about hundreds or thousands of websites to crawl? Or do you prefer just use phantomjs write static extraction rules.

[+] pkulak|10 years ago|reply
If you limit to just that, then there's no benefit over 10 lines of Ruby and Nokogiri.
[+] jancurn|10 years ago|reply
Hello HN! Today we’re launching what we were building for the past couple of months. Apifier is a hosted web crawler for developers that enables them to extract data from any website using a few simple lines of JavaScript. We built it because we realized that many existing web scrapers trade off their ability to scrape complex websites for the "simplicity" of their user interface. We thought: we are programmers and we already use JavaScript for client-side development, so why not use it for scraping?

Please have a look at the service, play with the examples and maybe set up your own crawl. My co-founder jakubbalada and myself will be around here to answer your questions. We'd love to hear what you guys think!

[+] jacquesm|10 years ago|reply
Do you respect robots.txt? Do you publish your IP ranges?
[+] rgbrgb|10 years ago|reply
Looks really cool! Pricing is the big stickler for me. I've been burned too many times to build any critical piece of my app with it without knowing how much it'll cost if it gets popular.
[+] jancurn|10 years ago|reply
Many thanks! You're absolutely right, we'll publish the price ASAP and also provide some long-term guarantee for users who will depend on our service. BTW do you think pricing per GB downloaded is reasonable or would you prefer some flat monthly fee?
[+] necrodome|10 years ago|reply
How do you access the latest crawling results programmatically? I hope you are not expecting me to click results link for a developer's tool.
[+] jakubbalada|10 years ago|reply
Of course not, API will be available soon - in a week or two. If you have some other feature requests, please let us know, we need to help with prioritization.
[+] danielharan|10 years ago|reply
I'm hoping this could save me some work.

A few questions, if founders are still around:

-Can you cache pages / download entire sites? -If caching, can you detect changes on a given schedule, trigger the extraction "pageFunction" and save versioned data?

-How do you handle errors?

-Will you handle database extractions and other sites that require multiple levels of what you have as pseudo-URLs?

[+] jancurn|10 years ago|reply
We're still here :)

- at the moment we don't store the HTML content of the visited pages (except of the last one), so the only way to determine if something changed is to run the 'pageFunction' on each page again and compare the results. This can be optimized in certain situation, e.g. you can crawl a product listing and only go to product details page if some basic property changed. Saving HTML for each page is certainly possible, but after the crawler finished loading a page, running a pageFunction adds very little extra overheads.

- if a page cannot be loaded for any reason, a detailed description of the error will be present in the JSON results. We want to implement a limited number of retries for these pages, for situations the error is just temporary.

- certainly, if your crawling strategy cannot be expressed using simple pseudo-URLs, you can use the low-level 'interceptRequest' function to control exactly how each new page navigation request is handled (enqueued/ignored), tell the crawler which URLs refer to same pages and shouldn't be visited again etc. You can also enqueue arbitrary pages to crawl using 'context.enqueuePage'. In fact, you don't need to use pseudo-URLs at all and control everything from your code.

[+] benjmn|10 years ago|reply
As a website owner, is it easy to block a rude crawler by contacting you ? (How would I identify in the first place that the crawler is operated by you ? Would my server logfile have enough data to point back to you ? )

Nice & useful demo. I'll give it a try.

[+] jancurn|10 years ago|reply
Currently, there is no way to distinguish traffic from our crawlers, but of course, let us know at [email protected] and we will blacklist your websites from our crawlers.
[+] Eridrus|10 years ago|reply
Why should I use this instead of just firing up some spot instances with phantomjs?
[+] jancurn|10 years ago|reply
You could definitely do that, but then you might also need to:

- implement a mechanism to find and click active page elements, track the browser actions

- recompile PhantomJS to support POST requests

- implement a page queue with checks for duplicate URLs

- implement some parallelization and failover mechanism for case PhantomJS crashes (it does)

- possibly implement support for infinite scroll

- setup your own pool of proxy servers

- setup a database to store the results

and finally make the whole thing simple to setup and use :)

[+] bentpins|10 years ago|reply
I love the demos, and that you can use them without registering. One thing I couldn't find without making an account was what happens after you've used a Gigabyte. That would be a helpful addition I think.
[+] jancurn|10 years ago|reply
TBH we don't have the pricing defined yet, because we don't know what will be our server costs. In a few days we will know more and put up the prices.
[+] aakilfernandes|10 years ago|reply
Cool! How do you stop users from trying to run malicious code?
[+] jancurn|10 years ago|reply
Thank you! The user JavaScript code runs in the context of the web pages in the same restricted environment as normal web page's JavaScript. Also, the crawling processes are sandboxed.
[+] thomasfromcdnjs|10 years ago|reply
Also similar to https://morph.io/ which has more of a trend towards open data sets.
[+] jancurn|10 years ago|reply
morph.io is a great tool but it requires a non-trivial setup on your machine in order to get things running. We wanted to enable people to create scrapers with no prerequisites.
[+] misiti3780|10 years ago|reply
I like the idea - would probably use it in the future - can you talk a little bit about what technologies you are using?
[+] jancurn|10 years ago|reply
Thank you! The crawler is currently based on PhantomJS, there is a pool of worker nodes that distribute the crawling to more machines. We use Node.js + MongoDB backend and Meteor for front-end.
[+] asterfield|10 years ago|reply
I was just thinking yesterday of creating a similar service. I'm glad to see someone else has already made it :D
[+] jancurn|10 years ago|reply
Cool, if things go well, we'll be hiring soon :)
[+] Raphmedia|10 years ago|reply
Exactly what I was looking for in order to efficiently improve my searches for a new home. Thanks!
[+] rgbrgb|10 years ago|reply
Raph, awesome that you're hacking your own tools for homebuying! This is how our company started.

If you're looking in California, let us know if there's a feature we could add to Open Listings to improve your search. Always excited to help HNers hack homebuying. One thing we're adding soon is the ability to filter your property feed by running regular expressions over the descriptions. We'd love to hear other ideas for hacker friendly homebuying :).

[+] jancurn|10 years ago|reply
Great to hear that, let us know if we can help in any way!