I don't quite understand why you would use a full-blown browser like phantomjs for crawling (I've seen a lot of projects recently taking this approach, so this critique is not directly towards Apifier).
Yes, I get that in some specific circumstances it would be nice to be able to execute the JavaScript on the page but think about the trade-off here.
In the vast majority of cases a simple HTTP GET request with a DOM parser is all you need -- actually not a single one of the examples on the Apifier homepage has any need for phantomjs.
Wouldn't it be much much cheaper, simpler and faster to ditch phantomjs? Or is there something I'm missing here?
You're right that most of the time you don't need to use JavaScript.
But look at Google Groups for example - there is an infinite scroll to get all the topics, posts are also loaded dynamically, so you have to wait some time to get them.
In the SFO flights example you have to deal with pagination also using JavaScript.
We wanted to build a powerful tool which can crawl and scrape almost any website out there. It's slower, but you can use bench of our nodes to do it in parallel.
I've written some projects which use phantomjs; the primary motivation for me has been the desire to look at the web in general, rather than specific sites I'm scraping data off, and having the ability to see what their javascript does.
Hello HN! Today we’re launching what we were building for the past couple of months. Apifier is a hosted web crawler for developers that enables them to extract data from any website using a few simple lines of JavaScript. We built it because we realized that many existing web scrapers trade off their ability to scrape complex websites for the "simplicity" of their user interface. We thought: we are programmers and we already use JavaScript for client-side development, so why not use it for scraping?
Please have a look at the service, play with the examples and maybe set up your own crawl. My co-founder jakubbalada and myself will be around here to answer your questions. We'd love to hear what you guys think!
Looks really cool! Pricing is the big stickler for me. I've been burned too many times to build any critical piece of my app with it without knowing how much it'll cost if it gets popular.
Many thanks! You're absolutely right, we'll publish the price ASAP and also provide some long-term guarantee for users who will depend on our service. BTW do you think pricing per GB downloaded is reasonable or would you prefer some flat monthly fee?
Of course not, API will be available soon - in a week or two. If you have some other feature requests, please let us know, we need to help with prioritization.
-Can you cache pages / download entire sites?
-If caching, can you detect changes on a given schedule, trigger the extraction "pageFunction" and save versioned data?
-How do you handle errors?
-Will you handle database extractions and other sites that require multiple levels of what you have as pseudo-URLs?
- at the moment we don't store the HTML content of the visited pages (except of the last one), so the only way to determine if something changed is to run the 'pageFunction' on each page again and compare the results. This can be optimized in certain situation, e.g. you can crawl a product listing and only go to product details page if some basic property changed. Saving HTML for each page is certainly possible, but after the crawler finished loading a page, running a pageFunction adds very little extra overheads.
- if a page cannot be loaded for any reason, a detailed description of the error will be present in the JSON results. We want to implement a limited number of retries for these pages, for situations the error is just temporary.
- certainly, if your crawling strategy cannot be expressed using simple pseudo-URLs, you can use the low-level 'interceptRequest' function to control exactly how each new page navigation request is handled (enqueued/ignored), tell the crawler which URLs refer to same pages and shouldn't be visited again etc. You can also enqueue arbitrary pages to crawl using 'context.enqueuePage'. In fact, you don't need to use pseudo-URLs at all and control everything from your code.
As a website owner, is it easy to block a rude crawler by contacting you ? (How would I identify in the first place that the crawler is operated by you ? Would my server logfile have enough data to point back to you ? )
Currently, there is no way to distinguish traffic from our crawlers, but of course, let us know at [email protected] and we will blacklist your websites from our crawlers.
I love the demos, and that you can use them without registering. One thing I couldn't find without making an account was what happens after you've used a Gigabyte. That would be a helpful addition I think.
Thank you! The user JavaScript code runs in the context of the web pages in the same restricted environment as normal web page's JavaScript. Also, the crawling processes are sandboxed.
morph.io is a great tool but it requires a non-trivial setup on your machine in order to get things running. We wanted to enable people to create scrapers with no prerequisites.
Thank you! The crawler is currently based on PhantomJS, there is a pool of worker nodes that distribute the crawling to more machines. We use Node.js + MongoDB backend and Meteor for front-end.
Raph, awesome that you're hacking your own tools for homebuying! This is how our company started.
If you're looking in California, let us know if there's a feature we could add to Open Listings to improve your search. Always excited to help HNers hack homebuying. One thing we're adding soon is the ability to filter your property feed by running regular expressions over the descriptions. We'd love to hear other ideas for hacker friendly homebuying :).
[+] [-] thecodemonkey|10 years ago|reply
Yes, I get that in some specific circumstances it would be nice to be able to execute the JavaScript on the page but think about the trade-off here.
In the vast majority of cases a simple HTTP GET request with a DOM parser is all you need -- actually not a single one of the examples on the Apifier homepage has any need for phantomjs.
Wouldn't it be much much cheaper, simpler and faster to ditch phantomjs? Or is there something I'm missing here?
[+] [-] jancurn|10 years ago|reply
But look at Google Groups for example - there is an infinite scroll to get all the topics, posts are also loaded dynamically, so you have to wait some time to get them.
In the SFO flights example you have to deal with pagination also using JavaScript.
We wanted to build a powerful tool which can crawl and scrape almost any website out there. It's slower, but you can use bench of our nodes to do it in parallel.
[+] [-] Eridrus|10 years ago|reply
[+] [-] est|10 years ago|reply
But how about hundreds or thousands of websites to crawl? Or do you prefer just use phantomjs write static extraction rules.
[+] [-] pkulak|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
Please have a look at the service, play with the examples and maybe set up your own crawl. My co-founder jakubbalada and myself will be around here to answer your questions. We'd love to hear what you guys think!
[+] [-] jacquesm|10 years ago|reply
[+] [-] rgbrgb|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
[+] [-] necrodome|10 years ago|reply
[+] [-] jakubbalada|10 years ago|reply
[+] [-] danielharan|10 years ago|reply
A few questions, if founders are still around:
-Can you cache pages / download entire sites? -If caching, can you detect changes on a given schedule, trigger the extraction "pageFunction" and save versioned data?
-How do you handle errors?
-Will you handle database extractions and other sites that require multiple levels of what you have as pseudo-URLs?
[+] [-] jancurn|10 years ago|reply
- at the moment we don't store the HTML content of the visited pages (except of the last one), so the only way to determine if something changed is to run the 'pageFunction' on each page again and compare the results. This can be optimized in certain situation, e.g. you can crawl a product listing and only go to product details page if some basic property changed. Saving HTML for each page is certainly possible, but after the crawler finished loading a page, running a pageFunction adds very little extra overheads.
- if a page cannot be loaded for any reason, a detailed description of the error will be present in the JSON results. We want to implement a limited number of retries for these pages, for situations the error is just temporary.
- certainly, if your crawling strategy cannot be expressed using simple pseudo-URLs, you can use the low-level 'interceptRequest' function to control exactly how each new page navigation request is handled (enqueued/ignored), tell the crawler which URLs refer to same pages and shouldn't be visited again etc. You can also enqueue arbitrary pages to crawl using 'context.enqueuePage'. In fact, you don't need to use pseudo-URLs at all and control everything from your code.
[+] [-] benjmn|10 years ago|reply
Nice & useful demo. I'll give it a try.
[+] [-] jancurn|10 years ago|reply
[+] [-] Eridrus|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
- implement a mechanism to find and click active page elements, track the browser actions
- recompile PhantomJS to support POST requests
- implement a page queue with checks for duplicate URLs
- implement some parallelization and failover mechanism for case PhantomJS crashes (it does)
- possibly implement support for infinite scroll
- setup your own pool of proxy servers
- setup a database to store the results
and finally make the whole thing simple to setup and use :)
[+] [-] bentpins|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
[+] [-] aakilfernandes|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
[+] [-] thomasfromcdnjs|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
[+] [-] misiti3780|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
[+] [-] asterfield|10 years ago|reply
[+] [-] jancurn|10 years ago|reply
[+] [-] Raphmedia|10 years ago|reply
[+] [-] rgbrgb|10 years ago|reply
If you're looking in California, let us know if there's a feature we could add to Open Listings to improve your search. Always excited to help HNers hack homebuying. One thing we're adding soon is the ability to filter your property feed by running regular expressions over the descriptions. We'd love to hear other ideas for hacker friendly homebuying :).
[+] [-] jancurn|10 years ago|reply
[+] [-] fdang|10 years ago|reply
[deleted]