The tutorials as live demos are great. You should say "live demo" somewhere, because "tutorial" makes it sound like text.
The front page is a bit split-personality: you have the idea of data mining; and the idea of sharing data sets. I think you should pick one to be the dominant focus, with other one present as an aside. It might make sense to focus on producers at the beginning ("scrape data and share it"), and once you've built up datasets, switch the emphasis to consumers ("wikipedia of datasets"). Maybe it's worthwhile to check how similar producer-consumer websites did it at the beginning: wikipedia, youtube, flickr.
BTW: What do you think of an interface like excel has for web-scraping: it displays the webpage, and you select the bit you want visually (no coding, not even html)?
I hesitate to say I'm working on a startup, but I've been working on a piece of software for a few years now. One of the key components is a scraper, so I have pretty serious interest in this topic.
It looks like they have thought things through pretty well, but I looked around and didn't find interesting data or useful code.
screen scraping is a bot that pulls the visual data from the screen and analyzes it. A bunch of tutorials talk about web scraping and call it screen scraping.
By the way, if you're interested in scraper/crawlers also have a look at 80legs.com crawling SaaS, with some custom code capabilities too.
Or, if python is your bag, there's a scraping lib called scrapy that was opensourced last year that's ok.
On ScraperWiki, they need source control as the environment is actually run on their servers, I'd guess a bespin type implementation, so you put all your code on the wiki, and it can get augmented etc. hence needs SCM. At least, for the moment, it seems they've got plans of releasing their api engine or at least calls to it in which case you'd end up doing the code locally, and can use your own SCM.
Their confusing naming could be explained that their not pitching this at coders, but rather journalists to try and get them to use the vast data resources out on the net.
screen scraping is a bot that pulls the visual data from the screen and analyzes it. A bunch of tutorials talk about web scraping and call it screen scraping
A more useful definition would be "extracting structured data from human interfaces"
This is a great idea. I haven't started on the site yet, so forgive the perhaps-dumb question: is there a standard meta-language for scraping sites? I don't think x-path works with funky-html. So does anybody have something that would universally describe how to, say, get a users pictures from flickr? Or latest comments from Digg?
Something like that -- where the community could then develop cross-platform tools to implement it -- would truly be ground-breaking.
(I don't think this is related to the nature of the html/regex debate, but I'd be happy to be educated)
Too many providers are taking data that the user knows intuitively that they own and sticking it behind a wall. Anything that helps get that data back out is awesome in my book.
I've never tried parsley as suggested by another commenter, but I use Nokogiri (ruby) and love it. It's tolerant of broken HTML and has many easy ways to move through the generated DOM (CSS selectors, xpath, etc)
This is a nifty idea. I wonder how an application developer would use it. Would you pull data directly from this site, using the most recent scraper code? That would require fixed schemas for the scrapers, or at least backwards compatible schemas.
This might make your app robust against changing data formats, but would also leave it vulnerable to vandalism, which could potentially be far more damaging for this site than for something like Wikipedia.
But, unlike wiki pages, scrapers don't need to improve incrementally; they either work or they don't. So a community verification step before a new scraper goes into "production" might be a feasible way to deal with vandalism.
Anyway, the IDE approach looks very very nice. I have done similar thing before, but using Selenium/jQuery: http://github.com/tszming/Selenium-Google-Scrapper. I still believe my jQuery approach is more flexible for screen scrapping :)
How to compare and choose web scraping tools
I recommend reading these series of posts is dedicated to executives taking charge of projects that entail scraping information from one or more websites.
http://www.fornova.net/blog/?p=18
They run Hack and Hackers days for journo's, recently in Liverpool and Birmingham. I believe they've got videos up of them too.
We're actually talking to them about hosting one in South London in the next month or two.
Not sure what their plans in the US are, but it could be worth dropping the idea to some fellow journo's and seeing if there's an oganisation willing to act as host.
EDIT: forgot to mention, if you're interested in data you might also want to take a look at the open data initiative (data.gov.uk) and the guardian's data store http://www.guardian.co.uk/data-store apologies for the UK focus.
I'm intrigued by this. I once crawled a wiki that later went away, & the database was lost too. I've been looking for other people recreating a wiki from HTML (including the edit pages, that is) for years since then, without success.
I'd be happy to, if I understand you correctly that you want somebody to do the parsing into a database-ready format, or a database you've already got hosted somewhere. Email's in my profile.
If you want/need a maximum tolerance of broken HTML, you can use Selenium to scrape, and as long as the markup doesn't break firefox, it won't break your script. (I've had some bad experiences with BeautifulSoup)
Great idea. I'm convinced that we can find a way to tap into all of the wonderful data on the Internet. Right now, it's just too hard, which is one of the delights of reading a well-researched article.
[+] [-] 10ren|15 years ago|reply
The front page is a bit split-personality: you have the idea of data mining; and the idea of sharing data sets. I think you should pick one to be the dominant focus, with other one present as an aside. It might make sense to focus on producers at the beginning ("scrape data and share it"), and once you've built up datasets, switch the emphasis to consumers ("wikipedia of datasets"). Maybe it's worthwhile to check how similar producer-consumer websites did it at the beginning: wikipedia, youtube, flickr.
BTW: What do you think of an interface like excel has for web-scraping: it displays the webpage, and you select the bit you want visually (no coding, not even html)?
[+] [-] keefe|15 years ago|reply
It looks like they have thought things through pretty well, but I looked around and didn't find interesting data or useful code.
screen scraping is a bot that pulls the visual data from the screen and analyzes it. A bunch of tutorials talk about web scraping and call it screen scraping.
# Built in source control
really???
[+] [-] dustrider|15 years ago|reply
Or, if python is your bag, there's a scraping lib called scrapy that was opensourced last year that's ok.
On ScraperWiki, they need source control as the environment is actually run on their servers, I'd guess a bespin type implementation, so you put all your code on the wiki, and it can get augmented etc. hence needs SCM. At least, for the moment, it seems they've got plans of releasing their api engine or at least calls to it in which case you'd end up doing the code locally, and can use your own SCM.
Their confusing naming could be explained that their not pitching this at coders, but rather journalists to try and get them to use the vast data resources out on the net.
[+] [-] extension|15 years ago|reply
A more useful definition would be "extracting structured data from human interfaces"
[+] [-] slig|15 years ago|reply
[+] [-] tsycho|15 years ago|reply
Thanks for the link :)
[+] [-] DanielBMarkham|15 years ago|reply
Something like that -- where the community could then develop cross-platform tools to implement it -- would truly be ground-breaking.
(I don't think this is related to the nature of the html/regex debate, but I'd be happy to be educated)
Too many providers are taking data that the user knows intuitively that they own and sticking it behind a wall. Anything that helps get that data back out is awesome in my book.
[+] [-] amccloud|15 years ago|reply
There use to be a website for sharing "parselet" scripts.
[+] [-] acgourley|15 years ago|reply
[+] [-] extension|15 years ago|reply
This might make your app robust against changing data formats, but would also leave it vulnerable to vandalism, which could potentially be far more damaging for this site than for something like Wikipedia.
But, unlike wiki pages, scrapers don't need to improve incrementally; they either work or they don't. So a community verification step before a new scraper goes into "production" might be a feasible way to deal with vandalism.
[+] [-] tszming|15 years ago|reply
Anyway, the IDE approach looks very very nice. I have done similar thing before, but using Selenium/jQuery: http://github.com/tszming/Selenium-Google-Scrapper. I still believe my jQuery approach is more flexible for screen scrapping :)
[+] [-] dorifornova|15 years ago|reply
[+] [-] shortformblog|15 years ago|reply
[+] [-] dustrider|15 years ago|reply
We're actually talking to them about hosting one in South London in the next month or two.
Not sure what their plans in the US are, but it could be worth dropping the idea to some fellow journo's and seeing if there's an oganisation willing to act as host.
EDIT: forgot to mention, if you're interested in data you might also want to take a look at the open data initiative (data.gov.uk) and the guardian's data store http://www.guardian.co.uk/data-store apologies for the UK focus.
[+] [-] pronoiac|15 years ago|reply
[+] [-] Vivtek|15 years ago|reply
[+] [-] crazydiamond|15 years ago|reply
http://scrubyt.org/
[+] [-] astrofinch|15 years ago|reply
http://www.crummy.com/software/BeautifulSoup/
[+] [-] krakensden|15 years ago|reply
[+] [-] cmurphycode|15 years ago|reply
[+] [-] Sukotto|15 years ago|reply
[+] [-] cmelbye|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]