Listly.io is my private work built days ago.
I hope to hear opinions if it is useful for you... or not.
Listly.io turn HTML to Excel in seconds without coding. It finds the pattern of repeated structure and extracts all of image links and texts. It does find not tags (table, ul ...), but the structure.
Ideally for developers, I think API would be the best way to adapt this extractor to other scraper or your own scraper.
This site is actually really awesome, and has worked for every website I've tried! My only slight issue with it however is it took me a few minutes to actually work out what "HTML codes" were and even then it was only from watching the video. Have you considered renaming it to something like "HTML Source Code"? It also seems to struggle on web pages it can't find tables, such as the following website I made which contains no information:
I have used Google Spreadsheet to extract <TABLE> or <UL> content. It works very well with them.
Compared to it, listly.io works well with all types of tag if there are repeated structures.
In my experiment, it works well with hunderds kinds of web sites.
e.g. Google/Bing search result, Amazon/Walmart/Ebay product list, Twitter/Facebook/Tumblr posts, Twitch list, Bloomberg finance info, Threads of a forum, Instagram comments, and etc.
Tried on Indeed jobs listing, Amazon product search, Craigslist and cannot get it to work. I suggest you test the tool with top 10-20 most popular websites that contain listing type data. Our company also did a little side project similar to yours and packaged it as Chrome extension. We learned that it is quite hard to make a unversal tool to guess where data is. Especially that so many websites use <div> and <ul> with CSS to form table like structures instead of plain <table>. If you want, take a look at our tool: https://chrome.google.com/webstore/detail/instant-data-scrap...
I tested now on Amazon product search, https://www.amazon.com/s/ref=nb_sb_noss_2/130-9531298-529675... . It works well though the result comes out slow (about 45 seconds). In result page, you can find the product list with the number of 28. For better speed, I agree to publish API or chrome extension.
I tried writing a script to do the same thing before - turns out finding the element on the page with the most children and assuming each child is an entry works surprisingly often.
I'm guessing this is doing some kind of tree-diff on the DOM?
Now if you would have this generate a graphQL spec file, you could run a graphQL server acting as a proxy to lots of websites.
That would be interesting. Not sure how that fares with the websites' owners' ToS though.
[+] [-] changmin|9 years ago|reply
Listly.io is my private work built days ago. I hope to hear opinions if it is useful for you... or not.
Listly.io turn HTML to Excel in seconds without coding. It finds the pattern of repeated structure and extracts all of image links and texts. It does find not tags (table, ul ...), but the structure.
Ideally for developers, I think API would be the best way to adapt this extractor to other scraper or your own scraper.
[+] [-] popey456963|9 years ago|reply
https://hastebin.com/eguluvoquq.html
[+] [-] est|9 years ago|reply
https://support.google.com/docs/answer/3093339?hl=en
I use it all the time for better sorting, filtering, etc.
[+] [-] changmin|9 years ago|reply
Compared to it, listly.io works well with all types of tag if there are repeated structures.
In my experiment, it works well with hunderds kinds of web sites.
e.g. Google/Bing search result, Amazon/Walmart/Ebay product list, Twitter/Facebook/Tumblr posts, Twitch list, Bloomberg finance info, Threads of a forum, Instagram comments, and etc.
[+] [-] LeoPanthera|9 years ago|reply
[+] [-] webrobots|9 years ago|reply
[+] [-] changmin|9 years ago|reply
In addition, it also works well in seconds with Craiglist apts / housing page. http://seoul.craigslist.co.kr/search/apa
Sorry for being slow. This is my private work. I could not predict a lot of new visitors, I need to scale up and out the server.
[+] [-] polm23|9 years ago|reply
https://www.import.io/
I tried writing a script to do the same thing before - turns out finding the element on the page with the most children and assuming each child is an entry works surprisingly often.
[+] [-] changmin|9 years ago|reply
Import.io needs user's click to determine what to extract, thus, the user has to repeat it whenever the web page changes.
Listly.io needs URL or HTML codes. It always works even if the web page chages.
[+] [-] haberdasher|9 years ago|reply
[+] [-] changmin|9 years ago|reply
[+] [-] fenollp|9 years ago|reply
Now if you would have this generate a graphQL spec file, you could run a graphQL server acting as a proxy to lots of websites. That would be interesting. Not sure how that fares with the websites' owners' ToS though.
[+] [-] changmin|9 years ago|reply
[+] [-] supergreg|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] joss82|9 years ago|reply
https://app.parseur.com
[+] [-] fabianmg|9 years ago|reply
http://webscraper.io/
[+] [-] guipsp|9 years ago|reply
[+] [-] nothrabannosir|9 years ago|reply
[+] [-] cdolan92|9 years ago|reply
their marketing is poor but the product is very powrful
[+] [-] snowpanda|9 years ago|reply