top | item 16368497

(no title)

at_smith | 8 years ago

Awesome tool! How do you handle scraping data that's hiding behind layers of ~fancy~ JS libraries? Is it as simple as triggering click events, pausing for loading, and then grabbing the information?

discuss

order

jardah|8 years ago

This tool basicaly performs the simplest data loading, it opens the webpage, then waits till most xhr requests are done, wait's a second (tio give JS time to manipulate DOM) and then loads data from the page. This way, it has what user sees when he opens the page in browser. So if the data is visible, or loaded through XHR or hidden in global JS variable it will see it.

For more advanced usage (like clicking, or submiting a search request) it would need to have some kind of scenario like: "Click on this" -> "wait till this loads" -> "type something here" -> "scroll to this" -> load data.

Which is possible with headless chrome, so the trick is to make it general and easy to use (something like recording what user does through chrome plugin). Maybe in future versions :)

cseelus|8 years ago

Could be an interesting enhancement. Sounds a little bit like what Capybara, a test framework for Ruby apps can do[1], things like

  click_link('Link Text')
  fill_in('Password', with: 'Seekrit')
  choose('A Radio Button')
  check('A Checkbox')
  uncheck('Another Checkbox')
  select('Option', from: 'Select Box')
1) https://github.com/teamcapybara/capybara#navigating

razki|8 years ago

I'd have money on using Horseman with phantomjs in node.

bdcravens|8 years ago

No, it says on the page it's using headless Chrome.

oodavid|8 years ago

PhantomJS is dead. The arrival of Puppeteer rung the death knell.