top | item 34976227

(no title)

portInit | 3 years ago

Thanks for this, we're still trying to figure out these details ourselves.

Related to the quote, we've seen interest for API data wrangling, where prepackaged data feeds can be cumbersome to edit, or other implementation details become challenging like credential management, domain throttling, scheduling, checkpointing, export, etc.

It's also interesting for webpage data when you need to use a particular page as an index of links to filter and crawl. We've tried to build an abstraction layer around that.

Initially, we were mainly focused on webpages, and wanted to bypass the visuals of the browser and use a headless browser to fulfill network requests, render js, etc. then convert the page to flat table of enriched elements. With the APIs as another data set there is some work for us to do around language.

We're now trying to figure out what workflows are most relevant for crul to optimize around, as well, honestly - we just built what we thought was cool. Some features/workflows will certainly be more straightforward with existing tools and software - especially for a technically savvy user.

discuss

No comments yet.