top | item 10938757

(no title)

dante9999 | 10 years ago

> code did not cover some weird edge case on the scraped resource and that all data extracted was now basically untrustworthy and worthless.

Your data should not be worthless just because you dont catch some edge cases early. Sure there are always some edge cases but best way to handle them is to have proper validation logic in scrapy pipelines - if something is missing some required fields for example or you get some invalid values (e.g. prices as sequence of characters without digits) you should detect that immediately and not after 50k urls. Rule of thumb is: "never trust data from internet" and always validate it carefully.

If you have validation and encounter edge cases you will be sure that they are actual weird outliers that you can either choose to ignore or somehow try to force into your model of content.

discuss

order

kami8845|10 years ago

Hmm, I'll have to investigate that, any tips for libraries to use for validation that tie well into scrapy?

What do you do if you discover that your parsing logic needs to be changed after you've scraped a few thousand items? Re-run your spiders on the URLs that raised errors?