(no title)
dante9999 | 10 years ago
Your data should not be worthless just because you dont catch some edge cases early. Sure there are always some edge cases but best way to handle them is to have proper validation logic in scrapy pipelines - if something is missing some required fields for example or you get some invalid values (e.g. prices as sequence of characters without digits) you should detect that immediately and not after 50k urls. Rule of thumb is: "never trust data from internet" and always validate it carefully.
If you have validation and encounter edge cases you will be sure that they are actual weird outliers that you can either choose to ignore or somehow try to force into your model of content.
kami8845|10 years ago
What do you do if you discover that your parsing logic needs to be changed after you've scraped a few thousand items? Re-run your spiders on the URLs that raised errors?
stummjr|10 years ago