(no title)
giu | 5 years ago
I definitely concur with your first point, since I made the same experience, specifically when working with company-specific datasets.
From my experience one also underestimates how much time cleaning up the data takes; there are quite a few steps you need to go through before you can really start to analyze a dataset.
iagovar|5 years ago
I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.
Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.
But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?
Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.
giu|5 years ago
Scraping websites can be quite the messy business, since some websites change their document structure more often than others.
Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).