top | item 25684018

(no title)

giu | 5 years ago

By real datasets you mean company-specific ones? Or do you happen to have some examples that are openly available which helped you a lot?

I definitely concur with your first point, since I made the same experience, specifically when working with company-specific datasets.

From my experience one also underestimates how much time cleaning up the data takes; there are quite a few steps you need to go through before you can really start to analyze a dataset.

discuss

order

iagovar|5 years ago

I happen to scrape a lot of large websites (mostly forums currently) and that's messy enough to force you into learning tricks.

I didn't stumble upon into any (tabular, at least) dataset that wasn't very curated.

Keep in mind that I studied sociology, so stuff that is a given for most HN people isn't for me. I had to learn a lot of CSS (for selectors), regex (still hate it), what's OLAP and how to take advantage of it (DuckDB) and a lot of stuff I'm not even aware now.

But I remember taking courses in my Uni, and later on, with R and Python. It was interesting, but no matter how deep into the rabbit hole of weird models I learnt, it felt... IDK, shallow?

Imagine yourself pulling data out of a company ERP, with human filled data. It won't be a walk in the park, just make some logit models and call it a day. You'll spend a lot of time trying to understand what's going on. And then you perform the models or make a dashboard.

giu|5 years ago

Thanks a lot for your reply!

Scraping websites can be quite the messy business, since some websites change their document structure more often than others.

Nonetheless, it's still a very instructive activity and you can build quite the pipeline around it (scraping multiple websites, joining datasets, efficiently storing the data, etc.).