top | item 18918944

(no title)

More data is better.

You can reduce it via PCA one of the many techniques in multivariate statistic.

You can do anova to select your predictors.

In general you can use a subset of it using the tools that statistic have provided.

Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.

All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns

CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.

discuss

YeGoblynQueenne|7 years ago

>> All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

The solution is to direct research effort towards learning algorithms that generalise well from few examples.

Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.

>> You can reduce it via PCA one of the many techniques in multivariate statistic.

PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.

nimithryn|7 years ago

>>>Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.

This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.

The industry is larger than just the Big N.

spongepoc|7 years ago

>CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.

Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.

>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?

It's fine to point out problems without giving solutions. You seem very aggravated.

b_tterc_p|7 years ago

PCA has specific use cases. It’s not a catch all dimensionality reduction technique. You can’t use it effectively, for example, if things are not linearly correlated. There of course many tools for addressing many problems, but as the title states, this is often a grind. For any practical problem, exclusive of huge black box neural nets where you don’t need to understand the model, you are probably better off starting with a smaller set of reasonable sounding features and then slowly growing out your model to incorporate others.

Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.

apercu|7 years ago

> Complaining about messy data... welcome to the real world.

I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.

Breza|7 years ago

Nice to see a statistician weighing in on this post

iagovar|7 years ago

Thank you, Im not crazy. I was reading HN very confused.