(no title)
digitalzombie | 7 years ago
You can reduce it via PCA one of the many techniques in multivariate statistic.
You can do anova to select your predictors.
In general you can use a subset of it using the tools that statistic have provided.
Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.
All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns
CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.
YeGoblynQueenne|7 years ago
The solution is to direct research effort towards learning algorithms that generalise well from few examples.
Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
>> You can reduce it via PCA one of the many techniques in multivariate statistic.
PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.
nimithryn|7 years ago
This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.
The industry is larger than just the Big N.
spongepoc|7 years ago
Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.
>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
It's fine to point out problems without giving solutions. You seem very aggravated.
b_tterc_p|7 years ago
Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.
apercu|7 years ago
I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.
Breza|7 years ago
iagovar|7 years ago