Let's not pretend manipulating data to get the outcome you want, and manipulating data to to make it more accurate (e.g. compensating for biased sampling) are the same.
That the data isn't perfect when you get it is not a justification to further falsify it.
The challenge is that, even though those two phrases have very different tones, they quite literally are the same. Compensating for biased sampling is done by saying "well, I don't think this sample represents what I was looking for, so I'm going to pretend that some parts of the sample are less common than they really were and other parts are more common than they really were". The bias isn't an inherent property of the sample, it's an interaction between the characteristics of the sample and the characteristics we'd like it to have.
> That the data isn't perfect when you get it is not a justification to further falsify it.
Falsify what?
Leaving aside the GP's important first point that scraping the internet is indeed an extremely biased sample, an LLM (for instance) is not an exercise in modeling the average person's writing on the internet, it's building a model for some purpose. Fulfilling that purpose is the goal and nonrandom sampling, generating data, etc are universally used tools to get there.
SpicyLemonZest|3 years ago
lyubalesya|3 years ago
[deleted]
magicalist|3 years ago
Falsify what?
Leaving aside the GP's important first point that scraping the internet is indeed an extremely biased sample, an LLM (for instance) is not an exercise in modeling the average person's writing on the internet, it's building a model for some purpose. Fulfilling that purpose is the goal and nonrandom sampling, generating data, etc are universally used tools to get there.