top | item 35169906

(no title)

Let's not pretend manipulating data to get the outcome you want, and manipulating data to to make it more accurate (e.g. compensating for biased sampling) are the same.

That the data isn't perfect when you get it is not a justification to further falsify it.

discuss

SpicyLemonZest|3 years ago

The challenge is that, even though those two phrases have very different tones, they quite literally are the same. Compensating for biased sampling is done by saying "well, I don't think this sample represents what I was looking for, so I'm going to pretend that some parts of the sample are less common than they really were and other parts are more common than they really were". The bias isn't an inherent property of the sample, it's an interaction between the characteristics of the sample and the characteristics we'd like it to have.

lyubalesya|3 years ago

[deleted]

magicalist|3 years ago

> That the data isn't perfect when you get it is not a justification to further falsify it.

Falsify what?

Leaving aside the GP's important first point that scraping the internet is indeed an extremely biased sample, an LLM (for instance) is not an exercise in modeling the average person's writing on the internet, it's building a model for some purpose. Fulfilling that purpose is the goal and nonrandom sampling, generating data, etc are universally used tools to get there.