top | item 18043408

(no title)

asploder | 7 years ago

I'm glad to have kept reading to the author's conclusion:

> As a hybrid approach, you could produce a large number of inferred sentiments for words, and have a human annotator patiently look through them, making a list of exceptions whose sentiment should be set to 0. The downside of this is that it’s extra work; the upside is that you take the time to actually see what your data is doing. And that’s something that I think should happen more often in machine learning anyway.

Couldn't agree more. Annotating ML data for quality control seems essential both for making it work, and building human trust.

discuss

ma2rten|7 years ago

This approach only works if you use OP's assumption that a text's sentiment is the average of it's word's sentiment. That assumption is obviously flawed (e.g. "The movie was not boring at all" would have negative sentiment).

Making this assumption is fine in some cases (for example if you don't have training data for your domain), but if you build a classifier based on this assumption why don't you just use an off-the-shelf sentiment lexicon? Do you really need to assign a sentiment to every noun known to mankind? I doubt that this improves the classification results regardless of the bias problem.

jakelazaroff|7 years ago

Sure, it's flawed, but that's the point of the post: that assumptions about your dataset can lead to unexpected forms of bias.

> Do you really need to assign a sentiment to very noun known to mankind?

No, but it seems like a simple (and seemingly innocuous) mistake that many programmers can and will make.

swingline-747|7 years ago

Heck, it's so important that it needs people with detail-orientation and solid judgement, because crowdsourcing (ie populism) may not be the best source of Godwin's law ethical mooring.

User23|7 years ago

The old Wise and Benevolent Philosopher King model of governance applied to machine learning?

rhizome|7 years ago

Another point in favor of having moderators.