top | item 9080179

(no title)

wrath | 11 years ago

1. You can try using bi-grams or even tri-grams to make you word list a little more precise.

2. Create a validation by manually identifying each review as positive or negative. Each time you modify your algorithm run it through your validation set and note the results in a spreadsheet. If you don't do that, you'll never know if and how you've improved the results. The bigger the validation set the better. Similarly, you can use part of your validation set as a training set into a classifier.

3. Find a scale that works to bias your score. For example, I would try to bias your negative score using a log scale. The fewer negative words you have the more they are worth, the more you have the less they are worth.

discuss

markovbling|11 years ago

Definitely think I should look at using bi-grams and tri-grams

Interesting reflection on society if there are more 1-gram ways of communicating negativity than positivity e.g. I'm more inclined to say 'terrible' for something very bad while it feels more natural to say 'very good' than 'excellent'. If that makes any sense :)

namecast|11 years ago

I found this paper useful for a side project I worked on a few months ago, one that made use of n-grams in a naive bayesian classifier:

http://arxiv.org/pdf/1305.6143v2.pdf

and the lead authors's github repos are:

https://github.com/vivekn/sentiment https://github.com/vivekn/sentiment-web

He's implemented 'negative bi-gram detection' (my phrasing, not his) with this function:

https://github.com/vivekn/sentiment/blob/master/info.py#L26-...

...which I found useful as a jumping off point. Good luck!