top | item 3708085

Twitter sentiment analysis using Python and NLTK

79 points| ananthrk | 14 years ago |laurentluce.com | reply

8 comments

order
[+] detour|14 years ago|reply
The Pattern library has sentiment analysis built-in, pretty fun toolkit to play around with.

http://www.clips.ua.ac.be/pages/pattern-en#sentiment

[+] fdb|14 years ago|reply
In Pattern, sentiment analysis is a one-liner:

    >>> from pattern.en import sentiment
    >>> print sentiment(
    >>>     "The movie attempts to be surreal by incorporating various time paradoxes,"
    >>>     "but it's presented in such a ridiculous way it's seriously boring.") 

    (-0.34, 1.0)
[+] lrvick|14 years ago|reply
Great write-up. My company (Tawlk) actually open sourced a library to automate this very thing. We typically get around 80% accuracy with about 2 million samples.

You can grab our sample set here: https://github.com/downloads/Tawlk/synt/sample_data.bz2

And check out the project here: http://github.com/Tawlk/synt

It also ships with a full CLI interface if you just want to play with it without getting knee deep into the code.

Also if you want to to see a stripped down stand-alone code sample that steps you through the process I made this gist:

https://gist.github.com/1266556

Enjoy :)

[+] abyssknight|14 years ago|reply
Sounds like what tawlk does. Wonder if their training data/method is better, though.
[+] lrvick|14 years ago|reply
The method is mostly the same one that is used within our synt library (htto://github.com/Tawlk/synt). We built quite a bit on top of it however. That said, the author did a great job of explaining the process.

Good encouragement for me to better document synt.

[+] jasonkolb|14 years ago|reply
What are neutral tweets classified as?
[+] lrvick|14 years ago|reply
It is a binary classifier so everything is at least slightly negative or slightly positive in a range from -1 to 1.

Think of it like leveler tool used in construction. Nothing is ever _perfectly_ level. It is either tilting one way or the other, but there is an acceptable range people will generally call 'level'. Neutral is the same.

If the classifier rates something something as 0.001 then that is probably safe to call it 'neutral'. It would be up to the application to decide on a 'neutral range'. You could for instance just flag anything between -0.2..0.2 as 'neutral'. It is good to define functions like these last so you can adjust the range manually until you have reduced false positives to a minimum with your particular data set.