Twitter sentiment analysis using Python and NLTK

[+] detour|14 years ago|reply

The Pattern library has sentiment analysis built-in, pretty fun toolkit to play around with.

http://www.clips.ua.ac.be/pages/pattern-en#sentiment

[+] fdb|14 years ago|reply

In Pattern, sentiment analysis is a one-liner:

    >>> from pattern.en import sentiment
    >>> print sentiment(
    >>>     "The movie attempts to be surreal by incorporating various time paradoxes,"
    >>>     "but it's presented in such a ridiculous way it's seriously boring.") 

    (-0.34, 1.0)

[+] lrvick|14 years ago|reply

Great write-up. My company (Tawlk) actually open sourced a library to automate this very thing. We typically get around 80% accuracy with about 2 million samples.

You can grab our sample set here: https://github.com/downloads/Tawlk/synt/sample_data.bz2

And check out the project here: http://github.com/Tawlk/synt

It also ships with a full CLI interface if you just want to play with it without getting knee deep into the code.

Also if you want to to see a stripped down stand-alone code sample that steps you through the process I made this gist:

https://gist.github.com/1266556

Enjoy :)

[+] denzil_correa|14 years ago|reply

A better example is shown by Jacob Perkins on his blog - http://streamhacker.com/2010/05/10/text-classification-senti...

[+] abyssknight|14 years ago|reply

Sounds like what tawlk does. Wonder if their training data/method is better, though.

[+] lrvick|14 years ago|reply

The method is mostly the same one that is used within our synt library (htto://github.com/Tawlk/synt). We built quite a bit on top of it however. That said, the author did a great job of explaining the process.

Good encouragement for me to better document synt.

[+] jasonkolb|14 years ago|reply

What are neutral tweets classified as?

[+] lrvick|14 years ago|reply

It is a binary classifier so everything is at least slightly negative or slightly positive in a range from -1 to 1.

Think of it like leveler tool used in construction. Nothing is ever _perfectly_ level. It is either tilting one way or the other, but there is an acceptable range people will generally call 'level'. Neutral is the same.

If the classifier rates something something as 0.001 then that is probably safe to call it 'neutral'. It would be up to the application to decide on a 'neutral range'. You could for instance just flag anything between -0.2..0.2 as 'neutral'. It is good to define functions like these last so you can adjust the range manually until you have reduced false positives to a minimum with your particular data set.

8 comments