jonathanbgn's comments

jonathanbgn | 6 years ago | on: Show HN: Visualize how HN/Reddit talk about your company and products

For inference speed I recommend a Naive Bayes model. I've tried this on Twitter messages and got near ~90% accuracy with 3-class (positive, negative, neutral).

The easiest library to do that would probably be scikit-learn with their ComplementNB class: https://scikit-learn.org/stable/modules/generated/sklearn.na...

For the data you can use the SemEval 2017 Task4-A dataset (around ~10K labeled tweets): https://github.com/cbaziotis/datastories-semeval2017-task4/t...

jonathanbgn | 6 years ago | on: Show HN: Visualize how HN/Reddit talk about your company and products

The algorithm will try to give more importance to words which appear rarely and are only used with the chosen brandname (similar to TF-IDF). This is why sometimes weird words can surface to the wordcloud, especially when the sample size of messages is small.

To prevent those words from appearing, I was thinking to implement some dictionary-check to only allow for meaningful words. However this approach also have drawback as you restrict people's words and can miss important new concepts.

Thanks for the feedback.

jonathanbgn | 6 years ago | on: Show HN: Visualize how HN/Reddit talk about your company and products

Let me elaborate a bit more on how the app computes sentiment. For a particular word, its sentiment is the average of the sentences' sentiments which contain both the word and the brandname (in order to identify the sentiment targeted at the brand, not just the overall sentiment).

For example, in the case of Mazda where you say that "regret" is classified as positive, if you look into which message it comes from you can see the original sentence: "Buy a Mazda, you won't regret it :)"

I agree with you that the word cloud is not useful on its own, and this is why you can click on a word to see the actual messages. Think of the word cloud as merely an entry point into a more detailed analysis by a human.

Thanks for the feedback.

jonathanbgn | 6 years ago | on: Show HN: Visualize how HN/Reddit talk about your company and products

That's a fair criticism. Sentiment analysis is quite hard to get right on social media messages because of diversity, subtlety, and many other aspects. From my experience with similar commercial and (very!) expensive products, their accuracy is far from perfect too.

Also consider the lack of labeled data for HN and Reddit messages: I had to use Twitter messages to train the classifiers.

This is the reason why I tried to play with BERT to see if I could get a model to generalize well from only Twitter messages. From my experiments, if you activate BERT (which makes the app much slower), you should be able to get 60~70% accuracy.

It's not perfect, but not too bad as well if you are getting averages over a large amount of messages.

Overall it's still a work in progress, I expect to greatly improve the accuracy over the following weeks!

jonathanbgn | 6 years ago | on: Show HN: Visualize how HN/Reddit talk about your company and products

Thanks! The basic (default) version for the sentiment analysis is based on TextBlob library, but you can choose to activate deep learning to analyze sentiment with Google AI's BERT (trained on Twitter messages), though it is quite slow at the moment because inferences are made on a CPU and not a GPU.

The back-end is just Python/Flask and I use the free Algolia and Pushshift.io APIs to source the messages from HN and Reddit (big thanks to them!)

page 1