Predicting Stack Overflow Tags with Google’s Cloud AI

[+] yeldarb|6 years ago|reply

Slightly off topic but I’ve been playing with this dataset for a while now to learn ML. It’s remarkably humbling.

No NLP approach I’ve tried has been able to predict question score based on content better than my baseline of “choose the mean”. (I’ve tried random forest on bag of words, AWD-LSTM, and Google AutoML so far).

The author of this post tried score prediction as well and pivoted to tag prediction after she couldn’t find anything that worked well: https://twitter.com/srobtweets/status/1125860523377979398?s=...

It’s so crazy to me that which posts get popular might just be random. And makes me wonder about the correlation between post content and popularity on other social sites like HN and reddit.

[+] existencebox|6 years ago|reply

Quick question: Did you try/find any correlation with post time/seasonality? I recall that in previous examinations of HN/Reddit top/viral posts, there was some amount of signal along that dimension. (I notice someone mentions this in the twitter thread too but I didn't see a response)

Additionally, you found average (all up) to be a better predictor than average per user or per category?

Apologies for grilling you here, I should frankly dig in myself, but if you happen to feel like indulging me it's much appreciated :)

[+] jzwinck|6 years ago|reply

I have some unproven factors for predicting certain popular low-scoring questions: https://meta.stackoverflow.com/questions/373412/require-conf...

A lot of questions that should have a very low score get fixed by people other than the original author. This may make your modeling more difficult because not only are there multiple authors, but the expected score of a question is not time-invariant. In fact, the final score of a question is a path dependent composite score of potentially several intermediate formulations.

[+] AlexCoventry|6 years ago|reply

It's not that it's random, it's that the signal is too abstract for such simple methods to grasp it. Machine Learning which can predict Stack Overflow question scores is going to have to understand what's being asked, and how useful the responses are going to be for people interested in the question.

[+] raz32dust|6 years ago|reply

Just curious - did you include only the title and text of the question, or also the tags and author history?

[+] Darkphibre|6 years ago|reply

I've used simple TF/IDF with a custom stemmer as a similarity scoring tool against the XDK documentation. It worked pretty well, I'd be curious to know how that'd fare vs. the neural model (i.e. I suspect it'd be better in identifying the rare, high-signal words that were excluded by the author's 400-most-common word limit).

For my side project: As we received emails from developers asking for clarification/help with the APIs, the system would provide relevant documentation URLs so that anyone could pick up an inquiry and brush up on the API (and have a handy link if the the docs could be leveraged in the response).

[+] btgeekboy|6 years ago|reply

Between this and the Pi calculation "achievement" it appears Google's all-in on the grassroots marketing these days.

[+] boulos|6 years ago|reply

Disclosure: I work on Google Cloud.

Both Emma and Sara are in our Developer Relations (aka Dev Rel) organization, like Kelsey Hightower and Felipe Hoffa.

They’re explicitly not in Sales, and their job is focused on explaining and demonstrating stuff to Developers. They go to meetups, give invited talks, write blog posts, and so on.

Sorry if that doesn’t come across as clear. They work for Google, are paid by Google for that work, but aren’t measured by revenue or anything.

[+] zitterbewegung|6 years ago|reply

By grassroots marketing you mean paying people to use their services?

[+] gridlockd|6 years ago|reply

Out of curiosity, does anyone ever use tags on Stackoverflow? Does anybody use search on Stackoverflow? Does anybody ever use in-site search on any website, instead of just using Google?

[+] inglor|6 years ago|reply

I'm a "heavy" Stack Overflow user https://stackoverflow.com/users/1348195/benjamin-gruenbaum and I use tags all the time.

Otherwise you're just playing "fastest gunslinger in the west" trying to answer generic stuff before anyone rather than sniping questions you genuinely find interesting and can teach you.

[+] amrrs|6 years ago|reply

I had a daily practice of looking at the tags of my language of desire `R` and `Python` to see the kind of incoming questions and the variety of answers coming at different instance of time. This is like a break from my full time work let's say after I finish a task or something. Somehow this has helped me improve my coding.

[+] w-m|6 years ago|reply

Stackoverflow may be the only site where I ever use tags. Tags are useful there for finding questions you can answer. You can watch tags of topics that you probably can reply to and browser the front page with a collection of interesting questions. Or look up the new questions page of a specific tag.

The same could probably done with search terms saved to your profile, but the tags are a much more organized alternative.

[+] jodrellblank|6 years ago|reply

Yes, tags are mandatory when posting a question and useful when looking for questions to answer.

And yes, Google is worse than useless, it doesn't even search for what I type in, and when I force it to do that, it returns SEO spam websites, bot-farmed-content, clickbait and advertising ahead of useful content. Or instead of useful content.

[+] Theodores|6 years ago|reply

You are expected to put a few tags on a question if you ask one and normally some editor person will mangle your English to remove the nuance that your question was really about to possibly also update the tags to maybe not quite fit your question. This is remarkably unappreciative, however, it is interesting how any website that gets established garners an army of helpers who do these things. Wikipedia being the classic for this expertise-ism.

So even if tags are not your thing then whatever question you see on StackOverflow will be fully tagged up. If the software doesn't suggest some when asking some keen person will add them in.

If you are interested in a particularly obscure software package that does not have its own StackOverflow site then you will fond the square bracket search option (for the tags) is a good way of finding out what is new in that niche and what cool features or tips you can borrow for your own project.

At a guess the unrelated questions - 'top network questions' - are more likely to get clicks than the SO search box. So, to answer your question, the answer is 'no'. Aside from the one DDG user on HN that maintains a gopher site and the really computer-phobic uncle that uses Bing! with Windows XP the whole English speaking world is using Google.

China is different.

[+] EastSmith|6 years ago|reply

I used to have a rss subscription on a obscure SO tag I was interested years ago. The tag had like a couple of posts a week, and I would read the question in the rss reader and click on what I am interested in.

So tags probably do not work well on a popular topic, but outside of that, they can be useful.

[+] tyingq|6 years ago|reply

"Does anybody ever use in-site search on any website, instead of just using Google?"

It's certainly common for e-commerce. Especially when you can search by specific product attributes, shipping options, etc.

[+] baroffoos|6 years ago|reply

You don't use tags for searching for answers. You use tags to browse questions on topics you know a lot about and you have a high chance of being able to answer. Most of the questions on stack overflow I know nothing about so I browse the ruby tag where a lot of the questions are in the range of what I could answer.

Thats why they have the requirement for a tag "Could someone be an expert in this tag?"

[+] kevinventullo|6 years ago|reply

I used tags on MathOverflow because questions outside my area of expertise often may as well have been written in Latin.

[+] onurcel|6 years ago|reply

If you want to experiment with question tag prediction on your laptop, you can also play with fastText : https://fasttext.cc/docs/en/supervised-tutorial.html

[+] ACow_Adonis|6 years ago|reply

Kaggle ran a facebook recruitment challenge trying to do something similar a few years ago for those interested (I am because I participated in it):

https://www.kaggle.com/c/facebook-recruiting-iii-keyword-ext...

I'm in a waiting room right now, but is anyone interested in summarising or commenting on the differences/ gains/ losses between the performance and aspects of comparing the two? :)

[+] lettergram|6 years ago|reply

For those interested, I’ve done similar work in the past. I’ve also written a guide along with explanations of how it works (on sentence classification):

https://github.com/lettergram/sentence-classification

It uses Keras and goes through everything from encodings, to the way various networks function, to hyperparameter tuning.

[+] milad_nazari|6 years ago|reply

Similarly, predicting issue tags on Github would be interesting.

[+] aficionado|6 years ago|reply

Links to dataset do not work.

[+] turtlegrids|6 years ago|reply

Not sure why you're being downvoted. I also cannot get to at least one dataset -https://storage.googleapis.com/cloudml-demo-lcm/SO_ml_tags_a... which is linked to from the fourth paragraph directly above the "What Is The Bag Of Words Model?" section heading.

[+] sararob|6 years ago|reply

Article author here.

To access the dataset you need to be logged in with a Google account. Details here: https://github.com/GoogleCloudPlatform/ai-platform-text-clas...

43 comments