Show HN: Generating fun Stack Exchange questions using Markov chains

[+] Findus23|8 years ago|reply

Hi everyone, I hope you like my latest side project!

I'm an astronomy student who likes programming in his free time. This time I wanted to write something that handles larger amounts of data. And as I recently came across the Stack Exchange data dump including all questions and answers I had the idea of using them to create Markov Chains for (nearly) every Stack Exchange site. The website displays the resulting content, which is often surprisingly coherent and entertaing, and allows upvotes/downvotes so that the best questions get to the front page. And as a bonus, I created a quiz where one can guess which site a random question is based on.

If you are interested in my other projects, check out https://lw1.at, if you want to see the code, everything is Open Source and can be found here: https://github.com/Findus23/se-simulator

Please excuse the very minimal design, but after writing a lot of Single Page Applications I wanted to go the oposite way and write a website with less than 25KB.

[+] froindt|8 years ago|reply

>Please excuse the very minimal design, but after writing a lot of Single Page Applications I wanted to go the oposite way and write a website with less than 25KB.

The world could use more of this.

I took the simple quiz and my first question was from the Russian language site. I hadn't considered that Markov chains could work in other languages too. I wonder if there are any significant differences?

[+] reificator|8 years ago|reply

I'm not going to lie, I started reading this thinking it was an example. I was very impressed as often Markov chain generated sentences break down after a common joining word.

[+] jaytaylor|8 years ago|reply

Really awesome that you shared the source code, I'm having a hard time trying to figure out how to get it all working, though.

Any chance you'll be including setup instructions?

[+] Waterluvian|8 years ago|reply

Markov chains are so much fun. They produce believable relevant text that ultimately makes no sense, which is basically a definition of comedy. And they're also super simple to understand and implement. I can have lots of fun without having to do any wild natural language processing.

[+] Findus23|8 years ago|reply

I have to fully agree. The language-processing part is really simple (partly due to the really cool markovify library [1])

You may also like my older project [2] even though it is partly in German. I'm using Markov Chains for the titles and some custom regex-based language processing for the descriptions.

[1] https://github.com/jsvine/markovify/

[2] https://nonsense.lw1.at/

[+] koolba|8 years ago|reply

> Do Greeks driving affect the whaling industry?

I’ve always wondered this as well.

[+] arbie|8 years ago|reply

> Like all animal abuse in particular: can one truly know?

Comedy gold.

[+] Findus23|8 years ago|reply

I have now written a bit more on how to hopefully get this to run locally and how everything works here:

https://github.com/Findus23/se-simulator#se-simulator

[+] bcaa7f3a8bbc|8 years ago|reply

Ham Radio: https://se-simulator.lw1.at/q/which-mode-describes-this

> If you can already be synchronized when it comes through the use of your test. That's a switching powersupply. I disable AGC in my comments above as an antenna analyzer that works depends on the Pi transmit frequency that isn't necessary to send an SWL, but let's dig further by adding another radial... You have bigger problems.

Any sufficiently advanced technology is indistinguishable from magic.

[+] chris_wot|8 years ago|reply

I'd like to see Stack Exchange moderator responses using Markov chains. Like "Stop answering this guy as he is posing useless questions", or "This duplicate is considered not relevant".

[+] everdev|8 years ago|reply

Beautiful.

> Configure anonymous DDOS attacks on internal servers?

[+] bcaa7f3a8bbc|8 years ago|reply

> How to brute force against quantum computer

> This may or may not be a major flaw with your function F is for the second bytes of randomness, this requires access to the algorithm.

> Okamoto-Tanaka Revisited: Fully Authenticated Diffie-Hellman with Schnorr signatures and KCDSA are two obvious caveats to this question is off-topic...

https://se-simulator.lw1.at/q/how-to-brute-force-against-qua...

[+] exabrial|8 years ago|reply

Ok some of these I actually want to know the answer to... Like what windsurfing equipment is good for deep sea fishing

[+] aetherspawn|8 years ago|reply

> Essential windsurfing equipment to fish?

> Any additional info will be suspicious, they've had the card

[+] pavel_lishin|8 years ago|reply

> Remove Broken Lightbulb from the toe?

> As it is horizontal? This means the color and material like them a few sheets of paper on the internet?

I don't know how much effort you put into this, but that alone was absolutely worth it.

[+] dkersten|8 years ago|reply

> What is an open-commercial license?

> Why did they determine that he used / invisibility cloak technology?

> How did newton APPROXIMATE THE AREA UNDER THESE PARTICULAR CURVES

[+] stared|8 years ago|reply

I am curious how would those results compare with RNN models, such as ones in Andrej Karpathy's "The Unreasonable Effectiveness of Recurrent Neural Networks" http://karpathy.github.io/2015/05/21/rnn-effectiveness/

(E.g. as they are able to learn code grammar.)

[+] Findus23|8 years ago|reply

Hi, I have already expected a question about Neural Networks :)

When I had the idea I also toyed around with word-rnn and similar RNN libraries. The results I got were pretty good, but training was extremely resource consuming. Cuda gave a 8x boost, but still training one of the smaller sites took 20 minutes on my simple graphics card while creating the Markov chain is done in 2 minutes.

I also have absolutely no experience with Machine Learning and just the setup was already quite an experience. So I stuck with what I know and stayed "the traditional way".

[+] jpatokal|8 years ago|reply

Turns out the random mishmash of pseudoscience and conspiracy theories that is Skeptics.SE is hilarious fodder for a Markov chain:

https://se-simulator.lw1.at/q/do-greeks-driving-affect-the-w...

[+] fourthark|8 years ago|reply

This is a lot of fun.

Why are some of the choice buttons colored in the easy quiz? Sometimes it seemed like a hint, sometimes just random.

[+] krallja|8 years ago|reply

For the colors, see https://stackexchange.com/sites -- it seems to be an approximation of the actual sites' themes/branding.

[+] jeroen|8 years ago|reply

Is there supposed to be a close button on the yellowish popup? It's not on screen on my iphone se.

[+] Findus23|8 years ago|reply

Hm, there ought to be one in the top right corner.

Can you try tapping there even though you don't see it?

[+] mmirate|8 years ago|reply

Next step: post these questions to the Stack Exchange sites from whence they were generated.

[+] pbhjpbhj|8 years ago|reply

I wonder if you split the corpus and only used low voted questions/answers for one corpus and edited questions + high-voted answers for the other ... could we tell the difference in the output chains?

SE often has translated questions, or questions of low quality, IME.

[+] Findus23|8 years ago|reply

In a way I did it for Stack Overflow. All Posts are 60GB which is a bit too much to handle the whole chain in my desktop-pc. So I only used questions and answers with a score >=10 there to get it to a similar size as the other sites. It still took half an hour to parse the XML and another hour to create the chain.

[+] unknown|8 years ago|reply

[deleted]

[+] sudouser|8 years ago|reply

awesome, reminds me of ‘how is babby formed’

36 comments