Text Mining South Park

[+] nanis|10 years ago|reply

I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.

But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.

When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.

If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, ALL the words spoken by all the characters).

FYI, all budding data "scientists" ...

[+] bonoboTP|10 years ago|reply

Why so bitter and angry? As far as I can see, his calculations make sense and lead to interesting results. Instead of philosophical nitpicking, why not help him improve his understanding by explaining how you would have calculated/formalized/modeled this thing, so the scare-quote data "scientists" can learn something?

By the way, we definitely don't hear all words that these characters speak in their lives. It's implied in the story that there are conversations that we don't get to see in the actual episodes, but nevertheless these imaginary characters speak a lot more. For example we don't see each and every breakfast, lunch and dinner discussion, we don't hear all their words in the classroom etc.

Now of course the sampling isn't random, because the creators obviously "select" the more interesting bits of the characters' lives, but in statistics we always make assumptions that simplify the procedure but are known to be technically wrong.

[+] vsbuffalo|10 years ago|reply

You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

[1]: http://andrewgelman.com/2009/07/03/how_does_statis/

[2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

[+] walkerkq|10 years ago|reply

Hi, I'm the author. I appreciate the time you've taken to read and provide constructive criticism of my work. Here's my full write up (on GitHub, so it should continue to work): https://github.com/walkerkq/textmining_southpark/blob/master...

I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.

Additionally, the counting up their most frequent words produced results with very little semantic meaning - things like "just" and "dont" - which can be seen in this (really boring) wordcloud: https://github.com/walkerkq/textmining_southpark/blob/master...

Looking into the log likelihood of each word for each speaker produced results that were much more intuitive and carried more meaning, like ppod said below: I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking.

[+] minimaxir|10 years ago|reply

> If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

From the text, the author is performing statistical testing (chi-sq) for which words are most unique to a character, not which words they say the most. (although the two metrics are somewhat correlated)

[+] nanis|10 years ago|reply

Also, I am going to go out on a limb here and guess that R's `read.csv` doesn't do what one hopes it would when fed this CSV:

    10,3,Brian,"You mean like the time you had tea with
    Mohammad, the prophet of the Muslim faith?
    Peter:
    Come on, Mohammad, let's get some tea.
    Mr. T:
    Try my ""Mr. T. ...tea.""
    "

Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U

    Brian: 	You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
    Peter: 	Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
    Mr. T: 	Try my "Mr. T. ...tea." [squints]

There, three characters speak.

However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl

   > x[596, ]
       Season Episode Character
    596     10       3     Brian
              Line
    596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n

    > x[597,]
        Season Episode Character
    597     10       3     Brian
                                                Line
    597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n

as well as seemingly duplicating part of the conversation.

PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.

[+] ZoF|10 years ago|reply

This implies there aren't future episodes upon which this type of statistical analysis could be applied.

This also strongly implies you think the author is a 'budding data scientist' out of his/her league.

This is very much a 'sample' given the context that South Park is still releasing new episodes.

FYI all elitist 'statisticians' ...

[+] make3|10 years ago|reply

Would the fact that he/she does not have the future text in his sample/population and that he uses this dataset as a sample of all the southparks to be ever written (in a prediction mode) make this make sense

[+] pg_is_a_butt|10 years ago|reply

[deleted]

[+] lcfcjs|10 years ago|reply

[deleted]

[+] JoeAltmaier|10 years ago|reply

Hm. The show is still running? Then the show can be considered a sample of what the characters (ok, the writers) will say/put in their mouths. The statistics then have predictive value.

[+] seankross|10 years ago|reply

Here's the accompanying GitHub repo: https://github.com/walkerkq/textmining_southpark

[+] wodenokoto|10 years ago|reply

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

[+] minimaxir|10 years ago|reply

Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215

I believe it corresponds to the tfidf factor.

[+] cadab|10 years ago|reply

I've found an image, which i'm guessing it taken from the site: http://imgur.com/IEudyni, worth looking at if the sites still down.

[+] LoSboccacc|10 years ago|reply

I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast

[+] dropdatabase|10 years ago|reply

This is amazing, I wonder what results you'd get from The Simpsons

[+] charlieegan3|10 years ago|reply

Not sure subtitles contain character information but the people running https://frinkiac.com/ might have the data.

[+] rhema|10 years ago|reply

Pretty interesting. This Large Scale Study of Myspace (http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...) paper shows a similar method for finding characteristic terms, using Mutual Information.

[+] peg_leg|10 years ago|reply

This should be nominated for an igNobel

[+] agentgt|10 years ago|reply

I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).

Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.

[+] cdubzzz|10 years ago|reply

Maybe because of Kyle's decision to not give long speeches last season (:

[+] flashman|10 years ago|reply

It would definitely change. For instance I'd expect Kyle's words-per-sentence (or at least his 90th percentile sentence length) to be higher than Stan's, due to his speeches.

[+] gulbrandr|10 years ago|reply

  Error establishing a database connection

48 comments