Submetrics – Search for your favorite show

[+] alexholehouse|10 years ago|reply

I don't understand;

1) What's the 'top words' which appears when you search for a site? I just get a bunch of profanities (for basically any show, even those which are PG-13). Is this meant to be the top words found in the show's subtitles (it's not) or the most searched for words (in which case why am I'm being shown that). Further searches seem to show some work (e.g. Homeland, or Game of thrones)

2) Expanding the 'top words' gives (apparently) a top 100, except many words appear more than once - in my 'top words' for 'The Simpsons', 'MOM' appears 7 times.

3) What are the 'Top topics'? Again, examining The Simpsons, the top topics are, 'Case/investigation', 'noisey', and 'spooky'.

4) Browser 'back' doesn't work from top topics or top words

Edit: Having read the 'about' I'm feeling far less critical, given this is part of a Big Data course project. Initially, I wondered if the prevalence of profanities in speech (generally) are causing a weird biasing effect (i.e. a single word being said repeatedly) but given there shouldn't be any 'fuck's in The Simpsons/Modern Family/Friends my guess is something may be off on the back-end?

[+] spgenot|10 years ago|reply

First of all this is all still in very early beta work ! I'm not sure how it got to HN but here it is so all your feedback is great. I'll try to answer some of your questions the best I can:

1) The top words are those that characterize the show the best. This is not perfect science, and is an output of the LDA algorithm, but it gives already a good indication. Some words indeed shouldn't be there. Some possible explanation: subtitle mistake or a bug...

2) The words that appear more then once are again a glitch, and should be fixed. Again, work in progress...

3) The top topics are found using a topic modelling algorithm. It splits a corpus of documents into a number of topics, and every documents contains a certain proportion of each topic (20% Police, 80% Terrorism for example). The topics are bag-of-words, and so we manually give them names to what we think fits best.

4) Again beta...

I hope the 'about' is clear enough, if you have any questions feel free to ask !

[+] noxToken|10 years ago|reply

3.) I think the 'Top Topics' are categorizations based upon the words found in the subtitles. Each word in the English language is mapped to a category, and based upon the word content of the show, it is assigned a category. While interesting in theory, it definitely misses the mark on certain shows. I assume The Simpsons is due to their 27 Treehouse of Horror events while the rest of the show does not necessarily have a central focus.

[+] frazras|10 years ago|reply

Right! I saw profanity in the big bang theory too but unless it was bleeped out I don't believe that has ever happened

[+] zdmc|10 years ago|reply

Based on the headline, I expected the site to return the name of a TV series based on a search of subtitles. i.e., "shootin some bball outside of the school"

For me, "find" implies search, while "discover" implies recommendation.

[+] vojant|10 years ago|reply

For example: Breaking bad (http://www.submetrics.org/#/show/1069) Top topics: Party, Gossip, Show...

I don't understand how it can help me pick similar shows

Edit: I played a little more for some TV shows it gives better results. For sure it is interesting but require a lot more work to be actually useful as TV Shows recommendation tool.

[+] fla|10 years ago|reply

96.58% Similar to Veronica Mars. Not sure if top words is a good metric here.

[+] TillE|10 years ago|reply

Tried it with Buffy, and the similar shows look completely unrelated. Also, I'm not sure where it's getting "king" and "dynasty" as keywords.

[+] Ygg2|10 years ago|reply

I'm wondering that too about Farscape.

Also the lack of frell makes me question it :P

[+] hias|10 years ago|reply

Top word for all shows these days seems to be 'fuck' oO

[+] domfletcher|10 years ago|reply

Yeah, its a nice idea but doesn't seem to work at the moment, give it a few months and someone may well implement it properly.

As an aside does anyone recognise what they've used for the data vis on http://www.submetrics.org/#/about ?

[+] dikaiosune|10 years ago|reply

It looks quite similar to a graph visualization method I once used in Gephi [1].

[1] https://gephi.github.io/images/screenshots/preview4.png

[+] matthewbauer|10 years ago|reply

Game of Thrones has top words of "rome", "england", and "france"? The only reason I can think of is if it's also including audio commentary.

[+] fla|10 years ago|reply

Interesting. Definately seems like there is something wrong with the data.

[+] jamesbrownuhh|10 years ago|reply

This seems like a nice idea but all the results seem to be pretty much indistinguishable from noise. Nearly every show I tried returns the same genres and keywords (and I think it's reasonable to say that Frasier is NOT a crime show in space.)

If this is just counting the frequency of individual words, perhaps that's too simplistic an approach.

[+] vkjv|10 years ago|reply

Doing some ad-hoc searches, it appears that recommendations tend to favor shows with the same writer rather than shows in the same genre. I'm guessing this is because writers tend to have a similar writing style across genres.

For example, search for a Joss Whedon show and get Joss Whedon shows recommended.

[+] xamdam|10 years ago|reply

Top words for Seinfeld are things you can't say on TV. Data broken?

[+] JackFr|10 years ago|reply

Based on poking around the site, it looks like there's something seriously wrong with the data.

[+] unknown|10 years ago|reply

[deleted]

[+] cheriot|10 years ago|reply

That's really cool, is the raw data available for people to play with? I've been looking for something interesting for some textual analysis experiments.

[+] habosa|10 years ago|reply

Basically every show I tried just gave me a word cloud with a big "Fuck" in the middle.

[+] malkia|10 years ago|reply

Already knowing where "Enhance!" going to lead me! :)

[+] zongitsrinzler|10 years ago|reply

Subtitle analysis is a really cool idea!

25 comments