1) What's the 'top words' which appears when you search for a site? I just get a bunch of profanities (for basically any show, even those which are PG-13). Is this meant to be the top words found in the show's subtitles (it's not) or the most searched for words (in which case why am I'm being shown that). Further searches seem to show some work (e.g. Homeland, or Game of thrones)
2) Expanding the 'top words' gives (apparently) a top 100, except many words appear more than once - in my 'top words' for 'The Simpsons', 'MOM' appears 7 times.
3) What are the 'Top topics'? Again, examining The Simpsons, the top topics are, 'Case/investigation', 'noisey', and 'spooky'.
4) Browser 'back' doesn't work from top topics or top words
Edit: Having read the 'about' I'm feeling far less critical, given this is part of a Big Data course project. Initially, I wondered if the prevalence of profanities in speech (generally) are causing a weird biasing effect (i.e. a single word being said repeatedly) but given there shouldn't be any 'fuck's in The Simpsons/Modern Family/Friends my guess is something may be off on the back-end?
First of all this is all still in very early beta work ! I'm not sure how it got to HN but here it is so all your feedback is great. I'll try to answer some of your questions the best I can:
1) The top words are those that characterize the show the best. This is not perfect science, and is an output of the LDA algorithm, but it gives already a good indication. Some words indeed shouldn't be there. Some possible explanation: subtitle mistake or a bug...
2) The words that appear more then once are again a glitch, and should be fixed. Again, work in progress...
3) The top topics are found using a topic modelling algorithm. It splits a corpus of documents into a number of topics, and every documents contains a certain proportion of each topic (20% Police, 80% Terrorism for example). The topics are bag-of-words, and so we manually give them names to what we think fits best.
4) Again beta...
I hope the 'about' is clear enough, if you have any questions feel free to ask !
3.) I think the 'Top Topics' are categorizations based upon the words found in the subtitles. Each word in the English language is mapped to a category, and based upon the word content of the show, it is assigned a category. While interesting in theory, it definitely misses the mark on certain shows. I assume The Simpsons is due to their 27 Treehouse of Horror events while the rest of the show does not necessarily have a central focus.
Based on the headline, I expected the site to return the name of a TV series based on a search of subtitles. i.e., "shootin some bball outside of the school"
For me, "find" implies search, while "discover" implies recommendation.
I don't understand how it can help me pick similar shows
Edit: I played a little more for some TV shows it gives better results. For sure it is interesting but require a lot more work to be actually useful as TV Shows recommendation tool.
This seems like a nice idea but all the results seem to be pretty much indistinguishable from noise. Nearly every show I tried returns the same genres and keywords (and I think it's reasonable to say that Frasier is NOT a crime show in space.)
If this is just counting the frequency of individual words, perhaps that's too simplistic an approach.
Doing some ad-hoc searches, it appears that recommendations tend to favor shows with the same writer rather than shows in the same genre. I'm guessing this is because writers tend to have a similar writing style across genres.
For example, search for a Joss Whedon show and get Joss Whedon shows recommended.
That's really cool, is the raw data available for people to play with? I've been looking for something interesting for some textual analysis experiments.
[+] [-] alexholehouse|10 years ago|reply
1) What's the 'top words' which appears when you search for a site? I just get a bunch of profanities (for basically any show, even those which are PG-13). Is this meant to be the top words found in the show's subtitles (it's not) or the most searched for words (in which case why am I'm being shown that). Further searches seem to show some work (e.g. Homeland, or Game of thrones)
2) Expanding the 'top words' gives (apparently) a top 100, except many words appear more than once - in my 'top words' for 'The Simpsons', 'MOM' appears 7 times.
3) What are the 'Top topics'? Again, examining The Simpsons, the top topics are, 'Case/investigation', 'noisey', and 'spooky'.
4) Browser 'back' doesn't work from top topics or top words
Edit: Having read the 'about' I'm feeling far less critical, given this is part of a Big Data course project. Initially, I wondered if the prevalence of profanities in speech (generally) are causing a weird biasing effect (i.e. a single word being said repeatedly) but given there shouldn't be any 'fuck's in The Simpsons/Modern Family/Friends my guess is something may be off on the back-end?
[+] [-] spgenot|10 years ago|reply
1) The top words are those that characterize the show the best. This is not perfect science, and is an output of the LDA algorithm, but it gives already a good indication. Some words indeed shouldn't be there. Some possible explanation: subtitle mistake or a bug...
2) The words that appear more then once are again a glitch, and should be fixed. Again, work in progress...
3) The top topics are found using a topic modelling algorithm. It splits a corpus of documents into a number of topics, and every documents contains a certain proportion of each topic (20% Police, 80% Terrorism for example). The topics are bag-of-words, and so we manually give them names to what we think fits best.
4) Again beta...
I hope the 'about' is clear enough, if you have any questions feel free to ask !
[+] [-] noxToken|10 years ago|reply
[+] [-] frazras|10 years ago|reply
[+] [-] zdmc|10 years ago|reply
For me, "find" implies search, while "discover" implies recommendation.
[+] [-] vojant|10 years ago|reply
I don't understand how it can help me pick similar shows
Edit: I played a little more for some TV shows it gives better results. For sure it is interesting but require a lot more work to be actually useful as TV Shows recommendation tool.
[+] [-] fla|10 years ago|reply
[+] [-] TillE|10 years ago|reply
[+] [-] Ygg2|10 years ago|reply
Also the lack of frell makes me question it :P
[+] [-] hias|10 years ago|reply
[+] [-] domfletcher|10 years ago|reply
As an aside does anyone recognise what they've used for the data vis on http://www.submetrics.org/#/about ?
[+] [-] dikaiosune|10 years ago|reply
[1] https://gephi.github.io/images/screenshots/preview4.png
[+] [-] matthewbauer|10 years ago|reply
[+] [-] fla|10 years ago|reply
[+] [-] jamesbrownuhh|10 years ago|reply
If this is just counting the frequency of individual words, perhaps that's too simplistic an approach.
[+] [-] vkjv|10 years ago|reply
For example, search for a Joss Whedon show and get Joss Whedon shows recommended.
[+] [-] xamdam|10 years ago|reply
[+] [-] JackFr|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] cheriot|10 years ago|reply
[+] [-] habosa|10 years ago|reply
[+] [-] malkia|10 years ago|reply
[+] [-] zongitsrinzler|10 years ago|reply