I'll ask plainly what others are hinting at : Is this actually your own built service, or are you a proxy for something like Google Translate API[1]?
If it's your own built service, it's critical how you explain the hows and whys of your forecast availability and scalability numbers for your chosen architecture, given who you are competing with.
Alternatively, people can just download langid.py[1] and do language detection locally. This is not a particularly hard problem - I think it's doable by undergrad ML or NLP classes.
The tricky parts are usually political - are users going to be angry if you confuse Indonesian with Malaysian, or so on?
I think it's doable by undergrad ML or NLP classes.
In fact, we had a course for high school students where they learnt how a language guesser works and where they had to change a language guesser. A simplistic method that already works very well is:
* Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams.
* To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum.
Of course, you can do fancier things, such as training a SVM or logistic regression classifier with n-grams and words as features, etc.
An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes.
The design is fine, but the language used on the page itself isn't quite right.
I see three spelling errors in your language list:
- Panjabi should be Punjabi;
- Teligu should be Telugu;
- Ukraininan should be Ukrainian.
There are also a few grammar problems earlier in the document, and style problems (e.g. English doesn't use a space before sentence-ending punctuation marks).
It's probably overloaded because it's on hackernews and is based on the same features (character n-grams) as Google Translate. Your text is simply too short for character n-grams to be 100% reliable.
Looks interesting. Why not have a input on the landing page where someone can try it out without even signing up? I think then people could give it a spin before they give away their email address. Otherwise, the user just has to trust your 99% figure, which it might be helpful to give some data around, even if it is a footnote (on a corpus of x, over x period of time, etc.)
Also, I think it would be clearer if it said "A simple and scalable way to automatically classify text by language" instead of "A simple and scalable way to classify automatically text by language".
Design looks very clean though. Nice work.
EDIT: Also, your social media links at the bottom aren't hooked up yet.
I've used detectlanguage.com[1] in the past, which seems like a very similar service to getlang.io. With both of them it is hard to know what is behind the scenes...
I wonder how this performs on short text posts like tweets. At my last gig where we did social media text analysis we used a few different packages (chromium, guess-language, and our own ngram classifier) and still had pretty low accuracy for tweets.
You guys might want to handle GET requests for /try URL(https://getlang.io/try) as well.currently it's returning "Server Error (500)" for GET requests.
I don't know why I can't stand this sentence "A simple and scalable way to classify automatically text by language". "Classify" and "automatically" need to switch places.
davidjgraph|12 years ago
If it's your own built service, it's critical how you explain the hows and whys of your forecast availability and scalability numbers for your chosen architecture, given who you are competing with.
[1]https://developers.google.com/translate/v2/using_rest#detect...
beering|12 years ago
The tricky parts are usually political - are users going to be angry if you confuse Indonesian with Malaysian, or so on?
[1] https://github.com/saffsd/langid.py
microtonal|12 years ago
In fact, we had a course for high school students where they learnt how a language guesser works and where they had to change a language guesser. A simplistic method that already works very well is:
* Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams.
* To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum.
Of course, you can do fancier things, such as training a SVM or logistic regression classifier with n-grams and words as features, etc.
An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes.
chrismorgan|12 years ago
I see three spelling errors in your language list:
- Panjabi should be Punjabi;
- Teligu should be Telugu;
- Ukraininan should be Ukrainian.
There are also a few grammar problems earlier in the document, and style problems (e.g. English doesn't use a space before sentence-ending punctuation marks).
mdemare|12 years ago
ma2rten|12 years ago
diasks2|12 years ago
Also, I think it would be clearer if it said "A simple and scalable way to automatically classify text by language" instead of "A simple and scalable way to classify automatically text by language".
Design looks very clean though. Nice work.
EDIT: Also, your social media links at the bottom aren't hooked up yet.
himal|12 years ago
captn3m0|12 years ago
microtonal|12 years ago
http://www.let.rug.nl/vannoord/TextCat/
Python version:
http://thomas.mangin.com/data/source/ngram.py
It's something that is fun to implement and doesn't take more than a few hours at most.
mdemare|12 years ago
redox_|12 years ago
redox_|12 years ago
alexott|12 years ago
ma2rten|12 years ago
web64|12 years ago
[1] http://detectlanguage.com/
jhull|12 years ago
AznHisoka|12 years ago
himal|12 years ago
martingordon|12 years ago
efeamadasun|12 years ago
alexott|12 years ago
razvvan|12 years ago
bkamapantula|12 years ago
As others already mentioned, it would be good to have users try examples before signup.
donutdan4114|12 years ago
oedj|12 years ago
phpnode|12 years ago
https://code.google.com/p/chromium-compact-language-detector...
https://github.com/mzsanford/cld
alexott|12 years ago
RBerenguel|12 years ago
ssiddharth|12 years ago
m4tthumphrey|12 years ago
O_O
ismaelc|12 years ago
unknown|12 years ago
[deleted]