Agglutinative Language

[+] bonoboTP|6 years ago|reply

I often wonder how much of a head start the isolating nature of English gave for computing. It allowed ignoring a lot of inflectional and agglutinative complexity.

Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".

Relatedly, I think focusing NLP efforts on English masks a lot of interesting phenomena, because English text already comes in a reasonably tokenized, chunked up and pre-digested, easy to handle form. For example speech recognition systems started out with closed vocabularies, with larger and larger numbers of words, and even in their toy forms you could recognize some proper English sentences. To do that in Hungarian for example, the "upfront costs" to a "somewhat usable" system are much higher, because closed vocabulary doesn't get you anywhere. (Similarly, learning basic English is very easy, you can build 100% correct sentences on day 1, you learn "I", "you", "see" and "hear" and can say "I see" and "You see" and "I see you" and "I hear Peter" which are all 100% correct. In Hungarian these are "nézek", "nézel", "nézlek", "hallom Pétert" requiring learning several suffixes and vowel harmony and definite/indefinite conjugation. The learning curve till your first 100% correct 3-5 word sentences is just steeper.)

I don't mean it's impossible to handle agglutinative languages in NLP, I just mean the "minimum viable model" is much simpler and attainable for English, which on the one hand was able to kickstart and propel the early research phases and on the other hand perhaps fueled a bit too much optimism.

English can seem very well structured and it can tempt one to think of language in a very symbolic, within-the-box, rule-based way. In terms of syntax trees, sets of valid sentences etc, instead of "fuzzy probabilistic mess" that it really is. Surely, the syntax tree, generative grammar approach (Chomsky and others) gave us a lot of computer science, but this kind of "clean" and pure symbolic parsing doesn't seem to drive today's NLP progress.

In summary, I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.

[+] romwell|6 years ago|reply

>I often wonder how much of a head start the isolating nature of English gave for computing.

That's like saying "I wonder when you stopped beating your wife"; you assume there was a head start, when, in fact, the world's first commercial computer was German[1].

And until recently, natural languages had a near-zero effect on computing. Worst case, users ended up seeing messages which weren't grammatically perfect, and it wasn't a big deal.

>I wonder how linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic

Would have? NLP has only started to matter recently, at a time when it has to work in all languages from the get-go. The current evolution includes contributions of people from many languages and cultures.

And for that matter, English makes a lot of things harder.

[1]https://en.wikipedia.org/wiki/Z4_(computer)

[+] jcranmer|6 years ago|reply

Agglutinative languages would probably work as well as isolating languages, since they tend to work by just shoving things on the end of words rather than inflecting them. It does potentially raise a segmentation problem, but I'm not really sufficiently familiar with any agglutinative language to know how hard a problem it is in practice.

The difficult languages are inflectional languages, where you make things completely different instead of just tacking something on the end.

It's worth pointing out that all whitespace is completely optional in Fortran, the first programming language--doi=0,10 is exactly the same as DO I = 0, 10. So it's not like early computing relied heavily on gratuitous whitespace.

[+] jhanschoo|6 years ago|reply

Possibly less than you think. (I'm not addressing the NLP part)

For example, every language is already used to math formatting. Programming languages draw more inspiration from math formatting than English.

That leaves naming. Here agglutinative languages should have an advantage. You can have more natural ways to describe roles like how in English we may have caller and callee, rather than more clumsily camel-casing something like sumOfLists.

> linguistics and especially computational linguistics and NLP would have evolved in a non-Anglo culture, e.g. Slavic or Hungarian.

Probably not much different, except that more elements of morphology are treated together with syntax.

> Concretely I mean it's very easy to generate text using sentence templates. Just plug in words and it works out. "The $process_name has completed running." "Like $username's comment" "Ban $username".

If computing were primarily championed by a fusional language (agglutinative languages usually have somewhat "clean" morphology), I imagine that libraries for inflection will be more prominently used. Like in English where more professional apps use a pluralizer library. One natural API for an inflection API is as a fluent API.

[+] eindiran|6 years ago|reply

Certainly English's morphosyntactic simplicity helped out NLP; your phrase "minimum viable model" hits the nail on the head. But increasingly over the last 5-10 years, I think there is a lot of progress on techniques for handling morphological complexity. Some of the unsupervised tokenization methods that first saw use for English (eg Goldsmith's work) now sees play for agglutinative languages: see here for example[0]. So its not clear to me if NLP in a non-Anglo culture would just use the same techniques (arriving at practical achievements a decade later) or if there would be fundamentally different techniques that are totally unobvious to me now.

Re your point on language being a "[f]uzzy probabilistic mess" -- language is absolutely NOT a fuzzy probabilistic mess and its a damn shame that NLP based its success on black-box models, because it means no one bothers realizing that language isn't a mess at all. See Jelinek's law of speech recognizer accuracy [1]. Simply because we get results using messy black box models doesn't mean that's how things work under-the-hood.

[0] https://www.researchgate.net/publication/221013038_Unsupervi...

[1] https://en.wikipedia.org/wiki/Frederick_Jelinek

[+] jerf|6 years ago|reply

Being able to encode it reasonably in 5 bits and comfortably in 6 (adding case and a few last nice symbols) was helpful too.

[+] sansnomme|6 years ago|reply

Turkish is probably strict enough to be used as a programming language. The only downside is that its vocabulary is utterly alien for most speakers of Latin/Anglo-Saxon languages aside from some borrowed words from French and Arabic.

[+] yabadabadoes|6 years ago|reply

It's actually quite a bit easier to learn since it has few false friends with Latin languages. I often thought search engines written by English speakers focused on bags of words can't work very well in Turkish though?

[+] unknown|6 years ago|reply

[deleted]

[+] romwell|6 years ago|reply

Sumerian, an agglutinative language, is an important plot point in a famous cyberpunk novel, Snow Crash by Neal Stephenson[1] (which also popularized the word "avatar" as we use it today).

If you find the concept interesting, you will enjoy reading the novel.

[1]https://en.wikipedia.org/wiki/Snow_Crash

[+] Bootwizard|6 years ago|reply

Can someone here explain this in an easier to understand way? This was a bit too dense for my understanding...

[+] eindiran|6 years ago|reply

The smallest unit in language that has meaning is called a morpheme. Some languages use relatively few morphemes-per-word like English: for example, the word "cats" can be broken into two morphemes -- "cat" and "-s". "Two" can't be broken down any further, so it has a 1-to-1 mapping between morphemes and words.

Other languages use a lot of morphemes-per-word. One strategy to create words from morphemes is called agglutination (meaning to glue things together). An agglutinative language takes all the morphemes that are going to go into a word, and with minimal or no changes, glues them together to form a word.

For example, the Yupik word "tuntussuqatarniksaitengqiggtuq" means "He had not yet said again that he was going to hunt reindeer". It is formed by taking the following morphemes and agglutinating them:

"tuntu-ssur-qatar-ni-ksaite-ngqiggte-uq"

[+] bonoboTP|6 years ago|reply

Agglutinative just means you glue (the -glu- refers to this) pieces (suffixes) at the end of words to express lots of things. This exists in English as well, but in restricted forms. For example blue+ish, quick+ly, blue+ness, look+ed. In an agglutinative language, this is how most of the things are expressed.

For example a totally normal Hungarian word is: szolgáltatásaiért = szolgá+l+tat+ás+a+i+ért = for his/her/its services. Szolga means servant, from Slavic origin. Szolgál is a verb meaning to serve. Szolgáltat means to provide service. Szolgáltatás means service (as in "goods and services", "internet service", etc.). Szolgáltatása means his/her/its service. Szolgáltatásai means his/her/its services. Szolgáltatásaiért means "for his/her/its services".

[+] sansnomme|6 years ago|reply

Simply put: a lot of grammar is based on appending to words. E.g. the Turkish word for book is Kitab (shared by a bunch of other middle eastern languages too). My Book is Kitabim. Your book is Kitabsin. (Note last example is vastly simplified, a proper Turkish speaker should correct it)

It allows for a lot of really short sentences; here's a nonsensical example:

His book is on fire - kitabı yanıyor.

The word endings is sufficient to provide context and meaning.

If you find Turkish to be too difficult to learn, try Malay. It's also agglutinative and used by ~300 million people (Malay and Indonesian are for all practical purposes the same language).

[+] Jhsto|6 years ago|reply

The article notes that in some languages it is possible to form sentences by chaining appendixes to them.

An example in Finnish:

- jousta (normal form of the verb run)

- juoksen (I run)

- juoksentelen (I run around)

- juoksentelisinkohan (I wonder should I run around)

- juostaankohammekohaan (I wonder do we run)

The two later forms are very rarely used, and I have no idea whether the last form is even correct. I have some friends who insist on talking like this. Usually, people express the same things with more words, such as juoksentelisinkohan is equivalent to about:

- Mietin ., että. pitäisikö. minun. juosta. ympäriinsä.

- I wonder., that. should. my (in this context, me). run. around.

The . are to separate the words.

Yet, it would be perfectly fine to just append a question mark to juoksentelisinkohan or juostaankohammekohaan and it would be a one-word sentence. An interesting remark is that in practice the question mark is redundant in both cases, as the -ko- part in the words reduces the only interpretation of the word to be a question.

I have absolutely no idea how would one formalize all this.

[+] beefman|6 years ago|reply

More broadly, synthetic languages are like statically-typed programming languages, whereas analytic languages[1] are like dynamically-typed programming languages.

Also, intransitive verbs[2] are like thunks.

[1] https://en.wikipedia.org/wiki/Analytic_language

[2] https://en.wikipedia.org/wiki/Intransitive_verb

[+] unknown|6 years ago|reply

[deleted]

[+] monkeycantype|6 years ago|reply

I was just reading this yesterday after the term came up in a Japanese grammar book.

[+] foobar_|6 years ago|reply

Forth is probably the only agglutinative language in a way.

29 comments