top | item 14169299

(no title)

The next trick would be to create a matrix that is the closest to all of the languages and doesn't have gaps from missing words. Using English as the identity is probably a bit ethnocentric -- another reification of the three-percent problem of literary translation: http://www.rochester.edu/College/translation/threepercent/ .

Maybe we can call this new matrix Mondoshawan?

discuss

sls56|8 years ago

Yes definitely! We didn't want to complicate the repository, but from a few in-house experiments we already know that it is possible to improve the rotation matrices by: 1. First aligning to a reference language (English) 2. Then defining a new reference as the mean vector of all the languages for each entry in the training dictionary 3. Re-align the languages to this new reference "language" 4. Iteratively repeat 2 and 3 to convergence

As you suggest potentially this mean language is itself really high quality word vectors; but we haven't looked at this yet...