top | item 27465287

Using aligned word vectors for instant translations with Python and Rust

112 points| beau | 4 years ago |instantdomainsearch.com | reply

35 comments

order
[+] beau|4 years ago|reply
We've released the underlying Rust implementation here: https://github.com/InstantDomain/instant-distance with Python bindings at https://pypi.org/project/instant-distance — feedback welcome!
[+] habibur|4 years ago|reply
For Linux, in the Makefile change the copy command to

cp target/release/libinstant_distance.so instant-distance-py/test/instant_distance.so

and it works. Built and running. The main tree was MacOS only.

Here's resource consumption in a sample run.

Time: 4.49s, Memory: 1552 mb.

Single word. Three langs including en.

[+] arbol|4 years ago|reply
Did you try spacy's most similar method? It's written in cython so is presumably quite fast as well. Thanks for the rust implementation though, I will most likely use this.
[+] Fiahil|4 years ago|reply
I’ve not much to say on the actual lib, it seems great! However, don’t feel compelled to put all your rust code into a single lib.rs. You can split your work into several files and use ‘pub use’ and ‘mod’ in lib.rs to re-export your functions & types into a public API of your choosing.

cargo check and format time might also slightly improve!

[+] maeln|4 years ago|reply
This webpage use a significant amount of CPU constantly for no apparent reason (as far as I can see it is mostly a static webpage). What the hell ? Is it mining crypto in the background ?
[+] maybevain|4 years ago|reply
At a quick glance it seems like some React component is constantly re-rending.

Quick glance in this case: took a couple second snapshot on the Performance tab and saw a lot of React related calls.

[+] beau|4 years ago|reply
Sorry, this page had a useEffect/setState render loop. We are running react@experimental with concurrent mode, and missed the error. Rolling out a fix now. Thanks!
[+] denysvitali|4 years ago|reply
> For example, here are the results of translating the English word "hello":

> Language: fr, Translation: bonjours

> Language: fr, Translation: bonsoir

> Language: fr, Translation: salutations

> Language: it, Translation: buongiorno

> Language: it, Translation: buonanotte

> Language: fr, Translation: rebonjour

> Language: it, Translation: auguri

> Language: fr, Translation: bonjour,

> Language: it, Translation: buonasera

> Language: it, Translation: chiamatemi

Is it just me or these machine translations are worse than ... Google Translate?

[+] beau|4 years ago|reply
These results are less accurate than Google Translate. But they are far faster to get, and far less expensive to generate: https://cloud.google.com/translate/pricing — our goal is here is speed. We want to search through many possibilities as quickly as possible.

The word vectors have been aligned in multiple languages. Using an approximate nearest neighbor search we are able to find the nearest vector to the input in multiple languages very quickly.

To keep the example simple, we did not try to filter the data through hand-built language dictionaries. In fact, we simply drop words in other languages that also appear in the English .vec file. Words like "ciao" appear frequently enough in otherwise English sentences that the example code drops it from Italian, and so is not shown in the results:

% curl -s "https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki..." | grep -n ciao 50393:ciao 0.0120 ...

One improvement would be to filter out any words that do not appear in a hand-curated dictionary instead of filtering out words that already appear in English. We decided not to show how to do this because we'd already introduced a few concepts, like aligned word vectors, approximate nearest neighbour searches, and wanted to keep the example as simple as possible.

[+] toxik|4 years ago|reply
Google Translate is state of the art, so I’m not sure why that would be surprising. That said, is there something wrong with the translations offered?
[+] ampdepolymerase|4 years ago|reply
It would be better to run the vectors through an attention layer if you want sentence to sentence translation.
[+] fulafel|4 years ago|reply
Was disappointed this can't translate from Python to Rust.
[+] dukeofdoom|4 years ago|reply
Can something like this be done to compare/translate subsequences COVID genetic code to SARS and other virus genetic codes. Would be interesting how much overlap there is. And would further the research into where it came from.

Full genome of COVID-19 is available:

https://www.snapgene.com/resources/coronavirus-resources/?re...

[+] nestorD|4 years ago|reply
Bioinformaticists have been able to do that with traditional algorithms for years (dynamic programming gets you a long way to compute an edit distance for example).

It is probably the first thing that was done once the COVID-19 genome was made public. A quick googling gave me that summary of the results: https://www.news-medical.net/health/How-Does-the-SARS-Virus-...

[+] mattkrause|4 years ago|reply
It sounds like you're thinking of "sequence alignment", which is a pretty standard bioinformatics tool.

BLAST (=Basic Local Alignment Search Tool) is one common version, and the NIH'S NCBI has a variety of nice online tools here: https://blast.ncbi.nlm.nih.gov/Blast.cgi

Note that it does take a little bit of background knowledge to interpret:some motifs are just really common, others are shared.

[+] PaulHoule|4 years ago|reply
Nice example.

The short text and that fact that your application would tolerate or celebrate catchy neologisms plays to fasttext's strengths.

[+] shakow|4 years ago|reply
> fast translates to vite in French

Only as an adverb, it should be "rapide" otherwise.

[+] aitk|4 years ago|reply
At first glance at the title, I thought it was translating Python code to Rust code.
[+] adsharma|4 years ago|reply
It may not be a bad candidate for writing the rust part in python and then running it through py2many to generate the rust.