top | item 39190685

(no title)

statusfailed | 2 years ago

OK I played with this some more, it's so good - exactly what I dreamed of!!

A couple more bits of feedback:

(1) The "suggestion" / "I'm unsure" etc. feedback is fantastic

(2) Word segmentation doesn't seem to be working properly, and so the context lookup doesn't work right. Example:

中国 should be parsed as a single word ("china"), but it's parsed as individual characters ("middle", "kingdom").

This means I have to tab out to a dictionary to look up words, and it's a bit annoying to select the right text.

discuss

Hadjimina|2 years ago

Thanks! The tricky bit is to make this work in different languages where the "space" is not used to separate the different words, such as Chinese. We should implement a real Chinese lemmatizer there to chunk the words.

Not sure if you saw it, but we already have pinyin in there. If you open up the settings and tick "show pronunciations" they will appear above the chat messages.

yorwba|2 years ago

> We should implement a real Chinese lemmatizer there to chunk the words.

Or find all substrings that are listed in a dictionary (≈everyone uses cc-cedict https://www.mdbg.net/chinese/dictionary?page=cc-cedict ) and give translations for all of them. That way, the user won't be limited to any particular chunking granularity, which is always a finicky aspect of word segmenters to fine-tune.

statusfailed|2 years ago

At least for chinese there are off-the-shelf word segmenters you can use like jieba[0]- I used it in gptlingo and it Just Works(TM).

The "show pronounciations" setting just turns on pinyin above characters - what I want is to type pinyin and enter chinese characters. Actually showing the pinyin above characters is quite distracting!

[0]: https://pypi.org/project/jieba/