top | item 42937651

(no title)

routerl | 1 year ago

Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba

discuss

rmccrear|1 year ago

Great project! It's fascinating how hard segmentation is and how many approaches there are. I thought I'd mention a trick that can let you segment without a backend. When you double click Chinese text in the browser, it will highlight an entire word. For example, try double clicking on the text here: 一步登天：走一步就到天堂美好境地。 It highlights/segments the first 4 characters as a chengyu, and the others as one or two character words. I haven't been able to discover what method Apple and Microsoft use to segment, but it seems to do a good job. You can even use JavaScript's Range.expand() function to do this programmatically. I once even made a little JS library that can run in the background and segment words on a page.

yorwba|1 year ago

Last I checked, browsers basically wrap ICU's word-break iterator: https://unicode-org.github.io/icu/userguide/boundaryanalysis...

imron|1 year ago

That’s neat!