top | item 42417582

(no title)

entilzha | 1 year ago

(Author Here)

Good description! Maybe what parent got mixed up on is an alternate way to view this is trying to chunk bytes to have roughly similar information. EG we initially tried a bunch of patching schemes, EG, keep a running total of entropy until the total exceeds a threshold, but ended up finding simple things worked better.

I’ll see if we can add more information about the small CNN in a next update to arXiv paper.

discuss

order

cschmidt|1 year ago

I'm curious if you're aware of some papers from around 2005 on using contextual entropy to do unsupervised word segmentation on Chinese, and other languages that don't use spaces for word boundaries.

https://aclanthology.org/Y03-1017/ https://aclanthology.org/I05-1009/ https://aclanthology.org/P06-2056/

Exactly the same approach of segmenting a word when the entropy goes up compared to the previous byte.

entilzha|1 year ago

At least I wasn't aware of this work, but thanks for the refs! I'm always curious to read papers from 10-20+ years ago that have similarly inspired ideas. If it makes sense, we'll mention those in the next related work update.

psb217|1 year ago

One way of thinking about the "Approximate Monotonic Constraint" is that you're running a quick and dirty edge detector on the entropy. Ie, you're clipping based on the gradient of per-byte entropy wrt timestep compared to detecting an edge based on gradient of per-pixel intensity wrt pixel coordinates. It would be interesting to look at the raw sequences of per-byte entropies to see how strongly these sorts of "edges" correlate with human interpretable boundaries (words, prefixes, suffixes, etc).

yorwba|1 year ago

Figure 4 plots the entropy of each byte in "Daenerys Targeryen is in Game of Thrones, a fantasy epic by George R.R. Martin."