top | item 46940532

(no title)

dserban | 21 days ago

At some point, I needed to write a function which, given a collection of product titles, picks one that is neither the longest nor the shortest, it should pick the one which best captures the essence of the product while not being excessively verbose. For example, given the product titles below, it should pick "Portable Two-Way Translator, Handheld".

Portable Two-Way Translator, Handheld

Portable Handheld Translator

Handheld Two-Way Translator

Electronic Two-Way Portable Instant Voice Translator, 40 Languages, Handheld

Based on previous experience with centroid-based algos, the function I wrote does a first pass throwing all words from all product titles into one big bag, then computing a centroid (frequency histogram with low-frequency words removed). The second pass is to compute a cosine similarity score for each product title (its own frequency histogram against the centroid). Whichever product title is the most similar to the centroid wins.

That algo may have existed already in some academic paper somewhere, but I came up with it independently.

discuss

order

No comments yet.