top | item 46811697

(no title)

exogen | 1 month ago

This could also be applied to record linkage. With search, there will usually be multiple results, and there's always a "top" match even if its confidence/score is quite low. In record linkage, at least if you're automating it, you need to minimize false positives and only automatically link records if confidence is super high that they're a match – and that doesn't just mean the top scoring match has high confidence, but that there's also no 2nd best match with a good score. If that's not the case, leave the records for manual human review.

My experience here is also related to music. Here are some cases to think about:

What's the actual title of the song "Mambo #5" vs. how you might search for it or find it referenced in other records? Mambo #5? Mambo No. 5? Mambo No. Five? Mambo Number 5? Mambo Number Five? And that's not even getting to the fact that the actual title is actually longer, with a parenthetical. This is a case where bigrams, trigrams, or other string similarly metrics wouldn't perform very well. Same with the Beatles song, is it "Dr. Robert" or "Doctor Robert"? Most string similarly algorithms put "Dr" and "Doctor" pretty far apart, but with vectors they should be practically equivalent.

How about "You've Lost that Loving Feeling"? Aren't there some dropped Gs in those gerunds? Is it You've Lost That Lovin' Feeling? You've Lost That Lovin' Feelin'? You've Lost That Loving Feelin'? In this case, string similarity (including trigrams) perform very well.

How about songs with censored titles? Some records will certainly have profanity censored, but would it be like "F*ck", "F**k", "F@$k", or what? And is the censorship actually part of the canonical song title, or just some references to it?

In the "#5" and "Dr." cases, this could be solved pretty effectively by the normalization step described in the article (hardcoding what #, No., and Dr. expand to) – although even that can get pretty complicated: what do you do about numbers? Do you normalize every numerical reference, e.g. "10 Thousand", to digits, or words? What about rarely used abbreviations, or cases where an abbreviation is ambiguous and could mean different things in different contexts? If someone has a song called "PT Cruiser" are you gonna accidentally normalize that to "Part Cruiser"? For this reason, I like to see this not as a "normalization" step, where there's a single normalized form, but rather a "query expansion" step – generate all the possible permutations, and those are your actual comparison strings.

It seems like embeddings could do the job of automatically considering different spellings/abbreviations of words as equivalent. I'm just a casual observer here, but I'm sure this is also a well-explored topic in speech-to-text, since you have to convert someone's utterances to match actual entity names, like movie titles for example.

discuss

No comments yet.