top | item 13773380

(no title)

elyase | 9 years ago

We evaluated using ConceptNet Numberbatch but in the end went with fasttext because of the treatment of OOV words using sub word information. This is important for us because we work with Social Media where misspellings are very frequent and we have found this helps a lot. Are you also looking into these sort of enhancements? How do you usually deal with OOV words?

discuss

rspeer|9 years ago

Very good question!

Our OOV strategy was pretty important in SemEval. The first line of defense -- so fundamental to Numberbatch that I don't even think of it as OOV -- is to see if the term exists in ConceptNet but with too low a degree to make it into the matrix. In that case, we average the vectors from its neighbors in the graph that are in the matrix.

For handling words that are truly OOV for ConceptNet, we ended up using a simple strategy of matching prefixes of the word against known words (and also checking whether a word that's supposed to be in a different language was known in English).

fastText's sub-word strategy, which is learned along with the vocabulary instead of after the fact, is indeed a benefit they have. But am I right that the sub-word information isn't present in these vectors they released?

There's a paper on the SemEval results that just needs to be reviewed by the other participants, and I'm also working on a blog update about it.

kortex|9 years ago

I googled that and got

  Showing results for *conceptnet cumberbatch*

Oh, the many names of Bumblebee Banglesnatch.