top | item 22037194

(no title)

brockf | 6 years ago

Most implementations are actually moving in the opposite direction. Previously, there was a tendency to look to aggregate words into phrases to better capture the "context" of a word. Now, most approaches are splitting words into sub-word parts or even characters. With networks that capture temporal relationships across tokens (as opposed to older, "bag of words" models), multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts.

discuss

LunaSea|6 years ago

> multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts

Indeed. Do you have an example of a library or snippet that demonstrates this?

My limited understanding of BERT (and other) word embeddings was that they only contain the word's position in the 728 (I believe) dimensional space but doesn't contain queryable temporal information no?

I like ngrams as a sort of untagged / unlabelled entity.

PeterisP|6 years ago

When using BERT (and all the many things like it, such as earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc) as the 'embeddings' you provide as input all the tokens in a sequence. You don't get an "embedding for word foobar in position 123", you get an embedding for all the sequence at once, so whatever corresponds to that token is a 728-dimensional "embedding for word foobar in position 123 conditional on all the particular other words that were before and after it'. Including very long-distance relations.

One of the simpler ways to try that out in your code seems to be running BERT-as-a-service https://github.com/hanxiao/bert-as-service , or alternatively the huggingface libraries that are discussed in the original article.

It's kind of the other way around compared to word2vec-style systems; before that you used to have a 'thin' embedding layer that's essentially just a lookup table followed by a bunch of complex layers of neural networks (e.g. multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.

visarga|6 years ago

> Do you have an example of a library or snippet that demonstrates this?

All NLP neural nets (based on LSTM or Transformer) do this. It's their main function - to create contextual representations of the input tokens.

The word 'position' in the 728 dimensional space is an embedding and it can be compared with other words by dot product. There are libraries that can do dot product ranking fast (such as annoy).