(no title)
talolard | 3 years ago
In both cases, "token-free" has been the way to go for extractive tasks like entity recognition.
It's unfortunately a technique that is off the beaten path, so a lot of extra engineering effort needs to be paid down. For example, when dealing with spans but embedding the text at the byte level, one needs to account for multi-byte codepoints that can throw off the alignment between "encoded" and raw bytes.
As the article alludes, the increase in sequence length can be very expensive. In extractive tasks, I've found it effective to use much lighter models with limited (but large) context length like ByteNet[0].
Thinking out loud, as always, there is no one-tool for everything. Often the summarization/few-shot capabilities of an off-the-shelf transformer are so far ahead that it's not worth building a model from Β±scratch to solve the subtlties of tokenization. Other times, you don't need the unbounded context or have a simpler task and can forgo the power of off-the-shelf models for the specificity of a token free one.
[0] Neural Machine Translation in Linear Time (https://arxiv.org/abs/1610.10099)
DoctorOetker|3 years ago
Word2Vec was (despite its poor performance compared to transformers) much more amenable for mathematical interpretation (partition function, analogies, ...)
Rhetorical question (i.e. not directed at you): how about fixing word2vec instead of letting it rot in the history of machine learning books?
We are all familiar with word2vec's property that the vector sum ("king"-"man"+"woman") lies close to the vector of "queen". Imperfections for other words are due to things like word ambiguity (the vector of "ball" is a linear interpolation between the vectors "ball.1", "ball.2", ... if the corpus had each word annotated with the distinct word senses). Other reasons for imperfections could be limited corpus, stopping training "early" due to diminishing returns or for rational fear of overfitting, ...
These analogy sets are directed shifts. for example "man"-"woman", "bull"-"cow", "husband"-"wife", ...
Suppose one were to ask an audience to randomly give examples of antonym pairs: people would submit things like:
loud, silent
dry, wet
love, hate
poor, rich
young, old
male, female
strong, weak
day, night
etc..., then someone submits:
bad, good
at this point the imaginary lecturer stops consulting the crowd and explains that in word2vec we see antonyms appear at a similar distance from each other, and that the axes going through each pair are roughly parallel.
the crowd slowly comes to the realization that this effectively directs each pair in one of 2 global classes, say the class containing "good" and the class containing "bad".
good: silent, wet, hate, rich, old, female, weak, night
bad: quiet, dry, love, poor, young, male, strong, day
Obviously something went horribly wrong. "This is unethical and must be stopped now!" a certain person exclaims.
The programmers blame the humans who expressed the corpus. Or perhaps the team that curated the corpus. Or perhaps they shift responsibility to the end-user for using the embeddings responsibly.
Someone mutters "algorithmic bias" whatever that means, the problem surely can't be due to the simple mathematical constraints we posed, or could they?
Proof by contradiction:
Suppose there was such a thing as an antonym shift vector.
One expects that if such a shift exists, that the antonym of the antonym of a word should be near the original word. "male" + 2 x v_antonym = "male"
simple linear algebra in Rn implies that this can not exist, unless each word had the same embedding location as its antonym!
if v_antonym = 0 then the equation can be satisfied trivially, but now the model can not differentiate between concepts and their antonyms.
if v_antonym != 0 then the supposed antonym vector that might emerge from the algorithm is being forced into spontaneous symmetry breaking, so the slightest bias in the corpus will be amplified.
So yes, sometimes the oversimplification is due to the mathematician or programmer, ... and not due to the corpus, corpus curator or end-user!
=====
Now imagine generalizing word2vec so that instead of embeddings laying in Rn it lays in Tn, an n-dimensional torus.
Consider now a shift vector, with all coordinates zero except for 1 coordinate: it is halfway along the n-torus in that direction.
Applying such a shift twice would result in a zero-shifting. Aha! a non-zero vector that applied twice returns to the original position!
How many such vectors are available? Naively n such vectors (one for each coordinate), but really all combinations are valid, i.e.
(0, 0, 0.5, 0, 0.5, ...) x 2 = (0, 0, 0, 0, 0, ...)
so the number of antonym types on an n-dimensional torus is 2 to the power of n.
that should more than suffice to allow the existence of many types of undirected antonyms: if we negate an antonym shift vector on a n-dimensional torus, it is identical to the original shift, so antonyms pairs can form such that the algorithm does not force them into a global positive and negative class.
For example the different shifts for (child,parent,child,...) and (male, female, male, ...) would be different types of antonym shifts, and sometimes the combined shifts are occupied with words: the combined shift satisfies both (father, daughter, father, ...) and (mother, son, mother, son, ...)
Other advantages of a toroidal representation: weekdays can be placed 1/7th of a full cycle, along some directions (or indeed 0/7ths in others, or 2/7ths etc...)
Months by 1/12th, and so on
=======
Character-level tokens:
in word2vec its unclear how to generalize from words to sentences:
simply adding the word vectors of a sentence does not preserve meaning:
for example:
"the cat bit the dog"
means something very different from
"the dog bit the cat"
so concatenation of words can't be addition of vectors, since vector addition commutes: v + w = w + v.
so "the" + ("cat" + "bit") + "the dog" would equal:
"the" + ("bit" + "cat") + "the dog".
If we can swap adjacent words (due to vector commutativity), we can repeat swapping and move "cat" to the place of "dog" and vice versa.
Similarly word2vec is not used at the character level.
Observe that some words or concepts do in fact roughly commute in English:
"the green round block" means roughly the same as "the round green block"
so adjectives largely commute (although not entirely "the large red block" sounds more natural to me than "the red large block")
so we sometimes want commutativity and sometimes don't want it, depending on the pair of words.
Next consider training some kind of generalized word2vec, let's call it character2element such that somehow the character elements combine mathematically to produce a new element which roughly corresponds to what we call word vectors in word2vec.
Some words are made of 1 character, others are made of many characters.
So the elements should have some type of "product" such that "multiplying" 2 elements results in an element of the same space.
This could be satisfied with matrices (char2matrix) or perhaps "geometric algebra" multivectors (char2gamv).
In this way we train the character's (matrix or multivector) coefficients such that for example multiplying the characters of each synonym (assuming the words have exactly 1 sense) results in the same matrix or multivector. (I don't propose this as the training loss though!)
i.e. multiplying the characters q u i c k = f a s t (or roughly so)
similarily, multiplying g r e e n and r o u n d should result in elements that roughly commute (with an extra space character): the full product of "green round" and "round green" should be similar.
word2vec uses an inner product (supposedly between a context and focus word vector, although from a physics perspective the only way to get an invariant is by multiplying a covariant with contravariant coordinates, so theres a hidden metric, and instead of adding the context and focus vectors after training to get the "embeddings", I believe providing a metric and using the same contravariant coordinates would make more sense, or alternatively keep a unit metric, and redistribute the parameters freed up from the context vectors to the focus word vectors.)
In word2vec this inner product is used to get a scalar (possibly negative), and a partition function is used to get the joint probability of a context and focus word.
How should the matrix or multivector be compressed to a scalar value to pass to the partition function? Perhaps a series of informed guesses, determinant or trace of a matrix, or norm of a multivector could be made and the better performing one selected...
OR we could instead run through the corpus, split into an integral number of sentences at a time, and corrupt it (at first only one character at a time) and optimize the coefficients for autoencoding loss.
Can you read this:? ππΉπππππππ
When we read it we see the normal capitalized word "OBVIOUSLY", but somehow entirely emphasized in this mathematical doublestroke notation.
so the matrix for each doublestroke letter should be related to the matrix of the normal capitalized letter.
Suppose there was an invertible matrix M with an inverse N such that MN=1
then we can hypothesize that
"π" = M "O" N,
πΉ = M "B" N, etc...
so that
"ππΉπππππππ" = M"O"N M"B"N M"V"N M"I"N M"O"N M"U"N M"S"N M"L"N M"Y"N
but since NM=1 we get:
M"OBVIOUSLY"N
so this effectively allows such decorative words to form
The same argument can be made for upper and lower case.
Not sure how much of what I wrote is compatible with HN's markdown...
so not only words but complete sentences would result in a "thought matrix" or "thought multivector".
Decoding such a thought multivector might be done by performing gradient descent on unit-sum interpolations of character matrices for each position in a string of such matrices.
DoctorOetker|3 years ago