I'm relatively novice to machine learning but here's my best attempt to summarize what's going on in layman's terms. Please correct me if I'm wrong.
- Encode the words in the source (aka embedding, section 3.1)
- Feed every run of k words into a convolutional layer producing an output, repeat this process 6 layers deep (section 3.2).
- Decide on which input word is most important for the "current" output word (aka attention, section 3.3).
- The most important word is decoded into the target language (section 3.1 again).
You repeat this process with every word as the "current" word. The critical insight of using this mechanism over an RNN is that you can do this repetition in parallel because each "current" word does not depend on any of the previous ones.
Yes, that's pretty accurate. Step 3 (attention) is repeated multiple times, i.e. for each layer in the decoder. With each additional layer, you incorporate more of the previously translated text as well as information about which parts of the source sentence representation were used to generate it. The independence of the current word from the previous words applies to the training phrase as a complete reference translation is provided and the model is trained to predict single next words only. This kind of computation would be very inefficient with an RNN: it would have to run over each word in every layer sequentially which prohibits efficient batching.
When generating a translation for a new sentence, the model uses classic beam search where the decoder is evaluated on a word-by-word basis. It's still pretty fast since the source-side network is highly parallelizable and running the decoder for a single word is relatively cheap.
One logical continuation of adding more attention steps is to make decision of how many attention steps to take determined by the network ala "Adaptive Computation Time for Recurrent Neural Networks", are you planning to go in that direction?
As far I understood it, Facebook put lots of research into optimizing a certain type of neural network (CNN), while everyone else is using another type called RNN. Up until now, CNN was faster but less accurate. However FB has progressed CNN to the point where it can compete in accuracy, particularly in speech recognition. And most importantly, they are releasing the source code and papers. Does that sound right?
Traditional Neural Networks worked like this: You have k inputs to a layer, and j outputs, so you have O(k * j) parameters, effectively multiplying the inputs by the parameter to get the outputs. And if you have lots of inputs to each layer, and lots of layers, you have a lot of parameters. Too many parameters = overfitting to your training data pretty quickly. But you want big networks, ideally, to get super accuracy. So the question is how to reduce the number of parameters while still having the same 'power' in the network.
CNNs (Convolutional Neural Networks) solve this problem by tying weights together. Instead of multiplying every input by every output, you build a small set of functions at each layer with a small number of parameters in each, and multiple nearby groups of inputs together. Images are the best way to describe this: a function will take as inputs small (3x3 or 5x5) groups of pixels in the image, and output a single result. But they apply the same function all over the image. Picture a little 5x5 box moving around the image, and running a function at each stop.
This has given some pretty incredible results in the image-recognition problem space, and they're super simple to train.
Another approach, Recurrent Neural networks (RNNs) turns the model around in a different way. Instead of having a long list of inputs that all come at once, it takes each input one at a time (or maybe a group at a time, same idea) and runs the neural-network machinery to build up to a single answer. So you might feed it one word at a time of input in English, and after a few words, it starts outputting one word at a time in French until the inputs run out and the output says its the end of the sentence.
What Facebook is doing is applying CNNs to text-sequence and translation problems. It seems to me that what they have here is kind of a RNN-CNN hybrid.
Caveats: I'm an idiot! I just read a lot and play around with ML, but I'm not an expert. Please correct me if I'm wrong, smarter people, by replying.
Not an expert, but as I understand it, common practice (everywhere, not just at Facebook) is to use CNN for understanding images and other kinds of non-sequential data. RNN are commonly used for handling text and other kinds of sequential data.
They showed how to use a CNN with text to get a speed boost, even though that's not how it's normally been done.
Yes, there have been a couple of attempts to use CNNs for translation already, but none of them outperformed big and well-tuned LSTM systems. We propose an architecture that is fast to run, easy to optimize and can scale to big networks, and could thus be used as a base architecture for future research.
There are a couple of contributions in the paper (https://arxiv.org/abs/1705.03122) apart from demonstrating the feasibility of CNNs for translation, e.g. the multi-hop attention in combination with a CNN language model, the wiring of the CNN encoder[1], or an initialization scheme for GLUs that, when combined with appropriate scaling for residual connections, enables the training of very deep networks without batch normalization.
[1] In previous work (https://arxiv.org/abs/1611.02344), we required two CNNs in the encoder: one for the keys (dot products) and one for the values (decoder input).
In this work Convolution Neural Nets (spatial models that have a weakly ordered context, as opposed to Recurrent Neural Nets which are sequential models that have a strongly ordered context) are demonstrated here to achieve State of the Art results in Machine Translation.
It seems the combination of gated linear units / residual connections / attention was the key to bringing this architecture to State of the Art.
It's worth noting that previously the QRNN and ByteNet architectures have used Convolutional Neural Nets for machine translation also. IIRC, those models performed well on small tasks but were not able to best SotA performance on larger benchmark tasks.
I believe it is almost always more desirable to encode a sequence using a CNN if possible as many operations are embarrassingly parallel!
This smells of "we built custom silicon to do fast image processing using CNNs and fully connected networks, and now we want to use that same silicon for translations. "
I was reading about SyntaxNet (I believe an RNN) developed by Google yesterday. One interesting problem they've run into is getting the system to properly interpret ambiguities. They use the example sentence "Alice drove down the street in her car":
"The first [possible interpretation] corresponds to the (correct) interpretation where Alice is driving in her car; the second [possible interpretation] corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity."[1]
One thing I believe helps humans interpret these ambiguities is the ability to form visuals from language. A NN that could potentially interpret/manipulate images and decode language seems like it could help solve the above problem and also be applied to a great deal of other things. I imagine (I know embarrassingly little about NNs) this would also introduce a massive amount of complexity.
I wonder if they can combine this with bytenet (dilated convolutons in place of vanilla convs) - gives you a larger FOV and add in attention and then you probably have a new SOTA.
CGamesPlay|8 years ago
- Encode the words in the source (aka embedding, section 3.1)
- Feed every run of k words into a convolutional layer producing an output, repeat this process 6 layers deep (section 3.2).
- Decide on which input word is most important for the "current" output word (aka attention, section 3.3).
- The most important word is decoded into the target language (section 3.1 again).
You repeat this process with every word as the "current" word. The critical insight of using this mechanism over an RNN is that you can do this repetition in parallel because each "current" word does not depend on any of the previous ones.
Am I on the right track?
jgehring|8 years ago
When generating a translation for a new sentence, the model uses classic beam search where the decoder is evaluated on a word-by-word basis. It's still pretty fast since the source-side network is highly parallelizable and running the decoder for a single word is relatively cheap.
forgotmyhnacc|8 years ago
albertzeyer|8 years ago
DeepMind has also released some framework which also has all the building blocks for translation: https://github.com/deepmind/sonnet
jorgemf|8 years ago
Google and Deepmind released a lot of stuff, I don't feel I have the right to complain about it.
kmicklas|8 years ago
gavinpc|8 years ago
That's a rather strong statement, for a company that has become one of the world's most complained-about black boxes.
But yes, they have done a lot of good in the computer science space.
blacksmythe|8 years ago
Like many big companies, they want to commoditize their products' complements.
"Smart companies try to commoditize their products' complements." https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
snippyhollow|8 years ago
code: https://github.com/facebookresearch/fairseq
pre-trained models: https://github.com/facebookresearch/fairseq#evaluating-pre-t...
Eridrus|8 years ago
pwaivers|8 years ago
Can anyone else give us an ELI5?
mabbo|8 years ago
Traditional Neural Networks worked like this: You have k inputs to a layer, and j outputs, so you have O(k * j) parameters, effectively multiplying the inputs by the parameter to get the outputs. And if you have lots of inputs to each layer, and lots of layers, you have a lot of parameters. Too many parameters = overfitting to your training data pretty quickly. But you want big networks, ideally, to get super accuracy. So the question is how to reduce the number of parameters while still having the same 'power' in the network.
CNNs (Convolutional Neural Networks) solve this problem by tying weights together. Instead of multiplying every input by every output, you build a small set of functions at each layer with a small number of parameters in each, and multiple nearby groups of inputs together. Images are the best way to describe this: a function will take as inputs small (3x3 or 5x5) groups of pixels in the image, and output a single result. But they apply the same function all over the image. Picture a little 5x5 box moving around the image, and running a function at each stop.
This has given some pretty incredible results in the image-recognition problem space, and they're super simple to train.
Another approach, Recurrent Neural networks (RNNs) turns the model around in a different way. Instead of having a long list of inputs that all come at once, it takes each input one at a time (or maybe a group at a time, same idea) and runs the neural-network machinery to build up to a single answer. So you might feed it one word at a time of input in English, and after a few words, it starts outputting one word at a time in French until the inputs run out and the output says its the end of the sentence.
What Facebook is doing is applying CNNs to text-sequence and translation problems. It seems to me that what they have here is kind of a RNN-CNN hybrid.
Caveats: I'm an idiot! I just read a lot and play around with ML, but I'm not an expert. Please correct me if I'm wrong, smarter people, by replying.
skybrian|8 years ago
They showed how to use a CNN with text to get a speed boost, even though that's not how it's normally been done.
deepnotderp|8 years ago
jgehring|8 years ago
There are a couple of contributions in the paper (https://arxiv.org/abs/1705.03122) apart from demonstrating the feasibility of CNNs for translation, e.g. the multi-hop attention in combination with a CNN language model, the wiring of the CNN encoder[1], or an initialization scheme for GLUs that, when combined with appropriate scaling for residual connections, enables the training of very deep networks without batch normalization.
[1] In previous work (https://arxiv.org/abs/1611.02344), we required two CNNs in the encoder: one for the keys (dot products) and one for the values (decoder input).
mrdrozdov|8 years ago
It seems the combination of gated linear units / residual connections / attention was the key to bringing this architecture to State of the Art.
It's worth noting that previously the QRNN and ByteNet architectures have used Convolutional Neural Nets for machine translation also. IIRC, those models performed well on small tasks but were not able to best SotA performance on larger benchmark tasks.
I believe it is almost always more desirable to encode a sequence using a CNN if possible as many operations are embarrassingly parallel!
The bleu scores in this work were the following:
Task (previous baseline): new baseline
WMT’16 English-Romanian (28.1): 29.88 WMT’14 English-German (24.61): 25.16 WMT’14 English-French (39.92): 40.46
londons_explore|8 years ago
alexanderdmitri|8 years ago
"The first [possible interpretation] corresponds to the (correct) interpretation where Alice is driving in her car; the second [possible interpretation] corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity."[1]
One thing I believe helps humans interpret these ambiguities is the ability to form visuals from language. A NN that could potentially interpret/manipulate images and decode language seems like it could help solve the above problem and also be applied to a great deal of other things. I imagine (I know embarrassingly little about NNs) this would also introduce a massive amount of complexity.
[1] https://research.googleblog.com/2016/05/announcing-syntaxnet...
shriphani|8 years ago
pama|8 years ago
m00x|8 years ago
option|8 years ago
esMazer|8 years ago
t3rmi|8 years ago
danielvf|8 years ago
But go read the article- nice animated diagrams in there.
EternalData|8 years ago