(no title)
igul222 | 10 years ago
I'd be curious to know if you have a reference for this. Given that the answers are one word, a word-level RNN language model output should basically be the same thing as a straight 1000-way softmax.
igul222 | 10 years ago
I'd be curious to know if you have a reference for this. Given that the answers are one word, a word-level RNN language model output should basically be the same thing as a straight 1000-way softmax.
iamaaditya|10 years ago
Source:
2.(a) Several early papers about VQA directly adapt the image captioning models to solve the VQA problem [10][11] by generating the answer using a recurrent LSTM network conditioned on the CNN output. But these models’ performance is still limited [10][11]
(b) our own implementation of this model is less accurate on [2] than other baseline models
Above two quotes are from -
However, I think my words were sloppy, as I could not find more concrete proof in the literature, but I will revisit them with detail to recollect where I read about RNN generating answers not overachieveing softmax classification over Top K distribution of answers.Also, I would like to note that, I am not using only "one word answers" as the possible set of answers. It contains few two words, and very few three and four word answers.
Here is the distribution Key == Length of words | Value == Count of answers with those many words
Counter({1: 855, 2: 112, 3: 32, 4: 1})