top | item 11435005

(no title)

igul222 | 10 years ago

> It is not for lack of trying that all the top papers in visual question answering end up doing this as a classification task. Results are really poor when it is used as RNN generation

I'd be curious to know if you have a reference for this. Given that the answers are one word, a word-level RNN language model output should basically be the same thing as a straight 1000-way softmax.

discuss

iamaaditya|10 years ago

   Model 	Q+I [1]	 Q+I+C [1] 	ATT 1000 	ATT Full 
   ACC. 	0.2678 	0.2939 		0.4838 		0.4651

Where ATT Full represents using all the words in the vocabulary, as you can see it performs worse than "Most frequent 1000 answers".

Source:

   Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015). ABC- CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv preprint arXiv:1511.05960.

(a) Several early papers about VQA directly adapt the image captioning models to solve the VQA problem [10][11] by generating the answer using a recurrent LSTM network conditioned on the CNN output. But these models’ performance is still limited [10][11]

(b) our own implementation of this model is less accurate on [2] than other baseline models

Above two quotes are from -

   Xu, Huijuan, and Kate Saenko. "Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering." arXiv preprint arXiv:1511.05234(2015).

However, I think my words were sloppy, as I could not find more concrete proof in the literature, but I will revisit them with detail to recollect where I read about RNN generating answers not overachieveing softmax classification over Top K distribution of answers.

Also, I would like to note that, I am not using only "one word answers" as the possible set of answers. It contains few two words, and very few three and four word answers.

Here is the distribution Key == Length of words | Value == Count of answers with those many words

Counter({1: 855, 2: 112, 3: 32, 4: 1})