Comments (16)
Totally agree with the paraphrase issue! We have been brainstorming about this in the lab as well. Simply mapping OOV words to something within the vocabulary is a good place to start.
If I had to say one thing was unsatisfying about the model, I'd say it's the multiclass classification output.
I am not sure I would agree. As our paper explains, most answers in our dataset are 1-3 words long, so it really is /mostly/ a large multiclass classification problem. Our choice of 1K answers in the model is simply one convenient choice. It covers ~82% of all answers.
Have you tried having the model output a vector, and using it to find a nearest neighbour?
No, but how would such a system be trained by backprop? 1-NN isn't amenable to gradient-based learning.
from vqa_lstm_cnn.
I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial. I think the coverage you observe is mostly a fact about the collection methodology, not about language in general.
Re training: Off the top of my head, maybe noise contrastive estimation? That's how the QANTA paper did it.
from vqa_lstm_cnn.
I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial.
:-). I would counter that just because the space of answers is small does not make the learning problem easy. Even binary questions such as "Is this person expecting company?" can require fairly heavy lifting on the vision/reasoning side.
from vqa_lstm_cnn.
Hey, I'm not saying it's easy, or that it's not impressive and interesting :)
But a fixed answer vocabulary isn't the future of this task. I think the technology would take a big step towards practicality if you were learning to produce a meaning representation. That way, to learn a new answer, you just have to learn its vector. If you add another class to the model, you don't know how many weights might have to be adjusted. Probably a lot.
My hunch is that it would actually be better for accuracy, too. But, you know the evaluation much better.
from vqa_lstm_cnn.
Agreed.
from vqa_lstm_cnn.
:)
Now, about the paraphrases. There are a couple of ways we could do this.
-
I could simply compute paraphrased versions of the task data, and give you a pull request with them. But then when someone ran the system with their own image/questions, you wouldn't be able to answer.
-
I could compute the paraphrase dictionary and check that in.
-
I could check in a script that generates the paraphrase dictionary.
-
I can bake the paraphrase mapping into spaCy's lexicon, and add a mode to
prepro.py
that uses spaCy with paraphrasing. -
I could make
prepro.py
accept a word vectors file as an argument, and then it would compute the paraphrases based on the vectors.
I think 3 and 4 make the most sense.
from vqa_lstm_cnn.
I like 4 the best.
@jiasenlu @abhshkdz @dexter1691: Do you have a preference?
from vqa_lstm_cnn.
I agree, 3 and 4 would be great.
from vqa_lstm_cnn.
yeah. 3 and 4 makes sense.
from vqa_lstm_cnn.
Great. I left a job running overnight compiling the paraphrases from GloVe, but I messed something up. Restarting now.
Here's some random substitutions for relatively frequent words. Some of these substitutions look good, others look pretty problematic.
u'Yeah' --> u'all'
u'fuck' --> u'what'
u'fucking' --> u'dude'
u'etc' --> u'toppers'
u'myself' --> u'herself'
u'comment' --> u'leave'
u'thanks' --> u'thank'
u'reddit' --> u'librarian'
u'yeah' --> u'all'
u'whatever' --> u'else'
u'questions' --> u'question'
u'lol' --> u'but'
u'sorry' --> u'okay'
u'hell' --> u'well'
u'Edit' --> u'again'
u'unless' --> u'except'
u'gonna' --> u'#'
u'However' --> u'even'
u'basically' --> u'apparently'
u'knew' --> u'told'
u'comments' --> u'opinion'
u'damn' --> u'dude'
u'Sorry' --> u'okay'
u'worse' --> u'better'
u'account' --> u'credit'
u'advice' --> u'help'
u'Reddit' --> u'librarian'
u'Hey' --> u'hi'
u'subreddit' --> u'#'
u'however' --> u'even'
u'kinda' --> u'seems'
u'Wow' --> u'dude'
u'certainly' --> u'definitely'
u'explain' --> u'describe'
I think it will help a lot to have vectors over POS tagged text. Then we can make sure we don't change the part-of-speech when we do the replacement.
from vqa_lstm_cnn.
The instructions in the readme don't specify a frequency threshold for the training data. Is this the config you've been running in your experiments?
Below are the number of excluded words at various thresholds. With a frequency threshold of 1 or 5, there seems relatively little advantage to having a complicated replacement scheme. Only 1% of the tokens are affected. With so many other moving parts and difficulties of the task, I doubt changing the representation of those tokens would help very much.
I would be curious to try an aggressive threshold, leaving only one or two thousand words in the vocab.
Threshold 1
number of bad words: 4854/12603 = 38.51%
number of words in vocab would be 7749
number of UNKs: 4854/1537357 = 0.32%
Threshold 5
number of bad words: 8387/12603 = 66.55%
number of words in vocab would be 4216
number of UNKs: 15355/1537357 = 1.00%
Threshold 20
number of bad words: 10449/12603 = 82.91%
number of words in vocab would be 2154
number of UNKs: 37260/1537357 = 2.42%
Threshold 50
number of bad words: 11304/12603 = 89.69%
number of words in vocab would be 1299
number of UNKs: 64905/1537357 = 4.22%
Threshold 100
number of bad words: 11722/12603 = 93.01%
number of words in vocab would be 881
number of UNKs: 94764/1537357 = 6.16%
Threshold 200
number of bad words: 12073/12603 = 95.79%
number of words in vocab would be 530
number of UNKs: 144454/1537357 = 9.40%
Threshold 1000
number of bad words: 12459/12603 = 98.86%
number of words in vocab would be 144
number of UNKs: 317024/1537357 = 20.62%
from vqa_lstm_cnn.
@honnibal I use the th = 0 (preserve all the words) in the pre-trained model. Previous I tried using only top 1000 words in our bag-of-words baseline(between th=50~100). And it seems the performance is much worse than method here[(http://arxiv.org/abs/1512.02167)], which has the similar network structure. So I doubt using aggressive threshold will improve the performance.
from vqa_lstm_cnn.
I think replacing the tokens with UNK
makes an aggressive threshold pretty problematic. I'm wondering whether it might work with this paraphrase replacement, though.
It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess.
Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better.
It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.
What regional architecture is represented here?
[u'what', u'center', u'design', u'is', u'represented', u'here', u'?']
Where is the cat sitting?
[u'where', u'is', u'the', u'cat', u'sitting', u'?']
What vehicle is in the picture on the wall?
[u'what', u'vehicle', u'is', u'in', u'the', u'picture', u'on', u'the', u'wall', u'?']
Is the television set turned on or off?
[u'is', u'the', u'television', u'set', u'turned', u'on', u'or', u'off', u'?']
Is this a professional sport event?
[u'is', u'this', u'a', u'professional', u'sport', u'event', u'?']
What is the man doing?
[u'what', u'is', u'the', u'man', u'doing', u'?']
Is the player's uniform dirty?
[u'is', u'the', u'player', u"'s", u'uniform', u'dirty', u'?']
What sole topping is shown on the pizza?
[u'what', u'boots', u'topping', u'is', u'shown', u'on', u'the', u'pizza', u'?']
What object is this?
[u'what', u'object', u'is', u'this', u'?']
What is the pizza sitting in?
[u'what', u'is', u'the', u'pizza', u'sitting', u'in', u'?']
Can you see the moon here?
[u'can', u'you', u'see', u'the', u'dark', u'here', u'?']
What is in the sky?
[u'what', u'is', u'in', u'the', u'sky', u'?']
How does the parachute stay in the air?
[u'how', u'does', u'the', u'military', u'keep', u'in', u'the', u'air', u'?']
Is the train moving through the countryside?
[u'is', u'the', u'train', u'moving', u'through', u'the', u'rural', u'?']
What can be read on the train?
[u'what', u'can', u'be', u'read', u'on', u'the', u'train', u'?']
Is this train moving?
[u'is', u'this', u'train', u'moving', u'?']
What is the color of the dog?
[u'what', u'is', u'the', u'color', u'of', u'the', u'dog', u'?']
Is this a library?
[u'is', u'this', u'a', u'book', u'?']
Can the dog read?
[u'can', u'the', u'dog', u'read', u'?']
from vqa_lstm_cnn.
I think replacing the tokens with UNK makes an aggressive threshold pretty problematic. I'm wondering whether it might work with this paraphrase replacement, though.
Yes, I agree. I think will this will help.
It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess.
I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.
Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better.
Yes, agree
It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.
Yeah, I also tried to do some experiment on paraphrase of Question, and we can discuss with this more if you are interested.
Jiasen
from vqa_lstm_cnn.
I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.
I think the random vector seems potentially problematic. It could be like replacing the word with a random one from your vocabulary. You could get a vector that's close to or identical some common word. Maybe empirically it makes no difference. I'm always trying to replace experiments with intuition though :). I find I always have too many experiments to run, so I'm always trying to make these guesses.
I've made a pull request that gives you a spacy
option on prepro.py
, and another script to create the necessary data files. I know you're probably already running a lot of experiments, but I'd be interested to see th=5
with the paraphrasing, if you have time. I might have time to try a simple BoW experiment tomorrow, but I doubt I have time to set up Torch and run the full model.
from vqa_lstm_cnn.
Example output at threshold 50 below. I expect much lower thresholds to perform better, but it's harder to see the paraphrasing working when relatively fewer tokens are replaced.
What kind of meals are there? [u'What', u'kind', u'of', u'meal', u'are', u'there', u'?']
Where is the arrow pointing? [u'Where', u'is', u'the', u'arrow', u'pointing', u'?']
Where is the rolling pin in the kitchen? [u'Where', u'is', u'the', u'roll', u'wire', u'in', u'the', u'kitchen', u'?']
How tall are the ceilings? [u'How', u'tall', u'are', u'the', u'ceiling', u'?']
Which hand has a mitt? [u'Which', u'hand', u'has', u'a', u'glove', u'?']
IS this woman wearing sneakers? [u'IS', u'this', u'woman', u'wearing', u'sneakers', u'?']
What is on the blond man's head? [u'What', u'is', u'on', u'the', u'blonde', u'man', u"'s", u'head', u'?']
Which front leg has more white? [u'Which', u'front', u'leg', u'has', u'more', u'white', u'?']
Bike or a car? [u'bike', u'or', u'a', u'car', u'?']
What is the dog watching over? [u'What', u'is', u'the', u'dog', u'watching', u'over', u'?']
58364 paraphrased/215375 (99.83% done)
example processed tokens:
What is the table made of? [u'What', u'is', u'the', u'table', u'made', u'of', u'?']
Is the food napping on the table? [u'Is', u'the', u'food', u'asleep', u'on', u'the', u'table', u'?']
What has been upcycled to make lights? [u'What', u'has', u'been', u'sweater', u'to', u'make', u'lights', u'?']
Is this an Spanish town? [u'Is', u'this', u'an', u'English', u'town', u'?']
Are there shadows on the sidewalk? [u'Are', u'there', u'shadows', u'on', u'the', u'sidewalk', u'?']
What is in the top right corner? [u'What', u'is', u'in', u'the', u'top', u'right', u'corner', u'?']
Is it cold outside? [u'Is', u'it', u'cold', u'outside', u'?']
What is leaning against the house? [u'What', u'is', u'leaning', u'against', u'the', u'house', u'?']
How many windows can you see? [u'How', u'many', u'windows', u'can', u'you', u'see', u'?']
Is this in a park? [u'Is', u'this', u'in', u'a', u'park', u'?']
from vqa_lstm_cnn.
Related Issues (20)
- Bugs in filtering and encoding questions in prepro.py HOT 1
- How to cite the model? HOT 3
- Number of training picture
- require 'cunn' and 'cutorch' in CPU mode
- Some problems while implementing with tensorflow HOT 1
- Might be remove the second term of output in LSTM
- setting for abstract? HOT 3
- Issue while trying to run the evaluation script HOT 2
- Providing feedback through correct answer
- would you tell me more about the parameter and dataset?
- Fail to repeat the accuracy of the pretrained VGG model HOT 2
- Trained model gets low accuracy on VQA server HOT 2
- Number of pretrained image features not matching with number of images in COCO
- Unsupported marker type 0xf0 HOT 1
- out of memory
- no clue HOT 1
- UNk Token
- Abstract scene parameters num_ans and num_output
- How to process the multiple choice answer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vqa_lstm_cnn.