GithubHelp home page GithubHelp logo

Comments (16)

dhruvbatra avatar dhruvbatra commented on July 18, 2024

Totally agree with the paraphrase issue! We have been brainstorming about this in the lab as well. Simply mapping OOV words to something within the vocabulary is a good place to start.

If I had to say one thing was unsatisfying about the model, I'd say it's the multiclass classification output.

I am not sure I would agree. As our paper explains, most answers in our dataset are 1-3 words long, so it really is /mostly/ a large multiclass classification problem. Our choice of 1K answers in the model is simply one convenient choice. It covers ~82% of all answers.

Have you tried having the model output a vector, and using it to find a nearest neighbour?

No, but how would such a system be trained by backprop? 1-NN isn't amenable to gradient-based learning.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial. I think the coverage you observe is mostly a fact about the collection methodology, not about language in general.

Re training: Off the top of my head, maybe noise contrastive estimation? That's how the QANTA paper did it.

from vqa_lstm_cnn.

dhruvbatra avatar dhruvbatra commented on July 18, 2024

I guess I just feel like having such a small and fixed answer vocabulary makes the task a little bit more artificial.

:-). I would counter that just because the space of answers is small does not make the learning problem easy. Even binary questions such as "Is this person expecting company?" can require fairly heavy lifting on the vision/reasoning side.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

Hey, I'm not saying it's easy, or that it's not impressive and interesting :)

But a fixed answer vocabulary isn't the future of this task. I think the technology would take a big step towards practicality if you were learning to produce a meaning representation. That way, to learn a new answer, you just have to learn its vector. If you add another class to the model, you don't know how many weights might have to be adjusted. Probably a lot.

My hunch is that it would actually be better for accuracy, too. But, you know the evaluation much better.

from vqa_lstm_cnn.

dhruvbatra avatar dhruvbatra commented on July 18, 2024

Agreed.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

:)

Now, about the paraphrases. There are a couple of ways we could do this.

  1. I could simply compute paraphrased versions of the task data, and give you a pull request with them. But then when someone ran the system with their own image/questions, you wouldn't be able to answer.

  2. I could compute the paraphrase dictionary and check that in.

  3. I could check in a script that generates the paraphrase dictionary.

  4. I can bake the paraphrase mapping into spaCy's lexicon, and add a mode to prepro.py that uses spaCy with paraphrasing.

  5. I could make prepro.py accept a word vectors file as an argument, and then it would compute the paraphrases based on the vectors.

I think 3 and 4 make the most sense.

from vqa_lstm_cnn.

dhruvbatra avatar dhruvbatra commented on July 18, 2024

I like 4 the best.

@jiasenlu @abhshkdz @dexter1691: Do you have a preference?

from vqa_lstm_cnn.

abhshkdz avatar abhshkdz commented on July 18, 2024

I agree, 3 and 4 would be great.

from vqa_lstm_cnn.

dexter1691 avatar dexter1691 commented on July 18, 2024

yeah. 3 and 4 makes sense.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

Great. I left a job running overnight compiling the paraphrases from GloVe, but I messed something up. Restarting now.

Here's some random substitutions for relatively frequent words. Some of these substitutions look good, others look pretty problematic.

u'Yeah' --> u'all'
u'fuck' --> u'what'
u'fucking' --> u'dude'
u'etc' --> u'toppers'
u'myself' --> u'herself'
u'comment' --> u'leave'
u'thanks' --> u'thank'
u'reddit' --> u'librarian'
u'yeah' --> u'all'
u'whatever' --> u'else'
u'questions' --> u'question'
u'lol' --> u'but'
u'sorry' --> u'okay'
u'hell' --> u'well'
u'Edit' --> u'again'
u'unless' --> u'except'
u'gonna' --> u'#'
u'However' --> u'even'
u'basically' --> u'apparently'
u'knew' --> u'told'
u'comments' --> u'opinion'
u'damn' --> u'dude'
u'Sorry' --> u'okay'
u'worse' --> u'better'
u'account' --> u'credit'
u'advice' --> u'help'
u'Reddit' --> u'librarian'
u'Hey' --> u'hi'
u'subreddit' --> u'#'
u'however' --> u'even'
u'kinda' --> u'seems'
u'Wow' --> u'dude'
u'certainly' --> u'definitely'
u'explain' --> u'describe'

I think it will help a lot to have vectors over POS tagged text. Then we can make sure we don't change the part-of-speech when we do the replacement.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

The instructions in the readme don't specify a frequency threshold for the training data. Is this the config you've been running in your experiments?

Below are the number of excluded words at various thresholds. With a frequency threshold of 1 or 5, there seems relatively little advantage to having a complicated replacement scheme. Only 1% of the tokens are affected. With so many other moving parts and difficulties of the task, I doubt changing the representation of those tokens would help very much.

I would be curious to try an aggressive threshold, leaving only one or two thousand words in the vocab.

Threshold 1
number of bad words: 4854/12603 = 38.51%
number of words in vocab would be 7749
number of UNKs: 4854/1537357 = 0.32%

Threshold 5
number of bad words: 8387/12603 = 66.55%
number of words in vocab would be 4216
number of UNKs: 15355/1537357 = 1.00%

Threshold 20
number of bad words: 10449/12603 = 82.91%
number of words in vocab would be 2154
number of UNKs: 37260/1537357 = 2.42%

Threshold 50
number of bad words: 11304/12603 = 89.69%
number of words in vocab would be 1299
number of UNKs: 64905/1537357 = 4.22%

Threshold 100
number of bad words: 11722/12603 = 93.01%
number of words in vocab would be 881
number of UNKs: 94764/1537357 = 6.16%

Threshold 200
number of bad words: 12073/12603 = 95.79%
number of words in vocab would be 530
number of UNKs: 144454/1537357 = 9.40%

Threshold 1000
number of bad words: 12459/12603 = 98.86%
number of words in vocab would be 144
number of UNKs: 317024/1537357 = 20.62%

from vqa_lstm_cnn.

jiasenlu avatar jiasenlu commented on July 18, 2024

@honnibal I use the th = 0 (preserve all the words) in the pre-trained model. Previous I tried using only top 1000 words in our bag-of-words baseline(between th=50~100). And it seems the performance is much worse than method here[(http://arxiv.org/abs/1512.02167)], which has the similar network structure. So I doubt using aggressive threshold will improve the performance.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

I think replacing the tokens with UNK makes an aggressive threshold pretty problematic. I'm wondering whether it might work with this paraphrase replacement, though.

It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess.

Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better.

It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.

What regional architecture is represented here?
 [u'what', u'center', u'design', u'is', u'represented', u'here', u'?']
Where is the cat sitting?
 [u'where', u'is', u'the', u'cat', u'sitting', u'?']
What vehicle is in the picture on the wall?
 [u'what', u'vehicle', u'is', u'in', u'the', u'picture', u'on', u'the', u'wall', u'?']
Is the television set turned on or off?
 [u'is', u'the', u'television', u'set', u'turned', u'on', u'or', u'off', u'?']
Is this a professional sport event?
 [u'is', u'this', u'a', u'professional', u'sport', u'event', u'?']
What is the man doing?
 [u'what', u'is', u'the', u'man', u'doing', u'?']
Is the player's uniform dirty?
 [u'is', u'the', u'player', u"'s", u'uniform', u'dirty', u'?']
What sole topping is shown on the pizza?
 [u'what', u'boots', u'topping', u'is', u'shown', u'on', u'the', u'pizza', u'?']
What object is this?
 [u'what', u'object', u'is', u'this', u'?']
What is the pizza sitting in?
 [u'what', u'is', u'the', u'pizza', u'sitting', u'in', u'?']
Can you see the moon here?
 [u'can', u'you', u'see', u'the', u'dark', u'here', u'?']
What is in the sky?
 [u'what', u'is', u'in', u'the', u'sky', u'?']
How does the parachute stay in the air?
 [u'how', u'does', u'the', u'military', u'keep', u'in', u'the', u'air', u'?']
Is the train moving through the countryside?
 [u'is', u'the', u'train', u'moving', u'through', u'the', u'rural', u'?']
What can be read on the train?
 [u'what', u'can', u'be', u'read', u'on', u'the', u'train', u'?']
Is this train moving?
 [u'is', u'this', u'train', u'moving', u'?']
What is the color of the dog?
 [u'what', u'is', u'the', u'color', u'of', u'the', u'dog', u'?']
Is this a library?
 [u'is', u'this', u'a', u'book', u'?']
Can the dog read?
 [u'can', u'the', u'dog', u'read', u'?']

from vqa_lstm_cnn.

jiasenlu avatar jiasenlu commented on July 18, 2024

I think replacing the tokens with UNK makes an aggressive threshold pretty problematic. I'm wondering whether it might work with this paraphrase replacement, though.

Yes, I agree. I think will this will help.

It seems quite difficult to learn a representation for a word occurring only once in the training data. You also don't learn any representation for the UNK token that you'll be using over the dev data. So I think th=1 seems better than th=0? That would be my guess.

I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.

Here's how the paraphrased data looks at th=50. I need to clean things up a bit before I can give you a pull request. I would say that the current results look a little promising, but we can do better.

Yes, agree

It seems to me like crucial words of the question are often relatively rare in the data. The current paraphrase model often messes them up. But, I'm not sure how well you can train the model to learn them, on only a few examples.

Yeah, I also tried to do some experiment on paraphrase of Question, and we can discuss with this more if you are interested.

Jiasen

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

I'm not sure about this, maybe we can do some experiment on this. basically th=1 is we replace the random vector with the same UNK representation.

I think the random vector seems potentially problematic. It could be like replacing the word with a random one from your vocabulary. You could get a vector that's close to or identical some common word. Maybe empirically it makes no difference. I'm always trying to replace experiments with intuition though :). I find I always have too many experiments to run, so I'm always trying to make these guesses.

I've made a pull request that gives you a spacy option on prepro.py, and another script to create the necessary data files. I know you're probably already running a lot of experiments, but I'd be interested to see th=5 with the paraphrasing, if you have time. I might have time to try a simple BoW experiment tomorrow, but I doubt I have time to set up Torch and run the full model.

from vqa_lstm_cnn.

honnibal avatar honnibal commented on July 18, 2024

Example output at threshold 50 below. I expect much lower thresholds to perform better, but it's harder to see the paraphrasing working when relatively fewer tokens are replaced.

What kind of meals are there? [u'What', u'kind', u'of', u'meal', u'are', u'there', u'?']
Where is the arrow pointing? [u'Where', u'is', u'the', u'arrow', u'pointing', u'?']
Where is the rolling pin in the kitchen? [u'Where', u'is', u'the', u'roll', u'wire', u'in', u'the', u'kitchen', u'?']
How tall are the ceilings? [u'How', u'tall', u'are', u'the', u'ceiling', u'?']
Which hand has a mitt? [u'Which', u'hand', u'has', u'a', u'glove', u'?']
IS this woman wearing sneakers? [u'IS', u'this', u'woman', u'wearing', u'sneakers', u'?']
What is on the blond man's head? [u'What', u'is', u'on', u'the', u'blonde', u'man', u"'s", u'head', u'?']
Which front leg has more white? [u'Which', u'front', u'leg', u'has', u'more', u'white', u'?']
Bike or a car? [u'bike', u'or', u'a', u'car', u'?']
What is the dog watching over? [u'What', u'is', u'the', u'dog', u'watching', u'over', u'?']
58364 paraphrased/215375 (99.83% done)   
example processed tokens:
What is the table made of? [u'What', u'is', u'the', u'table', u'made', u'of', u'?']
Is the food napping on the table? [u'Is', u'the', u'food', u'asleep', u'on', u'the', u'table', u'?']
What has been upcycled to make lights? [u'What', u'has', u'been', u'sweater', u'to', u'make', u'lights', u'?']
Is this an Spanish town? [u'Is', u'this', u'an', u'English', u'town', u'?']
Are there shadows on the sidewalk? [u'Are', u'there', u'shadows', u'on', u'the', u'sidewalk', u'?']
What is in the top right corner? [u'What', u'is', u'in', u'the', u'top', u'right', u'corner', u'?']
Is it cold outside? [u'Is', u'it', u'cold', u'outside', u'?']
What is leaning against the house? [u'What', u'is', u'leaning', u'against', u'the', u'house', u'?']
How many windows can you see? [u'How', u'many', u'windows', u'can', u'you', u'see', u'?']
Is this in a park? [u'Is', u'this', u'in', u'a', u'park', u'?']

from vqa_lstm_cnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.