GithubHelp home page GithubHelp logo

theshadow29 / awesome-grounding Goto Github PK

View Code? Open in Web Editor NEW
958.0 28.0 92.0 176 KB

awesome grounding: A curated list of research papers in visual grounding

License: MIT License

computer-vision natural-language-processing grounding awesome-list papers arxiv visual-grounding image-grounding video-understanding video-grounding

awesome-grounding's Introduction

Awesome Visual Grounding

A curated list of research papers in grounding. Link to the code if available is also present.

Have a look at SCOPE.md to get familiar with what grounding means and the tasks considered in this repository.

To maintaing the quality of the repo, I have gone through all the listed papers at least once before adding them to ensure their relevance to grounding. However, I might have missed some paper(s) or added some irrelevant paper(s). Feel free to open an issue in that case. I will go through the paper and then add / remove it.

Table of Contents

Contributing

Feel free to contact me via email ([email protected]) or open an issue or submit a pull request. To add a new paper via pull request:

  1. Fork the repo, change readme. Put the new paper under the correct heading, and place it at the correct chronological position.
  2. Copy its reference in MLA format
  3. Put ** around the title
  4. Provide link to the paper (arxiv/semantic scholar/conference proceedings).
  5. If code or website exists, link that too.
  6. Send a pull request. Ideally, I will review the request within a week.

Demos

  1. MATTNet demo: http://vision2.cs.unc.edu/refer/comprehension

Other Compilations:

Shoutout to some other awesome stuff on vision and language grounding:

  1. Multi-modal Reading List by Paul Liang (@pliang279) : https://github.com/pliang279/awesome-multimodal-ml/
  2. Temporal Grounding by Mu Ketong (@iworldtong): https://github.com/iworldtong/Awesome-Grounding-Natural-Language-in-Video
  3. Temporal Grounding by WuJie (@WuJie1010): https://github.com/WuJie1010/Awesome-Temporally-Language-Grounding. Also, checkout their implementation of some of the popular papers: https://github.com/WuJie1010/Temporally-language-grounding

Datasets

Image Grounding Datasets

  1. Flickr30k: Plummer, Bryan A., et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE international conference on computer vision. 2015. [Paper] [Code] [Website]

  2. RefClef: Kazemzadeh, Sahar, et al. Referitgame: Referring to objects in photographs of natural scenes. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. [Paper] [Website]

  3. RefCOCOg: Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  4. Visual Genome: Krishna, Ranjay, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123.1 (2017): 32-73. [Paper] [Website]

  5. RefCOCO and RefCOCO+: 1. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]

  6. GuessWhat: De Vries, Harm, et al. Guesswhat?! visual object discovery through multi-modal dialogue. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. [Paper] [Code] [Website]

  7. Clevr-ref+: Liu, Runtao, et al. Clevr-ref+: Diagnosing visual reasoning with referring expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code] [Website]

  8. Talk2Car: Deruyttere, Thierry, et al. Talk2Car: Taking Control of Your Self-Driving Car Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. [Paper] [Website] [Code]

  9. KB-Ref: Wang, Peng, et al. Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge. Proceedings of the 28th ACM International Conference on Multimedia. 2020. [Paper] [Code]

  10. Ref-Reasoning: Yang, Sibei, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code] [Website]

  11. Cops-Ref: Chen, Zhenfang, et al. Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  12. SUNRefer: Liu, Haolin, et al. Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code] [Website]

Video Datasets

  1. TaCoS: Regneri, Michaela, et al. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics 1 (2013): 25-36. [Paper] [Website]

  2. Charades: Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Website]

  3. Charades-STA: Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017).[Paper] [Code]

  4. Distinct Describable Moments (DiDeMo): Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code] [Website]

  5. ActivityNet Captions: Krishna, Ranjay, et al. Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision. 2017. [Paper] [Website]

  6. Charades-Ego: [Website]

    • Sigurdsson, Gunnar, et al. Actor and Observer: Joint Modeling of First and Third-Person Videos. CVPR-IEEE Conference on Computer Vision & Pattern Recognition. 2018. [Paper] [Code]
    • Sigurdsson, Gunnar A., et al. "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos." arXiv preprint arXiv:1804.09626 (2018). [Paper] [Code]
  7. TEMPO: Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]

  8. ActivityNet-Entities: Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Embodied Agents Platforms:

  1. Matterport3D: Chang, Angel, et al. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017). [Paper] [Code] [Website]

    • Photorealistic rooms
  2. AI2-THOR: Kolve, Eric, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017). [Paper] [Website]

    • Actionable objects!
  3. Habitat AI: Savva, Manolis, et al. Habitat: A platform for embodied ai research. Proceedings of the IEEE International Conference on Computer Vision. 2019. (ICCV 2019) [Paper] [Website]

Paper Roadmap (Chronological Order):

Visual Grounding / Referring Expressions (Images):

  1. Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems. 2014. [Paper]

  2. Karpathy, Andrej, and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Method name: Neural Talk. [Paper] [Code] [Torch Code] [Website]

  3. Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]

  4. Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  5. Wang, Liwei, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  6. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]

  7. Nagaraja, Varun K., Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. European Conference on Computer Vision. Springer, Cham, 2016.[Paper] [Code]

  8. Rohrbach, Anna, et al. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016. Method Name: GroundR [Paper] [Tensorflow Code] [Torch Code]

  9. Wang, Mingzhe, et al. Structured matching for phrase localization. European Conference on Computer Vision. Springer, Cham, 2016. Method name: Structured Matching [Paper] [Code]

  10. Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Website]

  11. Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP (2016). Method name: MCB [Paper][Code]

  12. Endo, Ko, et al. An attention-based regression model for grounding textual phrases in images. Proc. IJCAI. 2017. [Paper]

  13. Chen, Kan, et al. MSRC: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval 7.1 (2018): 17-28. [Paper -Springer Link]

  14. Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]

  15. Yu, Licheng, et al. A joint speakerlistener-reinforcer model for referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code][Website]

  16. Hu, Ronghang, et al. Modeling relationships in referential expressions with compositional modular networks. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [Paper] [Code]

  17. Luo, Ruotian, and Gregory Shakhnarovich. Comprehension-guided referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code]

  18. Liu, Jingyu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. Proceedings of CVPR. 2017. [Paper]

  19. Xiao, Fanyi, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371 (2017). [Paper]

  20. Plummer, Bryan A., et al. Phrase localization and visual relationship detection with comprehensive image-language cues. Proc. ICCV. 2017. [Paper] [Code]

  21. Chen, Kan, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: QRC [Paper] [Code]

  22. Liu, Chenxi, et al. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV. 2017. [Paper] [Code]

  23. Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]

  24. Li, Xiangyang, and Shuqiang Jiang. Bundled Object Context for Referring Expressions. IEEE Transactions on Multimedia (2018). [Paper ieee link]

  25. Yu, Zhou, et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. arXiv preprint arXiv:1805.03508 (2018). [Paper] [Code]

  26. Yu, Licheng, et al. Mattnet: Modular attention network for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [Paper] [Code] [Website]

  27. Deng, Chaorui, et al. Visual Grounding via Accumulated Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper]

  28. Li, Ruiyu, et al. Referring image segmentation via recurrent refinement networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper] [Code]

  29. Zhang, Yundong, Juan Carlos Niebles, and Alvaro Soto. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. arXiv preprint arXiv:1808.00265 (2018). [Paper]

  30. Chen, Kan, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879 (2018). [Paper] [Code]

  31. Zhang, Hanwang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code]

  32. Cirik, Volkan, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. arXiv preprint arXiv:1805.10547 (2018).[Paper] [Code]

  33. Margffoy-Tuay, Edgar, et al. Dynamic multimodal instance segmentation guided by natural language queries. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]

  34. Shi, Hengcan, et al. Key-word-aware network for referring expression image segmentation. Proceedings of the European Conference on Computer Vision (ECCV). 2018.[Paper] [Code]

  35. Plummer, Bryan A., et al. Conditional image-text embedding networks. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]

  36. Akbari, Hassan, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv preprint arXiv:1811.11683 (2018). [Paper]

  37. Kovvuri, Rama, and Ram Nevatia. PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding. arXiv preprint arXiv:1812.03213 (2018). [Paper]

  38. Chen, Xinpeng, et al. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arXiv preprint arXiv:1812.03426 (2018). [Paper]

  39. Wang, Peng, et al. Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks. arXiv preprint arXiv:1812.04794 (2018). [Paper]

  40. Liu, Daqing, et al. Learning to Assemble Neural Module Tree Networks for Visual Grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019. [Paper] [Code]

  41. RETRACTED (see #2): Deng, Chaorui, et al. You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding. arXiv preprint arXiv:1902.04213 (2019). [Paper]

  42. Hong, Richang, et al. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). 2019. [Paper]

  43. Liu, Xihui, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]

  44. Dogan, Pelin, Leonid Sigal, and Markus Gross. Neural Sequential Phrase Grounding (SeqGROUND). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  45. Datta, Samyak, et al. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. arXiv preprint arXiv:1903.11649 (2019). (ICCV 2019) [Paper]

  46. Fang, Zhiyuan, et al. Modularized textual grounding for counterfactual resilience. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  47. Ye, Linwei, et al. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  48. Yang, Sibei, Guanbin Li, and Yizhou Yu. Cross-Modal Relationship Inference for Grounding Referring Expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  49. Yang, Sibei, Guanbin Li, and Yizhou Yu. Dynamic Graph Attention for Referring Expression Comprehension. arXiv preprint arXiv:1909.08164 (2019). (ICCV 2019) [Paper] [Code]

  50. Wang, Josiah, and Lucia Specia. Phrase Localization Without Paired Training Examples. arXiv preprint arXiv:1908.07553 (2019). (ICCV 2019) [Paper] [Code]

  51. Yang, Zhengyuan, et al. A Fast and Accurate One-Stage Approach to Visual Grounding. arXiv preprint arXiv:1908.06354 (2019). (ICCV 2019) [Paper] [Code]

  52. Sadhu, Arka, Kan Chen, and Ram Nevatia. Zero-Shot Grounding of Objects from Natural Language Queries. arXiv preprint arXiv:1908.07129 (2019).(ICCV 2019) [Paper] [Code] Disclaimer: I am an author of the paper

  53. Liu, Xuejing, et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. arXiv preprint arXiv:1908.10568 (2019). (ICCV 2019) [Paper] [Code]

  54. Chen, Yi-Wen, et al. Referring Expression Object Segmentation with Caption-Aware Consistency. arXiv preprint arXiv:1910.04748 (2019). (BMVC 2019) [Paper] [Code]

  55. Liu, Jiacheng, and Julia Hockenmaier. Phrase Grounding by Soft-Label Chain Conditional Random Field. arXiv preprint arXiv:1909.00301 (2019) (EMNLP 2019). [Paper] [Code]

  56. Liu, Yongfei, Wan Bo, Zhu Xiaodan and He Xuming. Learning Cross-modal Context Graph for Visual Grounding. arXiv preprint arXiv: (2019) (AAAI-2020). [Paper] [Code]

  57. Yu, Tianyu, et al. Cross-Modal Omni Interaction Modeling for Phrase Grounding. Proceedings of the 28th ACM International Conference on Multimedia. ACM 2020. [Paper: ACM Link] [Code]

  58. Qiu, Heqian, et al. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. Proceedings of the 28th ACM International Conference on Multimedia. ACM 2020. [Paper: ACM Link]

  59. Wang, Qinxin, et al. MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding. arXiv preprint arXiv:2010.05379 (2020). [Paper] [Code]

  60. Liao, Yue, et al. A real-time cross-modality correlation filtering method for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper]

  61. Hu, Zhiwei, et al. Bi-directional relationship inferring network for referring image segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  62. Yang, Sibei, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  63. Luo, Gen, et al. Multi-task collaborative network for joint referring expression comprehension and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  64. Gupta, Tanmay, et al. Contrastive learning for weakly supervised phrase grounding. Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Paper] [Code]

  65. Yang, Zhengyuan, et al. Improving one-stage visual grounding by recursive sub-query construction. Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Paper] [Code]

  66. Wang, Liwei, et al. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper]

  67. Sun, Mingjie, Jimin Xiao, and Eng Gee Lim. Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]

  68. Liu, Haolin, et al. Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]

  69. Liu, Yongfei, et al. Relation-aware Instance Refinement for Weakly Supervised Visual Grounding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]

  70. Lin, Xiangru, Guanbin Li, and Yizhou Yu. Scene-Intuitive Agent for Remote Embodied Visual Grounding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper]

  71. Sun, Mingjie, et al. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE transactions on pattern analysis and machine intelligence (TPAMI 2021). [Paper] [Code]

  72. Mu, Zongshen, et al. Disentangled Motif-aware Graph Learning for Phrase Grounding. arXiv preprint arXiv:2104.06008 (AAAI 2021). [Paper]

  73. Chen, Long, et al. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. arXiv preprint arXiv:2009.01449 (AAAI-2021). [Paper] [Code]

  74. Deng, Jiajun, et al. TransVG: End-to-End Visual Grounding with Transformers. arXiv preprint arXiv:2104.08541 (2021). [Paper] [Unofficial Code]

  75. Du, Ye, et al. Visual Grounding with Transformers. arXiv preprint arXiv:2105.04281 (2021). [Paper]

  76. Kamath, Aishwarya, et al. MDETR--Modulated Detection for End-to-End Multi-Modal Understanding. arXiv preprint arXiv:2104.12763 (2021). [Paper]

  77. Cho, Junhyeong, et al. Collaborative Transformers for Grounded Situation Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2022. [Paper] [Code] [Website]

Natural Language Object Retrieval (Images)

  1. Guadarrama, Sergio, et al. Open-vocabulary Object Retrieval. Robotics: science and systems. Vol. 2. No. 5. 2014. [Paper] [Code]

  2. Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]

  3. Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]

  4. Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]

  5. Nguyen, Anh, et al. Object Captioning and Retrieval with Natural Language. arXiv preprint arXiv:1803.06152 (2018). [Paper] [Website]

  6. Plummer, Bryan A., et al. Open-vocabulary Phrase Detection. arXiv preprint arXiv:1811.07212 (2018). [Paper] [Code]

Grounding Relations / Referring Relations

  1. Krishna, Ranjay, et al. Referring relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code] [Website]

  2. Raboh, Moshiko et al. Differentiable Scene Graphs. (2019). [Paper]

  3. Conser, Erik, et al. Revisiting Visual Grounding. arXiv preprint arXiv:1904.02225 (2019). [Paper]

    • Critique of Referring Relationship paper

Video Grounding (Activity Localization) using Natural Language:

  1. Yu, Haonan, et al. Grounded Language Learning from Video Described with Sentences Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2013. [Paper]

  2. Xu, Ran, et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence. 2015. [Paper]

  3. Song, Young Chol, et al. Unsupervised Alignment of Actions in Video with Text Descriptions Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2016. [Paper]

  4. Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017). Method name: TALL [Paper] [Code]

  5. Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code]

  6. Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. Video Object Segmentation with Language Referring Expressions. arXiv preprint arXiv:1803.08006 (2018). [Paper] [Website]

  7. Xu, Huijuan, et al. Joint Event Detection and Description in Continuous Video Streams. arXiv preprint arXiv:1802.10250 (2018). [Paper] [Code]

  8. Xu, Huijuan, et al. Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning. arXiv preprint arXiv:1804.05113 (2018). [Paper] [Code]

  9. Liu, Bingbin, et al. Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos. European Conference on Computer Vision. Springer, Cham, 2018. [Paper] [Website]

  10. Liu, Meng, et al. Attentive Moment Retrieval in Videos. Proceedings of the International ACM SIGIR Conference . 2018. [Paper] [Website]

  11. Chen, Jingyuan, et al. Temporally grounding natural sentence in video. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. [Paper]

  12. Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]

  13. Wu, Aming, and Yahong Han. Multi-modal Circulant Fusion for Video-to-Language and Backward. IJCAI. Vol. 3. No. 4. 2018. [Paper] [Code]

  14. Ge, Runzhou, et al. MAC: Mining Actiivity Concepts for Language-based Temporal Localization. arXiv preprint arXiv:1811.08925 (2018). [Paper] [Code]

  15. Zhang, Da, et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. arXiv preprint arXiv:1812.00087 (2018). [Paper]

  16. He, Dongliang, et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence. 2019. [Paper]

  17. Wang, Weining, Yan Huang, and Liang Wang. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]

  18. Ghosh, Soham, et al. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv preprint arXiv:1904.02755 (2019). [Paper]

  19. Chen, Shaoxiang, and Yu-Gang Jiang. Semantic Proposal For Activity Localization In Videos Via Sentence Query. Proceedings of the AAAI Conference on Artificial Intelligence. 2019.[Paper]

  20. Yuan Y, Mei T, Zhu W. To Find Where You Talk: Temporal Sentence Localization In Video With Attention Based Location Regression. Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 9159-9166. [Paper]

  21. Mithun N C, Paul S, Roy-Chowdhury A K. Weakly Supervised Video Moment Retrieval From Text Queries. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 11592-11601.[Paper]

  22. Escorcia, Victor, et al. Temporal Localization of Moments in Video Collections with Natural Language. arXiv preprint arXiv:1907.12763 (2019). (ICCV 2019) [Paper] [Code]

  23. Wang, Jingwen, Lin Ma, and Wenhao Jiang. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. arXiv preprint arXiv:1909.05010 (2019). (AAAI 2020) [Paper] [Code]

Grounded Description (Image) (WIP)

  1. Hendricks, Lisa Anne, et al. Generating visual explanations. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Pytorch Code]

  2. Jiang, Ming, et al. TIGEr: Text-to-Image Grounding for Image Caption Evaluation. arXiv preprint arXiv:1909.02050 (2019). (EMNLP 2019) [Paper] [Code]

  3. Lee, Jason, Kyunghyun Cho, and Douwe Kiela. Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019). (EMNLP 2019) [Paper]

Grounded Description (Video) (WIP)

  1. Ma, Chih-Yao, et al. Grounded Objects and Interactions for Video Captioning. arXiv preprint arXiv:1711.06354 (2017). [Paper]

  2. Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Visual Grounding Pretraining

  1. Sun, Chen, et al. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766 (2019). [Paper]

  2. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv preprint arXiv:1908.02265 (Neurips 2019) [Paper] [Code]

  3. Li, Liunian Harold, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019). [Paper] [Code]

  4. Li, Gen, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019). [Paper]

  5. Tan, Hao, and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019). [Paper] [Code]

  6. Su, Weijie, et al. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). [Paper]

  7. Chen, Yen-Chun, et al. UNITER: Learning UNiversal Image-TExt Representations. arXiv preprint arXiv:1909.11740 (2019). [Paper]

  8. Li Liunian Harold, Pengchuan Zhang, Haotian Zhang, et al. Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 (2021). [Paper] [Code]

Visual Grounding in 3D

  1. Chen, Dave Zhenyu, Angel X. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer International Publishing, 2020. [Paper] [Code]

  2. Achlioptas, Panos, et al. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. European Conference on Computer Vision. Springer, Cham, 2020. [Website]

  3. Yuan, Zhihao, et al. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. arXiv preprint arXiv:2103.01128 (2021). [Paper] [Code]

  4. Rozenberszki, David, et al. Language-Grounded Indoor 3D Semantic Segmentation in the Wild. European Conference on Computer Vision. Springer, Tel Aviv, 2022. [Website] [Paper] [Code]

Grounding for Embodied Agents (WIP):

  1. Shridhar, Mohit, et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. arXiv preprint arXiv:1912.01734 (2019). [Paper] [Code] [Website]

Misc:

  1. Han, Xudong, Philip Schulz, and Trevor Cohn. Grounding learning of modifier dynamics: An application to color naming. arXiv preprint arXiv:1909.07586 (2019). (EMNLP 2019) [Paper] [Code]

  2. Yu, Xintong, et al. What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues. arXiv preprint arXiv:1909.00421 (2019). (EMNLP 2019) [Paper] [Code]

awesome-grounding's People

Contributors

chopinsharp avatar coformer avatar curryyuan avatar evinpinar avatar haotian-zhang avatar mzshuo avatar nku-shengzheliu avatar rozdavid avatar theshadow29 avatar thierryderuyttere avatar youngfly11 avatar yufansong avatar zyang-ur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-grounding's Issues

Please add GLIP

Hi,

Thanks for sharing this great repo. Just want to share some of our recent work GLIP on visual grounding pre-training and I hope they could help this community through your platform. Thank you very much!

Abstract: This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code will be released at (https://github.com/microsoft/GLIP).

wrong link

The link to the paper for the "Guess What" dataset takes you to a different paper.

Add 3d visual grounding papers

There are some research working on visual grounding in 3d point clouds.

ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language Website

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes Website

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring code

Please add Bongard-HOI and RelViT

Hi,

Thanks for making this learning list and indeed I learned a lot. Just want to share some of our recent work on visual grounding and I hope they could help this community through your platform:

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions (CVPR 2022, oral)
arxiv | code
In this work, we create a new dataset for few-shot visual reasoning with HOI concepts. It introduces great challenges to the state-of-the-art few-shot learners but seems straightforward for humans.

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning (ICLR 2022)
arxiv | code
In this work, we explore vision transformers for many visual relational reasoning tasks, including HICO and GQA. We further introduce concept-guided contrastive learning that helps these models master visual reasoning without massive pertaining or extra training data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.