skip to main content
10.1145/3347450.3357655acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visually Grounded Language Learning for Robot Navigation

Published:15 October 2019Publication History

ABSTRACT

We present an end-to-end deep learning model for robot navigation from raw visual pixel input and natural text instructions. The proposed model is an LSTM-based sequence-to-sequence neural network architecture with attention, which is trained on instruction-perception data samples collected in a synthetic environment. We conduct experiments on the SAIL dataset which we reconstruct in 3D so as to generate the 2D images associated with the data. Our experiments show that the performance of our model is on a par with state-of-the-art, despite the fact that it learns navigational language with end-to-end training from raw visual data.

References

  1. Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction following. arXiv preprint arXiv:1508.06491 (2015).Google ScholarGoogle Scholar
  2. Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014. Learning compact lexicons for CCG semantic parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . 1273--1283.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association of Computational Linguistics , Vol. 1 (2013), 49--62.Google ScholarGoogle ScholarCross RefCross Ref
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  5. Benjamin Börschinger, Bevan K Jones, and Mark Johnson. 2011. Reducing grounded learning tasks to grammatical inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1416--1425.Google ScholarGoogle Scholar
  6. Ozan Arkan Can and Deniz Yuret. 2018. A new dataset and model for learning to understand navigational instructions. arXiv preprint arXiv:1805.07952 (2018).Google ScholarGoogle Scholar
  7. David L Chen. 2012. Fast online lexicon learning for grounded language acquisition. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 430--439.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David L Chen and Raymond J Mooney. 2011. Learning to interpret natural language navigation instructions from observations.. In AAAI , Vol. 2. 1--2.Google ScholarGoogle Scholar
  9. Daniel Fried, Jacob Andreas, and Dan Klein. 2017. Unified Pragmatic Models for Generating and Following Instructions. arXiv preprint arXiv:1711.04987 (2017).Google ScholarGoogle Scholar
  10. Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-Follower Models for Vision-and-Language Navigation. arXiv preprint arXiv:1806.02724 (2018).Google ScholarGoogle Scholar
  11. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks , Vol. 18, 5--6 (2005), 602--610.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation , Vol. 9, 8 (1997), 1735--1780.Google ScholarGoogle Scholar
  14. Rohit J Kate and Raymond J Mooney. 2006. Using string-kernels for learning semantic parsers. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 913--920.Google ScholarGoogle Scholar
  15. Joohyun Kim and Raymond Mooney. 2013. Adapting discriminative reranking to grounded language learning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 218--227.Google ScholarGoogle Scholar
  16. Joohyun Kim and Raymond J Mooney. 2012. Unsupervised pcfg induction for grounded language learning with highly ambiguous supervision. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 433--444.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  18. Tomávs Kovc iskỳ , Gábor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, and Karl Moritz Hermann. 2016. Semantic parsing with semi-supervised sequential autoencoders. arXiv preprint arXiv:1609.09315 (2016).Google ScholarGoogle Scholar
  19. Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).Google ScholarGoogle Scholar
  20. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. Def , Vol. 2, 6 (2006), 4.Google ScholarGoogle Scholar
  21. Hongyuan Mei, Mohit Bansal, and Matthew R Walter. 2016. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences.. In AAAI, Vol. 1. 2.Google ScholarGoogle Scholar
  22. Tomávs Mikolov, Martin Karafiát, Lukávs Burget, Jan vC ernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association .Google ScholarGoogle ScholarCross RefCross Ref
  23. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807--814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , Vol. 115, 3 (2015), 211--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.Google ScholarGoogle Scholar
  26. Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 23--30.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.Google ScholarGoogle Scholar
  28. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 806--814.Google ScholarGoogle Scholar
  29. Paul J Werbos. 1990. Backpropagation through time: what it does and how to do it. Proc. IEEE , Vol. 78, 10 (1990), 1550--1560.Google ScholarGoogle ScholarCross RefCross Ref
  30. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Deniz Yuret. 2016. Knet: beginning deep learning with 100 lines of julia. In Machine Learning Systems Workshop at NIPS, Vol. 2016. 5.Google ScholarGoogle Scholar

Index Terms

  1. Visually Grounded Language Learning for Robot Navigation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MULEA '19: 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications
          October 2019
          65 pages
          ISBN:9781450369183
          DOI:10.1145/3347450

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)5
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader